Launch spark shell on multiple executors


spark-shell --deploy-mode cluster --master yarn --executor-cores 4 --num-executors 6 --executor-memory 12g

Like how we tune spark-submit parameters, same tuning parameters are applicable for spark-shell as-well. Except that deploy-mode can not be 'cluster', of-course right.

Also make sure spark.dynamicAllocation.enabled is set to true.

With these settings, you can see that Yarn Executors are allocated on demand and removed when no more required.

Spark Scala - Perform data aggregation on last or next n seconds time window

Often while performing statistical aggregations, we get scenario to perform aggregations on next n number of seconds from current row for each of the rows.

Spark Window api provides a nice rangeBetween functionality which facilitates performing above.

For Example:
// Sample data with timestamp

// Sample data with timestamp
val customers = sc.parallelize(List(("Alice", "2016-05-01 00:00:00", 10,4),
("Alice", "2016-05-01 00:00:01", 20,2),
("Alice", "2016-05-01 00:00:02", 30,4),
("Alice", "2016-05-01 00:00:02", 40,6),
("Alice", "2016-05-01 00:00:03", 50,1),
("Alice", "2016-05-01 00:00:03", 60,4),
("Alice", "2016-05-01 00:00:04", 70,2),
("Alice", "2016-05-01 00:00:05", 80,4),
("Bob", "2016-05-01 00:00:03", 25,6),
("Bob", "2016-05-01 00:00:04", 29,7),
("Bob", "2016-05-01 00:00:05&qu…

Accessing application listening at custom port on Oracle VirtualBox Hosted VM

For the virtual machines like HortonWorks Sandbox (HDP) VM which is hosted on Oracle VirtualBox, we often deploy custom applications in the VM. This only brings a surprise later that this application is not accessible from Host machine, where as other standard hadoop ports like 8080, 8888, 4040 do open normally.

This trick here is that we need to enable Port Forwarding for the new ports we added in VirtualBox.

VM -> Devices -> Network ->  Adapter -> Advances -> Click on Port Forwarding button -> Add Host port number and Guest Port numbers (these can be same though).

Here is the screenshot on where to update:

HBase : Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.

When accessing remote HBase database using HBase Client from Java Applications, there is a possibility of getting the following error:

"Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master."
This is because the configurations on HBase server could be non-default and which are not necessarily known to Client developer. For example in my case, 'zookeeper.znode.parent' was changed to /hbase/secure instead of default /hbase. Like this there could be more changes which makes is cumbersome for HBase Client to pass all the configurations. One easy solution to get around these is:
    Get the hbase-site.xml from the HBase Cluster and add it to your Java applications Classpath. Thats it. This way we don't have to do the Zookeeper Quorum settings also.

Sample Scala Client Code:

val conf = org.apache.hadoop.hbase.HBaseConfiguration.create() // Instead of the following settings, pass hbase-site.xml in …

HDFS filenames without rest of the file details

Get the list of file names and absolute path alone from HDFS:
General hdfs list output would be:
$ hdfs dfs -ls
-rw-r--r--   3 foo bar    6346268 2016-12-28 02:52 /user/foo/data/file007.csv
-rw-r--r--   3 foo bar    4397850 2016-12-28 02:52 /user/foo/data/file014.csv
-rw-r--r--   3 foo bar   13297361 2016-12-28 02:52 /user/foo/data/file020.csv
-rw-r--r--   3 foo bar   10400852 2016-12-28 02:53 /user/foo/data/file118.csv
-rw-r--r--   3 foo bar   10184639 2016-12-28 02:52 /user/foo/data/file205.csv
-rw-r--r--   3 foo bar    5542293 2016-12-28 02:53 /user/foo/data/file214.csv
-rw-r--r--   3 foo bar    6085128 2016-12-28 02:53 /user/foo/data/file307.csv
But we would need get just the absolute hdfs file paths especially in shell scripts to perform CRUD operations, like:
One way to achieve this is using awk, like:

$ hdfs dfs -ls | awk -F " &q…

Spark Cluster Mode - Too many open files

When running complex Spark Jobs in Cluster mode (Yarn-Client or Yarn-Cluster) mode, it is quite possible the default ulimit (number of open files) of 1024 is not sufficient and hence gets error "Too many open files".

One way to address this is issue is by increasing the ulimit size. Steps (as root user):

1) vim /etc/sysctl.conf, append the file with:
    fs.file-max = 65536

2) vim /etc/security/limits.conf, append the file with:
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535

3) ulimit -n

HBase Table does not exists, but exists

Often we come across this situation with HBase that, table creation says: "ERROR: Table already exists: hbaseFeatures!", however that table wont exist when you list tables.

This Often happens when HBase is upgraded to new version, but then there is stale data in Zookeeper. If you are not interested in those tables any more, One way to solve this is to Clear the corresponding data in Zookeeper. Here are some instructions:

Now try to create the HBase table, which should succeed.

All happy!