Accessing application listening at custom port on Oracle VirtualBox Hosted VM

For the virtual machines like HortonWorks Sandbox (HDP) VM which is hosted on Oracle VirtualBox, we often deploy custom applications in the VM. This only brings a surprise later that this application is not accessible from Host machine, where as other standard hadoop ports like 8080, 8888, 4040 do open normally.

This trick here is that we need to enable Port Forwarding for the new ports we added in VirtualBox.

VM -> Devices -> Network ->  Adapter -> Advances -> Click on Port Forwarding button -> Add Host port number and Guest Port numbers (these can be same though).

Here is the screenshot on where to update:

HBase : Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.

When accessing remote HBase database using HBase Client from Java Applications, there is a possibility of getting the following error:

"Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master."
This is because the configurations on HBase server could be non-default and which are not necessarily known to Client developer. For example in my case, 'zookeeper.znode.parent' was changed to /hbase/secure instead of default /hbase. Like this there could be more changes which makes is cumbersome for HBase Client to pass all the configurations. One easy solution to get around these is:
    Get the hbase-site.xml from the HBase Cluster and add it to your Java applications Classpath. Thats it. This way we don't have to do the Zookeeper Quorum settings also.

Sample Scala Client Code:

val conf = org.apache.hadoop.hbase.HBaseConfiguration.create() // Instead of the following settings, pass hbase-site.xml in …

HDFS filenames without rest of the file details

Get the list of file names and absolute path alone from HDFS:
General hdfs list output would be:
$ hdfs dfs -ls
-rw-r--r--   3 foo bar    6346268 2016-12-28 02:52 /user/foo/data/file007.csv
-rw-r--r--   3 foo bar    4397850 2016-12-28 02:52 /user/foo/data/file014.csv
-rw-r--r--   3 foo bar   13297361 2016-12-28 02:52 /user/foo/data/file020.csv
-rw-r--r--   3 foo bar   10400852 2016-12-28 02:53 /user/foo/data/file118.csv
-rw-r--r--   3 foo bar   10184639 2016-12-28 02:52 /user/foo/data/file205.csv
-rw-r--r--   3 foo bar    5542293 2016-12-28 02:53 /user/foo/data/file214.csv
-rw-r--r--   3 foo bar    6085128 2016-12-28 02:53 /user/foo/data/file307.csv
But we would need get just the absolute hdfs file paths especially in shell scripts to perform CRUD operations, like:
One way to achieve this is using awk, like:

$ hdfs dfs -ls | awk -F " &q…

Spark Cluster Mode - Too many open files

When running complex Spark Jobs in Cluster mode (Yarn-Client or Yarn-Cluster) mode, it is quite possible the default ulimit (number of open files) of 1024 is not sufficient and hence gets error "Too many open files".

One way to address this is issue is by increasing the ulimit size. Steps (as root user):

1) vim /etc/sysctl.conf, append the file with:
    fs.file-max = 65536

2) vim /etc/security/limits.conf, append the file with:
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535

3) ulimit -n

HBase Table does not exists, but exists

Often we come across this situation with HBase that, table creation says: "ERROR: Table already exists: hbaseFeatures!", however that table wont exist when you list tables.

This Often happens when HBase is upgraded to new version, but then there is stale data in Zookeeper. If you are not interested in those tables any more, One way to solve this is to Clear the corresponding data in Zookeeper. Here are some instructions:

Now try to create the HBase table, which should succeed.

All happy!

Download All Project dependencies using Maven

At-times there is need to physically download all the dependencies for Java/Scala based applications.
Especially when we want to develop or run application in a remote offline environments etc.
Maven has this convenience plug-in to rescue us:

Binary Data to Float using Spark SQL

Often when there is massive data communication among the enterprise applications, the data is converted to binary datatype, for reasons like:
Faster I/O, Smaller Size, any type of data can be represented, etc.

This brings in need to convert binary back to known/readable data type for further analysis/processing etc.

For example, here floating point data is represented in Decimal Array[Byte] then stored in Binary format.

Ex: [62, -10, 0, 0]
Whole floating point equivalent is: 0.480

To achieve above conversion using Apache Spark Sql DataFrames: