Tuesday, September 6, 2016

Spark Cluster Mode - Too many open files

When running complex Spark Jobs in Cluster mode (Yarn-Client or Yarn-Cluster) mode, it is quite possible the default ulimit (number of open files) of 1024 is not sufficient and hence gets error "Too many open files".

One way to address this is issue is by increasing the ulimit size. Steps (as root user):

1) vim /etc/sysctl.conf, append the file with:
    fs.file-max = 65536

2) vim /etc/security/limits.conf, append the file with:
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535

3) ulimit -n

Monday, June 27, 2016

HBase Table does not exists, but exists

Often we come across this situation with HBase that, table creation says: "ERROR: Table already exists: hbaseFeatures!", however that table wont exist when you list tables.

This Often happens when HBase is upgraded to new version, but then there is stale data in Zookeeper. If you are not interested in those tables any more, One way to solve this is to Clear the corresponding data in Zookeeper. Here are some instructions:

Now try to create the HBase table, which should succeed.

All happy!

Friday, June 10, 2016

Download All Project dependencies using Maven

At-times there is need to physically download all the dependencies for Java/Scala based applications.
Especially when we want to develop or run application in a remote offline environments etc.
Maven has this convenience plug-in to rescue us:

Binary Data to Float using Spark SQL

Often when there is massive data communication among the enterprise applications, the data is converted to binary datatype, for reasons like:
Faster I/O, Smaller Size, any type of data can be represented, etc.

This brings in need to convert binary back to known/readable data type for further analysis/processing etc.

For example, here floating point data is represented in Decimal Array[Byte] then stored in Binary format.

Ex: [62, -10, 0, 0]
Whole floating point equivalent is: 0.480

To achieve above conversion using Apache Spark Sql DataFrames:


Wednesday, April 13, 2016

JDBC Operations in Spark + Scala

Here is the sample code to create a sample Dataframe from a RDD and then insert that into MySql database. This approved should work for any other relational databases.

Happy Sparking...

Thursday, March 3, 2016

Installing Additional Hadoop Services using Parcels in Cloudera Manager

To install additional Services using Parcel in Cloudera Manager, follow the following steps:

  1. Download Parcel file from internet, make sure the component version is supported by your Cloudera version & the Operating System
  2. Download corresponding manifest.json file, also create .sha file using command
     echo "*********************" > componentName.parcel.sha
  3.  Where this hash will be found in manifest.json file
  4. Scp these three files to Hadoop Master node at location like /opt/cloudera/parcel-local/newdir
  5. Run an httpserver in this dir to expose this a local parcel repository 
    python -m SimpleHTTPServer 8900 
  6. In Cloudera manager, go to parcel page and Check for New Parcels. You might have to configure parcel setting and add your server:8900 as one of the parcel repository.
  7. You should see your parcel listed here now
  8. Perform Download of this parcel in CM
  9. Perform Distribute, by clicking on Distribute button
  10. Perform Activate, by clicking on Activate button
  11. Check Parcel Usage
  12. Now you may need to perform 'Cluster Add Service' to actually install your component based on above Parcel

Monday, February 22, 2016

Cloudera Hadoop Setup - HDFS Canary Health Check issue

After setting up Hadoop Cluster using Cloudera Manager, one of the common issues some of us face is Canary Health Check issues. This most often happens due to connectivity between the Master and Slave Nodes. In my case HDFS was throwing Canary error saying unable to write/read to /tmp/.cloudera_health_monitoring_canary_timestamp. Then I finally had to open corresponding Port on my DataNodes to resolve this error: Open ports from Firewall list: $ iptables-save | grep 8042 output: Will be blank $ firewall-cmd --zone=public --add-port=8042/tcp --permanent output: success $ firewall-cmd --reload output: success $ iptables-save | grep 8042 output: -A IN_public_allow -p tcp -m tcp --dport 7180 -m conntrack --ctstate NEW -j ACCEPT