Posts

Showing posts from 2016

HDFS filenames without rest of the file details

Get the list of file names and absolute path alone from HDFS: General hdfs list output would be: $ hdfs dfs -ls -rw-r--r--   3 foo bar    6346268 2016-12-28 02:52 /user/foo/data/file007.csv -rw-r--r--   3 foo bar    4397850 2016-12-28 02:52 /user/foo/data/file014.csv -rw-r--r--   3 foo bar   13297361 2016-12-28 02:52 /user/foo/data/file020.csv -rw-r--r--   3 foo bar   10400852 2016-12-28 02:53 /user/foo/data/file118.csv -rw-r--r--   3 foo bar   10184639 2016-12-28 02:52 /user/foo/data/file205.csv -rw-r--r--   3 foo bar    5542293 2016-12-28 02:53 /user/foo/data/file214.csv -rw-r--r--   3 foo bar    6085128 2016-12-28 02:53 /user/foo/data/file307.csv But we would need get just the absolute hdfs file paths especially in shell scripts to perform CRUD operations, like: /user/foo/data/file007.csv /user/foo/data/file014.csv ...

Spark Cluster Mode - Too many open files

When running complex Spark Jobs in Cluster mode (Yarn-Client or Yarn-Cluster) mode, it is quite possible the default ulimit (number of open files) of 1024 is not sufficient and hence gets error "Too many open files". One way to address this is issue is by increasing the ulimit size. Steps (as root user): 1) vim /etc/sysctl.conf, append the file with:     fs.file-max = 65536 2) vim /etc/security/limits.conf, append the file with: * soft nproc 65535 * hard nproc 65535 * soft nofile 65535 * hard nofile 65535 3) ulimit -n 65535

HBase Table does not exists, but exists

Often we come across this situation with HBase that, table creation says: "ERROR: Table already exists: hbaseFeatures!", however that table wont exist when you list tables. This Often happens when HBase is upgraded to new version, but then there is stale data in Zookeeper. If you are not interested in those tables any more, One way to solve this is to Clear the corresponding data in Zookeeper. Here are some instructions: Now try to create the HBase table, which should succeed. All happy!

Download All Project dependencies using Maven

At-times there is need to physically download all the dependencies for Java/Scala based applications. Especially when we want to develop or run application in a remote offline environments etc. Maven has this convenience plug-in to rescue us:

Binary Data to Float using Spark SQL

Often when there is massive data communication among the enterprise applications, the data is converted to binary datatype, for reasons like: Faster I/O, Smaller Size, any type of data can be represented, etc. This brings in need to convert binary back to known/readable data type for further analysis/processing etc. For example, here floating point data is represented in Decimal Array[Byte] then stored in Binary format. Ex: [62, -10, 0, 0] Whole floating point equivalent is: 0.480 To achieve above conversion using Apache Spark Sql DataFrames: Voila!

JDBC Operations in Spark + Scala

Here is the sample code to create a sample Dataframe from a RDD and then insert that into MySql database. This approved should work for any other relational databases. Happy Sparking...

Installing Additional Hadoop Services using Parcels in Cloudera Manager

To install additional Services using Parcel in Cloudera Manager, follow the following steps: Download Parcel file from internet, make sure the component version is supported by your Cloudera version & the Operating System Download corresponding manifest.json file, also create .sha file using command  echo "*********************" > componentName.parcel.sha  Where this hash will be found in manifest.json file Scp these three files to Hadoop Master node at location like /opt/cloudera/parcel-local/newdir Run an httpserver in this dir to expose this a local parcel repository  python -m SimpleHTTPServer 8900  In Cloudera manager, go to parcel page and Check for New Parcels. You might have to configure parcel setting and add your server:8900 as one of the parcel repository. You should see your parcel listed here now Perform Download of this parcel in CM Perform Distribute, by clicking on Distribute button Perform Activate, by clicking on Activate butt...

Cloudera Hadoop Setup - HDFS Canary Health Check issue

After setting up Hadoop Cluster using Cloudera Manager, one of the common issues some of us face is Canary Health Check issues. This most often happens due to connectivity between the Master and Slave Nodes. In my case HDFS was throwing Canary error saying unable to write/read to /tmp/.cloudera_health_monitoring_canary_timestamp. Then I finally had to open corresponding Port on my DataNodes to resolve this error: Open ports from Firewall list: $ iptables-save | grep 8042 output: Will be blank $ firewall-cmd --zone=public --add-port=8042/tcp --permanent output: success $ firewall-cmd --reload output: success $ iptables-save | grep 8042 output: -A IN_public_allow -p tcp -m tcp --dport 7180 -m conntrack --ctstate NEW -j ACCEPT

Running Apache Kafka in Windows

Windows Kafka Setup steps: 1) Download Kafka and unzip. 2) Start Zookeeper: kafka_2.10-0.8.2.1\bin\windows\zookeeper-server-start.bat tools\kafka_2.10-0.8.2.1\config\zookeeper.properties 3) Start Kafka Server kafka_2.10-0.8.2.1\bin\windows\kafka-server-start.bat tools\kafka_2.10-0.8.2.1\config\server.properties 4) List topics, to make sure Kafka is up and running kafka_2.10-0.8.2.1\bin\windows\kafka-topics.bat --list --zookeeper localhost:2181 5) Create new Topic, examples: kafka-topics.bat --create --topic sensor1 --replication-factor 1 --zookeeper localhost:2181 --partition 5 6) Produce some sample Kafka Messages: kafka_2.10-0.8.2.1\bin\windows\kafka-console-producer.sh --broker-list localhost:9092 --topic sensor1 6) Consumer to Test whether above produced messages are successfully published to Kafka broker: kafka_2.10-0.8.2.1\bin\windows\kafka-console-consumer.bat --zookeeper localhost:2181 --topic sensor1 --from-beginning 7) Stop Kafka Server: tools\kafka_2....