Posts

Showing posts from December, 2017

Launch spark shell on multiple executors

Image
Example: spark-shell --deploy-mode cluster --master yarn --executor-cores 4 --num-executors 6 --executor-memory 12g Like how we tune spark-submit parameters, same tuning parameters are applicable for spark-shell as-well. Except that deploy-mode can not be 'cluster', of-course right. Also make sure spark.dynamicAllocation.enabled is set to  true. With these settings, you can see that Yarn Executors are allocated on demand and removed when no more required.

Spark Scala - Perform data aggregation on last or next n seconds time window

Often while performing statistical aggregations, we get scenario to perform aggregations on next n number of seconds from current row for each of the rows. Spark Window api provides a nice rangeBetween functionality which facilitates performing above. For Example: // Sample data with timestamp // Sample data with timestamp val customers = sc.parallelize(List(("Alice", "2016-05-01 00:00:00", 10,4), ("Alice", "2016-05-01 00:00:01", 20,2), ("Alice", "2016-05-01 00:00:02", 30,4), ("Alice", "2016-05-01 00:00:02", 40,6), ("Alice", "2016-05-01 00:00:03", 50,1), ("Alice", "2016-05-01 00:00:03", 60,4), ("Alice", "2016-05-01 00:00:04", 70,2), ("Alice", "2016-05-01 00:00:05", 80,4), ("Bob", "2016-05-01 00:00:03", 25,6), ("Bob", "2016-05-01 00:00:04", 29,7), ("Bob", "2016-05-