An approach for Logging in Spark jobs


emr   livy-logo    spark-logo-trademark   splunk-logo-news

 

 

Spark website provides three options for using a custom log4j configuration for logging.

  1. Upload a custom log4j.properties with every job
  2. Add -Dlog4j.configuration={location of log4j config} with every job
  3. Update the $SPARK_CONF_DIR/log4j.properties file

In one of our product which runs Spark Jobs in YARN mode on EMR cluster through Livy server we selected Option 3. Here I am explaining how we did it and the challenges we faced.

First we have prepared a custom log4j.properties file with all the configurations we want. It include custom loggers, their log file location, log rotation strategy etc. This file content will be appended to the /etc/spark/conf.dist/log4j.properties during EMR Deployment. Here the log file location is important as the log files will be getting created in both Master and on all Executors in the cluster.

Spark website suggests to use ${spark.yarn.app.container.log.dir} as a reference directory for log file location. Example:

log4j.appender.myLogger.File=${spark.yarn.app.container.log.dir}/application.log

In EMR executors the variable will be resolved to /var/log/hadoop-yarn/containers/{application_id}/{container_dir}/. This will be different for each application (job) and for each container within the application. So multiple application.log file will be created for every execution of the application (job) in each executors.

Another catch is that the variable ${spark.yarn.app.container.log.dir} is not available in Master. However when the application is started log4j reads the configuration and tries to write/create the log files in Master as well. Since the variable is not available, the log file location will be evaluated as /application.log – i.e in root directory and it fails with “Permission Denied” exception. The solution to this was to create an empty /application.log as part of Deployment itself and give permission to the user who runs the job. In our case since we are running the Spark jobs using livy server, which was running as a process by “livy” user, we had to give write permission to livy user for the /application.log file

Once we have done this we have observed the complete logging happening for the Spark jobs. However depending on what gets executed where (master or executors) the logging happening locally on each of the node. To aggregate and analyze the logs from all the nodes we are using Splunk.


		
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s