An approach for Logging in Spark jobs

Spark website provides three options for using a custom log4j configuration for logging.

Upload a custom log4j.properties with every job
Add -Dlog4j.configuration={location of log4j config} with every job
Update the $SPARK_CONF_DIR/log4j.properties file

In one of our product which runs Spark Jobs in YARN mode on EMR cluster through Livy server we selected Option 3. Here I am explaining how we did it and the challenges we faced.

First we have prepared a custom log4j.properties file with all the configurations we want. It include custom loggers, their log file location, log rotation strategy etc. This file content will be appended to the /etc/spark/conf.dist/log4j.properties during EMR Deployment. Here the log file location is important as the log files will be getting created in both Master and on all Executors in the cluster.

Spark website suggests to use ${spark.yarn.app.container.log.dir} as a reference directory for log file location. Example:

log4j.appender.myLogger.File=${spark.yarn.app.container.log.dir}/application.log

In EMR executors the variable will be resolved to /var/log/hadoop-yarn/containers/{application_id}/{container_dir}/. This will be different for each application (job) and for each container within the application. So multiple application.log file will be created for every execution of the application (job) in each executors.

Another catch is that the variable ${spark.yarn.app.container.log.dir} is not available in Master. However when the application is started log4j reads the configuration and tries to write/create the log files in Master as well. Since the variable is not available, the log file location will be evaluated as /application.log – i.e in root directory and it fails with “Permission Denied” exception. The solution to this was to create an empty /application.log as part of Deployment itself and give permission to the user who runs the job. In our case since we are running the Spark jobs using livy server, which was running as a process by “livy” user, we had to give write permission to livy user for the /application.log file

Once we have done this we have observed the complete logging happening for the Spark jobs. However depending on what gets executed where (master or executors) the logging happening locally on each of the node. To aggregate and analyze the logs from all the nodes we are using Splunk.

Discover more from Anil G Kurian

Subscribe to get the latest posts sent to your email.

9 thoughts on “An approach for Logging in Spark jobs”

Harsh Trivedi

July 10, 2018 at 7:07 am Reply

Nice post Anil,
Can you please share a copy of your log4j.properties file?

LikeLike
1. Anil G Kurian
  
  July 10, 2018 at 10:55 am Reply
  
  Thanks Haarsh. The log4j.properties file looks something like this
  
  log4j.logger.appLogger=WARN,appFile,appJsonFile
  log4j.appender.appFile=org.apache.log4j.RollingFileAppender log4j.appender.appFile.File=${spark.yarn.app.container.log.dir}/log/application.log log4j.appender.appFile.ImmediateFlush=true log4j.appender.appFile.layout=org.apache.log4j.PatternLayout log4j.appender.appFile.layout.conversionPattern=%d{ISO8601} : %X{APP} : %X{CID} : %-5p - %m%n log4j.appender.appFile.MaxFileSize=100MB log4j.appender.appFile.MaxBackupIndex=5
  log4j.appender.appJsonFile=org.apache.log4j.RollingFileAppender log4j.appender.appJsonFile.File=${spark.yarn.app.container.log.dir}/log/application.json log4j.appender.appJsonFile.ImmediateFlush=true log4j.appender.appJsonFile.layout=org.apache.log4j.PatternLayout log4j.appender.appJsonFile.layout.conversionPattern={"timestamp": "%d{ISO8601}", "appName": "%X{APP}","correlationId": "%X{CID}", "logLevel": "%p", "message": "%m"}%n log4j.appender.appJsonFile.MaxFileSize=100MB log4j.appender.appJsonFile.MaxBackupIndex=5
  
  LikeLike
Karan

July 18, 2018 at 8:22 pm Reply

Hi Anil:
A huge thank you for the pointers in this post. I’m a little lost with the snippet below:

The solution to this was to create an empty /application.log as part of Deployment itself and give permission to the user who runs the job.

I tried to accomplish this by ssh ing onto the master and executing mkdir /application.log.
I get a message permission denied.

There is perhaps a gap in my knowledge. How do I create /application.log as part of Deployment on EMR.

Thank You.

LikeLike
1. Anil G Kurian
  
  July 19, 2018 at 9:39 am Reply
  
  It should be a file and you need to be root create the file in root (/) dir. Try “sudo touch /application.log”
  
  LikeLike
Amiya Chakraborty

August 14, 2018 at 8:18 am Reply

Hello Anil,

At the outset, thanks a lot for such valuable blog. It would be really great, if you could help me understand the following:

Can you please provide any documentations/resources to support “the variable ${spark.yarn.app.container.log.dir} is not available in Master” ? Do you mean to say that ${spark.yarn.app.container.log.dir} is not available on Master initially when the application is submitted, however later when Yarn reserves resources and launches executors containers, ${spark.yarn.app.container.log.dir} will be set by Yarn dynamically. And, post Yarn sets ${spark.yarn.app.container.log.dir}, is the variable available on Master then ? If so, then can we configure the log4j to read the configurations at delayed time instant rather than at first on Master ?

Or ${spark.yarn.app.container.log.dir} is never available on Master since Yarn only sets it dynamically on executors containers only ?

Positively looking forward for your response.

Thanking You,
Amiya Chakraborty.

LikeLike
1. Anil G Kurian
  
  August 14, 2018 at 8:20 pm Reply
  
  Hi Amiya,
  
  I have not tried that, but I feel it will never be available on Master. If I get a chance to check that I will let you know.
  
  LikeLike
2. Manish Satwani
  
  April 1, 2021 at 11:09 pm Reply
  
  HI Amiya, I am facing the same problem ${spark.yarn.app.container.log.dir} is not available on master, how did you fix this problem.
  I am also seeing permission denied when I launch my spark application since ${spark.yarn.app.container.log.dir} is not available the log file name becomes /application.log and there is no permission to create at the root level
  
  LikeLike
Amiya Chakraborty

August 14, 2018 at 9:44 pm Reply

Thanks for the swift response! I tend to agree with you. But could not find any relevant docs on the same

LikeLike
Manish Satwani

April 1, 2021 at 11:10 pm Reply

HI Amiya, I am facing the same problem ${spark.yarn.app.container.log.dir} is not available on master, how did you fix this problem.
I am also seeing permission denied when I launch my spark application since ${spark.yarn.app.container.log.dir} is not available the log file name becomes /application.log and there is no permission to create at the root level

LikeLike