12 ways to troubleshoot a failed MapReduce task

Sometimes it is hard to troubleshoot MapReduce task failure in a Hadoop Cluster. The following suggestions does help to troubleshoot the problem:

1. Integrate Additional Configuration:

  • keep.task.files.pattern:  This settings will specify a MapReduce task to keep stored by name, in both failed and success condition.
  • keep.failed.task.files:  This settings inform the data node  to keep failed task files stored on the machine. The files are stored at {HADOOP_LOG_DIR}

    /local/taskTracker/taskid/.

2. Launching IsolationRunner:

   Visit this folder and launch IsolationRunner (org.apache.hadoop.mapred.IsolationRunner) as shown below which will execute  failed task in a child JVM, using the same input which you can debug.

    • $hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
    • job.xml is generated by the MapReduce components when a job is submitted, MapReduce create the JobConf

3. Configure Log4J to include  additional logging from specific classes, for example use configure below to get more log from TaskTracker

  • log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

4. Be sure to have the following log specific Hadoop configuration is set in the cluster:

HADOOP_LOG_DIR Directory for log files
HADOOP_PID_DIR Directory to store the PID for the servers
HADOOP_ROOT_LOGGER Logging configuration for hadoop.root.logger. default: “INFO,console”
HADOOP_SECURITY_LOGGER Logging configuration for hadoop.security.logger. default: “INFO,NullAppender
HDFS_AUDIT_LOGGER Logging configuration for hdfs.audit.logger. default: “INFO,NullAppender

5.  Create a small 1 node cluster an using small size of data set, run the same job and look for simple execution pattern. This is good step to check the issues related with task specific execution.

6.  User can temporarily run JobTracker in local mode even with a large cluster to test code execution.

  • mapred.job.tracker = local
  • fs.default.name = local
  • Note: Above settings can be included in hadoop-site.xml as well however those settings will see as long as xml is not changed. Above settings will be specific to current running job only.

7. Sending kill signal to running Java process

  • Use “kill  -QUIT #java_process_pid” will print Call Stack, threads, lock and deadlocks details on stdout.
  • Java 1.5 have jps (List Java Process) and jstack (List Java Process Call Stack) to further troubleshoot the problem

8. Hadoop Pipes (C++ Library to support MapReduce program written in C++)  set the following configuration to keep the troubled task files saved in the data node.

  • hadoop.pipes.command-file.keep = true

9. Using script file to troubleshot the problem is also in option with Hadoop 0.15 and above.

  • Upload script file to HDFS
  • Set the configuration as
    • mapred.cache.files = full_path#script_file_name (Multiple script files can be added using ‘ , ‘ comma)
    • mapred.create.symlink = yes (Script file must be symlinked)
  • Same above can be done programmatically as below:
    • jobConf.setMapDebugScript(“./debugscript”);
    • DistributedCache.createSymlink(jobConf);
    • DistributedCache.addCacheFile(“/debug/scripts/script#debugscript”);

10. Enable Job Profiling

  • User can sample Map and Reduce task through built in Java Profiler 
  • Use the following configuration
    • mapred.task.profile = TRUE
    • mapred.task.profile.maps = 0-2
    • mapred.task.profile.reduces = 0-2
  • Same effect is achieve programmatically as below:
    • JobConf.setProfileEnabled(boolean)
    • JobConf.setProfileTaskRange(boolean,String)

11. Using MapReduce Tool Interface

  • Tool Interface in Hadoop supports functionality to handle Command line options
  • Here is a list of Hadoop command line options
    • -conf <configuration file>
    • -D <property=value>
    • -fs <local|namenode:port>
    • -jt <local|jobtracker:port>
  • Same can be achieved through code as describe here.

12.  Enable code level logging by using Reporter class.

  • To display debug information,  use Reported reported parameter with map and reduce interfaces. Method Reporter.setStatus(String status) changes the displayed status of the map task and visable on the jobtracker web page.

Source: