While working with Amazon EMR, it is possible that you might see an exception as below with your failed map/reduce task:
java.lang.Throwable: Child Error
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
The Root Cause:
The potential problem in this case is that, the Hadoop mapper or reducer tasks are killed by Linux OS due to oversubscribing memory. Keen in mind that this issue Linux OOM killer and not the Java OOM Killer. In general such issue occurs when a particular process is configured to use lots more memory then the OS could provide and due to over memory subscription OS has no other way to kill the process.
With Amazon EMR such job failure could happen very easily if use have configured mapred.child.java.opts setting to way high comparative to specific EMR instance type. This is because each EMR instance type has preconfigured setting for Map and Reduce jobs and a misconfiguration could lead to this problem.
For example the EMR Instance type is m1.large which means it has 768 MB memory allocated for each Map or Reduce task in Hadoop job as below:
However in the mapred-site.xml if user setup the mapred.child.java.opts to way high value i.e. 8GB as below:
The above configuration would cause Linux OS to kill Mapper or Reduce task due to very high memory subscription. Linus OS will kill the same process even when it is configured to use 4GB also because it is way over the configured limit.
The solution to this problem is to let your job use default mapreduce settings instead setting by your self during the job submission. Using default mapreduce setting helps jobtracker to run the task under whatever settings Amazon EMR instance already have configured.