Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone

Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone

Please visit my github and this specific page for more details.

Installation:

Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet.

Pre-requsite : Thrift

  • Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and 0.10.0 will give error while packaging)
  • Get thrift 0.9.1 Link

Pre-requsite : Impala

  • First setup Cloudera repo in your machine:
  • Install Impala
    • Install impala : $ sudo apt-get install impala
    • Install impala Server : $ sudo apt-get install impala-server
    • Install impala stat-store : $ sudo apt-get install impala-state-store
    • Install impala shell : $ sudo apt-get install impala-shell
    • Verify : impala : $ impala-shell
impala-shell
Starting Impala Shell without Kerberos authentication
Connected to mr-0xd7-precise1.0xdata.loc:21000
Server version: impalad version 2.6.0-cdh5.8.4 RELEASE (build 207450616f75adbe082a4c2e1145a2384da83fa6)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.4.0-cdh4-INTERNAL (08fa346) built on Mon Jul 14 15:52:52 PDT 2014)

Building : Herringbone source

Here is the successful herringbone “mvn package” command log for your review:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Herringbone Impala
[INFO] Herringbone Main
[INFO] Herringbone
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone Impala 0.0.2
[INFO] ------------------------------------------------------------------------
..
..
..
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone 0.0.1
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Herringbone Impala ................................. SUCCESS [ 2.930 s]
[INFO] Herringbone Main ................................... SUCCESS [ 13.012 s]
[INFO] Herringbone ........................................ SUCCESS [ 0.000 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.079 s
[INFO] Finished at: 2017-10-06T11:27:20-07:00
[INFO] Final Memory: 90M/1963M
[INFO] ------------------------------------------------------------------------

Using Herringbone

Note: You must have fiels on Hadoop, not on local file system

Verify the file on Hadoop:

  • ~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet
  • -rw-r–r– 3 avkash avkash 1463376 2017-09-13 16:56 /user/avkash/file-test1.parquet
  • ~/herringbone$ bin/herringbone flatten -i /user/avkash/file-test1.parquet
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/avkash/herringbone/herringbone-main/target/herringbone-0.0.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/10/06 12:06:44 INFO client.RMProxy: Connecting to ResourceManager at mr-0xd1-precise1.0xdata.loc/172.16.2.211:8032
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/10/06 12:06:45 INFO input.FileInputFormat: Total input paths to process : 1
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
1 initial splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
1 merged splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: number of splits:1
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1499294366934_0707
17/10/06 12:06:45 INFO impl.YarnClientImpl: Submitted application application_1499294366934_0707
17/10/06 12:06:46 INFO mapreduce.Job: The url to track the job: http://mr-0xd1-precise1.0xdata.loc:8088/proxy/application_1499294366934_0707/
17/10/06 12:06:46 INFO mapreduce.Job: Running job: job_1499294366934_0707
17/10/06 12:06:52 INFO mapreduce.Job: Job job_1499294366934_0707 running in uber mode : false
17/10/06 12:06:52 INFO mapreduce.Job:  map 0% reduce 0%
17/10/06 12:07:22 INFO mapreduce.Job:  map 100% reduce 0%

Now verify the file:

~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet-flat

Found 2 items
-rw-r--r--   3 avkash avkash          0 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/_SUCCESS
-rw-r--r--   3 avkash avkash    2901311 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/part-m-00000.parquet

Thats it, enjoy!!

Advertisements

Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters.
With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs.
Here are the steps using Pyspakling with YARN on a hadoop cluster.
1. Get the latest build of sparkling water from here
2. Download and unzip the correct Sparkling Water version comparable with the Spark version into one of the edge node in your Hadoop cluster.
3. Set the following environment variables to the right path before running services:
export MASTER=”yarn-client” // To submit to the Yarn cluster
export SPARK_HOME=“path_to_the_directory_where_spark_unzipped”
export HADOOP_CONF_DIR=“path_to_the_hadoop_installation”export SPARK_SUBMIT_OPTIONS=”–packages ai.h2o:sparkling-water-examples_2.11:2.1.0”
export PYTHONPATH=“_path_to_where_python_installed”
export SPARKLING_EGG=$(ls -t /sparkling-water-2.1.0/py/build/dist/h2o_pysparkling*.egg | head -1)
//path to the Sparkling egg file needs to be updated above
Please make sure to check above version values to reflect the following:
  • 2.11-> refers to the scala version.
  • 2.1.0 —> refers to the spark version.
4. Set the “spark.executor.memory 4g” in Zeppelin either in the configuration file or in the Zeppelin UI if Error 143 is seen while starting the zeppelin server.
Note: To configure it in the Zeppelin UI, goto the dropdown next to the user at theTop right corner , select Interpreters and in the Spark section either edit or add the configuration.
5. Start the Zeppelin server using the command below. This would start Zeppelin in a Yarn container.
bin/zeppelin.sh -Pspark-2.1
6. In Zeppelin notebook, create a new note with the markdown as below and add the path to the egg file. This will add the dependency and the classes of pysparkling.
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
7. Now, one can start calling pysparkling API’s like below:
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o hc = H2OContext.getOrCreate(sc)
8. To use the scala Sparkling water, one does not need to add dependency explicitly in the note in Zeppelin. A sample script would look like

%spark

import org.apache.spark.h2o._
sc.version
val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
val h2oContext = H2OContext.getOrCreate(sc)
Thats all, enjoy!!

Accessing Remote Hadoop Server using Hadoop API or Tools from local machine (Example: Hortonworks HDP Sandbox VM)

Sometimes you may need to access Hadoop runtime from a machine where Hadoop services are not running. In this process you will create password-less SSH access to Hadoop machine from your local machine and once ready you can use Hadoop API to access Hadoop cluster or you can directly use Hadoop commands from local machine by passing proper Hadoop configuration.

Starting Hortonworks HDP 1.3 and/or 2.1 VM

You can use these instructions on any VM running Hadoop or you can download HDP 1.3 or 2.1 Images from the link below:

http://hortonworks.com/products/hortonworks-sandbox/#install

Now start your VM and make sure your Hadoop cluster is up and running. Once you VM is up and running you will get IP address and hostname on the VM screen which is mostly 192.168.21.xxx as shown below:

Screen Shot 2014-06-05 at 1.21.53 PM

Accessing Hortonworks HDP 1.3 and/or 2.1 from browser:

Using the IP address provided you can check the Hadoop server status on port 8000 as below

HDP 1.3 – http://192.168.21.187:8000/about/

HDP 2.1 – http://192.168.21.186:8000/about/

The UI for both HDP1.3 and HDP 2.1 looks as below:

hdp13-21

 

 

 

 

 

 

 

 

 

 

 

Now from your host machine you can also try to ssh to any of the machine using user name root and password hadoop as below:

$ssh root@192.168.21.187

The authenticity of host ‘192.168.21.187 (192.168.21.187)’ can’t be established.
RSA key fingerprint is b2:c0:9a:4b:10:b4:0f:c0:a0:da:7c:47:60:84:f5:dc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘192.168.21.187’ (RSA) to the list of known hosts.
root@192.168.21.187’s password: hadoop
Last login: Thu Jun 5 03:55:17 2014

Now we will add password less SSH access to these VM and there could be two option:

Option 1: You already have SSH key created for yourself earlier and want to reuse here:

In this option, first we will make sure we have RSA based key for SSH session in our local machine and then we will use it for password less SSH access:

  1. In your home folder (/Users/<yourname>) visit to folder name .ssh
  2. Identify a file name id_rsa.pub  (/Users/avkashchauhan/.ssh/id_rsa.pub) and you will see a long string key there
  3. Now also identify another file name  authorized_keys there (i.e. /Users/avkashchauhan/.ssh/authorized_keys) and you will see one or more long string keys there.
  4. Check the content of id_rsa.pub and make sure that this key is also available into authorized_keys files along with other keys (if there)
  5. Now copy the key string from id_rsa.pub file in memory.
  6. SSH to your HDP machine as in previous step using username and password
  7. visit to /root/.ssh folder
  8. You will find authorized_keys file there so open this file in editor and append the key here which you have copied in previous step #5.
  9. Save authorized_keys files
  10. Now in the same VM you will find id_rsa.pub file and please copy its content in memory.
  11. Exit the HDP VM
  12. In your host machine you have already checked authorized_keys in step #3, append the key from HDP VM into authorized_keys file and save it.
  13. Now try logging HDP VM as below:

ssh root@192.168.21.187

Last login: Thu Jun 5 06:35:31 2014 from 192.168.21.1

Note: You will see  that password is not needed this time as Password less SSH is working.

Option 2: You haven’t created SSH key in your local machine and will do everything from scratch:

In this option first we will create a SSH based key first and then use it exactly with Option #1.

  • Log into your host machine and open terminal
  • For example your home folder will be /Users/<username>
  • Create a folder name .ssh inside your working folder
  • now go inside .ssh folder and run the following command

$ ssh-keygen -C ‘SSH Access Key’ -t rsa

Enter file in which to save the key (/home/avkashchauhan/.ssh/id_rsa): ENTER

Enter passphrase (empty for no passphrase): ENTER

Enter same passphrase again: ENTER

  • You will see id_rsa and id_rsa.pub files are created. Now we will append the contents of id_rsa.pub into authorized_keys files and it is not there then we will create and add. For both the command is as below:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  • In the above step you will see the contents of id_rsa.pub are included into authorized_keys.
  • Now we will set proper permissions for keys and folders as below:

$ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*

  • Finally we can follow Option #1 now to add both id_rsa.pub keys in both machines authorized_keys files to have password less ssh working.

Adding correct Java Home path to java

Migrating Hadoop configuration from Remote Machine to local Machine:

To get this working we will have to get Hadoop configuration files from HDP server to local machine and to do this you just need to copy Hadoop configuration files from HDP servers as below:

HDP 1.3:

Create a folder name hdp13 in your working folder and now use SCP command to copy configuration files as below over password less SSH:

$ scp -r root@192.168.21.187:/etc/hadoop/conf.empty/ ~/hdp13

HDP 2.1:

Create a folder name hdp21 in your working folder and now use SCP command to copy configuration files as below over password less SSH:

$ scp -r root@192.168.21.186:/etc/hadoop/conf/ ~/hdp21

Adding correct JAVA_HOME to imported Hadoop configuration hadoop-env.sh

Now visit to your hdp13 or hdp21 folder and edit hadoop-env.sh file with correct JAVA_HOME as below:

# The java implementation to use. Required.
# export JAVA_HOME=/usr/jdk/jdk1.6.0_31  
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Adding correct HDP Hostname into local machine hosts entries:

Now you would need to add Hortonworks HDP hostnames into your local machines hosts file. On Mac OSX you would need to edit /private/etc/hosts file to add the following:

#HDP 2.1
192.168.21.186 sandbox.hortonworks.com
#HDP 1.3
192.168.21.187 sandbox

 

Once added make sure you can ping the hosts by name as below:

$ ping sandbox

PING sandbox (192.168.21.187): 56 data bytes
64 bytes from 192.168.21.187: icmp_seq=0 ttl=64 time=0.461 ms

And for HDP 2.1

$ ping sandbox.hortonworks.com
PING sandbox.hortonworks.com (192.168.21.186): 56 data bytes
64 bytes from 192.168.21.186: icmp_seq=0 ttl=64 time=0.420 ms

Access Hadoop Runtime on Remote Machine from Hadoop commands (or API) at Local Machine:

Now using local machine Hadoop runtime you can connect to Hadoop at HDP VM as below:

HDP 1.3

$ ./hadoop –config /Users/avkashchauhan/hdp13/conf.empty fs -ls /
Found 4 items
drwxr-xr-x – hdfs hdfs 0 2013-05-30 10:34 /apps
drwx—— – mapred hdfs 0 2014-06-05 03:54 /mapred
drwxrwxrwx – hdfs hdfs 0 2014-06-05 06:19 /tmp
drwxr-xr-x – hdfs hdfs 0 2013-06-10 14:39 /user

HDP 2.1

$ ./hadoop –config /Users/avkashchauhan/hdp21/conf fs -ls /
Found 6 items
drwxrwxrwx – yarn hadoop 0 2014-04-21 07:21 /app-logs
drwxr-xr-x – hdfs hdfs 0 2014-04-21 07:23 /apps
drwxr-xr-x – mapred hdfs 0 2014-04-21 07:16 /mapred
drwxr-xr-x – hdfs hdfs 0 2014-04-21 07:16 /mr-history
drwxrwxrwx – hdfs hdfs 0 2014-05-23 11:35 /tmp
drwxr-xr-x – hdfs hdfs 0 2014-05-23 11:35 /user

If you are using Hadoop API then you can pass the CONF file path to API and have access to Hadoop runtime.

 

Hadoop 2.4.0 release (helpful links)

Kudos to Hadoop community as Hadoop 2.4.0 release is available for everyone to consume. A small list of improvements in HDFS, MapReduce along with overall framework are as below but not limited to:

Hadoop 2.4.0 Highlights:

  • HDFS:
    • Full HTTPS support
    • ACL Supported HDFS, allows easier access to Apache Sentry-managed data by components using it
    • Native supported Rolling upgrades in HDFS
    • HDFS FSImage using protocol-buffers for smoother operational upgrades
  • YARN:
    • ResourceManager HA Automatic Failover
    •  YARN Timeline Server PREVIEW for storing and serving generic application history

Hadoop 2.4.0 Release Notes:

http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/releasenotes.html

Hadoop 2.4.0 Source download:

http://apache.mirrors.tds.net/hadoop/common/hadoop-2.4.0/hadoop-2.4.0-src.tar.gz

Hadoop 2.4.0 Binary download:

http://apache.mirrors.tds.net/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gz

Hadoop HDFS Error: xxxx could only be replicated to 0 nodes, instead of 1

Sometime when using Hadoop  either using HDFS directly or running a MapReduce job which access HDFS, user get an error i.e. XXXX could only be replicated to 0 nodes, instead of 1

Example (1): Copying a file from local file system to HDFS
$myhadoop$ ./currenthadoop/bin/hadoop fs -copyFromLocal ./b.txt /
14/02/03 11:59:48 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /b.txt could only be replicated to 0 nodes, instead of 1
Example (2): Running MapReduce Job:
$myhadoop$ ./currenthadoop/bin/hadoop jar hadoop-examples-1.2.1.jar pi 10 1
 Number of Maps  = 10
 Samples per Map = 1
 14/02/03 12:02:11 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/henryo/PiEstimator_TMP_3_141592654/in/part0 could only be replicated to 0 nodes, instead of 1
The root cause for above problem is that Datanode is not available means datanode process is not running at all.
You can verify it by running the jps command as below to make sure all key process are running specific to HDFS/MR1/MR2(YARN) version.
Hadoop Process for HDFS/MR1:

$ jps
69269 TaskTracker
69092 DataNode
68993 NameNode
69171 JobTracker

Hadoop Process for HDFS/MR2

$ jps
43624 DataNode
44005 ResourceManager
43529 NameNode
43890 SecondaryNameNode
44105 NodeManager

If you look at Datanode logs you might see the reason for why Datanode could not started i.e. as below:

2014-02-03 17:50:37,334 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: Cannot locate configuration: tried hadoop-metrics2-datanode.properties, hadoop-metrics2.properties
2014-02-03 17:50:37,947 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hdfs/datanode: namenode namespaceID = 1867802097; datanode namespaceID = 1895712546
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:414)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:321)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
Based on above the problem is that folder where HDFS datanode is defined (/tmp/hdfs/datanode ), is not correctly configured. Either the folder does not exist or the contents are unreadable or the folder is inaccessible or locked.
Solution:
To solve this problem you may need to look for your HDFS -> Datanode folder accessibility and once properly configured, start Datanode/Namenode again.

Troubleshooting YARN NodeManager – Unable to start NodeManager because mapreduce.shuffle value is invalid

With Hadoop 2.2.x you might experience NodeManager is not running and the failure reports the following error message when starting YARN NodeManger:

2014-01-31 17:13:00,500 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Failed to initialize mapreduce.shuffle
java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:98)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:218)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:188)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:338)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:386)

 

If you check yarn-site.xml (in etc/hadoop/) you will see the following setting by default:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

Solution:

To solve this problem you just need to change mapreduce.shuffle to mapreduce_shuffle as shown below:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Note: With Hadoop 0.23.10 the value mapreduce.shuffle is still correct and works fine so this change is applicable to Hadoop 2.2.x 

 

 

YARN Job Problem: Application application_** failed 1 times due to AM Container for XX exited with exitCode: 127

Run a sample Pi job in YARN (Hadoop 0.23.x or 2.2.x) might fail with the following error message:

[Hadoop_Home] $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.10.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.23.10.jar 16 10000

Number of Maps = 16
Samples per Map = 10000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
14/01/31 14:58:10 INFO input.FileInputFormat: Total input paths to process : 16
14/01/31 14:58:10 INFO mapreduce.JobSubmitter: number of splits:16
14/01/31 14:58:10 INFO mapred.ResourceMgrDelegate: Submitted application application_1391206707058_0002 to ResourceManager at /0.0.0.0:8032
14/01/31 14:58:10 INFO mapreduce.Job: The url to track the job: http://Avkashs-MacBook-Pro.local:8088/proxy/application_1391206707058_0002/
14/01/31 14:58:10 INFO mapreduce.Job: Running job: job_1391206707058_0002
14/01/31 14:58:12 INFO mapreduce.Job: Job job_1391206707058_0002 running in uber mode : false
14/01/31 14:58:12 INFO mapreduce.Job: map 0% reduce 0%
14/01/31 14:58:12 INFO mapreduce.Job: Job job_1391206707058_0002 failed with state FAILED due to: Application application_1391206707058_0002 failed 1 times due to AM Container for appattempt_1391206707058_0002_000001 exited with exitCode: 127 due to:
.Failing this attempt.. Failing the application.
14/01/31 14:58:12 INFO mapreduce.Job: Counters: 0
Job Finished in 2.676 seconds
java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/user/avkashchauhan/QuasiMonteCarlo_1391209089737_1265113759/out/reduce-out
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:738)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1685)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1709)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Rootcause:

The problem is caused because YARN is using different path for JAVA executable different then you have in your OS. The way you can troubleshoot this problem is that look for the local logs for the failed task which will show stderr and stdout, in which select “stderr” to see the failure and you will the following message:

/bin/bash: /bin/java: No such file or directory

Screen Shot 2014-01-31 at 3.30.08 PM

 

 

 

 

The hardcoded path to check for java is /bin/java however if you don’t have /bin/java as your Java executable the YARN job will fail. Like in OSX I have Java 1.7 running at /usr/bin/java as below:

$java -version 

java version “1.7.0_45”
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Solution:

To solve this problem in OSX I created a link from /bin/java to /usr/bin/java as below:

$ sudo ln -s /usr/bin/java /bin/java Password: *****

Lets retry the Pi Sample again:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.10.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.23.10.jar 16 10000

Number of Maps = 16
Samples per Map = 10000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
14/01/31 15:09:55 INFO input.FileInputFormat: Total input paths to process : 16
14/01/31 15:09:55 INFO mapreduce.JobSubmitter: number of splits:16
14/01/31 15:09:56 INFO mapred.ResourceMgrDelegate: Submitted application application_1391206707058_0003 to ResourceManager at /0.0.0.0:8032
14/01/31 15:09:56 INFO mapreduce.Job: The url to track the job: http://Avkashs-MacBook-Pro.local:8088/proxy/application_1391206707058_0003/
14/01/31 15:09:56 INFO mapreduce.Job: Running job: job_1391206707058_0003
14/01/31 15:10:01 INFO mapreduce.Job: Job job_1391206707058_0003 running in uber mode : false
14/01/31 15:10:01 INFO mapreduce.Job: map 0% reduce 0%
14/01/31 15:10:07 INFO mapreduce.Job: map 37% reduce 0%
14/01/31 15:10:12 INFO mapreduce.Job: map 50% reduce 0%
14/01/31 15:10:13 INFO mapreduce.Job: map 75% reduce 0%
14/01/31 15:10:18 INFO mapreduce.Job: map 100% reduce 0%
14/01/31 15:10:18 INFO mapreduce.Job: map 100% reduce 100%
14/01/31 15:10:18 INFO mapreduce.Job: Job job_1391206707058_0003 completed successfully
14/01/31 15:10:18 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=358
FILE: Number of bytes written=1088273
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4358
HDFS: Number of bytes written=215
HDFS: Number of read operations=67
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=16
Launched reduce tasks=1
Rack-local map tasks=16
Total time spent by all maps in occupied slots (ms)=61842
Total time spent by all reduces in occupied slots (ms)=4465
Map-Reduce Framework
Map input records=16
Map output records=32
Map output bytes=288
Map output materialized bytes=448
Input split bytes=2470
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=448
Reduce input records=32
Reduce output records=0
Spilled Records=64
Shuffled Maps =16
Failed Shuffles=0
Merged Map outputs=16
GC time elapsed (ms)=290
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=3422552064
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1888
File Output Format Counters
Bytes Written=97
Job Finished in 23.024 seconds
14/01/31 15:10:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Estimated value of Pi is 3.14127500000000000000