Running 10GB Sort Hadoop Job on Windows Azure with Teragen, TeraSort and TeraValidate Options

Running 10GB Sort Hadoop Job with Teragen Option:

This example consists of the 3 map/reduce applications that Owen O’Malley and Arun Murthy used win the annual general purpose (daytona) terabyte sort benchmark @ sortbenchmark.org. This sample is part of prebuilt package in your Hadoop on Azure portal so Just like any other prebuilt sample you can deploy it to cluster as below:

There are three steps to this example:
1. TeraGen is a map/reduce program to generate the data.
2. TeraSort samples the input data and uses map/reduce to sort the data into a total order.
3. TeraValidate is a map/reduce program that validates the output is sorted.

The example deployment is pre-loaded with the first Teragen job.

1. Teragen (sample loaded default)
> hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar teragen “-Dmapred.map.tasks=50” 100000000 /example/data/10GB-sort-input

Once sample is deployed to the cluster, you can verify the parameters first and then start the Job:

Once the Job is started it, first creates the input data in 50 different files on HDFS…

….which you can verify in HDFS management as below:

Finally when the Job is completed the results are displayed as below:

10GB Terasort Example•••••Job Info

Status: Completed Sucessfully
Type: jar
Start time: 12/30/2011 5:54:16 PM
End time: 12/30/2011 6:04:59 PM
Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar teragen “-Dmapred.map.tasks=50” 100000000 /example/data/10GB-sort-input

Output (stdout)

Generating 100000000 using 50 maps with step of 2000000

Errors (stderr)
11/12/30 17:54:20 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 17:54:49 INFO mapred.JobClient: map 2% reduce 0%
11/12/30 17:54:52 INFO mapred.JobClient: map 4% reduce 0%
11/12/30 17:54:55 INFO mapred.JobClient: map 5% reduce 0%
11/12/30 17:55:01 INFO mapred.JobClient: map 6% reduce 0%
11/12/30 17:55:22 INFO mapred.JobClient: map 7% reduce 0%
11/12/30 17:55:28 INFO mapred.JobClient: map 8% reduce 0%
11/12/30 17:55:43 INFO mapred.JobClient: map 9% reduce 0%
11/12/30 17:55:46 INFO mapred.JobClient: map 12% reduce 0%
11/12/30 17:55:49 INFO mapred.JobClient: map 14% reduce 0%
11/12/30 17:56:10 INFO mapred.JobClient: map 15% reduce 0%
11/12/30 17:56:13 INFO mapred.JobClient: map 16% reduce 0%
11/12/30 17:56:28 INFO mapred.JobClient: map 18% reduce 0%
11/12/30 17:56:31 INFO mapred.JobClient: map 19% reduce 0%
11/12/30 17:56:34 INFO mapred.JobClient: map 20% reduce 0%
11/12/30 17:56:43 INFO mapred.JobClient: map 21% reduce 0%
11/12/30 17:56:49 INFO mapred.JobClient: map 22% reduce 0%
11/12/30 17:56:52 INFO mapred.JobClient: map 23% reduce 0%
11/12/30 17:56:58 INFO mapred.JobClient: map 24% reduce 0%
11/12/30 17:57:01 INFO mapred.JobClient: map 25% reduce 0%
11/12/30 17:57:04 INFO mapred.JobClient: map 26% reduce 0%
11/12/30 17:57:10 INFO mapred.JobClient: map 28% reduce 0%
11/12/30 17:57:19 INFO mapred.JobClient: map 29% reduce 0%
11/12/30 17:57:22 INFO mapred.JobClient: map 30% reduce 0%
11/12/30 17:57:28 INFO mapred.JobClient: map 31% reduce 0%
11/12/30 17:57:31 INFO mapred.JobClient: map 32% reduce 0%
11/12/30 17:58:04 INFO mapred.JobClient: map 33% reduce 0%
11/12/30 17:58:07 INFO mapred.JobClient: map 35% reduce 0%
11/12/30 17:58:10 INFO mapred.JobClient: map 36% reduce 0%
11/12/30 17:58:13 INFO mapred.JobClient: map 37% reduce 0%
11/12/30 17:58:19 INFO mapred.JobClient: map 38% reduce 0%
11/12/30 17:58:25 INFO mapred.JobClient: map 39% reduce 0%
11/12/30 17:58:34 INFO mapred.JobClient: map 40% reduce 0%
11/12/30 17:58:37 INFO mapred.JobClient: map 42% reduce 0%
11/12/30 17:58:44 INFO mapred.JobClient: map 43% reduce 0%
11/12/30 17:58:47 INFO mapred.JobClient: map 44% reduce 0%
11/12/30 17:58:52 INFO mapred.JobClient: map 45% reduce 0%
11/12/30 17:58:59 INFO mapred.JobClient: map 46% reduce 0%
11/12/30 17:59:23 INFO mapred.JobClient: map 48% reduce 0%
11/12/30 17:59:26 INFO mapred.JobClient: map 49% reduce 0%
11/12/30 17:59:32 INFO mapred.JobClient: map 50% reduce 0%
11/12/30 17:59:40 INFO mapred.JobClient: map 51% reduce 0%
11/12/30 17:59:44 INFO mapred.JobClient: map 52% reduce 0%
11/12/30 17:59:46 INFO mapred.JobClient: map 53% reduce 0%
11/12/30 17:59:47 INFO mapred.JobClient: map 54% reduce 0%
11/12/30 17:59:58 INFO mapred.JobClient: map 55% reduce 0%
11/12/30 18:00:11 INFO mapred.JobClient: map 56% reduce 0%
11/12/30 18:00:14 INFO mapred.JobClient: map 58% reduce 0%
11/12/30 18:00:16 INFO mapred.JobClient: map 59% reduce 0%
11/12/30 18:00:20 INFO mapred.JobClient: map 60% reduce 0%
11/12/30 18:00:23 INFO mapred.JobClient: map 61% reduce 0%
11/12/30 18:00:31 INFO mapred.JobClient: map 62% reduce 0%
11/12/30 18:00:50 INFO mapred.JobClient: map 63% reduce 0%
11/12/30 18:00:53 INFO mapred.JobClient: map 65% reduce 0%
11/12/30 18:00:59 INFO mapred.JobClient: map 66% reduce 0%
11/12/30 18:01:10 INFO mapred.JobClient: map 67% reduce 0%
11/12/30 18:01:13 INFO mapred.JobClient: map 68% reduce 0%
11/12/30 18:01:14 INFO mapred.JobClient: map 69% reduce 0%
11/12/30 18:01:17 INFO mapred.JobClient: map 70% reduce 0%
11/12/30 18:01:20 INFO mapred.JobClient: map 71% reduce 0%
11/12/30 18:01:23 INFO mapred.JobClient: map 72% reduce 0%
11/12/30 18:01:37 INFO mapred.JobClient: map 73% reduce 0%
11/12/30 18:01:38 INFO mapred.JobClient: map 74% reduce 0%
11/12/30 18:01:50 INFO mapred.JobClient: map 75% reduce 0%
11/12/30 18:02:07 INFO mapred.JobClient: map 76% reduce 0%
11/12/30 18:02:11 INFO mapred.JobClient: map 77% reduce 0%
11/12/30 18:02:14 INFO mapred.JobClient: map 78% reduce 0%
11/12/30 18:02:17 INFO mapred.JobClient: map 79% reduce 0%
11/12/30 18:02:20 INFO mapred.JobClient: map 80% reduce 0%
11/12/30 18:02:32 INFO mapred.JobClient: map 81% reduce 0%
11/12/30 18:02:44 INFO mapred.JobClient: map 82% reduce 0%
11/12/30 18:02:53 INFO mapred.JobClient: map 83% reduce 0%
11/12/30 18:02:59 INFO mapred.JobClient: map 84% reduce 0%
11/12/30 18:03:05 INFO mapred.JobClient: map 85% reduce 0%
11/12/30 18:03:08 INFO mapred.JobClient: map 87% reduce 0%
11/12/30 18:03:14 INFO mapred.JobClient: map 88% reduce 0%
11/12/30 18:03:20 INFO mapred.JobClient: map 89% reduce 0%
11/12/30 18:03:38 INFO mapred.JobClient: map 90% reduce 0%
11/12/30 18:03:41 INFO mapred.JobClient: map 92% reduce 0%
11/12/30 18:03:47 INFO mapred.JobClient: map 93% reduce 0%
11/12/30 18:03:50 INFO mapred.JobClient: map 94% reduce 0%
11/12/30 18:03:56 INFO mapred.JobClient: map 95% reduce 0%
11/12/30 18:04:05 INFO mapred.JobClient: map 96% reduce 0%
11/12/30 18:04:11 INFO mapred.JobClient: map 97% reduce 0%
11/12/30 18:04:14 INFO mapred.JobClient: map 98% reduce 0%
11/12/30 18:04:23 INFO mapred.JobClient: map 99% reduce 0%
11/12/30 18:04:47 INFO mapred.JobClient: map 100% reduce 0%
11/12/30 18:04:58 INFO mapred.JobClient: Job complete: job_201112290558_0005
11/12/30 18:04:58 INFO mapred.JobClient: Counters: 16
11/12/30 18:04:58 INFO mapred.JobClient: Job Counters
11/12/30 18:04:58 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4761149
11/12/30 18:04:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/30 18:04:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/30 18:04:58 INFO mapred.JobClient: Launched map tasks=54
11/12/30 18:04:58 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
11/12/30 18:04:58 INFO mapred.JobClient: File Input Format Counters
11/12/30 18:04:58 INFO mapred.JobClient: Bytes Read=0
11/12/30 18:04:58 INFO mapred.JobClient: File Output Format Counters
11/12/30 18:04:58 INFO mapred.JobClient: Bytes Written=10000000000
11/12/30 18:04:58 INFO mapred.JobClient: FileSystemCounters
11/12/30 18:04:58 INFO mapred.JobClient: FILE_BYTES_READ=113880
11/12/30 18:04:58 INFO mapred.JobClient: HDFS_BYTES_READ=4288
11/12/30 18:04:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1180870
11/12/30 18:04:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000000
11/12/30 18:04:58 INFO mapred.JobClient: Map-Reduce Framework
11/12/30 18:04:58 INFO mapred.JobClient: Map input records=100000000
11/12/30 18:04:58 INFO mapred.JobClient: Spilled Records=0
11/12/30 18:04:58 INFO mapred.JobClient: Map input bytes=100000000
11/12/30 18:04:58 INFO mapred.JobClient: Map output records=100000000
11/12/30 18:04:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=4288

 

Running 10GB Sort Hadoop Job with TeraSort Option:

In this section we will run the same 10GB sorting Hadoop job with TERASORT option. With TeraSort option the parameters are changed as below:

With above parameters couple of things to remember:

  1. You must have /example/data/10GB-sort-input folder along with data (This is created when you use teragen option first as explained in Exercise 5)
  2. You also need to have

Once you start the job you will see the reads data  from the /example/data/10GB-sort-input folder

Now you can login to your cluster using your RD credentials and launch local IP with port 50030 for node administration as below:

Above if you want to know how many nodes I have you would need to click “Nodes” section in the cluster Summary table to find each node details and its IP Address:

In the Map/Reduce Administration page, once you click on the running job, you can get further details about your job progress as below:

If you want to further dig into individual pending/running or completed job just click on either Map or Reduce tasks counter above and you will see details as below:

Pending Tasks:

Completed Tasks:

Now if you select a completed task and open for more info you will see:

When your Job is running you can also visualize the Map/Reduce process either at Hadoop on Azure Portal or directly in your Node Admin section inside the cluster as below:

Job Progress at Hadoop on Azure Portal:

11/12/30 19:41:45 INFO mapred.JobClient: map 57% reduce 4%
11/12/30 19:41:57 INFO mapred.JobClient: map 59% reduce 4%
11/12/30 19:42:06 INFO mapred.JobClient: map 60% reduce 4%
11/12/30 19:42:07 INFO mapred.JobClient: map 61% reduce 4%
11/12/30 19:42:15 INFO mapred.JobClient: map 62% reduce 4%
11/12/30 19:42:18 INFO mapred.JobClient: map 63% reduce 4%
11/12/30 19:42:21 INFO mapred.JobClient: map 64% reduce 4%

Job Progress directly seen inside the cluster directly athttp://10.186.42.44:50030/jobdetails.jsp?jobid=job_201112290558_0006&refresh=30

Finally Job will be completed when both Map and Reduce jobs are 100% completed

10GB Terasort Example

Job Info

Status: Completed Successfully
Type: jar
Start time: 12/30/2011 7:36:50 PM
End time: 12/30/2011 8:48:36 PM
Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar terasort “-Dmapred.map.tasks=50 -Dmapred.reduce.tasks=25” /example/data/10GB-sort-input /example/data/10GB-sort-out

Output (stdout)

Making 1 from 100000 records
Step size is 100000.0

Errors (stderr)

11/12/30 19:36:51 INFO terasort.TeraSort: starting
11/12/30 19:36:51 INFO mapred.FileInputFormat: Total input paths to process : 50
11/12/30 19:36:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
11/12/30 19:36:52 INFO compress.CodecPool: Got brand-new compressor
11/12/30 19:36:53 INFO mapred.FileInputFormat: Total input paths to process : 50
11/12/30 19:36:54 INFO mapred.JobClient: Running job: job_201112290558_0006
11/12/30 19:36:55 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 19:37:24 INFO mapred.JobClient: map 2% reduce 0%
11/12/30 19:37:26 INFO mapred.JobClient: map 5% reduce 0%
11/12/30 19:37:44 INFO mapred.JobClient: map 7% reduce 0%
11/12/30 19:37:48 INFO mapred.JobClient: map 8% reduce 0%
11/12/30 19:37:50 INFO mapred.JobClient: map 9% reduce 0%
11/12/30 19:37:52 INFO mapred.JobClient: map 10% reduce 0%

……

……

11/12/30 20:47:24 INFO mapred.JobClient: map 100% reduce 98%
11/12/30 20:47:51 INFO mapred.JobClient: map 100% reduce 99%
11/12/30 20:48:21 INFO mapred.JobClient: map 100% reduce 100%
11/12/30 20:48:35 INFO mapred.JobClient: Job complete: job_201112290558_0006
11/12/30 20:48:35 INFO mapred.JobClient: Counters: 27
11/12/30 20:48:35 INFO mapred.JobClient: Job Counters
11/12/30 20:48:35 INFO mapred.JobClient: Launched reduce tasks=1
11/12/30 20:48:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3766703
11/12/30 20:48:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/30 20:48:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/30 20:48:35 INFO mapred.JobClient: Rack-local map tasks=1
11/12/30 20:48:35 INFO mapred.JobClient: Launched map tasks=153
11/12/30 20:48:35 INFO mapred.JobClient: Data-local map tasks=152
11/12/30 20:48:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=4244002
11/12/30 20:48:35 INFO mapred.JobClient: File Input Format Counters
11/12/30 20:48:35 INFO mapred.JobClient: Bytes Read=10013107300
11/12/30 20:48:35 INFO mapred.JobClient: File Output Format Counters
11/12/30 20:48:35 INFO mapred.JobClient: Bytes Written=10000000000
11/12/30 20:48:35 INFO mapred.JobClient: FileSystemCounters
11/12/30 20:48:35 INFO mapred.JobClient: FILE_BYTES_READ=26766944216
11/12/30 20:48:35 INFO mapred.JobClient: HDFS_BYTES_READ=10013124850
11/12/30 20:48:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=36970291186
11/12/30 20:48:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000000
11/12/30 20:48:35 INFO mapred.JobClient: Map-Reduce Framework
11/12/30 20:48:35 INFO mapred.JobClient: Map output materialized bytes=10200000900
11/12/30 20:48:35 INFO mapred.JobClient: Map input records=100000000
11/12/30 20:48:35 INFO mapred.JobClient: Reduce shuffle bytes=10132903050
11/12/30 20:48:35 INFO mapred.JobClient: Spilled Records=362419016
11/12/30 20:48:35 INFO mapred.JobClient: Map output bytes=10000000000
11/12/30 20:48:35 INFO mapred.JobClient: Map input bytes=10000000000
11/12/30 20:48:35 INFO mapred.JobClient: Combine input records=0
11/12/30 20:48:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=17550
11/12/30 20:48:35 INFO mapred.JobClient: Reduce input records=100000000
11/12/30 20:48:35 INFO mapred.JobClient: Reduce input groups=100000000
11/12/30 20:48:35 INFO mapred.JobClient: Combine output records=0
11/12/30 20:48:35 INFO mapred.JobClient: Reduce output records=100000000
11/12/30 20:48:35 INFO mapred.JobClient: Map output records=100000000
11/12/30 20:48:35 INFO terasort.TeraSort: done

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Advertisements

One thought on “Running 10GB Sort Hadoop Job on Windows Azure with Teragen, TeraSort and TeraValidate Options

  1. Hi Avkash,

    Thanks for the detailed article. Could you give a split up of how long the mapper and reducer ran individually?
    And I would also like to know the significance of following options that you have used:
    -Dmapred.map.tasks=50 -Dmapred.reduce.tasks=25

    I am running a similar job using Isotope, it is held up in reducer, which has single task which is in Pending state for ever!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s