Using Windows Azure Blob Storage with Map/Reduce Job in Windows Azure Hadoop Cluster

Microsoft distribution to Apache Hadoop comes by direct connectivity to cloud storage i.e. Windows Azure Blob storage or Amazon S3. Here we will learn how to connect your Windows Azure Storage directly from your Hadoop Cluster.

 

As you know Windows Azure Storage access needed following two things:

  1. Azure Storage Name
  2. Azure Storage Access Key

Using above information we create the following Storage Connection Strings:

  • DefaultEndpointsProtocol=https;
  • AccountName=<Your_Azure_Blob_Storage_Name>;
  • AccountKey=<Azure_Storage_Key>

 

Now we just need to setup the above information inside the Hadoop cluster configuration. To do that, please open C:AppsDistconfcore-site.xml and include the following to parameters related with Azure Blob Storage access from Hadoop Cluster:

 

<property>

<name>fs.azure.buffer.dir</name>

<value>/tmp</value>

</property>

<property>

<name>fs.azure.storageConnectionString</name>

<value>DefaultEndpointsProtocol=https;AccountName=<YourAzureBlobStoreName>;AccountKey=<YourAzurePrimaryKey></value>

</property>

 

The above configuration setup Azure Blob Storage within the Hadoop setup.

ASV:// => https://<Azure_Blob_Storage_name&gt;.blob.core.windows.net

 

Now let’s try to list the blogs in your specific container:

c:appsdist>hadoop fs -lsr asv://hadoop/input

-rwxrwxrwx   1        107 2012-01-05 05:52 /input/helloworldblob.txt

 

Let’s verify at Azure Storage that the results we received above are correct as below:

 

So for example if you would want to copy a file from Hadoop cluster to Azure Storage you will use the following command:

Hadoop fs –copyFromLocal <Filename> asv://<Target_Container_Name>/<Blob_Name_or_samefilename>

 

Example:

c:Apps>hadoop.cmd fs -copyFromLocal helloworld.txt asv://filefromhadoop/helloworldblob.txt

This will upload helloworld.txt file to container name “filefromhadoop” as blob name “helloworldblob.txt”.

 

c:Apps>hadoop.cmd fs -copyToLocal asv://hadoop/input/helloworldblob.txt helloworldblob.txt

This command will download helloworldblob.txt blob from Azure storage and made available to local Hadoop cluster

 

Please see below to learn more about “Hadoop fs” command:

 

c:Apps>hadoop fs

Usage: java FsShell

[-ls <path>]

[-lsr <path>]

[-du <path>]

[-dus <path>]

[-count[-q] <path>]

[-mv <src> <dst>]

[-cp <src> <dst>]

[-rm [-skipTrash] <path>]

[-rmr [-skipTrash] <path>]

[-expunge]

[-put <localsrc> … <dst>]

[-copyFromLocal <localsrc> … <dst>]

[-moveFromLocal <localsrc> … <dst>]

[-get [-ignoreCrc] [-crc] <src> <localdst>]

[-getmerge <src> <localdst> [addnl]]

[-cat <src>]

[-text <src>]

[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]

[-moveToLocal [-crc] <src> <localdst>]

[-mkdir <path>]

[-setrep [-R] [-w] <rep> <path/file>]

[-touchz <path>]

[-test -[ezd] <path>]

[-stat [format] <path>]

[-tail [-f] <file>]

[-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]

[-chown [-R] [OWNER][:[GROUP]] PATH…]

[-chgrp [-R] GROUP PATH…]

[-help [cmd]]

 

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|jobtracker:port>    specify a job tracker

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

 

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

 

Please setup your Hadoop configuration to connect with Azure Storage and verify that connection to Azure Storage is working. Now, before running Hadoop Job please be sure to understand the correct format to use asv:// as below:

 

When using input or output string using Azure storage you must use the following format:

Input

      asv://<container_name>/<symbolic_folder_name>

Example: asv://hadoop/input

Output

      asv://<container_name>/<symbolic_folder_name>

Example:asv://hadoop/output

Note If you will use asv://<only_container_name> then job will return error.

 

Let’s verify at Azure Storage that we do have some data in proper location

 

 

The contents of the file helloworldblo.txt are as below:

This is Hello World

I like Hello World

Hello Country

Hello World

Love World

World is Love

Hello World

 

 

Now let’s run a simple WordCount Map/Reduce Job and use HelloWorldBlob.txt as input file and store results also in Azure Storage.

 

 

Job Command:

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://hadoop/input asv://hadoop/output

 

Once the Job Completes the following screenshot shows the results output:

 

Opening part-r-00000 shows the results as below:

  • Country 1
  • Hello     5
  • I            1
  • Love      2
  • This       1
  • World   6
  • is           2
  • like        1

 

Finally the Azure HeadNode WebApp shows the following final output about the Hadoop Job:

 

WordCount Example

•••••

Job Info

Status: Completed Sucessfully
Type: jar
Start time: 1/5/2012 5:53:49 AM
End time: 1/5/2012 5:55:52 AM
Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://hadoop/input asv://hadoop/output

Output (stdout)

 

Errors (stderr)
12/01/05 05:53:59 INFO mapred.JobClient: Running job: job_201201042206_0001
12/01/05 05:54:00 INFO mapred.JobClient: map 0% reduce 0%
12/01/05 05:54:39 INFO mapred.JobClient: map 100% reduce 0%
12/01/05 05:55:00 INFO mapred.JobClient: map 100% reduce 66%
12/01/05 05:55:30 INFO mapred.JobClient: map 100% reduce 100%
12/01/05 05:55:51 INFO mapred.JobClient: Job complete: job_201201042206_0001
12/01/05 05:55:51 INFO mapred.JobClient: Counters: 25
12/01/05 05:55:51 INFO mapred.JobClient: Job Counters
12/01/05 05:55:51 INFO mapred.JobClient: Launched reduce tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=40856
12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/01/05 05:55:51 INFO mapred.JobClient: Rack-local map tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: Launched map tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=48433
12/01/05 05:55:51 INFO mapred.JobClient: File Output Format Counters
12/01/05 05:55:51 INFO mapred.JobClient: Bytes Written=56
12/01/05 05:55:51 INFO mapred.JobClient: FileSystemCounters
12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_READ=1134
12/01/05 05:55:51 INFO mapred.JobClient: HDFS_BYTES_READ=102
12/01/05 05:55:51 INFO mapred.JobClient: ASV_BYTES_WRITTEN=56
12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44949
12/01/05 05:55:51 INFO mapred.JobClient: File Input Format Counters
12/01/05 05:55:51 INFO mapred.JobClient: Bytes Read=0
12/01/05 05:55:51 INFO mapred.JobClient: Map-Reduce Framework
12/01/05 05:55:51 INFO mapred.JobClient: Reduce input groups=8
12/01/05 05:55:51 INFO mapred.JobClient: Map output materialized bytes=94
12/01/05 05:55:51 INFO mapred.JobClient: Combine output records=8
12/01/05 05:55:51 INFO mapred.JobClient: Map input records=7
12/01/05 05:55:51 INFO mapred.JobClient: Reduce shuffle bytes=0
12/01/05 05:55:51 INFO mapred.JobClient: Reduce output records=8
12/01/05 05:55:51 INFO mapred.JobClient: Spilled Records=16
12/01/05 05:55:51 INFO mapred.JobClient: Map output bytes=178
12/01/05 05:55:51 INFO mapred.JobClient: Combine input records=19
12/01/05 05:55:51 INFO mapred.JobClient: Map output records=19
12/01/05 05:55:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=102
12/01/05 05:55:51 INFO mapred.JobClient: Reduce input records=8

 

 

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Advertisements

One thought on “Using Windows Azure Blob Storage with Map/Reduce Job in Windows Azure Hadoop Cluster

  1. Very well written article. I was wondering if at all there is any performance difference using asv:// vs HDFS. Also, how will that actual chunking occur for map reduce task. HDFS automatically takes care of it but I am not sure how it would work for blob storage. Would be eager to know the same

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s