Writing your very own WordCount Hadoop Job in Java and deploying to Windows Azure Cluster

In this article,  I will help you writing your own WordCount Hadoop Job and then deploy it to Windows Azure Cluster for further processing.

Let’s create  Java code file as “AvkashWordCount.java” as below:

package org.myorg;import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.util.*;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class AvkashWordCount {

public static class Map extends Mapper

<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer

<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf);

job.setJarByClass(AvkashWordCount.class);

job.setJobName(“avkashwordcountjob”);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(AvkashWordCount.Map.class);

job.setCombinerClass(AvkashWordCount.Reduce.class);

job.setReducerClass(AvkashWordCount.Reduce.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Let’s Compile the Java code first. You must have Hadoop 0.20 or above installed in your machined to use this code:

C:AzureJava>C:Appsjavaopenjdk7binjavac -classpath c:Appsdisthadoop-core-0.20.203.1-SNAPSHOT.jar -d . AvkashWordCount.java

Now let’s crate the JAR file

C:AzureJava>C:Appsjavaopenjdk7binjar -cvf AvkashWordCount.jar org

added manifest

adding: org/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/AvkashWordCount$Map.class(in = 1893) (out= 792)(deflated 58%)

adding: org/myorg/AvkashWordCount$Reduce.class(in = 1378) (out= 596)(deflated 56%)

adding: org/myorg/AvkashWordCount.class(in = 1399) (out= 754)(deflated 46%)

Once Jar is created please deploy it to your Windows Azure Hadoop Cluster as below:

In the page below please follow all the steps as described below:

  • Step 1: Click Browse to select your “AvkashWordCount.Jar” file here
  • Step 2: Enter the Job name as defined in the  source code
  • Step 3: Add the parameter as below
  • Step 4: Add folder name where files will be read to word count
  • Step 5: Add output folder name where the results will be stored
  • Step 6: Start the Job

Note: Be sure to have some data in your input folder. (Avkash I am using /user/avkash/inputfolder which has a text file with lots of word to be used as Word Count input file)

Once the job is stared, you will see the results as below:

avkashwordcountjob•••

Job Info

Status: Completed Sucessfully
Type: jar
Start time: 12/31/2011 4:06:51 PM
End time: 12/31/2011 4:07:53 PM
Exit code: 0

Command

call hadoop.cmd jar AvkashWordCount.jar org.myorg.AvkashWordCount /user/avkash/inputfolder /user/avkash/outputfolder

Output (stdout)

Errors (stderr)
11/12/31 16:06:53 INFO input.FileInputFormat: Total input paths to process : 1
11/12/31 16:06:54 INFO mapred.JobClient: Running job: job_201112310614_0001
11/12/31 16:06:55 INFO mapred.JobClient: map 0% reduce 0%
11/12/31 16:07:20 INFO mapred.JobClient: map 100% reduce 0%
11/12/31 16:07:42 INFO mapred.JobClient: map 100% reduce 100%
11/12/31 16:07:53 INFO mapred.JobClient: Job complete: job_201112310614_0001
11/12/31 16:07:53 INFO mapred.JobClient: Counters: 25
11/12/31 16:07:53 INFO mapred.JobClient: Job Counters
11/12/31 16:07:53 INFO mapred.JobClient: Launched reduce tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29029
11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/31 16:07:53 INFO mapred.JobClient: Launched map tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: Data-local map tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18764
11/12/31 16:07:53 INFO mapred.JobClient: File Output Format Counters
11/12/31 16:07:53 INFO mapred.JobClient: Bytes Written=123
11/12/31 16:07:53 INFO mapred.JobClient: FileSystemCounters
11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_READ=709
11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_READ=234
11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43709
11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123
11/12/31 16:07:53 INFO mapred.JobClient: File Input Format Counters
11/12/31 16:07:53 INFO mapred.JobClient: Bytes Read=108
11/12/31 16:07:53 INFO mapred.JobClient: Map-Reduce Framework
11/12/31 16:07:53 INFO mapred.JobClient: Reduce input groups=7
11/12/31 16:07:53 INFO mapred.JobClient: Map output materialized bytes=189
11/12/31 16:07:53 INFO mapred.JobClient: Combine output records=15
11/12/31 16:07:53 INFO mapred.JobClient: Map input records=15
11/12/31 16:07:53 INFO mapred.JobClient: Reduce shuffle bytes=0
11/12/31 16:07:53 INFO mapred.JobClient: Reduce output records=15
11/12/31 16:07:53 INFO mapred.JobClient: Spilled Records=30
11/12/31 16:07:53 INFO mapred.JobClient: Map output bytes=153
11/12/31 16:07:53 INFO mapred.JobClient: Combine input records=15
11/12/31 16:07:53 INFO mapred.JobClient: Map output records=15
11/12/31 16:07:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=126
11/12/31 16:07:53 INFO mapred.JobClient: Reduce input records=15

 

Finally you can open output folder /user/avkash/outputfolder and read the Word Count results.

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s