Building Hadoop Source in OSX

Step 1. Select your desired Hadoop Branch from a list below:

https://svn.apache.org/repos/asf/hadoop/common/branches/

Step 2. Use svn to checkout and download source from the branch i.e.

$ svn co https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.0.5-alpha/ hadoop-2.0.5

Note: Above command will download Hadoop Branch 2.0.5 Alpha source code to a folder name hadoop-2.0.5.

Step 3: Change your current folder to hadoop-2.0.5 folder which will be considered as Hadoop source root folder.

Step 4:  Now open pom.xml and verify hadoop-main version as below to make sure this is the branch your are targeting to build for:

<artifactId>hadoop-main</artifactId>
<version>2.0.5-alpha</version>

Step 5: Now open BUILDING.txt file and put your attention at requirement as described below:

* JDK 1.6
* Maven 3.0
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.4.1+ (for MapReduce and HDFS)
* CMake 2.6 or newer (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)

Step 6 : Make sure you do have everything needed in step 5 and if now use the info below to install required components:

  • Maven 3.0.4 works fine
  • For ProtocolBuffer just download it from here
  • $ ./configure
  • $ make
  • $ make install
  • For CMake you can use brew on OSX
  • $ brew install cmake

Step 7: Now be at your Hadoop source root and run the following commands in order to compile source, and build package

  •  $ mvn -version
  •  $ mvn clean
  •  $ mvn install  -DskipTests
  •  $ mvn compile  -DskipTests
  •  $ mvn package  -DskipTests
  •  $ mvn package -Pdist -DskipTests -Dtar

Now you can dive into hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/bin folder and run the Hadoop commands i.e. hadoop, hdfs, mapred etc as below:

~/work/hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/bin$ ./hadoop version
Hadoop 2.0.5-alpha
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1511192
Compiled by hadoopworld on 2013-08-07T07:01Z
From source with checksum c8f4bd45ac25c31b815f311b32ef17
This command was run using ~/work/hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/share/hadoop/common/hadoop-common-2.0.5-alpha.jar

Advertisements

Constructing blocks and file system relationship in HDFS

I am using a 3 nodes Hadoop cluster running Windows Azure HDInsight for the testing.

In Hadoop we can use fsck utility to diagnose the health of the HDFS file system, to find missing files or blocks or calculate them for integrity.

Lets Running FSCK for the root file system:

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fsck /

FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:27:39 GMT 2013
……….Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:27:39 GMT 2013 in 8 milliseconds

The filesystem under path ‘/’ is HEALTHY

 

Now let’s check the total files in the root (/) to verify the files and directories:

 

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fs -lsr /

drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/apps
-rw-r–r– 3 avkash supergroup 4608 2013-03-04 21:16 /example/apps/cat.exe
-rw-r–r– 3 avkash supergroup 5120 2013-03-04 21:16 /example/apps/wc.exe
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data/gutenberg
-rw-r–r– 3 avkash supergroup 1395667 2013-03-04 21:16 /example/data/gutenberg/davinci.txt
-rw-r–r– 3 avkash supergroup 674762 2013-03-04 21:16 /example/data/gutenberg/outlineofscience.txt
-rw-r–r– 3 avkash supergroup 1573044 2013-03-04 21:16 /example/data/gutenberg/ulysses.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred
drwx—— – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred/system
-rw——- 3 avkash supergroup 4 2013-03-04 21:15 /hdfs/tmp/mapred/system/jobtracker.info
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse/hivesampletable
-rw-r–r– 3 avkash supergroup 5015508 2013-03-04 21:16 /hive/warehouse/hivesampletable/HiveSampleData.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp/hive-avkash
drwxrwxrwx – SYSTEM supergroup 0 2013-03-04 21:15 /uploads
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/graph
-rw-r–r– 3 avkash supergroup 80 2013-03-04 21:16 /user/SYSTEM/graph/catepillar_star.edge
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/query
-rw-r–r– 3 avkash supergroup 12 2013-03-04 21:16 /user/SYSTEM/query/catepillar_star_rwr.query
drwxr-xr-x – avkash supergroup 0 2013-03-05 07:37 /user/avkash
drwxr-xr-x – avkash supergroup 0 2013-03-04 23:00 /user/avkash/.Trash
-rw-r–r– 3 avkash supergroup 543666528 2013-03-05 07:37 /user/avkash/data_w3c_large.txt

Above we found that there are total 21 directories and 10 files. Now we can dig further to check the total 12 blocks in HDFS for each files:

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fsck / -files -blocks –racks
FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:35:44 GMT 2013
/

/example

/example/apps

/example/apps/cat.exe 4608 bytes, 1 block(s): OK

0. blk_9084981204553714951_1008 len=4608 repl=3 [/fd0/ud0/10.114.236.28:50010, /

fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

/example/apps/wc.exe 5120 bytes, 1 block(s): OK
0. blk_-7951603158243426483_1009 len=5120 repl=3 [/fd1/ud1/10.114.228.35:50010,
/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/example/data

/example/data/gutenberg

/example/data/gutenberg/davinci.txt 1395667 bytes, 1 block(s): OK

0. blk_3859330889089858864_1005 len=1395667 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/example/data/gutenberg/outlineofscience.txt 674762 bytes, 1 block(s): OK

0. blk_-3790696559021810548_1006 len=674762 repl=3 [/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010, /fd1/ud1/10.114.228.35:50010]

/example/data/gutenberg/ulysses.txt 1573044 bytes, 1 block(s): OK

0. blk_-8671592324971725227_1007 len=1573044 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/hdfs

/hdfs/tmp

/hdfs/tmp/mapred

/hdfs/tmp/mapred/system

/hdfs/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK

0. blk_5997185491433558819_1003 len=4 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/hive

/hive/warehouse

/hive/warehouse/hivesampletable

/hive/warehouse/hivesampletable/HiveSampleData.txt 5015508 bytes, 1 block(s): OK

0. blk_44873054283747216_1004 len=5015508 repl=3 [/fd1/ud1/10.114.228.35:50010,/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/tmp

/tmp/hive-avkash

/uploads

/user

/user/SYSTEM

/user/SYSTEM/graph

/user/SYSTEM/graph/catepillar_star.edge 80 bytes, 1 block(s): OK

0. blk_-6715685143024983574_1010 len=80 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/user/SYSTEM/query

/user/SYSTEM/query/catepillar_star_rwr.query 12 bytes, 1 block(s): OK

0. blk_8102317020509190444_1011 len=12 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

/user/avkash

/user/avkash/.Trash

/user/avkash/data_w3c_large.txt 543666528 bytes, 3 block(s): OK

0. blk_2005027737969478969_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]

1. blk_1970119524179712436_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]

2. blk_6223000007391223944_1012 len=6795616 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:35:44 GMT 2013 in 10 milliseconds

The filesystem under path ‘/’ is HEALTHY

Above we can verify that where all total 12 blocks are distributed. 9 blocks are distributed through 9 files and 3 blocks are distributed through 1 file.

 

Keywaords: FSCK, Hadoop,HDFS, Blocks, File System, Replications, HDInsight

MapReduce in Cloud

When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:

  1. A collection of machines which are Hadoop/MapReduce ready and instant available
  2. You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
  3. It means you just need to hook your data and push MapReduce jobs immediately
  4. Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.

Here are a few options which are available now, which I tried before writing here:

Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/

The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/

Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/

Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

Big Data in Astronomical scale HDF and HUDF

Scientists in general, and astronomers in particular, have been at the forefront when it comes to dealing with large amounts of data. These days, the “Big Data” community, as it is known, includes almost every scientific endeavor — and even you.

In fact, Big Data is not just about extremely large collections of information hidden in databases inside archives like the Barbara A. Mikulski Archive for Space Telescopes. Big Data includes the hidden data you carry with you all the time in now-ubiquitous smart phones: calendars, photographs, SMS messages, usage information and records of our current and past locations. As we live our lives, we leave behind us a “data exhaust” that tells something about ourselves.

Star-Forming Region LH 95 in the Large Magellanic Cloud

…..

In late 1995, the Hubble Space Telescope took hundreds of exposures of a seemingly empty patch of sky near the constellation of Ursa Major (the Big Dipper). The Hubble Deep Field (HDF), as it is known, uncovered a mystifying collection of about 3,000 galaxies at various stages of their evolution. Most of the galaxies were faint, and from them we began to learn a story about our Universe that had not been told before.

……

So was the HDF unique? Were we just lucky to observe a crowded but faint patch of sky? To address this question, and determine if indeed the HDF was a “lucky shot,” in 2004  Hubble took a million-second-long exposure in a similarly “empty” patch of sky: The Hubble Ultra Deep Field (HUDF). The result was even more breathtaking. Containing an estimated 10,000 galaxies, the HUDF revealed glimpses of the first galaxies as they emerge from the so-called “dark ages” — the time shortly after the Big Bang when the first stars reheated the cold, dark universe. As with the HDF, the HUDF data was made immediately available to the community, and has spawned hundreds of publications and several follow-up observations.

Read Full Article at: http://hubblesite.org/blog/2012/04/data-exhaust/

Open Source system for data mining – RapidMiner

RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.

  • Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite
  • Powerful but intuitive graphical user interface for the design of analysis processes
  • Repositories for process, data and meta data handling
  • Only solution with meta data transformation: forget trial and error and inspect results already during design time
  • Only solution which supports on-the-fly error recognition and quick fixes
  • Complete and flexible: Hundreds of data loading, data transformation, data modeling, and data visualization methods

01_design_perspective.jpg (1096×796)

Programmatically retrieving Task ID and Unique Reducer ID in MapReduce

For each Mapper and Reducer you can get Task attempt id and Task ID both. This can be done when you set up your map using the Context object. You may also know that the when setting a Reducer an unique reduce ID is used inside reducer class setup method. You can get this ID as well.

There are multiple ways you can get this info:

1. Using JobConf Class.

  • JobConf.get(“mapred.task.id”) will provide most of the info related with Map and Reduce task along with attempt id.

2. You can use Context Class and use as below:

  • To get task attempt ID – context.getTaskAttemptID()
  • Reducer Task ID – Context.getTaskAttemptID().getTaskID()
  • Reducer Number – Context.getTaskAttemptID().getTaskID().getId()

Keyword: Hadoop, Map/Reduce, Jobs Performance, Mapper, Reducer

Resource Allocation Model in MapReduce 2.0

What was available in previous MapReduce:

  • Each node in the cluster was statically assigned the capability of running a predefined number of Map slots and a predefined number of Reduce slots.
  • The slots could not be shared between Maps and Reduces. This static allocation of slots wasn’t optimal since slot requirements vary during the MR job life cycle
  • In general there is a demand for Map slots when the job starts, as opposed to the need for Reduce slots towards the end

Key drawback in previous MapReduce:

  • In a real cluster, where jobs are randomly submitted and each has its own Map/Reduce slots requirement, having an optimal utilization of the cluster was hard, if not impossible.

What is new in MapReduce 2.0:

  • The resource allocation model in Hadoop 0.23 addresses above (Key drawback) deficiency by providing a more flexible resource modeling.
  • Resources are requested in the form of containers, where each container has a number of non-static attributes.
  • At the time of writing this blog, the only supported attribute was memory (RAM). However, the model is generic and there is intention to add more attributes in future releases (e.g. CPU and network bandwidth).
  • In this new Resource Management model, only a minimum and a maximum for each attribute are defined, and Application Master (AMs) can request containers with attribute values as multiples of these minimums.

Credit: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/