HDInsight (Hadoop on Azure) Demo: Submit MapReduce job, process result from Pig and filter final results in Hive

In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.

Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:

c:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop.cmd jar c:appsJobstemplates635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount

hd001

The results are stored @ /user/admin/outcount

Verify the results at Interactive Shell:

js> #ls /user/admin/outcount
Found 2 items
-rwxrwxrwx 1 admin supergroup 0 2013-03-28 05:22 /user/admin/outcount/_SUCCESS
-rwxrwxrwx 1 admin supergroup 337623 2013-03-28 05:22 /user/admin/outcount/part-r-00000

Step 2:  loading /user/admin/outcount/part-r-00000 results in the Pig: 

First we are storing the flat text file data as words, wordCount format as below:

Grunt>mydata = load ‘/user/admin/output/part-r-00000’ using PigStorage(‘t’) as (words:chararray, wordCount:int);

Grunt>first10 = LIMIT mydata 10;

Grunt>dump first10;

hd002

Note: This shows results for the words with frequency 1.  We need to reorder to results on descending order to get words with top frequency.

Grunt>mydatadsc = order mydata by wordCount DESC;

Grunt>first10 = LIMIT mydatadsc 10;

Grunt>dump first10;

hd003

Now we have got the result as expected. Lets stored the results into a file at HDFS.

Grunt>Store first10 into ‘/user/avkash/myresults10‘ ;

Step 3:  Filtering Pig Results  in to Hive Table: 

First we will create a table in Hive using the same format (words and  wordcount separated by comma)

hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘n’;

Now once table is created we will load the hive store file ‘/user/admin/myresults10/part-r-00000’ into wordslist10 table we just created:

hive> load data inpath ‘/user/admin/myresults10/part-r-00000’ overwrite into table wordslist10;

That’s all as you can see the results now in table:

hive> select * from wordslist10;

hd004

KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData

Understanding HBase tables and HDFS file structure in Hadoop

Learn more on HBase here: http://hbase.apache.org/book.html

Lets create a HBase table first and add some data to it.

[cloudera@localhost ~]$ hbase shell
13/03/27 00:04:31 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter ‘help<RETURN>’ for list of supported commands.
Type “exit<RETURN>” to leave the HBase Shell
Version 0.94.2-cdh4.2.0, rUnknown, Fri Feb 15 11:51:18 PST 2013

hbase(main):001:0> create ‘students’, ‘name’
0 row(s) in 2.5020 seconds

=> Hbase::Table – students
hbase(main):002:0> list ‘students’
TABLE
students
1 row(s) in 0.0540 seconds

=> [“students”]
hbase(main):003:0> put ‘students’, ‘row1’, ‘name:id1′,’John’
0 row(s) in 0.0400 seconds

hbase(main):004:0> put ‘students’, ‘row2’, ‘name:id2′,’Jim’
0 row(s) in 0.0070 seconds

hbase(main):005:0> put ‘students’, ‘row3’, ‘name:id3′,’Will’
0 row(s) in 0.0070 seconds

hbase(main):006:0> put ‘students’, ‘row4’, ‘name:id4′,’Henry’
0 row(s) in 0.0040 seconds

hbase(main):007:0> put ‘students’, ‘row5’, ‘name:id5′,’Ken’
0 row(s) in 0.0440 seconds

hbase(main):008:0> scan ‘students’
ROW COLUMN+CELL
row1 column=name:id1, timestamp=1364357135479, value=John
row2 column=name:id2, timestamp=1364357147587, value=Jim
row3 column=name:id3, timestamp=1364357161684, value=Will
row4 column=name:id4, timestamp=1364357173959, value=Henry
row5 column=name:id5, timestamp=1364357189836, value=Ken
5 row(s) in 0.0450 seconds

As you can see above we have table students and only one table name comes when  running list command.

In my Hadoop cluster the HBase is configured to use /hbase folder so now lets check the disk utilization in /hbase folder:

[cloudera@localhost ~]$ hdfs dfs -du /hbase
2868 /hbase/-ROOT-
245 /hbase/.META.
0 /hbase/.archive
0 /hbase/.corrupt
2424 /hbase/.logs
0 /hbase/.oldlogs
0 /hbase/.tmp
38 /hbase/hbase.id
3 /hbase/hbase.version
928 /hbase/students

Above table students is user table however -ROOT- and .META. are HBase catalog tables. These tables are part of HBase configuration where HBase keeps catalog about the user tables. To understand each table structure lets run describe command:

hbase(main):010:0> describe ‘-ROOT-‘
DESCRIPTION ENABLED
{NAME => ‘-ROOT-‘, IS_ROOT => ‘true’, IS_META => ‘t true
rue’, FAMILIES => [{NAME => ‘info’, DATA_BLOCK_ENCO
DING => ‘NONE’, BLOOMFILTER => ‘NONE’, REPLICATION_
SCOPE => ‘0’, COMPRESSION => ‘NONE’, VERSIONS => ‘1
0’, TTL => ‘2147483647’, MIN_VERSIONS => ‘0’, KEEP_
DELETED_CELLS => ‘false’, BLOCKSIZE => ‘8192’, ENCO
DE_ON_DISK => ‘true’, IN_MEMORY => ‘true’, BLOCKCAC
HE => ‘true’}]}
1 row(s) in 0.0700 seconds

hbase(main):011:0> describe ‘.META.’
DESCRIPTION ENABLED
{NAME => ‘.META.’, IS_META => ‘true’, FAMILIES => [ true
{NAME => ‘info’, DATA_BLOCK_ENCODING => ‘NONE’, BLO
OMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’, COMPR
ESSION => ‘NONE’, VERSIONS => ’10’, TTL => ‘2147483
647’, MIN_VERSIONS => ‘0’, KEEP_DELETED_CELLS => ‘f
alse’, BLOCKSIZE => ‘8192’, ENCODE_ON_DISK => ‘true
‘, IN_MEMORY => ‘true’, BLOCKCACHE => ‘true’}]}
1 row(s) in 0.0470 seconds

hbase(main):012:0> describe ‘students’
DESCRIPTION ENABLED
{NAME => ‘students’, FAMILIES => [{NAME => ‘name’, true
DATA_BLOCK_ENCODING => ‘NONE’, BLOOMFILTER => ‘NONE
‘, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPR
ESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL => ‘2147
483647’, KEEP_DELETED_CELLS => ‘false’, BLOCKSIZE =
> ‘65536’, IN_MEMORY => ‘false’, ENCODE_ON_DISK =>
‘true’, BLOCKCACHE => ‘true’}]}
1 row(s) in 0.0270 seconds

Now we can check the file structure for the user table ‘students‘ as below:

[cloudera@localhost ~]$ hdfs dfs -du /hbase/students
697 /hbase/students/.tableinfo.0000000001
0 /hbase/students/.tmp
231 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14

 

We can also check the HBase system specific tables structure as well:

[cloudera@localhost ~]$ hdfs dfs -du /hbase/-ROOT-
727 /hbase/-ROOT-/.tableinfo.0000000001
0 /hbase/-ROOT-/.tmp
2141 /hbase/-ROOT-/70236052
[cloudera@localhost ~]$ hdfs dfs -du /hbase/.META.
245 /hbase/.META./1028785192

Now if we dig further to see the file structure for user table students we can learn about regioninfo as below:

[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students
Found 3 items
-rw-r–r– 1 hbase supergroup 697 2013-03-27 00:04 /hbase/students/.tableinfo.0000000001
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/.tmp
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14
[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/.tmp

Now here we can see the regioninfo details about the table ‘students’

[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/b2cd87df288adbb7e9ff2423ca532e14
Found 2 items
-rw-r–r– 1 hbase supergroup 231 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/.regioninfo
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/name

[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/name

[cloudera@localhost ~]$ hdfs dfs -cat /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/.regioninfo
=�&:9students,,1364357097018.b2cd87df288adbb7e9ff2423ca532e14students�}�

{NAME => ‘students,,1364357097018.b2cd87df288adbb7e9ff2423ca532e14.’, STARTKEY => ”, ENDKEY => ”, ENCODED => b2cd87df288adbb7e9ff2423ca532e14,}[cloudera@local[cloudera@localhost ~]$

This is the way we can understand more about HBase user table details in HDFS.

Keywords: Hadoop, HBase, Regions, RegionServer, Catalog, Cloudera, HDFS

Understanding HDFS cluster FSImage image and looking its contents

FSImage in HDFS Cluster:

  • The HDFS namespace is stored by the NameNode.
  • The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage.
  • The FsImage is stored as a file in the NameNode’s local file system too.
  • The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
  • Reads at startup only and keeps it during the life of start
    • RAM Dependency
    • What about “One Large File” or  “1,000,000 small files”

Step 1: First save the fsimage to local file system (Note: You cannot save the fsimage to HDFS.)

Step 2: Converting saved image to human readable file using –oiv option

Step 3: Now you can open the file in your favorite editor or you can try $head or $tail quickly to peek into.

fsimage

List of machine leanring algorithms supported in Mahout 0.7

Here is a list of Mahout built-in algorithms available in 0.7:

  •   arff.vector: : Generate Vectors from an ARFF file or directory
  •   baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  •   canopy: : Canopy clustering
  •   cat: : Print a file or resource as the logistic regression models would see it
  •   cleansvd: : Cleanup and verification of SVD output
  •   clusterdump: : Dump cluster output to text
  •   clusterpp: : Groups Clustering Output In Clusters
  •   cmdump: : Dump confusion matrix in HTML or text formats
  •   cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  •   cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  •   dirichlet: : Dirichlet Clustering
  •   eigencuts: : Eigencuts spectral clustering
  •   evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  •   fkmeans: : Fuzzy K-means clustering
  •   fpg: : Frequent Pattern Growth
  •   hmmpredict: : Generate random sequence of observations by given HMM
  •   itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  •   kmeans: : K-means clustering
  •   lucene.vector: : Generate Vectors from a Lucene index
  •   matrixdump: : Dump matrix in CSV format
  •   matrixmult: : Take the product of two matrices
  •   meanshift: : Mean Shift clustering
  •   minhash: : Run Minhash clustering
  •   parallelALS: : ALS-WR factorization of a rating matrix
  •   recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  •   recommenditembased: : Compute recommendations using item-based collaborative filtering
  •   regexconverter: : Convert text files on a per line basis based on regular expressions
  •   rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  •   rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  •   runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  •   runlogistic: : Run a logistic regression model against CSV data
  •   seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  •   seq2sparse: : Sparse Vector generation from Text sequence files
  •   seqdirectory: : Generate sequence files (of Text) from a directory
  •   seqdumper: : Generic Sequence File dumper
  •   seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  •   seqwiki: : Wikipedia xml dump to sequence file
  •   spectralkmeans: : Spectral k-means clustering
  •   split: : Split Input data into test and train sets
  •   splitDataset: : split a rating dataset into training and probe parts
  •   ssvd: : Stochastic SVD
  •   svd: : Lanczos Singular Value Decomposition
  •   testnb: : Test the Vector-based Bayes classifier
  •   trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  •   trainlogistic: : Train a logistic regression using stochastic gradient descent
  •   trainnb: : Train the Vector-based Bayes classifier
  •   transpose: : Take the transpose of a matrix
  •   validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  •   vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  •   vectordump: : Dump vectors from a sequence file to text
  •   viterbi: : Viterbi decoding of hidden states from given output states sequence

To further use any one of the above just try appending it with mahout and you will see more details on how to use it:

$mahout kmeans

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths>              comma separated archives to be unarchived
on the compute machines.
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-files <paths>                 comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-libjars <paths>               comma separated jar files to include in the classpath.
-tokenCacheFile <tokensFile>   name of the file with the tokens

Missing required option –clusters

Usage:
[–input <input> –output <output> –distanceMeasure <distanceMeasure>
–clusters <clusters> –numClusters <k> –convergenceDelta <convergenceDelta>
–maxIter <maxIter> –overwrite –clustering –method <method>
–outlierThreshold <outlierThreshold> –help –tempDir <tempDir> –startPhase
<startPhase> –endPhase <endPhase>]
–clusters (-c) clusters    The input centroids, as Vectors.  Must be a
SequenceFile of Writable, Cluster/Canopy.  If k is
also specified, then a random set of vectors will
be selected and written out to this path first

Finding Hadoop specific processes running in a Hadoop Cluster

Recently I was asked to provide info on all Hadoop specific process running in a Hadoop cluster. I decided to run few commands as below to provide that info.

Hadoop 2.0.x on Linux (CentOS 6.3) – Single Node Cluster

First list all Java process running in the cluster

[cloudera@localhost usr]$ ps -A | grep java
1768 ?        00:00:28 java
2197 ?        00:00:54 java
2439 ?        00:00:30 java
2507 ?        00:01:19 java
2654 ?        00:00:35 java
2784 ?        00:00:52 java
2911 ?        00:00:56 java
3028 ?        00:00:31 java
3239 ?        00:00:59 java
3344 ?        00:01:11 java
3446 ?        00:00:27 java
3551 ?        00:00:30 java
3644 ?        00:00:22 java
3878 ?        00:01:08 java
4142 ?        00:02:16 java
4201 ?        00:00:36 java
4223 ?        00:00:25 java
4259 ?        00:00:21 java
4364 ?        00:00:29 java
4497 ?        00:11:11 java
4561 ?        00:00:44 java

Next dig each Java specific process to dig further to see which Hadoop specific application is running within Java proc:

[cloudera@localhost usr]$ ps -aef | grep java

499       1768     1  0 08:29 ?        00:00:29 /usr/java/jdk1.6.0_31/bin/java -Dzookeeper.datadir.autocreate=false -Dzookeeper.log.dir=/var/log/zookeeper -********

yarn 2197 1 0 08:29 ? 00:00:55 /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn ********

sqoop2 2439 1 0 08:29 ? 00:00:31 /usr/java/jdk1.6.0_31/bin/java -Djava.util.logging.config.file=/usr/lib/sqoop2/sqoop-server/conf/logging.properties -Dsqoop.config.dir=/etc/sqoop2/conf ****************

yarn 2507 1 0 08:29 ? 00:01:21 /usr/java/jdk1.6.0_31/bin/java -Dproc_nodemanager -Xmx1000m -server -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn **********

mapred 2654 1 0 08:30 ? 00:00:36 /usr/java/jdk1.6.0_31/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=yarn-mapred-historyserver-localhost.localdomain.log -Dhadoop.home.dir=/usr/lib/hadoop ********

hdfs 2784 1 0 08:30 ? 00:00:53 /usr/java/jdk1.6.0_31/bin/java -Dproc_datanode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-datanode-localhost.localdomain.log ********

hdfs 2911 1 0 08:30 ? 00:00:57 /usr/java/jdk1.6.0_31/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-localhost.localdomain.log *********

hdfs 3028 1 0 08:30 ? 00:00:31 /usr/java/jdk1.6.0_31/bin/java -Dproc_secondarynamenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-localhost.localdomain.log -Dhadoop.home.dir=/usr/lib/hadoop ********

hbase 3239 1 0 08:31 ? 00:01:00 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -XX:+UseConcMarkSweepGC -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log *******

hbase 3344 1 0 08:31 ? 00:01:13 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -XX:+UseConcMarkSweepGC -XX:+UseConcMarkSweepGC ****

hbase 3446 1 0 08:31 ? 00:00:28 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -XX:+UseConcMarkSweepGC -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-rest-localhost.localdomain.log -Dhbase.home.dir=/usr/lib/hbase/bin/*******

hbase 3551 1 0 08:31 ? 00:00:31 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -XX:+UseConcMarkSweepGC -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-thrift-localhost.localdomain.log *******

flume 3644 1 0 08:31 ? 00:00:23 /usr/java/jdk1.6.0_31/bin/java -Xmx20m -cp /etc/flume-ng/conf:/usr/lib/flume-ng/lib/*:/etc/hadoop/conf:/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoop/lib/asm-3.2.jar *******

root 3865 1 0 08:31 ? 00:00:00 su mapred -s /usr/java/jdk1.6.0_31/bin/java — -Dproc_jobtracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-0.20-mapreduce -Dhadoop.log.file=hadoop-hadoop-jobtracker-localhost.localdomain.log ********

mapred 3878 3865 0 08:31 ? 00:01:09 java -Dproc_jobtracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-0.20-mapreduce -Dhadoop.log.file=hadoop-hadoop-jobtracker-localhost.localdomain.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20-mapreduce -Dhadoop.id.str=hadoop **********

root 4139 1 0 08:31 ? 00:00:00 su mapred -s /usr/java/jdk1.6.0_31/bin/java — -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-0.20-mapreduce -Dhadoop.log.file=hadoop-hadoop-tasktracker-localhost.localdomain.log ************

mapred 4142 4139 1 08:31 ? 00:02:19 java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-0.20-mapreduce -Dhadoop.log.file=hadoop-hadoop-tasktracker-localhost.localdomain.log ***************

httpfs 4201 1 0 08:31 ? 00:00:37 /usr/java/jdk1.6.0_31/bin/java -Djava.util.logging.config.file=/usr/lib/hadoop-httpfs/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager ******

hive 4223 1 0 08:31 ? 00:00:26 /usr/java/jdk1.6.0_31/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-metastore.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=//usr/lib/hadoop/logs *********

hive 4259 1 0 08:31 ? 00:00:22 /usr/java/jdk1.6.0_31/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server.log -Dhive.log.threshold=INFO -Dhadoop.log.dir=//usr/lib/hadoop/logs *****

hue 4364 4349 0 08:31 ? 00:00:30 /usr/java/jdk1.6.0_31/bin/java -Xmx1000m -Dlog4j.configuration=log4j.properties -Dhadoop.log.dir=//usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log *******

oozie 4497 1 6 08:31 ? 00:11:27 /usr/bin/java -Djava.util.logging.config.file=/usr/lib/oozie/oozie-server-0.20/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xmx1024m -Doozie.https.port=11443 *********

sqoop 4561 1 0 08:31 ? 00:00:45 /usr/java/jdk1.6.0_31/bin/java -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop *******

cloudera 15657 8150 0 11:26 pts/4 00:00:00 grep java

Note: The above output is trimmed as each process spit out full class path etc. along with other process specific details.

HDInsight On Windows – Single Node Cluster

Apache Hadoop datanode Running Automatic .hadoop
Apache Hadoop historyserver Running Automatic .hadoop
Apache Hadoop isotopejs Running Automatic .hadoop
Apache Hadoopjobtracker Running Automatic .hadoop
Apache Hadoop namenode Running Automatic .hadoop
Apache Hadoop secondarynamenode Running Automatic .hadoop
Apache Hadoop tasktracker Running Automatic .hadoop
Apache Hive Derbyserver Running Automatic Ahadoop
Apache Hive hiveserver Running Automatic .hadoop
Apache Hive hwi Running Automatic .hadoop

Constructing blocks and file system relationship in HDFS

I am using a 3 nodes Hadoop cluster running Windows Azure HDInsight for the testing.

In Hadoop we can use fsck utility to diagnose the health of the HDFS file system, to find missing files or blocks or calculate them for integrity.

Lets Running FSCK for the root file system:

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fsck /

FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:27:39 GMT 2013
……….Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:27:39 GMT 2013 in 8 milliseconds

The filesystem under path ‘/’ is HEALTHY

 

Now let’s check the total files in the root (/) to verify the files and directories:

 

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fs -lsr /

drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/apps
-rw-r–r– 3 avkash supergroup 4608 2013-03-04 21:16 /example/apps/cat.exe
-rw-r–r– 3 avkash supergroup 5120 2013-03-04 21:16 /example/apps/wc.exe
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data/gutenberg
-rw-r–r– 3 avkash supergroup 1395667 2013-03-04 21:16 /example/data/gutenberg/davinci.txt
-rw-r–r– 3 avkash supergroup 674762 2013-03-04 21:16 /example/data/gutenberg/outlineofscience.txt
-rw-r–r– 3 avkash supergroup 1573044 2013-03-04 21:16 /example/data/gutenberg/ulysses.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred
drwx—— – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred/system
-rw——- 3 avkash supergroup 4 2013-03-04 21:15 /hdfs/tmp/mapred/system/jobtracker.info
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse/hivesampletable
-rw-r–r– 3 avkash supergroup 5015508 2013-03-04 21:16 /hive/warehouse/hivesampletable/HiveSampleData.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp/hive-avkash
drwxrwxrwx – SYSTEM supergroup 0 2013-03-04 21:15 /uploads
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/graph
-rw-r–r– 3 avkash supergroup 80 2013-03-04 21:16 /user/SYSTEM/graph/catepillar_star.edge
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/query
-rw-r–r– 3 avkash supergroup 12 2013-03-04 21:16 /user/SYSTEM/query/catepillar_star_rwr.query
drwxr-xr-x – avkash supergroup 0 2013-03-05 07:37 /user/avkash
drwxr-xr-x – avkash supergroup 0 2013-03-04 23:00 /user/avkash/.Trash
-rw-r–r– 3 avkash supergroup 543666528 2013-03-05 07:37 /user/avkash/data_w3c_large.txt

Above we found that there are total 21 directories and 10 files. Now we can dig further to check the total 12 blocks in HDFS for each files:

c:appsdisthadoop-1.1.0-SNAPSHOT>hadoop fsck / -files -blocks –racks
FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:35:44 GMT 2013
/

/example

/example/apps

/example/apps/cat.exe 4608 bytes, 1 block(s): OK

0. blk_9084981204553714951_1008 len=4608 repl=3 [/fd0/ud0/10.114.236.28:50010, /

fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

/example/apps/wc.exe 5120 bytes, 1 block(s): OK
0. blk_-7951603158243426483_1009 len=5120 repl=3 [/fd1/ud1/10.114.228.35:50010,
/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/example/data

/example/data/gutenberg

/example/data/gutenberg/davinci.txt 1395667 bytes, 1 block(s): OK

0. blk_3859330889089858864_1005 len=1395667 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/example/data/gutenberg/outlineofscience.txt 674762 bytes, 1 block(s): OK

0. blk_-3790696559021810548_1006 len=674762 repl=3 [/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010, /fd1/ud1/10.114.228.35:50010]

/example/data/gutenberg/ulysses.txt 1573044 bytes, 1 block(s): OK

0. blk_-8671592324971725227_1007 len=1573044 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/hdfs

/hdfs/tmp

/hdfs/tmp/mapred

/hdfs/tmp/mapred/system

/hdfs/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK

0. blk_5997185491433558819_1003 len=4 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/hive

/hive/warehouse

/hive/warehouse/hivesampletable

/hive/warehouse/hivesampletable/HiveSampleData.txt 5015508 bytes, 1 block(s): OK

0. blk_44873054283747216_1004 len=5015508 repl=3 [/fd1/ud1/10.114.228.35:50010,/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/tmp

/tmp/hive-avkash

/uploads

/user

/user/SYSTEM

/user/SYSTEM/graph

/user/SYSTEM/graph/catepillar_star.edge 80 bytes, 1 block(s): OK

0. blk_-6715685143024983574_1010 len=80 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]

/user/SYSTEM/query

/user/SYSTEM/query/catepillar_star_rwr.query 12 bytes, 1 block(s): OK

0. blk_8102317020509190444_1011 len=12 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

/user/avkash

/user/avkash/.Trash

/user/avkash/data_w3c_large.txt 543666528 bytes, 3 block(s): OK

0. blk_2005027737969478969_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]

1. blk_1970119524179712436_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]

2. blk_6223000007391223944_1012 len=6795616 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]

Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:35:44 GMT 2013 in 10 milliseconds

The filesystem under path ‘/’ is HEALTHY

Above we can verify that where all total 12 blocks are distributed. 9 blocks are distributed through 9 files and 3 blocks are distributed through 1 file.

 

Keywaords: FSCK, Hadoop,HDFS, Blocks, File System, Replications, HDInsight

Cloudera Impala Hands-on Video

Cloudera Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in (near) real time.

Learn more: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Download the Impala virtual Machine: https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM