Tips and tricks to manage your Hadoop cluster in Windows Azure

In Hadoop cluster, namenode communicate with all the other nodes. Apache Hadoop on Windows Azure have the following XML file which includes all the primary settings for Hadoop:

 

C:AppsDistconfHDFS-SITE.XML

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

<name>dfs.datanode.max.xcievers</name>

<value>4096</value>

</property>

<property>

<name>dfs.name.dir</name> <======This is the NAME node data directory

<value>c:hdfsnn</value>

</property>

<property>

<name>dfs.data.dir</name> <========= This is the DATA node data directory

<value>c:hdfsdn</value>

</property>

</configuration>

 

 

C:AppsDistconfCore-site.xml

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/hdfs/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://10.26.104.45:9000</value> <== After the role started the VM gets IP Address and then included here

</property>

<property>

<name>io.file.buffer.size</name>

<value>131072</value>

</property>

</configuration>

 

C:AppsDistconfMapred-site.xml:

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>10.26.104.45:9010</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/hdfs/mapred/local</value>

</property>

<property>

<name>mapred.tasktracker.map.tasks.maximum</name>

<value>2</value>

</property>

<property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<value>1</value>

</property>

<property>

<name>mapred.child.java.opts</name>

<value>-Xmx1024m</value>

</property>

<property>

<name>mapreduce.client.tasklog.timeout</name>

<value>6000000</value>

</property>

<property>

<name>mapred.task.timeout</name>

<value>6000000</value>

</property>

<property>

<name>mapreduce.reduce.shuffle.connect.timeout</name>

<value>600000</value>

</property>

<property>

<name>mapreduce.reduce.shuffle.read.timeout</name>

<value>600000</value>

</property>

</configuration>

 

You sure can make necessary changes to above setting however after that you would need to restart namenode as below:

 

  • C:AppsDist > Hadoop namenode -format

 

For more command you can see check the Hadoop command line shortcut:

c:appsdist>hadoop

Usage: hadoop [–config confdir] COMMAND

where COMMAND is one of:

namenode -format     format the DFS filesystem

secondarynamenode    run the DFS secondary namenode

namenode             run the DFS namenode

datanode             run a DFS datanode

dfsadmin             run a DFS admin client

mradmin              run a Map-Reduce admin client

fsck                 run a DFS filesystem checking utility

fs                   run a generic filesystem user client

balancer             run a cluster balancing utility

jobtracker           run the MapReduce job Tracker node

pipes                run a Pipes job

tasktracker          run a MapReduce task Tracker node

job                  manipulate MapReduce jobs

queue                get information regarding JobQueues

version              print the version

jar <jar>            run a jar file

 

distcp <srcurl> <desturl> copy file or directories recursively

archive -archiveName NAME <src>* <dest> create a hadoop archive

daemonlog            get/set the log level for each daemon

or

CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

 

 

You can also use the following configuration related with Java logging which can be modified, however you would need to re-launch Java process again:

C:AppsDistconfLog4j.properties:

hadoop.log.file=hadoop.log

log4j.rootLogger=${hadoop.root.logger}, EventCounter

log4j.threshhold=ALL

 

#

# TaskLog Appender

#

 

#Default values

hadoop.tasklog.taskid=null

hadoop.tasklog.noKeepSplits=4

hadoop.tasklog.totalLogFileSize=100

hadoop.tasklog.purgeLogSplits=true

hadoop.tasklog.logsRetainHours=12

 

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender

log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}

log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}

 

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout

log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

 

# FSNamesystem Audit logging

log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN

 

# Custom Logging levels

#log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

#log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG

 

# Jets3t library

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR

 

# Event Counter Appender

# Sends counts of logging messages at different severity levels to Hadoop Metrics.

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

 

Resources:

http://hadoop.apache.org/common/docs/current/cluster_setup.html

http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s