Handling YARN resources manager issue with decommissioned nodes

If you hit the following exception with your YARN resource manager:

ERROR/Exception:

17/07/31 15:06:13 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes over rm1. Not retrying because try once and fail.
java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl

Troubleshooting:

Please try running the following command and you will see the exact same exception:

$ yarn node -list -all

Root Cause:

This problem happen when your YARN cluster have decommissioned nodes and it could cause issue with other dependent application i.e. H2O to not to start.

Solution:

Please make sure all the decommissioned nodes are either not listed or added back as full service nodes.
That’s it, enjoy!!

Setting various logs levels for H2O

Setting log levels in different H2O deployment scenarios.

Standalone H2O mode (H2O on VMs, laptops…)

You can specify options -log_level and/or -log_dir:

-log_level <TRACE,DEBUG,INFO,WARN,ERRR,FATAL>
Write messages at this logging level, or above. Default is INFO.
-log_dir <fileSystemPath>

The directory where H2O writes logs to disk. (This usually has a good default that you need not change.)

$ java -jar h2o.jar -log_level DEBUG

H2O on Hadoop

The log level option is not directly exposed. You can still set the log level by adding an extra java argument using the -J option of the Hadoop h2o driver: “-J -log_level -J DEBUG”. Here is an example:

$ hadoop jar h2odriver.jar -J -log_level -J DEBUG -nodes 1 -mapperXmx 1g -output t/$RANDOM

Sparkling Water:

Log levels can be adjusted using Spark conf properties: spark.ext.h2o.node.log.level and spark.ext.h2o.client.log.level, these are two separate options for the compute node and the h2o client running in Sparkling Water’s driver program (H2O client).

$ bin/sparkling-shell --conf spark.ext.h2o.node.log.level=DEBUG --conf spark.ext.h2o.client.log.level=DEBUG

Free ebook: Introducing Microsoft Azure HDInsight

New Free eBook by Microsoft Press:

Microsoft Press is thrilled to share another new free ebook with you:Introducing Microsoft Azure HDInsight, by Avkash Chauhan, Valentine Fontama, Michele Hart, Wee Hyong Tok, and Buck Woody. 

hdinsight-book

Free ebook: Introducing Microsoft Azure HDInsight

Introduction (excerpt)

Microsoft Azure HDInsight is Microsoft’s 100 percent compliant distribution of Apache Hadoop on Microsoft Azure. This means that standard Hadoop concepts and technologies apply, so learning the Hadoop stack helps you learn the HDInsight service. At the time of this writing, HDInsight (version 3.0) uses Hadoop version 2.2 and Hortonworks Data Platform 2.0.

In Introducing Microsoft Azure HDInsight, we cover what big data really means, how you can use it to your advantage in your company or organization, and one of the services you can use to do that quickly—specifically, Microsoft’s HDInsight service. We start with an overview of big data and Hadoop, but we don’t emphasize only concepts in this book—we want you to jump in and get your hands dirty working with HDInsight in a practical way. To help you learn and even implement HDInsight right away, we focus on a specific use case that applies to almost any organization and demonstrate a process that you can follow along with.

We also help you learn more. In the last chapter, we look ahead at the future of HDInsight and give you recommendations for self-learning so that you can dive deeper into important concepts and round out your education on working with big data.

Here are the download links (and below the links you’ll find an ebook excerpt that describes this offering):

Download the PDF (6.37 MB; 130 pages) fromhttp://aka.ms/IntroHDInsight/PDF

Download the EPUB (8.46 MB) fromhttp://aka.ms/IntroHDInsight/EPUB

Download the MOBI (12.8 MB) fromhttp://aka.ms/IntroHDInsight/MOBI

Download the code samples (6.83 KB) fromhttp://aka.ms/IntroHDInsight/CompContent

Troubleshooting Cloudera Manager installation and start issues

After Cloudera Manager is installed and running you can access it over the web UI through <Your_IP_Address>:7180 , if you could not access it, then here are few ways to troubleshoot the problem. These details are helpful with you are installing Cloudera Hadoop on a remotely located machine and you just have shell access to that machine over SSH.

Verify if Cloudera Manager is running: 

[root@cloudera-master ~]# service cloudera-scm-server status
cloudera-scm-server (pid 4652) is running…

Now lets check the Cloudera Manager logs for further verification:

[root@cloudera-master ~]# ps -ef | grep cloudera-scm
498        977     1  0 16:31 ?        00:00:01 /usr/bin/postgres -D /var/lib/cloudera-scm-server-db/data
root      3729     1  0 16:59 pts/0    00:00:00 su cloudera-scm -s /bin/bash -c nohup /usr/sbin/cmf-server
498       3731  3729 53 16:59 ?        00:00:47 /usr/java/jdk1.6.0_31/bin/java -cp .:lib/*:/usr/share/java/mysql-connector-java.jar -Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dcmf.root.logger=INFO,LOGFILE -Dcmf.log.dir=/var/log/cloudera-scm-server -Dcmf.log.file=cloudera-scm-server.log -Dcmf.jetty.threshhold=WARN -Dcmf.schema.dir=/usr/share/cmf/schema -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dpython.home=/usr/share/cmf -Xmx2G -XX:MaxPermSize=256m com.cloudera.server.cmf.Main
root      3835  1180  0 17:00 pts/0    00:00:00 grep cloudera-scm

You can access Cloudera Manager Logs located as below (as shown in the above commandline):

[root@cloudera-master ~]# tail -5 /var/log/cloudera-scm-server/cloudera-scm-server.log 
2014-02-06 16:59:29,596  INFO [WebServerImpl:cmon.JobDetailGatekeeper@71] ActivityMonitor configured to allow job details for all jobs.
2014-02-06 16:59:29,597  INFO [WebServerImpl:cmon.JobDetailGatekeeper@71] ActivityMonitor configured to allow job details for all jobs.
2014-02-06 16:59:29,601  INFO [WebServerImpl:mortbay.log@67] jetty-6.1.26.cloudera.2
2014-02-06 16:59:29,605  INFO [WebServerImpl:mortbay.log@67] Started SelectChannelConnector@0.0.0.0:7180
2014-02-06 16:59:29,606  INFO [WebServerImpl:cmf.WebServerImpl@280] Started Jetty server.

Lets disable Linux Machine Firewall to see if it works:

First make sure you have to fiddle with your Firewall, below I am disabling the firewall in my Linux machine:

[root@cloudera-master ~]# /etc/init.d/iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:[  OK  ]
[root@cloudera-master ~]# /etc/init.d/iptables stop
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@cloudera-master ~]# /etc/init.d/ip6tables save
ip6tables: Saving firewall rules to /etc/sysconfig/ip6table[  OK  ]
[root@cloudera-master ~]# /etc/init.d/ip6tables stop
ip6tables: Setting chains to policy ACCEPT: filter         [  OK  ]
ip6tables: Flushing firewall rules:                        [  OK  ]
ip6tables: Unloading modules:                              [  OK  ]
[root@cloudera-master ~]# chkconfig 
ip6tables           0:off     1:off     2:off     3:off     4:off     5:off     6:off
iptables            0:off     1:off     2:off     3:off     4:off     5:off     6:off

Once Firewall is stopped completely restart Cloudera Manager: 

If your machines are behind a firewall you can go ahead and disable the firewall on a machine which has Cloudera Manager running. If not please setup/open specific ports for Cloudera Manager to get it working.

[root@cloudera-master ~]# service –status-all | grep cloudera
cloudera-scm-server (pid  1082) is running…
/usr/bin/postgres “-D” “/var/lib/cloudera-scm-server-db/data”

[root@cloudera-master ~]# service cloudera-scm-server 
Usage: cloudera-scm-server {start|stop|restart|status}

[root@cloudera-master ~]# service cloudera-scm-server restart
Stopping cloudera-scm-server:                              [  OK  ]
Starting cloudera-scm-server:                              [  OK  ]

Note: that cloudera-scm-server is listed through chkconfig however the status shows OFF.

[root@cloudera-master ~]# chkconfig 
cloudera-scm-server     0:off     1:off     2:off     3:off     4:off     5:off     6:off
cloudera-scm-server-db     0:off     1:off     2:off     3:on     4:on     5:on     6:off

Once working and accessible the Cloudera Manager Login Page looks as below:

 

 

Screen Shot 2014-02-06 at 5.20.35 PM

Handling Hadoop Error "could only be replicated to 0 nodes, instead of 1" during copying data to HDFS or with mapreduce jobs

Sometimes copy files to HDFS or running a MapReduce jobs you might receive an error as below:

During file copy to HDFS the error and call stack look like as below:

File /platfora/uploads/test.xml could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation

at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1339)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2198)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:501)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:299)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44954)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1751)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1747)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1745) 
UTC Timestamp: 11/20 04:14 amVersion: 2.5.4-IQT-build.73

During MapReduce job failure the error message and call stack look like as below:

DFSClient.java (line 2873) DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File ****/xyz.jar could only be replicated to 0 nodes, instead of 1

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1569)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

There could be various problems within datanode which could exhibit this issue such as:

  • Inconsistency in your datanodes
    • Restart your Hadoop cluster and see if this solves your problem.
  • Communication between datanodes and namenode
    • Network Issues
      • For example if you have Hadoop in EC2 instances and due to any security reason nodes can not talk, this problem may occur. You can fix the security by putting all nodes inside same EC2 security group to solve this problem.
    • Make sure that you can get datanode status from HDFS page or command line using command below:
      • $hadoop dfs-admin -report
  • Disk space full on datanode
    • What you can do is verify disk space availability in your system and make sure Hadoop  logs are not warning about disk space issue.
  • Busy or unresponsive datanode
    • Sometime datanodes are busy scanning block or working on a maintenance job initiated by namenode.
  • Negative block size configuration etc..
    • Please check the value of dfs.block.size in hdfs-site.xml and correct it per your Hadoop configuration

20TB Earth Science Dataset on AWS With NASA / NEX available for Public

AWS has been working with the NASA Earth Exchange (NEX) team to make it easier and more efficient for researchers to access and process earth science data. The goal is to make a number of important data sets accessible to a wider audience of full-time researchers, students, and citizen scientists. This important new project is called OpenNEX. Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size (tens of terabytes). Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical.

nasa_nex_landsat_us_2005_forest_leaf_area_1

Access Dataset: s3://nasanex/NEX-DCP30

Consult the detail page and the tech note to learn more about the provenance, format, structure, and attribution requirements.

NASA Earth Exchange (NEX):

The NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) dataset is comprised of downscaled climate scenarios for the conterminous United States that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011] developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5). The dataset includes downscaled projections from 33 models, as well as ensemble statistics calculated for each RCP from all model runs available. The purpose of these datasets is to provide a set of high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.

Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run).

Website: NASA NEX

Summary

  • Short Name: NEX-DCP30
  • Version: 1
  • Format: netCDF4 classic
  • Spatial Coverage: CONUS
  • Temporal Coverage:
    • 1950 – 2005 historical or 2006 – 2099 RCP
  • Data Resolution:
    • Latitude Resolution: 30 arc second
    • Longitude Resolution: 30 arc second
    • Temporal Resolution: monthly
  • Data Size:
    • Total Dataset Size: 17 TB
    • Individual file size: 2 GB

Learn more about NEX – NASA Earth Exchange Downscaled Project

NEX Virtual Workshop: https://nex.nasa.gov/nex/projects/1328/

 

Top 20 Big Data Platfora and Analytics Startups with significant VC Funding

Top 20 Big Data and Analytics Startups with significant VC Funding

 

Startup

Funding in Million

URL

MongoDB

231

Pivotal

210

Mu Sigma

208

Cloudera

141

Opera Solutions

114

HortonWorks

98

DataStax

83.7

Guavas

80.5

GoodData

75.5

ParAccel (Actian)

74

Talend

61.6

Pentaho

60

MapR

61

CouchBase

56

Platfora

27.5

Datameer

18

Hadapt

16.2

Karmasphere

14.5

DataBricks

14

Quantifind

11.2

 

Top 20 Big Data

 

Keywords: Big Data, Data Analytics, Infographic, Hadoop,  BI

Hadoop Job and Task Name Classification and Convention

Hadoop MapReduce jobs and tasks have preconfigured naming convention so during job analysis or troubleshooting you can very easily understand what and where to look for.

hadoop

 

 

 

 

 

 

 

Here is some key information with regard to Hadoop jobs and mappers/reducers tasks naming classification and convention:

Job Name convention:

  • job_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED}_JobID
    • First Part – “job” keyword is assigned for job
    • Second Part – Full date and time when task tracker was started
    • Third Part – It is the job counter since task tracker was running

A task is unit of job execution consist of mappers and reducers. The total number of mappers and reducers are created when a job is submitted and based on number of mappers and reducer slots are available in a Hadoop cluster, job tracker send these tasks. There are two kind of tasks

  1. Mapper

    1. There are 3 kind of mappers
      1. Work mapper – These tasks are the actual mapper tasks which perform the identical work as other mappers. The ID for these mappers tasks starts with 0 and ends with Total – 1.
      2. Setup Mapper – This is the very last mapper task.
      3. Closeup Mapper – This is the task which clean the overall work. The ID for this task is “Total tasks – 1”. (See the example below to understand it clearly)
      4. Note: Both Setup and Closeup mappers are not counted in the actual mappers calculation. Also depending tasks count it is possible to have more than 1 cleanup task also.
  2. Reducer

    1. There are only 1 kind of reducer.

Task Name convention: 

  • For mapper
    • task_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED_JobID}_m_{6-Digit-Mapper-ID}_{mapper-instance}
  • For reducer
    • task_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED_JobID}_r_{6-Digit-Mapper-ID}_{reducer-instance}

Here is an Example:

  • Job ID
    • job_201307091604_1081
      • job – job
      • 201307091604 – The time when the Hadoop cluster was started
        • 2013/07/09 – Date
        • 16:04 (4:04 PM)
      • 1081 – Job ID
  • Mappers (Ex total 20 -> 000000 – 000019)
    • task_201307091604_1081_m_000000_0
      • First instance of mapper task (ID – 000)
    • task_201307091604_1081_m_000010_0
    • task_201307091604_1081_m_000010_1
    • task_201307091604_1081_m_000010_2
      • Above 3 instance of Same MapReduce task (ID – 010)
    • task_201307091604_1081_m_000019_0
      • First instance of last Mapper task (ID –  019)
  • Reducers (Total 6)
    • task_201307091604_1081_r_000000_0
      • First Instance of first reducer task (ID – 000)
    • task_201307091604_1081_r_000005_0
      • First instance of 6th reducer task (ID – 005)
  • Besides above there are 2 more mapper tasks added in every job as
    • Setup task
      • Even when it is Setup task however this task counter is very last
    • Cleanup task
      • This task ID will be “LAST – 1”
    • For example if you have total 20 mappers then Setup task ID will 21 and Cleanup taks will be 20.
      • 0 – 19 – total 20 mappers
      • 20 – cleanup task
      • 21 – setup task

12 key steps to keep your Hadoop Cluster running strong and performing optimum

1. Hadoop deployment on 64bit OS:

  • 32bit OS have 3GB limit on Java Heap size so make sure that Hadoop Namenode/Datanode is running on 64bit OS.

2. Mapper and Reducer count setup:

This is cluster specific value and reflects total number of mapper and reducers per tasktracker.

conf/mapred-site.xml mapred.tasktracker.map.tasks.maximum N The maximum number of map task slots to run simultaneously
conf/mapred-site.xml mapred.tasktracker.reduce.tasks.maximum N The maximum number of reduce task slots to run simultaneously

 

 

 

If no value is set the default is 2 and -1 specifies that the number of map/reduce task slots is based on the total amount of memory reserved for MapReduce by the sysadmin.

To set this value you would need to consider tasktracker CPU (+/- HT), DISK and Memory in account along with if your job is CPU intensive or not from a degree 1-10. For example if tasktracker is a quad core CPU with hyper-threading box, then there will be 4 physical and 4 virtual, total 8 CPU. For a high CPU intensive job we sure can assign 4 mappers and 4 reducer tasks however for a far less CPU intensive job, we can have up to 40 mappers & 40 reducers. You don’t need to have mapper or reducers count same as it is all depend on how the job are created. We can also have 6 Mappers and 2 Reducer also depend on  how much work is done by each mapper and reduce and to get this info, we can look at job specific counters. The number of mappers and reducer per tasktracker is depend of CPU utilization per task. You can also look at each reduce task counter to see how long CPU was utilized for the total map/reduce task time. If there is long wait then you may need to reduce the count however if everything is done very fast, it gives you some idea on adding either mapper or reducer count per tasktracker.

Users must understand that having larger mapper count compare to physical CPU cores, will result in CPU context switching, which may result as an overall slow job completion. However a balanced per CPU job configuration may results faster job completion results.

 

3. Per Task JVM Memory Configuration:

This particular memory configuration is important to setup based on total RAM in each tasktracker.

conf/mapred-site.xml mapred.child.java.opts -Xmx{YOUR_Value}M Larger heap-size for child jvms of maps/reduces.

 

The value for above parameter is depend on total mapper and reducer task per tasktracker so you must know these two parameters before setting. Here are few ways to calculate proper values for these parameters:

  • Lets consider there are 4 mappers and 4 reducer per tasktracker with 32GB total RAM in each machine
    • In this scenario there will be total 8 tasks running in any tasktracker
    • Lets consider about 2-4 GB RAM is required for Tasktracker to perform other jobs so there is about ~28GB RAM available for Hadoop Tasks
    • Now we can divide 28/8 and get 3.5GB per task RAM
    • The value in this case will be -Xmx3500M
  • Lets consider there are 8 mappers and 4 reducer per tasktracker with 32GB total RAM
    • In this scenario there will be total 12 tasks running in any tasktracker
    • Lets consider about 2-4 GB RAM is required for Tasktracker to perform other jobs so there is about ~28GB RAM available for Hadoop Tasks
    • Now we can divide 28/12 and get 2.35GB per task RAM
    • The value in this case will be -Xmx2300M
  • Lets consider there are 12 mappers and 8 reducer per tasktracker with 128GB total RAM, also one specific node is working as secondary namenode
    • It is not suggested to keep Secondary Namenode with Datanode/TaskTracker however in this example we will keep it here for the sake of calculation.
    • In this scenario there will be total 20 tasks running in any tasktracker
    • Lets consider about  8 GB RAM is required for Secondary namenode to perform its jobs and  4GB  for other jobs so there is about ~100GB RAM available for Hadoop Tasks
    • Now we can divide 100/20 and get 5GB per task RAM
    • The value in this case will be around -Xmx5000M
  • Note:
    • HDP 1.2 have some new JVM specific configuration which can be used for much more granular memory setting.
    • If Hadoop cluster does not have identical machines in memory (i.e. a collection of machines with 32GB & 64GB RAM) then user should use lower memory configuration as the base line.
    • It is always best to have ~20% memory left for other processes.
    • Do not overcommit the memory for total tasks, it sure will cause JVM OOM errors.

4. Setting mapper or reducer memory limit to unlimited:

Setting both mapred.job.{map|reduce}.memory.mb value to -1 or maximum helps mapreduce  jobs use maximum amount memory available.

mapred.job.map.memory.mb
-1
This property’s value sets the virtual memory size of a single map task for the job.
mapred.job.reduce.memory.mb -1 This property’s value sets the virtual memory size of a single reduce task for the job

 

5. Setting No limit (or Maximum) for total number of tasks per job:

Setting this value to a certain limit put constraints on mapreduce job completion & performance. It is best to set it as -1 so it can use the maximum available.

mapred.jobtracker.maxtasks.per.job -1 Set this property’s value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum.

6. Memory configuration for sorting data within processes:

There are two values io.sort.factor and io.sort.mb in this segment.  Based on experience this value io.sort.mb should be 25-30% of mapred.child.java.opts value.

conf/core-site.xml io.sort.factor 100 More streams merged at once while sorting files.
conf/core-site.xml io.sort.mb NNN Higher memory-limit while sorting data.

So for example if mapred.child.java.opts is 2 GB, io.sort.mb can be 500MB or if mapred.child.java.opts is 3.5 GB then io.sort.mb can be 768MB.

Also after running a few mapreduce jobs, analyzing log messages will help you to determine a better settings for io.sort.mb memory size. User must know that having a low io.sort.mb will cause lot more time in sort procedure, however a higher value may result job failure.

 

7. Reducer Parallel copies configuration:

A large number of parallel copies would cause high memory utilization and cause java heap error. However a small number would cause slow job completion. Keeping this valve to optimum helps mapreduce jobs complete faster.

conf/mapred-site.xml mapred.reduce.parallel.copies 20 The default number of parallel transfers run by reduce during the copy(shuffle) phase.

Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

This value is very much network specific. Having a larger value means higher network activity between tasktrackers. With higher parallel reduce copies, reducers will create many network connections which congest the network in a Hadoop cluster. A lower number helps stable network connectivity in a Hadoop cluster. Users should choose this number depending on their network strength.  I think the recommended value can be between 12-18 in a gigabit network.

 

8. Setting Reducer Input limit to maximum:

Sometimes setting a lower limit to reducer input size may cause job failures. It is best to set the reducer input limit to maximum.

conf/mapred-site.xml mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set.

This value is based on disk size and available space in the tasktracker. So if there is a cluster in which each datanode has variation in configured disk space, setting a specific value may cause job failures. Setting this value to -1 helps reducers to work based on available space.

 

9. Setting Map input split size:

During a mapreduce job execution,  map jobs are created per split. Having split size set to 0 helps jobtracker  to decide the split size based on data source.

mapred.min.split.size 0 The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting.

10. Setting HDFS block size:

  • Currently I have seen various Hadoop clusters running great with variety of HDFS block sizes.
  • A user can set dfs.block.size in hdfs-site.xml between 64MB and 1GB or more.

11. Setting  user priority, “High” in Hadoop Cluster:

  • In Hadoop clusters jobs, are submitted based on users priority if certain type of job scheduler are configured
  • If a hadoop user is lower in priority, the mappers and reducers task will have to wait longer to get task slots in tasktracker. This could ultimately cause longer mapreduce jobs.
    • In some cases a time out could occur and the mapreduce job may fail
  • If a job scheduler is configured, submitting job through high  job scheduling priority user, will result faster job completion in a Hadoop cluster.

 

12. Secondary Namenode or Highly Available Namenode Configuration:

  • Having secondary namenode or Highly Available namenode helps Hadoop cluster to be always/highly available.
  • However I  have seen some cases where secondary namenode or HA namenode is running on a datanode which could impact the cluster performance.
  • Keeping Secondary Namenode or High Available Namenode separate from Datanode/JobTracker helps dedicated resources available for tasks assigned to the tasktracker.