Icons for Hadoop v1.0 by Hortonworks

Hortonworks releases  a set of icons today which can be download from here.

Screen Shot 2013-08-28 at 11.16.14 PM

 

This is an unofficial set of icons for Hadoop projects, related components and resources that anyone can use to create architecture diagrams or other images. The icons are designed to be simple and flexible to assist comprehension in technical diagrams.

Advertisement

Foundation steps to become a leader

Recently I went through a literature which was given to me during a leadership training at my previous work. I wanted to live through the training again and decided to follow it again completely. The content was great and it was mainly conveying 3 main things to become a successful leader:

  1. Translating individual potential to team potential and individual strength to team strength
    1. This is done by showing trust in team and confidence
      1. Winning trust is not an easy thing and it is done be being honest
      2. Confidence in team is created by encouraging them to do the exemplary things by showing the path, not by doing it yourself
  2. Searching hidden potential in team and bringing it the limelight
    1. This is done by preparing and promoting team member for bigger task
      1. This is achieved by seeing big and looking for the big picture
      2. After that showing the exact same picture to everyone
      3. Finally bringing everyone to follow the same objective
  3. Do less and achieve BIG
    1. This is hardest to achieve because a leader want to see the results at his scale and to match then, he or she starts doing everything by himself and it is the end of leadership.
      1. A leader is nothing if his or her team does not see himself or herself a leader and you can not force someone to follow your lead.
        1. This is achieved by mastering #1 and #2 above.

A leader is consider to be like the “paddle” for the boat. The “paddle” seems to look very small comparative to the “boat” however it is “paddle” who gives the direction to boat and help reaching to its destination . Everyone sees the “boat” however no one see the “paddle”. Now these days leaders are like “flag” on the boat who are there to show off and get the credit from the “paddle”.

zsh Shell HowTo

I would say that as being new to mac, I found zsh shell is one of the best and I am digging more about it every day.  I thought this article would be helpful for someone who is in need of a great terminal application.  The zsh shell comes with lots of themes to beautify the static terminal. I use iTerm2 on my mac and using zsh with it, the combination made in heaven.

Screen Shot 2013-08-25 at 10.20.30 PM

I use brew on my mac (as application installer) so using brew to install zsh shell on mac is the best option however it will only work if the zsh package is available through brew installer. If you don’t have brew on your mac you can install just by visiting the site below and follow the instructions:

http://brew.sh/

To check if any application or package is available on through brew you can try:

$ brew search <package_name> 

$ brew search zsh 

zsh
zsh-completions
zsh-lovers
zsh-syntax-highlighting
zshdb

As we can see above the zsh package is available so we can get more info about it as below:

$ brew info <package_name>

$ brew info zsh

zsh: stable 5.0.2

http://www.zsh.org/
/usr/local/Cellar/zsh/5.0.2 (1053 files, 8.7M) *
Built from source
From: https://github.com/mxcl/homebrew/commits/master/Library/Formula/zsh.rb
==> Dependencies
Required: gdbm, pcre
==> Options
–disable-etcdir
Disable the reading of Zsh rc files in /etc
==> Caveats
To use this build of Zsh as your login shell, add it to /etc/shells.

If you have administrator privileges, you must fix an Apple miss
configuration in Mac OS X 10.7 Lion by renaming /etc/zshenv to
/etc/zprofile, or Zsh will have the wrong PATH when executed
non-interactively by scripts.

Alternatively, install Zsh with /etc disabled:
brew install –disable-etcdir zsh

Add the following to your zshrc to access the online help:
unalias run-help
autoload run-help
HELPDIR=/usr/local/share/zsh/helpfiles

To install zsh  shell you can just in

$ brew install <package_name>

$ brew install zsh

You can make sure installation is completed and zsh is installed:

/bin/zsh
/Users/hadoopworld/Library/Logs/Homebrew/zsh
/usr/lib/zsh
/usr/lib/zsh/4.3.11/zsh
/usr/local/bin/zsh
/usr/local/Cellar/zsh
/usr/local/Cellar/zsh/5.0.2/bin/zsh
/usr/local/Cellar/zsh/5.0.2/lib/zsh
/usr/local/Cellar/zsh/5.0.2/share/zsh
/usr/local/lib/zsh
/usr/local/Library/LinkedKegs/zsh
/usr/local/opt/zsh
/usr/local/share/zsh
/usr/share/zsh

 After the installation is completed you would need to download the config files which you can download directly from zsh git repo by cloning the repo to a specific folder. In the command below I am cloning zsh to my work folder, in its own folder name .oh-my-zsh.

 $ git clone https://github.com/robbyrussell/oh-my-zsh.git ~/work/.oh-my-zsh

After clonOnce cloning is done you can change the shell for a specific using in your mac using chsh command used as below:

$ sudo chsh <desired_shell_binary> <user_name>

$ sudo chsh /usr/local/bin/zsh my_user_name

Now the last step to configure the zsh shell as you desired. First you would need to make sure that zsh shell resource configuration file is located in user home folder name .zshrc as below:

$ ls -lah ~/.zshrc

Next edit the .zshrc file to reflect the correct .oh-my-zsh location as below (In your case choose the correct folder where you have cloned the zsh shell config):

$ vi ~/.zshrc

Edit the path for correctness

# Path to your oh-my-zsh configuration.
ZSH=$HOME/work/.oh-my-zsh

You can also edit your choice of the these as below:

ZSH_THEME=”jonathan” ## “robbyrussell”

Finally source the shell to reflect the changes as below:

$ source ~/.zshrc

Thats all. You have zsh shell working with your terminal.

Space Time Crystals

Time to time, i like to think outside the work and enjoy totally random stuff so I decided to learn more on Space-Time Crystal and collected the following:

What is a crystal?

crystal or crystalline solid is a solid material whose constituent atomsmolecules, or ions are arranged in an ordered pattern extending in all three spatial dimensions. (more from wiki)

What is time crystal? (also called as space-time crystal or four dimensional crystal)

space-time crystal or four dimensional crystal is a theoretical structure periodic in time and space. It extends the idea of a crystal to four dimensions (more from wiki)

Classical time crystals:

 We consider the possibility that classical dynamical systems display motion in their lowest energy state, forming a time analogue of crystalline spatial order. Challenges facing that idea are identified and overcome. We display arbitrary orbits of an angular variable as lowest-energy trajectories for nonsingular Lagrangian systems. Dynamics within orbits of broken symmetry provide a natural arena for formation of time crystals. We exhibit models of that kind, including a model with traveling density waves.

Read Full Paper here

Quantum time crystals:

Difficulties around the idea of spontaneous breaking of time translation symmetry in a closed quantum mechanical system are identified, and then overcome in a simple model. The possibility of ordering in imaginary time is also discussed.

Read Full Paper here.

 

A few notable articles:

Finding Python Module version

Sometimes it becomes a need to find out particular module version in Python, here is the quickest way to find it:

  1. Launch python
  2. >>> import <module_by_name>
  3. Try any of the following way to get the module version:
    1. >>> ModuleName.__version__
    2. >>> ModuleName.version
    3. >>> ModuleName.version_string

Here is an example of finding pyCrypto (Module name – Crypto) version:

╚═╝ § python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import Crypto
>>> Crpto.__version__
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
NameError: name ‘Crpto’ is not defined
>>> Crypto.__version__
‘2.6’
>>> quit()

Hadoop MapReduce job failure with java.io.IOException: Task process exit with nonzero status of 137

While working with Amazon EMR, it is possible that you might see an exception as below with your failed map/reduce task:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 137. 
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

The Root Cause:

The potential problem in this case is that, the Hadoop mapper or reducer tasks are killed by Linux OS due to oversubscribing memory.  Keen in mind that this issue  Linux OOM killer and not the Java OOM Killer. In general such issue occurs when a particular process is configured to use lots more memory then the OS could provide and due to over memory subscription OS has no other way to kill the process.

With Amazon EMR such job failure could happen very easily if use have configured mapred.child.java.opts setting to way high comparative to specific EMR instance type. This is because each EMR instance type has preconfigured setting for Map and Reduce jobs and a misconfiguration could lead to this problem.

An Example:

For example the EMR Instance type is m1.large which means it has 768 MB memory allocated for each Map or Reduce task in Hadoop job as below:

m1.xlarge

Parameter Value
HADOOP_JOBTRACKER_HEAPSIZE 6912
HADOOP_NAMENODE_HEAPSIZE 2304
HADOOP_TASKTRACKER_HEAPSIZE 384
HADOOP_DATANODE_HEAPSIZE 384
mapred.child.java.opts -Xmx768m
mapred.tasktracker.map.tasks.maximum 8
mapred.tasktracker.reduce.tasks.maximum 3

However in the mapred-site.xml if user setup the mapred.child.java.opts to way high value i.e. 8GB as below:

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx8192m</value>
</property>

The above configuration would cause Linux OS to kill Mapper or Reduce task due to very high memory subscription. Linus OS will kill the same process even when it is configured to use 4GB also because it is way over the configured limit.

The solution:

The solution  to this problem is to let your job use default mapreduce settings instead setting by your self during the job submission. Using default mapreduce setting helps jobtracker to run the task under  whatever settings Amazon EMR instance already have configured.

 

A visualization on how a blog article is viewed by linkedIn users

Once I have written a blog entry and shared through LinkedIn,  LinkedIn has a new way to visualize how my article is viewed within my inner and extended LinkedIn circle.  Here is an example:

First LinkedIn shows how my specific article is shared with others.

 

As displayed in the image below, you can see the blog has been views by 168 linkedIn users and 2 of them liked it. Those who liked the article are from two separate groups:

Screen Shot 2013-08-19 at 1.31.50 PM

 

Now in the image below, you can see who from my direct contact view and liked the article:

 

Screen Shot 2013-08-19 at 1.31.32 PM

Next as shows in the image blow you can see who from my extended circle liked and views the article:

Screen Shot 2013-08-19 at 1.31.42 PM

Finally you can see more….

Screen Shot 2013-08-19 at 1.37.21 PM

Hadoop Job and Task Name Classification and Convention

Hadoop MapReduce jobs and tasks have preconfigured naming convention so during job analysis or troubleshooting you can very easily understand what and where to look for.

hadoop

 

 

 

 

 

 

 

Here is some key information with regard to Hadoop jobs and mappers/reducers tasks naming classification and convention:

Job Name convention:

  • job_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED}_JobID
    • First Part – “job” keyword is assigned for job
    • Second Part – Full date and time when task tracker was started
    • Third Part – It is the job counter since task tracker was running

A task is unit of job execution consist of mappers and reducers. The total number of mappers and reducers are created when a job is submitted and based on number of mappers and reducer slots are available in a Hadoop cluster, job tracker send these tasks. There are two kind of tasks

  1. Mapper

    1. There are 3 kind of mappers
      1. Work mapper – These tasks are the actual mapper tasks which perform the identical work as other mappers. The ID for these mappers tasks starts with 0 and ends with Total – 1.
      2. Setup Mapper – This is the very last mapper task.
      3. Closeup Mapper – This is the task which clean the overall work. The ID for this task is “Total tasks – 1”. (See the example below to understand it clearly)
      4. Note: Both Setup and Closeup mappers are not counted in the actual mappers calculation. Also depending tasks count it is possible to have more than 1 cleanup task also.
  2. Reducer

    1. There are only 1 kind of reducer.

Task Name convention: 

  • For mapper
    • task_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED_JobID}_m_{6-Digit-Mapper-ID}_{mapper-instance}
  • For reducer
    • task_{DATE-TIME-WHEN-TASK-TRACKER-WAS-STARTED_JobID}_r_{6-Digit-Mapper-ID}_{reducer-instance}

Here is an Example:

  • Job ID
    • job_201307091604_1081
      • job – job
      • 201307091604 – The time when the Hadoop cluster was started
        • 2013/07/09 – Date
        • 16:04 (4:04 PM)
      • 1081 – Job ID
  • Mappers (Ex total 20 -> 000000 – 000019)
    • task_201307091604_1081_m_000000_0
      • First instance of mapper task (ID – 000)
    • task_201307091604_1081_m_000010_0
    • task_201307091604_1081_m_000010_1
    • task_201307091604_1081_m_000010_2
      • Above 3 instance of Same MapReduce task (ID – 010)
    • task_201307091604_1081_m_000019_0
      • First instance of last Mapper task (ID –  019)
  • Reducers (Total 6)
    • task_201307091604_1081_r_000000_0
      • First Instance of first reducer task (ID – 000)
    • task_201307091604_1081_r_000005_0
      • First instance of 6th reducer task (ID – 005)
  • Besides above there are 2 more mapper tasks added in every job as
    • Setup task
      • Even when it is Setup task however this task counter is very last
    • Cleanup task
      • This task ID will be “LAST – 1”
    • For example if you have total 20 mappers then Setup task ID will 21 and Cleanup taks will be 20.
      • 0 – 19 – total 20 mappers
      • 20 – cleanup task
      • 21 – setup task

12 key steps to keep your Hadoop Cluster running strong and performing optimum

1. Hadoop deployment on 64bit OS:

  • 32bit OS have 3GB limit on Java Heap size so make sure that Hadoop Namenode/Datanode is running on 64bit OS.

2. Mapper and Reducer count setup:

This is cluster specific value and reflects total number of mapper and reducers per tasktracker.

conf/mapred-site.xml mapred.tasktracker.map.tasks.maximum N The maximum number of map task slots to run simultaneously
conf/mapred-site.xml mapred.tasktracker.reduce.tasks.maximum N The maximum number of reduce task slots to run simultaneously

 

 

 

If no value is set the default is 2 and -1 specifies that the number of map/reduce task slots is based on the total amount of memory reserved for MapReduce by the sysadmin.

To set this value you would need to consider tasktracker CPU (+/- HT), DISK and Memory in account along with if your job is CPU intensive or not from a degree 1-10. For example if tasktracker is a quad core CPU with hyper-threading box, then there will be 4 physical and 4 virtual, total 8 CPU. For a high CPU intensive job we sure can assign 4 mappers and 4 reducer tasks however for a far less CPU intensive job, we can have up to 40 mappers & 40 reducers. You don’t need to have mapper or reducers count same as it is all depend on how the job are created. We can also have 6 Mappers and 2 Reducer also depend on  how much work is done by each mapper and reduce and to get this info, we can look at job specific counters. The number of mappers and reducer per tasktracker is depend of CPU utilization per task. You can also look at each reduce task counter to see how long CPU was utilized for the total map/reduce task time. If there is long wait then you may need to reduce the count however if everything is done very fast, it gives you some idea on adding either mapper or reducer count per tasktracker.

Users must understand that having larger mapper count compare to physical CPU cores, will result in CPU context switching, which may result as an overall slow job completion. However a balanced per CPU job configuration may results faster job completion results.

 

3. Per Task JVM Memory Configuration:

This particular memory configuration is important to setup based on total RAM in each tasktracker.

conf/mapred-site.xml mapred.child.java.opts -Xmx{YOUR_Value}M Larger heap-size for child jvms of maps/reduces.

 

The value for above parameter is depend on total mapper and reducer task per tasktracker so you must know these two parameters before setting. Here are few ways to calculate proper values for these parameters:

  • Lets consider there are 4 mappers and 4 reducer per tasktracker with 32GB total RAM in each machine
    • In this scenario there will be total 8 tasks running in any tasktracker
    • Lets consider about 2-4 GB RAM is required for Tasktracker to perform other jobs so there is about ~28GB RAM available for Hadoop Tasks
    • Now we can divide 28/8 and get 3.5GB per task RAM
    • The value in this case will be -Xmx3500M
  • Lets consider there are 8 mappers and 4 reducer per tasktracker with 32GB total RAM
    • In this scenario there will be total 12 tasks running in any tasktracker
    • Lets consider about 2-4 GB RAM is required for Tasktracker to perform other jobs so there is about ~28GB RAM available for Hadoop Tasks
    • Now we can divide 28/12 and get 2.35GB per task RAM
    • The value in this case will be -Xmx2300M
  • Lets consider there are 12 mappers and 8 reducer per tasktracker with 128GB total RAM, also one specific node is working as secondary namenode
    • It is not suggested to keep Secondary Namenode with Datanode/TaskTracker however in this example we will keep it here for the sake of calculation.
    • In this scenario there will be total 20 tasks running in any tasktracker
    • Lets consider about  8 GB RAM is required for Secondary namenode to perform its jobs and  4GB  for other jobs so there is about ~100GB RAM available for Hadoop Tasks
    • Now we can divide 100/20 and get 5GB per task RAM
    • The value in this case will be around -Xmx5000M
  • Note:
    • HDP 1.2 have some new JVM specific configuration which can be used for much more granular memory setting.
    • If Hadoop cluster does not have identical machines in memory (i.e. a collection of machines with 32GB & 64GB RAM) then user should use lower memory configuration as the base line.
    • It is always best to have ~20% memory left for other processes.
    • Do not overcommit the memory for total tasks, it sure will cause JVM OOM errors.

4. Setting mapper or reducer memory limit to unlimited:

Setting both mapred.job.{map|reduce}.memory.mb value to -1 or maximum helps mapreduce  jobs use maximum amount memory available.

mapred.job.map.memory.mb
-1
This property’s value sets the virtual memory size of a single map task for the job.
mapred.job.reduce.memory.mb -1 This property’s value sets the virtual memory size of a single reduce task for the job

 

5. Setting No limit (or Maximum) for total number of tasks per job:

Setting this value to a certain limit put constraints on mapreduce job completion & performance. It is best to set it as -1 so it can use the maximum available.

mapred.jobtracker.maxtasks.per.job -1 Set this property’s value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum.

6. Memory configuration for sorting data within processes:

There are two values io.sort.factor and io.sort.mb in this segment.  Based on experience this value io.sort.mb should be 25-30% of mapred.child.java.opts value.

conf/core-site.xml io.sort.factor 100 More streams merged at once while sorting files.
conf/core-site.xml io.sort.mb NNN Higher memory-limit while sorting data.

So for example if mapred.child.java.opts is 2 GB, io.sort.mb can be 500MB or if mapred.child.java.opts is 3.5 GB then io.sort.mb can be 768MB.

Also after running a few mapreduce jobs, analyzing log messages will help you to determine a better settings for io.sort.mb memory size. User must know that having a low io.sort.mb will cause lot more time in sort procedure, however a higher value may result job failure.

 

7. Reducer Parallel copies configuration:

A large number of parallel copies would cause high memory utilization and cause java heap error. However a small number would cause slow job completion. Keeping this valve to optimum helps mapreduce jobs complete faster.

conf/mapred-site.xml mapred.reduce.parallel.copies 20 The default number of parallel transfers run by reduce during the copy(shuffle) phase.

Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

This value is very much network specific. Having a larger value means higher network activity between tasktrackers. With higher parallel reduce copies, reducers will create many network connections which congest the network in a Hadoop cluster. A lower number helps stable network connectivity in a Hadoop cluster. Users should choose this number depending on their network strength.  I think the recommended value can be between 12-18 in a gigabit network.

 

8. Setting Reducer Input limit to maximum:

Sometimes setting a lower limit to reducer input size may cause job failures. It is best to set the reducer input limit to maximum.

conf/mapred-site.xml mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set.

This value is based on disk size and available space in the tasktracker. So if there is a cluster in which each datanode has variation in configured disk space, setting a specific value may cause job failures. Setting this value to -1 helps reducers to work based on available space.

 

9. Setting Map input split size:

During a mapreduce job execution,  map jobs are created per split. Having split size set to 0 helps jobtracker  to decide the split size based on data source.

mapred.min.split.size 0 The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting.

10. Setting HDFS block size:

  • Currently I have seen various Hadoop clusters running great with variety of HDFS block sizes.
  • A user can set dfs.block.size in hdfs-site.xml between 64MB and 1GB or more.

11. Setting  user priority, “High” in Hadoop Cluster:

  • In Hadoop clusters jobs, are submitted based on users priority if certain type of job scheduler are configured
  • If a hadoop user is lower in priority, the mappers and reducers task will have to wait longer to get task slots in tasktracker. This could ultimately cause longer mapreduce jobs.
    • In some cases a time out could occur and the mapreduce job may fail
  • If a job scheduler is configured, submitting job through high  job scheduling priority user, will result faster job completion in a Hadoop cluster.

 

12. Secondary Namenode or Highly Available Namenode Configuration:

  • Having secondary namenode or Highly Available namenode helps Hadoop cluster to be always/highly available.
  • However I  have seen some cases where secondary namenode or HA namenode is running on a datanode which could impact the cluster performance.
  • Keeping Secondary Namenode or High Available Namenode separate from Datanode/JobTracker helps dedicated resources available for tasks assigned to the tasktracker.