Classification of companies in Big Data Space


Class 1: Companies who have adopted Hadoop as Big Data Strategy. Class 1 companies are using Class 2 companies are partner to excel in Big Data space

Class 2: Companies who have taken Hadoop and productized  it

Class 3: These companies are creating products which are adding value to overall Hadoop eco-system.

Class 4: These companies are consuming or providing Hadoop based servicess to other companies on smaller scale compare to class 1 and class 2 companies.


Primary Namenode and Secondary Namenode configuration in Apache Hadoop

Apache Hadoop Primary Namenode and secondary Namenode architecture is designed as below:

Namenode Master:

The conf/masters file defines the master nodes of any single or multimode cluster. On master, conf/masters that it looks like this:




This conf/slaves  file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.  When you have both the master box and the slave box to act as Hadoop slaves, you will see same hostname is listed in both master and slave.

On master, conf/slaves looks like as below:




If you have additional slave nodes, just add them to the conf/slaves  file, one per line. Be sure that your namenode can ping to those machine which are listed in your slave.

Secondary Namenode:

If you are building a test cluster, you don’t need to set up secondary name node on a different machine, something like pseudo cluster install steps. However if you’re building out a real distributed cluster, you must move secondary node to other machine and it is a great idea. You can have Secondary Namenode on a different machine other than Primary NameNode in case the primary Namenode goes down.

The masters file contains the name of the machine where the secondary name node will start. In case you have modified the scripts to change your secondary namenode details i.e. location & name, be sure that when the DFS service starts its reads the updated configuration  script so it can start secondary namenode correctly.

In a Linux based Hadoop cluster, the secondary namenode is started by bin/ on the nodes specified in conf/masters file. Initially bin/ calls bin/ where you specify the name of master/slave file as command line option

Start Secondary Name node on demand or by DFS:

Location to your Hadoop conf directory is set using $HADOOP_CONF_DIR shell variable. Different distributions i.e. Cloudera or MapR have setup it differently so have a look where is your Hadoop conf folder.

To start secondary name node on any machine using following command:

$HADOOP_HOME/bin/hadoop –config $HADOOP_CONF_DIR secondarynamenode

When Secondary name node is started by DFS it does as below:

$HADOOP_HOME/bin/ starts SecondaryNameNode

>>>> $bin”/ –config $HADOOP_CONF_DIR –hosts masters start secondarynamenode

In case you have changed secondary namenode name say “hadoopsecondary” then when starting secondary namenode, you would need to provide hostnames, and be sure these changes are available to when starting bin/ by default:

$bin”/ –config $HADOOP_CONF_DIR –hosts hadoopsecondary start secondarynamenode

which will start secondary namenode on ALL hosts specified in file ” hadoopsecondary “.

How Hadoop DFS Service Starts in a Cluster:
In Linux based Hadoop Cluster:

1. Namenode Service :  start Namenode on same machine from where we starts DFS .
2. DataNode Service : looks into slave file and start DataNode on all slaves using following command :
#>$HADOOP_HOME/bin/ –config $HADOOP_CONF_DIR start datanode
3. SecondaryNameNode Service:   looks into masters file and start SecondaryNameNode on all hosts listed in masters file using following command
#>$HADOOP_HOME/bin/ –config $HADOOP_CONF_DIR start secondarynamenode

Alternative to backup Namenode or Avatar Namenode:

Secondary namenode is created as primary namenode backup to keep the cluster going in case primary namenode goes down. There are alternative to secondary namenode available in case you would want to build a name node HA. Once such method is to use avatar namenode. An Avatar namenode can be created by migrating namenode to avatar namenode and avatar namenode must build on a separate machine.

Technically when migrated Avatar namenode is the namenode hot standby. So avatar namenode is always in sync with namenode. If you create a new file to master name node, you can also read in standby avatar name node real time.

In standby mode, Avatar namenode is a ready-only name node. Any given time you can transition avatar name node to act as primary namenode. When in need you can switch standby mode to full active mode in just few second. To do that, you must have a VIP for name node migration and a NFS for name node data replication.

Master Slave architecture in Hadoop

Apache Hadoop  is designed to have Master Slave architecture.

  • Master: Namenode, JobTracker
  • Slave: {DataNode, TaskTraker}, …..  {DataNode, TaskTraker}

HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-slave architecture.

  • Master: NameNode
  • Slave: {Datanode}…..{Datanode}
  • –     The Master (NameNode) manages the file system namespace operations like opening, closing, and renaming files and directories and determines the mapping of blocks to DataNodes along with regulating access to files by clients
  • –     Slaves (DataNodes) are responsible for serving read and write requests from the file system’s clients along with perform block creation, deletion, and replication upon instruction from the Master (NameNode).

Map/Reduce is also primary component of Hadoop and it also have Master-slave architecture

  • Master: JobTracker
  • Slaves: {tasktraker}……{Tasktraker}
  •  –     Master {Jobtracker} is the point of interaction between users and the map/reduce framework. When a map/reduce job is submitted, Jobtracker puts it in a queue of pending jobs and executes them on a first-come/first-served basis and then manages the assignment of map and reduce tasks to the tasktrackers.
  • –     Slaves {tasktracker} execute tasks upon instruction from the Master {Jobtracker} and also handle data motion between the map and reduce phases.

Facts about any NoSQL Database

Read both the article for more info:

Here are some facts about any NoSQL database:

  • Experimental  spontaneity
  • Performance lies in application design
  • Structure of query is more important than structure of data
  • To design a DB you always start from what kind of query will be used to access the data
  • The approach is results centric about what you really expect as result
  • This also helps to design how to run an efficient query
  • Queries is used to define data objects
  • Accepting less than perfect consistency provide huge improvements in performance
  • Big-Data backed application have DB spread across thousands of nodes so
    • Performance penalty of locking DB to write/modify is huge
    • In frequent writes, serializing all writes will lose the advantage of distributed database
  • In an eventually consistence db, changes arrives to the node in 10th of the seconds before DB arrives in consistent state
  • For web applications availability always get priority over consistency
  • If consistency is not guaranteed by DB then application will have to manage it and to do it, application will be complex
  • You can add replication and failover strategies database designed for consistency can deliver super high availability
    • HBASE is an example here

Types of NoSQL databases and extensive details

Please study the first article as background of this article: Why there is a need for NoSql Database?

What about NoSQL:

  • NoSQL is not completely “No Schema” DB
  • There are mainly 3 types of NoSQL DB
    • Document DB
    • Data Structure Oriented DB
    • Column Oriented DB

 What is a Document DB?

  • Documents are key-value pair
  • document can also be stroed in JSON format
  • Because of JSON document considered as object
  • JSON documents are used as Key-Value pairs
  • Document can have any set of keys
  • Any key can associate with any arbitrarily complex value, that is itself a JSON document
  • Documents are added with different sets of keys
    • Missing keys
    • Extra keys
    • Add keys in future when in need
    • Application must know that certain key present
    • Queries are made on Keys
    • Index are set to keys to make search efficient
  • Example: CouchDB, MongoDB, Redis, Riak

Example of Document DB – CouchDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are documents in the DB that specify searches
  • View can be complex
  • Views can use map/reduce to process and summarize results
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main
  • BigCouch
    • Integrating clustering with CouchDB
    • Cloudant merging CouchDB & BigCouch

Example of Document DB – MongoDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are JSON documents specifying fields and values to match
  • Queries results can be processed by built in map/reduce
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main

Example of Document DB – Riak

  • A document database with more flexible document types
  • Supports JSON, XML, plain text
  • A plugin architecture supports adding other document types
  • Queries must know the structure of JSON or XML for proper results
  • Queries results can be processed by built in map/reduce
  • Built in control about replication and distribution
  • Core is designed to run in cluster environment
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • Note: Tradeoff between availability and consistency is tunable
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write

Data Structure Oriented DB – Redis:

  • In Memory DB for fastest read and write speed
  • If dataset can fit in memory, top choice
  • Great  for Raw speed
  • Data isn’t saved on disk and list in case of crash
  • Can be configured to save on disk but hit in performance
  • Limited scalability with some replication
  • Cluster Replication Support is coming
  • In Redis there is a difference
    • The value can be data structure (list or sets)
    • You can do union and intersection on list and sets

Column Oriented DB

  • Also considered as “Sparse row store”
  • Equivalent to “relational table” – “set of rows” identified by key
  • Concept starts with columns
  • Data is organized in the columns
  • Columns are stored contiguously
  • Columns tend to have similar data
  • A row can have as many columns as needed
  • Columns are essentially keys, that can let you lookup values in the rows
  • Columns can be added any time
  • Unused columns in a row does not occupy storage
  • NULL don’t exist
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Built in control about replication and distribution
  • Example: HBASE & Cassandra
  • HBase
    • From CAP Theorem
      • Partition Tolerance
      • Consistency
  • Cassandra
    • From CAP Theorem
      • Partition Tolerance
      • Availability Note: Tradeoff between availability and consistency is tunable

Additional functionalities supported by NoSql DB: 

  • Scripting Language Support
    • JavaScript
      • CouchDB, MongoDB
  • Pig
    • Hadoop
  • Hive
    • Hadoop
  • Lua
    • Redis
  • RESTFull Interface:
    • CouchDB and Riak
    • CouchDB can be considered as best with Web Application Framework
    • Riak provides traditional protocol buffer interface

Why there is a need for NoSql Database?

Let’s start with what are the issues and requirements with data in this generation:

  • Issues with dataSize
    • Scalability
      • Vertical
        • CPU Limit
    • Horizontal
      • Distributed
      • Scalability on Multiple Servers
      • Response
        • No overnight queries at all
        • No night batch processing, application needs instant results
        • Instant analytics
        • Availability
          • Data is living, breathing part of your application
          • No single point of failure
          • Distributed in nature
            • Manual distribution – sharding
              • Relational databases are split between multiple hosts by manual sharding
              • Energy spent on sharding and replication design
    • Inherent distribution
      • Built in control about replication and distribution
    • Hybrid (manual and inherent) distribution: Not inherently distributed, but designed to partitioned easily (automatically or manually)
    • Architecture:
      • For any RDBMS the schema is needed even before the program is written
      • Schemaless (best for agile development)
      • Latency while Interaction with Data:
        • Read Latency
          • Traditional RDBMS with proper indexing results FAST read access
  • Write Latency
    • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
    • All Database must following below consideration:
      • ACIDproperties
        • Atomicity
        • Consistency
        • Isolation
        • Durability
  • Two-Phase Commit
  • CAP Theorem – You can get 2 out of following 3, means you will need to sacrifice the least required. Partition Tolerance is must for any distributed database so most of the db choose to sacrifice either consistency or availability
  • Partition Tolerance
  • Consistency
  • Availability

Here is an example (e.g. Twitter) how data evolve in this generation:

  • Twitter started with 140 chars + a few things
  • Later added pic and
  • Then added location
  • So you can see the lots of metadata has been added
  • So the type of data schema is changing regularly and a fixed schema will not work in this kind of data model
  • There are more and more examples to show that the data requirements are fluid
  • So the applications needs little DB planning at start
  • Data design is more query centric (what you are looking for) instead what kind of data is

Based on above following are the requirements for a database to fulfill the need:

  • Scalable
  • Super fast data insertion without concurrency
  • Extremely fast random reads on large datasets.
  • Consistent read/write speed across the whole data set.
  • Super efficient data storage
  • Scale well for cloud application need
  • Easy to maintain
  • Stable, of course.
  • Replicate data across machines to avoid issue if certain machine goes down
  • No more 80s batch processing
  • No-more schema because data changes frequently
  • Structured data is not a priority as unstructured data is growing faster then we establish the schema

Source: what-you-need-to-know-about-nosql-databases

Commercial Hadoop Distribution or develop your own from scratch?

There is always a question with open source that if one should develop their own distribution directly using open source code or choose a commercial packaged solution which comes with little more additional components to make your job easy.

With Apache Hadoop, you too have option to choose Hadoop  directly from open source repo to built your own from scratch and pick and choose different components available along with Hadoop core. Otherwise you can choose a commercial release . While comparing both option I found a few interested  things and  decided to share.

With commercial Hadoop Distribution you will get:

  • A compound solution where you know all of these components available are working together perfectly and test well
  • You will have a stable reliable setup to start with
  • You will get additional items i.e. management console, admin portal etc with commercial release
  • One Single point of contact for support
  • You will have an ecosystem which will provide greater benefits
  • In some cases you will have none or very little effort to fine tune your system

While when you pick and choose components from Apache Hadoop, you:

  • start with picking Hadoop core and then select each module and try to make it work your Hadoop core
  • You build your own Hadoop clusters and manage it whatever is available
  • It will take some time depend on several factors to have a stable & fine tuned system ready
  • You own it and for any problem you will have to figure it you.
  • Best option for developing something from scratch within Hadoop
  • The best thing is that it’s your creation

Here are few commercial offering you can consider:

  • Cloudera offers CDH (Cloudera’s Distribution including Apache Hadoop) and Cloudera Enterprise.
  • MapR distribution is very sound and provides filesystem and MapReduce engine. MapR also provides additional capabilities such as snapshots, mirrors, NFS access and full read-write file semantics.
  • Yahoo! and Benchmark Capital formed Hortonworks Inc., whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users. Hortonworks in process to provide something so far I don’t know what is available
  • Amazon provides Hadoop in two different ways:
    • Amazon runs (inefficient) Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
    • Amazon also run Hadoop on Elastic MapReduce (EMR) by provisioning Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce.
  • Microsoft Announced their Hadoop Offering in late 2011 and their service is currently in CTP. Microsoft Hadoop offering will be available on Windows Azure and Windows Servers.
  • IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition.
  • Silicon Graphics International offers Hadoop optimized solutions based on the SGI Rackable and CloudRack server lines with implementation services.
  • EMC released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition.
    • The community edition, with optional for-fee technical support, consists of Hadoop, HDFS, HBase, Hive, and the ZooKeeper configuration service.
    • The enterprise edition is an offering based on the MapR product, and offers proprietary features such as snapshots and wide area replication.
  • Google added AppEngine-MapReduce to support running Hadoop 0.20 programs on Google App Engine.
  • Oracle announced the Big Data Appliance, integrates Hadoop, Oracle Enterprise Linux, the R programming language, and a NoSQL database with the Exadata hardware.

Some of these commercial vendors have partnerships with Hardware vendors i.e.

Don’t choose any commercial distribution just because everything works. Do your research and find your reason to choose one or another.

Keys to understand relationship between MapReduce and HDFS

Map Task (HDFS data localization):

The unit of input for a map task is an HDFS data block of the input file. The map task functions most efficiently if the data block it has to process is available locally on the node on which the task is scheduled. This approach is called HDFS data localization.

An HDFS data locality miss occurs if the data needed by the map task is not available locally. In such a case, the map task will request the data from another node in the cluster: an operation that is expensive and time consuming, leading to inefficiencies and, hence, delay in job completion.

Clients, Data Nodes, and HDFS Storage:

Input data is uploaded to the HDFS file system in either of following two ways:

  1. An HDFS client has a large amount of data to place into HDFS.
  2. An HDFS client is constantly streaming data into HDFS.

Both these scenarios have the same interaction with HDFS, except that in the streaming case, the client waits for enough data to fill a data block before writing to HDFS. Data is stored in HDFS in large blocks, generally 64 to 128 MB or more in size. This storage approach allows easy parallel processing of data.




<value>134217728</value> ç 128MB Block size





<value>67108864</value> ç 64MB Block size (Default is this value is not set)


Block Replication Factor:

During the process of writing to HDFS, the blocks are generally replicated to multiple data nodes for redundancy. The number of copies, or the replication factor, is set to a default of 3 and can be modified by the cluster administrator as below:







When the replication factor is three, HDFS’s placement policy is to:

–     Put one replica on one node in the local rack,

–     Another on a node in a different (remote) rack,

–     Last on a different node in the same remote rack.

When a new data block is stored on a data node, the data node initiates a replication process to replicate the data onto a second data node. The second data node, in turn, replicates the block to a third data node, completing the replication of the block.

With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

Keywords: Hadoop, MapReduce, HDFS, Performance,

Network related particulars with Apache Hadoop Cluster Performance

Network Characteristics:

The nodes in a Hadoop cluster are interconnected through the network. Typically, one or more of the following phases of MapReduce jobs transfers data over the network:

1. Writing data: This phase occurs when the initial data is either streamed or bulk-delivered to HDFS. Data blocks of the loaded files are replicated, transferring additional data over the network.

2. Workload execution: The MapReduce algorithm is run.

a. Map phase: In the map phase of the algorithm, almost no traffic is sent over the network. The network is used at the beginning of the map phase only if a HDFS locality miss occurs (the data block is not locally available and has to be requested from another data node).

b. Shuffle phase: This is the phase of workload execution in which traffic is sent over the network, the degree to which depends on the workload. Data is transferred over the network when the output of the mappers is shuffled to the reducers.

c. Reduce phase: In this phase, almost no traffic is sent over the network because the reducers have all the data they need from the shuffle phase.

d. Output replication: MapReduce output is stored as a file in HDFS. The network is used when the blocks of the result file have to be replicated by HDFS for redundancy.

3. Reading data: This phase occurs when the final data is read from HDFS for consumption by the end application, such as the website, indexing, or SQL database.

In addition, the network is crucial for the Hadoop control plane: the signaling and operations of HDFS and the MapReduce infrastructure.

Be sure to consider the benefits and costs of the choices available when designing a network: network architectures, network devices, resiliency, oversubscription ratios, etc. The following section discusses some of these parameters in more detail.

Impact of Network Characteristics on Job Completion Times

  • A functional and resilient network is a crucial part of a good Hadoop cluster.
  • However, an analysis of the relative importance of the factors shows that other factors in a cluster have a greater influence on the performance of the cluster than the network.
  • Nevertheless, you should consider some of the relevant network characteristics and their potential effects.

Network Latency

  • Variations in switch and router latency have been shown to have only limited impact on cluster performance.
  • From a network point of view, any latency-related optimization should start with a network wide analysis.
  • “Architecture first, and device next” is an effective strategy.
  • Architectures that deliver consistently lower latency at scale are better than architectures with higher overall latency but lower individual device latency.
  • The latency contribution to the workload is much higher at the application level, contributed by the application logic (Java Virtual Machine software stack, socket-buffer etc)than network latency.
  • In any case, slightly more or less network latency will not noticeably affect job completion times.

Data Node Network Speed

  • Data nodes must be provisioned with enough bandwidth for efficient job completion. Price-to-performance ratio entailed in adding more bandwidth to nodes.
  • Recommendations for a cluster depend on workload characteristics.
  • Typical clusters are provisioned with one or two 1-GB uplinks per data node. Cluster management is made easier by choosing network architectures that are proven to be resilient and easy to manage and that can scale as your data grows.
  • The use of 10Gbps server access is largely dependent on the cost/performance trade-off.
  • The workload characteristics and business requirement to complete the job in required time will drive the 10Gbps server connectivity.
  • For example: As 10 Gbps Ethernet LAN-on-motherboard (LOM) connectors become more commonly available on servers in the future, more clusters likely will be built with 10 Gigabit Ethernet data node uplinks.