Classification of companies in Big Data Space


Class 1: Companies who have adopted Hadoop as Big Data Strategy. Class 1 companies are using Class 2 companies are partner to excel in Big Data space

Class 2: Companies who have taken Hadoop and productized  it

Class 3: These companies are creating products which are adding value to overall Hadoop eco-system.

Class 4: These companies are consuming or providing Hadoop based servicess to other companies on smaller scale compare to class 1 and class 2 companies.

Primary Namenode and Secondary Namenode configuration in Apache Hadoop

Apache Hadoop Primary Namenode and secondary Namenode architecture is designed as below:

Namenode Master:

The conf/masters file defines the master nodes of any single or multimode cluster. On master, conf/masters that it looks like this:




This conf/slaves  file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.  When you have both the master box and the slave box to act as Hadoop slaves, you will see same hostname is listed in both master and slave.

On master, conf/slaves looks like as below:




If you have additional slave nodes, just add them to the conf/slaves  file, one per line. Be sure that your namenode can ping to those machine which are listed in your slave.

Secondary Namenode:

If you are building a test cluster, you don’t need to set up secondary name node on a different machine, something like pseudo cluster install steps. However if you’re building out a real distributed cluster, you must move secondary node to other machine and it is a great idea. You can have Secondary Namenode on a different machine other than Primary NameNode in case the primary Namenode goes down.

The masters file contains the name of the machine where the secondary name node will start. In case you have modified the scripts to change your secondary namenode details i.e. location & name, be sure that when the DFS service starts its reads the updated configuration  script so it can start secondary namenode correctly.

In a Linux based Hadoop cluster, the secondary namenode is started by bin/ on the nodes specified in conf/masters file. Initially bin/ calls bin/ where you specify the name of master/slave file as command line option

Start Secondary Name node on demand or by DFS:

Location to your Hadoop conf directory is set using $HADOOP_CONF_DIR shell variable. Different distributions i.e. Cloudera or MapR have setup it differently so have a look where is your Hadoop conf folder.

To start secondary name node on any machine using following command:

$HADOOP_HOME/bin/hadoop –config $HADOOP_CONF_DIR secondarynamenode

When Secondary name node is started by DFS it does as below:

$HADOOP_HOME/bin/ starts SecondaryNameNode

>>>> $bin”/ –config $HADOOP_CONF_DIR –hosts masters start secondarynamenode

In case you have changed secondary namenode name say “hadoopsecondary” then when starting secondary namenode, you would need to provide hostnames, and be sure these changes are available to when starting bin/ by default:

$bin”/ –config $HADOOP_CONF_DIR –hosts hadoopsecondary start secondarynamenode

which will start secondary namenode on ALL hosts specified in file ” hadoopsecondary “.

How Hadoop DFS Service Starts in a Cluster:
In Linux based Hadoop Cluster:

1. Namenode Service :  start Namenode on same machine from where we starts DFS .
2. DataNode Service : looks into slave file and start DataNode on all slaves using following command :
#>$HADOOP_HOME/bin/ –config $HADOOP_CONF_DIR start datanode
3. SecondaryNameNode Service:   looks into masters file and start SecondaryNameNode on all hosts listed in masters file using following command
#>$HADOOP_HOME/bin/ –config $HADOOP_CONF_DIR start secondarynamenode

Alternative to backup Namenode or Avatar Namenode:

Secondary namenode is created as primary namenode backup to keep the cluster going in case primary namenode goes down. There are alternative to secondary namenode available in case you would want to build a name node HA. Once such method is to use avatar namenode. An Avatar namenode can be created by migrating namenode to avatar namenode and avatar namenode must build on a separate machine.

Technically when migrated Avatar namenode is the namenode hot standby. So avatar namenode is always in sync with namenode. If you create a new file to master name node, you can also read in standby avatar name node real time.

In standby mode, Avatar namenode is a ready-only name node. Any given time you can transition avatar name node to act as primary namenode. When in need you can switch standby mode to full active mode in just few second. To do that, you must have a VIP for name node migration and a NFS for name node data replication.

Master Slave architecture in Hadoop

Apache Hadoop  is designed to have Master Slave architecture.

  • Master: Namenode, JobTracker
  • Slave: {DataNode, TaskTraker}, …..  {DataNode, TaskTraker}

HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-slave architecture.

  • Master: NameNode
  • Slave: {Datanode}…..{Datanode}
  • –     The Master (NameNode) manages the file system namespace operations like opening, closing, and renaming files and directories and determines the mapping of blocks to DataNodes along with regulating access to files by clients
  • –     Slaves (DataNodes) are responsible for serving read and write requests from the file system’s clients along with perform block creation, deletion, and replication upon instruction from the Master (NameNode).

Map/Reduce is also primary component of Hadoop and it also have Master-slave architecture

  • Master: JobTracker
  • Slaves: {tasktraker}……{Tasktraker}
  •  –     Master {Jobtracker} is the point of interaction between users and the map/reduce framework. When a map/reduce job is submitted, Jobtracker puts it in a queue of pending jobs and executes them on a first-come/first-served basis and then manages the assignment of map and reduce tasks to the tasktrackers.
  • –     Slaves {tasktracker} execute tasks upon instruction from the Master {Jobtracker} and also handle data motion between the map and reduce phases.

Facts about any NoSQL Database

Read both the article for more info:

Here are some facts about any NoSQL database:

  • Experimental  spontaneity
  • Performance lies in application design
  • Structure of query is more important than structure of data
  • To design a DB you always start from what kind of query will be used to access the data
  • The approach is results centric about what you really expect as result
  • This also helps to design how to run an efficient query
  • Queries is used to define data objects
  • Accepting less than perfect consistency provide huge improvements in performance
  • Big-Data backed application have DB spread across thousands of nodes so
    • Performance penalty of locking DB to write/modify is huge
    • In frequent writes, serializing all writes will lose the advantage of distributed database
  • In an eventually consistence db, changes arrives to the node in 10th of the seconds before DB arrives in consistent state
  • For web applications availability always get priority over consistency
  • If consistency is not guaranteed by DB then application will have to manage it and to do it, application will be complex
  • You can add replication and failover strategies database designed for consistency can deliver super high availability
    • HBASE is an example here

Types of NoSQL databases and extensive details

Please study the first article as background of this article: Why there is a need for NoSql Database?

What about NoSQL:

  • NoSQL is not completely “No Schema” DB
  • There are mainly 3 types of NoSQL DB
    • Document DB
    • Data Structure Oriented DB
    • Column Oriented DB

 What is a Document DB?

  • Documents are key-value pair
  • document can also be stroed in JSON format
  • Because of JSON document considered as object
  • JSON documents are used as Key-Value pairs
  • Document can have any set of keys
  • Any key can associate with any arbitrarily complex value, that is itself a JSON document
  • Documents are added with different sets of keys
    • Missing keys
    • Extra keys
    • Add keys in future when in need
    • Application must know that certain key present
    • Queries are made on Keys
    • Index are set to keys to make search efficient
  • Example: CouchDB, MongoDB, Redis, Riak

Example of Document DB – CouchDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are documents in the DB that specify searches
  • View can be complex
  • Views can use map/reduce to process and summarize results
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main
  • BigCouch
    • Integrating clustering with CouchDB
    • Cloudant merging CouchDB & BigCouch

Example of Document DB – MongoDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are JSON documents specifying fields and values to match
  • Queries results can be processed by built in map/reduce
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main

Example of Document DB – Riak

  • A document database with more flexible document types
  • Supports JSON, XML, plain text
  • A plugin architecture supports adding other document types
  • Queries must know the structure of JSON or XML for proper results
  • Queries results can be processed by built in map/reduce
  • Built in control about replication and distribution
  • Core is designed to run in cluster environment
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • Note: Tradeoff between availability and consistency is tunable
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write

Data Structure Oriented DB – Redis:

  • In Memory DB for fastest read and write speed
  • If dataset can fit in memory, top choice
  • Great  for Raw speed
  • Data isn’t saved on disk and list in case of crash
  • Can be configured to save on disk but hit in performance
  • Limited scalability with some replication
  • Cluster Replication Support is coming
  • In Redis there is a difference
    • The value can be data structure (list or sets)
    • You can do union and intersection on list and sets

Column Oriented DB

  • Also considered as “Sparse row store”
  • Equivalent to “relational table” – “set of rows” identified by key
  • Concept starts with columns
  • Data is organized in the columns
  • Columns are stored contiguously
  • Columns tend to have similar data
  • A row can have as many columns as needed
  • Columns are essentially keys, that can let you lookup values in the rows
  • Columns can be added any time
  • Unused columns in a row does not occupy storage
  • NULL don’t exist
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Built in control about replication and distribution
  • Example: HBASE & Cassandra
  • HBase
    • From CAP Theorem
      • Partition Tolerance
      • Consistency
  • Cassandra
    • From CAP Theorem
      • Partition Tolerance
      • Availability Note: Tradeoff between availability and consistency is tunable

Additional functionalities supported by NoSql DB: 

  • Scripting Language Support
    • JavaScript
      • CouchDB, MongoDB
  • Pig
    • Hadoop
  • Hive
    • Hadoop
  • Lua
    • Redis
  • RESTFull Interface:
    • CouchDB and Riak
    • CouchDB can be considered as best with Web Application Framework
    • Riak provides traditional protocol buffer interface

Why there is a need for NoSql Database?

Let’s start with what are the issues and requirements with data in this generation:

  • Issues with dataSize
    • Scalability
      • Vertical
        • CPU Limit
    • Horizontal
      • Distributed
      • Scalability on Multiple Servers
      • Response
        • No overnight queries at all
        • No night batch processing, application needs instant results
        • Instant analytics
        • Availability
          • Data is living, breathing part of your application
          • No single point of failure
          • Distributed in nature
            • Manual distribution – sharding
              • Relational databases are split between multiple hosts by manual sharding
              • Energy spent on sharding and replication design
    • Inherent distribution
      • Built in control about replication and distribution
    • Hybrid (manual and inherent) distribution: Not inherently distributed, but designed to partitioned easily (automatically or manually)
    • Architecture:
      • For any RDBMS the schema is needed even before the program is written
      • Schemaless (best for agile development)
      • Latency while Interaction with Data:
        • Read Latency
          • Traditional RDBMS with proper indexing results FAST read access
  • Write Latency
    • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
    • All Database must following below consideration:
      • ACIDproperties
        • Atomicity
        • Consistency
        • Isolation
        • Durability
  • Two-Phase Commit
  • CAP Theorem – You can get 2 out of following 3, means you will need to sacrifice the least required. Partition Tolerance is must for any distributed database so most of the db choose to sacrifice either consistency or availability
  • Partition Tolerance
  • Consistency
  • Availability

Here is an example (e.g. Twitter) how data evolve in this generation:

  • Twitter started with 140 chars + a few things
  • Later added pic and
  • Then added location
  • So you can see the lots of metadata has been added
  • So the type of data schema is changing regularly and a fixed schema will not work in this kind of data model
  • There are more and more examples to show that the data requirements are fluid
  • So the applications needs little DB planning at start
  • Data design is more query centric (what you are looking for) instead what kind of data is

Based on above following are the requirements for a database to fulfill the need:

  • Scalable
  • Super fast data insertion without concurrency
  • Extremely fast random reads on large datasets.
  • Consistent read/write speed across the whole data set.
  • Super efficient data storage
  • Scale well for cloud application need
  • Easy to maintain
  • Stable, of course.
  • Replicate data across machines to avoid issue if certain machine goes down
  • No more 80s batch processing
  • No-more schema because data changes frequently
  • Structured data is not a priority as unstructured data is growing faster then we establish the schema

Source: what-you-need-to-know-about-nosql-databases

Commercial Hadoop Distribution or develop your own from scratch?

There is always a question with open source that if one should develop their own distribution directly using open source code or choose a commercial packaged solution which comes with little more additional components to make your job easy.

With Apache Hadoop, you too have option to choose Hadoop  directly from open source repo to built your own from scratch and pick and choose different components available along with Hadoop core. Otherwise you can choose a commercial release . While comparing both option I found a few interested  things and  decided to share.

With commercial Hadoop Distribution you will get:

  • A compound solution where you know all of these components available are working together perfectly and test well
  • You will have a stable reliable setup to start with
  • You will get additional items i.e. management console, admin portal etc with commercial release
  • One Single point of contact for support
  • You will have an ecosystem which will provide greater benefits
  • In some cases you will have none or very little effort to fine tune your system

While when you pick and choose components from Apache Hadoop, you:

  • start with picking Hadoop core and then select each module and try to make it work your Hadoop core
  • You build your own Hadoop clusters and manage it whatever is available
  • It will take some time depend on several factors to have a stable & fine tuned system ready
  • You own it and for any problem you will have to figure it you.
  • Best option for developing something from scratch within Hadoop
  • The best thing is that it’s your creation

Here are few commercial offering you can consider:

  • Cloudera offers CDH (Cloudera’s Distribution including Apache Hadoop) and Cloudera Enterprise.
  • MapR distribution is very sound and provides filesystem and MapReduce engine. MapR also provides additional capabilities such as snapshots, mirrors, NFS access and full read-write file semantics.
  • Yahoo! and Benchmark Capital formed Hortonworks Inc., whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users. Hortonworks in process to provide something so far I don’t know what is available
  • Amazon provides Hadoop in two different ways:
    • Amazon runs (inefficient) Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
    • Amazon also run Hadoop on Elastic MapReduce (EMR) by provisioning Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 and S3 are automated by Elastic MapReduce.
  • Microsoft Announced their Hadoop Offering in late 2011 and their service is currently in CTP. Microsoft Hadoop offering will be available on Windows Azure and Windows Servers.
  • IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition.
  • Silicon Graphics International offers Hadoop optimized solutions based on the SGI Rackable and CloudRack server lines with implementation services.
  • EMC released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition.
    • The community edition, with optional for-fee technical support, consists of Hadoop, HDFS, HBase, Hive, and the ZooKeeper configuration service.
    • The enterprise edition is an offering based on the MapR product, and offers proprietary features such as snapshots and wide area replication.
  • Google added AppEngine-MapReduce to support running Hadoop 0.20 programs on Google App Engine.
  • Oracle announced the Big Data Appliance, integrates Hadoop, Oracle Enterprise Linux, the R programming language, and a NoSQL database with the Exadata hardware.

Some of these commercial vendors have partnerships with Hardware vendors i.e.

Don’t choose any commercial distribution just because everything works. Do your research and find your reason to choose one or another.