Enterprise Hadoop Solution distributed by key Hadoop vendors

Lets start from Cloudera Enterprise Data Hub:

Cloudera-Ehadoop

Here is the offering from Hortonworks:

HW-enterprizehadoop

And this is how MapR is packaging Enterprise Hadoop

mapr-hadoop

And finally Pivotal Enterprise Hadoop offering:

Pivotal-hadoop

Keywords: Apache Hadoop, Cloudera, Hortonworks, Pivotal, MapR, Big Data

Advertisements

The present and future of Hadoop from its creator Doug Cutting

At Starta + Hadoop world, Hadoop creator Dough Cutting explained Hadoop and talked more about its present and future. Doug talked about:

  • What is Hadoop?
  • How the name “Hadoop” came from?
  • What a Hadoop application look like?
  • Ethical use of Data
  • Quick plans and future strategy

Here is the full interview:

Big Data 1B dollars Club – Top 20 Players

Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …

Accessing Remote Hadoop Server using Hadoop API or Tools from local machine (Example: Hortonworks HDP Sandbox VM)

Sometimes you may need to access Hadoop runtime from a machine where Hadoop services are not running. In this process you will create password-less SSH access to Hadoop machine from your local machine and once ready you can use Hadoop API to access Hadoop cluster or you can directly use Hadoop commands from local machine by passing proper Hadoop configuration.

Starting Hortonworks HDP 1.3 and/or 2.1 VM

You can use these instructions on any VM running Hadoop or you can download HDP 1.3 or 2.1 Images from the link below:

http://hortonworks.com/products/hortonworks-sandbox/#install

Now start your VM and make sure your Hadoop cluster is up and running. Once you VM is up and running you will get IP address and hostname on the VM screen which is mostly 192.168.21.xxx as shown below:

Screen Shot 2014-06-05 at 1.21.53 PM

Accessing Hortonworks HDP 1.3 and/or 2.1 from browser:

Using the IP address provided you can check the Hadoop server status on port 8000 as below

HDP 1.3 – http://192.168.21.187:8000/about/

HDP 2.1 – http://192.168.21.186:8000/about/

The UI for both HDP1.3 and HDP 2.1 looks as below:

hdp13-21

 

 

 

 

 

 

 

 

 

 

 

Now from your host machine you can also try to ssh to any of the machine using user name root and password hadoop as below:

$ssh root@192.168.21.187

The authenticity of host ‘192.168.21.187 (192.168.21.187)’ can’t be established.
RSA key fingerprint is b2:c0:9a:4b:10:b4:0f:c0:a0:da:7c:47:60:84:f5:dc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘192.168.21.187’ (RSA) to the list of known hosts.
root@192.168.21.187’s password: hadoop
Last login: Thu Jun 5 03:55:17 2014

Now we will add password less SSH access to these VM and there could be two option:

Option 1: You already have SSH key created for yourself earlier and want to reuse here:

In this option, first we will make sure we have RSA based key for SSH session in our local machine and then we will use it for password less SSH access:

  1. In your home folder (/Users/<yourname>) visit to folder name .ssh
  2. Identify a file name id_rsa.pub  (/Users/avkashchauhan/.ssh/id_rsa.pub) and you will see a long string key there
  3. Now also identify another file name  authorized_keys there (i.e. /Users/avkashchauhan/.ssh/authorized_keys) and you will see one or more long string keys there.
  4. Check the content of id_rsa.pub and make sure that this key is also available into authorized_keys files along with other keys (if there)
  5. Now copy the key string from id_rsa.pub file in memory.
  6. SSH to your HDP machine as in previous step using username and password
  7. visit to /root/.ssh folder
  8. You will find authorized_keys file there so open this file in editor and append the key here which you have copied in previous step #5.
  9. Save authorized_keys files
  10. Now in the same VM you will find id_rsa.pub file and please copy its content in memory.
  11. Exit the HDP VM
  12. In your host machine you have already checked authorized_keys in step #3, append the key from HDP VM into authorized_keys file and save it.
  13. Now try logging HDP VM as below:

ssh root@192.168.21.187

Last login: Thu Jun 5 06:35:31 2014 from 192.168.21.1

Note: You will see  that password is not needed this time as Password less SSH is working.

Option 2: You haven’t created SSH key in your local machine and will do everything from scratch:

In this option first we will create a SSH based key first and then use it exactly with Option #1.

  • Log into your host machine and open terminal
  • For example your home folder will be /Users/<username>
  • Create a folder name .ssh inside your working folder
  • now go inside .ssh folder and run the following command

$ ssh-keygen -C ‘SSH Access Key’ -t rsa

Enter file in which to save the key (/home/avkashchauhan/.ssh/id_rsa): ENTER

Enter passphrase (empty for no passphrase): ENTER

Enter same passphrase again: ENTER

  • You will see id_rsa and id_rsa.pub files are created. Now we will append the contents of id_rsa.pub into authorized_keys files and it is not there then we will create and add. For both the command is as below:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

  • In the above step you will see the contents of id_rsa.pub are included into authorized_keys.
  • Now we will set proper permissions for keys and folders as below:

$ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*

  • Finally we can follow Option #1 now to add both id_rsa.pub keys in both machines authorized_keys files to have password less ssh working.

Adding correct Java Home path to java

Migrating Hadoop configuration from Remote Machine to local Machine:

To get this working we will have to get Hadoop configuration files from HDP server to local machine and to do this you just need to copy Hadoop configuration files from HDP servers as below:

HDP 1.3:

Create a folder name hdp13 in your working folder and now use SCP command to copy configuration files as below over password less SSH:

$ scp -r root@192.168.21.187:/etc/hadoop/conf.empty/ ~/hdp13

HDP 2.1:

Create a folder name hdp21 in your working folder and now use SCP command to copy configuration files as below over password less SSH:

$ scp -r root@192.168.21.186:/etc/hadoop/conf/ ~/hdp21

Adding correct JAVA_HOME to imported Hadoop configuration hadoop-env.sh

Now visit to your hdp13 or hdp21 folder and edit hadoop-env.sh file with correct JAVA_HOME as below:

# The java implementation to use. Required.
# export JAVA_HOME=/usr/jdk/jdk1.6.0_31  
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Adding correct HDP Hostname into local machine hosts entries:

Now you would need to add Hortonworks HDP hostnames into your local machines hosts file. On Mac OSX you would need to edit /private/etc/hosts file to add the following:

#HDP 2.1
192.168.21.186 sandbox.hortonworks.com
#HDP 1.3
192.168.21.187 sandbox

 

Once added make sure you can ping the hosts by name as below:

$ ping sandbox

PING sandbox (192.168.21.187): 56 data bytes
64 bytes from 192.168.21.187: icmp_seq=0 ttl=64 time=0.461 ms

And for HDP 2.1

$ ping sandbox.hortonworks.com
PING sandbox.hortonworks.com (192.168.21.186): 56 data bytes
64 bytes from 192.168.21.186: icmp_seq=0 ttl=64 time=0.420 ms

Access Hadoop Runtime on Remote Machine from Hadoop commands (or API) at Local Machine:

Now using local machine Hadoop runtime you can connect to Hadoop at HDP VM as below:

HDP 1.3

$ ./hadoop –config /Users/avkashchauhan/hdp13/conf.empty fs -ls /
Found 4 items
drwxr-xr-x – hdfs hdfs 0 2013-05-30 10:34 /apps
drwx—— – mapred hdfs 0 2014-06-05 03:54 /mapred
drwxrwxrwx – hdfs hdfs 0 2014-06-05 06:19 /tmp
drwxr-xr-x – hdfs hdfs 0 2013-06-10 14:39 /user

HDP 2.1

$ ./hadoop –config /Users/avkashchauhan/hdp21/conf fs -ls /
Found 6 items
drwxrwxrwx – yarn hadoop 0 2014-04-21 07:21 /app-logs
drwxr-xr-x – hdfs hdfs 0 2014-04-21 07:23 /apps
drwxr-xr-x – mapred hdfs 0 2014-04-21 07:16 /mapred
drwxr-xr-x – hdfs hdfs 0 2014-04-21 07:16 /mr-history
drwxrwxrwx – hdfs hdfs 0 2014-05-23 11:35 /tmp
drwxr-xr-x – hdfs hdfs 0 2014-05-23 11:35 /user

If you are using Hadoop API then you can pass the CONF file path to API and have access to Hadoop runtime.

 

Learning Cloudera Impala Book with a great **Buy One, Get One Free offer** from Pack Publishing Company

Here are some facts about Cloudera Impala as provided recently by Cloudera:

  • Impala is being widely adopted.
    Impala has been downloaded by people in more than 5,000 unique organizations, and Cloudera Enterprise customers from multiple industries and Fortune 100 companies have deployed it as supported infrastructure, integrated with their existing BI tools, in business-critical environments. Furthermore, MapR and Amazon Web Services (via its Elastic MapReduce service) recently announced support for Impala in their own platforms.
  • Impala has demonstrated production readiness.
    Impala has proved out its superior performance for interactive queries in a concurrent-workloads environment, while providing reliable resource sharing compared to batch-processing alternatives like Apache Hive.
  • Impala sports an ever-increasing list of enterprise features.
    Planned new enterprise features are materializing rapidly and according to plan — including the addition of fine-grained, role-based authorizationuser-defined functions (UDFs), cost-based optimization, and pre-built analytic functions.

LearningClouderaImpala

Buy my book  Learning Cloudera Impala from #Packt Publishing Company with amazing Buy One, Get One Free offer bit.ly/1j26nPN.

Shark vs. Impala, Hive and Amazon Redshift

Source: AMPlab (UC-Berkeley). Check out this blog post for more details.

Source: AMPlab (UC-Berkeley)

Pivotal HD Hawq vs. Cloudera Impala and Apache Hive

Source: Pivotal. Check out this whitepaper for more details.

Source: Pivotal

Cloudera Impala is Open Source and you can grab the source from github using the link below:

https://github.com/cloudera/impala

Troubleshooting Cloudera Manager installation and start issues

After Cloudera Manager is installed and running you can access it over the web UI through <Your_IP_Address>:7180 , if you could not access it, then here are few ways to troubleshoot the problem. These details are helpful with you are installing Cloudera Hadoop on a remotely located machine and you just have shell access to that machine over SSH.

Verify if Cloudera Manager is running: 

[root@cloudera-master ~]# service cloudera-scm-server status
cloudera-scm-server (pid 4652) is running…

Now lets check the Cloudera Manager logs for further verification:

[root@cloudera-master ~]# ps -ef | grep cloudera-scm
498        977     1  0 16:31 ?        00:00:01 /usr/bin/postgres -D /var/lib/cloudera-scm-server-db/data
root      3729     1  0 16:59 pts/0    00:00:00 su cloudera-scm -s /bin/bash -c nohup /usr/sbin/cmf-server
498       3731  3729 53 16:59 ?        00:00:47 /usr/java/jdk1.6.0_31/bin/java -cp .:lib/*:/usr/share/java/mysql-connector-java.jar -Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dcmf.root.logger=INFO,LOGFILE -Dcmf.log.dir=/var/log/cloudera-scm-server -Dcmf.log.file=cloudera-scm-server.log -Dcmf.jetty.threshhold=WARN -Dcmf.schema.dir=/usr/share/cmf/schema -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dpython.home=/usr/share/cmf -Xmx2G -XX:MaxPermSize=256m com.cloudera.server.cmf.Main
root      3835  1180  0 17:00 pts/0    00:00:00 grep cloudera-scm

You can access Cloudera Manager Logs located as below (as shown in the above commandline):

[root@cloudera-master ~]# tail -5 /var/log/cloudera-scm-server/cloudera-scm-server.log 
2014-02-06 16:59:29,596  INFO [WebServerImpl:cmon.JobDetailGatekeeper@71] ActivityMonitor configured to allow job details for all jobs.
2014-02-06 16:59:29,597  INFO [WebServerImpl:cmon.JobDetailGatekeeper@71] ActivityMonitor configured to allow job details for all jobs.
2014-02-06 16:59:29,601  INFO [WebServerImpl:mortbay.log@67] jetty-6.1.26.cloudera.2
2014-02-06 16:59:29,605  INFO [WebServerImpl:mortbay.log@67] Started SelectChannelConnector@0.0.0.0:7180
2014-02-06 16:59:29,606  INFO [WebServerImpl:cmf.WebServerImpl@280] Started Jetty server.

Lets disable Linux Machine Firewall to see if it works:

First make sure you have to fiddle with your Firewall, below I am disabling the firewall in my Linux machine:

[root@cloudera-master ~]# /etc/init.d/iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:[  OK  ]
[root@cloudera-master ~]# /etc/init.d/iptables stop
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@cloudera-master ~]# /etc/init.d/ip6tables save
ip6tables: Saving firewall rules to /etc/sysconfig/ip6table[  OK  ]
[root@cloudera-master ~]# /etc/init.d/ip6tables stop
ip6tables: Setting chains to policy ACCEPT: filter         [  OK  ]
ip6tables: Flushing firewall rules:                        [  OK  ]
ip6tables: Unloading modules:                              [  OK  ]
[root@cloudera-master ~]# chkconfig 
ip6tables           0:off     1:off     2:off     3:off     4:off     5:off     6:off
iptables            0:off     1:off     2:off     3:off     4:off     5:off     6:off

Once Firewall is stopped completely restart Cloudera Manager: 

If your machines are behind a firewall you can go ahead and disable the firewall on a machine which has Cloudera Manager running. If not please setup/open specific ports for Cloudera Manager to get it working.

[root@cloudera-master ~]# service –status-all | grep cloudera
cloudera-scm-server (pid  1082) is running…
/usr/bin/postgres “-D” “/var/lib/cloudera-scm-server-db/data”

[root@cloudera-master ~]# service cloudera-scm-server 
Usage: cloudera-scm-server {start|stop|restart|status}

[root@cloudera-master ~]# service cloudera-scm-server restart
Stopping cloudera-scm-server:                              [  OK  ]
Starting cloudera-scm-server:                              [  OK  ]

Note: that cloudera-scm-server is listed through chkconfig however the status shows OFF.

[root@cloudera-master ~]# chkconfig 
cloudera-scm-server     0:off     1:off     2:off     3:off     4:off     5:off     6:off
cloudera-scm-server-db     0:off     1:off     2:off     3:on     4:on     5:on     6:off

Once working and accessible the Cloudera Manager Login Page looks as below:

 

 

Screen Shot 2014-02-06 at 5.20.35 PM