Setting up Pivotal Hadoop (PivotalHD 1.1 Community Edition) Cluster in CentOS 6.5

Download Pivotal HD Package

http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/pivotalhd_community_1.1.tar.gz

The package consist of 3 tarball package:

  • PHD-1.1.0.0-76.tar.gz
  • PCC-2.1.0-460.x86_64.tar.gz
  • PHDTools-1.1.0.0-97.tar.gz

Untar above package and start with PCC (Pivotal Command Center)

Install Pivotal Command Center:

$tar -zxvf PCC-2.1.0-460.x86_64.tar.gz
$PHDCE1.1/PCC-2.1.0-460/install

Log in using  newly created user gpadmin:
$  su – gpadmin
$  sudo cp /root/.bashrc .
$  sudo cp /root/.bash_profile .
$  sudo cp /root/.bash_logout .
$  sudo cp /root/.cshrc .
$  sudo cp /root/.tcshrc .

Logout and re-login:
$ exit
$ su – gpadmin

Make sure you have alias set for your localhost:
$  vi /etc/hosts
xx.xx.xx.xx pivotal-master.hadoopbox.com  pivotal-master
$ service network restart
$ ping pivotal-master
$ ping pivotal-master.hadoopbox.com
Now we will use Pivotal HD Package, so lets untar it into PHD-1.1.0.0-76 folder.
Expand PHD* package and then import it:
$  icm_client import -s PHD-1.1.0.0-76/

Get cluster specific configuration:
$ icm_client fetch-template -o ~/ClusterConfigDir

Edit cluster configuration based on your domain details:
$  vi ~/ClusterConfigDir/clusterConfig.xml
Replace all host.yourdomain.com to your_Domainname. Somehow having .  {dot} in domain name is not accepted.
Also select the services you would want to install. you must need base 3 services hdfs, YARN, and Zookeeper in PivotalHD:

<services>hdfs,yarn,zookeeper</services> <!– hbase,hive,hawq,gpxf,pig,mahout</services> –>

Create password-less SSH configuration:

$ ssh-keygen -t rsa
$  cd .ssh
$  cat id_rsa.pub >> authorized_keys
$  cat authorized_keys
$  chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*

[gpadmin@pivotal-master ~]$ icm_client deploy -c ClusterConfigDir
Please enter the root password for the cluster nodes:
PCC creates a gpadmin user on the newly added cluster nodes (if any). Please enter a non-empty password to be used for the gpadmin user:
Verifying input
Starting install
Running scan hosts
[RESULT] The following hosts do not meet PHD prerequisites: [ pivotal-master.hadoopbox.com ] Details…

Host: pivotal-master.hadoopbox.com
Status: [FAILED]
[ERROR] Please verify supported OS type and version. Supported OS: RHEL6.1, RHEL6.2, RHEL6.3, RHEL6.4, CentOS6.1, CentOS6.2, CentOS6.3, CentOS6.4
[OK] SELinux is disabled
[OK] sshpass installed
[OK] gpadmin user exists
[OK] gpadmin user has sudo privilege
[OK] .ssh directory and authorized_keys have proper permission
[OK] Puppet version 2.7.20 installed
[OK] Ruby version 1.9.3 installed
[OK] Facter rpm version 1.6.17 installed
[OK] Admin node is reachable from host using FQDN and admin hostname.
[OK] umask is set to 0002.
[OK] nc and postgresql-devel packages are installed or available in the yum repo
[OK] iptables: Firewall is not running.
[OK] Time difference between clocks within acceptable threshold
[OK] Host FQDN is configured correctly
[OK] Host has proper java version.
ERROR: Fetching status of the cluster failed
HTTP Error 500: Server Error
Cluster ID: 4

Because I have Cent OS 6.5 so lets edit /etc/centos-release file to let Pivotal installation know CentOS 6.4.
[gpadmin@pivotal-master ~]$ cat /etc/centos-release
CentOS release 6.5 (Final)
[gpadmin@pivotal-master ~]$ sudo mv /etc/centos-release /etc/centos-release-orig
[gpadmin@pivotal-master ~]$ sudo cp /etc/centos-release-orig /etc/centos-release
[gpadmin@pivotal-master ~]$ sudo vi /etc/centos-release

CentOS release 6.4 (Final)  <— Edit to look like I am using CentOS 6.4 even when I have CentOS 6.5

[gpadmin@pivotal-master ~]$ icm_client deploy -c ClusterConfigDir
Please enter the root password for the cluster nodes:
PCC creates a gpadmin user on the newly added cluster nodes (if any). Please enter a non-empty password to be used for the gpadmin user:
Verifying input
Starting install
[====================================================================================================] 100%
Results:
pivotal-master… [Success]
Details at /var/log/gphd/gphdmgr/
Cluster ID: 5

$ cat /var/log/gphd/gphdmgr/GPHDClusterInstaller_1392419546.log
Updating Option : TimeOut
Current Value   : 60
TimeOut=”180″
pivotal-master : Push Succeeded
pivotal-master : Push Succeeded
pivotal-master : Push Succeeded
pivotal-master : Push Succeeded
pivotal-master : Push Succeeded
pivotal-master : Push Succeeded
[INFO] Deployment ID: 1392419546
[INFO] Private key path : /var/lib/puppet/ssl-icm/private_keys/ssl-icm-1392419546.pem
[INFO] Signed cert path : /var/lib/puppet/ssl-icm/ca/signed/ssl-icm-1392419546.pem
[INFO] CA cert path : /var/lib/puppet/ssl-icm/certs/ca.pem
hostlist: pivotal-master
running: massh /tmp/tmp.jaDiwkIFMH bombed uname -n
sync cmd sudo python ~gpadmin/GPHDNodeInstaller.py –server=pivotal-master.hadoopbox.com –certname=ssl-icm-1392419546 –logfile=/tmp/GPHDNodeInstaller_1392419546.log –sync –username=gpadmin
[INFO] Deploying batch with hosts [‘pivotal-master’]
writing host list to file /tmp/tmp.43okqQH7Ji
[INFO] All hosts succeeded.

$ icm_client list
Fetching installed clusters
Installed Clusters:
Cluster ID: 5     Cluster Name: pivotal-master     PHD Version: 2.0     Status: installed

$ icm_client start -l pivotal-master
Starting services
Starting cluster
[====================================================================================================] 100%
Results:
pivotal-master… [Success]
Details at /var/log/gphd/gphdmgr/

Check HDFS:
$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x   – mapred hadoop          0 2014-02-14 15:19 /mapred
drwxrwxrwx   – hdfs   hadoop          0 2014-02-14 15:19 /tmp
drwxrwxrwx   – hdfs   hadoop          0 2014-02-14 15:20 /user
drwxr-xr-x   – hdfs   hadoop          0 2014-02-14 15:20 /yarn

Now open Browser @ https://your_domain_name:5443/
Username/Password – gpadmin/gpadmin

 

Pivotal Command Center Service Status:
$ service commander status
commander (pid  2238) is running…

Advertisement

Handling Cloudera Hadoop Cluster from command line

If you have installed Hadoop from Cloudera distribution without Cloudera Manager you would have to manage your cluster from console and the things art not easy. Here are some of the important information to manage working on Cloudera Hadoop from console:

 

Where hadoop binary are located:

ubuntu@HADOOP_CLUSTER:~$ which hadoop

    • /usr/bin/hadoop

Files located at /usr/lib/hadoop/

drwxr-xr-x 2 root root 4096 May 22 21:00 bin
drwxr-xr-x 2 root root 4096 May 23 00:25 client
drwxr-xr-x 2 root root 4096 May 23 00:25 client-0.20
drwxr-xr-x 2 root root 4096 May 22 21:00 cloudera
drwxr-xr-x 2 root root 4096 May 22 21:00 etc
-rw-r–r– 1 root root 16678 Apr 22 17:38 hadoop-annotations-2.0.0-cdh4.2.1.jar
lrwxrwxrwx 1 root root 37 Apr 22 17:38 hadoop-annotations.jar -> hadoop-annotations-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 46858 Apr 22 17:38 hadoop-auth-2.0.0-cdh4.2.1.jar
lrwxrwxrwx 1 root root 30 Apr 22 17:38 hadoop-auth.jar -> hadoop-auth-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 2267883 Apr 22 17:38 hadoop-common-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 1213897 Apr 22 17:38 hadoop-common-2.0.0-cdh4.2.1-tests.jar
lrwxrwxrwx 1 root root 32 Apr 22 17:38 hadoop-common.jar -> hadoop-common-2.0.0-cdh4.2.1.jar
drwxr-xr-x 3 root root 4096 May 22 21:00 lib
drwxr-xr-x 2 root root 4096 May 23 00:25 libexec
drwxr-xr-x 2 root root 4096 May 22 21:00 sbin

 

Hadoop cluster specific XML configuration  files are stored here:

lrwxrwxrwx 1 root root 16 Apr 22 17:38 hadoop -> /etc/hadoop/conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /usr/lib/hadoop/etc/hadoop
lrwxrwxrwx 1 root root 16 Apr 22 17:38 /usr/lib/hadoop/etc/hadoop -> /etc/hadoop/conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/hadoop/conf
lrwxrwxrwx 1 root root 29 May 22 21:00 /etc/hadoop/conf -> /etc/alternatives/hadoop-conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/alternatives/hadoop-conf
lrwxrwxrwx 1 root root 23 May 22 22:02 /etc/alternatives/hadoop-conf -> /etc/hadoop/conf.avkash
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/hadoop/conf.avkash/

    • core-site.xml
    • hadoop-metrics.properties
    • hadoop-metrics2.properties
    • hdfs-site.xml
    • log4j.properties
    • mapred-site.xml
    • slaves
    • ssl-client.xml.example
    • ssl-server.xml.example
    • yarn-env.sh
    • yarn-site.xml

Note: Otherwise you can try to find Hadoop configuration files as

    • ubuntu@ec2-54-214-67-144:~$ sudo find / -name “hdfs*.xml”

 

Hadoop cluster specific scripts are located here:

  • Hadoop
    • /usr/lib/hadoop/libexec/hadoop-config.sh
    • /usr/lib/hadoop/libexec/hadoop-layout.sh
    • /usr/lib/hadoop/sbin/hadoop-daemon.sh
    • /usr/lib/hadoop/sbin/hadoop-daemons.sh
  • MapReduce
    • /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemon.sh
    • /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-config.sh
    • /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemons.sh

To start/stop/restart Hadoop service, the scripts are located here: 

  • Hadoop Namenode and Job Tracker
    • /etc/init.d/hadoop-0.20-mapreduce-jobtracker
    • /etc/init.d/hadoop-hdfs-namenode
  • Hadoop Datanode and TaskTracker
    • /etc/init.d/hadoop-hdfs-datanode
    • /etc/init.d/hadoop-0.20-mapreduce-tasktracker

 

If you decided to start Hadoop Service manually you can do the following:

  • Stop Services:
    • sudo /etc/init.d/hadoop-hdfs-namenode stop
    • sudo /etc/init.d/hadoop-hdfs-datanode stop
    • sudo /etc/init.d/hadoop-0.20-mapreduce-jobtracker stop
    • sudo /etc/init.d/hadoop-0.20-mapreduce-tasktracker stop
  • Start Services
    • sudo /etc/init.d/hadoop-hdfs-namenode start
    • sudo /etc/init.d/hadoop-hdfs-datanode start
    • sudo /etc/init.d/hadoop-0.20-mapreduce-jobtracker start
    • sudo /etc/init.d/hadoop-0.20-mapreduce-tasktracker start

 

Running hdfs command in hdfs user context:

  • sudo -u hdfs hdfs dfs -mkdir /tmp
  • sudo -u hdfs hdfs dfs -chmod -R 1777 /tmp
  • sudo -u hdfs hdfs dfs -mkdir -p /var/lib/hadoop-hdfs/cache/
  • hdfs dfs -ls /

 

Running Hadoop example jobs from console:

  • ubuntu@HADOOP_CLUSTER:~$ hdfs dfs -copyFromLocal history.log /
  • ubuntu@HADOOP_CLUSTER:~$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount /history.log /home/ubuntu/results
  • 13/06/04 16:14:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  • 13/06/04 16:14:35 INFO input.FileInputFormat: Total input paths to process : 1
  • 13/06/04 16:14:35 INFO mapred.JobClient: Running job: job_201306041556_0005
  • 13/06/04 16:14:36 INFO mapred.JobClient: map 0% reduce 0%

 

The following error means the HDFS is running but JobTracker is not running at Hadoop: 

  • 13/06/04 15:48:48 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  • 13/06/04 15:48:49 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  • 13/06/04 15:48:50 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  • 13/06/04 15:48:51 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

Keywords: Hadoop, MapReduce, Cloudera, Services

US Mass-shootings data visualization from 1996-2012

Here are mass-shootings data visualization between  1996-2012 in US. The stats are based on news papers and data visualization is done using Platfora.

Graph 1: Mass shootings in USA between 1996-2012

Mass shootings in USA between 1996-2012

 

 

 

 

 

 

Graph 2: Month with maximum numbers of mass shootings

Month with maximum numbers of mass shootings

 

 

 

 

 

 

Graph 3: School or Place Name and Causality Count

School or Place Name and Causality Count

 

 

 

 

 

 

Graph 4: Month with maximum numbers of mass shootings

Month with maximum numbers of mass shootings

 

 

 

 

 

 

Graph 5: Mass shootings in particular month

Mass shootings in particular month

 

 

 

 

 

 

Keywords: Hadoop, Big Data, Data Visualization, Platfora, HadoopBI, BigDataBI

Top movies visualizations using Platfora

Here are some of the cool visualization of IMDB Movie Dataset (60,000 records from 1893-2004) using Platfora…

Top Criteria: 7.5+ rating and 50,000+ votes.

Timeline: From 1893 – 2004

 

Top movies of all time:

All-Top-Rated-movies-1

 

 

 

Top “R” Rated movies:

PG-13-top-movies

 

 

 

 

 

 

 

 

 

Top PG-13 movies of all time:

 

 

PG-13-top-movies

 

 

 

 

 

 

 

 

 

 

Top Action Movies:top-action-movies

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Top Comedy Movies:

top-comedy-movies

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Keywords: Hadoop, BigData, Data Visualization,

 

 

 

 

 

 

 

IMDB Movie Dataset Visualization using Platfora

Here are some of the cool visualization of IMDB Movie Dataset (60,000 records from 1893-2004) using Platfora…

Top movies based on 8.0 rating and highest voting:

Top-animated-movies-fulllength

Total yearly budget and movie production between 1893-2004

totalmovies-budget-alltime

Total movies produced from 1893-2004

total-movies-alltime

Total yearly budget for movies production from 1893-2004

totalovies-yearly-budget

Total movies produced based on MPAA Ratings between 1893-2004

All-MPAA-Movies

Total movies based on MPAA rating between 1893-2004

list-rating-allmovies

Fact: In 1942, total 100 animation (of all length) produced comparative to only 94 in 2003

1042-animation-2003

All time, top 9+ rating movies total (The data for year 2005 is incomplete):

Topmovies-9plus-alltime