Big Data – Transition from Velocity, Variety and Volume to adding Variability and Complexity in the mix

Previous Definition: Velocity, Variety and Volume

New Definition: Velocity, Variety and Volume + Variability and Complexity

bigdata-3cbigdata-4c

Advertisement

A collection of Big Data Books from Packt Publication

I found that Packt publication have few great books on Big Data and here is a collection of few books which I found very useful:Screen Shot 2014-09-30 at 11.50.08 AM

Packt is giving its readers a chance to dive into their comprehensive catalog of over 2000 books and videos for the next 7 days with LevelUp program:

packt

Packt is offering all of its eBooks and Videos at just $10 each or less

The more EXP customers want to gain, the more they save:

  • Any 1 or 2 eBooks/Videos – $10 each
  • Any 3 to 5 eBooks/Videos – $8 each
  • Any 6 or more eBooks/Videos – $6 each

More Information is available at bit.ly/Yj6oWq  |  bit.ly/1yu4679

For more information please visit : www.packtpub.com/packt/offers/levelup

Big Data 1B dollars Club – Top 20 Players

Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …

Watch Spark Summit 2014 on UStream

You can check out the Spark Summit 2014 agenda here: http://spark-summit.org/2014/agenda

logo2014

 

 

 

 

 

 

U Stream Sessions :

http://www.ustream.tv/channel/spark-summit-2014

Please register yourself at the summit site to get more details information.

Keywords: Spark Summit, Hadoop, Spark,

Apache Ambari 1.6.0 support with Blueprints is released

What is Ambari Blueprint?

Ambari Blueprint allows an operator to instantiate a Hadoop cluster quickly—and reuse the blueprint to replicate cluster instances elsewhere, for example, as development and test clusters, staging clusters, performance testing clusters, or co-located clusters.

Release URL: http://www.apache.org/dyn/closer.cgi/ambari/ambari-1.6.0

blueprints

 

 

 

 

 

 

Ambari Blueprint supports PostgreSQL:

Ambari now extends database support for Ambari DB, Hive and Oozie to include PostgreSQL. This means that Ambari now provides support for the key databases used in enterprises today: PostgreSQL, MySQL and Oracle. The PostgreSQL configuration choice is reflected in this database support matrix.

More Links:

 

Content Source: http://hortonworks.com/blog/apache-ambari-1-6-0-released-blueprints

Learning Cloudera Impala – Book Availability

Learning Cloudera Impala:

 
Using Cloudera Impala is for those who really want to take advantage of their Hadoop cluster by processing extremely large amounts of raw data in Hadoop at real-time speed. Prior knowledge of Hadoop and some exposure to HIVE and MapReduce is expected.
 
LearningClouderaImpala

Learning Cloudera Impala

You will learn from this book:

  • Understand the various ways of installing Impala in your Hadoop cluster
  • Use the Impala shell API to interact with Impala components
  • Utilize Impala Query Language and built-in functions to play with data
  • Administrate and fine-tune Impala for high availability
  • Identify and troubleshoot problems in a variety of ways
  • Get acquainted with various input data formats in Hadoop and how to use them with Impala
  • Comprehend how third party applications can connect with Impala to provide data visualization and various other enhancements

Available at:

Screen Shot 2013-12-28 at 11.21.15 AM

Screen Shot 2013-12-28 at 11.20.32 AMScreen Shot 2013-12-28 at 11.20.54 AM

Continue reading

20TB Earth Science Dataset on AWS With NASA / NEX available for Public

AWS has been working with the NASA Earth Exchange (NEX) team to make it easier and more efficient for researchers to access and process earth science data. The goal is to make a number of important data sets accessible to a wider audience of full-time researchers, students, and citizen scientists. This important new project is called OpenNEX. Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size (tens of terabytes). Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical.

nasa_nex_landsat_us_2005_forest_leaf_area_1

Access Dataset: s3://nasanex/NEX-DCP30

Consult the detail page and the tech note to learn more about the provenance, format, structure, and attribution requirements.

NASA Earth Exchange (NEX):

The NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) dataset is comprised of downscaled climate scenarios for the conterminous United States that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011] developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5). The dataset includes downscaled projections from 33 models, as well as ensemble statistics calculated for each RCP from all model runs available. The purpose of these datasets is to provide a set of high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.

Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run).

Website: NASA NEX

Summary

  • Short Name: NEX-DCP30
  • Version: 1
  • Format: netCDF4 classic
  • Spatial Coverage: CONUS
  • Temporal Coverage:
    • 1950 – 2005 historical or 2006 – 2099 RCP
  • Data Resolution:
    • Latitude Resolution: 30 arc second
    • Longitude Resolution: 30 arc second
    • Temporal Resolution: monthly
  • Data Size:
    • Total Dataset Size: 17 TB
    • Individual file size: 2 GB

Learn more about NEX – NASA Earth Exchange Downscaled Project

NEX Virtual Workshop: https://nex.nasa.gov/nex/projects/1328/

 

Permission denied (publickey,gssapi-keyex,gssapi-with-mic) error with SSH connection to Amazon EC2 Instance

Problem: I tried accessing my EC2 instance and got the error as below:

Avkash-Machine:~ avkash$ ssh -i MyUtils/SSHAccessKey.pem ec2_user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Lets try to have verbose SSH connection to see what could be the issue:

Avkash-Machine:~ avkash$ ssh -v -i MyUtils/SSHAccessKey.pem ec2_user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
OpenSSH_5.9p1, OpenSSL 0.9.8r 8 Feb 2011
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug1: Connecting to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com [xx.xxx.xxx.xx] port 22.
debug1: Connection established.
debug1: identity file MyUtils/SSHAccessKey.pem type -1
debug1: identity file MyUtils/SSHAccessKey.pem-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
debug1: match: OpenSSH_5.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Server host key: RSA 0f:ce:27:19:18:ee:40:86:df:db:f0:95:79:29:49:05
debug1: Host ‘xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com’ is known and matches the RSA host key.
debug1: Found key in /Users/avkash/.ssh/known_hosts:15
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: MyUtils/SSHAccessKey.pem
debug1: read PEM private key done: type RSA
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: No more authentication methods to try.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Above you can see that the SSH key is valid and readable and the connection did try various auth methods however connection still failed.

As you can see below that SSH Key SSHAccessKey.pem have proper access mode configured:

Avkash-Machine:~ avkash$ ls -l MyUtils/SSHAccessKey.pem
-rwx——@ 1 avkash staff 1692 May 20 14:38 MyUtils/SSHAccessKey.pem
Now the connection is working here as shown in the following full verbose log:

Avkash-Machine:~ avkash$ ssh -v -i MyUtils/SSHAccessKey.pem ec2-user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
OpenSSH_5.9p1, OpenSSL 0.9.8r 8 Feb 2011
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug1: Connecting to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com [xx.xxx.xxx.xx] port 22.
debug1: Connection established.
debug1: identity file MyUtils/SSHAccessKey.pem type -1
debug1: identity file MyUtils/SSHAccessKey.pem-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
debug1: match: OpenSSH_5.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Server host key: RSA 0f:ce:27:19:18:ee:40:86:df:db:f0:95:79:29:49:05
debug1: Host ‘xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com’ is known and matches the RSA host key.
debug1: Found key in /Users/avkash/.ssh/known_hosts:15
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: MyUtils/SSHAccessKey.pem
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
Authenticated to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com ([xx.xxx.xxx.xx]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LC_MONETARY = en_US.utf-8
debug1: Sending env LC_NUMERIC = en_US.utf-8
debug1: Sending env LC_MESSAGES = en_US.utf-8
debug1: Sending env LC_COLLATE = en_US.utf-8
debug1: Sending env LANG = ru_RU.UTF-8
debug1: Sending env LC_CTYPE = en_US.utf-8
debug1: Sending env LC_TIME = en_US.utf-8

Verification done:

  1.  I checked that my EC2 Instance have proper SSH key
  2. Verified that EC2 instance have ssh port open in firewall (Security Policy)

Solution: I was using incorrect user Id (i.e. ec2_user) instead of the correct one (i.e. ec2-user)

 

 

 

HDInsight (Hadoop on Azure) Demo: Submit MapReduce job, process result from Pig and filter final results in Hive

In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.

Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:

c:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop.cmd jar c:appsJobstemplates635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount

hd001

The results are stored @ /user/admin/outcount

Verify the results at Interactive Shell:

js> #ls /user/admin/outcount
Found 2 items
-rwxrwxrwx 1 admin supergroup 0 2013-03-28 05:22 /user/admin/outcount/_SUCCESS
-rwxrwxrwx 1 admin supergroup 337623 2013-03-28 05:22 /user/admin/outcount/part-r-00000

Step 2:  loading /user/admin/outcount/part-r-00000 results in the Pig: 

First we are storing the flat text file data as words, wordCount format as below:

Grunt>mydata = load ‘/user/admin/output/part-r-00000’ using PigStorage(‘t’) as (words:chararray, wordCount:int);

Grunt>first10 = LIMIT mydata 10;

Grunt>dump first10;

hd002

Note: This shows results for the words with frequency 1.  We need to reorder to results on descending order to get words with top frequency.

Grunt>mydatadsc = order mydata by wordCount DESC;

Grunt>first10 = LIMIT mydatadsc 10;

Grunt>dump first10;

hd003

Now we have got the result as expected. Lets stored the results into a file at HDFS.

Grunt>Store first10 into ‘/user/avkash/myresults10‘ ;

Step 3:  Filtering Pig Results  in to Hive Table: 

First we will create a table in Hive using the same format (words and  wordcount separated by comma)

hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘n’;

Now once table is created we will load the hive store file ‘/user/admin/myresults10/part-r-00000’ into wordslist10 table we just created:

hive> load data inpath ‘/user/admin/myresults10/part-r-00000’ overwrite into table wordslist10;

That’s all as you can see the results now in table:

hive> select * from wordslist10;

hd004

KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData