Previous Definition: Velocity, Variety and Volume
New Definition: Velocity, Variety and Volume + Variability and Complexity
I found that Packt publication have few great books on Big Data and here is a collection of few books which I found very useful:
Packt is giving its readers a chance to dive into their comprehensive catalog of over 2000 books and videos for the next 7 days with LevelUp program:
Packt is offering all of its eBooks and Videos at just $10 each or less –
The more EXP customers want to gain, the more they save:
More Information is available at bit.ly/Yj6oWq | bit.ly/1yu4679
For more information please visit : www.packtpub.com/packt/offers/levelup
Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):
The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …
You can check out the Spark Summit 2014 agenda here: http://spark-summit.org/2014/agenda
http://www.ustream.tv/channel/spark-summit-2014
Please register yourself at the summit site to get more details information.
Keywords: Spark Summit, Hadoop, Spark,
Ambari Blueprint allows an operator to instantiate a Hadoop cluster quickly—and reuse the blueprint to replicate cluster instances elsewhere, for example, as development and test clusters, staging clusters, performance testing clusters, or co-located clusters.
Release URL: http://www.apache.org/dyn/closer.cgi/ambari/ambari-1.6.0
Ambari now extends database support for Ambari DB, Hive and Oozie to include PostgreSQL. This means that Ambari now provides support for the key databases used in enterprises today: PostgreSQL, MySQL and Oracle. The PostgreSQL configuration choice is reflected in this database support matrix.
More Links:
Content Source: http://hortonworks.com/blog/apache-ambari-1-6-0-released-blueprints
AWS has been working with the NASA Earth Exchange (NEX) team to make it easier and more efficient for researchers to access and process earth science data. The goal is to make a number of important data sets accessible to a wider audience of full-time researchers, students, and citizen scientists. This important new project is called OpenNEX. Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size (tens of terabytes). Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical.
Access Dataset: s3://nasanex/NEX-DCP30
Consult the detail page and the tech note to learn more about the provenance, format, structure, and attribution requirements.
The NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) dataset is comprised of downscaled climate scenarios for the conterminous United States that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011] developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5). The dataset includes downscaled projections from 33 models, as well as ensemble statistics calculated for each RCP from all model runs available. The purpose of these datasets is to provide a set of high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.
Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run).
Website: NASA NEX
Learn more about NEX – NASA Earth Exchange Downscaled Project
NEX Virtual Workshop: https://nex.nasa.gov/nex/projects/1328/
Problem: I tried accessing my EC2 instance and got the error as below:
Avkash-Machine:~ avkash$ ssh -i MyUtils/SSHAccessKey.pem ec2_user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Lets try to have verbose SSH connection to see what could be the issue:
Avkash-Machine:~ avkash$ ssh -v -i MyUtils/SSHAccessKey.pem ec2_user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
OpenSSH_5.9p1, OpenSSL 0.9.8r 8 Feb 2011
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug1: Connecting to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com [xx.xxx.xxx.xx] port 22.
debug1: Connection established.
debug1: identity file MyUtils/SSHAccessKey.pem type -1
debug1: identity file MyUtils/SSHAccessKey.pem-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
debug1: match: OpenSSH_5.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Server host key: RSA 0f:ce:27:19:18:ee:40:86:df:db:f0:95:79:29:49:05
debug1: Host ‘xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com’ is known and matches the RSA host key.
debug1: Found key in /Users/avkash/.ssh/known_hosts:15
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: MyUtils/SSHAccessKey.pem
debug1: read PEM private key done: type RSA
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: No more authentication methods to try.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Above you can see that the SSH key is valid and readable and the connection did try various auth methods however connection still failed.
As you can see below that SSH Key SSHAccessKey.pem have proper access mode configured:
Avkash-Machine:~ avkash$ ls -l MyUtils/SSHAccessKey.pem
-rwx——@ 1 avkash staff 1692 May 20 14:38 MyUtils/SSHAccessKey.pem
Now the connection is working here as shown in the following full verbose log:
Avkash-Machine:~ avkash$ ssh -v -i MyUtils/SSHAccessKey.pem ec2-user@xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
OpenSSH_5.9p1, OpenSSL 0.9.8r 8 Feb 2011
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug1: Connecting to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com [xx.xxx.xxx.xx] port 22.
debug1: Connection established.
debug1: identity file MyUtils/SSHAccessKey.pem type -1
debug1: identity file MyUtils/SSHAccessKey.pem-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
debug1: match: OpenSSH_5.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Server host key: RSA 0f:ce:27:19:18:ee:40:86:df:db:f0:95:79:29:49:05
debug1: Host ‘xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com’ is known and matches the RSA host key.
debug1: Found key in /Users/avkash/.ssh/known_hosts:15
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic
debug1: Next authentication method: publickey
debug1: Trying private key: MyUtils/SSHAccessKey.pem
debug1: read PEM private key done: type RSA
debug1: Authentication succeeded (publickey).
Authenticated to xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com ([xx.xxx.xxx.xx]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LC_MONETARY = en_US.utf-8
debug1: Sending env LC_NUMERIC = en_US.utf-8
debug1: Sending env LC_MESSAGES = en_US.utf-8
debug1: Sending env LC_COLLATE = en_US.utf-8
debug1: Sending env LANG = ru_RU.UTF-8
debug1: Sending env LC_CTYPE = en_US.utf-8
debug1: Sending env LC_TIME = en_US.utf-8
Verification done:
Solution: I was using incorrect user Id (i.e. ec2_user) instead of the correct one (i.e. ec2-user)
In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.
Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:
c:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop.cmd jar c:appsJobstemplates635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount
The results are stored @ /user/admin/outcount
Verify the results at Interactive Shell:
Step 2: loading /user/admin/outcount/part-r-00000 results in the Pig:
First we are storing the flat text file data as words, wordCount format as below:
Grunt>mydata = load ‘/user/admin/output/part-r-00000’ using PigStorage(‘t’) as (words:chararray, wordCount:int);
Grunt>first10 = LIMIT mydata 10;
Grunt>dump first10;
Note: This shows results for the words with frequency 1. We need to reorder to results on descending order to get words with top frequency.
Grunt>mydatadsc = order mydata by wordCount DESC;
Grunt>first10 = LIMIT mydatadsc 10;
Grunt>dump first10;
Now we have got the result as expected. Lets stored the results into a file at HDFS.
Grunt>Store first10 into ‘/user/avkash/myresults10‘ ;
Step 3: Filtering Pig Results in to Hive Table:
First we will create a table in Hive using the same format (words and wordcount separated by comma)
hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘n’;
Now once table is created we will load the hive store file ‘/user/admin/myresults10/part-r-00000’ into wordslist10 table we just created:
hive> load data inpath ‘/user/admin/myresults10/part-r-00000’ overwrite into table wordslist10;
That’s all as you can see the results now in table:
hive> select * from wordslist10;
KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData