Hortonworks Data Platform (HDP) Sandbox with Hadoop Walkthrough

I recently downloaded and decided to give a quick run to Hortonworks Data Platform (HDP) Sandbox. As suggested it takes only 15 minutes for anyone to get familiar about key Hadoop components i.e. Pig, HBase, Hive, Oozie and few others. The Sandbox comes with pre-configured and pre-installed VMWare or VirtualBox compatable VMDK running CentOS 6.2 Linux, which user can download and open very easily with any of above free application. This 15 minutes walk through checks out key Hadoop components in HDP sandbox.

If you are new to Apache Hadoop, this is the easiest method you have to get familiarize yourself with Amazing world of Hadoop.

HDP Sandbox Instructions: http://hortonworks.com/blog/hortonworks-sandbox-the-fastest-on-ramp-to-apache-hadoop/

HDP Download Links: http://hortonworks.com/products/sandbox-instructions/

VMDK Username and Password: root/hadoop

Cloud Storage Performance Tests are out and Windows Azure Cloud Storage is #1 in most categories

Window Azure Cloud storage is #1 in most of categories as you can see below:

Cloud Storage Delete Speed Report: (Azure Cloud Storage #1)


Cloud Storage Read Speed Report: (Azure Cloud Storage #1)

Cloud Storage Read/Write Error Report: (Azure Cloud Storage #1)

Cloud Storage Response Time/ UpTime Report: (Azure Cloud Storage #1 in response time)
Windows Azure Cloud Storage is  not #1 in uptime due to SIE.

Cloud Storage Scaling Test Report (Azure Cloud Storage in #2 behind Amazon):


Cloud Storage Write Speed Report: (Azure Cloud Storage #1 with all file size)
Read more about  Nasuni Cloud Storage Report details here.

Read the full details in more  here.

Visualizing Social Network Data based on Twitter #Hashtag using NodeXL

In the live demo you will learn how to visualizing social network data using NodeXL. NodeXL is an open source addon to Excel great to download social network data from Twitter, Facebook or flicker. In this example I am using twitter #Hashtag to collect social network data.

NodeXL: http://nodexl.codeplex.com/

A Menu based Windows Azure PowerShell script for PaaS and IaaS Operations

When running the Powershell Menu look like as below:








Get script from here:


For those who would like to fork and then add more functionality use the command as below:


$ ls -l

total 5

-rw-r–r–    1 avkashc  Administ     8350 Feb 12 12:02 PowershellMenuPub.ps1

-rw-r–r–    1 avkashc  Administ       73 Feb  6 23:49 README.md


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

$ git init

Reinitialized existing Git repository in c:/installbox/azureps/toshare/.git/


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

$ git add PowershellMenuPub.ps1


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

$ git commit -m “initial commit”

[master 9157958] initial commit

Committer: unknown <**********************************>

Your name and email address were configured automatically based

on your username and hostname. Please check that they are accurate.

You can suppress this message by setting them explicitly:


git config –global user.name “Your Name”

git config –global user.email you@example.com


After doing this, you may fix the identity used for this commit with:


git commit –amend –reset-author


1 file changed, 28 insertions(+), 5 deletions(-)


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

$ git remote add myPS https://github.com/Avkash/AzurePowershellmenu.git


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

$ git push myPS master

Username for ‘https://github.com&#8217;: **********@*****.***

Password for ‘https://*********@github.com&#8217;:#################

Counting objects: 8, done.

Delta compression using up to 4 threads.

Compressing objects: 100% (6/6), done.

Writing objects: 100% (6/6), 1.17 KiB, done.

Total 6 (delta 2), reused 0 (delta 0)

To https://github.com/Avkash/AzurePowershellmenu.git

fa6307a..9157958  master -> master


avkashc@AVKASHXPS12 /c/installbox/azureps/toshare (master)

Keywords: Powershell, Windows Azure, Scripting

Processing unstructured content from a URL in R

R has a built in function name readLines() which read a local file or an URL to read content line by line.

For example my blog URL is http://cloudcelebrity.wordpress.com so lets read it:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;)
Warning message:
In readLines(“http://cloudcelebrity.wordpress.com&#8221;) :
incomplete final line found on ‘http://cloudcelebrity.wordpress.com&#8217;

> length(myblog)
[1] 1380

As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;, warn=FALSE)

> length(myblog)
[1] 1380

And above there are no warning.. if I would want to print special lines based on line number I can just call the

> myblog[100]
[1] “/**/”

Lets get the summary:

> summary(myblog)
Length Class Mode
1380 character character

To read only limited lines in the same URL , I can also pass the total line limit as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;, 100, warn=FALSE)
> summary(myblog)
Length Class Mode
100 character character

After I read  all the lines in my blog, lets perform some specific search operation in the content:

Searching all lines with Hadoop or hadoop in it: 

To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:

> hd <- grep(“[hH]adoop”, myblog)

Lets print hd to see all the line numbers:
> hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281

To print all the lines with Hadoop or hadoop in it we can just use:

> myblog[hd]
[1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I:  In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N:  NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning)  and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X:  XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>”
[2] “ttt

[3] “ttt

How Hadoop is shaping up at Disney World?

[4] “ttttttttttttLeave a comment
[5] “tttt

<br/><p> </p>”



[34] “PDRTJS_settings_5386869_post_412={“id”:5386869,”unique_id”:”wp-post-412″,”title”:”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”,”item_id”:”_post_412″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={“id”:5386869,”unique_id”:”wp-post-409″,”title”:”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”,”item_id”:”_post_409″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={“id”:5386869,”unique_id”:”wp-post-398″,”title”:”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”,”item_id”:”_post_398″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={“id”:5386869,”unique_id”:”wp-post-397″,”title”:”ABC%20of%20Data%26nbsp%3BScience”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”,”item_id”:”_post_397″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={“id”:5386869,”unique_id”:”wp-post-390″,”title”:”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29″,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”,”item_id”:”_post_390″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={“id”:5386869,”unique_id”:”wp-post-382″,”title”:”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”,”item_id”:”_post_382″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={“id”:5386869,”unique_id”:”wp-post-376″,”title”:”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”,”item_id”:”_post_376″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}”
[35] “ttWPCOM_sharing_counts = {“http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”:412,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”:409,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”:398,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”:397,”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”:390,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”:382,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”:376}t</script>”

Above I have just removed the lines in middle to show the result snippet.

In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:

> myLines <- myblog[553:648]
> summary(myLines)
Length Class Mode
96 character character

Note: Above mylines character list has total 110 lines so you can try printing and see what you get.

Create a list of available dataset from above myLines vector: 

The pattern in myLines is as below:

[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”



[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”

so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use sub function as below:

> dsName <- sub(” .*”, “”, myLines)
> dsName
[1] “AirPassengers” “BJsales” “BOD”
[4] “CO2” “ChickWeight” “DNase”
[7] “EuStockMarkets” “” “Formaldehyde”
[10] “HairEyeColor” “Harman23.cor” “Harman74.cor”
[13] “Indometh” “InsectSprays” “JohnsonJohnson”
[16] “LakeHuron” “LifeCycleSavings” “Loblolly”
[19] “Nile” “Orange” “OrchardSprays”
[22] “PlantGrowth” “Puromycin” “Theoph”
[25] “Titanic” “ToothGrowth” “”
[28] “UCBAdmissions” “UKDriverDeaths” “UKLungDeaths”
[31] “UKgas” “USAccDeaths” “USArrests”
[34] “USJudgeRatings” “” “USPersonalExpenditure”
[37] “VADeaths” “WWWusage” “WorldPhones”
[40] “ability.cov” “airmiles” “”
[43] “airquality” “anscombe” “”
[46] “attenu” “attitude” “austres”
[49] “” “beavers” “cars”
[52] “chickwts” “co2” “crimtab”
[55] “datasets-package” “discoveries” “esoph”
[58] “euro” “eurodist” “faithful”
[61] “freeny” “infert” “”
[64] “iris” “islands” “lh”
[67] “longley” “lynx” “morley”
[70] “mtcars” “nhtemp” “nottem”
[73] “” “occupationalStatus” “precip”
[76] “presidents” “pressure” “”
[79] “quakes” “randu” “”
[82] “rivers” “rock” “sleep”
[85] “stackloss” “state” “sunspot.month”
[88] “sunspot.year” “sunspots” “swiss”
[91] “” “treering” “trees”
[94] “uspop” “volcano” “”

Next work item:  mylines does has a few empty item so we can clean the array.


Note: Readline in R is used to prompt user to input something in console.