Data Analysis

1 Comment Posted on May 9, 2013 Big Data, Data Analysis, Data Visualization, Platfora

IMDB Movie Dataset Visualization using Platfora

Here are some of the cool visualization of IMDB Movie Dataset (60,000 records from 1893-2004) using Platfora…

Top movies based on 8.0 rating and highest voting:

Total yearly budget and movie production between 1893-2004

Total movies produced from 1893-2004

Total yearly budget for movies production from 1893-2004

Total movies produced based on MPAA Ratings between 1893-2004

Total movies based on MPAA rating between 1893-2004

Fact: In 1942, total 100 animation (of all length) produced comparative to only 94 in 2003

All time, top 9+ rating movies total (The data for year 2005 is incomplete):

2 Comments Posted on March 6, 2013 Data Analysis, Hadoop, MapReduce

Cloudera Impala Hands-on Video

Cloudera Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in (near) real time.

Learn more: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Download the Impala virtual Machine: https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM

5 Comments Posted on February 20, 2013 Big Data, Data Analysis, Data Visualization, Uncategorized

Visualizing Social Network Data From Twitter using Gephi

This 3 part series recording is available here:

Part 1:

Part 2:

Part 3:

Gephi: http://gephi.org/

Keywords: Data Visualization, Twitter, Gephi

Leave a comment Posted on February 20, 2013 Big Data, Computing and Cloud, Data Analysis, Data Visualization, Uncategorized

Visualizing Social Network Data based on Twitter #Hashtag using NodeXL

In the live demo you will learn how to visualizing social network data using NodeXL. NodeXL is an open source addon to Excel great to download social network data from Twitter, Facebook or flicker. In this example I am using twitter #Hashtag to collect social network data.

NodeXL: http://nodexl.codeplex.com/

Leave a comment Posted on February 12, 2013 Data Analysis, Development, R

Processing unstructured content from a URL in R

R has a built in function name readLines() which read a local file or an URL to read content line by line.

For example my blog URL is http://cloudcelebrity.wordpress.com so lets read it:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com”)
Warning message:
In readLines(“http://cloudcelebrity.wordpress.com”) :
incomplete final line found on ‘http://cloudcelebrity.wordpress.com’

> length(myblog)
[1] 1380

As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com”, warn=FALSE)

> length(myblog)
[1] 1380

And above there are no warning.. if I would want to print special lines based on line number I can just call the

> myblog[100]
[1] “/**/”

Lets get the summary:

> summary(myblog)
Length Class Mode
1380 character character

To read only limited lines in the same URL , I can also pass the total line limit as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com”, 100, warn=FALSE)
> summary(myblog)
Length Class Mode
100 character character

After I read all the lines in my blog, lets perform some specific search operation in the content:

Searching all lines with Hadoop or hadoop in it:

To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:

> hd <- grep(“[hH]adoop”, myblog)

Lets print hd to see all the line numbers:
> hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281

To print all the lines with Hadoop or hadoop in it we can just use:

> myblog[hd]
[1] “A: ACID – Atomicity, Consistency, Isolation and Durability B: Big Data – Volume, Velocity, Variety C: Columnar (or Column-Oriented) Database D: Data Warehousing – Relevant and very useful E: ETL – Extract, transform and load F: Flume – A framework for populating Hadoop with data G: Geospatial Analysis – A picture worth 1,000 words or more H: Hadoop, HDFS, HBASE – Do you really want to know? I: In-Memory Database – A new definition of superfast access J: Java – Hadoop gave biggest push in last years to stay in enterprise market K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn L: Latency – Low Latency and High Latency M: Map/Reduce – MapReduce N: NoSQL Databases – No SQL Database or Not Only SQL O: Oozie – Open-source workflow engine managing Hadoop job processing P: Pig – Platform for analyzing huge data sets Q: Quantitative Data Analysis R: Relational Database – Still relevant and will be for some time S: Sharding (Database Partitioning) and Sqoop (SQL Database to Hadoop) T: Text Analysis – Larger the information, more needed analysis U: Unstructured Data – Growing faster than speed of thoughts V: Visualization – Important to keep the information relevant W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors X: XML – Still eXtensible and no Introduction needed Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes Z: Zookeeper – Help managing Hadoop nodes across a distributed network ”
[2] “ttt

”
[3] “ttt

How Hadoop is shaping up at Disney World?

”
[4] “ttttcloudcelebrityttttttttLeave a comment”
[5] “tttt

”

……………….

………………

[34] “PDRTJS_settings_5386869_post_412={“id”:5386869,”unique_id”:”wp-post-412″,”title”:”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”,”item_id”:”_post_412″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={“id”:5386869,”unique_id”:”wp-post-409″,”title”:”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”,”item_id”:”_post_409″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={“id”:5386869,”unique_id”:”wp-post-398″,”title”:”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”,”item_id”:”_post_398″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={“id”:5386869,”unique_id”:”wp-post-397″,”title”:”ABC%20of%20Data%26nbsp%3BScience”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”,”item_id”:”_post_397″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={“id”:5386869,”unique_id”:”wp-post-390″,”title”:”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29″,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”,”item_id”:”_post_390″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={“id”:5386869,”unique_id”:”wp-post-382″,”title”:”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”,”item_id”:”_post_382″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={“id”:5386869,”unique_id”:”wp-post-376″,”title”:”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”,”item_id”:”_post_376″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}”
[35] “ttWPCOM_sharing_counts = {“http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”:412,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”:409,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”:398,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”:397,”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”:390,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”:382,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”:376}t</script>”

Above I have just removed the lines in middle to show the result snippet.

In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:

> myLines <- myblog[553:648]
> summary(myLines)
Length Class Mode
96 character character

Note: Above mylines character list has total 110 lines so you can try printing and see what you get.

Create a list of available dataset from above myLines vector:

The pattern in myLines is as below:

[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”

……….

………

[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”

so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use sub function as below:

> dsName <- sub(” .*”, “”, myLines)
> dsName
[1] “AirPassengers” “BJsales” “BOD”
[4] “CO2” “ChickWeight” “DNase”
[7] “EuStockMarkets” “” “Formaldehyde”
[10] “HairEyeColor” “Harman23.cor” “Harman74.cor”
[13] “Indometh” “InsectSprays” “JohnsonJohnson”
[16] “LakeHuron” “LifeCycleSavings” “Loblolly”
[19] “Nile” “Orange” “OrchardSprays”
[22] “PlantGrowth” “Puromycin” “Theoph”
[25] “Titanic” “ToothGrowth” “”
[28] “UCBAdmissions” “UKDriverDeaths” “UKLungDeaths”
[31] “UKgas” “USAccDeaths” “USArrests”
[34] “USJudgeRatings” “” “USPersonalExpenditure”
[37] “VADeaths” “WWWusage” “WorldPhones”
[40] “ability.cov” “airmiles” “”
[43] “airquality” “anscombe” “”
[46] “attenu” “attitude” “austres”
[49] “” “beavers” “cars”
[52] “chickwts” “co2” “crimtab”
[55] “datasets-package” “discoveries” “esoph”
[58] “euro” “eurodist” “faithful”
[61] “freeny” “infert” “”
[64] “iris” “islands” “lh”
[67] “longley” “lynx” “morley”
[70] “mtcars” “nhtemp” “nottem”
[73] “” “occupationalStatus” “precip”
[76] “presidents” “pressure” “”
[79] “quakes” “randu” “”
[82] “rivers” “rock” “sleep”
[85] “stackloss” “state” “sunspot.month”
[88] “sunspot.year” “sunspots” “swiss”
[91] “” “treering” “trees”
[94] “uspop” “volcano” “”

Next work item: mylines does has a few empty item so we can clean the array.

Note: Readline in R is used to prompt user to input something in console.

Leave a comment Posted on January 30, 2013 Data Analysis, Development, R

Merging two data set in R based on one common column

Let’s create a new dataset using mtcars dataset and only mpg and hp column:

> cars.mpg <- subset(mtcars, select = c(mpg, hp))
 
> cars.mpg
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
…………..

Let’s create another dataset using mtcars dataset and only hp and cyl column:

> cars.cyl <- subset(mtcars, select = c(hp,cyl))
> cars.cyl
                     hp cyl
Mazda RX4           110   6
Mazda RX4 Wag       110   6
Datsun 710           93   4
Hornet 4 Drive      110   6
Hornet Sportabout   175   8
Valiant             105   6
Duster 360          245   8
Merc 240D            62   4
Merc 230             95   4
Merc 280            123   6
Merc 280C           123   6
Merc 450SE          180   8
Merc 450SL          180   8
Merc 450SLC         180   8
…………………..

Now we can merge both dataset based on common column hp as below:

 
> merge.ds <- merge(cars.mpg, cars.cyl, by="hp")
> merge.ds
    hp  mpg cyl
1   52 30.4   4
2   62 24.4   4
3   65 33.9   4
4   66 32.4   4
5   66 32.4   4
6   66 27.3   4
7   66 27.3   4
8   91 26.0   4
9   93 22.8   4
10  95 22.8   4
11  97 21.5   4
12 105 18.1   6
13 109 21.4   4
14 110 21.0   6
15 110 21.0   6
16 110 21.0   6
17 110 21.0   6
18 110 21.0   6
19 110 21.0   6
20 110 21.4   6
21 110 21.4   6
22 110 21.4   6
23 113 30.4   4
24 123 17.8   6
25 123 17.8   6
26 123 19.2   6
27 123 19.2   6
28 150 15.2   8
29 150 15.2   8
30 150 15.5   8
31 150 15.5   8
32 175 18.7   8
33 175 18.7   6
34 175 18.7   8
35 175 19.7   8
36 175 19.7   6
37 175 19.7   8
38 175 19.2   8
39 175 19.2   6
40 175 19.2   8
41 180 16.4   8
42 180 16.4   8
43 180 16.4   8
44 180 17.3   8
45 180 17.3   8
46 180 17.3   8
47 180 15.2   8
48 180 15.2   8
49 180 15.2   8
50 205 10.4   8
51 215 10.4   8
52 230 14.7   8
53 245 14.3   8
54 245 14.3   8
55 245 13.3   8
56 245 13.3   8
57 264 15.8   8
58 335 15.0   8

Why you see total 58 merged rows when there were only 32 rows in original data sets?

This is because "merge is a generic function whose principal method is for data frames: the default method coerces its arguments to data frames and calls the "data.frame" method."
Learn more by calling
?merge

Leave a comment Posted on January 30, 2013 Data Analysis, Development, R

Working with dataset in R and using subset to work on dataset

Let’s load built in data set

>Library(datasets)

Let’s access one of the dataset name mtcars:

>mtcars

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

So what is this data set? Running str(_dataset_name) will tell you more about it:

> str(mtcars)
'data.frame':  32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let’s see the top of the data frame using head

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also see the tail as well:

> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

So how many rows are in this dataset?

> nrow(mtcars)
[1] 32

So how many columns are in this dataset?

> ncol(mtcars)
[1] 11

Ok can we get the name of column headers? Sure

> names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

So let’s see the summary for our dataset:

> summary(mtcars)

      mpg             cyl             disp             hp             drat     
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760 
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080 
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695 
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597 
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920 
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930 
       wt             qsec             vs               am       
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000 
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000 
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000 
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062 
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000 
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000 
      gear            carb     
 Min.   :3.000   Min.   :1.000 
 1st Qu.:3.000   1st Qu.:2.000 
 Median :4.000   Median :2.000 
 Mean   :3.688   Mean   :2.812 
 3rd Qu.:4.000   3rd Qu.:4.000 
 Max.   :5.000   Max.   :8.000

As you can see above the summary give you the mean, media, max and other values for each column.

Using Subset with Dataset in R

How to list data frame with a few columns? Sure we will be using subset() for this:

Subset(dataframe, condition, format)

For example the following subset will show all the rows in mtcrs but only mpg column:

> subset(mtcars, select = c(mpg))
                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3
Merc 450SLC         15.2
Cadillac Fleetwood  10.4
Lincoln Continental 10.4
Chrysler Imperial   14.7
Fiat 128            32.4
Honda Civic         30.4
Toyota Corolla      33.9
Toyota Corona       21.5
Dodge Challenger    15.5
AMC Javelin         15.2
Camaro Z28          13.3
Pontiac Firebird    19.2
Fiat X1-9           27.3
Porsche 914-2       26.0
Lotus Europa        30.4
Ford Pantera L      15.8
Ferrari Dino        19.7
Maserati Bora       15.0
Volvo 142E          21.4

Now what if we want to see the list of cars which have mpg above 20 along with mpg and hp column:

> subset(mtcars, mpg > 20, select = c(mpg, hp))
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Datsun 710     22.8  93
Hornet 4 Drive 21.4 110
Merc 240D      24.4  62
Merc 230       22.8  95
Fiat 128       32.4  66
Honda Civic    30.4  52
Toyota Corolla 33.9  65
Toyota Corona  21.5  97
Fiat X1-9      27.3  66
Porsche 914-2  26.0  91
Lotus Europa   30.4 113
Volvo 142E     21.4 109

We can also call subset within subset.

For example we need to get a list of cars with HP above 100 and mpg above 20.

> subset(subset(mtcars, mpg > 20, select = c(mpg, hp)),hp >= 100)
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Hornet 4 Drive 21.4 110
Lotus Europa   30.4 113
Volvo 142E     21.4 109

Note: following will not work as condition will not match and you will see a full list of cars with mpg and hp column
> subset(mtcars, mpg > 20 && hp >= 100, select = c(mpg, hp))

You can also use subset to show specific columns
> subset(mtcars[1:3])
                     mpg cyl  disp
Mazda RX4           21.0   6 160.0
Mazda RX4 Wag       21.0   6 160.0
Datsun 710          22.8   4 108.0
Hornet 4 Drive      21.4   6 258.0
Hornet Sportabout   18.7   8 360.0

…..

Or you can defined the columns name as below:

> subset(mtcars[c("mpg", “cyl”, "hp")])

1 Comment Posted on November 13, 2012 Big Data, Data Analysis, Hadoop

How Hadoop is shaping up at Disney World?

Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop – Matt Estes, Disney

How Disney built a big data platform on a startup budget

http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

Leave a comment Posted on May 19, 2012 Data Analysis, Machine Learning

Pattern Analysis Computation Methods and Algorithms for Machine Learning

In machine learning, pattern recognition is the assignment of a label to a given input value. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is “spam” or “non-spam”). However, pattern recognition is a more general problem that encompasses other types of output as well. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); and parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence.
Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to do “fuzzy” matching of inputs. This is opposed to pattern matching algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm is regular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many text editors and word processors. In contrast to pattern recognition, pattern matching is generally not considered a type of machine learning, although pattern-matching algorithms (especially with fairly general, carefully tailored patterns) can sometimes succeed in providing similar-quality output to the sort provided by pattern-recognition algorithms.

Pattern Analysis Computation Methods:

Ridge regression
Regularized Fisher discriminant
Regularized kernel Fisher discriminant
Maximizing variance
Maximizing covariance
Canonical correlation analysis
Kernel CCA
Regularized CCA
Kernel regularized CCA
Smallest enclosing hyper sphere
Soft minimal hyper sphere
nu-soft minimal hyper sphere
Hard margin SVM
1-norm soft margin SVM
2-norm soft margin SVM
Ridge regression optimization
Quadratic e-insensitive
Linear e-insensitive SVR
nu-SVR
Soft ranking
Cluster quality
Cluster optimization strategy
Multiclass clustering
Relaxed multiclass clustering
Visualization quality

Pattern Analysis Algorithms:

Normalization
Centering data
Simple novelty detection
Parzen based classifier
Cholesky decomposition or dual Gram�Schmidt
Standardizing data
Kernel Fisher discriminant
Primal PCA
Kernel PCA
Whitening
Primal CCA
Kernel CCA
Principal components regression
PLS feature extraction
Primal PLS
Kernel PLS
Smallest hyper sphere enclosing data
Soft hyper sphere minimization
nu-soft minimal hyper sphere
Hard margin SVM
Alternative hard margin SVM
1-norm soft margin SVM
nu-SVM
2-norm soft margin SVM
Kernel ridge regression
2-norm SVR
1-norm SVR
nu-support vector regression
Kernel perceptron
Kernel adatron
On-line SVR
nu-ranking
On-line ranking
Kernel k-means
MDS for kernel-embedded data
Data visualization

Source: http://www.kernel-methods.net/algos.html

Keywords: Data Mining, Machine Learning, Algorithms

Everything Artificial Intelligence

Everything Artificial Intelligence

Champion yourself into AI (Blog by @avkashchauhan)

Data Analysis

IMDB Movie Dataset Visualization using Platfora

Cloudera Impala Hands-on Video

Visualizing Social Network Data From Twitter using Gephi

Visualizing Social Network Data based on Twitter #Hashtag using NodeXL

Processing unstructured content from a URL in R

How Hadoop is shaping up at Disney World?

Merging two data set in R based on one common column

Working with dataset in R and using subset to work on dataset

How Hadoop is shaping up at Disney World?

Hadoop World 2011: Advancing Disney’s Data Infrastructure with Hadoop – Matt Estes, Disney

How Disney built a big data platform on a startup budget

Pattern Analysis Computation Methods and Algorithms for Machine Learning