Cloudera Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in (near) real time.
In the live demo you will learn how to visualizing social network data using NodeXL. NodeXL is an open source addon to Excel great to download social network data from Twitter, Facebook or flicker. In this example I am using twitter #Hashtag to collect social network data.
As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:
After I read all the lines in my blog, lets perform some specific search operation in the content:
Searching all lines with Hadoop or hadoop in it:
To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:
> hd <- grep(“[hH]adoop”, myblog)
Lets print hd to see all the line numbers: > hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281
To print all the lines with Hadoop or hadoop in it we can just use:
> myblog[hd] [1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I: In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N: NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning) and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X: XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>” [2] “ttt
[34] “PDRTJS_settings_5386869_post_412={“id”:5386869,”unique_id”:”wp-post-412″,”title”:”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”,”item_id”:”_post_412″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={“id”:5386869,”unique_id”:”wp-post-409″,”title”:”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”,”item_id”:”_post_409″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={“id”:5386869,”unique_id”:”wp-post-398″,”title”:”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”,”item_id”:”_post_398″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={“id”:5386869,”unique_id”:”wp-post-397″,”title”:”ABC%20of%20Data%26nbsp%3BScience”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”,”item_id”:”_post_397″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={“id”:5386869,”unique_id”:”wp-post-390″,”title”:”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29″,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”,”item_id”:”_post_390″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={“id”:5386869,”unique_id”:”wp-post-382″,”title”:”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”,”item_id”:”_post_382″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={“id”:5386869,”unique_id”:”wp-post-376″,”title”:”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”,”item_id”:”_post_376″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}” [35] “ttWPCOM_sharing_counts = {“http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”:412,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”:409,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”:398,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”:397,”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”:390,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”:382,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”:376}t</script>”
Above I have just removed the lines in middle to show the result snippet.
In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:
> myLines <- myblog[553:648] > summary(myLines)
Length Class Mode
96 character character
Note: Above mylines character list has total 110 lines so you can try printing and see what you get.
Create a list of available dataset from above myLines vector:
The pattern in myLines is as below:
[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”
……….
………
[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”
so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use subfunction as below:
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :4.000 Median :2.000
Mean :3.688 Mean :2.812
3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :8.000
As you can see above the summary give you the mean, media, max and other values for each column.
Using Subset with Dataset in R
How to list data frame with a few columns? Sure we will be using subset() for this:
Subset(dataframe, condition, format)
For example the following subset will show all the rows in mtcrs but only mpg column:
> subset(mtcars, select = c(mpg))
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
Valiant 18.1
Duster 360 14.3
Merc 240D 24.4
Merc 230 22.8
Merc 280 19.2
Merc 280C 17.8
Merc 450SE 16.4
Merc 450SL 17.3
Merc 450SLC 15.2
Cadillac Fleetwood 10.4
Lincoln Continental 10.4
Chrysler Imperial 14.7
Fiat 128 32.4
Honda Civic 30.4
Toyota Corolla 33.9
Toyota Corona 21.5
Dodge Challenger 15.5
AMC Javelin 15.2
Camaro Z28 13.3
Pontiac Firebird 19.2
Fiat X1-9 27.3
Porsche 914-2 26.0
Lotus Europa 30.4
Ford Pantera L 15.8
Ferrari Dino 19.7
Maserati Bora 15.0
Volvo 142E 21.4
Now what if we want to see the list of cars which have mpg above 20 along with mpg and hp column:
For example we need to get a list of cars with HP above 100 and mpg above 20.
> subset(subset(mtcars, mpg > 20, select = c(mpg, hp)),hp >= 100)
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Hornet 4 Drive 21.4 110
Lotus Europa 30.4 113
Volvo 142E 21.4 109
Note: following will not work as condition will not match and you will see a full list of cars with mpg and hp column
> subset(mtcars, mpg > 20 && hp >= 100, select = c(mpg, hp))
You can also use subset to show specific columns
> subset(mtcars[1:3])
mpg cyl disp
Mazda RX4 21.0 6 160.0
Mazda RX4 Wag 21.0 6 160.0
Datsun 710 22.8 4 108.0
Hornet 4 Drive 21.4 6 258.0
Hornet Sportabout 18.7 8 360.0
In machine learning, pattern recognition is the assignment of a label to a given input value. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is “spam” or “non-spam”). However, pattern recognition is a more general problem that encompasses other types of output as well. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); and parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence.
Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to do “fuzzy” matching of inputs. This is opposed to pattern matching algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm is regular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many text editors and word processors. In contrast to pattern recognition, pattern matching is generally not considered a type of machine learning, although pattern-matching algorithms (especially with fairly general, carefully tailored patterns) can sometimes succeed in providing similar-quality output to the sort provided by pattern-recognition algorithms.
Pattern Analysis Computation Methods:
Ridge regression
Regularized Fisher discriminant
Regularized kernel Fisher discriminant
Maximizing variance
Maximizing covariance
Canonical correlation analysis
Kernel CCA
Regularized CCA
Kernel regularized CCA
Smallest enclosing hyper sphere
Soft minimal hyper sphere
nu-soft minimal hyper sphere
Hard margin SVM
1-norm soft margin SVM
2-norm soft margin SVM
Ridge regression optimization
Quadratic e-insensitive
Linear e-insensitive SVR
nu-SVR
Soft ranking
Cluster quality
Cluster optimization strategy
Multiclass clustering
Relaxed multiclass clustering
Visualization quality
Pattern Analysis Algorithms:
Normalization
Centering data
Simple novelty detection
Parzen based classifier
Cholesky decomposition or dual Gram�Schmidt
Standardizing data
Kernel Fisher discriminant
Primal PCA
Kernel PCA
Whitening
Primal CCA
Kernel CCA
Principal components regression
PLS feature extraction
Primal PLS
Kernel PLS
Smallest hyper sphere enclosing data
Soft hyper sphere minimization
nu-soft minimal hyper sphere
Hard margin SVM
Alternative hard margin SVM
1-norm soft margin SVM
nu-SVM
2-norm soft margin SVM
Kernel ridge regression
2-norm SVR
1-norm SVR
nu-support vector regression
Kernel perceptron
Kernel adatron
On-line SVR
nu-ranking
On-line ranking
Kernel k-means
MDS for kernel-embedded data
Data visualization
Source: http://www.kernel-methods.net/algos.html
Keywords: Data Mining, Machine Learning, Algorithms