The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling
You must have the following package installed in your R environment:
If we just try to add these two data frame blindly as below:
final = df2.rbind(df1)
We will get the following error:
H2OValueError: Cannot row-bind a dataframe with 6 columns to a data frame with 4 columns: the columns must match
So we need to merge two data sets of different columns we need to instrument our datasets to meet the rbind need. First we will add remaining columns from “df2” to “df1” as below:
df1['C10'] = 0
df1['C20'] = 0
The updated data frame looks like as below:
C1
C2
C3
C4
C10
C20
10
20
30
40
0
0
3
4
5
6
0
0
5
7
8
9
0
0
12
3
55
10
0
0
Now we will do rbind with “df2” to “df1” as below:
df1 = df1.rbind(df2)
Now “df1” looks like as below:
C1
C2
C3
C4
C10
C20
10
20
30
40
0
0
3
4
5
6
0
0
5
7
8
9
0
0
12
3
55
10
0
0
10
20
30
40
33
44
3
4
5
6
11
22
5
7
8
9
90
100
12
3
55
10
33
44
If you are using R you just need to do the following to add new columns into your first data frame:
df1$C10 = 0
df1$C20 = 0
You must make sure the number of columns match before doing rbind and number of rows match before doing cbind.
Note: Before we jump down, make sure you have postgresql is up and running and database is ready to respond your queries. Check you queries return results as records and are not null.
In the following test I am connection to DVD Rental DB which is available into Postgresql. Need help to get it working.. visit Here and Here.
Test R (RStudio) for the postgresql connection working:
# Install package if you don't have it
> install.packages("RPostgreSQL")
# User package RPostgreSQL
> library(RPostgreSQL)
# Code to test database and table:
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, dbname = "dvdrentaldb", host = "localhost", port = 5432,
> user = "avkash", password = "avkash")
> dbExistsTable(con, "actor")
TRUE
As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:
After I read all the lines in my blog, lets perform some specific search operation in the content:
Searching all lines with Hadoop or hadoop in it:
To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:
> hd <- grep(“[hH]adoop”, myblog)
Lets print hd to see all the line numbers: > hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281
To print all the lines with Hadoop or hadoop in it we can just use:
> myblog[hd] [1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I: In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N: NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning) and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X: XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>” [2] “ttt
[34] “PDRTJS_settings_5386869_post_412={“id”:5386869,”unique_id”:”wp-post-412″,”title”:”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”,”item_id”:”_post_412″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={“id”:5386869,”unique_id”:”wp-post-409″,”title”:”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”,”item_id”:”_post_409″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={“id”:5386869,”unique_id”:”wp-post-398″,”title”:”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”,”item_id”:”_post_398″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={“id”:5386869,”unique_id”:”wp-post-397″,”title”:”ABC%20of%20Data%26nbsp%3BScience”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”,”item_id”:”_post_397″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={“id”:5386869,”unique_id”:”wp-post-390″,”title”:”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29″,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”,”item_id”:”_post_390″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={“id”:5386869,”unique_id”:”wp-post-382″,”title”:”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”,”item_id”:”_post_382″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={“id”:5386869,”unique_id”:”wp-post-376″,”title”:”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”,”item_id”:”_post_376″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}” [35] “ttWPCOM_sharing_counts = {“http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”:412,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”:409,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”:398,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”:397,”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”:390,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”:382,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”:376}t</script>”
Above I have just removed the lines in middle to show the result snippet.
In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:
> myLines <- myblog[553:648] > summary(myLines)
Length Class Mode
96 character character
Note: Above mylines character list has total 110 lines so you can try printing and see what you get.
Create a list of available dataset from above myLines vector:
The pattern in myLines is as below:
[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”
……….
………
[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”
so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use subfunction as below: