Calculating Standard Deviation using custom UDF and group by in H2O

Here is the full code to calculate standard deviation using H2O group by method as well as using customer UDF:

library(h2o)
h2o.init()
irisPath <- system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex <- h2o.uploadFile(path = irisPath, destination_frame = "iris.hex")

# Calculating Standard Deviation using h2o group by
SdValue <- h2o.group_by(data = iris.hex, by = "class", sd("sepal_len"))

# Printing result
SdValue

# Alternative defining a UDF for Standard Deviation
mySDUdf <- function(df) { sd(df[,1],na.rm = T) }

# Using h2o ddply with UDF
SdValue <- h2o.ddply(iris.hex, "class", mySDUdf)

# Printing result
SdValue

Thats it, enjoy!!

Advertisement

Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling

You must have the following package installed in your R environment:

You must have Sparkling Water latest package download and unzipped locally:

I am using the following package in my environment:

  • Spark 2.1
  • Sparkling Water 2.1.8
  • sparklyr 0.4.4
  • rsparkling 0.2.0

Now here is rspakrling script to create the cluster locally:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Now here is the rsparkling script to create Spark cluster with Yarn:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.executor.instances = 2
sc <- spark_connect(master = "yarn-client", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Thats it, Enjoy!!

Splitting h2o data frame based on date time value

Sometime we may need to split the data frame based on date time values i.e. one split is above certain date and another split is after certain date.

Here is an example of the python code on how to split it:

import datetime
timedata = h2o.import_file("/Users/avkashchauhan/Downloads/date-data.csv")
timedata.shape
date_before_data = timedata[timedata['date'] < datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_after_data = timedata[timedata['date'] >= datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_before_data.shape
date_after_data.shape

If you decide to split one piece of data frame and then add one of the split to previous data frame you can do the following:

part1, part2 = date_after_data.split_frame(ratios=[0.5])
final_data = date_before_data.rbind(part2)

Note the CSV file contents are as below:

id date
1 9/1/2015
2 9/2/2015
3 9/3/2015
4 9/4/2015
5 9/5/2015
6 9/6/2015
7 9/7/2015
8 9/8/2015
9 9/9/2015
10 9/10/2015
11 12/1/2015
12 12/2/2015
13 12/3/2015
14 12/4/2015
15 12/5/2015
16 12/6/2015
17 12/7/2015
18 12/8/2015
19 12/9/2015
20 12/10/2015

Thats it, enjoy!!

Union of two different H2O data frames in python and R

We have first data frame as below:

C1 C2 C3 C4
10 20 30 40
3 4 5 6
5 7 8 9
12 3 55 10

And then we have second data frame as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If we just try to add these two data frame blindly as below:

final = df2.rbind(df1)

We will get the following error:

H2OValueError: Cannot row-bind a dataframe with 6 columns to a data frame with 4 columns: the columns must match

So we need to merge two data sets of different columns we need to instrument our datasets to meet the rbind need.  First we will add remaining columns from “df2” to “df1” as below:

df1['C10'] = 0
df1['C20'] = 0

The updated data frame looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0

Now we will do rbind with “df2” to “df1” as below:

df1 = df1.rbind(df2)

Now “df1” looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If you are using R you just need to do the following to add new columns into your first data frame:

df1$C10 = 0
df1$C20 = 0

You must make sure the number of columns match before doing rbind and number of rows match before doing cbind.

Thats it, enjoy!!

Building H2O GLM model using Postgresql database and JDBC driver

Note: Before we jump down, make sure you have postgresql is up and running and database is ready to respond your queries. Check you queries return results as records and are not null.

Download JDBC Driver 42.0.0 JDBC 4:

Note: I have tested H2O 3.10.4.2 with above JDBC driver 4.0 (Build 42.0.0) and Postgresql 9.2.x

In the following test I am connection to DVD Rental DB which is available into Postgresql. Need help to get it working.. visit Here and Here.

Test R (RStudio) for the postgresql connection working:

# Install package if you don't have it
> install.packages("RPostgreSQL")

# User package RPostgreSQL 
> library(RPostgreSQL)

# Code to test database and table:
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, dbname = "dvdrentaldb", host = "localhost", port = 5432,
> user = "avkash", password = "avkash")
> dbExistsTable(con, "actor")
TRUE

Start H2O with JDBC driver:

$ java -cp postgresql-42.0.0.jre6.jar:h2o.jar water.H2OApp

Note:

  • You must have h2o.jar and postgresql-42.0.0.jre6.jar in the same folder as above.
  • You must start h2o first and then connect to running instance of H2O from R as below.
  • I am connecting to a table name payment below
  • I am using table payment to run H2O GLM model

Connecting H2O from R:

> library(h2o)
> h2o.init()
> h2o.init(strict_version_check = FALSE)
> payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
> payment_glm

Here is the full code snippet in working:

 

payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
|=============================================| 100%
> payment
payment_id customer_id staff_id rental_id amount payment_date
1 17503 341 2 1520 7.99 1.171607e+12
2 17504 341 1 1778 1.99 1.171675e+12
3 17505 341 1 1849 7.99 1.171695e+12
4 17506 341 2 2829 2.99 1.171943e+12
5 17507 341 2 3130 7.99 1.172022e+12
6 17508 341 1 3382 5.99 1.172090e+12

[14596 rows x 6 columns]
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
|=============================================| 100%
> payment_glm
Model Details:
==============

H2ORegressionModel: glm
Model ID: GLM_model_R_1490053774745_2
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.038E-4 ) 5 5
number_of_iterations training_frame
1 0 payment_sql_to_hex

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -10.739680 4.200606
2 payment_id -0.000009 -0.038040
3 customer_id 0.000139 0.024262
4 staff_id 0.103740 0.051872
5 rental_id 0.000001 0.003172
6 payment_date 0.000000 0.026343

H2ORegressionMetrics: glm
** Reported on training data. **

MSE: 5.607411
RMSE: 2.367997
MAE: 1.950123
RMSLE: 0.5182649
Mean Residual Deviance : 5.607411
R^2 : 0.0007319098
Null Deviance :81905.72
Null D.o.F. :14595
Residual Deviance :81845.77
Residual D.o.F. :14590
AIC :66600.46

 

Thats all, enjoy!!

 

Concurrent model building in H2O using Parallel

Here is the full code snippet which shows how to build any model concurrently using H2O backend and R based parallel library:

 > library(h2o)
 > h2o.init(nthreads = -1)

 ## To simplify only use first 300 rows

 > prostate.hex = h2o.uploadFile(path = system.file("extdata", "prostate.csv", package="h2o"), destination_frame = "prostate.hex")
 > prostate.hex = prostate.hex[1:300,]
 > ones = rep(1, times = 100)
 > zeroes = rep(0, times = 100)
 > prostate.hex$Fold_1 = as.h2o(data.frame( Fold_1 = c(ones, zeroes, zeroes)))
 > prostate.hex$Fold_2 = as.h2o(data.frame( Fold_2 = c(zeroes, ones, zeroes)))
 > prostate.hex$Fold_3 = as.h2o(data.frame( Fold_3 = c(zeroes, zeroes, ones)))
 
 ## Case 1: Use weights in GLM that will essentially run multiple GLM models on the same frame (so no data replication)

 > glm_weights = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate.hex, weights_column = paste0("Fold_", i), model_id = paste0("Fold_", i))
 glm_weights = c(glm_weights, glm_m)
 }
 > end = Sys.time()
 > weightsTime = end - start
 > weightsTime

 ## Case 2: Subset H2OFrame and try to run GLM in a for loop

 > prostate_1 = prostate.hex[1:100, ]
 > prostate_2 = prostate.hex[101:200, ]
 > prostate_3 = prostate.hex[201:300, ]
 > prostate = c(prostate_1,prostate_2,prostate_3)
 > glm_subset = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i))
 glm_subset = c(glm_subset, glm_m)
 }
 > end = Sys.time()
 > subsetTime = end - start
 > subsetTime

 ## Case 3: Use the package parallel to send all the GLM function calls over to H2O and H2O will handle how to run the multiple calls optimumly

 > library(parallel)
 > start = Sys.time()
 > glm_parallel = mclapply(1:3, function(i) 
 > glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i)) )
 > end = Sys.time()
 > parallelTimes = end - start
 > parallelTimes

 ### Quick check to make sure all the GLM models return the same residual deviance

 > unlist(lapply(glm_parallel, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_weights, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_subset, function(x) h2o.residual_deviance(x)))
 
 ## Compare the model build time

 > comparison_table = data.frame(Time_Elapsed = c(weightsTime, subsetTime, parallelTimes), row.names = c("Case_1", "Case_2", "Case_3"))

How to subset an H2O data frame by date in R

Let’s see below example:

> time = as.Date(c(‘2016-01-01 12:30:00’, ‘2016-02-02 10:10:22’, ‘2016-05-02 01:01:30’))

> time
[1] “2016-01-01” “2016-02-02” “2016-05-02”

lets convert above data frame to H2O data frame:

> df = data.frame(time, stringsAsFactors = FALSE)
> df
time
1 2016-01-01
2 2016-02-02
3 2016-05-02

d = as.h2o(df)
|==========================================| 100%
> d
time
1 1.451635e+12
2 1.454400e+12
3 1.462172e+12

[3 rows x 1 column]

Above d is H2O Frame which is confirmed as below:

> summary(d)
NULL
> summary(df)
 time 
 Min. :2016-01-01 
 1st Qu.:2016-01-17 
 Median :2016-02-02 
 Mean :2016-02-21 
 3rd Qu.:2016-03-18 
 Max. :2016-05-02

Our next step is to filter some values from above H2O data frame based on date comparison:

> d[d$time < as.h2o(as.Date(‘2016-05-02 01:01:30’)),]
|===================================| 100%
time
1 1.451635e+12
2 1.454400e+12

[2 rows x 1 column]

Above we have filtered date which is lower then 2016-05-02 01:01:30 and results are printed which we can store into another subset as needed.

Another option for the same is as below:

  # slice with a boolean mask
 mask <- air_h2o[,"Date"] > last_current_date
 temp_df <- air_h2o[mask,]
 mask_2 <- air_h2o[,"Date"] < new_end_date
 
 # multiply the mask dataframes to get the intersection
 final_mask <- mask*mask_2
 test_1 <- air_h2o[final_mask,]



Thank you!!

Processing unstructured content from a URL in R

R has a built in function name readLines() which read a local file or an URL to read content line by line.

For example my blog URL is http://cloudcelebrity.wordpress.com so lets read it:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;)
Warning message:
In readLines(“http://cloudcelebrity.wordpress.com&#8221;) :
incomplete final line found on ‘http://cloudcelebrity.wordpress.com&#8217;

> length(myblog)
[1] 1380

As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;, warn=FALSE)

> length(myblog)
[1] 1380

And above there are no warning.. if I would want to print special lines based on line number I can just call the

> myblog[100]
[1] “/**/”

Lets get the summary:

> summary(myblog)
Length Class Mode
1380 character character

To read only limited lines in the same URL , I can also pass the total line limit as below:

> myblog <- readLines(“http://cloudcelebrity.wordpress.com&#8221;, 100, warn=FALSE)
> summary(myblog)
Length Class Mode
100 character character

After I read  all the lines in my blog, lets perform some specific search operation in the content:

Searching all lines with Hadoop or hadoop in it: 

To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:

> hd <- grep(“[hH]adoop”, myblog)

Lets print hd to see all the line numbers:
> hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281

To print all the lines with Hadoop or hadoop in it we can just use:

> myblog[hd]
[1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I:  In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N:  NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning)  and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X:  XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>”
[2] “ttt


[3] “ttt

How Hadoop is shaping up at Disney World?


[4] “ttttttttttttLeave a comment
[5] “tttt

<br/><p> </p>”

……………….

………………

[34] “PDRTJS_settings_5386869_post_412={“id”:5386869,”unique_id”:”wp-post-412″,”title”:”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”,”item_id”:”_post_412″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={“id”:5386869,”unique_id”:”wp-post-409″,”title”:”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”,”item_id”:”_post_409″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={“id”:5386869,”unique_id”:”wp-post-398″,”title”:”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”,”item_id”:”_post_398″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={“id”:5386869,”unique_id”:”wp-post-397″,”title”:”ABC%20of%20Data%26nbsp%3BScience”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”,”item_id”:”_post_397″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={“id”:5386869,”unique_id”:”wp-post-390″,”title”:”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29″,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”,”item_id”:”_post_390″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={“id”:5386869,”unique_id”:”wp-post-382″,”title”:”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”,”item_id”:”_post_382″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={“id”:5386869,”unique_id”:”wp-post-376″,”title”:”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight”,”permalink”:”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”,”item_id”:”_post_376″}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}”
[35] “ttWPCOM_sharing_counts = {“http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/merging-two-data-set-in-r-based-on-one-common-column\/”:412,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/30\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\/”:409,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/19\/listing-base-datasets-in-r-and-loading-as-data-frame\/”:398,”http:\/\/cloudcelebrity.wordpress.com\/2013\/01\/01\/abc-of-data-science\/”:397,”http:\/\/cloudcelebrity.wordpress.com\/2012\/12\/18\/r-programming-language-installation-and-configuration-on-windows\/”:390,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/13\/how-hadoop-is-shaping-up-at-disney-world\/”:382,”http:\/\/cloudcelebrity.wordpress.com\/2012\/11\/03\/hadoop-adventures-with-microsoft-hdinsight\/”:376}t</script>”

Above I have just removed the lines in middle to show the result snippet.

In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:

> myLines <- myblog[553:648]
> summary(myLines)
Length Class Mode
96 character character

Note: Above mylines character list has total 110 lines so you can try printing and see what you get.

Create a list of available dataset from above myLines vector: 

The pattern in myLines is as below:

[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”

……….

………

[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”

so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use sub function as below:

> dsName <- sub(” .*”, “”, myLines)
> dsName
[1] “AirPassengers” “BJsales” “BOD”
[4] “CO2” “ChickWeight” “DNase”
[7] “EuStockMarkets” “” “Formaldehyde”
[10] “HairEyeColor” “Harman23.cor” “Harman74.cor”
[13] “Indometh” “InsectSprays” “JohnsonJohnson”
[16] “LakeHuron” “LifeCycleSavings” “Loblolly”
[19] “Nile” “Orange” “OrchardSprays”
[22] “PlantGrowth” “Puromycin” “Theoph”
[25] “Titanic” “ToothGrowth” “”
[28] “UCBAdmissions” “UKDriverDeaths” “UKLungDeaths”
[31] “UKgas” “USAccDeaths” “USArrests”
[34] “USJudgeRatings” “” “USPersonalExpenditure”
[37] “VADeaths” “WWWusage” “WorldPhones”
[40] “ability.cov” “airmiles” “”
[43] “airquality” “anscombe” “”
[46] “attenu” “attitude” “austres”
[49] “” “beavers” “cars”
[52] “chickwts” “co2” “crimtab”
[55] “datasets-package” “discoveries” “esoph”
[58] “euro” “eurodist” “faithful”
[61] “freeny” “infert” “”
[64] “iris” “islands” “lh”
[67] “longley” “lynx” “morley”
[70] “mtcars” “nhtemp” “nottem”
[73] “” “occupationalStatus” “precip”
[76] “presidents” “pressure” “”
[79] “quakes” “randu” “”
[82] “rivers” “rock” “sleep”
[85] “stackloss” “state” “sunspot.month”
[88] “sunspot.year” “sunspots” “swiss”
[91] “” “treering” “trees”
[94] “uspop” “volcano” “”

Next work item:  mylines does has a few empty item so we can clean the array.

 

Note: Readline in R is used to prompt user to input something in console.

 

Merging two data set in R based on one common column

Let’s create a new dataset using mtcars dataset and only mpg and hp column:

> cars.mpg <- subset(mtcars, select = c(mpg, hp))
 
> cars.mpg
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
…………..

Let’s create another dataset using mtcars dataset and only hp and cyl column:

> cars.cyl <- subset(mtcars, select = c(hp,cyl))
> cars.cyl
                     hp cyl
Mazda RX4           110   6
Mazda RX4 Wag       110   6
Datsun 710           93   4
Hornet 4 Drive      110   6
Hornet Sportabout   175   8
Valiant             105   6
Duster 360          245   8
Merc 240D            62   4
Merc 230             95   4
Merc 280            123   6
Merc 280C           123   6
Merc 450SE          180   8
Merc 450SL          180   8
Merc 450SLC         180   8
…………………..

Now we can merge both dataset based on common column hp as below:

 
> merge.ds <- merge(cars.mpg, cars.cyl, by="hp")
> merge.ds
    hp  mpg cyl
1   52 30.4   4
2   62 24.4   4
3   65 33.9   4
4   66 32.4   4
5   66 32.4   4
6   66 27.3   4
7   66 27.3   4
8   91 26.0   4
9   93 22.8   4
10  95 22.8   4
11  97 21.5   4
12 105 18.1   6
13 109 21.4   4
14 110 21.0   6
15 110 21.0   6
16 110 21.0   6
17 110 21.0   6
18 110 21.0   6
19 110 21.0   6
20 110 21.4   6
21 110 21.4   6
22 110 21.4   6
23 113 30.4   4
24 123 17.8   6
25 123 17.8   6
26 123 19.2   6
27 123 19.2   6
28 150 15.2   8
29 150 15.2   8
30 150 15.5   8
31 150 15.5   8
32 175 18.7   8
33 175 18.7   6
34 175 18.7   8
35 175 19.7   8
36 175 19.7   6
37 175 19.7   8
38 175 19.2   8
39 175 19.2   6
40 175 19.2   8
41 180 16.4   8
42 180 16.4   8
43 180 16.4   8
44 180 17.3   8
45 180 17.3   8
46 180 17.3   8
47 180 15.2   8
48 180 15.2   8
49 180 15.2   8
50 205 10.4   8
51 215 10.4   8
52 230 14.7   8
53 245 14.3   8
54 245 14.3   8
55 245 13.3   8
56 245 13.3   8
57 264 15.8   8
58 335 15.0   8

Why you see total 58 merged rows when there were only 32 rows in original data sets?

This is because "merge is a generic function whose principal method is for data frames: the default method coerces its arguments to data frames and calls the "data.frame" method."
Learn more by calling
?merge