Calculating Standard Deviation using custom UDF and group by in H2O

Here is the full code to calculate standard deviation using H2O group by method as well as using customer UDF:

library(h2o)
h2o.init()
irisPath <- system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex <- h2o.uploadFile(path = irisPath, destination_frame = "iris.hex")

# Calculating Standard Deviation using h2o group by
SdValue <- h2o.group_by(data = iris.hex, by = "class", sd("sepal_len"))

# Printing result
SdValue

# Alternative defining a UDF for Standard Deviation
mySDUdf <- function(df) { sd(df[,1],na.rm = T) }

# Using h2o ddply with UDF
SdValue <- h2o.ddply(iris.hex, "class", mySDUdf)

# Printing result
SdValue

Thats it, enjoy!!

Advertisements

Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling

You must have the following package installed in your R environment:

You must have Sparkling Water latest package download and unzipped locally:

I am using the following package in my environment:

  • Spark 2.1
  • Sparkling Water 2.1.8
  • sparklyr 0.4.4
  • rsparkling 0.2.0

Now here is rspakrling script to create the cluster locally:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Now here is the rsparkling script to create Spark cluster with Yarn:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.executor.instances = 2
sc <- spark_connect(master = "yarn-client", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Thats it, Enjoy!!

Splitting h2o data frame based on date time value

Sometime we may need to split the data frame based on date time values i.e. one split is above certain date and another split is after certain date.

Here is an example of the python code on how to split it:

import datetime
timedata = h2o.import_file("/Users/avkashchauhan/Downloads/date-data.csv")
timedata.shape
date_before_data = timedata[timedata['date'] < datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_after_data = timedata[timedata['date'] >= datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_before_data.shape
date_after_data.shape

If you decide to split one piece of data frame and then add one of the split to previous data frame you can do the following:

part1, part2 = date_after_data.split_frame(ratios=[0.5])
final_data = date_before_data.rbind(part2)

Note the CSV file contents are as below:

id date
1 9/1/2015
2 9/2/2015
3 9/3/2015
4 9/4/2015
5 9/5/2015
6 9/6/2015
7 9/7/2015
8 9/8/2015
9 9/9/2015
10 9/10/2015
11 12/1/2015
12 12/2/2015
13 12/3/2015
14 12/4/2015
15 12/5/2015
16 12/6/2015
17 12/7/2015
18 12/8/2015
19 12/9/2015
20 12/10/2015

Thats it, enjoy!!

Union of two different H2O data frames in python and R

We have first data frame as below:

C1 C2 C3 C4
10 20 30 40
3 4 5 6
5 7 8 9
12 3 55 10

And then we have second data frame as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If we just try to add these two data frame blindly as below:

final = df2.rbind(df1)

We will get the following error:

H2OValueError: Cannot row-bind a dataframe with 6 columns to a data frame with 4 columns: the columns must match

So we need to merge two data sets of different columns we need to instrument our datasets to meet the rbind need.  First we will add remaining columns from “df2” to “df1” as below:

df1['C10'] = 0
df1['C20'] = 0

The updated data frame looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0

Now we will do rbind with “df2” to “df1” as below:

df1 = df1.rbind(df2)

Now “df1” looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If you are using R you just need to do the following to add new columns into your first data frame:

df1$C10 = 0
df1$C20 = 0

You must make sure the number of columns match before doing rbind and number of rows match before doing cbind.

Thats it, enjoy!!

Building H2O GLM model using Postgresql database and JDBC driver

Note: Before we jump down, make sure you have postgresql is up and running and database is ready to respond your queries. Check you queries return results as records and are not null.

Download JDBC Driver 42.0.0 JDBC 4:

Note: I have tested H2O 3.10.4.2 with above JDBC driver 4.0 (Build 42.0.0) and Postgresql 9.2.x

In the following test I am connection to DVD Rental DB which is available into Postgresql. Need help to get it working.. visit Here and Here.

Test R (RStudio) for the postgresql connection working:

# Install package if you don't have it
> install.packages("RPostgreSQL")

# User package RPostgreSQL 
> library(RPostgreSQL)

# Code to test database and table:
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, dbname = "dvdrentaldb", host = "localhost", port = 5432,
> user = "avkash", password = "avkash")
> dbExistsTable(con, "actor")
TRUE

Start H2O with JDBC driver:

$ java -cp postgresql-42.0.0.jre6.jar:h2o.jar water.H2OApp

Note:

  • You must have h2o.jar and postgresql-42.0.0.jre6.jar in the same folder as above.
  • You must start h2o first and then connect to running instance of H2O from R as below.
  • I am connecting to a table name payment below
  • I am using table payment to run H2O GLM model

Connecting H2O from R:

> library(h2o)
> h2o.init()
> h2o.init(strict_version_check = FALSE)
> payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
> payment_glm

Here is the full code snippet in working:

 

payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
|=============================================| 100%
> payment
payment_id customer_id staff_id rental_id amount payment_date
1 17503 341 2 1520 7.99 1.171607e+12
2 17504 341 1 1778 1.99 1.171675e+12
3 17505 341 1 1849 7.99 1.171695e+12
4 17506 341 2 2829 2.99 1.171943e+12
5 17507 341 2 3130 7.99 1.172022e+12
6 17508 341 1 3382 5.99 1.172090e+12

[14596 rows x 6 columns]
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
|=============================================| 100%
> payment_glm
Model Details:
==============

H2ORegressionModel: glm
Model ID: GLM_model_R_1490053774745_2
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.038E-4 ) 5 5
number_of_iterations training_frame
1 0 payment_sql_to_hex

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -10.739680 4.200606
2 payment_id -0.000009 -0.038040
3 customer_id 0.000139 0.024262
4 staff_id 0.103740 0.051872
5 rental_id 0.000001 0.003172
6 payment_date 0.000000 0.026343

H2ORegressionMetrics: glm
** Reported on training data. **

MSE: 5.607411
RMSE: 2.367997
MAE: 1.950123
RMSLE: 0.5182649
Mean Residual Deviance : 5.607411
R^2 : 0.0007319098
Null Deviance :81905.72
Null D.o.F. :14595
Residual Deviance :81845.77
Residual D.o.F. :14590
AIC :66600.46

 

Thats all, enjoy!!

 

Concurrent model building in H2O using Parallel

Here is the full code snippet which shows how to build any model concurrently using H2O backend and R based parallel library:

 > library(h2o)
 > h2o.init(nthreads = -1)

 ## To simplify only use first 300 rows

 > prostate.hex = h2o.uploadFile(path = system.file("extdata", "prostate.csv", package="h2o"), destination_frame = "prostate.hex")
 > prostate.hex = prostate.hex[1:300,]
 > ones = rep(1, times = 100)
 > zeroes = rep(0, times = 100)
 > prostate.hex$Fold_1 = as.h2o(data.frame( Fold_1 = c(ones, zeroes, zeroes)))
 > prostate.hex$Fold_2 = as.h2o(data.frame( Fold_2 = c(zeroes, ones, zeroes)))
 > prostate.hex$Fold_3 = as.h2o(data.frame( Fold_3 = c(zeroes, zeroes, ones)))
 
 ## Case 1: Use weights in GLM that will essentially run multiple GLM models on the same frame (so no data replication)

 > glm_weights = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate.hex, weights_column = paste0("Fold_", i), model_id = paste0("Fold_", i))
 glm_weights = c(glm_weights, glm_m)
 }
 > end = Sys.time()
 > weightsTime = end - start
 > weightsTime

 ## Case 2: Subset H2OFrame and try to run GLM in a for loop

 > prostate_1 = prostate.hex[1:100, ]
 > prostate_2 = prostate.hex[101:200, ]
 > prostate_3 = prostate.hex[201:300, ]
 > prostate = c(prostate_1,prostate_2,prostate_3)
 > glm_subset = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i))
 glm_subset = c(glm_subset, glm_m)
 }
 > end = Sys.time()
 > subsetTime = end - start
 > subsetTime

 ## Case 3: Use the package parallel to send all the GLM function calls over to H2O and H2O will handle how to run the multiple calls optimumly

 > library(parallel)
 > start = Sys.time()
 > glm_parallel = mclapply(1:3, function(i) 
 > glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i)) )
 > end = Sys.time()
 > parallelTimes = end - start
 > parallelTimes

 ### Quick check to make sure all the GLM models return the same residual deviance

 > unlist(lapply(glm_parallel, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_weights, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_subset, function(x) h2o.residual_deviance(x)))
 
 ## Compare the model build time

 > comparison_table = data.frame(Time_Elapsed = c(weightsTime, subsetTime, parallelTimes), row.names = c("Case_1", "Case_2", "Case_3"))

How to subset an H2O data frame by date in R

Let’s see below example:

> time = as.Date(c(‘2016-01-01 12:30:00’, ‘2016-02-02 10:10:22’, ‘2016-05-02 01:01:30’))

> time
[1] “2016-01-01” “2016-02-02” “2016-05-02”

lets convert above data frame to H2O data frame:

> df = data.frame(time, stringsAsFactors = FALSE)
> df
time
1 2016-01-01
2 2016-02-02
3 2016-05-02

d = as.h2o(df)
|==========================================| 100%
> d
time
1 1.451635e+12
2 1.454400e+12
3 1.462172e+12

[3 rows x 1 column]

Above d is H2O Frame which is confirmed as below:

> summary(d)
NULL
> summary(df)
 time 
 Min. :2016-01-01 
 1st Qu.:2016-01-17 
 Median :2016-02-02 
 Mean :2016-02-21 
 3rd Qu.:2016-03-18 
 Max. :2016-05-02

Our next step is to filter some values from above H2O data frame based on date comparison:

> d[d$time < as.h2o(as.Date(‘2016-05-02 01:01:30’)),]
|===================================| 100%
time
1 1.451635e+12
2 1.454400e+12

[2 rows x 1 column]

Above we have filtered date which is lower then 2016-05-02 01:01:30 and results are printed which we can store into another subset as needed.

Another option for the same is as below:

  # slice with a boolean mask
 mask <- air_h2o[,"Date"] > last_current_date
 temp_df <- air_h2o[mask,]
 mask_2 <- air_h2o[,"Date"] < new_end_date
 
 # multiply the mask dataframes to get the intersection
 final_mask <- mask*mask_2
 test_1 <- air_h2o[final_mask,]



Thank you!!