Spark with H2O using rsparkling and sparklyr in R

You must have installed:

  • sparklyr
  • rsparkling

 

Here is the working script:

library(sparklyr)
> options(rsparkling.sparklingwater.version = “2.1.6”)
> Sys.setenv(SPARK_HOME=’/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.6′)
> library(rsparkling)
> spark_disconnect(sc)
> sc <- spark_connect(master = “local”, version = “2.1.0”)

Testing the spark context:

sc
$master
[1] “local[8]”

$method
[1] “shell”

$app_name
[1] “sparklyr”

$config
$config$sparklyr.cores.local
[1] 8

$config$spark.sql.shuffle.partitions.local
[1] 8

$config$spark.env.SPARK_LOCAL_IP.local
[1] “127.0.0.1”

$config$sparklyr.csv.embedded
[1] “^1.*”

$config$`sparklyr.shell.driver-class-path`
[1] “”

attr(,”config”)
[1] “default”
attr(,”file”)
[1] “/Library/Frameworks/R.framework/Versions/3.4/Resources/library/sparklyr/conf/config-template.yml”

$spark_home
[1] “/Volumes/OSxexT/tools/spark-2.1.0-bin-hadoop2.6”

$backend
A connection with
description “->localhost:53374”
class “sockconn”
mode “wb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

$monitor
A connection with
description “->localhost:8880”
class “sockconn”
mode “rb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

$output_file
[1] “/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T//RtmpIIVL8I/file6ba7b454325_spark.log”

$spark_context
<jobj[5]>
class org.apache.spark.SparkContext
org.apache.spark.SparkContext@159ba51c

$java_context
<jobj[6]>
class org.apache.spark.api.java.JavaSparkContext
org.apache.spark.api.java.JavaSparkContext@6b114a2d

$hive_context
<jobj[9]>
class org.apache.spark.sql.SparkSession
org.apache.spark.sql.SparkSession@2cd7fdf8

attr(,”class”)
[1] “spark_connection” “spark_shell_connection” “DBIConnection”
> h2o_context(sc, st)
Error in is.H2OFrame(x) : object ‘st’ not found
> h2o_context(sc, strict_version_check = FALSE)
<jobj[15]>
class org.apache.spark.h2o.H2OContext

Sparkling Water Context:
* H2O name: sparkling-water-avkashchauhan_1672148412
* cluster size: 1
* list of used nodes:
(executorId, host, port)
————————
(driver,127.0.0.1,54321)
————————

Open H2O Flow in browser: http://127.0.0.1:54321 (CMD + click in Mac OSX)

You can use the following command to launch H2O FLOW:

h2o_flow(sc, strict_version_check = FALSE)

Thats it, enjoy!!

 

Building GBM model in R and exporting POJO and MOJO model

Get the dataset:

Training:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_train.csv.gz

Test:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_test.csv.gz

Here is the script to build GBM grid model and export MOJO model:

library(h2o)
h2o.init()

# Importing Dataset
trainfile <- file.path("/Users/avkashchauhan/learn/adult_2013_train.csv.gz")
adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
testfile <- file.path("/Users/avkashchauhan/learn/adult_2013_test.csv.gz")
adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")

# Display Dataset
adult_2013_train
adult_2013_test

# Feature Engineering
actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
 adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
 adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
}
predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

# Building GBM Model:
log_wagp_gbm_grid <- h2o.gbm(x = predset,
 y = "LOG_WAGP",
 training_frame = adult_2013_train,
 model_id = "GBMModel",
 distribution = "gaussian",
 max_depth = 5,
 ntrees = 110,
 validation_frame = adult_2013_test)

log_wagp_gbm_grid

# Prediction 
h2o.predict(log_wagp_gbm_grid, adult_2013_test)

# Download POJO Model:
h2o.download_pojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

# Download MOJO model:
h2o.download_mojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

You will see GBM_model.java (as POJO Model) and GBM_model.zip (MOJO model) at the location where you will save these models.

Thats it, enjoy!

 

Experimental Plotting in H2O FLOW (limited support)

H2O FLOW comes with experimental plot option which can be used as below:

What you need is :

  • X – Column name
  • Y – Column name
  • Data: Data Frame name

Here is what experimental script look like:

plot (g) -> g(
   g.point(
   g.position "X_Column", "Y_Column"
 )
 g.from inspect "data", getFrameData "DATA_SET_KEY"
)

You can launch the plot configuration as below:

plot inspect 'data', getFrame "<dataframe>"

Here is the screen shot:

Screen Shot 2017-06-19 at 4.50.31 PM

Here is the example script:

plot (g) -> g(
      g.point(
      g.position "FIRST_PAYMENT_DATE", "CHANNEL"
      )
    g.from inspect "data", getFrameData "FM_Joined"
)

 

Note: This experimentation script only selects first 20 columns and 1000 rows and this setting can not be configured.

Thats it, enjoy!

Setting H2O FLOW directory path in Sparkling Water

Sometimes you may want to back up H2O FLOW files to some source code repo or to a backup location. For that reason you may want to change the default FLOW directory.

In H2O flag -flow_dir is used to set the local folder for FLOW files.

Note: You can always specify any H2O property by using system properties on Spark driver/executors.

So to change H2O FLOW directory to save you can append to your command line with the Sparkling Water commandline:

--conf spark.driver.extraJavaOptions="-Dai.h2o.flow_dir=/your/backup/location"

 

Thats it, thanks.

Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM:

https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/

To add cross validation you can do the following:

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._nfolds = 3  ###### Here is cross-validation ###
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
}

To look cross-validated model try this:

scala> glmModel._output._cross_validation_models
res12: Array[water.Key[_ <: water.Keyed[_ <: AnyRef]]] = 
    Array(glmModel.hex_cv_1, glmModel.hex_cv_2, glmModel.hex_cv_3)

Now to get each model do the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]

And you will see the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]
m1: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_1
 frame id: glmModel.hex_cv_1_train
 MSE: 0.14714406
 RMSE: 0.38359362
 AUC: 0.7167627
 logloss: 0.4703465
 mean_per_class_error: 0.31526923
 default threshold: 0.27434438467025757
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10704 1651 0.1336 1,651 / 12,355
 1 1768 1790 0.4969 1,768 / 3,558
Totals 12472 3441 0.2149 3,419 / 15,913
Gains/Lift Table (Avg response rate: 22.36 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005467 0....
scala> val m2 = DKV.getGet("glmModel.hex_cv_2").asInstanceOf[GLMModel]
m2: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_2
 frame id: glmModel.hex_cv_2_train
 MSE: 0.14598908
 RMSE: 0.38208517
 AUC: 0.7231473
 logloss: 0.46717605
 mean_per_class_error: 0.31456697
 default threshold: 0.29637953639030457
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 11038 1395 0.1122 1,395 / 12,433
 1 1847 1726 0.5169 1,847 / 3,573
Totals 12885 3121 0.2025 3,242 / 16,006
Gains/Lift Table (Avg response rate: 22.32 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005873 0...
scala> val m3 = DKV.getGet("glmModel.hex_cv_3").asInstanceOf[GLMModel]
m3: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_3
 frame id: glmModel.hex_cv_3_train
 MSE: 0.14626761
 RMSE: 0.38244948
 AUC: 0.7239823
 logloss: 0.46873763
 mean_per_class_error: 0.31437498
 default threshold: 0.28522220253944397
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10982 1490 0.1195 1,490 / 12,472
 1 1838 1771 0.5093 1,838 / 3,609
Totals 12820 3261 0.2070 3,328 / 16,081
Gains/Lift Table (Avg response rate: 22.44 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01001182 0...
scala>

Thats it, enjoy!!

 

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

Saving H2O models from R/Python API in Hadoop Environment

When you are using H2O in clustered environment i.e. Hadoop the machine could be different where h2o.savemodel() is trying to write the model and thats why you see the error “No such file or directory”. If you just give the path i.e. /tmp and visit the machine ID where H2O connection is initiated from R, you will see the model stored there.
Here is a good example to understand it better:
Step [1] Starting Hadoop driver in EC2 environment as below:
[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o-3.10.4.8-hdp2.6/h2odriver.jar -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
....
....
....
Open H2O Flow in your web browser: http://10.0.65.248:54323  <=== H2O is started.
Note: Above you could see that hadoop command is ran on ip address 10.0.104.179 however the node where H2O server is shown as 10.0.65.248.
Step [2] Connect R client with H2O
> h2o.init(ip = "10.0.65.248", port = 54323, strict_version_check = FALSE)
Note: I have used the ip address as shown above to connect with existing H2O cluster. However the machine where I am running R client is different as its IP address is 34.208.200.16.
Step [3]: Saving H2O model:
h2o.saveModel(my.glm, path = "/tmp", force = TRUE)
So when I am saving the mode it is saved at 10.0.65.248 machine even when the R client was running at 34.208.200.16.
ec2-user@ip-10-0-65-248 ~]$ ll /tmp/GLM*
-rw-r--r-- 1 yarn hadoop 90391 Jun 2 20:02 /tmp/GLM_model_R_1496447892009_1
So you need to make sure you have access to a folder where H2O service is running or you can save model at HDFS something similar to as below:
h2o.saveModel(my.glm, path = "hdfs://ip-10-0-104-179.us-west-2.compute.internal/user/achauhan", force = TRUE)

Thats it, enjoy!!