RSparkling > The best of R + H2O + Spark

What you get from R + H2O + Spark?

R is great for statistical computing and graphics, and small scale data preparation, H2O is amazing distributed machine learning platform designed for scale and speed and Spark is great for super fast data processing at mega scale. So combining all of these 3 together you get the best of data science, machine learning and data processing, all in one.

rsparkling: The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R.

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Sparkling Water integrates H2O’s fast scalable machine learning engine with Spark. With Sparkling Water you can publish Spark data structures (RDDs, DataFrames, Datasets) as H2O’s frames and vice versa, DSL to use Spark data structures as input for H2O’s algorithms. You can create ML applications utilizing Spark and H2O APIs, and Python interface enabling use of Sparkling Water directly from PySpark.

Installation Packages:

Quick Start Script:

options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
sc = spark_connect(master = "local", version = "2.1.0")
h2o_context(sc, strict_version_check = FALSE)

Important Settings for your environment:

  • master = “local” > To start local spark cluster
  • master = “yarn-client” > To start a cluster managed by YARN
  • To get a list of supported Sparkling Water versions: h2o_release_table()
  • When you will call spark_connect() you will see a new “tab” appears
    • Tab “Spark” is used to launch “SparkUI”
    • Tab “Log” is used to collect spark logs
  • If there is any issue with sparklyr and spark version pass exact version above otherwise you dont need to pass version.

Startup Script with config parameters to set executor settings:

These are the settings you will use to get our rsparkling/spark session up and running in RStudio:

options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G”
config$spark.executor.instances = 3  <==== This will create 3 Nodes Instance
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
h2o_context(sc, strict_version_check = FALSE)

Accessing SparkUI:

You can access Spark UI just by clicking  SparkUI button at the spark tab as shown below:

Screen Shot 2017-10-28 at 9.54.48 AM

Accessing H2O FLOW UI:

You just need to pass the command to open H2O FLOW UI on selected browser:


Screen Shot 2017-10-28 at 9.55.03 AM

Building H2O GLM model using rsparkling + sparklyr + H2O:

In This example we are ingesting the famous “CARS & MPG” dataset and building a GLM (Generalized Linear Model) to predict the miles-per-gallon from the given specification of car capabilities:

options(rsparkling.sparklingwater.location = "/tmp/sparkling-water-assembly_2.11-2.1.7-all.jar")
sc <- spark_connect(master = "local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
sciris_tbl <- copy_to(sc, iris)
mtcars_tbl <- copy_to(sc, mtcars, "iris1")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars", overwrite = TRUE)
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
mtcars_glm <- h2o.glm(x = c("wt", "cyl"),mtcars_glm <- h2o.glm(x = c("wt", "cyl"),y = "mpg",training_frame = mtcars_h2o,lambda_search = TRUE)

That’s all, enjoy!!


Exploring & transforming H2O Data Frame in R and Python

Sometime you may need to ingest a dataset for building models and then your first task is to explore all the features and their type you have. Once that is done you may want to change the feature types to the one you want.

Here is the code snippet in Python:

df = h2o.import_file('')
{    u'AGE': u'int', u'CAPSULE': u'int', u'DCAPS': u'int', 
     u'DPROS': u'int', u'GLEASON': u'int', u'ID': u'int',
     u'PSA': u'real', u'RACE': u'int', u'VOL': u'real'
If you would like to visualize all the features in graphical format you can do the following:
import pylab as pl
The result looks like as below on jupyter notebook:
Screen Shot 2017-10-05 at 5.20.03 PM
Note: If you have features above 50, you might have to trim your data frame to less features so you can have effective visualization.
Next you may need to You can also use the following function to convert a list of columns as factor/categorical by passing H2O dataframe and a list of columns:
def convert_columns_as_factor(hdf, column_list):
    list_count = len(column_list)
    if list_count is 0:
        return "Error: You don't have a list of binary columns."
    if (len(pdf.columns)) is 0:
        return "Error: You don't have any columns in your data frame."
    local_column_list = pdf.columns
    for i in range(list_count):
            target_index = local_column_list.index(column_list[i])
            pdf[column_list[i]] = pdf[column_list[i]].asfactor()
            print('Column ' + column_list[i] + " is converted into factor/catagorical.")
        except ValueError:
            print('Error: ' + str(column_list[i]) + " not found in the data frame.")

The following script is in R to perform the same above tasks:

color = sample(c("D","E","I","F","M"),size=N,replace=TRUE)
num = rnorm(N,mean = 12,sd = 21212)
sex = sample(c("male","female"),size=N,replace=TRUE)
sex = as.factor(sex)
color = as.factor(color)
data = sample(c(0,1),size = N,replace = T)
fdata = factor(data)
dd = data.frame(color,sex,num,fdata)
data = as.h2o(dd)
data$sex = h2o.setLevels(x = data$sex ,levels = c("F","M"))
Thats it, enjoy!!

Renaming H2O data frame column name in R

Following is the code snippet showing how you can rename a column in H2O data frame in R:

> train.hex <- h2o.importFile("")
  |======================================================| 100%

> train.hex
  sepal_len sepal_wid petal_len petal_wid class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
[150 rows x 5 columns] 

> h2o.names(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "class"    

> h2o.colnames(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "class" 

## Now use the index starting from 1 and if you can change the column name as below
## Changing "class" to "new_class"

> names(train.hex)[5] = c("new_class")

# Checking Result:
> h2o.colnames(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "new_class"
> h2o.names(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "new_class"  

## Now changing "sepal_len" to "sepal_len_new"
> names(train.hex)[1] = c("sepal_len_new")

> h2o.names(train.hex)
[1] "sepal_len_new" "sepal_wid" "petal_len" "petal_wid" "new_class"    

> h2o.colnames(train.hex)
[1] "sepal_len_new" "sepal_wid" "petal_len" "petal_wid" "new_class"    

> train.hex
  sepal_len_new sepal_wid petal_len petal_wid new_class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa


That’s it, enjoy!!

Calculating Standard Deviation using custom UDF and group by in H2O

Here is the full code to calculate standard deviation using H2O group by method as well as using customer UDF:

irisPath <- system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex <- h2o.uploadFile(path = irisPath, destination_frame = "iris.hex")

# Calculating Standard Deviation using h2o group by
SdValue <- h2o.group_by(data = iris.hex, by = "class", sd("sepal_len"))

# Printing result

# Alternative defining a UDF for Standard Deviation
mySDUdf <- function(df) { sd(df[,1],na.rm = T) }

# Using h2o ddply with UDF
SdValue <- h2o.ddply(iris.hex, "class", mySDUdf)

# Printing result

Thats it, enjoy!!

Calculate mean using UDF in H2O

Here is the full code to write a UDF to calculate mean for a given data frame using H2O machine learning platform:


ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)

# Writing the UDF
myMeanUDF = function(Fr) { mean(Fr[, 1]) }

# Applying UDF using ddply
MeanValue = h2o.ddply(australia.hex[, c("premax", "salmax", "Max_czcs")], c("premax", "salmax"), myMeanUDF)

# Printing Results

Thats it, enjoy!!

Anomaly Detection with Deep Learning in R with H2O

The following R script downloads ECG dataset (training and validation) from internet and perform deep learning based anomaly detection on it.

# Import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(
 path = "", 
 header = FALSE, 
 sep = ",")
test_ecg <- h2o.importFile(
 path = "", 
 header = FALSE, 
 sep = ",")

# Train deep autoencoder learning model on "normal" 
# training data, y ignored 
anomaly_model <- h2o.deeplearning(
 x = names(train_ecg), 
 training_frame = train_ecg, 
 activation = "Tanh", 
 autoencoder = TRUE, 
 hidden = c(50,20,50), 
 sparse = TRUE,
 l1 = 1e-4, 
 epochs = 100)

# Compute reconstruction error with the Anomaly 
# detection app (MSE between output and input layers)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)

# Pull reconstruction error data into R and 
# plot to find outliers (last 3 heartbeats)
recon_error <-

# Note: Testing = Reconstructing the test dataset
test_recon <- h2o.predict(anomaly_model, test_ecg) 

h2o.saveModel(anomaly_model, "/Users/avkashchauhan/learn/tmp/anomaly_model.bin")
h2o.download_pojo(anomaly_model, "/Users/avkashchauhan/learn/tmp/", get_jar = TRUE)

h2o.shutdown(prompt= FALSE)



Thats it, enjoy!!

Spark with H2O using rsparkling and sparklyr in R

You must have installed:

  • sparklyr
  • rsparkling


Here is the working script:

> options(rsparkling.sparklingwater.version = “2.1.6”)
> Sys.setenv(SPARK_HOME=’/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.6′)
> library(rsparkling)
> spark_disconnect(sc)
> sc <- spark_connect(master = “local”, version = “2.1.0”)

Testing the spark context:

[1] “local[8]”

[1] “shell”

[1] “sparklyr”

[1] 8

[1] 8

[1] “”

[1] “^1.*”

[1] “”

[1] “default”
[1] “/Library/Frameworks/R.framework/Versions/3.4/Resources/library/sparklyr/conf/config-template.yml”

[1] “/Volumes/OSxexT/tools/spark-2.1.0-bin-hadoop2.6”

A connection with
description “->localhost:53374”
class “sockconn”
mode “wb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

A connection with
description “->localhost:8880”
class “sockconn”
mode “rb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

[1] “/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T//RtmpIIVL8I/file6ba7b454325_spark.log”

class org.apache.spark.SparkContext


class org.apache.spark.sql.SparkSession

[1] “spark_connection” “spark_shell_connection” “DBIConnection”
> h2o_context(sc, st)
Error in is.H2OFrame(x) : object ‘st’ not found
> h2o_context(sc, strict_version_check = FALSE)
class org.apache.spark.h2o.H2OContext

Sparkling Water Context:
* H2O name: sparkling-water-avkashchauhan_1672148412
* cluster size: 1
* list of used nodes:
(executorId, host, port)

Open H2O Flow in browser: (CMD + click in Mac OSX)

You can use the following command to launch H2O FLOW:

h2o_flow(sc, strict_version_check = FALSE)

Thats it, enjoy!!