Installing ipython 5.0 (lower then 6.0) compatible with python 2.6/2.7

It is possible that you may need to install some python library or component with your python 2.6 or 2.7 environment. If those components need IPython then you

For example, with python 2.7.x when you try to install jupyter as below:

$ pip install jupyter --user

You will get the error as below:

Using cached ipython-6.0.0.tar.gz
 Complete output from command python setup.py egg_info:

IPython 6.0+ does not support Python 2.6, 2.7, 3.0, 3.1, or 3.2.
 When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
 Beginning with IPython 6.0, Python 3.3 and above is required.

See IPython `README.rst` file for more information:

https://github.com/ipython/ipython/blob/master/README.rst

Python sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) detected.

To solve this problem you just need to install IPython 5.x (instead of 6.0 which is pulled as default when installing jupyter or independently ipython.

Here is the way you can install IPython 5.x version:

$ pip install IPython==5.0 --user
$ pip install jupyter --user

Thats it, enjoy!!

Thats it, enjoy!!

 

 

Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling

You must have the following package installed in your R environment:

You must have Sparkling Water latest package download and unzipped locally:

I am using the following package in my environment:

  • Spark 2.1
  • Sparkling Water 2.1.8
  • sparklyr 0.4.4
  • rsparkling 0.2.0

Now here is rspakrling script to create the cluster locally:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Now here is the rsparkling script to create Spark cluster with Yarn:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.executor.instances = 2
sc <- spark_connect(master = "yarn-client", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Thats it, Enjoy!!

Installing R on Redhat 7 (EC2 RHEL 7)

Check you machine version:

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Now  lets updated the RPM repo details:

$ sudo su -c 'rpm -Uvh http://mirror.sfo12.us.leaseweb.net/epel/7/x86_64/e/epel-release-7-9.noarch.rpm'
$ sudo yum update

Make sure all dependencies are installed individually:

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-devel-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall blas-devel-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall blas-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/lapack-devel-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall lapack-devel-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/texinfo-tex-5.1-4.el7.x86_64.rpm
$ sudo yum install texinfo-tex-5.1-4.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/texlive-epsf-svn21461.2.7.4-38.el7.noarch.rpm
$ sudo yum install texlive-epsf-svn21461.2.7.4-38.el7.noarch.rpm

Finally install R now:

$ sudo yum install R

Thats it.

How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
                                    beta_constraints=bc,
                                    standardize=False)
glm.coef()
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Creating Partial Dependency Plot (PDP) in H2O

Starting from H2O 3.10.0.8 H2O added partial dependency plot which has the Java backend to do the mutli-scoring of the dataset with the model. This makes creating PDP much faster.

To get PDP in H2O you must need Model, and the original data set used to generate mode. Here are few ways to create PDP:

If you want to generate PDP on a single column:

response = h2o.predict(model, data.pdp[, column_name])
To generate PDP on the original data set:
response = h2o.predict(model, data.pdp)
If you want to build PDP directly from Model and dataset without using PDP API, you can the following code:
model = prostate.gbm
column_name = "AGE"
data.pdp = data.hex
bins = unique(h2o.quantile(data.hex[, column_name], probs = seq(0.05,1,0.05)) )
mean_responses = c()

for(bin in bins ){
  data.pdp[, column_name] = bin
  response = h2o.predict(model, data.pdp[, column_name])
  mean_response = mean(response[,ncol(response)])
  mean_responses = c(mean_responses, mean_response)
}

pdp_manual = data.frame(AGE = bins, mean_response = mean_responses)
plot(pdp_manual, type = "l")
Thats it, enjoy!!