Saving H2O model object as text locally

Sometimes you may want to store the H2O model object as text to local file system. In this example I will show you how you can save H2O model object to local disk as simple text content. You can get full working jupyter notebook for this example here from my Github.

Based on my experience the following example works fine with python 2.7.12 and python 3.4. I also found that the H2O model object tables were not saved to text file from jupyter notebook however when I ran the same code form command line into python shell, all the content was written perfectly.

Lets build an H2O GBM model using the public PROSTATE dataset (The following script is full working script which will generate the GBM binomial model):

import h2o
h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y) 
df[y] = df[y].asfactor()

df_train, df_valid = df.split_frame(ratios=[0.9])
print(df_train.shape)
print(df_valid.shape)

prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",
 ntrees=1000,
 learn_rate=0.5,
 max_depth=20,
 stopping_tolerance=0.001,
 stopping_rounds=2,
 score_each_iteration=True)

prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_gbm

Now we will save the model details to the disk as below:

old_target = sys.stdout
f = open('/Users/avkashchauhan/Downloads/model_output.txt', 'w')
sys.stdout = f

Lets see the content of the local file we have just created in the above step (It is empty):

!cat /Users/avkashchauhan/Downloads/model_output.txt

Now we will launch the following commands which will fill the standard output buffer with the model details as text:

print("Model summary>>> model_object.show()")
prostate_gbm.show()

Now we will push the standard output buffer to the text file which is created locally:

sys.stdout = old_target

Now we will check back the local file contents and this time you will see that the output of above command is written into the file:

!cat /Users/avkashchauhan/Downloads/model_output.txt

You will see the command output stored into the local text file as below:

Model summary>>> model_object.show()
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  prostate_gbm


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.036289343297
RMSE: 0.190497620187
LogLoss: 0.170007804527
Mean Per-Class Error: 0.0160045361428
AUC: 0.998865964296
Gini: 0.997731928592
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.487417363665: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 40.36 %



ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.161786079676
RMSE: 0.402226403505
LogLoss: 0.483923658542
Mean Per-Class Error: 0.174208144796
AUC: 0.871040723982
Gini: 0.742081447964
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.205076283533: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 39.53 %


Scoring History: 
Variable Importances:

Note: If you are thinking what “!” sign does here, so it is used here to run a linux shell command (in this case “cat”  is the linux command) inside jupyter cell.

Thats it, enjoy!!

 

Advertisements

Installing or upgrading python3.6 in Ubuntu 16.04

Download python 3.6.1 and install as below:

wget https://www.python.org/ftp/python/3.6.1/Python-3.6.1.tgz
tar xvf Python-3.6.1.tgz
cd Python-3.6.1
./configure --enable-optimizations
make -j8
# If you want to keep previous version user altinstall
sudo make altinstall
# if you want to replace previous version use install
# sudo make install

Testing python3.6

$ python3.6

Once it is working check its launching path:

$ which python3.6
/usr/local/bin/python3.6

Now you just need to change the links for python3 binary as below:

$ sudo ln -s /usr/local/bin/python3.6 /usr/local/python3

Now test python3 for the final:

$ python3
Python 3.6.1 (default, Jun 8 2017, 16:11:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

That’s it, enjoy!!

Installing ipython 5.0 (lower then 6.0) compatible with python 2.6/2.7

It is possible that you may need to install some python library or component with your python 2.6 or 2.7 environment. If those components need IPython then you

For example, with python 2.7.x when you try to install jupyter as below:

$ pip install jupyter --user

You will get the error as below:

Using cached ipython-6.0.0.tar.gz
 Complete output from command python setup.py egg_info:

IPython 6.0+ does not support Python 2.6, 2.7, 3.0, 3.1, or 3.2.
 When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
 Beginning with IPython 6.0, Python 3.3 and above is required.

See IPython `README.rst` file for more information:

https://github.com/ipython/ipython/blob/master/README.rst

Python sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) detected.

To solve this problem you just need to install IPython 5.x (instead of 6.0 which is pulled as default when installing jupyter or independently ipython.

Here is the way you can install IPython 5.x version:

$ pip install IPython==5.0 --user
$ pip install jupyter --user

Thats it, enjoy!!

Thats it, enjoy!!

 

 

Splitting h2o data frame based on date time value

Sometime we may need to split the data frame based on date time values i.e. one split is above certain date and another split is after certain date.

Here is an example of the python code on how to split it:

import datetime
timedata = h2o.import_file("/Users/avkashchauhan/Downloads/date-data.csv")
timedata.shape
date_before_data = timedata[timedata['date'] < datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_after_data = timedata[timedata['date'] >= datetime.datetime(2015, 10, 1, 0, 0, 0),:]
date_before_data.shape
date_after_data.shape

If you decide to split one piece of data frame and then add one of the split to previous data frame you can do the following:

part1, part2 = date_after_data.split_frame(ratios=[0.5])
final_data = date_before_data.rbind(part2)

Note the CSV file contents are as below:

id date
1 9/1/2015
2 9/2/2015
3 9/3/2015
4 9/4/2015
5 9/5/2015
6 9/6/2015
7 9/7/2015
8 9/8/2015
9 9/9/2015
10 9/10/2015
11 12/1/2015
12 12/2/2015
13 12/3/2015
14 12/4/2015
15 12/5/2015
16 12/6/2015
17 12/7/2015
18 12/8/2015
19 12/9/2015
20 12/10/2015

Thats it, enjoy!!

Union of two different H2O data frames in python and R

We have first data frame as below:

C1 C2 C3 C4
10 20 30 40
3 4 5 6
5 7 8 9
12 3 55 10

And then we have second data frame as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If we just try to add these two data frame blindly as below:

final = df2.rbind(df1)

We will get the following error:

H2OValueError: Cannot row-bind a dataframe with 6 columns to a data frame with 4 columns: the columns must match

So we need to merge two data sets of different columns we need to instrument our datasets to meet the rbind need.  First we will add remaining columns from “df2” to “df1” as below:

df1['C10'] = 0
df1['C20'] = 0

The updated data frame looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0

Now we will do rbind with “df2” to “df1” as below:

df1 = df1.rbind(df2)

Now “df1” looks like as below:

C1 C2 C3 C4 C10 C20
10 20 30 40 0 0
3 4 5 6 0 0
5 7 8 9 0 0
12 3 55 10 0 0
10 20 30 40 33 44
3 4 5 6 11 22
5 7 8 9 90 100
12 3 55 10 33 44

If you are using R you just need to do the following to add new columns into your first data frame:

df1$C10 = 0
df1$C20 = 0

You must make sure the number of columns match before doing rbind and number of rows match before doing cbind.

Thats it, enjoy!!

Renaming data frame column names in H2O (python)

Sometimes you may need to change the all the column names or a specific column due to certain need, and you can do as below:

>>> df = h2o.import_file("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/iris/iris.csv")
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
>>> df
 C1 C2 C3 C4 C5
---- ---- ---- ---- -----------
 5.1 3.5 1.4 0.2 Iris-setosa
 4.9 3 1.4 0.2 Iris-setosa
 4.7 3.2 1.3 0.2 Iris-setosa
 4.6 3.1 1.5 0.2 Iris-setosa
 5 3.6 1.4 0.2 Iris-setosa
 5.4 3.9 1.7 0.4 Iris-setosa
 4.6 3.4 1.4 0.3 Iris-setosa
 5 3.4 1.5 0.2 Iris-setosa
 4.4 2.9 1.4 0.2 Iris-setosa
 4.9 3.1 1.5 0.1 Iris-setosa

[150 rows x 5 columns]

>>> df.names
[u'C1', u'C2', u'C3', u'C4', u'C5']

>>> df.set_names(['A1','A2','A3','A4','A5'])
 A1 A2 A3 A4 A5
---- ---- ---- ---- ------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

If you want to change only few column names then you still need to copy the original name in the same index and just add the changed name into where applicable. For example in above data frame, we just want to change A5 to Levels and we will do as below:

>>> df.set_names(['A1','A2','A3','A4','Levels'])
 A1 A2 A3 A4 Levels
---- ---- ---- ---- --------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

The set_names function must have all names values in the array, either same name or changes names otherwise it will generate an error.

For example the following will not work and will throw an error:

>>> df.set_names(['A1'])
>>> df.set_names(['A1','A2','A3','A4','A5','A6'])

Thats it, enjoy!!

Using Sparkling water and PySpark to log console output

Here is the command Option #1:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the command Option #2:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0,ai.h2o:sparkling-water-core_2.10:1.6.7 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

We must make sure that both h2o backend and python version are calling same Version of API.

This parameter is using H2O API backend version 1.6.7
ai.h2o:sparkling-water-core_2.10:1.6.7

This parameter is using 1.6.7 version of Python API:
–py-files /mnt/app/sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the script to test overall scenario:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> hc = H2OContext.getOrCreate(sc)