Saving H2O model object as text locally

Sometimes you may want to store the H2O model object as text to local file system. In this example I will show you how you can save H2O model object to local disk as simple text content. You can get full working jupyter notebook for this example here from my Github.

Based on my experience the following example works fine with python 2.7.12 and python 3.4. I also found that the H2O model object tables were not saved to text file from jupyter notebook however when I ran the same code form command line into python shell, all the content was written perfectly.

Lets build an H2O GBM model using the public PROSTATE dataset (The following script is full working script which will generate the GBM binomial model):

import h2o
h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y) 
df[y] = df[y].asfactor()

df_train, df_valid = df.split_frame(ratios=[0.9])
print(df_train.shape)
print(df_valid.shape)

prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",
 ntrees=1000,
 learn_rate=0.5,
 max_depth=20,
 stopping_tolerance=0.001,
 stopping_rounds=2,
 score_each_iteration=True)

prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_gbm

Now we will save the model details to the disk as below:

old_target = sys.stdout
f = open('/Users/avkashchauhan/Downloads/model_output.txt', 'w')
sys.stdout = f

Lets see the content of the local file we have just created in the above step (It is empty):

!cat /Users/avkashchauhan/Downloads/model_output.txt

Now we will launch the following commands which will fill the standard output buffer with the model details as text:

print("Model summary>>> model_object.show()")
prostate_gbm.show()

Now we will push the standard output buffer to the text file which is created locally:

sys.stdout = old_target

Now we will check back the local file contents and this time you will see that the output of above command is written into the file:

!cat /Users/avkashchauhan/Downloads/model_output.txt

You will see the command output stored into the local text file as below:

Model summary>>> model_object.show()
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  prostate_gbm


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.036289343297
RMSE: 0.190497620187
LogLoss: 0.170007804527
Mean Per-Class Error: 0.0160045361428
AUC: 0.998865964296
Gini: 0.997731928592
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.487417363665: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 40.36 %



ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.161786079676
RMSE: 0.402226403505
LogLoss: 0.483923658542
Mean Per-Class Error: 0.174208144796
AUC: 0.871040723982
Gini: 0.742081447964
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.205076283533: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 39.53 %


Scoring History: 
Variable Importances:

Note: If you are thinking what “!” sign does here, so it is used here to run a linux shell command (in this case “cat”  is the linux command) inside jupyter cell.

Thats it, enjoy!!

 

Advertisements

Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone

Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone

Please visit my github and this specific page for more details.

Installation:

Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet.

Pre-requsite : Thrift

  • Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and 0.10.0 will give error while packaging)
  • Get thrift 0.9.1 Link

Pre-requsite : Impala

  • First setup Cloudera repo in your machine:
  • Install Impala
    • Install impala : $ sudo apt-get install impala
    • Install impala Server : $ sudo apt-get install impala-server
    • Install impala stat-store : $ sudo apt-get install impala-state-store
    • Install impala shell : $ sudo apt-get install impala-shell
    • Verify : impala : $ impala-shell
impala-shell
Starting Impala Shell without Kerberos authentication
Connected to mr-0xd7-precise1.0xdata.loc:21000
Server version: impalad version 2.6.0-cdh5.8.4 RELEASE (build 207450616f75adbe082a4c2e1145a2384da83fa6)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.4.0-cdh4-INTERNAL (08fa346) built on Mon Jul 14 15:52:52 PDT 2014)

Building : Herringbone source

Here is the successful herringbone “mvn package” command log for your review:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Herringbone Impala
[INFO] Herringbone Main
[INFO] Herringbone
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone Impala 0.0.2
[INFO] ------------------------------------------------------------------------
..
..
..
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone 0.0.1
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Herringbone Impala ................................. SUCCESS [ 2.930 s]
[INFO] Herringbone Main ................................... SUCCESS [ 13.012 s]
[INFO] Herringbone ........................................ SUCCESS [ 0.000 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.079 s
[INFO] Finished at: 2017-10-06T11:27:20-07:00
[INFO] Final Memory: 90M/1963M
[INFO] ------------------------------------------------------------------------

Using Herringbone

Note: You must have fiels on Hadoop, not on local file system

Verify the file on Hadoop:

  • ~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet
  • -rw-r–r– 3 avkash avkash 1463376 2017-09-13 16:56 /user/avkash/file-test1.parquet
  • ~/herringbone$ bin/herringbone flatten -i /user/avkash/file-test1.parquet
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/avkash/herringbone/herringbone-main/target/herringbone-0.0.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/10/06 12:06:44 INFO client.RMProxy: Connecting to ResourceManager at mr-0xd1-precise1.0xdata.loc/172.16.2.211:8032
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/10/06 12:06:45 INFO input.FileInputFormat: Total input paths to process : 1
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
1 initial splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
1 merged splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: number of splits:1
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1499294366934_0707
17/10/06 12:06:45 INFO impl.YarnClientImpl: Submitted application application_1499294366934_0707
17/10/06 12:06:46 INFO mapreduce.Job: The url to track the job: http://mr-0xd1-precise1.0xdata.loc:8088/proxy/application_1499294366934_0707/
17/10/06 12:06:46 INFO mapreduce.Job: Running job: job_1499294366934_0707
17/10/06 12:06:52 INFO mapreduce.Job: Job job_1499294366934_0707 running in uber mode : false
17/10/06 12:06:52 INFO mapreduce.Job:  map 0% reduce 0%
17/10/06 12:07:22 INFO mapreduce.Job:  map 100% reduce 0%

Now verify the file:

~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet-flat

Found 2 items
-rw-r--r--   3 avkash avkash          0 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/_SUCCESS
-rw-r--r--   3 avkash avkash    2901311 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/part-m-00000.parquet

Thats it, enjoy!!

Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV.

We wrote a script in Scala which does the following

  • Handles nested parquet compressed content
  • Look for columns as “Array” and then remove those columns

Here is a the script

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colPath = if (prefix == null) s"`${f.name}`" else s"${prefix}.`${f.name}`"

    f.dataType match {
      case st: StructType => flattenSchema(st, colPath)
      // Skip user defined types like array or vectors
      case x if x.isInstanceOf[ArrayType] => Array.empty[Column]
      case _ => Array(col(colPath).alias(colPath.replaceAll("[.`]", "_")))
    }
  })
}

Here are the all the steps you would need to take while reading the parquet compressed content and then exporting it to disk as CSV.

val spark = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

scala> :paste
// Entering paste mode (ctrl-D to finish)

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colPath = if (prefix == null) s"`${f.name}`" else s"${prefix}.`${f.name}`"

    f.dataType match {
      case st: StructType => flattenSchema(st, colPath)
      // Skip user defined types like array or vectors
      case x if x.isInstanceOf[ArrayType] => Array.empty[Column]
      case _ => Array(col(colPath).alias(colPath.replaceAll("[.`]", "_")))
    }
  })
}

// Exiting paste mode, now interpreting.

flattenSchema: (schema: org.apache.spark.sql.types.StructType, prefix: String)Array[org.apache.spark.sql.Column]

scala >

val df = spark.read.parquet("/user/avkash/test.parquet")

df.select(flattenSchema(df.schema):_*).write.format("com.databricks.spark.csv").save("/Users/avkashchauhan/Downloads/saveit/result.csv")

If you want to see the full working scripts with output you can visit any of the following link based on your Spark Version:

  • Here is the full working demo in Spark 2.1.0
  • Here is the full working demo in Spark 1.6.x.

We got some help from the StackOverflow discussion here. Michal K and Michal M helped me to write above solution.

Thats it, enjoy!!

Setting up jupyter notebook server as service in Ubuntu 16.04

Step 1: Verify the jupyter notebook location:

$ ll /home/avkash/.local/bin/jupyter-notebook
-rwxrwxr-x 1 avkash avkash 222 Jun 4 10:00 /home/avkash/.local/bin/jupyter-notebook*

Step 2: Configure your jupyter notebook with password and ip address as needed and make sure where it exist. We will use this file as configuration for jupyter as service.

jupyter config: /home/avkash/.jupyter/jupyter_notebook_config.py

Step 3: Create a file name jupyter.service as below and save it into /usr/lib/systemd/system/ folder.

$ cat /usr/lib/systemd/system/jupyter.service
[Unit]
Description=Jupyter Notebook

[Service]
Type=simple
PIDFile=/run/jupyter.pid
# Step 1 and Step 2 details are here..
# ------------------------------------
ExecStart=/home/avkash/.local/bin/jupyter-notebook --config=/home/avkash/.jupyter/jupyter_notebook_config.py
User=avkash
Group=avkash
WorkingDirectory=/home/avkash/tools/notebooks
Restart=always
RestartSec=10
#KillMode=mixed

[Install]
WantedBy=multi-user.target

Step 4: Now enabled the service as below:

$ sudo systemctl enable jupyter.service

Step 5: Now enabled the service as below:

$ sudo systemctl daemon-reload

Step 6: Now enabled the service as below:

$ sudo systemctl restart jupyter.service

The service is started now. You can test it as below:

$ systemctl -a | grep jupyter
 jupyter.service      loaded active running Jupyter Notebook

Thats it, enjoy!!

 

 

 

Using RESTful API to get POJO and MOJO models in H2O

 

CURL API for Listing Models:

http://<hostname>:<port>/3/Models/

CURL API for Listing specific POJO Model:

http://<hostname>:<port>/3/Models/model_name

List Specific MOJO Model:

http://<hostname>:<port>/3/Models/glm_model/mojo

Here is an example:

curl -X GET "http://localhost:54323/3/Models"
curl -X GET "http://localhost:54323/3/Models/deeplearning_model" >> NAME_IT

curl -X GET "http://localhost:54323/3/Models/deeplearning_model" >> dl_model.java
curl -X GET "http://localhost:54323/3/Models/glm_model/mojo" > myglm_mojo.zip

Thats it, enjoy!!

Installing ipython 5.0 (lower then 6.0) compatible with python 2.6/2.7

It is possible that you may need to install some python library or component with your python 2.6 or 2.7 environment. If those components need IPython then you

For example, with python 2.7.x when you try to install jupyter as below:

$ pip install jupyter --user

You will get the error as below:

Using cached ipython-6.0.0.tar.gz
 Complete output from command python setup.py egg_info:

IPython 6.0+ does not support Python 2.6, 2.7, 3.0, 3.1, or 3.2.
 When using Python 2.7, please install IPython 5.x LTS Long Term Support version.
 Beginning with IPython 6.0, Python 3.3 and above is required.

See IPython `README.rst` file for more information:

https://github.com/ipython/ipython/blob/master/README.rst

Python sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) detected.

To solve this problem you just need to install IPython 5.x (instead of 6.0 which is pulled as default when installing jupyter or independently ipython.

Here is the way you can install IPython 5.x version:

$ pip install IPython==5.0 --user
$ pip install jupyter --user

Thats it, enjoy!!

Thats it, enjoy!!

 

 

Installing R on Redhat 7 (EC2 RHEL 7)

Check you machine version:

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Now  lets updated the RPM repo details:

$ sudo su -c 'rpm -Uvh http://mirror.sfo12.us.leaseweb.net/epel/7/x86_64/e/epel-release-7-9.noarch.rpm'
$ sudo yum update

Make sure all dependencies are installed individually:

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-devel-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall blas-devel-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/blas-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall blas-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/lapack-devel-3.4.2-5.el7.x86_64.rpm
$ sudo yum localinstall lapack-devel-3.4.2-5.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/texinfo-tex-5.1-4.el7.x86_64.rpm
$ sudo yum install texinfo-tex-5.1-4.el7.x86_64.rpm

$ wget http://mirror.centos.org/centos/7/os/x86_64/Packages/texlive-epsf-svn21461.2.7.4-38.el7.noarch.rpm
$ sudo yum install texlive-epsf-svn21461.2.7.4-38.el7.noarch.rpm

Finally install R now:

$ sudo yum install R

Thats it.