Superset and Jupyter notebooks on AWS as Service

Jupyter Notebook (In EC2 Instance):

The following script is written to run jupyter notebook as a server inside the AWS EC2 instance which you can access from your desktop/laptop if EC2 instance is accessible from your machine:

  • $ conda activate python37
  • $ jupyter notebook –generate-config
    • This will create the jupyter_notebook_config.py configuration file inside your working folder i.e. /home/<username>/.jupyter/
  • $ jupyter notebook password  
    • You can set the password here
  • $ vi /home/centos/.jupyter/jupyter_notebook_config.py
    • Edit the following 2 lines 
    •   c.NotebookApp.ip = ‘0.0.0.0’
    •   c.NotebookApp.port = 8888
  • $ jupyter notebook

Apache Superset (In EC2 Instance):

That’s all.

@avkashchauhan

Installing Apache Superset into CentOS 7 with Python 3.7

Following are the starter commands to install superset:

  • $ python –version
    • Python 3.7.5
  • $ pip install superset

Possible Errors:

You might be hitting any or all of the following error(s):

Running setup.py install for python-geohash … error
ERROR: Command errored out with exit status 1:

building ‘_geohash’ extension
……
unable to execute ‘gcc’: No such file or directory
error: command ‘gcc’ failed with exit status 1

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory
error: command ‘gcc’ failed with exit status 1

Look for:

  • $ gcc –version <= You must have gcc installed
  • $ locate cc1plus <= You must have cc1plus install

Install the required libraries and tools:

If any of the above components are missing, you need to install a few required libraries:

  • $ sudo yum install mlocate <= For locate command
  • $ sudo updatedb  <= Update for mlocate
  • $ sudo yum install gcc <=For gcc if you don’t have
  • $ sudo yum install gcc-c++  <== For cc1plus if you dont have

Verify the following again:

  • $ gcc –version
  • $ locate cc1plus
    • /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus

Note:

  • If you could locate cc1plus properly however still getting the error, try the following
    • sudo ln -s /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus /usr/local/bin/
  • Try installing again

Final Installation:

Now you can install  superset as below:

  • $ pip install superset
    • Python 3.7.5
      Flask 1.1.1
      Werkzeug 0.16.0
  • $ superset db upgrade
  • $ export FLASK_APP=superset
  • $ flask fab create-admin
    • Recognized Database Authentications.
      Admin User admin created.
  • $ superset init
  • $ superset run -p 8080 –with-threads –reload –debugger

 

That’s all.

@avkashchauhan

Error with python-geohash installation while installing superset in OSX Catalina

Error Installing superset on OSX Catalina:

Command:

$ pip3  install superset

$ pip3 install python-geohash

Error:

Running setup.py install for python-geohash ... error
ERROR: Command errored out with exit status 1:
command: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/
Complete output (21 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-10.7-x86_64-3.7
copying geohash.py -> build/lib.macosx-10.7-x86_64-3.7
copying quadtree.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpgrid.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpiarea.py -> build/lib.macosx-10.7-x86_64-3.7
running build_ext
building '_geohash' extension
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/src
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -DPYTHON_MODULE=1 -I/Users/avkashchauhan/anaconda3/include/python3.7m -c src/geohash.cpp -o build/temp.macosx-10.7-x86_64-3.7/src/geohash.o
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
1 warning generated.
g++ -bundle -undefined dynamic_lookup -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/src/geohash.o -o build/lib.macosx-10.7-x86_64-3.7/_geohash.cpython-37m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Reason:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Solution:

$ sudo CFLAGS=-stdlib=libc++ pip3 install python-geohash

$ sudo CFLAGS=-stdlib=libc++ pip3 install superser

That’s all for now.

@avkashchauhan

Calculating AUC and GINI model metrics for logistic classification

For logistics classification problem we use AUC metrics to check the model performance. The higher is better however any value above 80% is considered good and over 90% means the model is behaving great.

AUC is an abbreviation for Area Under the Curve. It is used in classification analysis in order to determine which of the used models predicts the classes best. An example of its application are ROC curves. Here, the true positive rates are plotted against false positive rates. You can learn more about AUC in this QUORA discussion.

We will also look for GINI metric which you can learn from wiki.  In this example we will learn how AUC and GINI model metric is calculated using True Positive Results (TPR) and False Positive Results (FPR) values from a given test dataset.

You can get the full working Jupyter Notebook here from my Github.

Lets build a logistic classification model in H2O using the prostate data set:

Preparation of H2O environment and dataset:

## Importing required libraries
import h2o
import sys
import pandas as pd
from h2o.estimators.gbm import H2OGradientBoostingEstimator

## Starting H2O machine learning cluster
h2o.init()

## Importing dataset
local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv"
df = h2o.import_file(local_url)

## defining feaures and response column
y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

## setting our response column to catagorical so our model classify the problem
df[y] = df[y].asfactor()

Now we will be splitting the dataset into 3 sets for training, validation and test:

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

Setting  H2O GBM Estimator and building GBM Model:

prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",
 ntrees=500,
 learn_rate=0.001,
 max_depth=10,
 score_each_iteration=True)

## Building H2O GBM Model:
prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

## Understand the H2O GBM Model
prostate_gbm

Generating model performance with training, validation & test datasets:

train_performance = prostate_gbm.model_performance(df_train)
valid_performance = prostate_gbm.model_performance(df_valid)
test_performance = prostate_gbm.model_performance(df_test)

Let’s take a look at the AUC metrics provided by Model performance:

print(train_performance.auc())
print(valid_performance.auc())
print(test_performance.auc())
print(prostate_gbm.auc())

Let’s take a look at the GINI metrics provided by Model performance:

print(train_performance.gini())
print(valid_performance.gini())
print(test_performance.gini())
print(prostate_gbm.gini())

Let generate the predictions using test dataset:

predictions = prostate_gbm.predict(df_test)
## Here we will get the probability for the 'p1' values from the prediction frame:
predict_probability = predictions['p1']

Now we will import required scikit-learn libraries to generate AUC manually:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import random

Lets get the real response results from the test data frame:

actual = df_test[y].as_data_frame()
actual_list = actual['CAPSULE'].tolist()
print(actual_list)

Now lets get the results probabilities from the prediction frame:

predictions_temp = predict_probability_x['p1'].as_data_frame()
predictions_list = predictions_temp['p1'].tolist()
print(predictions_list)

Calculating False Positive Rate and True Positive Rate:

Lets calculate TPR, FPR and Threshold metrics from the predictions and original data frame
– False Positive Rate (fpr)
– True Positive Rate (tpr)
– Threashold

fpr, tpr, thresholds = roc_curve(actual_list, predictions_list)
roc_auc = auc(fpr, tpr)
print(roc_auc)
print(test_performance.auc())

Note: Above you will see that our calculated ROC values is exactly same as given by model performance for test dataset. 

Lets plot the AUC Curve using matplotlib:

plt.title('ROC (Receiver Operating Characteristic)')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate (TPR)')
plt.xlabel('False Positive Rate (FPR)')
plt.show()

Screen Shot 2017-10-19 at 10.30.21 PM

This is how GINI metric is calculated from AUC:

GINI = (2 * roc_auc) - 1
print(GINI)
print(test_performance.gini())

Note: Above you will see that our calculated GINI values is exactly same as given by model performance for test dataset.

Thats it, enjoy!!


					

How R2 error is calculated in Generalized Linear Model

What is R2 (R^2 i.e. R-Squared)?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. … 100% indicates that the model explains all the variability of the response data around its mean. (From here)

You can get the full working jupyter notebook for this article from here directly from my Github.

Even when this article explains how R^2 error is calculated for an H2O GLM (Generalized Linear Model) however same math is use for any other statistical model. So you can use this function anywhere you would want to apply.

Lets build an H2O GLM Model first:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

prostate_glm = H2OGeneralizedLinearEstimator(model_id = "prostate_glm")

prostate_glm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_glm

Now calculate Model Performance based on training, validation and test data:

train_performance = prostate_glm.model_performance(df_train)
valid_performance = prostate_glm.model_performance(df_valid)
test_performance = prostate_glm.model_performance(df_test)

Now lets check the default R^2 metrics for training, validation and test data:

print(train_performance.r2())
print(valid_performance.r2())
print(test_performance.r2())
print(prostate_glm.r2())

Now lets get the prediction for the test data which we kept separate:

predictions = prostate_glm.predict(df_test)

Here is the math which is use to calculate the R2 metric for the test data set:

SSE = ((predictions-df_test[y])**2).sum()
y_hat = df_test[y].mean()
SST = ((df_test[y]-y_hat[0])**2).sum()
1-SSE/SST

Now lets get model performance for given test data as below:

print(test_performance.r2())

Above we can see that both values, one give by model performance for test data and the other we calculated are same.

Thats it, enjoy!!

 

 

 

 

Launching H2O cluster on different port in pysparkling

In this example we will launch H2O machine learning cluster using pysparkling package. You can visit my github and this article to learn more about the code execution explained in this article.

For you would need to  install pysparkling in python 2.7 setup as below:

> pip install -U h2o_pysparkling_2.1

Now we can launch the pysparkling Shell as below:

SPARK_HOME=/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.6

Launch pysparkling shell:

~/tools/sw2/sparkling-water-2.1.14 $ bin/pysparkling

Python Code Script Launch the H2O cluster in pysparkling:

## Importing Libraries
from pysparkling import *
import h2o

## Setting H2O Conf Object
h2oConf = H2OConf(sc)
h2oConf

## Setting H2O Conf for different port
h2oConf.set_client_port_base(54300)
h2oConf.set_node_base_port(54300)

## Gett H2O Conf Object to see the configuration
h2oConf

## Launching H2O Cluster
hc = H2OContext.getOrCreate(spark, h2oConf)

## Getting H2O Cluster status
h2o.cluster_status()

Now If you verify the Sparkling Water configuration you will see that the H2O is running on the given IP and port 54300 as configured:

Sparkling Water configuration:
  backend cluster mode : internal
  workers              : None
  cloudName            : Not set yet, it will be set automatically before starting H2OContext.
  flatfile             : true
  clientBasePort       : 54300
  nodeBasePort         : 54300
  cloudTimeout         : 60000
  h2oNodeLog           : INFO
  h2oClientLog         : WARN
  nthreads             : -1
  drddMulFactor        : 10

Thats it, enjoy!!

Using H2O AutoML for Kaggle Porto Seguro Safe Driver Prediction Competition

If you into competitive machine learning you must be visiting Kaggle routinely. Currently you can compete for cash and recognition at the Porto Seguro’s Safe Driver Prediction as well.

I did try to given training dataset (as it is) with H2O AutoML which ran for about 5 hours and I was able to get into top 280th position. If you could transform the dataset properly and run H2O AutoML you may be able to get even higher ranking.

Following is the simplest H2O AutoML python script which you can try as well (Note: Make sure to change the run_automl_for_seconds to the desired time you would want to run the experiment.)

import h2o
import pandas as pd
from h2o.automl import H2OAutoML

h2o.init()
train = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroTrain.csv')
test = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroTest.csv')
sub_data = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroSample_submission.csv')

y = 'target'
x = train.columns
x.remove(y)

## Time to run the experiment
run_automl_for_seconds = 18000
## Running AML for 4 Hours
aml = H2OAutoML(max_runtime_secs =run_automl_for_seconds)
train_final, valid = train.split_frame(ratios=[0.9])
aml.train(x=x, y =y, training_frame=train_final, validation_frame=valid)

leader_model = aml.leader
pred = leader_model.predict(test_data=test)

pred_pd = pred.as_data_frame()
sub = sub_data.as_data_frame()

sub['target'] = pred_pd
sub.to_csv('/data/avkash/PortoSeguro/PortoSeguroResult.csv', header=True, index=False)

That’s it, enjoy!!