Superset and Jupyter notebooks on AWS as Service

Jupyter Notebook (In EC2 Instance):

The following script is written to run jupyter notebook as a server inside the AWS EC2 instance which you can access from your desktop/laptop if EC2 instance is accessible from your machine:

  • $ conda activate python37
  • $ jupyter notebook –generate-config
    • This will create the jupyter_notebook_config.py configuration file inside your working folder i.e. /home/<username>/.jupyter/
  • $ jupyter notebook password  
    • You can set the password here
  • $ vi /home/centos/.jupyter/jupyter_notebook_config.py
    • Edit the following 2 lines 
    •   c.NotebookApp.ip = ‘0.0.0.0’
    •   c.NotebookApp.port = 8888
  • $ jupyter notebook

Apache Superset (In EC2 Instance):

That’s all.

@avkashchauhan

Advertisement

Installing Apache Superset into CentOS 7 with Python 3.7

Following are the starter commands to install superset:

  • $ python –version
    • Python 3.7.5
  • $ pip install superset

Possible Errors:

You might be hitting any or all of the following error(s):

Running setup.py install for python-geohash … error
ERROR: Command errored out with exit status 1:

building ‘_geohash’ extension
……
unable to execute ‘gcc’: No such file or directory
error: command ‘gcc’ failed with exit status 1

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory
error: command ‘gcc’ failed with exit status 1

Look for:

  • $ gcc –version <= You must have gcc installed
  • $ locate cc1plus <= You must have cc1plus install

Install the required libraries and tools:

If any of the above components are missing, you need to install a few required libraries:

  • $ sudo yum install mlocate <= For locate command
  • $ sudo updatedb  <= Update for mlocate
  • $ sudo yum install gcc <=For gcc if you don’t have
  • $ sudo yum install gcc-c++  <== For cc1plus if you dont have

Verify the following again:

  • $ gcc –version
  • $ locate cc1plus
    • /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus

Note:

  • If you could locate cc1plus properly however still getting the error, try the following
    • sudo ln -s /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus /usr/local/bin/
  • Try installing again

Final Installation:

Now you can install  superset as below:

  • $ pip install superset
    • Python 3.7.5
      Flask 1.1.1
      Werkzeug 0.16.0
  • $ superset db upgrade
  • $ export FLASK_APP=superset
  • $ flask fab create-admin
    • Recognized Database Authentications.
      Admin User admin created.
  • $ superset init
  • $ superset run -p 8080 –with-threads –reload –debugger

 

That’s all.

@avkashchauhan

Steps to connect Apache Superset with Apache Druid

Druid Install:

  • Install Druid and run.
  • Get broker port number from druid configuration, mostly 8082 if not changed.
  • Add a test data source to your druid so you can access that from superset
  • Test
    • $ curl http://localhost:8082/druid/v2/datasources
      • [“testdf”,”plants”]
    • Note: You should get a list of configured druid data sources.
    • Note: If the above command does not work, please fix it first before connecting with superset.

Superset Install:

  • Make sure you have python 3.6 or above
  • Install pydruid to connect from the superset
    • $ pip install pydruid
  • Install Superset and run

Superset Configuration for Druid:

Step 1:

At Superset UI, select “Sources > Drid Clusters” menu option and fill the following info:

  • Verbose Name: <provide a string to identify cluster>
  • Broker Host: Please input IP Address or “LocalHost” or FQDN
  • Broker Port: Please input Broker Port address here (default druid broker port: 8082)
  • Broker Username: If configured input username or leave blank
  • Broker Password: If configured input username or leave blank
  • Broker Endpoint: Add default – druid/v2
  • Cache Timeout: Add as needed or leave empty
  • Cluster: You can use the same verbose name here

The UI looks like as below:

Screen Shot 2019-11-07 at 4.45.28 PM

Save the UI.

Step 2: 

At Superset UI, select “Sources > Drid Datasources” menu option and you will see a list of data sources that you have configured into Druid, as below.

 

Screen Shot 2019-11-07 at 5.01.56 PM

That’s all you need to get Superset working with Apache Druid.

Common Errors:

[1]

Error:
Error while processing cluster ‘druid’ name ‘requests’ is not defined

Solution:

You might have missed installing pydruid. Please install pydruid or some other python dependency to fix this problem.

[2]

Error while processing cluster ‘druid’ HTTPConnectionPool(host=’druid’, port=8082): Max retries exceeded with url: /druid/v2/datasources (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x10bc69748>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known’))

Solution:

Either your Druid configuration at Superset is wrong or missing some important value. Please follow the configuration steps to provide correct info.


That’s all for now.

@avkashchauhan

 

Adding MapBox token with SuperSet

To visualize geographic data with superset you will need to get the MapBox token first and then apply that MapBox token with Superset configuration to consume it.

Please visit https://www.mapbox.com/ to request the MapBox token as needed.

Update your shell configuration to support Superset:

What you need:

  • Superset Home
    • If you have installed from pip/pip3 get the site-packages
    • If you have installed from GitHub clone, use the GitHub clone home
  • Superset Config file
    • Create a file name superset_config.py and place it into your $HOME/.superset/ folder
  • Python path includes superset config location with python binary

Update following into your .bash_profile or .zshrc:

export SUPERSET_HOME=/Users/avkashchauhan/anaconda3/lib/python3.7/site-packages/superset
export SUPERSET_CONFIG_PATH=$HOME/.superset/superset_config.py
export PYTHONPATH=/Users/avkashchauhan/anaconda3/bin/python:/Users/avkashchauhan/.superset:$PYTHONPATH

Minimal superset_config.py configuration:

#---------------------------------------------------------
# Superset specific config
#---------------------------------------------------------
ROW_LIMIT = 50000

SQLALCHEMY_DATABASE_URI = 'sqlite:////Users/avkashchauhan/.superset/superset.db'

MAPBOX_API_KEY = 'YOUR_TOKEN_HERE'

Start your superset instance:

$ superset run -p 8080 –with-threads –reload –debugger

Please verify the logs to make sure superset_config.py was loaded and read without any error. The successful logs will look like as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]

If there are errors you will get an error (or more) just after the above line similar to as below:

ERROR:root:Failed to import config for SUPERSET_CONFIG_PATH=/Users/avkashchauhan/.superset/superset_config.py

IF your Sqlite instance is not configured correctly you will get error as below:

2019-11-06 14:25:51,074:ERROR:flask_appbuilder.security.sqla.manager:DB Creation and initialization failed: (sqlite3.OperationalError) unable to open database file
(Background on this error at: http://sqlalche.me/e/e3q8)

The successful superset_config.py loading will return with no error as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:16,588:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: off
2019-11-06 17:33:17,294:INFO:werkzeug: * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
2019-11-06 17:33:17,306:INFO:werkzeug: * Restarting with fsevents reloader
Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:18,644:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
2019-11-06 17:33:19,345:WARNING:werkzeug: * Debugger is active!
2019-11-06 17:33:19,353:INFO:werkzeug: * Debugger PIN: 134-113-136

Now if you visualize any dataset with geographic columns i.e. longitude and latitude the Superset will be able to show the data properly as below:

Screen Shot 2019-11-06 at 5.49.53 PM

That’s all for now.

@avkashchauhan

Error with python-geohash installation while installing superset in OSX Catalina

Error Installing superset on OSX Catalina:

Command:

$ pip3  install superset

$ pip3 install python-geohash

Error:

Running setup.py install for python-geohash ... error
ERROR: Command errored out with exit status 1:
command: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/
Complete output (21 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-10.7-x86_64-3.7
copying geohash.py -> build/lib.macosx-10.7-x86_64-3.7
copying quadtree.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpgrid.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpiarea.py -> build/lib.macosx-10.7-x86_64-3.7
running build_ext
building '_geohash' extension
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/src
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -DPYTHON_MODULE=1 -I/Users/avkashchauhan/anaconda3/include/python3.7m -c src/geohash.cpp -o build/temp.macosx-10.7-x86_64-3.7/src/geohash.o
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
1 warning generated.
g++ -bundle -undefined dynamic_lookup -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/src/geohash.o -o build/lib.macosx-10.7-x86_64-3.7/_geohash.cpython-37m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Reason:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Solution:

$ sudo CFLAGS=-stdlib=libc++ pip3 install python-geohash

$ sudo CFLAGS=-stdlib=libc++ pip3 install superser

That’s all for now.

@avkashchauhan

Installing R packages missing R_X11.so error

 

install.packages(“roxygen2”)

— Please select a CRAN mirror for use in this session —
Secure CRAN mirrors

1: 0-Cloud [https] 2: Algeria [https]
3: Australia (Canberra) [https] 4: Australia (Melbourne 1) [https]
5: Australia (Melbourne 2) [https] 6: Australia (Perth) [https]
7: Austria [https] 8: Belgium (Ghent) [https]
9: Brazil (PR) [https] 10: Brazil (RJ) [https]
11: Brazil (SP 1) [https] 12: Brazil (SP 2) [https]
13: Bulgaria [https] 14: Chile 1 [https]
15: Chile 2 [https] 16: China (Hong Kong) [https]
17: China (Guangzhou) [https] 18: China (Lanzhou) [https]
19: China (Shanghai 1) [https] 20: China (Shanghai 2) [https]
21: Colombia (Cali) [https] 22: Czech Republic [https]
23: Denmark [https] 24: East Asia [https]
25: Ecuador (Cuenca) [https] 26: Ecuador (Quito) [https]
27: Estonia [https] 28: France (Lyon 1) [https]
29: France (Lyon 2) [https] 30: France (Marseille) [https]
31: France (Montpellier) [https] 32: France (Paris 2) [https]
33: Germany (Erlangen) [https] 34: Germany (Göttingen) [https]
35: Germany (Münster) [https] 36: Greece [https]
37: Iceland [https] 38: India [https]
39: Indonesia (Jakarta) [https] 40: Ireland [https]
41: Italy (Padua) [https] 42: Japan (Tokyo) [https]
43: Japan (Yonezawa) [https] 44: Korea (Busan) [https]
45: Korea (Seoul 1) [https] 46: Korea (Ulsan) [https]
47: Malaysia [https] 48: Mexico (Mexico City) [https]
49: Norway [https] 50: Philippines [https]
51: Serbia [https] 52: Spain (A Coruña) [https]
53: Spain (Madrid) [https] 54: Sweden [https]
55: Switzerland [https] 56: Turkey (Denizli) [https]
57: Turkey (Mersin) [https] 58: UK (Bristol) [https]
59: UK (London 1) [https] 60: USA (CA 1) [https]
61: USA (IA) [https] 62: USA (KS) [https]
63: USA (MI 1) [https] 64: USA (NY) [https]
65: USA (OR) [https] 66: USA (TN) [https]
67: USA (TX 1) [https] 68: Uruguay [https]
69: Vietnam [https] 70: (other mirrors)
Selection: 60
trying URL ‘https://cran.cnr.berkeley.edu/bin/macosx/el-capitan/contrib/3.5/roxygen2_6.1.0.tgz&#8217;

Content type ‘application/x-gzip’ length 712037 bytes (695 KB)

downloaded 695 KB
The downloaded binary packages are in
/var/folders/tt/vy5b7m5s7rb54r02c9r647tc0000gn/T//RtmpSKQcBX/downloaded_packages
Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
unable to load shared object ‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’:
dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
Reason: image not found

quit()

 

SOlution > Just Install XQuartz from the link below and you dont need to set X11 as default server at all. It should fix your problem.

https://www.xquartz.org/

Holdout prediction with cross validation in K-means modeling

Sometime you may need to combine holdout predictions, while keep_cross_validation_predictions parameter is active in Python, here is the Python code as sample:

import h2o
h2o.init()
from h2o.estimators.kmeans import H2OKMeansEstimator
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
kmeans_model = H2OKMeansEstimator(k=10, nfolds = 5,
keep_cross_validation_predictions=True,
keep_cross_validation_fold_assignment=True)
kmeans_model.train(predictors, training_frame = prostate)
kmeans_model.cross_validation_predictions()

 

ss

Sparkling Water 2.0 Walkthrough with pysparkling

My ENV:

SPARK_HOME=/Users/avkashchauhan/tools/spark-2.0.1-bin-hadoop2.6
 H2O_HOME=/Users/avkashchauhan/src/github.com/h2oai/h2o-3
 MASTER=local-cluster[3,2,1024]

Pysparkling Command:

$$> bin/pysparkling --num-executors 2 --executor-memory 2g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false
 Python 2.7.10 (default, Jul 30 2016, 18:31:42)
 [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel).
 16/10/20 09:29:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.0.1
 /_/
 
 Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
 SparkSession available as 'spark'.

Now entering Commands:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> sqlContext

>>> hc = H2OContext.getOrCreate(sc)

Here is the successful output:

 16/10/20 09:31:10 WARN InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
 16/10/20 09:31:10 WARN InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
 We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
 Warning: if you don't want to start local H2O server, then use of `h2o.connect()` is preferred.
 Checking whether there is an H2O instance running at http://10.0.0.63:54327. connected.
 -------------------------- ----------------------------------------
 H2O cluster uptime: 09 secs
 H2O cluster version: 3.10.0.7
 H2O cluster version age: 1 month
 H2O cluster name: sparkling-water-avkashchauhan_2132345410
 H2O cluster total nodes: 3
 H2O cluster free memory: 2.364 Gb
 H2O cluster total cores: 24
 H2O cluster allowed cores: 24
 H2O cluster status: accepting new members, healthy
 H2O connection url: http://10.0.0.63:54327
 H2O connection proxy:
 Python version: 2.7.10 final
 -------------------------- ----------------------------------------

Now verifying sparkling water package and make sure you have pysparkling reference to 2.0_2.0.0 package above.

>> h2o

<module ‘h2o’ from ‘/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/avkashchauhan/spark/work/spark- 28af708d-a149-435a-9a53-41e63d9ba7f5/userFiles-cf7c7aaf-610f-439d-9037- 2dcddca73524/h2o_pysparkling_2.0-2.0.0-py2.7.egg/h2o/init.pyc’>

Getting Help for h2o:

>> help(h2o)

Getting Cluster Status

>> h2o.cluster_status()

Feature Binning in GBM models and prediction

We have a categorical variable with about 900 levels. Out of these levels some of them happen very rarely. Out of millions of observations there are levels that would occur less than 500 times. The minimum number of rows chosen to fit the model was 500. In addition, we didn’t want to restrict the number of bins, so we chose to go with the default nbins_cats=1024.

I had a look at the POJO and for that categorical variable we got this part of the script:

// The class representing column C11 class GBM_model_python_1_ColInfo_99 implements java.io.Serializable { public static final String[] VALUES = new String[406]; static { GBM_model_python1_ColInfo_99_0.fill(VALUES); } static final class GBM_model_python*_1_ColInfo_99_0 implements java.io.Serializable { static final void fill(String[] sa) { sa[0] = "C003"; … sa[405] = "C702"; } } }

As you can see, in the script there are only 406 bins created. My questions are the following:

Question 1: What happens with the other bins up to 900? If when scoring the model we have a value for variable C99 from the bins not included in the 406, will that observation be scored? And how exactly variable C99 will be used when scoring the model?

Question 2: If the model will see for variable C99 a value outside the 900 distinct levels we had in our training data, will the model score that observation? If yes, how?

Yes the H2O model (whether it be in POJO / MOJO form or running in H2O ) , will score when receiving unseen categoricals.

 

In GBM, during training, the optimal split direction for every feature value (numeric and categorical, including missing values/NAs) is computed, and stored for future use during scoring.

This means that missing numeric, categorical, or unseen categorical values are turned into NAs.

Specifically, if there are NO NAs in the training data, then NAs in the test data follow the majority direction (the direction with the most observations); If there are NAs in the training data, then NAs in the test set follow the direction that is optimal for the NAs of the training data.

For more details we have an updated GBM documentation page at:

http://h2o-release.s3.amazonaws.com/h2o/master/3754/docs-website/h2o-docs/data-science/gbm-faq/missing_values.html