Superset and Jupyter notebooks on AWS as Service

Jupyter Notebook (In EC2 Instance):

The following script is written to run jupyter notebook as a server inside the AWS EC2 instance which you can access from your desktop/laptop if EC2 instance is accessible from your machine:

  • $ conda activate python37
  • $ jupyter notebook –generate-config
    • This will create the jupyter_notebook_config.py configuration file inside your working folder i.e. /home/<username>/.jupyter/
  • $ jupyter notebook password  
    • You can set the password here
  • $ vi /home/centos/.jupyter/jupyter_notebook_config.py
    • Edit the following 2 lines 
    •   c.NotebookApp.ip = ‘0.0.0.0’
    •   c.NotebookApp.port = 8888
  • $ jupyter notebook

Apache Superset (In EC2 Instance):

That’s all.

@avkashchauhan

Advertisement

Installing Apache Superset into CentOS 7 with Python 3.7

Following are the starter commands to install superset:

  • $ python –version
    • Python 3.7.5
  • $ pip install superset

Possible Errors:

You might be hitting any or all of the following error(s):

Running setup.py install for python-geohash … error
ERROR: Command errored out with exit status 1:

building ‘_geohash’ extension
……
unable to execute ‘gcc’: No such file or directory
error: command ‘gcc’ failed with exit status 1

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory
error: command ‘gcc’ failed with exit status 1

Look for:

  • $ gcc –version <= You must have gcc installed
  • $ locate cc1plus <= You must have cc1plus install

Install the required libraries and tools:

If any of the above components are missing, you need to install a few required libraries:

  • $ sudo yum install mlocate <= For locate command
  • $ sudo updatedb  <= Update for mlocate
  • $ sudo yum install gcc <=For gcc if you don’t have
  • $ sudo yum install gcc-c++  <== For cc1plus if you dont have

Verify the following again:

  • $ gcc –version
  • $ locate cc1plus
    • /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus

Note:

  • If you could locate cc1plus properly however still getting the error, try the following
    • sudo ln -s /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus /usr/local/bin/
  • Try installing again

Final Installation:

Now you can install  superset as below:

  • $ pip install superset
    • Python 3.7.5
      Flask 1.1.1
      Werkzeug 0.16.0
  • $ superset db upgrade
  • $ export FLASK_APP=superset
  • $ flask fab create-admin
    • Recognized Database Authentications.
      Admin User admin created.
  • $ superset init
  • $ superset run -p 8080 –with-threads –reload –debugger

 

That’s all.

@avkashchauhan

Steps to connect Apache Superset with Apache Druid

Druid Install:

  • Install Druid and run.
  • Get broker port number from druid configuration, mostly 8082 if not changed.
  • Add a test data source to your druid so you can access that from superset
  • Test
    • $ curl http://localhost:8082/druid/v2/datasources
      • [“testdf”,”plants”]
    • Note: You should get a list of configured druid data sources.
    • Note: If the above command does not work, please fix it first before connecting with superset.

Superset Install:

  • Make sure you have python 3.6 or above
  • Install pydruid to connect from the superset
    • $ pip install pydruid
  • Install Superset and run

Superset Configuration for Druid:

Step 1:

At Superset UI, select “Sources > Drid Clusters” menu option and fill the following info:

  • Verbose Name: <provide a string to identify cluster>
  • Broker Host: Please input IP Address or “LocalHost” or FQDN
  • Broker Port: Please input Broker Port address here (default druid broker port: 8082)
  • Broker Username: If configured input username or leave blank
  • Broker Password: If configured input username or leave blank
  • Broker Endpoint: Add default – druid/v2
  • Cache Timeout: Add as needed or leave empty
  • Cluster: You can use the same verbose name here

The UI looks like as below:

Screen Shot 2019-11-07 at 4.45.28 PM

Save the UI.

Step 2: 

At Superset UI, select “Sources > Drid Datasources” menu option and you will see a list of data sources that you have configured into Druid, as below.

 

Screen Shot 2019-11-07 at 5.01.56 PM

That’s all you need to get Superset working with Apache Druid.

Common Errors:

[1]

Error:
Error while processing cluster ‘druid’ name ‘requests’ is not defined

Solution:

You might have missed installing pydruid. Please install pydruid or some other python dependency to fix this problem.

[2]

Error while processing cluster ‘druid’ HTTPConnectionPool(host=’druid’, port=8082): Max retries exceeded with url: /druid/v2/datasources (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x10bc69748>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known’))

Solution:

Either your Druid configuration at Superset is wrong or missing some important value. Please follow the configuration steps to provide correct info.


That’s all for now.

@avkashchauhan

 

Adding MapBox token with SuperSet

To visualize geographic data with superset you will need to get the MapBox token first and then apply that MapBox token with Superset configuration to consume it.

Please visit https://www.mapbox.com/ to request the MapBox token as needed.

Update your shell configuration to support Superset:

What you need:

  • Superset Home
    • If you have installed from pip/pip3 get the site-packages
    • If you have installed from GitHub clone, use the GitHub clone home
  • Superset Config file
    • Create a file name superset_config.py and place it into your $HOME/.superset/ folder
  • Python path includes superset config location with python binary

Update following into your .bash_profile or .zshrc:

export SUPERSET_HOME=/Users/avkashchauhan/anaconda3/lib/python3.7/site-packages/superset
export SUPERSET_CONFIG_PATH=$HOME/.superset/superset_config.py
export PYTHONPATH=/Users/avkashchauhan/anaconda3/bin/python:/Users/avkashchauhan/.superset:$PYTHONPATH

Minimal superset_config.py configuration:

#---------------------------------------------------------
# Superset specific config
#---------------------------------------------------------
ROW_LIMIT = 50000

SQLALCHEMY_DATABASE_URI = 'sqlite:////Users/avkashchauhan/.superset/superset.db'

MAPBOX_API_KEY = 'YOUR_TOKEN_HERE'

Start your superset instance:

$ superset run -p 8080 –with-threads –reload –debugger

Please verify the logs to make sure superset_config.py was loaded and read without any error. The successful logs will look like as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]

If there are errors you will get an error (or more) just after the above line similar to as below:

ERROR:root:Failed to import config for SUPERSET_CONFIG_PATH=/Users/avkashchauhan/.superset/superset_config.py

IF your Sqlite instance is not configured correctly you will get error as below:

2019-11-06 14:25:51,074:ERROR:flask_appbuilder.security.sqla.manager:DB Creation and initialization failed: (sqlite3.OperationalError) unable to open database file
(Background on this error at: http://sqlalche.me/e/e3q8)

The successful superset_config.py loading will return with no error as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:16,588:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: off
2019-11-06 17:33:17,294:INFO:werkzeug: * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
2019-11-06 17:33:17,306:INFO:werkzeug: * Restarting with fsevents reloader
Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:18,644:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
2019-11-06 17:33:19,345:WARNING:werkzeug: * Debugger is active!
2019-11-06 17:33:19,353:INFO:werkzeug: * Debugger PIN: 134-113-136

Now if you visualize any dataset with geographic columns i.e. longitude and latitude the Superset will be able to show the data properly as below:

Screen Shot 2019-11-06 at 5.49.53 PM

That’s all for now.

@avkashchauhan

Error with python-geohash installation while installing superset in OSX Catalina

Error Installing superset on OSX Catalina:

Command:

$ pip3  install superset

$ pip3 install python-geohash

Error:

Running setup.py install for python-geohash ... error
ERROR: Command errored out with exit status 1:
command: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/
Complete output (21 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-10.7-x86_64-3.7
copying geohash.py -> build/lib.macosx-10.7-x86_64-3.7
copying quadtree.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpgrid.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpiarea.py -> build/lib.macosx-10.7-x86_64-3.7
running build_ext
building '_geohash' extension
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/src
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -DPYTHON_MODULE=1 -I/Users/avkashchauhan/anaconda3/include/python3.7m -c src/geohash.cpp -o build/temp.macosx-10.7-x86_64-3.7/src/geohash.o
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
1 warning generated.
g++ -bundle -undefined dynamic_lookup -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/src/geohash.o -o build/lib.macosx-10.7-x86_64-3.7/_geohash.cpython-37m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Reason:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Solution:

$ sudo CFLAGS=-stdlib=libc++ pip3 install python-geohash

$ sudo CFLAGS=-stdlib=libc++ pip3 install superser

That’s all for now.

@avkashchauhan

Free ebook: Introducing Microsoft Azure HDInsight

New Free eBook by Microsoft Press:

Microsoft Press is thrilled to share another new free ebook with you:Introducing Microsoft Azure HDInsight, by Avkash Chauhan, Valentine Fontama, Michele Hart, Wee Hyong Tok, and Buck Woody. 

hdinsight-book

Free ebook: Introducing Microsoft Azure HDInsight

Introduction (excerpt)

Microsoft Azure HDInsight is Microsoft’s 100 percent compliant distribution of Apache Hadoop on Microsoft Azure. This means that standard Hadoop concepts and technologies apply, so learning the Hadoop stack helps you learn the HDInsight service. At the time of this writing, HDInsight (version 3.0) uses Hadoop version 2.2 and Hortonworks Data Platform 2.0.

In Introducing Microsoft Azure HDInsight, we cover what big data really means, how you can use it to your advantage in your company or organization, and one of the services you can use to do that quickly—specifically, Microsoft’s HDInsight service. We start with an overview of big data and Hadoop, but we don’t emphasize only concepts in this book—we want you to jump in and get your hands dirty working with HDInsight in a practical way. To help you learn and even implement HDInsight right away, we focus on a specific use case that applies to almost any organization and demonstrate a process that you can follow along with.

We also help you learn more. In the last chapter, we look ahead at the future of HDInsight and give you recommendations for self-learning so that you can dive deeper into important concepts and round out your education on working with big data.

Here are the download links (and below the links you’ll find an ebook excerpt that describes this offering):

Download the PDF (6.37 MB; 130 pages) fromhttp://aka.ms/IntroHDInsight/PDF

Download the EPUB (8.46 MB) fromhttp://aka.ms/IntroHDInsight/EPUB

Download the MOBI (12.8 MB) fromhttp://aka.ms/IntroHDInsight/MOBI

Download the code samples (6.83 KB) fromhttp://aka.ms/IntroHDInsight/CompContent

20TB Earth Science Dataset on AWS With NASA / NEX available for Public

AWS has been working with the NASA Earth Exchange (NEX) team to make it easier and more efficient for researchers to access and process earth science data. The goal is to make a number of important data sets accessible to a wider audience of full-time researchers, students, and citizen scientists. This important new project is called OpenNEX. Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size (tens of terabytes). Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical.

nasa_nex_landsat_us_2005_forest_leaf_area_1

Access Dataset: s3://nasanex/NEX-DCP30

Consult the detail page and the tech note to learn more about the provenance, format, structure, and attribution requirements.

NASA Earth Exchange (NEX):

The NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) dataset is comprised of downscaled climate scenarios for the conterminous United States that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011] developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5). The dataset includes downscaled projections from 33 models, as well as ensemble statistics calculated for each RCP from all model runs available. The purpose of these datasets is to provide a set of high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.

Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run).

Website: NASA NEX

Summary

  • Short Name: NEX-DCP30
  • Version: 1
  • Format: netCDF4 classic
  • Spatial Coverage: CONUS
  • Temporal Coverage:
    • 1950 – 2005 historical or 2006 – 2099 RCP
  • Data Resolution:
    • Latitude Resolution: 30 arc second
    • Longitude Resolution: 30 arc second
    • Temporal Resolution: monthly
  • Data Size:
    • Total Dataset Size: 17 TB
    • Individual file size: 2 GB

Learn more about NEX – NASA Earth Exchange Downscaled Project

NEX Virtual Workshop: https://nex.nasa.gov/nex/projects/1328/

 

Top 20 Big Data Platfora and Analytics Startups with significant VC Funding

Top 20 Big Data and Analytics Startups with significant VC Funding

 

Startup

Funding in Million

URL

MongoDB

231

Pivotal

210

Mu Sigma

208

Cloudera

141

Opera Solutions

114

HortonWorks

98

DataStax

83.7

Guavas

80.5

GoodData

75.5

ParAccel (Actian)

74

Talend

61.6

Pentaho

60

MapR

61

CouchBase

56

Platfora

27.5

Datameer

18

Hadapt

16.2

Karmasphere

14.5

DataBricks

14

Quantifind

11.2

 

Top 20 Big Data

 

Keywords: Big Data, Data Analytics, Infographic, Hadoop,  BI