Steps to connect Apache Superset with Apache Druid

Druid Install:

  • Install Druid and run.
  • Get broker port number from druid configuration, mostly 8082 if not changed.
  • Add a test data source to your druid so you can access that from superset
  • Test
    • $ curl http://localhost:8082/druid/v2/datasources
      • [“testdf”,”plants”]
    • Note: You should get a list of configured druid data sources.
    • Note: If the above command does not work, please fix it first before connecting with superset.

Superset Install:

  • Make sure you have python 3.6 or above
  • Install pydruid to connect from the superset
    • $ pip install pydruid
  • Install Superset and run

Superset Configuration for Druid:

Step 1:

At Superset UI, select “Sources > Drid Clusters” menu option and fill the following info:

  • Verbose Name: <provide a string to identify cluster>
  • Broker Host: Please input IP Address or “LocalHost” or FQDN
  • Broker Port: Please input Broker Port address here (default druid broker port: 8082)
  • Broker Username: If configured input username or leave blank
  • Broker Password: If configured input username or leave blank
  • Broker Endpoint: Add default – druid/v2
  • Cache Timeout: Add as needed or leave empty
  • Cluster: You can use the same verbose name here

The UI looks like as below:

Screen Shot 2019-11-07 at 4.45.28 PM

Save the UI.

Step 2: 

At Superset UI, select “Sources > Drid Datasources” menu option and you will see a list of data sources that you have configured into Druid, as below.

 

Screen Shot 2019-11-07 at 5.01.56 PM

That’s all you need to get Superset working with Apache Druid.

Common Errors:

[1]

Error:
Error while processing cluster ‘druid’ name ‘requests’ is not defined

Solution:

You might have missed installing pydruid. Please install pydruid or some other python dependency to fix this problem.

[2]

Error while processing cluster ‘druid’ HTTPConnectionPool(host=’druid’, port=8082): Max retries exceeded with url: /druid/v2/datasources (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x10bc69748>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known’))

Solution:

Either your Druid configuration at Superset is wrong or missing some important value. Please follow the configuration steps to provide correct info.


That’s all for now.

@avkashchauhan

 

Adding MapBox token with SuperSet

To visualize geographic data with superset you will need to get the MapBox token first and then apply that MapBox token with Superset configuration to consume it.

Please visit https://www.mapbox.com/ to request the MapBox token as needed.

Update your shell configuration to support Superset:

What you need:

  • Superset Home
    • If you have installed from pip/pip3 get the site-packages
    • If you have installed from GitHub clone, use the GitHub clone home
  • Superset Config file
    • Create a file name superset_config.py and place it into your $HOME/.superset/ folder
  • Python path includes superset config location with python binary

Update following into your .bash_profile or .zshrc:

export SUPERSET_HOME=/Users/avkashchauhan/anaconda3/lib/python3.7/site-packages/superset
export SUPERSET_CONFIG_PATH=$HOME/.superset/superset_config.py
export PYTHONPATH=/Users/avkashchauhan/anaconda3/bin/python:/Users/avkashchauhan/.superset:$PYTHONPATH

Minimal superset_config.py configuration:

#---------------------------------------------------------
# Superset specific config
#---------------------------------------------------------
ROW_LIMIT = 50000

SQLALCHEMY_DATABASE_URI = 'sqlite:////Users/avkashchauhan/.superset/superset.db'

MAPBOX_API_KEY = 'YOUR_TOKEN_HERE'

Start your superset instance:

$ superset run -p 8080 –with-threads –reload –debugger

Please verify the logs to make sure superset_config.py was loaded and read without any error. The successful logs will look like as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]

If there are errors you will get an error (or more) just after the above line similar to as below:

ERROR:root:Failed to import config for SUPERSET_CONFIG_PATH=/Users/avkashchauhan/.superset/superset_config.py

IF your Sqlite instance is not configured correctly you will get error as below:

2019-11-06 14:25:51,074:ERROR:flask_appbuilder.security.sqla.manager:DB Creation and initialization failed: (sqlite3.OperationalError) unable to open database file
(Background on this error at: http://sqlalche.me/e/e3q8)

The successful superset_config.py loading will return with no error as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:16,588:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: off
2019-11-06 17:33:17,294:INFO:werkzeug: * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
2019-11-06 17:33:17,306:INFO:werkzeug: * Restarting with fsevents reloader
Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:18,644:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
2019-11-06 17:33:19,345:WARNING:werkzeug: * Debugger is active!
2019-11-06 17:33:19,353:INFO:werkzeug: * Debugger PIN: 134-113-136

Now if you visualize any dataset with geographic columns i.e. longitude and latitude the Superset will be able to show the data properly as below:

Screen Shot 2019-11-06 at 5.49.53 PM

That’s all for now.

@avkashchauhan

Error with python-geohash installation while installing superset in OSX Catalina

Error Installing superset on OSX Catalina:

Command:

$ pip3  install superset

$ pip3 install python-geohash

Error:

Running setup.py install for python-geohash ... error
ERROR: Command errored out with exit status 1:
command: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/
Complete output (21 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-10.7-x86_64-3.7
copying geohash.py -> build/lib.macosx-10.7-x86_64-3.7
copying quadtree.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpgrid.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpiarea.py -> build/lib.macosx-10.7-x86_64-3.7
running build_ext
building '_geohash' extension
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/src
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -DPYTHON_MODULE=1 -I/Users/avkashchauhan/anaconda3/include/python3.7m -c src/geohash.cpp -o build/temp.macosx-10.7-x86_64-3.7/src/geohash.o
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
1 warning generated.
g++ -bundle -undefined dynamic_lookup -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/src/geohash.o -o build/lib.macosx-10.7-x86_64-3.7/_geohash.cpython-37m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Reason:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Solution:

$ sudo CFLAGS=-stdlib=libc++ pip3 install python-geohash

$ sudo CFLAGS=-stdlib=libc++ pip3 install superser

That’s all for now.

@avkashchauhan

Renaming data frame column names in H2O (python)

Sometimes you may need to change the all the column names or a specific column due to certain need, and you can do as below:

>>> df = h2o.import_file("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/iris/iris.csv")
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
>>> df
 C1 C2 C3 C4 C5
---- ---- ---- ---- -----------
 5.1 3.5 1.4 0.2 Iris-setosa
 4.9 3 1.4 0.2 Iris-setosa
 4.7 3.2 1.3 0.2 Iris-setosa
 4.6 3.1 1.5 0.2 Iris-setosa
 5 3.6 1.4 0.2 Iris-setosa
 5.4 3.9 1.7 0.4 Iris-setosa
 4.6 3.4 1.4 0.3 Iris-setosa
 5 3.4 1.5 0.2 Iris-setosa
 4.4 2.9 1.4 0.2 Iris-setosa
 4.9 3.1 1.5 0.1 Iris-setosa

[150 rows x 5 columns]

>>> df.names
[u'C1', u'C2', u'C3', u'C4', u'C5']

>>> df.set_names(['A1','A2','A3','A4','A5'])
 A1 A2 A3 A4 A5
---- ---- ---- ---- ------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

If you want to change only few column names then you still need to copy the original name in the same index and just add the changed name into where applicable. For example in above data frame, we just want to change A5 to Levels and we will do as below:

>>> df.set_names(['A1','A2','A3','A4','Levels'])
 A1 A2 A3 A4 Levels
---- ---- ---- ---- --------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

The set_names function must have all names values in the array, either same name or changes names otherwise it will generate an error.

For example the following will not work and will throw an error:

>>> df.set_names(['A1'])
>>> df.set_names(['A1','A2','A3','A4','A5','A6'])

Thats it, enjoy!!

Unification of date and time data with joda in Spark

Here is the code snippet which can first parse  various kind of date and time formats and then unify them together to be processed by data munging process.

  import org.apache.spark.sql.functions._
  import org.joda.time._
  import org.joda.time.format._
  import org.apache.spark.sql.expressions.Window

    val getHour = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getHourOfDay)
        }
    })

    val getDT = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getMillis / 1000.0  )
        }
      })

    // UDF for day of week
    val getDayOfWeek = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy")
          Some(fmt.parseDateTime(s.split(" ")(0)).getDayOfWeek)
        }
      }
    })

    val getDate = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          Some(s.split(" ")(0))
        }
      }
    })

    val getDiffKey = udf((diff:Double) => {
      val threshold = 5 // 15 minutes   5 // 75%-tile seconds
      if (diff > threshold) {
        1  // tag as 2nd diff
      } else {
        0 // 1st diff
      }
    })

  val rawDF = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .load("hdfs://mr-0xc5.0xdata.loc:8020/user/file.csv")

    var df = rawDF.withColumn("hourOfDay", getHour(rawDF.col("datetime")))
      df = df.withColumn("timestamp", getDT(df.col("datetime")))
      df = df.withColumn("dayOfWeek", getDayOfWeek(df.col("datetime")))
      df = df.withColumn("date", getDate(df.col("datetime")))

Categorical Encoding, One Hot Encoding and why use it?

What is categorical encoding?

In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.

Name Age Zip Code Salary
Jim 43 94404 45000
Jon 37 94407 80000
Merry 36 94404 65000
Tim 42 94403 75000
Hailey 29 94407 60000

In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.

As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.

Name Age Zip Code Sex Salary
Jim 43 94404 M 45000
Jon 37 94407 M 80000
Merry 36 94404 F 65000
Tim 42 94403 M 75000
Hailey 29 94407 F 60000

So after encoding Zip Code an Sex features as enums both features will look like as below:

Name Age Zip Code Sex Salary
Jim 43 1 1 45000
Jon 37 2 1 80000
Merry 36 1 0 65000
Tim 42 3 1 75000
Hailey 29 2 0 60000

As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:

Age Zip Code Sex
43 1 1
37 2 1
36 1 0
42 3 1
29 2 0

Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.

What is One Hot Encoding?

In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.

One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:

Age Zip Code M F Salary
43 1 1 0 45000
37 2 1 0 80000
36 1 0 1 65000
42 3 1 0 75000
29 2 0 1 60000

If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:

Age 94404 94407 94403 M F Salary
43 1 0 0 1 0 45000
37 0 1 0 1 0 80000
36 1 0 0 0 1 65000
42 0 0 1 1 0 75000
29 0 1 0 0 1 60000

So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.

One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.

Why Use it?

There are several good reasons to use One Hot Encoding in the data.

As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.

Other Resources:

Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

For in depth feature engineering please visit the following slides from HJ Van Veen: