Installing Apache Superset into CentOS 7 with Python 3.7

Following are the starter commands to install superset:

  • $ python –version
    • Python 3.7.5
  • $ pip install superset

Possible Errors:

You might be hitting any or all of the following error(s):

Running setup.py install for python-geohash … error
ERROR: Command errored out with exit status 1:

building ‘_geohash’ extension
……
unable to execute ‘gcc’: No such file or directory
error: command ‘gcc’ failed with exit status 1

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory
error: command ‘gcc’ failed with exit status 1

Look for:

  • $ gcc –version <= You must have gcc installed
  • $ locate cc1plus <= You must have cc1plus install

Install the required libraries and tools:

If any of the above components are missing, you need to install a few required libraries:

  • $ sudo yum install mlocate <= For locate command
  • $ sudo updatedb  <= Update for mlocate
  • $ sudo yum install gcc <=For gcc if you don’t have
  • $ sudo yum install gcc-c++  <== For cc1plus if you dont have

Verify the following again:

  • $ gcc –version
  • $ locate cc1plus
    • /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus

Note:

  • If you could locate cc1plus properly however still getting the error, try the following
    • sudo ln -s /usr/libexec/gcc/x86_64-redhat-linux/4.8.2/cc1plus /usr/local/bin/
  • Try installing again

Final Installation:

Now you can install  superset as below:

  • $ pip install superset
    • Python 3.7.5
      Flask 1.1.1
      Werkzeug 0.16.0
  • $ superset db upgrade
  • $ export FLASK_APP=superset
  • $ flask fab create-admin
    • Recognized Database Authentications.
      Admin User admin created.
  • $ superset init
  • $ superset run -p 8080 –with-threads –reload –debugger

 

That’s all.

@avkashchauhan

Steps to connect Apache Superset with Apache Druid

Druid Install:

  • Install Druid and run.
  • Get broker port number from druid configuration, mostly 8082 if not changed.
  • Add a test data source to your druid so you can access that from superset
  • Test
    • $ curl http://localhost:8082/druid/v2/datasources
      • [“testdf”,”plants”]
    • Note: You should get a list of configured druid data sources.
    • Note: If the above command does not work, please fix it first before connecting with superset.

Superset Install:

  • Make sure you have python 3.6 or above
  • Install pydruid to connect from the superset
    • $ pip install pydruid
  • Install Superset and run

Superset Configuration for Druid:

Step 1:

At Superset UI, select “Sources > Drid Clusters” menu option and fill the following info:

  • Verbose Name: <provide a string to identify cluster>
  • Broker Host: Please input IP Address or “LocalHost” or FQDN
  • Broker Port: Please input Broker Port address here (default druid broker port: 8082)
  • Broker Username: If configured input username or leave blank
  • Broker Password: If configured input username or leave blank
  • Broker Endpoint: Add default – druid/v2
  • Cache Timeout: Add as needed or leave empty
  • Cluster: You can use the same verbose name here

The UI looks like as below:

Screen Shot 2019-11-07 at 4.45.28 PM

Save the UI.

Step 2: 

At Superset UI, select “Sources > Drid Datasources” menu option and you will see a list of data sources that you have configured into Druid, as below.

 

Screen Shot 2019-11-07 at 5.01.56 PM

That’s all you need to get Superset working with Apache Druid.

Common Errors:

[1]

Error:
Error while processing cluster ‘druid’ name ‘requests’ is not defined

Solution:

You might have missed installing pydruid. Please install pydruid or some other python dependency to fix this problem.

[2]

Error while processing cluster ‘druid’ HTTPConnectionPool(host=’druid’, port=8082): Max retries exceeded with url: /druid/v2/datasources (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x10bc69748>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known’))

Solution:

Either your Druid configuration at Superset is wrong or missing some important value. Please follow the configuration steps to provide correct info.


That’s all for now.

@avkashchauhan

 

Adding MapBox token with SuperSet

To visualize geographic data with superset you will need to get the MapBox token first and then apply that MapBox token with Superset configuration to consume it.

Please visit https://www.mapbox.com/ to request the MapBox token as needed.

Update your shell configuration to support Superset:

What you need:

  • Superset Home
    • If you have installed from pip/pip3 get the site-packages
    • If you have installed from GitHub clone, use the GitHub clone home
  • Superset Config file
    • Create a file name superset_config.py and place it into your $HOME/.superset/ folder
  • Python path includes superset config location with python binary

Update following into your .bash_profile or .zshrc:

export SUPERSET_HOME=/Users/avkashchauhan/anaconda3/lib/python3.7/site-packages/superset
export SUPERSET_CONFIG_PATH=$HOME/.superset/superset_config.py
export PYTHONPATH=/Users/avkashchauhan/anaconda3/bin/python:/Users/avkashchauhan/.superset:$PYTHONPATH

Minimal superset_config.py configuration:

#---------------------------------------------------------
# Superset specific config
#---------------------------------------------------------
ROW_LIMIT = 50000

SQLALCHEMY_DATABASE_URI = 'sqlite:////Users/avkashchauhan/.superset/superset.db'

MAPBOX_API_KEY = 'YOUR_TOKEN_HERE'

Start your superset instance:

$ superset run -p 8080 –with-threads –reload –debugger

Please verify the logs to make sure superset_config.py was loaded and read without any error. The successful logs will look like as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]

If there are errors you will get an error (or more) just after the above line similar to as below:

ERROR:root:Failed to import config for SUPERSET_CONFIG_PATH=/Users/avkashchauhan/.superset/superset_config.py

IF your Sqlite instance is not configured correctly you will get error as below:

2019-11-06 14:25:51,074:ERROR:flask_appbuilder.security.sqla.manager:DB Creation and initialization failed: (sqlite3.OperationalError) unable to open database file
(Background on this error at: http://sqlalche.me/e/e3q8)

The successful superset_config.py loading will return with no error as below:

Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:16,588:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: off
2019-11-06 17:33:17,294:INFO:werkzeug: * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
2019-11-06 17:33:17,306:INFO:werkzeug: * Restarting with fsevents reloader
Loaded your LOCAL configuration at [/Users/avkashchauhan/.superset/superset_config.py]
2019-11-06 17:33:18,644:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
2019-11-06 17:33:19,345:WARNING:werkzeug: * Debugger is active!
2019-11-06 17:33:19,353:INFO:werkzeug: * Debugger PIN: 134-113-136

Now if you visualize any dataset with geographic columns i.e. longitude and latitude the Superset will be able to show the data properly as below:

Screen Shot 2019-11-06 at 5.49.53 PM

That’s all for now.

@avkashchauhan

Error with python-geohash installation while installing superset in OSX Catalina

Error Installing superset on OSX Catalina:

Command:

$ pip3  install superset

$ pip3 install python-geohash

Error:

Running setup.py install for python-geohash ... error
ERROR: Command errored out with exit status 1:
command: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/
Complete output (21 lines):
running install
running build
running build_py
creating build
creating build/lib.macosx-10.7-x86_64-3.7
copying geohash.py -> build/lib.macosx-10.7-x86_64-3.7
copying quadtree.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpgrid.py -> build/lib.macosx-10.7-x86_64-3.7
copying jpiarea.py -> build/lib.macosx-10.7-x86_64-3.7
running build_ext
building '_geohash' extension
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/src
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -I/Users/avkashchauhan/anaconda3/include -arch x86_64 -DPYTHON_MODULE=1 -I/Users/avkashchauhan/anaconda3/include/python3.7m -c src/geohash.cpp -o build/temp.macosx-10.7-x86_64-3.7/src/geohash.o
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
1 warning generated.
g++ -bundle -undefined dynamic_lookup -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -L/Users/avkashchauhan/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/src/geohash.o -o build/lib.macosx-10.7-x86_64-3.7/_geohash.cpython-37m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /Users/avkashchauhan/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"'; __file__='"'"'/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-install-9hviuey8/python-geohash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/pip-record-h0in5a0u/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Reason:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

Solution:

$ sudo CFLAGS=-stdlib=libc++ pip3 install python-geohash

$ sudo CFLAGS=-stdlib=libc++ pip3 install superser

That’s all for now.

@avkashchauhan

Full working example of connecting Netezza from Java and python

Before start connecting you must make sure you can access the Netezza database and table from the machine where you are trying to run Java and or Python samples.

Connecting Netezza server from Python Sample

Check out my Ipython Jupyter Notebook with Python Sample

Step 1: Importing python jaydebeapi library

import jaydebeapi

Step 2: Setting Database connection settings

dsn_database = "avkash"            
dsn_hostname = "172.16.181.131" 
dsn_port = "5480"                
dsn_uid = "admin"        
dsn_pwd = "password"      
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/Users/avkashchauhan/learn/customers/netezza/nzjdbc3.jar"
###jdbc:netezza://" + server + "/" + dbName ;
connection_string='jdbc:netezza://'+dsn_hostname+':'+dsn_port+'/'+dsn_database
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

Step 3:Creating Database Connection

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
                         jars = "/Users/avkashchauhan/learn/customers/netezza/nzjdbc3.jar")
curs = conn.cursor()

Step 4:Processing SQL Query

curs.execute("select * from allusers")
result = curs.fetchall()
print("Total records: " + str(len(result)))
print(result[0])

Step 5: Printing all records

for i in range(len(result)):
    print(result[i])

Step 6: Closing all connections

curs.close()
conn.close()

Connecting Netezza server from Java Code Sample

Step 1: Have the Netezza driver as nzjdbc3.jar in a folder.

Step 2: Create netezzaJdbcMain.java as below in the same folder where nzjdbc3.jar is placed.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class netezzaJdbcMain {
    public static void main(String[] args) {
        String server = "x.x.x.x";
        String port = "5480";
        String dbName = "_db_name_";
        String url = "jdbc:netezza://" + server + "/" + dbName ;
        String user = "admin";
        String pwd = "password";
        String schema = "db_schema";
        Connection conn = null;
        Statement st = null;
        ResultSet rs = null;
        try {
            Class.forName("org.netezza.Driver");
            System.out.println(" Connecting ... ");
            conn = DriverManager.getConnection(url, user, pwd);
            System.out.println(" Connected "+conn);
            
            String sql = "select * from allusers";
            st = conn.createStatement();
            rs = st.executeQuery(sql);

            System.out.println("Printing result...");
            int i = 0;
            while (rs.next()) {
                String userName = rs.getString("name");
                int year = rs.getInt("age");
                System.out.println("User: " + userName +
                        ", age is: " + year);
                i++;
            }
            if (i==0){
                System.out.println(" No data found");
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if( rs != null) 
                    rs.close();
                if( st!= null)
                    st.close();
                if( conn != null)
                    conn.close();
            } catch (SQLException e1) {
                    e1.printStackTrace();
                }
        }
    }
}

Step 3: Compile code as below:

$ javac -cp nzjdbc3.jar -J-Xmx2g -J-XX:MaxPermSize=128m netezzaJdbcMin.java                                                                                                                                
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

Note: You should see your main class is compiled without any problem.

Step 4: Run compiled class as below:

$ java -cp .:nzjdbc3.jar netezzaJdbcMain

 Connecting ...
 Connected org.netezza.sql.NzConnection@3feba861
Printing result...
User: John                , age is: 30
User: Jason               , age is: 26
User: Jim                 , age is: 20
User: Kyle                , age is: 21
User: Kim                 , age is: 27

Note: You will see results something as above.

Thats it, enjoy!!

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Keywords: Kylin, Big Data, Hadoop, Jobs, OLAP, SQL, Query