Full working example of connecting Netezza from Java and python

Before start connecting you must make sure you can access the Netezza database and table from the machine where you are trying to run Java and or Python samples.

Connecting Netezza server from Python Sample

Check out my Ipython Jupyter Notebook with Python Sample

Step 1: Importing python jaydebeapi library

import jaydebeapi

Step 2: Setting Database connection settings

dsn_database = "avkash"            
dsn_hostname = "172.16.181.131" 
dsn_port = "5480"                
dsn_uid = "admin"        
dsn_pwd = "password"      
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/Users/avkashchauhan/learn/customers/netezza/nzjdbc3.jar"
###jdbc:netezza://" + server + "/" + dbName ;
connection_string='jdbc:netezza://'+dsn_hostname+':'+dsn_port+'/'+dsn_database
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

Step 3:Creating Database Connection

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
                         jars = "/Users/avkashchauhan/learn/customers/netezza/nzjdbc3.jar")
curs = conn.cursor()

Step 4:Processing SQL Query

curs.execute("select * from allusers")
result = curs.fetchall()
print("Total records: " + str(len(result)))
print(result[0])

Step 5: Printing all records

for i in range(len(result)):
    print(result[i])

Step 6: Closing all connections

curs.close()
conn.close()

Connecting Netezza server from Java Code Sample

Step 1: Have the Netezza driver as nzjdbc3.jar in a folder.

Step 2: Create netezzaJdbcMain.java as below in the same folder where nzjdbc3.jar is placed.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class netezzaJdbcMain {
    public static void main(String[] args) {
        String server = "x.x.x.x";
        String port = "5480";
        String dbName = "_db_name_";
        String url = "jdbc:netezza://" + server + "/" + dbName ;
        String user = "admin";
        String pwd = "password";
        String schema = "db_schema";
        Connection conn = null;
        Statement st = null;
        ResultSet rs = null;
        try {
            Class.forName("org.netezza.Driver");
            System.out.println(" Connecting ... ");
            conn = DriverManager.getConnection(url, user, pwd);
            System.out.println(" Connected "+conn);
            
            String sql = "select * from allusers";
            st = conn.createStatement();
            rs = st.executeQuery(sql);

            System.out.println("Printing result...");
            int i = 0;
            while (rs.next()) {
                String userName = rs.getString("name");
                int year = rs.getInt("age");
                System.out.println("User: " + userName +
                        ", age is: " + year);
                i++;
            }
            if (i==0){
                System.out.println(" No data found");
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if( rs != null) 
                    rs.close();
                if( st!= null)
                    st.close();
                if( conn != null)
                    conn.close();
            } catch (SQLException e1) {
                    e1.printStackTrace();
                }
        }
    }
}

Step 3: Compile code as below:

$ javac -cp nzjdbc3.jar -J-Xmx2g -J-XX:MaxPermSize=128m netezzaJdbcMin.java                                                                                                                                
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

Note: You should see your main class is compiled without any problem.

Step 4: Run compiled class as below:

$ java -cp .:nzjdbc3.jar netezzaJdbcMain

 Connecting ...
 Connected org.netezza.sql.NzConnection@3feba861
Printing result...
User: John                , age is: 30
User: Jason               , age is: 26
User: Jim                 , age is: 20
User: Kyle                , age is: 21
User: Kim                 , age is: 27

Note: You will see results something as above.

Thats it, enjoy!!

Advertisements

Reading nested parquet file in Scala and exporting to CSV

Recently we were working on a problem where the parquet compressed file had lots of nested tables and some of the tables had columns with array type and our objective was to read it and save it to CSV.

We wrote a script in Scala which does the following

  • Handles nested parquet compressed content
  • Look for columns as “Array” and then remove those columns

Here is a the script

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colPath = if (prefix == null) s"`${f.name}`" else s"${prefix}.`${f.name}`"

    f.dataType match {
      case st: StructType => flattenSchema(st, colPath)
      // Skip user defined types like array or vectors
      case x if x.isInstanceOf[ArrayType] => Array.empty[Column]
      case _ => Array(col(colPath).alias(colPath.replaceAll("[.`]", "_")))
    }
  })
}

Here are the all the steps you would need to take while reading the parquet compressed content and then exporting it to disk as CSV.

val spark = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

scala> :paste
// Entering paste mode (ctrl-D to finish)

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colPath = if (prefix == null) s"`${f.name}`" else s"${prefix}.`${f.name}`"

    f.dataType match {
      case st: StructType => flattenSchema(st, colPath)
      // Skip user defined types like array or vectors
      case x if x.isInstanceOf[ArrayType] => Array.empty[Column]
      case _ => Array(col(colPath).alias(colPath.replaceAll("[.`]", "_")))
    }
  })
}

// Exiting paste mode, now interpreting.

flattenSchema: (schema: org.apache.spark.sql.types.StructType, prefix: String)Array[org.apache.spark.sql.Column]

scala >

val df = spark.read.parquet("/user/avkash/test.parquet")

df.select(flattenSchema(df.schema):_*).write.format("com.databricks.spark.csv").save("/Users/avkashchauhan/Downloads/saveit/result.csv")

If you want to see the full working scripts with output you can visit any of the following link based on your Spark Version:

  • Here is the full working demo in Spark 2.1.0
  • Here is the full working demo in Spark 1.6.x.

We got some help from the StackOverflow discussion here. Michal K and Michal M helped me to write above solution.

Thats it, enjoy!!

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Keywords: Kylin, Big Data, Hadoop, Jobs, OLAP, SQL, Query

Handling Apache Hive Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Recently I hit the following  Error when starting Apache hive in my server:

ubuntu@HIVE_SERVER:~$ hive
log4j:ERROR Could not find value for key log4j.appender.DEBUG
log4j:ERROR Could not instantiate appender named “DEBUG”.
log4j:ERROR Could not find value for key log4j.appender.WARN
log4j:ERROR Could not instantiate appender named “WARN”.
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /var/log/hive/hive.log (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:415)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:52)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:629)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201306111657_1917799460.txt
hive> show tables;
FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> quit;

Potential Root Cause:

This is mainly because the I have had previous hive installation in the machine and uninstalled it. Not all the files were removed completelty and the /var/log/hive/hive.log file was not there and most of the settings were messed up.

Even when I created an empty /var/log/hive/hive.log ($touch /var/log/hive/hive.log and chmod 777 /var/log/hive/hive.log) still the problem persisted.

Removing Hive:

ubuntu@HIVE_SERVER:~$ sudo apt-get remove hive
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following packages were automatically installed and are no longer required:
libopts25 libcap2 hbase ntp
Use ‘apt-get autoremove’ to remove them.
The following packages will be REMOVED:
hive
0 upgraded, 0 newly installed, 1 to remove and 26 not upgraded.
After this operation, 175 MB disk space will be freed.
Do you want to continue [Y/n]? Y
(Reading database … 57309 files and directories currently installed.)
Removing hive …
Processing triggers for man-db …

Checking what else is left after removal:

ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”
/etc/hive
/usr/lib/hive
/var/lib/hive

Deleting all of above residual files in Hive:

ubuntu@HIVE_SERVER:~$ sudo rm -rf /etc/hive
ubuntu@HIVE_SERVER:~$ sudo rm -rf /usr/lib/hive
ubuntu@HIVE_SERVER:~$ sudo rm -rf /var/lib/hive

Verifying that none of hive is in my machine:

ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”

Installing Hive (from cloudera distribution)

ubuntu@HIVE_SERVER:~$ sudo apt-get install hive
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following NEW packages will be installed:
hive
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
Need to get 0 B/32.4 MB of archives.
After this operation, 175 MB of additional disk space will be used.
Selecting previously unselected package hive.
(Reading database … 50478 files and directories currently installed.)
Unpacking hive (from …/hive_0.10.0+78-1.cdh4.2.1.p0.8~precise-cdh4.2.1_all.deb) …
Processing triggers for man-db …
Setting up hive (0.10.0+78-1.cdh4.2.1.p0.8~precise-cdh4.2.1) …
update-alternatives: using /etc/hive/conf.dist to provide /etc/hive/conf (hive-conf) in auto mode.

Listing hive in my machine:

ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”
/etc/hive
/run/hive
/usr/lib/hive
/usr/lib/hive/lib/php/packages/serde/org/apache/hadoop/hive
/usr/lib/hive/docs/api/org/apache/hadoop/hive
/usr/lib/hive/docs/api/org/apache/hive
/usr/lib/hive/bin/hive
/usr/share/doc/hive
/usr/share/doc/hive/examples/test-plugin/src/org/apache/hive
/usr/bin/hive
/var/lib/hive
/var/log/hive

Testing Hive:

ubuntu@HIVE_SERVER:~$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-log4j.properties
Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201306111709_559685140.txt
hive> show tables;
OK
Time taken: 14.431 seconds
hive> quit;

Its working!!

Facts about any NoSQL Database

Read both the article for more info:

Here are some facts about any NoSQL database:

  • Experimental  spontaneity
  • Performance lies in application design
  • Structure of query is more important than structure of data
  • To design a DB you always start from what kind of query will be used to access the data
  • The approach is results centric about what you really expect as result
  • This also helps to design how to run an efficient query
  • Queries is used to define data objects
  • Accepting less than perfect consistency provide huge improvements in performance
  • Big-Data backed application have DB spread across thousands of nodes so
    • Performance penalty of locking DB to write/modify is huge
    • In frequent writes, serializing all writes will lose the advantage of distributed database
  • In an eventually consistence db, changes arrives to the node in 10th of the seconds before DB arrives in consistent state
  • For web applications availability always get priority over consistency
  • If consistency is not guaranteed by DB then application will have to manage it and to do it, application will be complex
  • You can add replication and failover strategies database designed for consistency can deliver super high availability
    • HBASE is an example here

Types of NoSQL databases and extensive details

Please study the first article as background of this article: Why there is a need for NoSql Database?

What about NoSQL:

  • NoSQL is not completely “No Schema” DB
  • There are mainly 3 types of NoSQL DB
    • Document DB
    • Data Structure Oriented DB
    • Column Oriented DB

 What is a Document DB?

  • Documents are key-value pair
  • document can also be stroed in JSON format
  • Because of JSON document considered as object
  • JSON documents are used as Key-Value pairs
  • Document can have any set of keys
  • Any key can associate with any arbitrarily complex value, that is itself a JSON document
  • Documents are added with different sets of keys
    • Missing keys
    • Extra keys
    • Add keys in future when in need
    • Application must know that certain key present
    • Queries are made on Keys
    • Index are set to keys to make search efficient
  • Example: CouchDB, MongoDB, Redis, Riak

Example of Document DB – CouchDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are documents in the DB that specify searches
  • View can be complex
  • Views can use map/reduce to process and summarize results
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main
  • BigCouch
    • Integrating clustering with CouchDB
    • Cloudant merging CouchDB & BigCouch

Example of Document DB – MongoDB

  • The value is plain string in JSON format
  • Queries are views
  • Views are JSON documents specifying fields and values to match
  • Queries results can be processed by built in map/reduce
  • Single headed database
  • Can run in cluster environment (not available in core)
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • In Non-Cluster environment availability is main
    • In clustered environment consistency is main

Example of Document DB – Riak

  • A document database with more flexible document types
  • Supports JSON, XML, plain text
  • A plugin architecture supports adding other document types
  • Queries must know the structure of JSON or XML for proper results
  • Queries results can be processed by built in map/reduce
  • Built in control about replication and distribution
  • Core is designed to run in cluster environment
  • From CAP Theorem –
    • Partition Tolerance
    • Availability
    • Note: Tradeoff between availability and consistency is tunable
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write

Data Structure Oriented DB – Redis:

  • In Memory DB for fastest read and write speed
  • If dataset can fit in memory, top choice
  • Great  for Raw speed
  • Data isn’t saved on disk and list in case of crash
  • Can be configured to save on disk but hit in performance
  • Limited scalability with some replication
  • Cluster Replication Support is coming
  • In Redis there is a difference
    • The value can be data structure (list or sets)
    • You can do union and intersection on list and sets

Column Oriented DB

  • Also considered as “Sparse row store”
  • Equivalent to “relational table” – “set of rows” identified by key
  • Concept starts with columns
  • Data is organized in the columns
  • Columns are stored contiguously
  • Columns tend to have similar data
  • A row can have as many columns as needed
  • Columns are essentially keys, that can let you lookup values in the rows
  • Columns can be added any time
  • Unused columns in a row does not occupy storage
  • NULL don’t exist
  • Write Data to Append Only file, an extremely efficient and makes write are significantly faster then write
  • Built in control about replication and distribution
  • Example: HBASE & Cassandra
  • HBase
    • From CAP Theorem
      • Partition Tolerance
      • Consistency
  • Cassandra
    • From CAP Theorem
      • Partition Tolerance
      • Availability Note: Tradeoff between availability and consistency is tunable

Additional functionalities supported by NoSql DB: 

  • Scripting Language Support
    • JavaScript
      • CouchDB, MongoDB
  • Pig
    • Hadoop
  • Hive
    • Hadoop
  • Lua
    • Redis
  • RESTFull Interface:
    • CouchDB and Riak
    • CouchDB can be considered as best with Web Application Framework
    • Riak provides traditional protocol buffer interface