Flatten complex nested parquet files on Hadoop with Herringbone

Herringbone

Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.https://github.com/stripe/herringbone

Please visit my github and this specific page for more details.

Installation:

Note: You must be using a Hadoop machine and herringbone needs Hadoop environmet.

Pre-requsite : Thrift

  • Thrift 0.9.1 (MUST have 0.9.1 as 0.9.3 and 0.10.0 will give error while packaging)
  • Get thrift 0.9.1 Link

Pre-requsite : Impala

  • First setup Cloudera repo in your machine:
  • Install Impala
    • Install impala : $ sudo apt-get install impala
    • Install impala Server : $ sudo apt-get install impala-server
    • Install impala stat-store : $ sudo apt-get install impala-state-store
    • Install impala shell : $ sudo apt-get install impala-shell
    • Verify : impala : $ impala-shell
impala-shell
Starting Impala Shell without Kerberos authentication
Connected to mr-0xd7-precise1.0xdata.loc:21000
Server version: impalad version 2.6.0-cdh5.8.4 RELEASE (build 207450616f75adbe082a4c2e1145a2384da83fa6)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.4.0-cdh4-INTERNAL (08fa346) built on Mon Jul 14 15:52:52 PDT 2014)

Building : Herringbone source

Here is the successful herringbone “mvn package” command log for your review:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Herringbone Impala
[INFO] Herringbone Main
[INFO] Herringbone
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone Impala 0.0.2
[INFO] ------------------------------------------------------------------------
..
..
..
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone 0.0.1
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Herringbone Impala ................................. SUCCESS [ 2.930 s]
[INFO] Herringbone Main ................................... SUCCESS [ 13.012 s]
[INFO] Herringbone ........................................ SUCCESS [ 0.000 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.079 s
[INFO] Finished at: 2017-10-06T11:27:20-07:00
[INFO] Final Memory: 90M/1963M
[INFO] ------------------------------------------------------------------------

Using Herringbone

Note: You must have fiels on Hadoop, not on local file system

Verify the file on Hadoop:

  • ~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet
  • -rw-r–r– 3 avkash avkash 1463376 2017-09-13 16:56 /user/avkash/file-test1.parquet
  • ~/herringbone$ bin/herringbone flatten -i /user/avkash/file-test1.parquet
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/avkash/herringbone/herringbone-main/target/herringbone-0.0.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/10/06 12:06:44 INFO client.RMProxy: Connecting to ResourceManager at mr-0xd1-precise1.0xdata.loc/172.16.2.211:8032
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/10/06 12:06:45 INFO input.FileInputFormat: Total input paths to process : 1
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
1 initial splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
1 merged splits were generated.
  Max: 1.34M
  Min: 1.34M
  Avg: 1.34M
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: number of splits:1
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1499294366934_0707
17/10/06 12:06:45 INFO impl.YarnClientImpl: Submitted application application_1499294366934_0707
17/10/06 12:06:46 INFO mapreduce.Job: The url to track the job: http://mr-0xd1-precise1.0xdata.loc:8088/proxy/application_1499294366934_0707/
17/10/06 12:06:46 INFO mapreduce.Job: Running job: job_1499294366934_0707
17/10/06 12:06:52 INFO mapreduce.Job: Job job_1499294366934_0707 running in uber mode : false
17/10/06 12:06:52 INFO mapreduce.Job:  map 0% reduce 0%
17/10/06 12:07:22 INFO mapreduce.Job:  map 100% reduce 0%

Now verify the file:

~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet-flat

Found 2 items
-rw-r--r--   3 avkash avkash          0 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/_SUCCESS
-rw-r--r--   3 avkash avkash    2901311 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/part-m-00000.parquet

Thats it, enjoy!!

Advertisements

Handling exception “Argument python_obj should be a …”

Recently I hit the following exception when running python code with H2O functions on a new machine however this exception does not happen on my main machine. The exception was as below:

H2OTypeError: Argument `python_obj` should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable
Error in sys.excepthook:
Traceback (most recent call last):
 File “/usr/local/lib/python2.7/site-packages/h2o/utils/debugging.py”, line 95, in _except_hook
 _handle_soft_error(exc_type, exc_value, exc_tb)
 File “/usr/local/lib/python2.7/site-packages/h2o/utils/debugging.py”, line 225, in _handle_soft_error
 args_str = _get_args_str(func, highlight=highlight)
 File “/usr/local/lib/python2.7/site-packages/h2o/utils/debugging.py”, line 316, in _get_args_str
 s = str(inspect.signature(func))[1:-1]

The following message is worth to explore:

Argument python_obj should be a None | list | tuple | dict | numpy.ndarray | pandas.DataFrame | scipy.sparse.issparse, got H2OTwoDimTable

Diagnostics:

  • The method is looking for numpy, pandas, scipy to be available in the machine
  • I checked that numpy was installed but pandas was missing
  • The missing pandas library gave me cryptic error message

Solution:

After installing pandas library the problem was resolved.

Thats it, enjoy!!

Exploring & transforming H2O Data Frame in R and Python

Sometime you may need to ingest a dataset for building models and then your first task is to explore all the features and their type you have. Once that is done you may want to change the feature types to the one you want.

Here is the code snippet in Python:

df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
df.types
{    u'AGE': u'int', u'CAPSULE': u'int', u'DCAPS': u'int', 
     u'DPROS': u'int', u'GLEASON': u'int', u'ID': u'int',
     u'PSA': u'real', u'RACE': u'int', u'VOL': u'real'
}
If you would like to visualize all the features in graphical format you can do the following:
import pylab as pl
df.as_data_frame().hist(figsize=(20,20))
pl.show()
The result looks like as below on jupyter notebook:
Screen Shot 2017-10-05 at 5.20.03 PM
Note: If you have features above 50, you might have to trim your data frame to less features so you can have effective visualization.
Next you may need to You can also use the following function to convert a list of columns as factor/categorical by passing H2O dataframe and a list of columns:
def convert_columns_as_factor(hdf, column_list):
    list_count = len(column_list)
    if list_count is 0:
        return "Error: You don't have a list of binary columns."
    if (len(pdf.columns)) is 0:
        return "Error: You don't have any columns in your data frame."
    local_column_list = pdf.columns
    for i in range(list_count):
        try:
            target_index = local_column_list.index(column_list[i])
            pdf[column_list[i]] = pdf[column_list[i]].asfactor()
            print('Column ' + column_list[i] + " is converted into factor/catagorical.")
        except ValueError:
            print('Error: ' + str(column_list[i]) + " not found in the data frame.")

The following script is in R to perform the same above tasks:

N=100
set.seed(999)
color = sample(c("D","E","I","F","M"),size=N,replace=TRUE)
num = rnorm(N,mean = 12,sd = 21212)
sex = sample(c("male","female"),size=N,replace=TRUE)
sex = as.factor(sex)
color = as.factor(color)
data = sample(c(0,1),size = N,replace = T)
fdata = factor(data)
table(fdata)
dd = data.frame(color,sex,num,fdata)
data = as.h2o(dd)
str(data)
data$sex = h2o.setLevels(x = data$sex ,levels = c("F","M"))
data
Thats it, enjoy!!

H2O Word2Vec Tutorial with example in Scala

If you would like to know what is word2vec and why you should use it, there is lots of material available to scan.  You can learn more about H2O implementation of Word2Vec here, along with its configuration and interpretation.

In this Scala example we will use H2O Word2Vec algorithm to build a model using the given Text (as text file, or an Array) and then build Word2vec model from it.

Here is the full Scala code of the following example at my github.

Lets start H2O cluster first:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)

Now we will be importing required libraries to get our job done:

import scala.io.Source
import _root_.hex.word2vec.{Word2Vec, Word2VecModel}
import _root_.hex.word2vec.Word2VecModel.Word2VecParameters
import water.fvec.Vec

Now we will be creating a stop words list which are not useful for text mining and removed from the word source:

val STOP_WORDS = Set("ourselves", "hers", "between", "yourself", "but", "again", "there", "about", 
    "once", "during", "out", "very", "having", "with", "they", "own", "an", "be", "some", "for", "do", 
    "its", "yours", "such", "into", "of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as", 
     "from", "him", "each", "the", "themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were", "her", 
    "more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up", 
    "to", "ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have", "in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can", 
    "did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself", "which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by", "doing", 
    "it", "how", "further", "was", "here", "than")

Note:

Now lets ingest the text data we would want to run Word2Vec algorithms to vectorize the data first and then run machine learning experiment to it.

I have downloaded a free story “The Adventure of Sherlock Holmes” from Internet and using that as my source.  

val filename = "/Users/avkashchauhan/Downloads/TheAdventuresOfSherlockHolmes.txt"
val lines = Source.fromFile(filename).getLines.toArray
val sparkframe = sc.parallelize(lines)

Now lets defined the tokenize function which will convert out input text to tokens:

def tokenize(line: String) = {
 //get rid of nonWords such as punctuation as opposed to splitting by just " "
 line.split("""\W+""")
 .map(_.toLowerCase)

//Lets remove stopwords defined above
 .filterNot(word => STOP_WORDS.contains(word)) :+ null
}

Now we will be calling the tokenize function to create a list of labeled words:

val allLabelledWords = sparkframe.flatMap(d => tokenize(d))

Note: You can also use your own or a custom tokenize function from a library as well, you just need to map the function to the DataFrame.

Now lets convert the collection of label words into an H2O DataFrame:

val h2oFrame = h2oContext.asH2OFrame(allLabelledWords)

Here is the time now to use the H2O Word2Vec algorithm by configuring the parameters first:

val w2vParams = new Word2VecParameters
w2vParams._train = h2oFrame._key
w2vParams._epochs = 500
w2vParams._min_word_freq = 0
w2vParams._init_learning_rate = 0.05f
w2vParams._window_size = 20
w2vParams._vec_size = 20
w2vParams._sent_sample_rate = 0.0001f

Now we will perform the real action, building the model:

val w2v = new Word2Vec(w2vParams).trainModel().get()

Now we can apply the model to perform some actions on it:

Lets start first test by finding synonyms using this given word2vec model. We will be calling findSynonyms method by passing a given word  to find N synonyms, the results will be the top ‘count’ synonyms with their distance values:

w2v.findSynonyms("love", 3)
w2v.findSynonyms("help", 2)
w2v.findSynonyms("hate", 1)

Lets Transform words using w2v model and aggregate method average:

The transform() function takes an H2O Vec as the first parameter, where the vector needs to be extracted from the H2O frame h2oFrame.

val newSparkFrame = w2v.transform(h2oFrame.vec(0), Word2VecModel.AggregateMethod.NONE).toTwoDimTable()

Thats it, enjoy!!