Working with variable importance data with models in H2O

When building classification models in H2O, you will get to see the variable importance table at the FLOW UI. It looks like as below:

Screen Shot 2017-04-11 at 3.18.54 PM

Most of the users are using python or R as their shell so there could be a need to get this variable importance table into python or R shell. This is what we will do in next step.

If we want to plot the variable importance graph we can use the following script:

import matplotlib.pyplot as plt
plt.rcdefaults()
fig, ax = plt.subplots()
variables = mymodel._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
scaled_importance = mymodel._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos, scaled_importance, align='center', color='green', ecolor='black')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title('Variable Importance')
plt.show()

Here is the variable importance graph looks like:

Screen Shot 2017-04-11 at 3.09.22 PM

If we want to see the variable metrics directly from the model in python we can do the following:

mymodel._model_json['output']['variable_importances'].as_data_frame()

The results are shown as below:

Screen Shot 2017-04-11 at 3.13.30 PM

Thats it, enjoy!!

Advertisements

Renaming data frame column names in H2O (python)

Sometimes you may need to change the all the column names or a specific column due to certain need, and you can do as below:

>>> df = h2o.import_file("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/iris/iris.csv")
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
>>> df
 C1 C2 C3 C4 C5
---- ---- ---- ---- -----------
 5.1 3.5 1.4 0.2 Iris-setosa
 4.9 3 1.4 0.2 Iris-setosa
 4.7 3.2 1.3 0.2 Iris-setosa
 4.6 3.1 1.5 0.2 Iris-setosa
 5 3.6 1.4 0.2 Iris-setosa
 5.4 3.9 1.7 0.4 Iris-setosa
 4.6 3.4 1.4 0.3 Iris-setosa
 5 3.4 1.5 0.2 Iris-setosa
 4.4 2.9 1.4 0.2 Iris-setosa
 4.9 3.1 1.5 0.1 Iris-setosa

[150 rows x 5 columns]

>>> df.names
[u'C1', u'C2', u'C3', u'C4', u'C5']

>>> df.set_names(['A1','A2','A3','A4','A5'])
 A1 A2 A3 A4 A5
---- ---- ---- ---- ------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

If you want to change only few column names then you still need to copy the original name in the same index and just add the changed name into where applicable. For example in above data frame, we just want to change A5 to Levels and we will do as below:

>>> df.set_names(['A1','A2','A3','A4','Levels'])
 A1 A2 A3 A4 Levels
---- ---- ---- ---- --------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

The set_names function must have all names values in the array, either same name or changes names otherwise it will generate an error.

For example the following will not work and will throw an error:

>>> df.set_names(['A1'])
>>> df.set_names(['A1','A2','A3','A4','A5','A6'])

Thats it, enjoy!!

Unification of date and time data with joda in Spark

Here is the code snippet which can first parse  various kind of date and time formats and then unify them together to be processed by data munging process.

  import org.apache.spark.sql.functions._
  import org.joda.time._
  import org.joda.time.format._
  import org.apache.spark.sql.expressions.Window

    val getHour = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getHourOfDay)
        }
    })

    val getDT = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getMillis / 1000.0  )
        }
      })

    // UDF for day of week
    val getDayOfWeek = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy")
          Some(fmt.parseDateTime(s.split(" ")(0)).getDayOfWeek)
        }
      }
    })

    val getDate = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          Some(s.split(" ")(0))
        }
      }
    })

    val getDiffKey = udf((diff:Double) => {
      val threshold = 5 // 15 minutes   5 // 75%-tile seconds
      if (diff > threshold) {
        1  // tag as 2nd diff
      } else {
        0 // 1st diff
      }
    })

  val rawDF = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .load("hdfs://mr-0xc5.0xdata.loc:8020/user/file.csv")

    var df = rawDF.withColumn("hourOfDay", getHour(rawDF.col("datetime")))
      df = df.withColumn("timestamp", getDT(df.col("datetime")))
      df = df.withColumn("dayOfWeek", getDayOfWeek(df.col("datetime")))
      df = df.withColumn("date", getDate(df.col("datetime")))

A collection of Big Data Books from Packt Publication

I found that Packt publication have few great books on Big Data and here is a collection of few books which I found very useful:Screen Shot 2014-09-30 at 11.50.08 AM

Packt is giving its readers a chance to dive into their comprehensive catalog of over 2000 books and videos for the next 7 days with LevelUp program:

packt

Packt is offering all of its eBooks and Videos at just $10 each or less

The more EXP customers want to gain, the more they save:

  • Any 1 or 2 eBooks/Videos – $10 each
  • Any 3 to 5 eBooks/Videos – $8 each
  • Any 6 or more eBooks/Videos – $6 each

More Information is available at bit.ly/Yj6oWq  |  bit.ly/1yu4679

For more information please visit : www.packtpub.com/packt/offers/levelup

Merging two data set in R based on one common column

Let’s create a new dataset using mtcars dataset and only mpg and hp column:

> cars.mpg <- subset(mtcars, select = c(mpg, hp))
 
> cars.mpg
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
…………..

Let’s create another dataset using mtcars dataset and only hp and cyl column:

> cars.cyl <- subset(mtcars, select = c(hp,cyl))
> cars.cyl
                     hp cyl
Mazda RX4           110   6
Mazda RX4 Wag       110   6
Datsun 710           93   4
Hornet 4 Drive      110   6
Hornet Sportabout   175   8
Valiant             105   6
Duster 360          245   8
Merc 240D            62   4
Merc 230             95   4
Merc 280            123   6
Merc 280C           123   6
Merc 450SE          180   8
Merc 450SL          180   8
Merc 450SLC         180   8
…………………..

Now we can merge both dataset based on common column hp as below:

 
> merge.ds <- merge(cars.mpg, cars.cyl, by="hp")
> merge.ds
    hp  mpg cyl
1   52 30.4   4
2   62 24.4   4
3   65 33.9   4
4   66 32.4   4
5   66 32.4   4
6   66 27.3   4
7   66 27.3   4
8   91 26.0   4
9   93 22.8   4
10  95 22.8   4
11  97 21.5   4
12 105 18.1   6
13 109 21.4   4
14 110 21.0   6
15 110 21.0   6
16 110 21.0   6
17 110 21.0   6
18 110 21.0   6
19 110 21.0   6
20 110 21.4   6
21 110 21.4   6
22 110 21.4   6
23 113 30.4   4
24 123 17.8   6
25 123 17.8   6
26 123 19.2   6
27 123 19.2   6
28 150 15.2   8
29 150 15.2   8
30 150 15.5   8
31 150 15.5   8
32 175 18.7   8
33 175 18.7   6
34 175 18.7   8
35 175 19.7   8
36 175 19.7   6
37 175 19.7   8
38 175 19.2   8
39 175 19.2   6
40 175 19.2   8
41 180 16.4   8
42 180 16.4   8
43 180 16.4   8
44 180 17.3   8
45 180 17.3   8
46 180 17.3   8
47 180 15.2   8
48 180 15.2   8
49 180 15.2   8
50 205 10.4   8
51 215 10.4   8
52 230 14.7   8
53 245 14.3   8
54 245 14.3   8
55 245 13.3   8
56 245 13.3   8
57 264 15.8   8
58 335 15.0   8

Why you see total 58 merged rows when there were only 32 rows in original data sets?

This is because "merge is a generic function whose principal method is for data frames: the default method coerces its arguments to data frames and calls the "data.frame" method."
Learn more by calling
?merge