Exploring & transforming H2O Data Frame in R and Python

Sometime you may need to ingest a dataset for building models and then your first task is to explore all the features and their type you have. Once that is done you may want to change the feature types to the one you want.

Here is the code snippet in Python:

df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
df.types
{    u'AGE': u'int', u'CAPSULE': u'int', u'DCAPS': u'int', 
     u'DPROS': u'int', u'GLEASON': u'int', u'ID': u'int',
     u'PSA': u'real', u'RACE': u'int', u'VOL': u'real'
}
If you would like to visualize all the features in graphical format you can do the following:
import pylab as pl
df.as_data_frame().hist(figsize=(20,20))
pl.show()
The result looks like as below on jupyter notebook:
Screen Shot 2017-10-05 at 5.20.03 PM
Note: If you have features above 50, you might have to trim your data frame to less features so you can have effective visualization.
Next you may need to You can also use the following function to convert a list of columns as factor/categorical by passing H2O dataframe and a list of columns:
def convert_columns_as_factor(hdf, column_list):
    list_count = len(column_list)
    if list_count is 0:
        return "Error: You don't have a list of binary columns."
    if (len(pdf.columns)) is 0:
        return "Error: You don't have any columns in your data frame."
    local_column_list = pdf.columns
    for i in range(list_count):
        try:
            target_index = local_column_list.index(column_list[i])
            pdf[column_list[i]] = pdf[column_list[i]].asfactor()
            print('Column ' + column_list[i] + " is converted into factor/catagorical.")
        except ValueError:
            print('Error: ' + str(column_list[i]) + " not found in the data frame.")

The following script is in R to perform the same above tasks:

N=100
set.seed(999)
color = sample(c("D","E","I","F","M"),size=N,replace=TRUE)
num = rnorm(N,mean = 12,sd = 21212)
sex = sample(c("male","female"),size=N,replace=TRUE)
sex = as.factor(sex)
color = as.factor(color)
data = sample(c(0,1),size = N,replace = T)
fdata = factor(data)
table(fdata)
dd = data.frame(color,sex,num,fdata)
data = as.h2o(dd)
str(data)
data$sex = h2o.setLevels(x = data$sex ,levels = c("F","M"))
data
Thats it, enjoy!!
Advertisements

Applying AND, OR, NOT conditions as filter into dataframe

Question:

How to add conditions into data frame filters, to express the function (AND, OR, NOT)? For example, I have two flags:

  1. myData flag to be myData_flag
  2. myProx flag to be is_myProx_t_f.

Conditions are defined as below:

  • AND: is it data_myDatamyProx = data[(data[‘myData_flag’].isin([‘1’]),:)&&( data[‘is_myProx_t_f’].isin([‘1’]),:)]?
  • OR: is it data_myDataOrmyProx = data[(data[‘myData_flag’].isin([‘1’]),:)||( data[‘is_myProx_t_f’].isin([‘1’]),:)]?
  • NOT: is it data_NonemyDatamyProx = data[(data[‘myData_flag’].isnotin([‘1’]),:)||( data[‘is_myProx_t_f’].isnotin([‘1’]),:)]?

Solution:

For AND, OR operators, you can already accomplish this like below (using iris dataset as an example):
df[(df['Sepal.Length'] < 5) & (df['Sepal.Width'] > 3) | (df['Species'].isin(['setosa'])), :]
Above, the operators are &, | and negation is the tilda ~ .
Thats it, enjoy!!

Creating a new columns into data frame from calculation over data

Sometime you may need to operate either the full data frame or a specific column with a function and add new column which consist the results. This is how you can do it:

# Create a test frame
c_names = ['Prediction']
data1 = np.array([[0.12],
                  [0.43],
                  [0.90],
                  [0.002],
                  [0.52]])
df = h2o.H2OFrame().from_python(data1, destination_frame='df', column_names=c_names)

# Applying the function on specific column from frame and creating new column into same data frame:
df['new_prediction'] = df['Prediction']*1000
print df
Thats it, enjoy!!

Maintaining column names after applying function on data frame

Sometime when we apply a function on a data frame the column names are changed. Here is an example:

// Creating a new data frame and then converting it to H2O data frame
c_names = [‘Num’, ‘Prediction’]
data1 = np.array([[1, 0.12],
 [2, 0.43],
 [3,0.90],
 [4,0.002],
 [5,0.52]])
df = h2o.H2OFrame().from_python(data1, destination_frame=‘df’, column_names=c_names)
// Printing H2O Dataframe
print “df Columns: ”, df.colunns
// Now applying log function
df = df.log1p()
// Above function will change columns name X to log1p(X)
// If i tried df.log() then new column names will be log(X)
print "df Columns: ", df.columns

As you see above, the columns are changed so you would need to re-apply the original columns to the data frame. The way you do is to store the columns first, then apply necessary function and then re-apply previous column names to data frame as below:

column_names = df.columns
df = df.log()
df.set_names(col_name) 
print(df.columns)

Thats it, enjoy!!

Creating, Adding and managing H2O frame in Scala

Creating a new H2O Frame:

To create a new frame in H2O you will call as below:

val df = new Frame()

Adding a frame to another H2O Frame:

To add an H2O from to another H2O frame you will do the following:
frame1.add(frame2)
When h2oDataFrame.add() method is called, it mutates the calling frame. It doesn’t create a new Frame and the Frame keeps the same Key.  Its the same object in memory.
What happens is that frame1 now depends on frame2.  Frame “frame1” has the new columns but they are actually the data from “frame2”. Looking into this operation, it looks like data has been duplicated because there are 2 keys in the DKV, but actually there has been no memory copy at all.  If you delete “frame2” you will run into an error , because the Frame “frame1” now depends on “frame2”.
In general managing memory in H2O DKV, there is no automated way of deleting old Frames during your program execution, you just need to manually call Frame.delete() on the Frames you no longer need.

Difference of using val vs var in Scala with new frame:

While looking from Scala point of view val dataframeNew = new Frame() doesn’t stop you from changing the dataframeNew frame with dataframeNew.add,  this does however stop you from reassigning dataframeNew to a different instance of a Frame.
Note: If you had var dataframeNew = new Frame(), then this df can be set to a completely different Frame. The reason for this difference is mainly because how Scala treats the val vs var in variable assignment.
Thats it, enjoy