Exploring & transforming H2O Data Frame in R and Python

Sometime you may need to ingest a dataset for building models and then your first task is to explore all the features and their type you have. Once that is done you may want to change the feature types to the one you want.

Here is the code snippet in Python:

df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
df.types
{    u'AGE': u'int', u'CAPSULE': u'int', u'DCAPS': u'int', 
     u'DPROS': u'int', u'GLEASON': u'int', u'ID': u'int',
     u'PSA': u'real', u'RACE': u'int', u'VOL': u'real'
}
If you would like to visualize all the features in graphical format you can do the following:
import pylab as pl
df.as_data_frame().hist(figsize=(20,20))
pl.show()
The result looks like as below on jupyter notebook:
Screen Shot 2017-10-05 at 5.20.03 PM
Note: If you have features above 50, you might have to trim your data frame to less features so you can have effective visualization.
Next you may need to You can also use the following function to convert a list of columns as factor/categorical by passing H2O dataframe and a list of columns:
def convert_columns_as_factor(hdf, column_list):
    list_count = len(column_list)
    if list_count is 0:
        return "Error: You don't have a list of binary columns."
    if (len(pdf.columns)) is 0:
        return "Error: You don't have any columns in your data frame."
    local_column_list = pdf.columns
    for i in range(list_count):
        try:
            target_index = local_column_list.index(column_list[i])
            pdf[column_list[i]] = pdf[column_list[i]].asfactor()
            print('Column ' + column_list[i] + " is converted into factor/catagorical.")
        except ValueError:
            print('Error: ' + str(column_list[i]) + " not found in the data frame.")

The following script is in R to perform the same above tasks:

N=100
set.seed(999)
color = sample(c("D","E","I","F","M"),size=N,replace=TRUE)
num = rnorm(N,mean = 12,sd = 21212)
sex = sample(c("male","female"),size=N,replace=TRUE)
sex = as.factor(sex)
color = as.factor(color)
data = sample(c(0,1),size = N,replace = T)
fdata = factor(data)
table(fdata)
dd = data.frame(color,sex,num,fdata)
data = as.h2o(dd)
str(data)
data$sex = h2o.setLevels(x = data$sex ,levels = c("F","M"))
data
Thats it, enjoy!!
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s