Performing complex group by on independent columns with H2O and R

Our objective is to perform a complex group by operation which run a function on multiple columns. Lets consider the sample data frame below:

X1 X2 X3
1    2    4
2    1    -2
3    3    4 
1    3    4
2    3    4
3    3    1

Now lets try to run a group by on X1, and calculate the sums of X2 and X3 then the results will be as below:

X1    X2-Sum  X3-Sum
1     5       8
2     4       2
3     6       5

The following code snippet show how to get it done:

> airlinesURL = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
> airlines.hex = h2o.importFile(path = airlinesURL, destination_frame = "airlines.hex")
> colnames(airlines.hex)
>cols = c("Distance", "ArrDelay", "DepDelay")
> originFlights = h2o.group_by(data = airlines.hex[c("FlightNum",cols)], by = "FlightNum", sum(cols),gb.control = list(na.methods = "ignore", col.names = NULL))

>>> Using lapply now
> res = h2o.cbind(lapply(cols, function(x){h2o.group_by(airlines.hex,by="FlightNum",sum(x))}))
> res
  FlightNum sum_Distance FlightNum0 sum_ArrDelay FlightNum1 sum_DepDelay
1         1         3815          1           23          1           39
2         2         2519          2           84          2           23
3         3         4069          3          NaN          3          NaN
4         4         1740          4           44          4           64
5         5         2535          5           62          5           59
6         6         1844          6          -46          6           -6
[2439 rows x 6 columns]

## Binding the results into one data frame:
> res = lapply(cols, function(x){h2o.group_by(airlines.hex,by="FlightNum",sum(x))})
> h2o.merge(h2o.merge(res[[1]], res[[2]]), res[[3]])
  FlightNum sum_DepDelay sum_ArrDelay sum_Distance
1         1           39           23         3815
2         2           23           84         2519
3         3          NaN          NaN         4069
4         4           64           44         1740
5         5           59           62         2535
6         6           -6          -46         1844
[2439 rows x 4 columns]

# Another way to calculate and bind results:
> res <- h2o.cbind(lapply(cols, function(x){h2o.group_by(airlines.hex,by="FlightNum",sum(x))}))
> df <- res[,c("FlightNum", "sum_DepDelay", "sum_ArrDelay", "sum_Distance")]

# Another way to calculate and bind results in one line:
> res <- h2o.cbind(lapply(cols, function(x){h2o.group_by(airlines.hex,by="FlightNum",sum(x))}))[,c(1,2,4,6)]
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s