Saving H2O models from R/Python API in Hadoop Environment

When you are using H2O in clustered environment i.e. Hadoop the machine could be different where h2o.savemodel() is trying to write the model and thats why you see the error “No such file or directory”. If you just give the path i.e. /tmp and visit the machine ID where H2O connection is initiated from R, you will see the model stored there.
Here is a good example to understand it better:
Step [1] Starting Hadoop driver in EC2 environment as below:
[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o-3.10.4.8-hdp2.6/h2odriver.jar -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
....
....
....
Open H2O Flow in your web browser: http://10.0.65.248:54323  <=== H2O is started.
Note: Above you could see that hadoop command is ran on ip address 10.0.104.179 however the node where H2O server is shown as 10.0.65.248.
Step [2] Connect R client with H2O
> h2o.init(ip = "10.0.65.248", port = 54323, strict_version_check = FALSE)
Note: I have used the ip address as shown above to connect with existing H2O cluster. However the machine where I am running R client is different as its IP address is 34.208.200.16.
Step [3]: Saving H2O model:
h2o.saveModel(my.glm, path = "/tmp", force = TRUE)
So when I am saving the mode it is saved at 10.0.65.248 machine even when the R client was running at 34.208.200.16.
ec2-user@ip-10-0-65-248 ~]$ ll /tmp/GLM*
-rw-r--r-- 1 yarn hadoop 90391 Jun 2 20:02 /tmp/GLM_model_R_1496447892009_1
So you need to make sure you have access to a folder where H2O service is running or you can save model at HDFS something similar to as below:
h2o.saveModel(my.glm, path = "hdfs://ip-10-0-104-179.us-west-2.compute.internal/user/achauhan", force = TRUE)

Thats it, enjoy!!

Setting various logs levels for H2O

Setting log levels in different H2O deployment scenarios.

Standalone H2O mode (H2O on VMs, laptops…)

You can specify options -log_level and/or -log_dir:

-log_level <TRACE,DEBUG,INFO,WARN,ERRR,FATAL>
Write messages at this logging level, or above. Default is INFO.
-log_dir <fileSystemPath>

The directory where H2O writes logs to disk. (This usually has a good default that you need not change.)

$ java -jar h2o.jar -log_level DEBUG

H2O on Hadoop

The log level option is not directly exposed. You can still set the log level by adding an extra java argument using the -J option of the Hadoop h2o driver: “-J -log_level -J DEBUG”. Here is an example:

$ hadoop jar h2odriver.jar -J -log_level -J DEBUG -nodes 1 -mapperXmx 1g -output t/$RANDOM

Sparkling Water:

Log levels can be adjusted using Spark conf properties: spark.ext.h2o.node.log.level and spark.ext.h2o.client.log.level, these are two separate options for the compute node and the h2o client running in Sparkling Water’s driver program (H2O client).

$ bin/sparkling-shell --conf spark.ext.h2o.node.log.level=DEBUG --conf spark.ext.h2o.client.log.level=DEBUG

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Keywords: Kylin, Big Data, Hadoop, Jobs, OLAP, SQL, Query

Big Data 1B dollars Club – Top 20 Players

Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …