Renaming data frame column names in H2O (python)

Sometimes you may need to change the all the column names or a specific column due to certain need, and you can do as below:

>>> df = h2o.import_file("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/iris/iris.csv")
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
>>> df
 C1 C2 C3 C4 C5
---- ---- ---- ---- -----------
 5.1 3.5 1.4 0.2 Iris-setosa
 4.9 3 1.4 0.2 Iris-setosa
 4.7 3.2 1.3 0.2 Iris-setosa
 4.6 3.1 1.5 0.2 Iris-setosa
 5 3.6 1.4 0.2 Iris-setosa
 5.4 3.9 1.7 0.4 Iris-setosa
 4.6 3.4 1.4 0.3 Iris-setosa
 5 3.4 1.5 0.2 Iris-setosa
 4.4 2.9 1.4 0.2 Iris-setosa
 4.9 3.1 1.5 0.1 Iris-setosa

[150 rows x 5 columns]

>>> df.names
[u'C1', u'C2', u'C3', u'C4', u'C5']

>>> df.set_names(['A1','A2','A3','A4','A5'])
 A1 A2 A3 A4 A5
---- ---- ---- ---- ------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

If you want to change only few column names then you still need to copy the original name in the same index and just add the changed name into where applicable. For example in above data frame, we just want to change A5 to Levels and we will do as below:

>>> df.set_names(['A1','A2','A3','A4','Levels'])
 A1 A2 A3 A4 Levels
---- ---- ---- ---- --------
 5.1 3.5 1.4 0.2 Iris_A
 4.9 3 1.4 0.2 Iris_A
 4.7 3.2 1.3 0.2 Iris_A
 4.6 3.1 1.5 0.2 Iris_A
 5 3.6 1.4 0.2 Iris_A
 5.4 3.9 1.7 0.4 Iris_A
 4.6 3.4 1.4 0.3 Iris_A
 5 3.4 1.5 0.2 Iris_A
 4.4 2.9 1.4 0.2 Iris_A
 4.9 3.1 1.5 0.1 Iris_A

[150 rows x 5 columns]

The set_names function must have all names values in the array, either same name or changes names otherwise it will generate an error.

For example the following will not work and will throw an error:

>>> df.set_names(['A1'])
>>> df.set_names(['A1','A2','A3','A4','A5','A6'])

Thats it, enjoy!!

Advertisements

Unification of date and time data with joda in Spark

Here is the code snippet which can first parse  various kind of date and time formats and then unify them together to be processed by data munging process.

  import org.apache.spark.sql.functions._
  import org.joda.time._
  import org.joda.time.format._
  import org.apache.spark.sql.expressions.Window

    val getHour = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getHourOfDay)
        }
    })

    val getDT = udf((dt:String) =>
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy hh:mm:ss aa")
          Some(fmt.parseDateTime(s).getMillis / 1000.0  )
        }
      })

    // UDF for day of week
    val getDayOfWeek = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          val fmt:DateTimeFormatter = DateTimeFormat.forPattern("MM/dd/yyyy")
          Some(fmt.parseDateTime(s.split(" ")(0)).getDayOfWeek)
        }
      }
    })

    val getDate = udf((dt:String) => {
      dt match {
        case null => None
        case s => {
          Some(s.split(" ")(0))
        }
      }
    })

    val getDiffKey = udf((diff:Double) => {
      val threshold = 5 // 15 minutes   5 // 75%-tile seconds
      if (diff > threshold) {
        1  // tag as 2nd diff
      } else {
        0 // 1st diff
      }
    })

  val rawDF = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .load("hdfs://mr-0xc5.0xdata.loc:8020/user/file.csv")

    var df = rawDF.withColumn("hourOfDay", getHour(rawDF.col("datetime")))
      df = df.withColumn("timestamp", getDT(df.col("datetime")))
      df = df.withColumn("dayOfWeek", getDayOfWeek(df.col("datetime")))
      df = df.withColumn("date", getDate(df.col("datetime")))

Categorical Encoding, One Hot Encoding and why use it?

What is categorical encoding?

In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.

Name Age Zip Code Salary
Jim 43 94404 45000
Jon 37 94407 80000
Merry 36 94404 65000
Tim 42 94403 75000
Hailey 29 94407 60000

In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.

As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.

Name Age Zip Code Sex Salary
Jim 43 94404 M 45000
Jon 37 94407 M 80000
Merry 36 94404 F 65000
Tim 42 94403 M 75000
Hailey 29 94407 F 60000

So after encoding Zip Code an Sex features as enums both features will look like as below:

Name Age Zip Code Sex Salary
Jim 43 1 1 45000
Jon 37 2 1 80000
Merry 36 1 0 65000
Tim 42 3 1 75000
Hailey 29 2 0 60000

As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:

Age Zip Code Sex
43 1 1
37 2 1
36 1 0
42 3 1
29 2 0

Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.

What is One Hot Encoding?

In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.

One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:

Age Zip Code M F Salary
43 1 1 0 45000
37 2 1 0 80000
36 1 0 1 65000
42 3 1 0 75000
29 2 0 1 60000

If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:

Age 94404 94407 94403 M F Salary
43 1 0 0 1 0 45000
37 0 1 0 1 0 80000
36 1 0 0 0 1 65000
42 0 0 1 1 0 75000
29 0 1 0 0 1 60000

So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.

One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.

Why Use it?

There are several good reasons to use One Hot Encoding in the data.

As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.

Other Resources:

Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

For in depth feature engineering please visit the following slides from HJ Van Veen:

Big Data 1B dollars Club – Top 20 Players

Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …

Learning Cloudera Impala – Book Availability

Learning Cloudera Impala:

 
Using Cloudera Impala is for those who really want to take advantage of their Hadoop cluster by processing extremely large amounts of raw data in Hadoop at real-time speed. Prior knowledge of Hadoop and some exposure to HIVE and MapReduce is expected.
 
LearningClouderaImpala

Learning Cloudera Impala

You will learn from this book:

  • Understand the various ways of installing Impala in your Hadoop cluster
  • Use the Impala shell API to interact with Impala components
  • Utilize Impala Query Language and built-in functions to play with data
  • Administrate and fine-tune Impala for high availability
  • Identify and troubleshoot problems in a variety of ways
  • Get acquainted with various input data formats in Hadoop and how to use them with Impala
  • Comprehend how third party applications can connect with Impala to provide data visualization and various other enhancements

Available at:

Screen Shot 2013-12-28 at 11.21.15 AM

Screen Shot 2013-12-28 at 11.20.32 AMScreen Shot 2013-12-28 at 11.20.54 AM

Continue reading

Spark Summit 2013- Mon Dec 2, 2013 Keynotes

The State of Spark, and Where We’re Going Next

Matei Zaharia (CTO, Databricks; Assistant Professor, MIT)

Screen Shot 2013-12-05 at 8.08.21 AM

Community Contributions for Spark

  • YARN support (Yahoo!)
  • Columnar compression in Shark (Yahoo!)
  • Fair scheduling (Intel)
  • Metrics reporting (Intel, Quantifind)
  • New RDD operators (Bizo, ClearStory)
  • Scala 2.10 support (Imaginea)
Downloads: pptx slidespdf slides

Turning Data into Value
Ion Stoica (CEO, Databricks; CTO, Conviva; Co-Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.11.30 AM

  • Everyone collects but few extract value from data
  • Unification of comp. and prog. models key to
    • » Efficiently analyze data
    • » Make sophisticated, real-time decisions
  • Spark is unique in unifying
    • » batch, interactive, streaming computation models
    • » data-parallel and graph-parallel prog. models
Downloads: pptx slidespdf slides

Big Data Research in the AMPLab
Mike Franklin (Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.14.46 AM

  • GraphX: Unifying Graph Parallel & Data Parallel Analytics
  • OLTP and Serving Workloads •  MDCC: Mutli Data Center Consistency
  • HAT: Highly-Available Transactions
  • PBS: Probabilistically Bounded Staleness
  • PLANET: Predictive Latency-Aware Networked Transactions
  • Fast Matrix Manipulation Libraries
  • Cold Storage, Partitioning, Distributed Caching
  • Machine Learning Pipelines, GPUs,
Downloads: pptx slidespdf slides

Hadoop and Spark Join Forces in Yahoo
Andy Feng (Distinguished Architect, Cloud Services, Yahoo)

Screen Shot 2013-12-05 at 7.53.47 AM

YAHOO AT SCALE:

  • 150 PB of data on Yahoo Hadoop clusters
    • Yahoo data scientists need the data for
      • Model building
      • BI analytics
    • Such datasets should be accessed efficiently
      • avoid latency caused by data movement
  • 35,000 servers in Hadoop cluster
    • Science projects need to leverage all these servers for computation

SOLUTION: HADOOP + SPARK

  • science … Spark API & MLlib ease development of ML algorithms
  • speed … Spark reduces latency of model training via in-memory RDD etc
  • scale … YARN brings Hadoop datasets & servers at scientists’ fingertips
Downloads: pdf slides (large file)

Integration of Spark/Shark into the Yahoo! Data and Analytics Platform

Tim Tully (Distinguished Engineer/Architect, Yahoo)

  • Legacy / Current Hadoop Architecture
  • Reflection / Pain Points
  • Why the movement towards Spark / Shark
  • New Hybrid Environment
  • Future Spark/Shark/Hadoop Stack

Screen Shot 2013-12-05 at 7.55.52 AM

 

Downloads:  pptx slidespdf slides

 

Spark in the Hadoop Ecosystem
Eric Baldeschwieler (@jeric14)

Screen Shot 2013-12-05 at 8.20.33 AM

Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data. 

• Spark best current candidate:
•  Open Source – Apache Foundation
•  Expressive (MR, iteration, Graphs, SQL, streaming)
•  Easily extended & embedded (DSLs, Java, Python…)

Spark “on the radar”
•  2008 – Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins
•  2009 – Spark example built for Nexus -> Mesos
•  2011 – “Spark is 2 years ahead of anything at Google””- Conviva seeing good results w Spark
•  2012 – Yahoo! working with Spark / Shark
•  Today – Many success stories” – Early commercial support

Downloads: ppt slidespdf slides

Keywords: Apache Spark, Hadoop, YARN,  Big Data, Mesos, Databricks, Conviva,