Learning Cloudera Impala – Book Availability

Learning Cloudera Impala:

Using Cloudera Impala is for those who really want to take advantage of their Hadoop cluster by processing extremely large amounts of raw data in Hadoop at real-time speed. Prior knowledge of Hadoop and some exposure to HIVE and MapReduce is expected.

Learning Cloudera Impala

You will learn from this book:

  • Understand the various ways of installing Impala in your Hadoop cluster
  • Use the Impala shell API to interact with Impala components
  • Utilize Impala Query Language and built-in functions to play with data
  • Administrate and fine-tune Impala for high availability
  • Identify and troubleshoot problems in a variety of ways
  • Get acquainted with various input data formats in Hadoop and how to use them with Impala
  • Comprehend how third party applications can connect with Impala to provide data visualization and various other enhancements

Available at:

Screen Shot 2013-12-28 at 11.21.15 AM

Screen Shot 2013-12-28 at 11.20.32 AMScreen Shot 2013-12-28 at 11.20.54 AM

Continue reading


Spark Summit 2013- Mon Dec 2, 2013 Keynotes

The State of Spark, and Where We’re Going Next

Matei Zaharia (CTO, Databricks; Assistant Professor, MIT)

Screen Shot 2013-12-05 at 8.08.21 AM

Community Contributions for Spark

  • YARN support (Yahoo!)
  • Columnar compression in Shark (Yahoo!)
  • Fair scheduling (Intel)
  • Metrics reporting (Intel, Quantifind)
  • New RDD operators (Bizo, ClearStory)
  • Scala 2.10 support (Imaginea)
Downloads: pptx slidespdf slides

Turning Data into Value
Ion Stoica (CEO, Databricks; CTO, Conviva; Co-Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.11.30 AM

  • Everyone collects but few extract value from data
  • Unification of comp. and prog. models key to
    • » Efficiently analyze data
    • » Make sophisticated, real-time decisions
  • Spark is unique in unifying
    • » batch, interactive, streaming computation models
    • » data-parallel and graph-parallel prog. models
Downloads: pptx slidespdf slides

Big Data Research in the AMPLab
Mike Franklin (Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.14.46 AM

  • GraphX: Unifying Graph Parallel & Data Parallel Analytics
  • OLTP and Serving Workloads •  MDCC: Mutli Data Center Consistency
  • HAT: Highly-Available Transactions
  • PBS: Probabilistically Bounded Staleness
  • PLANET: Predictive Latency-Aware Networked Transactions
  • Fast Matrix Manipulation Libraries
  • Cold Storage, Partitioning, Distributed Caching
  • Machine Learning Pipelines, GPUs,
Downloads: pptx slidespdf slides

Hadoop and Spark Join Forces in Yahoo
Andy Feng (Distinguished Architect, Cloud Services, Yahoo)

Screen Shot 2013-12-05 at 7.53.47 AM


  • 150 PB of data on Yahoo Hadoop clusters
    • Yahoo data scientists need the data for
      • Model building
      • BI analytics
    • Such datasets should be accessed efficiently
      • avoid latency caused by data movement
  • 35,000 servers in Hadoop cluster
    • Science projects need to leverage all these servers for computation


  • science … Spark API & MLlib ease development of ML algorithms
  • speed … Spark reduces latency of model training via in-memory RDD etc
  • scale … YARN brings Hadoop datasets & servers at scientists’ fingertips
Downloads: pdf slides (large file)

Integration of Spark/Shark into the Yahoo! Data and Analytics Platform

Tim Tully (Distinguished Engineer/Architect, Yahoo)

  • Legacy / Current Hadoop Architecture
  • Reflection / Pain Points
  • Why the movement towards Spark / Shark
  • New Hybrid Environment
  • Future Spark/Shark/Hadoop Stack

Screen Shot 2013-12-05 at 7.55.52 AM


Downloads:  pptx slidespdf slides


Spark in the Hadoop Ecosystem
Eric Baldeschwieler (@jeric14)

Screen Shot 2013-12-05 at 8.20.33 AM

Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data. 

• Spark best current candidate:
•  Open Source – Apache Foundation
•  Expressive (MR, iteration, Graphs, SQL, streaming)
•  Easily extended & embedded (DSLs, Java, Python…)

Spark “on the radar”
•  2008 – Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins
•  2009 – Spark example built for Nexus -> Mesos
•  2011 – “Spark is 2 years ahead of anything at Google””- Conviva seeing good results w Spark
•  2012 – Yahoo! working with Spark / Shark
•  Today – Many success stories” – Early commercial support

Downloads: ppt slidespdf slides

Keywords: Apache Spark, Hadoop, YARN,  Big Data, Mesos, Databricks, Conviva,