Spark Summit 2013- Mon Dec 2, 2013 Keynotes

The State of Spark, and Where We’re Going Next

Matei Zaharia (CTO, Databricks; Assistant Professor, MIT)

Screen Shot 2013-12-05 at 8.08.21 AM

Community Contributions for Spark

  • YARN support (Yahoo!)
  • Columnar compression in Shark (Yahoo!)
  • Fair scheduling (Intel)
  • Metrics reporting (Intel, Quantifind)
  • New RDD operators (Bizo, ClearStory)
  • Scala 2.10 support (Imaginea)
Downloads: pptx slidespdf slides

Turning Data into Value
Ion Stoica (CEO, Databricks; CTO, Conviva; Co-Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.11.30 AM

  • Everyone collects but few extract value from data
  • Unification of comp. and prog. models key to
    • » Efficiently analyze data
    • » Make sophisticated, real-time decisions
  • Spark is unique in unifying
    • » batch, interactive, streaming computation models
    • » data-parallel and graph-parallel prog. models
Downloads: pptx slidespdf slides

Big Data Research in the AMPLab
Mike Franklin (Director, UC Berkeley AMPLab)

Screen Shot 2013-12-05 at 8.14.46 AM

  • GraphX: Unifying Graph Parallel & Data Parallel Analytics
  • OLTP and Serving Workloads •  MDCC: Mutli Data Center Consistency
  • HAT: Highly-Available Transactions
  • PBS: Probabilistically Bounded Staleness
  • PLANET: Predictive Latency-Aware Networked Transactions
  • Fast Matrix Manipulation Libraries
  • Cold Storage, Partitioning, Distributed Caching
  • Machine Learning Pipelines, GPUs,
Downloads: pptx slidespdf slides

Hadoop and Spark Join Forces in Yahoo
Andy Feng (Distinguished Architect, Cloud Services, Yahoo)

Screen Shot 2013-12-05 at 7.53.47 AM

YAHOO AT SCALE:

  • 150 PB of data on Yahoo Hadoop clusters
    • Yahoo data scientists need the data for
      • Model building
      • BI analytics
    • Such datasets should be accessed efficiently
      • avoid latency caused by data movement
  • 35,000 servers in Hadoop cluster
    • Science projects need to leverage all these servers for computation

SOLUTION: HADOOP + SPARK

  • science … Spark API & MLlib ease development of ML algorithms
  • speed … Spark reduces latency of model training via in-memory RDD etc
  • scale … YARN brings Hadoop datasets & servers at scientists’ fingertips
Downloads: pdf slides (large file)

Integration of Spark/Shark into the Yahoo! Data and Analytics Platform

Tim Tully (Distinguished Engineer/Architect, Yahoo)

  • Legacy / Current Hadoop Architecture
  • Reflection / Pain Points
  • Why the movement towards Spark / Shark
  • New Hybrid Environment
  • Future Spark/Shark/Hadoop Stack

Screen Shot 2013-12-05 at 7.55.52 AM

 

Downloads:  pptx slidespdf slides

 

Spark in the Hadoop Ecosystem
Eric Baldeschwieler (@jeric14)

Screen Shot 2013-12-05 at 8.20.33 AM

Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data. 

• Spark best current candidate:
•  Open Source – Apache Foundation
•  Expressive (MR, iteration, Graphs, SQL, streaming)
•  Easily extended & embedded (DSLs, Java, Python…)

Spark “on the radar”
•  2008 – Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins
•  2009 – Spark example built for Nexus -> Mesos
•  2011 – “Spark is 2 years ahead of anything at Google””- Conviva seeing good results w Spark
•  2012 – Yahoo! working with Spark / Shark
•  Today – Many success stories” – Early commercial support

Downloads: ppt slidespdf slides

Keywords: Apache Spark, Hadoop, YARN,  Big Data, Mesos, Databricks, Conviva,

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s