Matei Zaharia (CTO, Databricks; Assistant Professor, MIT)

Community Contributions for Spark
- YARN support (Yahoo!)
- Columnar compression in Shark (Yahoo!)
- Fair scheduling (Intel)
- Metrics reporting (Intel, Quantifind)
- New RDD operators (Bizo, ClearStory)
- Scala 2.10 support (Imaginea)
Downloads: pptx slides, pdf slides
Turning Data into Value
Ion Stoica (CEO, Databricks; CTO, Conviva; Co-Director, UC Berkeley AMPLab)

- Everyone collects but few extract value from data
- Unification of comp. and prog. models key to
- » Efficiently analyze data
- » Make sophisticated, real-time decisions
- Spark is unique in unifying
- » batch, interactive, streaming computation models
- » data-parallel and graph-parallel prog. models
Downloads: pptx slides, pdf slides
Big Data Research in the AMPLab
Mike Franklin (Director, UC Berkeley AMPLab)

- GraphX: Unifying Graph Parallel & Data Parallel Analytics
- OLTP and Serving Workloads • MDCC: Mutli Data Center Consistency
- HAT: Highly-Available Transactions
- PBS: Probabilistically Bounded Staleness
- PLANET: Predictive Latency-Aware Networked Transactions
- Fast Matrix Manipulation Libraries
- Cold Storage, Partitioning, Distributed Caching
- Machine Learning Pipelines, GPUs,
Downloads: pptx slides, pdf slides

YAHOO AT SCALE:
- 150 PB of data on Yahoo Hadoop clusters
- Yahoo data scientists need the data for
- Model building
- BI analytics
- Such datasets should be accessed efficiently
- avoid latency caused by data movement
- 35,000 servers in Hadoop cluster
- Science projects need to leverage all these servers for computation
SOLUTION: HADOOP + SPARK
- science … Spark API & MLlib ease development of ML algorithms
- speed … Spark reduces latency of model training via in-memory RDD etc
- scale … YARN brings Hadoop datasets & servers at scientists’ fingertips
Downloads: pdf slides (large file)
Tim Tully (Distinguished Engineer/Architect, Yahoo)
- Legacy / Current Hadoop Architecture
- Reflection / Pain Points
- Why the movement towards Spark / Shark
- New Hybrid Environment
- Future Spark/Shark/Hadoop Stack

Downloads: pptx slides, pdf slides
Spark in the Hadoop Ecosystem
Eric Baldeschwieler (@jeric14)

Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data.
• Spark best current candidate:
• Open Source – Apache Foundation
• Expressive (MR, iteration, Graphs, SQL, streaming)
• Easily extended & embedded (DSLs, Java, Python…)
Spark “on the radar”
• 2008 – Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins
• 2009 – Spark example built for Nexus -> Mesos
• 2011 – “Spark is 2 years ahead of anything at Google””- Conviva seeing good results w Spark
• 2012 – Yahoo! working with Spark / Shark
• Today – Many success stories” – Early commercial support
Downloads: ppt slides, pdf slides
Keywords: Apache Spark, Hadoop, YARN, Big Data, Mesos, Databricks, Conviva,