Databricks CSV parser with Sparkling Water in Scala

You need to download sparkling water 2.0.2 along with Spark 2.0.2 for this example to working:

bin/sparkling-shell –packages com.databricks:spark-csv_2.11:1.5.0

-----
 Spark master (MASTER) : local[*]
 Spark home (SPARK_HOME) : /Users/avkashchauhan/tools/spark-2.0.2-bin-hadoop2.6
 H2O build version : 3.10.0.10 (turing)
 Spark build version : 2.0.1
 Scala version : 2.11
 ----
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
 Ivy Default Cache set to: /Users/avkashchauhan/.ivy2/cache
 The jars for the packages stored in: /Users/avkashchauhan/.ivy2/jars
 :: loading settings :: url = jar:file:/Volumes/OSxexT/tools/spark-2.0.2-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
 com.databricks#spark-csv_2.11 added as a dependency
 :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
 confs: [default]
 found com.databricks#spark-csv_2.11;1.5.0 in central
 found org.apache.commons#commons-csv;1.1 in local-m2-cache
 found com.univocity#univocity-parsers;1.5.1 in central
 :: resolution report :: resolve 189ms :: artifacts dl 3ms
 :: modules in use:
 com.databricks#spark-csv_2.11;1.5.0 from central in [default]
 com.univocity#univocity-parsers;1.5.1 from central in [default]
 org.apache.commons#commons-csv;1.1 from local-m2-cache in [default]
 ---------------------------------------------------------------------
 | | modules || artifacts |
 | conf | number| search|dwnlded|evicted|| number|dwnlded|
 ---------------------------------------------------------------------
 | default | 3 | 0 | 0 | 0 || 3 | 0 |
 ---------------------------------------------------------------------
 :: retrieving :: org.apache.spark#spark-submit-parent
 confs: [default]
 0 artifacts copied, 3 already retrieved (0kB/5ms)
 16/12/22 15:10:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 16/12/22 15:10:34 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
 Spark context Web UI available at http://172.16.2.122:4040
 Spark context available as 'sc' (master = local[*], app id = local-1482448233926).
 Spark session available as 'spark'.
 Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.0.2
 /_/
 Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
 Type in expressions to have them evaluated.
 Type :help for more information.

Importing key package/libraries:

scala> import com.databricks.spark
 final package spark
 scala> import com.databricks.spark.csv
 final package csv

If you want to see all the configuration use spark session :

scala> spark.conf.getAll
 res1: Map[String,String] = Map(spark.driver.host -> 172.16.2.122, spark.driver.port -> 60937, hive.metastore.warehouse.dir -> file:/Volumes/OSxexT/tools/sparkling-water-2.0.1/spark-warehouse, spark.repl.class.uri -> spark://172.16.2.122:60937/classes, spark.jars -> file:/Users/avkashchauhan/tools/sparkling-water-2.0.1/assembly/build/libs/sparkling-water-assembly_2.11-2.0.1-all.jar,file:/Users/avkashchauhan/.ivy2/jars/com.databricks_spark-csv_2.11-1.5.0.jar,file:/Users/avkashchauhan/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/Users/avkashchauhan/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar, spark.repl.class.outputDir -> /private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/avkashchauhan/spark/work/spark-eeab779b-1114-4a38-9cb4-098b2d551115/repl-8246fd61-287d-4718-9...

Note: If you have spark 1.6.x then you can use Spark Context as below:

> sc.jars
res2: Seq[String] = List(file:/Users/avkashchauhan/tools/sparkling-water-2.0.1/assembly/build/libs/sparkling-water-assembly_2.11-2.0.1-all.jar, file:/Users/avkashchauhan/.ivy2/jars/com.databricks_spark-csv_2.11-1.5.0.jar, file:/Users/avkashchauhan/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar, file:/Users/avkashchauhan/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar)

Now you can load a CSV file from its full path as below:

var a = spark.sqlContext.read.format("com.databricks.spark.csv").load("/Users/avkashchauhan/tools/sparkling-water-2.0.1/cars.csv")
 a: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]

Display few rows:

> a.show(10)
  • +----+-----+-----+--------------------+-----+
  • | _c0| _c1| _c2| _c3| _c4|
  • +----+-----+-----+--------------------+-----+
  • |year| make|model| comment|blank|
  • |2012|Tesla| S| No comment| null|
  • |1997| Ford| E350|Go get one now th...| null|
  • |2015|Chevy| Volt| null| null|
  • +----+-----+-----+--------------------+-----+

> a.columns.length

res8: Int = 5

There is another way to read the table:

> val df = spark.sqlContext.load(
| “com.databricks.spark.csv”,
| Map(“path” -> “/Users/avkashchauhan/tools/sparkling-water-2.0.1/cars.csv”, “header” -> “true”, “inferSchema” -> “true”))
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [year: int, make: string … 3 more fields]
> df.show(10)
> df.head.toString()
res13: String = [2012,Tesla,S,No comment,null]

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s