Writing Datasets to Disk | Datasets In Apache Spark III

In the last tutorial we've seen how to create parametrized datasets. Once you create datasets and perform some operations on them, you would like to save those results back into storage. This is what we'll try to do in this article - Saving Datasets to storage.

Spark Logo

The first thing we'll do as always is to create the spark-session variable.

// Initialize Sparksession
SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();

Using that session variable, we read the fake-people.csv file which has data like this:

id,first_name,last_name,email,gender,ip_address
1,Netti,McKirdy,nmckirdy0@slideshare.net,Female,148.3.248.193
2,Nickey,Curreen,ncurreen1@tripadvisor.com,Male,206.9.48.216
3,Allayne,Chatainier,achatainier2@trellian.com,Male,191.118.4.217
...

We read this file into a dataset as following:

// Read csv file
Dataset<Row> peopleDs = spark.read().option("header", "true").csv("fake-people.csv");

After we have the dataset, Let's assume you've performed some operations on it. Some column selections, some filtering, some sorting etc. And we have a new dataset after all those operations.

// After performing several awesome operations
Dataset<Row> newDs = ....

We want to store this dataset back on the disk. We can do that with the write() on spark session variable, just like read().

newDs.write().csv("processed-data");

The processed-data in the above command is not the name for the output CSV file but instead for the output directory. When you write a Dataset to a file, it will store the data in the format you asked for, CSV in this case, along with adding some check files and status flags as well creating a directory with that name.

These are the files that get created in the processed-data folder.

$ ls ../../apache-spark/processed-data
_SUCCESS  part-00000-311049cf-3e48-4286-b93c-7d2096a18678-c000.csv

There are two more hidden CRC files that I'm not showing here. The part-00000-31hxxxxxxxxx.csv is the actual data file which has the data from the new dataset.

You can also create a json file by running

newDs.write().json("processed-data")

And that will create another folder with json file and the _SUCCESS file inside it.

You can also save this data to an external Database if you want to. You'll use the jdbc() method along with the connection string and the table name. And Spark will write it to the DB.

Parquet Logo

Apart from the CSV and JSON formats, there is one more popular data format in the Data Science and Big Data world. That is Parquet. Parquet is a data format that is highly optimized and well suited for column-wise operations. It is widely used in a lot of projects in the Big Data ecosystem as a data serialization format. And In Spark, Parquet is the default file storage format. Of course one main difference between Parquet and formats like CSV, JSON is that Parquet is not meant to be used for humans. It can only be read by a parquet reader. A sample file looks something like this:

PAR1   ï¿½k ï¿½>, ï¿½          999  1     ï¿½5,   ï¿½      1   2   3   4   5   6   7   8   9  - 0   1   2   3   4   5   6   7   8   < 2 < 2 < 2 < 2 < 2 < 2 < 2 <
.....

Utterly gibberish. But spark can read and understand it. In fact, As Parquet is designed for speed and throughput, it can be 10-100 times faster than reading/writing from an ordinary data format like CSV or JSON, depending on the type of data.

You save dataset to Parquet as follows:

newDs.write().parquet("processed");

And this will save the dataset as a parquet file along with the _SUCCESS status file.

That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Java, Freblogg/Spark

Articles on Apache Spark:

Map Vs Flat map

Spark Word count with Java

Datasets in Spark | Part I

Datasets in Spark | Part II

This is the 17th article as part of my twitter challenge #30DaysOfBlogging. Thirteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

Thanks for reading. See you again in the next article.

Durga Swaroop Perla

Writing Datasets to Disk | Datasets In Apache Spark III

Published

Category

Tags