In the last tutorial we've seen how to create parametrized datasets. Once you create datasets and perform some operations on them, you would like to save those results back into storage. This is what we'll try to do in this article - Saving Datasets to storage.
The first thing we'll do as always is to create the spark-session variable.
// Initialize Sparksession SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();
Using that session variable, we read the
fake-people.csv file which has data like this:
id,first_name,last_name,email,gender,ip_address 1,Netti,McKirdy,email@example.com,Female,220.127.116.11 2,Nickey,Curreen,firstname.lastname@example.org,Male,18.104.22.168 3,Allayne,Chatainier,email@example.com,Male,22.214.171.124 ...
We read this file into a dataset as following:
// Read csv file Dataset<Row> peopleDs = spark.read().option("header", "true").csv("fake-people.csv");
// After performing several awesome operations Dataset<Row> newDs = ....
We want to store this dataset back on the disk. We can do that with the
write() on spark session variable, just like
processed-data in the above command is not the name for the output CSV file but instead for the output directory. When you write a Dataset to a file, it will store the data in the format you asked for,
CSV in this case, along with adding some check files and status flags as well creating a directory with that name.
These are the files that get created in the
$ ls ../../apache-spark/processed-data _SUCCESS part-00000-311049cf-3e48-4286-b93c-7d2096a18678-c000.csv
There are two more hidden CRC files that I'm not showing here. The
part-00000-31hxxxxxxxxx.csv is the actual data file which has the data from the new dataset.
You can also create a
json file by running
And that will create another folder with json file and the
_SUCCESS file inside it.
You can also save this data to an external Database if you want to. You'll use the
jdbc() method along with the connection string and the table name. And Spark will write it to the DB.
Apart from the CSV and JSON formats, there is one more popular data format in the Data Science and Big Data world. That is Parquet. Parquet is a data format that is highly optimized and well suited for column-wise operations. It is widely used in a lot of projects in the Big Data ecosystem as a data serialization format. And In Spark, Parquet is the default file storage format. Of course one main difference between Parquet and formats like CSV, JSON is that Parquet is not meant to be used for humans. It can only be read by a parquet reader. A sample file looks something like this:
PAR1 ï¿½k ï¿½>, ï¿½ 999 1 ï¿½5, ï¿½ 1 2 3 4 5 6 7 8 9 - 0 1 2 3 4 5 6 7 8 < 2 < 2 < 2 < 2 < 2 < 2 < 2 < .....
Utterly gibberish. But spark can read and understand it. In fact, As Parquet is designed for speed and throughput, it can be 10-100 times faster than reading/writing from an ordinary data format like CSV or JSON, depending on the type of data.
You save dataset to Parquet as follows:
And this will save the dataset as a parquet file along with the
_SUCCESS status file.
That is all for this article.
Articles on Apache Spark:
This is the 17th article as part of my twitter challenge #30DaysOfBlogging. Thirteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.
If you are interested in this, make sure to follow me on Twitter @durgaswaroop.
If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.
Thanks for reading. See you again in the next article.