January 02, 2018

Datasets In Apache Spark | Part 2

In the two last tutorials we have covered what Apache Spark is and also got ourselves familiar with Datasets in Apache Spark, which is the primary data abstraction in Spark. In this tutorial we will see how to read a data file as a parametrized Bean object Dataset using Encoders.

Spark Image Logo 

This tutorial is going to be short, but this is very important as you would find yourself doing this frequently. In the last article you've seen how to read a CSV or JSON file as a Dataset. You might have noticed that we were using Dataset<Row> for everything. If you're not familiar with Generics in Java, Dataset<Row> can be thought of as a Dataset consisting of Row objects. The Row object is a spark sql class and is the default when creating a Dataset.

Although the Row class has some useful methods, as a generic object suitable for all types, it is not suitable for everything. Since Datasets usually store data that usually corresponds to a Bean class, it is better to create a Dataset of that bean class instead of Row. With this, you'll have access to all your usual getters and setters of the bean class. That's what We'll do in this article. We'll create a Dataset of POJO's instead of Row objects.

I'm using the same fake-people.csv file that I used in the last article that looks like this:

id,first_name,last_name,email,gender,ip_address
1,Netti,McKirdy,nmckirdy0@slideshare.net,Female,148.3.248.193
2,Nickey,Curreen,ncurreen1@tripadvisor.com,Male,206.9.48.216
3,Allayne,Chatainier,achatainier2@trellian.com,Male,191.118.4.217
...

To represent this data, I've created a POJO called FakePeople.java, which looks like this:

import lombok.Data;
public @Data class FakePeople {
    final int id;
    final private String firstName;
    final private String lastName;
    final private String email;
    final private String gender;
    final private String ipAddress;
}

I'm using Project Lombok here, to generate the required Getters, Setters and other POJO methods. (If you don't know about Lombok, you should definitely check that out. It is quite handy).

We have our POJO now, Let's get a parametrized Dataset. To achieve this we first need to create an Encoder. We do that for the FakePeople class as following:

Encoder<FakePeople> fakePeopleEncoder = Encoders.bean(FakePeople.class);

This will register our encoder which will help us parse our CSV data.

Of course we need our spark session variable as well.

// Initialize Sparksession
SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();

Now we can go ahead and read the CSV file, very much like the way we did before with just one addition.

// Without Encoder
Dataset<Row> people = spark.read().option("header", "true").csv("fake-people.csv");

// With Encoder
Dataset<FakePeople> people = spark.read().option("header", "true").csv("fake-people.csv").as(fakePeopleEncoder);

And the output of people.show(5) is the same as what you'd expect.

+---+----------+----------+--------------------+------+--------------+
| id|first_name| last_name|               email|gender|    ip_address|
+---+----------+----------+--------------------+------+--------------+
|  1|     Netti|   McKirdy|nmckirdy0@slidesh...|Female| 148.3.248.193|
|  2|    Nickey|   Curreen|ncurreen1@tripadv...|  Male|  206.9.48.216|
|  3|   Allayne|Chatainier|achatainier2@trel...|  Male| 191.118.4.217|
|  4|     Tades|    Emmett|temmett3@barnesan...|  Male|153.113.87.195|
|  5|     Shawn|    McGenn|smcgenn4@shop-pro.jp|  Male|  247.45.80.68|
+---+----------+----------+--------------------+------+--------------+

As you can see the only difference in creating the Dataset is .as(fakePeopleEncoder) and that gets us Dataset<FakePeople> instead of Dataset<Row>. And with that, we now have access to all the getters, setters of FakePeople class which we wouldn't otherwise have with a Row object. We'll explore more about how this is useful in a future tutorial.

For more information on Datasets: Spark SQL, DataFrames and Datasets Guide

That is all for this article.


For more programming articles, checkout Freblogg, Freblogg/Java, Freblogg/Spark

Apache Spark articles:

Word count with Apache Spark and Java

Datasets in Apache Spark | Part 1

Datasets in Apache Spark | Part 2


This is the 11th article as part of my twitter challenge #31DaysOfBlogging. Nineteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe here on medium and my other blog as well.


If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.


Thanks for reading. See you again in the next article.

0 comments:

Post a Comment

Please Enter your comment here......