FreBlogg

Archive vs Archived | A rant on naming things correctly

2021-08-23T23:00:00+05:30

We are always taught that grammar is important in English (or any other language for that matter). But somehow it is not the case with English usage in programming, most of the time. The Grammar rules seem to be relaxed when people are coding.

My students and my colleagues know how particular I am about naming things in code. It is something I take pride in. Apart from picking descriptive names, I also try to make sure that the Tense and Number of the variables or functions, always agree with the Grammar rules.

I review a lot of code daily - of my colleagues, my students, etc. Even though I am less critical of other people's code than I am of mine, one of the things I do try to give feedback on is about adhering to the correct grammar. It happens to be one of the most common mistakes I see. They use the correct word to name something but often in incorrect tense and number than what it is supposed to be.

The latest installment of this is the misuse of "archive" and "archived" in code by a student. And hence this article.

Let's get the definitions first.

archive /ˈɑːkʌɪv/

(noun)

A collection of something. In software terms, it refers to a folder, location where files and data will be kept aside for later use.

(verb)

The act of storing or placing something in an archive.

So, you can use them in sentences like:

1. I am placing the files in the archive. (Noun)
2. I am archiving the files. (Verb)

Archived, is the state of an object or a file after it has been placed in an archive.

Once you know these, you can name the variables and functions that use this word, correctly.

For example, the folder where you are storing the files would be called an archive. A file you have archived would be named archived_file. The function or method that archives a file can be called archive_file() or maybe simply archive().

Let's add a few more things to it. Based on this, if you have a function called is_archive (or archive? in ruby), you expect it to give True or False based on whether something (a folder) is an archive, or not.

To check if a file has been archived, you will have a function called is_archived that will take a file name.

Here's a sample code (in python) that shows all of this in action:

archive = Path('/opt/archive')

is_archive(archive) # True

files = os.listdir(current_dir) # Get all files in the current directory

# Archive the files
for file in files:
  archive(file)

archived_files = os.listdir(archive)

for file in archived_files:
  is_archived(file) # True

You can use similar conventions in other languages also.

Now, for some general advice on naming things from my experience reviewing code for years:

Try to keep the names of variables/objects as nouns and names of functions as verbs or instructions.
A function that validates something would be validate_<something>. A file validator would be called validate_file(...) and so on.
A file that has been validated should be called validated_file or even better, a valid_file
A function that shares credentials (with something else), should be share_credentials(...) while the actual credential after sharing can be named shared_credential.
Use configuration or config for an object and configure(...) for a function.
Do not use multiple words to mean the same thing. I have seen functions like get_file(...), fetch_file(...) and retrieve_file(...) in the same code base. Stick to one and use that everywhere. get_file(...) is my preferred option.
Do not use a variable in the plural when it only has one value. Similarly, when a variable will have multiple values, use a plural. So, if you have a variable containers, it is expected to contain multiple containers.
Also, you don't need to suffix a variable with _list or _set to indicate the type of collection it is. In most cases, it is not relevant. So, prefer objects instead of object_list and files instead of file_set.
Similarly, don't suffix or prefix data types when it is obvious. No need to use a variable user_name_str to indicate that it is a string. Names are generally strings. So, we can omit the _str suffix.
Use similar forms of words when referring to similar things. For example, if you are referring to states of a docker container can be in, don't use "Running, "Success" and "Fail". Keep them all in one grammatical form. Something like "Running", "Succeeded/Completed", "Failed". That reads better.

I can go on, but I will stop this list here before it gets too pedantic.

Hopefully, all of this information has been useful for you. If not, you can let me know all about it on Twitter @durgaswaroop

Mocking functions in Python with Pytest Part I

2020-04-11T02:18:00+05:30

Mocking resources in unit tests is just as important and common as writing unit tests. However, a lot of people are not familiar with how to properly mock classes, objects or functions for tests, because the available documentation online is either too short or unnecessarily complicated. One of the main reasons for this confusion — several ways to do the same thing. Every other article out there seems to mock things in a different way. With this series of articles on mocking, I hope to bring some clarity on the topic.

Pre-requisite

This is a tutorial on Mocking with pytest. I am operating with the assumption that you can write unit tests in Python using pytest.

Why Mock?

As you are here, reading this article, I will assume that you are familiar with mocking. In case you are not, let us do a quick overview of what it is and why we need it.

Say, you have a service that collects stock market data and gives information about the top gainers in a particular sector. You get the stock market information from a third party API, and process it to give out the results. Now, to test your code, you would not want to hit the API every time, as it will make the tests slower, and also the API provider would charge you for the extra hits. What you want here is a mock! A mock replaces a function with a dummy you can program to do whatever you choose. This is also called ‘Patching’. For the rest of this series, I am going to use ‘mock’ and ‘patch’ interchangeably.

Packages needed for Mocking

Unlike the majority of programming languages, Python comes with a built-in library for unit testing and mocking. They are powerful, self-sufficient and provide the functionality you need. The Pytest-mock plugin we will use, is a convenient wrapper around it which makes it easier to use it in combination with pytest.

If you look up articles on mocking, or if you read through the endless questions on Stackoverflow, you will frequently come across the words Mock, MagicMock, patch, etc. I'm going to demystify them here.

In Python, to mock, be it functions, objects or classes, you will mostly use Mock class. Mock class comes from the built-in unittest.mock module. From now on, anytime you come across Mock, know that it is from the unittest library. MagicMock is a subclass of Mock with some of the magic methods implemented. Magic methods are your usual dunder methods like__str__, __len__, etc.

For the most part, it does not matter which one you use, Mock or MagicMock. Unless you need magic methods like the above implemented, you can stick to Mock. Pytest-mock gives you access to both of these classes with an easy to use interface.

patch is another function that comes from the 'unittest' module that helps replace functions with mocks. Pytest mock has a wrapper for this too.

Installing Pytest Mock

Before you get started with using pytest-mock, you have to install it. You can install it with pip as follows:

pip install pytest-mock

This is a pytest plugin. So, it will also install pytest, if you have not installed it already.

Mocking a simple function

As this is the first article, we will keep it simple. We will start by mocking a simple function.

Say, we have a function get_operating_system that tells us whether we are using Windows or Linux.

# application.py 
from time import sleep  
def is_windows():    
    # This sleep could be some complex operation instead
    sleep(5)    
    return True  
def get_operating_system():    
    return 'Windows' if is_windows() else 'Linux'

This function uses another function is_windows to check if the current system is Windows or not. Assume that this is_windows function is quite complex taking several seconds to run. We can simulate this slow function by making the program sleep for 5 seconds every time it is called.

A pytest for get_operating_system() would be as follows:

# test_application.py

from application import get_operating_system

def test_get_operating_system():
    assert get_operating_system() == 'Windows'

Since, get_operating_system() calls a slower function is_windows, the test is going to be slow. This can be seen below in the output of running pytest which took 5.05 seconds.

$ pytest
================ test session starts ========================
Python 3.7.3, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /usr/Personal/Projects/pytest-and-mocking
plugins: mock-2.0.0
collected 1 item

test_application.py .                                    [100%]

================ 1 passed in 5.05s ==========================

Unit tests should be fast. We should be able to run hundreds of tests in seconds. A single test that takes five seconds slows down the test suite. Enter mocking, to makes our lives easier. If we patch the slow function, we can verify get_operating_system's behavior without waiting for five seconds.

Let’s mock this function with pytest-mock.

Pytest-mock provides a fixture called mocker. It provides a nice interface on top of python's built-in mocking constructs. You use mocker by passing it as an argument to your test function, and calling the mock and patch functions from it.

Say, you want the is_windows function to return True without taking those five precious seconds. We can patch it as follows:

mocker.patch('application.is_windows', return_value=True)

You have to refer to is_windows here as application.is_windows, given that it is the function in the application module. If we only patch is_windows, it will try to patch a function called is_windows in the 'test_application' file, which obviously does not exist. The format is always <module_name>.<function_name>. Knowing how to mock correctly is important and we will continue working on it in this series.

The updated test function with the patch is as follows:

# 'mocker' fixture provided by pytest-mock
def test_get_operating_system(mocker):  
    # Mock the slow function and return True always
    mocker.patch('application.is_windows', return_value=True) 
    assert get_operating_system() == 'Windows'

Now when you run the test, it will finish much faster.

$ pytest
============ test session starts ==================
Python 3.7.3, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /mnt/c/Personal/Projects/pytest-and-mocking
plugins: mock-2.0.0
collected 1 item

test_application.py .                          [100%]

=========== 1 passed in 0.11s ======================

As you can see, the test only 0.11 seconds. We have successfully patched the slow function and made the test suite faster.

Another advantage of mocking - you can make the mock function return anything. You can even make it raise errors to test how your code behaves in in those scenarios. We will see how all of this works and more, in the future articles.

For now, if you want to test the case where is_windows returnsFalse, write the following test:

def test_operation_system_is_linux(mocker):
    mocker.patch('application.is_windows', return_value=False) # set the return value to be False
    assert get_operating_system() == 'Linux'

Note that all of the mocks & patches set with mocker are function scoped i.e., they will only be available for that specific function. Therefore, you can patch the same function in multiple tests and they will not conflict with each other.

That is your first introduction to the world of mocking with pytest. We will cover more scenarios in the upcoming articles. Stay tuned, stay safe and stay awesome till then.

List of articles in this series:

Mocking Functions Part I 🢠 Current Article

Mocking Functions Part II

If you like this article, you can like this article to encourage me to put out the next article soon. If you think someone you know can benefit from this article, do share it with them.

If you want to thank me, you can say hi on twitter @durgaswaroop. And, if you want to support me here’s my paypal link: paypal.me/durgaswaroop

Attribution: Python Logo — https://www.python.org/community/logos/

A Practical Introduction to Kafka Storage Internals

2018-08-06T18:00:00+05:30

Kafka is everywhere these days. With the advent of Microservices and distributed computing, Kafka has become a regular occurrence in the architecture of every product. In this article, I’ll try to explain how Kafka’s internal storage mechanism works.

Since this is going to be a deep dive into Kafka’s internals, I would expect you to have some understanding about Kafka. Although I’ve tried to keep the entry-level for this article pretty low, you might not be able to understand everything if you’re not familiar with the general workings of Kafka. Proceed further with that in mind.

Kafka is typically referred to as a Distributed, Replicated Messaging Queue, which although technically true, usually leads to some confusion depending on your definition of a messaging queue. Instead, I prefer to call it a Distributed, Replicated Commit Log. This, I think, clearly represents what Kafka does, as all of us understand how logs are written to disk. And in this case, it is the messages pushed into Kafka that are stored to disk.

Regarding storage in Kafka, you’ll always hear two terms - Partition and Topic. Partitions are the units of storage in Kafka for messages. And Topic can be thought of as being a container in which these partitions lie.

With the basic stuff out of our way, let’s understand these concepts better by working with Kafka.

I am going to start by creating a topic in Kafka with three partitions. If you want to follow along, the command looks like this for a local Kafka setup on windows.

kafka-topics.bat --create --topic freblogg --partitions 3 --replication-factor 1 --zookeeper localhost:2181

If I go into Kafka’s log directory, I see three directories created as follows.

$ tree freblogg*
freblogg-0
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint
freblogg-1
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint
freblogg-2
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint

We have three directories created because we’ve given three partitions for our topic, which means that each partition gets a directory on the file system. You also see some files like index, log etc. We’ll get to them shortly.

One more thing that you should be able to see from here is that in Kafka, the topic is more of a logical grouping than anything else and that the Partition is the actual unit of storage in Kafka. That is what is physically stored on the disk. Let’s understand partitions in some more detail.

Partitions

A partition, in theory, can be described as an immutable collection (or sequence) of messages. We can only append messages to a partition but cannot delete from it. And by “We”, I am talking about the Kafka producer. A producer can’t delete the messages in the topic.

Now we’ll send some messages into the topic. But before that, I want you to see the sizes of files in our partition folders.

$ ls -lh freblogg-0
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

You see the index files combined are about 20M in size while the log file is empty. This is the same case with freblogg-1 and freblogg-2 folders.

Now let us send a couple of messages and see what happens. To send the messages I’m using the console producer as follows:

kafka-console-producer.bat --topic freblogg --broker-list localhost:9092

I have sent two messages, first a customary “hello world” and then I pressed the Enter key, which becomes the second message. Now if I print the sizes again:

$ ls -lh freblogg*
freblogg-0:
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

freblogg-1:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  68 Aug  5 10:15 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 10:15 leader-epoch-checkpoint

freblogg-2:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  79 Aug  5 09:59 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 09:59 leader-epoch-checkpoint

Our two messages went into two of the partitions where you can see that the log files have a non zero size. This is because the messages in the partition are stored in the ‘xxxx.log’ file. To confirm that the messages are indeed stored in the log file, we can just see what’s inside that log file.

$ cat freblogg-2/*.log
@^@^BÂ°Â£Ã¦Ãƒ^@^K^XÃ¿Ã¿Ã¿Ã¿Ã¿Ã¿^@^@^@^A"^@^@^A^VHello World^@

The file format of the ‘log’ file is not conducive for textual representation but, you should see the ‘Hello World’ at the end indicating that this file got updated when we have sent the message into the topic. The second message we have sent went into the other partition.

Notice that the first message we sent, went into the third partition (freblogg-2) and the second message went into the second partition (freblogg-1). This is because Kafka arbitrarily picks the partition for the first message and then distributes the messages to partitions in a round-robin fashion. If a third message comes now, it would go into freblogg-0 and this order of partition continues for any new message that comes in. We can also make Kafka choose the same partition for our messages by adding a key to the message. Kafka stores all the messages with the same key into a single partition.

Each new message in the partition gets an Id which is one more than the previous Id number. This Id number is also called the Offset. So, the first message is at ‘offset’ 0, the second message is at offset 1 and so on. These offset Id’s are always incremented from the previous value.

We can understand those random characters in the log file, using a Kafka tool. Those extra characters might not seem useful to us, but they are useful for Kafka as they are the metadata for each message in the queue. If I run,

kafka-run-class.bat kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files logs\freblogg-2\00000000000000000000.log

This gives the output

Dumping logs\freblogg-2\00000000000000000000.log
Starting offset: 0

offset: 0 position: 0 CreateTime: 1533443377944 isvalid: true keysize: -1 valuesize: 11 producerId: -1 headerKeys: [] payload: Hello World

offset: 1 position: 79 CreateTime: 1533462689974 isvalid: true keysize: -1 valuesize: 6 producerId: -1 headerKeys: [] payload: amazon

(I’ve removed a couple of things from this output that are not necessary for this discussion.)

You can see that it stores information of the offset, time of creation, key and value sizes etc along with the actual message payload in the log file.

It is also important to note that a partition is tied to a broker. In other words, If we have three brokers and if the folder freblogg-0 exists on broker-1, you can be sure that it will not appear in any of the other brokers. Partitions of a topic can be spread out to multiple brokers but a partition is always present on one single Kafka broker (When the replication factor has its default value, which is 1. Replication is mentioned further below).

Segments

We’ll finally talk about those index and log files we’ve seen in the partition directory. Partition might be the standard unit of storage in Kafka, but it is not the lowest level of abstraction provided. Each partition is divided into segments.

A segment is simply a collection of messages of a partition. Instead of storing all the messages of a partition in a single file (think of the log file analogy again), Kafka splits them into chunks called segments. Doing this provides several advantages. Divide and Conquer FTW!

Most importantly, it makes purging data easy. As previously introduced partition is immutable from a consumer perspective. But Kafka can still remove the messages based on the “Retention policy” of the topic. Deleting segments is much simpler than deleting things from a single file, especially when a producer might be pushing data into it.

$ ls -lh freblogg-0
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

The 00000000000000000000 in front of the log and the index files in each partition folder, is the name of our segment. Each segment file has segment.log, segment.index and segment.timeindex files.

Kafka always writes the messages into these segment files under a partition. There is always an active segment to which Kafka writes to. Once the segment’s size limit is reached, a new segment file is created and that becomes the active segment.

Each segment file is created with the offset of the first message as its file name. So, In the above picture, segment 0 has messages from offset 0 to offset 2, segment 3 has messages from offset 3 to 5 and so on. Segment 6 which is the last segment is the active segment.

$ ls -lh freblogg*
freblogg-0:
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

freblogg-1:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  68 Aug  5 10:15 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 10:15 leader-epoch-checkpoint

freblogg-2:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  79 Aug  5 09:59 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 09:59 leader-epoch-checkpoint

In our case, we only had one segment in each of our partitions which is 00000000000000000000. Since we don't see another segment file present, it means that 00000000000000000000 is the active segment in each of those partitions.

The default value for segment size is a high value (1 GB) but let’s say we’ve tweaked the Kafka configuration so that each segment can hold only three messages. Let’s see how that would play out.

Say this is the current state of the freblogg-2 partition. We've three messages pushed into it.

Since ‘three messages’ is the limit we’ve set, If a new message comes into this partition, Kafka will automatically close the current segment, create a new segment, make that the active segment and store that new message in the new segment’s log file.

(I'm not showing the preceding zeroes to make it easy on the eyes)

freblogg-2
|-- 00.index
|-- 00.log
|-- 00.timeindex
|-- 03.index
|-- 03.log
|-- 03.timeindex
`--

You should’ve noted that the name of the newer segment is not 01. Instead, you see 03.index, 03.log. So, what is going on?

This is because Kafka makes the lowest offset in the segment as its name. Since the new message that came into the partition has offset 3, that is the name Kafka gives for the new segment. It also means that since we have 00 and 03 as our segments, we can be sure that the messages with offsets 0,1 and 2 are indeed present in the 00 segment. New messages coming into the freblogg-2 partition with offsets 3,4 and 5 will be stored in the segment 03.

One of the common operations in Kafka is to read the message at a particular offset. For this, if it has to go to the log file to find the offset, it becomes an expensive task especially because the log file can grow to huge sizes (Default — 1G). This is where the .index file becomes useful. Index file stores the offsets and physical position of the message in the log file.

An index file for the log file I’ve showed in the ‘Quick detour’ above would look something like this:

If you need to read the message at offset 1, you first search for it in the index file and figure out that the message is in position 79. Then you directly go to position 79 in the log file and start reading. This makes it quite effective as you can use binary search to quickly get to the correct offset in the already sorted index file.

Parallelism with Partitions

To guarantee the order of reading messages from a partition, Kafka restricts to having only one consumer (from a consumer group) per partition. So, if a partition gets messages a,f and k, the consumer will also read them in the order a,f and k. This is an important thing to make a note of as the order of message consumption is not guaranteed at a topic level when you have multiple partitions.

Just increasing the number of consumers won’t increase the parallelism. You need to scale your partitions accordingly. To read data from a topic in parallel with two consumers, you create two partitions so that each consumer can read from its own partition. Also since partitions of a topic can be on different brokers, two consumers of a topic can read the data from two different brokers.

Topics

We’ve finally come to what a topic is. We’ve covered a lot of things about topics already. The most important thing to know is that a Topic is merely a logical grouping of several partitions.

A topic can be distributed across multiple brokers. This is done using the partitions. But a partition still needs to be on a single broker. Each topic will have its unique name and the partitions will be named from that.

Replication

Let’s talk about replication. Whenever we’re creating a topic in Kafka, we need to specify the replication factor we need for that topic. Let's say we've two brokers and so we've given the replication-factor as 2. What this means is that Kafka will try to always ensure that each partition of this topic has a backup/replica. The way Kafka distributes the partitions is quite similar to how HDFS distributes its data blocks across nodes.

Say for the freblogg topic that we've been using so far, we've given the replication factor as 2. The resulting distribution of its three partitions will look something like this.

Even when you have a replicated partition on a different broker, Kafka wouldn’t let you read from it because in each replicated set of partitions, there is a LEADER and the rest of them are just mere FOLLOWERS serving as backup. The followers keep on syncing the data from the leader partition periodically, waiting for their chance to shine. When the leader goes down, one of the in-sync follower partitions is chosen as the new leader and now you can consume data from this partition.

A Leader and a Follower of a single partition are never in a single broker. It should be quite obvious why that is so.

Finally, this long article ends. Congratulations on making it this far. You now know most of what there is to know about Kafka’s data storage. To ensure that you retain this information let’s do a quick recap.

Recap

Data in Kafka is stored in topics
Topics are partitioned
Each partition is further divided into segments
Each segment has a log file to store the actual message and an index file to store the position of the messages in the log file
Various partitions of a topic can be on different brokers but a partition is always tied to a single broker
Replicated partitions are passive. You can consume messages from them only when the leader is down

That ought to cover everything we’ve talked about. Thanks for reading. See you again in the next one.

Attribution:

Kafka image - https://kafka.apache.org/images/kafka_diagram.png

Reshaping Pandas Data frames with Melt & Pivot

2018-06-17T09:00:00+05:30

Pandas is a wonderful data manipulation library in python. Working in the field of Data science and Machine learning, I find myself using Pandas pretty much everyday. It's an invaluable tool for data analysis and manipulation.

In this short article, I will show you what Melt and Pivot (Reverse melt or Unmelt) are in Pandas, and how you can use them for reshaping and manipulating data frames.

Say, I have the data of the closing prices of stock market data of stock market closing prices of two major companies for the last week as follows:

Day	Google	Apple
MON	1129	191
TUE	1132	192
WED	1134	190
THU	1152	190
FRI	1152	188

For an analysis I want to do I need the names of the companies Google & Apple to appear in a single column with the stock price as another column, something like this:

Day	Company	Closing Price
MON	Google	1129
TUE	Google	1132
WED	Google	1134
THU	Google	1152
FRI	Google	1152
MON	Apple	191
TUE	Apple	192
WED	Apple	190
THU	Apple	190
FRI	Apple	188

This is exactly where Melt comes into picture. Melt is used for converting multiple columns into a single column, which is exactly what I need here.

Let's see how we can do this.

Melt

First we need to import pandas.

import pandas as pd

Then, we'll create the Dataframe with the data.

df = pd.DataFrame(data = {
    'Day' : ['MON', 'TUE', 'WED', 'THU', 'FRI'], 
    'Google' : [1129,1132,1134,1152,1152], 
    'Apple' : [191,192,190,190,188] 
})

And this will get us the dataframe we need as follows:

	Day	Google	Apple
0	MON	1129	191
1	TUE	1132	192
2	WED	1134	190
3	THU	1152	190
4	FRI	1152	188

Let's melt this now. To melt this dataframe, you call the melt() method on the dataframe with the id_vars parameter set.

reshaped_df = df.melt(id_vars=['Day']) # id_vars is the column you do not want to change

And you're done. Your reshaped_df would like this now.

	Day	variable	value
0	MON	Apple	191
1	TUE	Apple	192
2	WED	Apple	190
3	THU	Apple	190
4	FRI	Apple	188
5	MON	Google	1129
6	TUE	Google	1132
7	WED	Google	1134
8	THU	Google	1152
9	FRI	Google	1152

The id_vars you've passed into the melt() method is to specify which column you want to leave untouched. Since we want the Day column to stay the same even after the melt, we set id_vars=['Day'].

Also, you would have noticed that the output dataframe of melt has the columns variable and value. These are the default names given by pandas for the columns. We can change this either manually with something like

reshaped_df.columns = [['Day', 'Company', 'Closing Price']]

Or, we can specify the values for these columns in the melt() itself. Melt takes arguments var_name and value_name apart from id_vars. These options specify the names for the variable column and the value column respectively.

reshaped_df = df.melt(id_vars=['Day'], var_name='Company', value_name='Closing Price')

That will give us:

	Day	Company	Closing Price
0	MON	Apple	191
1	TUE	Apple	192
2	WED	Apple	190
3	THU	Apple	190
4	FRI	Apple	188
5	MON	Google	1129
6	TUE	Google	1132
7	WED	Google	1134
8	THU	Google	1152
9	FRI	Google	1152

Unmelt/Reverse Melt/Pivot

We can also do the reverse of the melt operation which is also called as Pivoting. In Pivoting or Reverse Melting, we convert a column with multiple values into several columns of their own.

The pivot() method on the dataframe takes two main arguments index and columns. The index parameter is similar to id_vars we have seen before i.e., It is used to specify the column you don't want to touch. The columns parameter is to specify which column should be used to create the new columns.

reshaped_df.pivot(index='Day', columns='Company')

Running the above command gives you the following:

+---------+-----------------------+
|         |     Closing Price     |
+=========+:=============:+:=====:+
| Company |     Google    | Apple |
+---------+---------------+-------+
|   Day   |               |       |
+---------+---------------+-------+
|   MON   |     1129      |  191  |
+---------+---------------+-------+
|   TUE   |     1132      |  192  |
+---------+---------------+-------+
|   WED   |     1134      |  190  |
+---------+---------------+-------+
|   THU   |     1152      |  190  |
+---------+---------------+-------+
|   FRI   |     1152      |  188  |
+---------+---------------+-------+

# (Showing in textual format as multi-level columns are not posible in Markdown)

This is close, but probably not exactly what you wanted. The Closing Price is an extra stacked column (index) on top of Google & Apple. So to get exactly the reverse of melt and get the original df dataframe we started with, we do the following:

original_df = reshaped_df.pivot(index='Day', columns='Company')['Closing Price'].reset_index()
original_df.columns.name = None

And that gets us back to what we have started with.

Day	Google	Apple
MON	1129	191
TUE	1132	192
WED	1134	190
THU	1152	190
FRI	1152	188

That is all for this article. I hope this was useful for you and that you'll try to use this in your data processing workflow.

For more programming articles, checkout Freblogg, Freblogg/Java, Freblogg/Spark

Thanks for reading. See you again in the next article.

Stash uncommitted changes with Git Stash

2018-01-13T09:00:00+05:30

You are in the middle of developing a feature and suddenly your manager tells you to work on an urgent fix for a production bug! You want to create a new branch for the fix but git wouldn't let you as you have uncommited changes. How can you switch to a new branch without losing your local uncommitted changes? Git Stash to your rescue.

Say, I’ve two commits in my git repository:

$ git log --oneline --decorate --graph
* 10c532b (HEAD -> master) Add File2   <- Commit #1
* d19fe8d Add File1                    <- Commit #2

And I’ve those two files file1 and file2 in my directory.

$ ls
file1  file2

After I have added file2, I've made some changes to it that have not been committed yet, as you see from the output of git status below:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directo

        modified:   file2

no changes added to commit (use "git add" and/or "git commit -a")

At this stage, say I have to switch to a different branch or a different commit, I usually have two options. Either commit these changes or lose them by switching to the other commit. As I don’t want to pick either of those options, I will go with the third option available, which is Stashing.

Git stash, as the name indicates, lets you stash-away some changes temporarily. You can think of stashes as being "temporary commits".

You can stash your changes with the following command:

git stash save "Changes in file2"

To git stash, I pass in the command save along with a message. This message is similar to a commit messages, by which you can identify a particular stash.

You can do the same with just git stash as well. But with that, you will not be able to provide a stash message.

You can see the existing list of stashes with the list command:

$ git stash list
stash@{0}: On master: Changes in file2

That is the stash I’ve just created, and you can see the branch name master and the stash message I have given as well. The stash@{0} is the identifier for your stash.

Now that we have created the stash, to use it, we have two options.

apply — Adds the changes in stash to working directory but keeps the stash
pop — Adds the changes in stash to working directory and deletes the stash

To apply a stash:

git stash apply stash@{0}

This will still keep the stash, and you will see it in the output of the list command.

To pop a stash:

git stash pop stash@{0}

If you want to delete the stash, you can do:

git stash drop stash@{0}

And that will delete the stash.

Git stashes are a great way to quickly stow away your unsaved changes for some later use. Try this out and this can be a really useful tool in your development workflow. One way in which I use stashes is to make a change on multiple branches. I stash the necessary changes and then apply the stash on all of the branches. Pretty neat!

That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Git

Thanks for reading. See you again in the next article.

Image attribution:

Git Logo - Git Logo by Jason Long is licensed under the Creative Commons Attribution 3.0 Unported License.

Build A Web Application With Flask In Python Part I

2018-01-12T23:54:00+05:30

Flask is a popular micro web application framework for Python using which you can create web apps. Unlike another popular framework like Django, Flask keeps its foot print to a minimum providing only the basic functionality required instead of picking out the entire stack for you the way Django does. And we call it a micro framework for this very reason. Using flask's extensibility at the core, you can build any type of applications by picking the components you want to use. Several big name companies like LinkedIn, Pinterest use Flask for their products.

In this tutorial we will get started with using Flask and create a simple web application with it.

Prerequisites

To follow along with this series you should have some knowledge of Python language. I'm using 3.6 for these tutorials and if you would like to follow along without any issues, I would suggest you to use the same version. For any of the previous versions, there might be a couple of changes in the syntax but the ideas and concepts will remain same.

You will also need to install Flask. You can do that with pip.

pip install -U flask

This will install flask if you don't already have it and update the version to latest if you have a previous version installed.

With those two things, you are good to go.

Getting Started

Just like with anything else you start by importing the stuff you want.

from flask import Flask

And this will make Flask ready for you to use. After this you have to create an app object by calling the Flask constructor like this:

app = Flask("hello")

This will create our app object. The name hello I've specified in the constructor can be anything. But the usual convention is to keep it __main__. Also, the app is just a variable. So, you can name it anything you want.

Next you have to define the routes. Using routes you configure your server to do different actions. Let's say when you type in some website URL in to your browser, you will be taken to its home page. Now if you do a <website>/info it will take you to the info page. So, this mapping of the call to /info URL to the info page is what we call as a route. For the home page the route is simply /.

Let's say we want our server's homepage to display Hello World. You can configure that with a method like this:

@app.route('/')
def index():
    return "Hello World"

With the @app.route('/'), we are defining a route on our server. So, when ever somebody opens that route, which for us i the homepage, the index() method associated with that route annotation will be called. And when the index() method is called it will return Hello World just as we expect it to.

And there is one final command to start and run our server which is:

app.run(debug=True)

And that's it. This will run the app that we have created when you run the python file. The debug=True option is useful while developing and testing applications. So, we'll keep that for now.

Just run your python script and you should output like this on the console:

* Debugger is active!
* Debugger PIN: 127-398-124
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Now If you go to http://localhost:5000, you can see Hello World displayed.

That's it. You have successfully created your first web application with flask in just 3 lines of code. Now, that is awesome. Stay tuned for the next part.

That is all for this article.

For more programming articles, checkout Freblogg Freblogg/Python

Some articles on automation:

Web Scraping For Beginners with Python

My semi automated workflow for blogging

Publish articles to Blogger automatically

Publish articles to Medium automatically

This is the 21st article as part of my twitter challenge #30DaysOfBlogging. Nine more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in its functionality.

Thanks for reading. See you again in the next article.

Json Parsing With Python

2018-01-10T23:56:00+05:30

JSON has become an ubiquitous data exchange format everywhere. Pretty much every service has a JSON API. And since it is so popular, most of the programming languages have built-in JSON parsers. And Of course, Python is no exception. In this article, I'll show you how you can parse JSON with Python's json library.

JSON parsing in Python is quite straight forward and easy unlike in some languages, where it is unnecessarily cumbersome. Like everything else in Python, You start by importing the library you want.

import json

In this article, I am going to use the following JSON I got from json.org

{
  "menu": {
    "id": "file",
    "value": "File",
    "popup": {
      "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}
      ]
    }
  }
}

We have got a good set of dictionaries and arrays to work with in this data. If you want to follow along, you can use the same JSON or you can use anything else as well.

The first thing to do is to get this json string into a variable.

json_string = """{"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}"""

And now we parse this string into a dictionary object with the help of the json library's loads() method.

json_dict = json.loads(json_string)

And you're done. The JSON is parsed and is stored in the json_dict object. The json_dict here is a python dictionary object. If you want to verify, you can do that by calling the type() on it with

print(type(json_dict))

And it will show that it is <class 'dict'>.

Getting back, We have the entire json object as a dictionary in json_dict object and you can just drill down into the dictionary with the keys. On the top level, We just have one key in the dictionary which is menu. We get can get that by indexing the dictionary with that key.

menu = json_dict['menu']

And of course menu is a dictionary too with the keys id, value, and popup. We can access them and print them as well.

print(menu['id'])            ## => 'file'
print(menu['value'])         ## => 'File'

And then finally we've got popup which is another dictionary as well with the key menuitem which is a list. We can verify this by checking the types of these objects.

popup = menu['popup']
print(type(popup))           ## => <class 'dict'>

menuitem = popup['menuitem']
print(type(menuitem))        ## => <class 'list'>

And Since menuitem is a list, we can iterate on it and print the values.

for item in menuitem:
    print(item)

And the output is

{'value': 'New', 'onclick': 'CreateNewDoc()'}
{'value': 'Open', 'onclick': 'OpenDoc()'}
{'value': 'Close', 'onclick': 'CloseDoc()'}

And of course each of these elements are dictionaries and so you can go further inside and access those keys and values.

For example, If you want to access New from the above output, you can do this:

print(menuitem[0]['value'])  ## => New

And so on and so forth to get any value in the JSON.

And not only that, json library can also accept JSON responses from web services. One cool thing here is that, web server responses are byte strings which means that if you want to use them in your program you'd have convert them to regular strings by using the decode() method. But for json you don't have to do that. You can directly feed in the byte string and it will give you a parsed object. That's pretty cool!

That is all for this article.

For more programming articles, checkout Freblogg Freblogg/Python

Some of my other articles on automation:

Web Scraping For Beginners with Python

My semi automated workflow for blogging

Publish articles to Blogger automatically

Publish articles to Medium automatically

This is the 19th article as part of my twitter challenge #30DaysOfBlogging. Eleven more articles on various topics, including but not limited to, Java, Git, Vim, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe on medium and my blog as well.

Thanks for reading. See you again in the next article.

Writing Datasets to Disk | Datasets In Apache Spark III

2018-01-08T21:46:00+05:30

In the last tutorial we've seen how to create parametrized datasets. Once you create datasets and perform some operations on them, you would like to save those results back into storage. This is what we'll try to do in this article - Saving Datasets to storage.

The first thing we'll do as always is to create the spark-session variable.

// Initialize Sparksession
SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();

Using that session variable, we read the fake-people.csv file which has data like this:

id,first_name,last_name,email,gender,ip_address
1,Netti,McKirdy,nmckirdy0@slideshare.net,Female,148.3.248.193
2,Nickey,Curreen,ncurreen1@tripadvisor.com,Male,206.9.48.216
3,Allayne,Chatainier,achatainier2@trellian.com,Male,191.118.4.217
...

We read this file into a dataset as following:

// Read csv file
Dataset<Row> peopleDs = spark.read().option("header", "true").csv("fake-people.csv");

After we have the dataset, Let's assume you've performed some operations on it. Some column selections, some filtering, some sorting etc. And we have a new dataset after all those operations.

// After performing several awesome operations
Dataset<Row> newDs = ....

We want to store this dataset back on the disk. We can do that with the write() on spark session variable, just like read().

newDs.write().csv("processed-data");

The processed-data in the above command is not the name for the output CSV file but instead for the output directory. When you write a Dataset to a file, it will store the data in the format you asked for, CSV in this case, along with adding some check files and status flags as well creating a directory with that name.

These are the files that get created in the processed-data folder.

$ ls ../../apache-spark/processed-data
_SUCCESS  part-00000-311049cf-3e48-4286-b93c-7d2096a18678-c000.csv

There are two more hidden CRC files that I'm not showing here. The part-00000-31hxxxxxxxxx.csv is the actual data file which has the data from the new dataset.

You can also create a json file by running

newDs.write().json("processed-data")

And that will create another folder with json file and the _SUCCESS file inside it.

You can also save this data to an external Database if you want to. You'll use the jdbc() method along with the connection string and the table name. And Spark will write it to the DB.

Apart from the CSV and JSON formats, there is one more popular data format in the Data Science and Big Data world. That is Parquet. Parquet is a data format that is highly optimized and well suited for column-wise operations. It is widely used in a lot of projects in the Big Data ecosystem as a data serialization format. And In Spark, Parquet is the default file storage format. Of course one main difference between Parquet and formats like CSV, JSON is that Parquet is not meant to be used for humans. It can only be read by a parquet reader. A sample file looks something like this:

PAR1   ï¿½k ï¿½>, ï¿½          999  1     ï¿½5,   ï¿½      1   2   3   4   5   6   7   8   9  - 0   1   2   3   4   5   6   7   8   < 2 < 2 < 2 < 2 < 2 < 2 < 2 <
.....

Utterly gibberish. But spark can read and understand it. In fact, As Parquet is designed for speed and throughput, it can be 10-100 times faster than reading/writing from an ordinary data format like CSV or JSON, depending on the type of data.

You save dataset to Parquet as follows:

newDs.write().parquet("processed");

And this will save the dataset as a parquet file along with the _SUCCESS status file.

That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Java, Freblogg/Spark

Articles on Apache Spark:

Map Vs Flat map

Spark Word count with Java

Datasets in Spark | Part I

Datasets in Spark | Part II

This is the 17th article as part of my twitter challenge #30DaysOfBlogging. Thirteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

Thanks for reading. See you again in the next article.

Remove Duplicate Elements From An Array

2018-01-06T23:35:00+05:30

Interviews are a great place to learn about your strengths and weaknesses, which makes them a great way to improve oneself. In one of my interviews, I was asked to Remove duplicate elements from an array. So, given the array a as below, I've to produce b.

a = {1, -2, 3, 1, 0, 9, 5, 6, 4, 5, 3, 1, 0}
b = {1, -2, 3, 0, 9, 5, 6, 4}

Here b has the same order of elements as a but per the problem statement, It is not necessary to do that.

I was flustered for a bit after getting the question. It took me a while to get to a proper solution, not before getting my first solution rejected for using HashMap which apparently, I was not supposed to. I am attributing this mainly to the fact that I was told to write Java code on a piece of paper and not an IDE. Anyway, I came home after that and decided to try it out and find what other's have done online. That is what this article is about.

Since that particular interview was in Java, It is only fair that I use Java for the solution here, although I really wanted to do it in Python. Maybe some other time.

Approaches for solving the problem:

Approach #1

The most naive approach would be to just look through the entire array and compare each element to every other element to see if there's a duplicate. Of course, this is useless as its time complexity is O(n^2). So, Let's skip this one and go to the next one.

Approach #2

Another approach is using a HashMap to keep track of elements. This is what I've tried initially but was rejected because I've used HashMap when I wasn't supposed to. The pseudo code would be:

map = new Map // Create map
new_arrray = []

for number in numbers_array
  if not map.contains(number)
  map += number
  new_array += number

print(new_array)

Of course, Since I wrote my implementation of this in Java, I had to make a few modifications to this as you need to first define the size of the array and only then can you add elements to it. So, I've added a count variable to count unique elements and then created a new array after the iteration with that size. This would require two loop iterations, but it is still O(n) which is fine. But Alas I couldn't use this.

And so, then comes my final approach.

Approach #3

The third solution is to first sort the array and then from the sorted array, remove duplicates. We can do this because the problem didn't want us to maintain the given input order. Otherwise, we wouldn't have been able to sort the array.

Sorting is easy enough. We just use the built-in sort method, which will sort the array in place.

Arrays.sort(numbers);

Then comes the major part which is removing the duplicates in that sorted array. We can accomplish that by using two pointers i, j on our array. i goes through the entire loop while j is the slow-moving pointer that only changes based on a condition.

 int j = 0; // Slow moving index

// i is the fast moving index that loops through the entire array
for (int i = 1; i < numbers.length; i++) {
    if (numbers[i] != numbers[j]) {
        j++;
        numbers[j] = numbers[i];
    }
}

The j index is basically playing catch up with i. When there is a duplicate element, i moves ahead while j stays back at the first duplicate element and then with numbers[j] = numbers[i], we assign the next unique value to the j location. After this, our original array has unique elements till index j but after that, we'll have leftover elements. To take care of that, we can create a new array from the numbers array.

int[] result = Arrays.copyOf(numbers, j + 1);
System.out.println(Arrays.toString(result));

And that's it. This will remove all the duplicated elements from the array. To test it, let me run the code:

Input array: [1, -2, 3, 1, 0, 9, 5, 6, 4, 5]
Final result after removing duplicates: [-2, 0, 1, 3, 4, 5, 6, 9]

The sorting of the array can be assumed to be done in O(nlogn). And then the iteration after that is O(n). Put together you still technically get O(n), which is the same as the previous case. Of course depending on a more specific kind of array, the sort might take less time as well. O(n) for the average case is what you finally get.

The full code is present as a gist:

Let me know if you have any more questions that need answers. That is all for this article.

For more programming articles, checkout Freblogg Freblogg/Java

This is the 15th article as part of my twitter challenge #30DaysOfBlogging. Fifteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe here on medium and my other blog as well.

If you are interested in contributing to any open source projects and haven't found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code but also with usage documentation and also by identifying any bugs in its functionality.

Thanks for reading. See you again in the next article.

Reduce Image Size With Python And Tinypng

2018-01-04T23:11:00+05:30

Whenever I want to upload images with my articles, I make sure they are of the right size first and then I have to check the file sizes and if they are too big, I will have to compress them. For this compression, I use Tinypng. They compress your images to a small size all the while keeping the image looking the same. I've tried some other services as well, but TinyPNG is definitely the best as their compression ratio is quite impressive.

In this article I'll show you how I'm planning to automate the image compression process using TinyPNG's developer API. And of-course we are going to using python.

Setting up

First of all, you need to have a developer key to connect to TinyPNG and use their services. So, go to Developer's API and enter your name and email.

Once you've registered, you'll get a mail from TinyPNG with a link and once you click on that, you'll go to your developers page which also has your API key and your usage information. Do keep it mind that for the free account, you can only compress 500 images per month. For someone like me, that's a number I won't really be reaching in a month anytime soon. But if do, you should probably check out their paid plans.

PS: That's not my real key :D

Get started

Once you've the developer key, you can start compressing images using their service. The full documentation for Python is here.

You start by installing Tinify, which is TinyPNG's library for compression.

pip install --upgrade tinify

Then we can start using tinify in code by importing it and setting the API key from your developer's page.

import tinify
tinify.key = 'API_Key'

If you've to send your requests over a proxy, you can set that as well.

tinify.proxy = "http://user:pass@192.168.0.1:8080"

Then, you can start compressing your image files. You can upload either PNG or JPEG files and tinify will compress it for you.

For the purpose of this article, I'm going to use the following delorean.jpeg image.

And I'll compress this to delorean-compressed.jpeg. For that we'll use the following code:

source = "delorean.jpeg"
destination = "delorean-compressed.jpeg"

original = tinify.from_file(source)
original.to_file(destination)

And that gives me this file:

If they both look the same, then that is the magic of TinyPNG's compression algorithm. It looks pretty much identical but it did compress it. To verify that, let's print the file sizes.

import os.path as path

original_size = path.getsize(source)
compressed_size = path.getsize(destination)
print(original_size/1024, compressed_size/1024)

And this prints,

29.0029296875 25.3466796875 1.144249662878058

The file was original 29 KB and now after compression it is 25.3 KB which is a fairly good compression for such a small file. If the original file was bigger, you will be able to see an even tighter compression.

And since this is the free version, there's a limit on the number of requests we can make. We can keep track of that with a built-in variable compression_count. You can print that after every requests to make sure you don't go over that.

compressions_this_month = tinify.compression_count
print(compressions_this_month)

You can also compress images from their URL's and store it locally. You will just do:

original = tinify.from_url("https://raw.githubusercontent.com/durgaswaroop/delorean/master/delorean.jpeg")

And then you can store the compressed file locally just like before.

Apart from just compressing the images, you can also resize them with TinyPNG's API. We'll cover that in the tomorrow's article here.

So, That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Python

Some articles on automation:

Web Scraping For Beginners with Python

My semi automated workflow for blogging

Publish articles to Blogger automatically

Publish articles to Medium automatically

This is the 13th article as part of my twitter challenge #30DaysOfBlogging. Seventeen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

Thanks for reading. See you again in the next article.

Datasets In Apache Spark | Part 2

2018-01-02T18:43:00+05:30

In the two last tutorials we have covered what Apache Spark is and also got ourselves familiar with Datasets in Apache Spark, which is the primary data abstraction in Spark. In this tutorial we will see how to read a data file as a parametrized Bean object Dataset using Encoders.

This tutorial is going to be short, but this is very important as you would find yourself doing this frequently. In the last article you've seen how to read a CSV or JSON file as a Dataset. You might have noticed that we were using Dataset<Row> for everything. If you're not familiar with Generics in Java, Dataset<Row> can be thought of as a Dataset consisting of Row objects. The Row object is a spark sql class and is the default when creating a Dataset.

Although the Row class has some useful methods, as a generic object suitable for all types, it is not suitable for everything. Since Datasets usually store data that usually corresponds to a Bean class, it is better to create a Dataset of that bean class instead of Row. With this, you'll have access to all your usual getters and setters of the bean class. That's what We'll do in this article. We'll create a Dataset of POJO's instead of Row objects.

I'm using the same fake-people.csv file that I used in the last article that looks like this:

id,first_name,last_name,email,gender,ip_address
1,Netti,McKirdy,nmckirdy0@slideshare.net,Female,148.3.248.193
2,Nickey,Curreen,ncurreen1@tripadvisor.com,Male,206.9.48.216
3,Allayne,Chatainier,achatainier2@trellian.com,Male,191.118.4.217
...

To represent this data, I've created a POJO called FakePeople.java, which looks like this:

import lombok.Data;
public @Data class FakePeople {
    final int id;
    final private String firstName;
    final private String lastName;
    final private String email;
    final private String gender;
    final private String ipAddress;
}

I'm using Project Lombok here, to generate the required Getters, Setters and other POJO methods. (If you don't know about Lombok, you should definitely check that out. It is quite handy).

We have our POJO now, Let's get a parametrized Dataset. To achieve this we first need to create an Encoder. We do that for the FakePeople class as following:

Encoder<FakePeople> fakePeopleEncoder = Encoders.bean(FakePeople.class);

This will register our encoder which will help us parse our CSV data.

Of course we need our spark session variable as well.

// Initialize Sparksession
SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();

Now we can go ahead and read the CSV file, very much like the way we did before with just one addition.

// Without Encoder
Dataset<Row> people = spark.read().option("header", "true").csv("fake-people.csv");

// With Encoder
Dataset<FakePeople> people = spark.read().option("header", "true").csv("fake-people.csv").as(fakePeopleEncoder);

And the output of people.show(5) is the same as what you'd expect.

+---+----------+----------+--------------------+------+--------------+
| id|first_name| last_name|               email|gender|    ip_address|
+---+----------+----------+--------------------+------+--------------+
|  1|     Netti|   McKirdy|nmckirdy0@slidesh...|Female| 148.3.248.193|
|  2|    Nickey|   Curreen|ncurreen1@tripadv...|  Male|  206.9.48.216|
|  3|   Allayne|Chatainier|achatainier2@trel...|  Male| 191.118.4.217|
|  4|     Tades|    Emmett|temmett3@barnesan...|  Male|153.113.87.195|
|  5|     Shawn|    McGenn|smcgenn4@shop-pro.jp|  Male|  247.45.80.68|
+---+----------+----------+--------------------+------+--------------+

As you can see the only difference in creating the Dataset is .as(fakePeopleEncoder) and that gets us Dataset<FakePeople> instead of Dataset<Row>. And with that, we now have access to all the getters, setters of FakePeople class which we wouldn't otherwise have with a Row object. We'll explore more about how this is useful in a future tutorial.

For more information on Datasets: Spark SQL, DataFrames and Datasets Guide

That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Java, Freblogg/Spark

Apache Spark articles:

Word count with Apache Spark and Java

Datasets in Apache Spark | Part 1

Datasets in Apache Spark | Part 2

This is the 11th article as part of my twitter challenge #31DaysOfBlogging. Nineteen more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

Thanks for reading. See you again in the next article.

My (Almost) Fully Automated Blogging Workflow

2017-12-31T18:07:00+05:30

In the article My semi automated workflow for blogging, I have outlined what my blogging process is like and how I've started to automate it. Ofcourse, at the time of that article, the process was still in early stages and I hadn't automated everything I do. And, that's where this article comes in. This is the second attempt at automating my entire Blogging workflow.

Just to give you some context, here are the things that I do when I'm blogging.

Open a markdown file in Vim with the title of the article as the name along with some template text
Open a browser with the html of the newly created markdown file
Convert markdown to html with pandoc several times during the writing process
Once the article is done and html is produced, edit the html to make some changes specific based on whether I'm publishing on Medium or if I'm publishing on Blogger
Read the tags/labels and other attributes from the file and Publish the code as draft on Medium or Blogger.
Once it looks good, Schedule or Publish it (This is a manual process. There's no denying it.)
Finally tweet about the post with the link to the article

I have the individual pieces of this process ready. I have already written about them in the following articles.

Semi Automated Blogging Workflow

Publish Articles To Blogger In One Second

Publish Articles To Medium In One Second

Tweeting With Python & Tweepy

Now, since the individual pieces are ready, it might seem that everything is done. But, as it turns out (unsurprisingly), the integration is of-course a big deal and took a lot more effort than I was expecting. And I am documenting that in this article along with the complete flow.

It starts with the script blog-it which opens vim for me, opens chrome and also sets up a process for converting markdown to html, continuously.

That script calls blog.py which is what opens the vim along with the default text template. I would like to put the complete gist here, but it is just too long and so instead I'm showing the meat of the script.

article_title = title.replace("_", " ").title()

# Create the markdown file and add the title
f = open(md_file, "w+")
f.write(generate_comments_header(article_title))
f.write(article_title)  # Replace underscores and title case it
f.write("\n")
f.write("-" * len(title))
f.write("\n")
f.write(generate_footer_text())
f.close()

# Now, create the html file
html_file = title + ".html"
open(html_file, "w").close()

# Start vim with the markdown file open on line #10
subprocess.run(['C:/Program Files (x86)/Vim/vim80/gvim.exe', '+10', md_file])

Then comes m2h which continuously converts markdown to html.

This ends one flow. Next comes, publishing. I have broken this down because publishing is a manual process for me unless I can complete the entire article in one sitting, which is never going to be possible. So, Once I'm doing with writing it, I'll start the publishing.

I'll run publish.py which depending on the comments in the html publishes it to either Blogger or Medium. Again, I'm only showing a part of it. The full gist is available here.

with open(html_file) as file:
    html_file_contents = file.read()

re_comments = re.compile('\s*<!--(.*)-->', re.DOTALL)
comments_text = re_comments.search(html_file_contents).group(1).strip()
comments_parser = CommentParser.parse_comments(comments_text)

if comments_parser.destination.lower() == 'blogger':
    blogger_publish.publish(html_file, comments_parser.title, comments_parser.labels, comments_parser.post_id)
elif comments_parser.destination.lower() == 'medium':
    medium_publish.publish(html_file, comments_parser.title, comments_parser.labels)
else:
    print(
        'Unknown destination: ' + comments_parser.destination + '. Supported destinations are Blogger and Medium.')

Then comes the individual publishing scripts that publish to blogger and medium.

For blogger-publish.py (Gist here), I do any required modifications with blogger_modifications.py (Gist here) which converts some tags as expected my blogger page.

Then for medium-publish.py (Gist here), I take the parameters and publish to blogger as html. No, modifications needed to be done here.

access_token_file = '~/.medium-access-token'
expanded_path = os.path.expanduser(access_token_file)
with open(expanded_path) as file:
  access_token = file.read().strip()

headers = get_headers(access_token)
user_url = get_user_url(headers)

# Publish new post
posts_url = user_url + 'posts/'
payload = generate_payload(title, labels, html_file)
response = requests.request('POST', posts_url, data=payload, headers=headers)

Actually this publishing does send it to the site as a draft instead of actually publishing it. This is a step that I don't know how to automate as I have to manually take a look at how the article looks in preview. May be I should try doing this with selenium or something like that.

Once, I've verified that the post looks good, I will publish it and take the URL of the published article and call the tweeter.py (Gist here) which then opens a Vim file with some default text for title, and URL already filled in along with some hashtags. I'll complete the tweet and once, I close it, It gets published on Twitter.

And that completes the process. Obviously there are still a couple of manual steps. Although I can't eliminate all of them, I might be able to minimize them as well. But, so far it looks pretty good especially with just the little effort I've put into this in just one week. Of course, I'll keep on tuning it as needed to make it even better and may be I'll publish one final article for that.

That is all for this article.

For more programming articles, checkout Freblogg, Freblogg/Python

Some articles on automation:

Web Scraping For Beginners with Python My semi automated workflow for blogging Publish articles to Blogger automatically Publish articles to Medium automatically

This is the 9th article as part of my twitter challenge #30DaysOfBlogging. Twenty one more articles on various topics, including but not limited to, Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

Thanks for reading. See you again in the next article.

Publish Articles To Medium In One Second

2017-12-29T19:57:00+05:30

In my article My semi automated workflow for blogging, I have talked about my blogging workflow. There were two main things (actually one thing) in that flow that were not automated. i.e., automatically Uploading to Blogger and automatically Uploading to Medium. I have talked about the first one here. This article is about uploading posts to Medium automatically.

Developer documentation for Medium is a breath of fresh air after the mess that is Google API’s. Of course, Google API’s are complex because they have so many different services, but they could’ve done a better job at organizing all that stuff. Anyway, Let’s see how you can use Medium API’s.

Setting Up

We don’t really need any specific dependencies for what we’re doing in this article. You can do everything with urllib which is already part of the python standard library. I’ll be using requests as well to make it a bit more simpler but you can achieve the same without it.

Getting the access token

To authenticate yourself with Medium, you need to get an access token that you’ll pass along to every request. There are two ways to get that token.

Browser-based authentication
Self-issues access tokens

Which one you should go with, depends on what kind of application you’re trying to build. As you can probably guess based on the title, we’ll be covering the second method in this article. The first method needs an authentication server setup which can accept callback from Medium. But, since at this moment, I don’t have that setup, I’m going with the second option.

The Self-issued access tokens method is quite easy to work with as you directly take the access token without having to have the user authenticate via the browser.

To get the access token, Go to Profile Settings and scroll down till you see Integration tokens section.

There enter some description for what you’re going to use this token and click on Get integration token. Copy that generated token which looks something like 181d415f34379af07b2c11d144dfbe35d and save it some where to be used in your program.

Using Access token to access Medium

Once you have the access token, you’ll use that token as your password and send it along with every request to get the required data.

Let’s get started then. As, I’ve said we’ll be using requests library for url connections. We’ll also be using the json libary for parsing the responses. So, Let’s import them.

import requests
import json

Then use access_token you’ve got and put it in a headers dictionary.

access_token = '181d415f34379af07b2c11d144dfbe35d'
headers = {
    'Authorization': "Bearer " + access_token,
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}

The User-Agent in the above dictionary is required as Medium won’t accept your request otherwise. You don’t have to have the same value as I did.

Validating the access token

First thing to check is if the access_token is valid. You can do that by making a GET request to https://api.medium.com/v1/me and checking the response.

me_url = base_url + 'me'
me_req = ureq.Request(me_url, headers=headers)
me_response = ureq.urlopen(me_req).read()
json_me_response = json.loads(me_response)
print(json_me_response)

And, when I print the json_me_response, which is a json object, I get the following:

{
"data": {
  "id":"5303d74c64f66366f00cb9b2a94f3251bf5adskak7623as",
  "username":"durgaswaroop", 
  "name":"Durga swaroop Perla", 
  "url":"https://medium.com/@durgaswaroop",
  "imageUrl":"https://cdn-images-1.medium.com/fit/c/400/400/0*qVDXEHT9DDYUOcrj."
  }
}

If we got that response like above, then we know that the access token we have is valid.

From there, I extract, the user_id from the JSON string, with

user_id = json_me_response['data']['id']

Get User’s Publications

From the above request, we’ve validated that the access token is correct and we also have got the user_id. Using that we can get access to the publications of a user. For that, we’ve to make a GET to https://api.medium.com/v1/users/{{userId}}/publications and you’ll see the list of the publications by that user.

user_url = base_url + 'users/' + user_id
publications_url = user_url + 'publications/'
publications_req = ureq.Request(publications_url, headers=headers)
publications_response = ureq.urlopen(publications_req).read()
print(publications_response)

I don’t have any publications on my medium account, and so I got an empty array as response. But, if you have some publications, the response will be something like this.

{
  "data": [
    {
      "id": "b969ac62a46b",
      "name": "About Medium",
      "description": "What is this thing and how does it work?",
      "url": "https://medium.com/about",
      "imageUrl": "https://cdn-images-1.medium.com/fit/c/200/200/0*ae1jbP_od0W6EulE.jpeg"
    },
    {
      "id": "b45573563f5a",
      "name": "Developers",
      "description": "Medium’s Developer resources",
      "url": "https://medium.com/developers",
      "imageUrl": "https://cdn-images-1.medium.com/fit/c/200/200/1*ccokMT4VXmDDO1EoQQHkzg@2x.png"
    }
  ]
}

Now, one weird thing about Medium’s API is that they don’t have a GET for posts. From the API’s we can get a list of all the publications but you can’t get a user’s posts. You can only publish a new post. Although, it is odd for that to be missing, It is not something I’m looking for anyway, as I am only interested in publishing an article. But if you need that, you probably should check to see if there are any hacky ways of achieving the same (at your own volition).

Create a New Post

To create a new post, we have to make a POST request to https://api.medium.com/v1/users/{{authorId}}/posts. The authorId here would be the same as the userId of the user whose access-token you have.

I’m using requests library for this as making a POST request becomes easy with it. Of course, first you need to create a payload to be uploaded. The payload should look something like the following, as described here

    {
      "title": "Liverpool FC",
      "contentFormat": "html",
      "content": "<h1>Liverpool FC</h1><p>You’ll never walk alone.</p>",
      "tags": ["football", "sport", "Liverpool"],
      "publishStatus": "public"
    }

So, for this, I did the following:

posts_url = user_url + 'posts/'

payload = {
    'title': 'Medium Test Post',
    'contentFormat': 'markdown',
    'tags': ['medium', 'test', 'python'],
    'publishStatus': 'draft',
    'content': open('7.Test_post.md').read()
}

response = requests.request('POST', posts_url, data=payload, headers=headers)
print(response.text)

As you see, for contentFormat, I’ve set markdown and for content I read it straight from the file. I didn’t want to publish this as it is just a dummy post and so I’ve set the publishStatus to draft. And sure enough, it works as expected and I can see this draft added on my account.

Do note that the title in the payload object won’t actually be the title of the article. If you want to have a title, you add it in the content itself as a <h*> tag.

The full code is available as a gist.

That is all for this article.

For more programming and Python articles, checkout Freblogg and Freblogg/Python

Some articles on automation:

Web Scraping For Beginners with Python

My semi automated workflow for blogging

This is the seventh article as part of my twitter challenge #30DaysOfBlogging. Twenty-three more articles on various topics including but not limited to Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you’re at it, Go ahead and subscribe to this blog and my blog on Medium as well.

If you are interested in contributing to any open source projects and haven’t found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.

Thanks for reading. See you again in the next article.

Datasets in Apache Spark | Part 1

2017-12-27T20:00:00+05:30

In my previous post I have talked about Apache Spark. We have also built an application for counting the number of words in a file, which is the hello world equivalent of the big data world.

It has been over 18 months since that article and spark has changed quite a lot in this time. A new major release of spark, which is spark-2.0 came out and now the latest version is 2.2.1 And with a new version comes new API’s and improvements. In-fact the first thing you’ll probably notice is that, you don’t need to create SparkContext or JavaSparkContext objects anymore. The various context and configurations have been put together into a new class SparkSession. You can still access the SparkContext or the SqlContext from the SparkSession object itself. So, you’ll be starting your programs with this now:

SparkSession spark = SparkSession.builder().appName("Freblogg-Spark").master("local").getOrCreate();

And you can use this spark variable the way you’d use other context variables.

Another change in Spark 2.0 is that, there is a heavy emphasis on the usage of Dataset API’s, and for a good reason. Datasets are more performant and memory efficient than RDD’s. RDD (Resilient Distributed Datasets) have been pushed to second place now. You can still use RDD’s if you want but Datasets are the preferred API. In fact, datasets have some nice convenience methods that we can use them for even unstructured data like text as well. Let’s generate some cool lipsum from Malevole. It looks something like this:

Ulysses, Ulysses - Soaring through all the galaxies. In search of Earth, flying in to the night. Ulysses, Ulysses - Fighting evil 
and tyranny, with all his power, and with all of his might. Ulysses - no-one else can do the things you do. Ulysses - like a bolt of
thunder from the blue. Ulysses - always fighting all the evil forces bringing peace and justice to all....

Now, you might try to use an RDD to read this, but let’s see what we can do with Datasets.

Dataset<String> lipsumDs = spark.read().textFile("fake-text.txt");
lipsumDs.show(5);

Here we are reading the text file using the spark object we created earlier and that gives us a Dataset<String> lipsumDs. The show() method on the dataset object prints the dataset. And we get the following output:

+--------------------+
|               value|
+--------------------+
|Ulysses, Ulysses ...|
|Ulysses, Ulysses ...|
| no-one else can ...|
|  always fighting...|
|                    |
+--------------------+

What we see here are the lines of the text file. Each line in the file is now a row in the Dataset. There are now a rich set of functions available to you in Datasets which weren’t in RDD’s. You can do filters on the rows for certain words, do a count on the table, perform groupBy operations, etc. all like you would on a Database table. For a full list of all the available operations on Dataset, read this: Dataset: Spark Documentation.

I hope that’s enough talk about unstructured data analysis. Let’s get to the main focus of this article, which is using Datasets for structured data. More specifically, csv and JSON. For this tutorial, I am using the data created from Mockaroo, an online data generator. I’ve created 1000 csv records that look like this:

id,first_name,last_name,email,gender,ip_address
1,Netti,McKirdy,nmckirdy0@slideshare.net,Female,148.3.248.193
2,Nickey,Curreen,ncurreen1@tripadvisor.com,Male,206.9.48.216
3,Allayne,Chatainier,achatainier2@trellian.com,Male,191.118.4.217
4,Tades,Emmett,temmett3@barnesandnoble.com,Male,153.113.87.195
5,Shawn,McGenn,smcgenn4@shop-pro.jp,Male,247.45.80.68
6,Giuseppe,Scobbie,gscobbie5@twitter.com,Male,123.114.131.200
...

We’ll use this data, which I’ve put in a file named fake-people.csv, to work with Datasets. Let’s create a Dataset out of this csv data.

Dataset<Row> peopleDs = spark.read().option("header", "true").csv("fake-people.csv");
peopleDs.show(5);

Since we’ve column headers in our data, we add the .option("header", "true") and the output is a nicely formatted table of the data with all the columns like this:

+---+----------+----------+--------------------+------+--------------+
| id|first_name| last_name|               email|gender|    ip_address|
+---+----------+----------+--------------------+------+--------------+
|  1|     Netti|   McKirdy|nmckirdy0@slidesh...|Female| 148.3.248.193|
|  2|    Nickey|   Curreen|ncurreen1@tripadv...|  Male|  206.9.48.216|
|  3|   Allayne|Chatainier|achatainier2@trel...|  Male| 191.118.4.217|
|  4|     Tades|    Emmett|temmett3@barnesan...|  Male|153.113.87.195|
|  5|     Shawn|    McGenn|smcgenn4@shop-pro.jp|  Male|  247.45.80.68|
+---+----------+----------+--------------------+------+--------------+

You can read in JSON data similarly as well. So, I generated some JSON this time from Mockaroo.

{"id":1,"first_name":"Zenia","last_name":"Joberne","email":"zjoberne0@foxnews.com","gender":"Female","ip_address":"214.207.159.43"}
{"id":2,"first_name":"Renard","last_name":"Kezor","email":"rkezor1@elpais.com","gender":"Male","ip_address":"199.3.18.104"}
{"id":3,"first_name":"Briant","last_name":"Patel","email":"bpatel2@odnoklassniki.ru","gender":"Male","ip_address":"111.184.217.23"}
{"id":4,"first_name":"Robinett","last_name":"Heasley","email":"rheasley3@tiny.cc","gender":"Female","ip_address":"21.40.190.226"}
{"id":5,"first_name":"Rosalinda","last_name":"Glandfield","email":"rglandfield4@indiegogo.com","gender":"Female","ip_address":"26.16.4.132"}
{"id":6,"first_name":"Haslett","last_name":"Culligan","email":"hculligan5@meetup.com","gender":"Male","ip_address":"201.191.72.10"}
....

Note: Spark can read JSON only of this format where we have one object per row. Otherwise you will see _corrupt_record when you print your dataset. That’s your cue to make sure the JSON is formatted as per spark’s need.

And you read JSON very similar to the way you read csv. Since in JSON we don’t have headers, we don’t need the header option.

Dataset<Row> peopleJsonDs = spark.read().JSON("fake-people.JSON");
peopleJsonDs.show(5);

And the output is,

+--------------------+----------+------+---+--------------+---------+
|               email|first_name|gender| id|    ip_address|last_name|
+--------------------+----------+------+---+--------------+---------+
|psurgison0@istock...|   Prissie|Female|  1| 48.151.89.171| Surgison|
| rsewell1@jalbum.net|    Robena|Female|  2| 184.16.37.210|   Sewell|
|aluxon2@list-mana...| Annamarie|Female|  3| 254.69.187.23|    Luxon|
|sodoherty3@twitpi...|   Shannah|Female|  4| 0.245.101.197|O'Doherty|
| alodford4@jigsy.com|     Alice|Female|  5|70.217.170.182|  Lodford|
+--------------------+----------+------+---+--------------+---------+

You can see the order of columns is jumbled. This is because JSON data doesn’t usually keep any specified order and so, when you read JSON data into a dataset, the order might not be same as what you’ve given. Of course if you want to display the columns in a particular order, you can always do a select operation.

peopleJsonDs.select("id", "first_name", "last_name", "email", "gender", "ip_address").show(5);

And that would print it in the right order. This is exactly like the SELECT query in SQL, if you’re familiar with it.

Now, that we have seen how to create Datasets, let’s see some of the operations we can perform on them.

Operations on Datasets

Datasets are built on top of Data frames. So, if you’re already familiar with Data frames in the spark 1.x releases you already know a ton about Datasets. Some of the operations you can perform on Dataset are as follows:

Column selection

Select one or more columns from the dataset.

peopleDs.select("email").show(5); // Selecting one column
peopleDs.select(col("email"), col("gender")).show(5); // Selecting multiple columns

Note: col is a static import of org.apache.spark.sql.functions.col;

Filtering on columns

Filter a subset of rows in the dataset based on conditions.

// Filter rows with id > 5 and \<= 10
peopleDs.filter(col("id").$less$eq(10).and(col("id").$greater(5))).show();

Dropping columns

Remove one or more columns from the dataset

peopleDs.drop("last_name", "ip_address").show(5);

Sorting on columns

peopleDs.sort(desc("first_name")).show(5);

And that sorts the dataset in the reverse order of the column first_name.

Output:

+---+----------+---------+--------------------+------+-------------+
| id|first_name|last_name|               email|gender|   ip_address|
+---+----------+---------+--------------------+------+-------------+
|685|  Zedekiah|  Brockie|zbrockiej0@mozill...|  Male|105.119.18.98|
|308|     Zarla| Bryceson|zbryceson8j@redif...|Female|55.118.168.15|
|636|  Zacherie|   Kermon|zkermonhn@prnewsw...|  Male| 120.36.10.87|

Those are some of the functions that you can use with Datasets. There are still several Database table type operations on Datasets, like group By, aggregations, joins, etc.. We’ll look at them in the next article on Spark as I think this article already has a lot of information already and I don’t want to overload you with information.

So, that is all for this article. If you’re someone that has never tried Datasets or Dataframes, I hope this article gave a good introduction on the topic to keep you interested in learning more.

The full code is available as gist.

This is the fifth article as part of my twitter challenge #30DaysOfBlogging. Twenty-five more articles on various topics including but not limited to Java, Git, Vim, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

If you are interested in contributing to any open source projects and haven’t found the right project or if you were unsure on how to begin, I would like to suggest my own project, Delorean which is a Distributed Version control system, built from scratch in Scala. You can contribute not only in the form of code, but also with usage documentation and also by identifying any bugs in the functionality.

Thanks for reading. See you again in the next article.

Tweeting with Python and Tweepy

2017-12-25T19:00:00+05:30

Programmers love to automate things and I'm no exception. I always like automate my common tasks. Whether it is checking for stock prices or checking to see when the next episode of my favorite show is coming, I've automated scripts for that. Today I am going to add one more thing in that list i.e., automated tweeting. I tweet quite frequently and I would love to have a way of automating this as well. And that's exactly what we're going to do today. We are tweeting using python.

We'll use a python library called tweepy for this. Tweepy is a simple, easy to use library for accessing Twitter API.

Accessing twitter API's programmatically is not only just an accessibility feature but can be of enormous value too. Mining the twitter verse data is one of the key steps in sentimental analysis. Twitter chat bots have also become quite popular now a days with hundreds and thousands of bot accounts. This article, although, only barely scratches the surface, hopefully will helping in building yourself towards that.

Setting Up

First thing's first, install tweepy by running pip install tweepy. The latest version at the time of the writing this article is 3.5.0.

Then we need to have our Twitter API credentials. Go to Twitter Apps. If you don't have any apps registered already, go ahead and click the Create New App button.

To register your app you have to provide the following three things

Name of your application
Description
Your website url

There is one more option which is callback URL. You can ignore that for now. Then after reading the Twitter developer agreement (wink wink), click on Create your Twitter application button to create a new app.

Once the app is created you should see that in your twitter apps page. Click on it and GOTO the Keys and Access Tokens tab.

There you will see four pieces of information. First you have your app API keys which are consumer key and consumer secret. Then you have your access token and access token secret.

We'll need all of them to access twitter API's. So, have them ready. I have copied all of them and exported them as system variables. You could do the same or if you'd like, you can read them from a file as well.

Let's get started

First you have to import tweepy and os(only if you are accessing system variables).

import tweepy
import os

Then I'll populate the access variables by reading them environment variables.

consumer_key = os.environ["t_consumer_key"]
consumer_secret = os.environ["t_consumer_secret"]
access_token = os.environ["t_access_token"]
access_token_secret = os.environ["t_access_token_secret"]

With the keys ready, we setup the authorization.

authorization = tweepy.OAuthHandler(consumer_key, consumer_secret)
authorization.set_access_token(access_token, access_token_secret)

After authorization we create an API object twitter

twitter = tweepy.API(authorization)

And now you can tweet from python using this twitter object like this.

twitter.update_status("Tweet using #tweepy")

That is all you have to do. Just five lines of code and you can already tweet. You should try it out and check your twitter account. I just ran this command and this is the tweet.

Tweet using #tweepy

— Durga Swaroop Perla (@durgaswaroop) December 24, 2017

Not just this, you can also tweet media. Let's tweet again, this time with a picture attached.

image = os.environ['USERPROFILE'] + "\\Pictures\\cubes.jpg"
twitter.update_with_media(image, "Tweet with media using #tweepy")

And this is the media tweet.

Tweet with media using #tweepy pic.twitter.com/9bDuw9DDJI

— Durga Swaroop Perla (@durgaswaroop) December 24, 2017

When you run the previous commands, you'll see that there is a lot of output that is printed on the terminal. This is a status object with a lot of useful data like the number of followers you got, your profile picture URL, your location etc., pretty much everything you get from your twitter page. We can make use of this information, If we are building something more comprehensive.

Apart from sending regular tweets, you can also reply to existing tweets. To reply to a tweet you'd first need its tweet_id which you can get from the tweet's URL.

For example the URL for previous tweet is https://twitter.com/durgaswaroop/status/945049796238118912 and the tweet_id is 945049796238118912.

Using that id, we can send another tweet as reply.

id_of_tweet_to_reply = "945049796238118912"
twitter.update_status("Reply to a tweet using #tweepy", in_reply_to_status_id=id_of_tweet_to_reply)

The only change in the syntax is in_reply_to_status_id=id_of_tweet_to_reply that is passed as the second argument. And with that our new tweet will be added as reply to the original tweet.

The new reply tweet is this:

Reply to a tweet using #tweepy

— Durga Swaroop Perla (@durgaswaroop) December 24, 2017

That's how easy it is to access Twitter API with tweepy. We now know how to tweet and how to reply to a tweet. Building up from this knowledge, In a later tutorial, I can show you how to create your own twitter chat-bot and also twitter streaming analysis.

The full code of things covered in this article is available as gist at

For more programming and Python articles, checkout Freblogg and Freblogg/Python

Web Scraping For Beginner with Python

This is the third article as part of my twitter challenge #30DaysOfBlogging. Twenty-seven more articles on various topics including but not limited to Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop. While you're at it, Go ahead and subscribe to this blog and my blog on Medium as well.

Thanks for reading. See you again in the next article.

Sessions in Vim

2017-12-23T20:00:00+05:30

I love to have a lot of tabs open at the same time in Vim. Being the deputy Scrum master (Yeah, it is a thing) of our dev-team, I have to keep track of a lot of things going on in the team. I maintain a repository of all the links to product documentation, stories, tasks etc. I also need to keep track of the discussions that happened in various team meetings. On top of this, as a backend engineer I have my own stories and tasks to manage as well. All of this means that, I have a ton of tabs and splits open at any given time in Vim. Something like the following:

Now, the problem comes when I have to shutdown and start my computer. All the tabs I have kept open for several days will be gone and I have to open them up all again and put them in the order I want. Oh, the pain!. There has to a better way!

Luckily for us, Vim always has a better way to do something. There is an inbuilt feature just for this.

It is called Vim-sessions and with that you can get back all your tabs with just one command. How nice!

How to create a new session?

To create a vim session, run the command :mksession episodes.session. Here episodes is the name of the session I want to create.

In short, :mks <session-name>.session. And that’s it. Your session is now saved. It saves information about what tabs are currently open, what splits are open and even what buffers are open, all into that session file.

Note: The .session suffix is not needed. But it is the preferred way as you can easily identify the session file.

Once this is done, you can go ahead and close your tabs as all of that is stored in the session file.

How to open an existing session?

The next time you want to open all of those tabs, all you have to do is tell Vim to run that session. You do that by running the command :so <session-file-path>(:so is short for :source).

And boom! All of your windows and tabs are back with just one command. You don’t have to have multiple tmux or screen buffers running anymore. Vim can do all of that with just one command.

That is all you need to know about sessions in vim to make yourself productive. You can always try the vim help with :help session-file and find out more.

For more Vim articles, checkout Freblogg/Vim

Beginner Vim Tutorials - Your First Lesson In Vim

Vim Color scheme used: Eclipse

This is the first article as part of my twitter challenge #30DaysOfBlogging. Twenty-nine more articles on various topics including but not limited to Java, Git, Vim, Software Development, Python, to come.

If you are interested in this, make sure to follow me on Twitter @durgaswaroop.

Thanks for reading. See you again in the next article.

How to recover from 'git reset --hard" | Git

2017-09-11T16:00:00+05:30

Git is an amazingly powerful tool. But, as Uncle Ben said,

With great power, comes great responsibility

And that is true for Git as well. If you are not careful when using it, you could easily burn your a**.
So, If something like that happened to you, Or If you want to make sure that never happens to you, then watch this video.

Subscribe to the channel for more videos like this.

Git Cherrypick

2017-02-20T14:26:00+05:30

Cherrypick is one of the most useful commands in Git. It is used to apply a commit that is present on another branch, on the current branch. Let's see an example to understand this better.

Let’s say you have two branches feature1 and feature2 as in the following picture.

Now, the green commit 5 on branch 2, has some interesting code that you want on feature1. How would you get that? You are probably thinking about merge/rebase. But with that you will get all the other green commits from 1–4 as well, which you don’t want.
Cherrypick for the rescue!.

Assuming you are on feature1, all you have to say is

git cherry-pick green5 (Assuming 'green5' is the commit id)

And that’s it. You will have the green5 commit on your orange4 commit like in this picture as you wanted.

Notice, that the green commit is no longer “5” but has been changed to “5′”. This is to show that, though the changes (change set is the git term) in the commit remain the same, Git will generate a new commit hash for this because hashes take parent node also into account. I have used the same colour to show that the content is the same.

And that is all you need to know about Cherry picking. So, Go ahead and pick some cherries!

Follow @durgaswaroop on Twitter.

Git Merge Vs. Git Rebase

2017-01-31T09:00:00+05:30

Merge and Rebase are two strategies available in Git to combine two ( or more) branches into one branch.
Let’s say we have two branches feature1 and feature2 that have diverged from a common commit “a” to have four commits each.

Now we want to combine both the features into a single branch. Merge and Rebase are our options. Let’s see what each of them can do.

Git Merge

Merge will seem like a fairly obvious thing, if you look at the end result. It is pretty much like taking two threads and tying them up in a knot.

Here the commit ‘b’, has the information regarding all the commits in feature1 and feature2. So, Merge preserves the history of the repository.

Git Rebase

Rebase on the other hand doesn’t preserve the history. It quite literally re-bases one branch on top of the other i.e., it changes the base of the branch. Let’s see rebasing with the same example.
Let’s say I want to rebase feature1 onto feature2, what that means is that I want all the commits in the branch feature1 on top of the commits of feature2. So, after rebase your commit history would look like the following.

As you see in the picture, the base of feature1 which was previously the commit “a”, has been shifted to the green commit “4”. Hence the name Re-Base. Here feature1 is sitting on top of feature2 as opposed to being on “a”.

Do note that I have added a ‘ next to the numbers of feature branch making them 1’, 2′ and so on, to indicate that the orange 1′ commit is different from the orange 1 commit. This is because each commit, apart from storing the changes to the files, stores the information regarding its parent. So, If a parent to a commit changes, even it has the exact sames modifications to the files, will be treated as a different commit by Git, which means we have changed the Git commit history.

Also Anyone who looks at the commit history now, would think that feature1 was added after feature2 which is not what actually happened. If this is the end result you’re going for, then it’s absolutely fine but if you want to show that feature1 and feature2 both started off simultaneously, then you need to use Merge.

Both Merge and Rebase have their pros and cons. Merge keeps the history of the repository but can make it hard for someone to understand and follow what’s going on at a particular stage when there are multiple merges. Rebase on the other hand ‘rewrites’ history (read - creates new history) but makes the repo look cleaner and is much easier to look at.

What you want to use depends on your need. A lot of companies make merges mandatory when adding stuff to master branch because they want to see the history of all changes. And a few companies/Open source projects mandate rebasing as it keeps the flow simple and easy to follow. Use the one that suits your workflow.

Fun Fact:
There is a merge strategy called Octopus merge, where you merge multiple branches into one. For more info on this: Understanding Git Octopus Merge

For more interesting articles, follow me @durgaswaroop on Twitter

Understanding Git Octopus Merge

2016-12-21T05:00:00+05:30

The Code for Git merge is one of the most sophisticated pieces of software ever written. There is so much stuff that goes inside during a merge that its just mind boggling. Just for that alone, Linus could be considered a programming genius. Too bad for other geniuses, he also has "Linux kernel" on his resume :-D.

As the title suggests this article is about Octopus Merge in Git. For this, I hope you know what a basic Git merge is and what it means to merge. If you're completely unfamiliar with Git, then I've no idea what you're doing here. You better read up on some Git 101 before jumping in to this article.

Anyway, Just to brush up, this is how a simple/familiar Git merge goes ..

We have a branch called feature that diverged from master at the second commit and went to have two commits of its own.

Note: For the branch pictures in this article I am using a Git GUI tool called Git Kraken. I have been trying it a few days now and it looks quite promising. I am a fan of its clean and minimalist UI and have been using it extensively for the beautiful visualization of branches. And above all, It is free for personal non-commercial use. So, you can try it out for free.

Now you want to add those cool new changes on the feature branch to master. The way you do it is by merging (Let's not talk about Rebasing for now. We will look at it another time). So, when you merge this is how it looks like.

This is all the usual stuff that we are all familiar with.

Now there is another type of merge called the The Octopus Merge. At least some of you must have heard about it either from an online video or from a colleague in your office that seems to know everything. Either way the Octopus merge is a really fun way of Merging. You probably won't get to do this at your work as a lot of companies think this complicates things and we all know how much Companies hate complexity. Anyway, Let's see what it looks like. I have a local git repository with three branches branch1, branch2, branch3 along with master. All four of these branches have two extra commits from the point they diverged.

Now if you want to merge them, the usual way would be to merge two branches at a time to finally get to the final combination after three merges like so.

This may seem fine and might actually be the only way you would think about this if not for the Octopus merge. You have three merge commits here and as we know merge commits are noise. They pollute the history of your repository and interrupt the story told by your Git history. So, how about keeping the noise low by just having one Merge commit instead of three. How you ask? Octopus, My friend. All hail the great and mighty Octopus. So, the way you perform an Octopus is by merging all the branches at once on to the master. To do that you give a command like this

This will merge all the three branches to master. The branches will look something like this. Do you see the reference of Octopus now?

Now, if you know anything about octopuses, you might be wondering that we only have four legs here while an Octopus has 8. Well you are right. Octopuses do have 8 legs (technically 6 as two of them are used as hands) but 4 is good enough. Actually any merge can be called Octopus if you're merging three or more branches.

If you are using Git for sometime, you might be wondering, If Octopus is so freaking cool, why haven't more people heard about it and Why are more people not using it. Well, you are right my friend. Octopus is awesome for sure, but as I said it certainly does complicate things a lot especially when dealing with merge conflicts. Merge is hard enough as it is when dealing with just two branches. But if you are merging 5 or 10 branches together it feels like you're doing a complex surgery. You have to be really careful in that case and I am not even sure if any modern GUI tools support diffing 10-way. Also a lot of people tend to go overboard with Octopus.

Look at this message where Linus Torvalds yells (pleasantly) at a guy for creating an Octopus with 66 branches. Imagine that for a second. 66 branches! I wouldn't want to be the guy that handles merge conflicts on that one! Linux aptly says

that's not an octopus, that's a Cthulhu merge

So, a lot of companies don't really use this. A lot of people won't even consider this for their merge strategies.

A rule of thumb to follow with Octopus is to never overdo it. An 8-way octopus merge though borders on crazy hard and insane, is fine but more than that is an overkill. The situations where you have to merge more than 5 or 6 branches tend to be very rare and in those cases may be you can go for an Octopus on a subset of branches at a time and do a Octopus for those. Or may be rethink your merging strategy.

Either way, I hope this article helped you in understanding something new and gives you some ideas for dealing with complex merges. I hope you will educate your peers and colleagues about this new merge and share this article with them

Well, That is all for this article folks. See you again in the next one. Until then, Good Bye.

Special Thanks to Git Kraken team at Axosoft for developing a great tool like Kraken.

You can find me as @durgaswaroop on Twitter.

Attributions:

Cthulhu Image - CC BY-SA 3.0 -https://commons.wikimedia.org/wiki/File:Cthulhu_and_R'lyeh.jpg

Navigating In Vim II | Your First Lesson in Vim

2016-10-13T12:00:00+05:30

This is the fourth article in the series titled, "Your First Lesson In Vim". These articles are written with a goal of helping out new Vim users by teaching the awesomeness of the Vim editor and there by extending the Vim community. Vim though quite powerful, has a bad rep for being hard to learn and hard to get started with. So, even when someone is interested in learning about Vim, that infamous learning curve seem to be scaring them off. This series is going to put an end to all of that.

In the last article Navigating in Vim I, we have seen a lot of Vim motions. Most of these fall under the category of word-motions (:help word-motions). We will learn some more motions in this article. And in case you still haven't tried Vim Adventures you should do it. It will help you a lot with getting the hang of Vim motions and getting around in vim.

[]{#linemotions} Here are the list of Vim motions for this article.

Motion What it does?

0 Go to the STARTING of the CURRENT LINE \^ Go to the FIRST NBC* of the CURRENT LINE - Go to the FIRST NBC* of the PREVIOUS LINE + Go to the FIRST NBC* of the NEXT LINE \$ Go to the END of the CURRENT LINE g_ Go to the LAST NBC* of the CURRENT LINE f{char} Find a character FORWARD in the current line (Usage: to go to first occurance of c, you type fc) F{char} Find a character BACKWARD in the current line (Usage: to go to first occurance of c to the left of the cursor, you type Fc) t{char} Like f but places the cursor before the character (Mnemonic : t - till) T{char} Like T but places the cursor after the character gg Move the cursor to the first line (compare this with H) G Move the cursor to the last line (compare this with L)

* NBC - Non Blank Character

These motions let you move very fast between lines. You can go to any character you want on the current line with just 2 or at 3 keys, which is insanely fast compared to any other text editor. The last two motions (gg, G) are super useful and are certainly two of my most used commands.

Now we have one final set of motions to learn called Text Object motions (:help object-motions). Text objects is an important concept in Vim and we will cover that in depth in a future article. For now let's look at these motions.

[]{#textmotions}Text Object Motions

Motion What it does?

( Go to the beginning of the PREVIOUS sentence ) Go to the beginning of the NEXT sentence { Go one paragraph BACKWARD } Go one paragraph FORWARD

These four motions are very useful too. Especially if you're a programmer, the { and } will make navigating the code base a breeze.

And with that, we have covered all the basic Vim motions for you to get started. There is just one more important thing you need to know in conjunction with Motions. I haven't told you about this till now because I wanted you to get a full grasp of Vim motions before I explain this. Anyway, here it goes ..

Every Vim Motion takes a count before it

That's it. It might seem simple and it is simple, but its usefulness is just immeasurable.

Let's say you have to move eight lines down. To go eight lines down you don't have to frantically type jjjjjjjj. Just simply type *8j*. Similarly 4k to go four lines above, 6w to go to the sixth word from the cursor and so on. This is just such a useful feature that quite literally Sky is the limit for what you can do with this. Want to go to the second e after the cursor? Try 2fe and your cursor lands directly on e. Similarly to go to the ending of the 5th line below just do 5$ and B.A.M!

This opens up a whole new world of combinations for you to use and I hope you will make use of all of them. With these motions you can move to any place you want in the file with minimal number of keystrokes and your ultimate aim should be to accomplish everything with the minimum possible number of keystrokes. Be a Vim Ninja and conquer the world!

Well, That is all for this article folks. Will see you again in the next one. Until then, Keep practicing and Happy Vimming!

← Prev

For more Vim stuff : Vim

Attributions:

Vim Logo - Vim Replacement Icon http://wolfrosch.com/works/goodies/vim (CC BY-NC-ND 3.0)

Vim Ninja image - https://goo.gl/QgTrsY (Originally from Practical Vim by Drew Neal)

Navigating in Vim I | Your First Lesson In Vim

2016-10-09T00:00:00+05:30

This is the third article in a series titled, "Your First Lesson In Vim". These articles are written with a goal of helping out new Vim users by teaching the awesomeness of the Vim editor there by extending the Vim community. Vim though quite powerful, has a bad rep for being hard to learn and hard to get started with. So, even when someone is interested in learning about Vim, that infamous learning curve seem to be scaring them off. Hopefully this series will put those fears to bed.

In the last article How To Exit Vim, we have seen what Vim modes are and what they do. So, If you know about Visual Mode, Insert Mode, Command Mode and Normal Mode, then continue with this article. Otherwise take a look again at the previous article. Vim modes are really important to understand this article and the upcoming ones.

I wanted this to be a part of the previous article but since this is really important and has a lot of potential information to discuss, I decided to give this its own full article. We will be spending most of our time in Normal mode here as that is where we will navigate in the file. You have seen how you navigate in Vim using h j k l. If you haven't figured out already, Vim's main philosophy is increasing your productivity and because of this some of the things Vim does might seem different compared to the usual way you are used to in other editors. Using h j k l is one of those things.

We have covered this in the last article but as promised I will expand about it here. If you look at your keyboard you will see that h j k l are on your Home Row (unless you are using Dvorak Keyboard, in which case this article probably won't help you much). Having the navigation keys on the home row is such an advantage as you don't have to move your fingers at all to access them. Going to the arrow keys for navigating is tiresome and time consuming. Don't think so? Well try it out yourself. Rest your fingers on their normal positions on the Home row (a s d f - j k l ;) and try to hit the UP arrow and come back. Did you see the travel involved in that? Do it again and see how you have moved your hand away from the keys and came back. Do it 5 more times and tell me If i am wrong when I say its just unnecessary travel, Especially since you have the navigation keys right on the home row in Vim. This is one of the reasons why Vim users are usually pretty fast. They don't keep moving their hands on and off the keyboard every time you have to go up or down. And again this could save you from potential RSI injuries.

So, I strongly advise you to stop using UP and DOWN arrow keys. To use the h j k l keys more, try playing Vim Adventures. Its a fun game where you go around the textland collecting characters using Vim's navigation controls. It will help you use the h j k l keys and just after a couple of tries it becomes muscle memory.

One more thing that you won't see Vim users (#VimRocks) using is Mouse. The argument against using Mouse is the same one as that for the arrow keys. You are taking your hands off your main row which not only breaks your typing flow but also is just plain annoying. Its the same for mouse except you're moving your hand even further which makes it that much worse.

Getting rid of Arrow keys and Mouse is not an easy thing. This is cert ainly something that takes time to get used to. But once you do, you will be that much faster in your work flow.

Vim Motions

Vim Motions are the amazing things that make Vim users so fast. You already know about h j k l. Motions are just about anything that moves your cursor from one place to another. Apart from that you also have w W b B e E H M L. Let's see what they do.

Command What it does?

w Move the cursor to the starting of the next word W Move the cursor to the starting of the next WORD b Move the cursor to the starting of the previous word (Mnemonic - back) B Move the cursor to the starting of the previous WORD e Move the cursor to the end of the current word (Mnemonic - end) E Move the cursor to the end of the current WORD H Move the cursor to the First line of the current visible screen (Mnemonic - High) M Move the cursor to the Middle line of the current visible screen (Mnemonic - Middle) L Move the cursor to the Last line of the current visible screen (Mnemonic - Low)

The usage of H M L commands should be clear with this image below. They move your cursor to first, middle and last line of the screen respectively.

To understand about w b e and their upper case variants we have to understand how word and WORD are defined in Vim. From the official documentation (:help word),

A word consists of a sequence of letters, digits and underscores, or a sequence of other non-blank characters, separated with white space (spaces, tabs, )

A WORD consists of a sequence of non-blank characters, separated with white space

In short, a group of characters with out a space between them is a WORD and there can be multiple words in that. With that definition let's take a look at some examples and identify the number of words and the number of WORDS in them.

Word # of words # of WORDS

hello world 2 (hello,world) 2 (hello,world) hello-world 3 (hello,-,world) 1 (hello-world) hello_world 1 (hello_world) 1 (hello_world)

Once you understood the difference between word and WORD, all the motions explained in the first table would be clear. Just to make the foundation firm, try them out yourself. Type something in a file and try to see what each of the w b e W B E H M L commands are doing and how they are moving the cursor. All the motions we have covered in this article are called Word motions (:help word-motions). There are some more Vim motion commands that you need to know to quickly navigate with ease. To keep this article simple, we will end this discussion here and will pickup again in the next article where we will discuss about the other motion commands. So, Keep practicing these motions combined with other commands discussed in How To Exit Vim and you would be really fast already. Fast like a Puma!

Well, That is all for this article folks. Will see you again in the next one. Until then, Keep practicing and Happy Vimming!

← Prev Next →

For more Vim stuff : Vim

Attributions:

Vim Logo - Vim Replacement Icon http://wolfrosch.com/works/goodies/vim (CC BY-NC-ND 3.0)

How to Exit Vim? | Your First Lesson In Vim

2016-10-02T17:00:00+05:30

This is the second article in the series titled, "Your First Lesson In Vim". These articles are written with a goal of helping out new Vim users by teaching the awesomeness of the Vim editor and there by extending the Vim community. Vim though quite powerful, has a bad rep for being hard to learn and hard to get started with. So, even when someone is interested in learning about Vim, that infamous learning curve seem to be scaring them off. This series is going to put an end to all of that.

In the last article Introduction & Installation we have seen why Vim is the best and coolest editor ever. Hopefully after watching Damian Conway's YouTube video given at the end of that article, you would agree. In this article we will experience Vim for the first time. We will learn about the various modes of operations in Vim. And, most importantly as the title of the article suggests, we will learn How to Exit Vim.

First thing's first, You have to open Vim. Duh!. You can do that either by directly searching for Vim in your Search box or by typing vim in to your terminal. If you want to start off with gvim then open that instead. Your Gvim or Mac Vim would look something like this.

If you have already tried to type something, you would observe that there is something shady going on here. For example, if you type hello world you might observe that only world is displayed and hello is no where to be seen. Try it out for yourself. This happens because of the infamous Vim modes. One of the first things you have to realize while using Vim is that its not like your typical run of the mill text editor. Vim works a bit differently and Modes is one of the key things that makes vim different. So, let's take a look at them.

Broadly speaking Vim has Four major modes of operation. That number keeps changing depending on who you're talking to because there are a few more modes that can technically be called sub-modes but some people insist on treating them as Seperate modes. But to keep things simple here 4 is the magic number for you and 4 is the answer to Life, Universe and Everything. Not 42, 4!

The modes are :

Normal Mode

Normal mode is the default mode you will be in when you open Vim. Normal mode is used for altering, deleting and formatting text. You won't be inserting any new text into the document in this mode. Normal Mode is the mode you will be spending most of your time in. You can get to Normal mode by pressing ESC from any other mode. One of the main things you will be doing in this mode is moving around your document.

To move around the file in the window you might usually be using arrow keys. In Vim they will work the way you expect them to, but instead vim advises to use h .. j .. k .. l for moving the cursor. The reason why this came to be and the advantages of this will be apparent in the next article but for now let's see how this works.

h moves the cursor left, l moves it right

j moves the cursor down, k moves it up

This following picture would make the idea clear.

If you are thinking to stick with the arrow keys instead of h j k l, it is fine. There are a lot of people who use vim this way. But trust me when I say using h j k l speeds up you work flow a lot. Once you get used to this you wouldn't want to use arrow keys anymore. But anyway we will discuss more about the this in the next article.

Working in Normal mode, you will see how everything you do get's easy in Vim. For a quick sneak peak of some commands.

Command	What it does?
dd	Copy (yank) the current line
p	paste the copied text below the current line
u	Undo your previous change
gg	Go to the beginning of the file
G	Go to the end of the file

From now on you don't have to awkwardly select the full line with your mouse to delete it. All you have to do is press dd and that sucker goes away.

Didn't mean to delete it? No problem. Just hit u and it undoes the delete. No more holding down Ctrl and z. Alsou for undo is so simple to remember. What does Z even mean in Ctrl + z? And how did that become synonymous to Undo?

Similarly no more Ctrl + c and Ctrl + v to copy and paste. yy and p got you covered.

So you see, Vim sticks to its philosophy of making you productive. Imagine all the keystrokes you save per day, per year. So, switching to Vim doesn't just improve your productivity, it take care of your health too. With every Key you saved keeps you a key away from getting Carpel Tunnels and RSI. So, Use Vim - Stay healthy. :]

You might be happy with using Ctrl+c to copy and Ctrl+v to paste in your plain old editor. Its absolutely fine but Vim offers a simple and easy alternative and honestly the choice is pretty clear.

Anyways, that is about Normal mode for now. We will discuss more later when required. Let's look at Insert mode.

Insert Mode

As the name suggests Insert mode is where you will inserting text and that is all you will be doing in here. You enter Insert mode by pressing i in Normal mode . And in almost all Vim distributions you should see a noticeable change in the cursor right away. It would have changed from a block type cursor (█) to an I-beam (|). That's your indication that you're in Insert mode. In another tutorial we will see how you make that even more apparent. Whatever you type in Insert mode would be displayed on the file literally. If you type a b c, it types in those characters in to the file as you would expected. Contrast this from pressing dd in Normal mode which doesn't print them on the file but instead does something to the file (In this case, a delete operation).

Unlike other editors you wont be spending much time here and Infact I'd advise you to get out of Insert mode once you are done typing. To go out of Insert mode you just have to press Esc and you will be back in Normal Mode.

Now, Let's get to the fun part of insert mode. Remember before when I said you go from Normal mode to Insert mode by pressing i, well, It turns out it is just one of the ways to get in to Insert mode. There are five more ways in which you can enter Insert mode and you can choose the best one based on what you need. Sounds complicated? Let's list them down first.

Command	What it does?
i	Enters Insert mode with the cursor placed before the current character
a	Enters Insert mode with the cursor placed after the current character (Remember a - after)
o	Enters Insert mode by opening a new line below the current line (Remember o - open)
I	Enters Insert mode by placing the cursor at the beginning of the line (Remember big I - bigger version of i)
A	Enters Insert mode by placing the cursor at the end of the line (Remember big A - bigger version of a)
O	Enters Insert mode by opening a new line above the current line (Remember O - bigger version of o)

As explained each one has a specific purpose.

If you want to quickly create a new line above the current line and start typing - You press O

If you want to insert a new line below the current line - You press o

To add something quickly at the end of a line - You press A

To add something at the beginning of a line - You press I

Could that be any more simpler? Surprisingly none of the other popular text editors do this. I can promise you that you won't be able to move so quickly in any other editor. This is Vim's power.

Let's look at another easy mode that will help you visualize things better, Enter Visual Mode.

Visual Mode

If you have carefully looked at things till now you might have started to feel that Vim favours Keyboard commands using a mouse. If you thought so, you would be absolutely correct. So, in the spirit of No Mouse, Visual mode tries to emulate Visual selections of your text similar to the way a mouse selects on Screen but instead with completely with the keyboard.

To enter in to Visual mode, just press v and move your cursor with either h j k l or the arrow keys and you will see that the text is getting highlighted indicating that it has been selected. Now, what can you do on this selected text? You can press d and delete it completely or you can press y and copy it. Notice that these are d and y and not dd and yy like in Normal mode. With v, Visual selection happens character by character. But if you want to select the full line, press V instead and you have the whole line highlighted and you can delete, copy or run any other command on the highlighted text.

This is how it looks like when you've something selected in Visual mode.

And to exit out of Visual mode or to cancel the selection, just press ESC.

Command	What it does?
v	Visual selection by character
V	Visual selection by line

We are finally down to the last mode, which is the Command mode (You don't take Command, Son)

Command Mode

Vim command mode is very powerful and one of the reasons why Vim is so versatile. Command mode is where you type Vim's commands, Vim configurations, Plugin settings, Open new files, close existing files and also access Vim's builtin help documentation. You enter to Command mode by typing : and then you type in the command you want. After you press : you will see the cursor at the bottom of the screen (called the last line appropriately) and you type the command.

To open a file, you type in :e file_name (:e is short for :edit)and hit Enter. If the file exists Vim will open it for you and if doesn't exist Vim will open blank file for you and the file will be created when you save it.

To save or rather to write the file to disk, you do :w and hit Enter for it to be saved. If the file doesn't yet have a name, You type :w file_name and it will save the contents of the window with that file name.

And Now for the most important question in all of Vim's History and the given title of this article, How to exit Vim! If you are using a Graphical version of Vim, then closing Vim is the same as closing the window and poof, its gone. But If you're using a terminal (works in gvim and macvim too) then you quit Vim by typing :q (short for :quit) and that closes the current window. If you have unsaved changes in your buffer Vim will give an error saying No Write Since Last Change. If you don't mind discarding unsaved changes, you append a ! and so the command becomes :q!. That is all there is about how you exit Vim. The bang(!) at the end is similar to -f or --force option in a lot of linux commands. It forces Vim to quit even when there are unsaved changes.

From now on if you ever saw a meme like this, you know what they are talking about.

Another important Vim command is :help. It contains the full help manual for Vim and so should be one of your most used commands in the initial days of learning.

And similar to other modes, you exit command mode by pressing the Esc key and you will be back in the Normal mode.

This image illustrates how you switch from one mode to another

Okay. That's a lot of information for one article. Let's do a quick review.

Story Recap

There are four modes of operations in Vim.

Normal Mode : moving around the document, deleting, copying, formatting are some of the common things you do in this mode
Insert Mode: Inserts text in to the document. Go into Insert mode by typing any one of a A i I o O in Normal mode. Come out with Esc
Visual Mode : For visually selecting the text. Enter with v or V and exit with Esc
Command Mode : To execute commands. :w to save, :help for documentation and :q to quit.

Well, That is all for this article folks. Will see you again in the next one. Until then, Keep practicing and Happy Vimming!

← Prev Next →

Attributions:

Vim Logo - Vim Replacement Icon http://wolfrosch.com/works/goodies/vim (CC BY-NC-ND 3.0)

Vim Comic image - https://comic.browserling.com/vim.png

Introduction & Installation | Your First Lesson In Vim

2016-09-24T21:20:00+05:30

This is the first article in a series of articles titled, "Your First Lesson In Vim". These articles are written with a goal of helping out new Vim users by teaching the awesomeness of the Vim editor and there by extending the Vim community. Vim though quite powerful, has a bad rep for being hard to learn and hard to get started with. So, even when someone is interested in learning about Vim, that infamous learning curve seem to be scaring them off. This series is going to put an end to all of that.

Warning : After going through all the articles in this series you will love Vim so much that you would like to have Vim style keyboard bindings everywhere, in your browser, in your mail client, in your shell and every other place which has a text input, which might not always be possible. Proceed further at your own risk. YOU HAVE BEEN WARNED!

Vim is one of the best text editors available out there in the market. In fact it is one of the two best editors, the other being Emacs (This would be the last you'll see its name. From here on, it will be referred to as, The Editor which shall not be named ). Now you might be wondering, what about Sublime Text ? or Atom ? or some other flashy editor that's getting attention. My answer to that is very simple - East or West, Vim is the best. Don't get me wrong, editors like Sublime, Atom are good and I was a fan of Sublime myself. But to be called the best, a Text Editor needs to customizable, extensible and most importantly should have a huge community of users helping out each other. None of these editors can beat Vim in those areas. Apart from that Vim is really fast and robust. It can open huge files that makes other editors crash. It has builtin syntax support for hundred's of file types. It has a huge plugin base that both extend vim's functionality and add more functionality to do pretty much any thing you want. And that's just a few reasons why its the best.

Since you are reading this article, I assume that you're interested in learning about what Vim is and about what Vim does. So, Let's start with some brief history of how the Vim editor came to be.

In 1970's, Bill Joy developed ex editor for Unix which later came to be known as the Vi editor for having a Visual interface for editing.
1987 - Stevie was developed as a clone of Vi for Atari ST systems. Stevie stands for 'ST Editor for Vi Enthusiasts'. The name might be a mouthful but the editor itself is quite popular.
1988 - Vim (Vi IMitation) was created by Bram Moolenaar (Remember the name ..) as a port of Stevie for AmigaOS. Though started as an imitation, Vim quickly started to add several new features with support for multiple operating systems.
1993 - Vim 2.0 released with name changed to 'Vi IMproved' because, by then Vim had a lot more features than original Vi. . . Fast forwarding history .
2006 - Vim 7.0 released with support for tabs, code completion, undo branching and a lot more
2016 - Vim 8.0 released with a lot of exciting features like Asynchronous I/O, channels, Jobs, Timers, Packages and a lot more

(Shout out to buildingvts.com for putting this history together)

So, as you can see from our brief Time travel, Vim has been around for almost 30 years. Now you might be asking yourself, why the heck is this editor still used today after almost 30 years. That's a good question and one that needs to answered right now.

Technology sure changes a lot and old things usually tend to get lost with all the new things that keep coming. But in the case of Vim or The Editor which shall not be named, that is simply not the case. They fall in to the category of "Old is Gold". These editors are written during the days when floppy disks and magnetic tapes were all the rage and hence are written to be memory efficient. Though Vim has changed a lot over the years to add countless new features, the fundamental idea of being light weight and memory efficient is still one of its big selling points. That is the reason why Vim managed to stay relevant through three decades and that is also the reason why it will continue to be relevant for more decades to come.

So, If that answer convinced you to stay the course and explore the exciting and enticing world of Vim, then Welcome aboard! Make sure to remember that this is the day you have decided to take your text editing to the next level by learning Vim.

Now that we know the history of Vim, its time to install Vim on your Computers. If you are rocking a Linux Operating system, chances are you already have a version of Vim pre-installed. So, check if it exists by typing vi or vim in the command line. If it is available, you should see a screen that looks something like this.

If you see this then Vim is already installed.

If you don't have it installed, don't worry. Vim is a freeware (correction: Charity ware) and so you can download it for free from Vim's official site Vim.org. Vim is available for pretty much every major Operating system out there. I heard that there is a version of Vim available even for Toasters. I have no idea who might use that, but its there if you need it. And this is another reason why people like vim so much.No matter the OS, they can be sure that their favorite editor is available. So, Just download vim for your operating system and install it.

And by the way, did I mention that Vim is primarily a terminal based program? It was initially designed to be run in terminals to access files on remote systems. A lot of people to this day, prefer the terminal version of Vim. But to those of you who like to have a Graphical User Interface (GUI) you've that available as well.

For windows users, it can be downloaded from the vim.org site. Look for Gvim (stands for Graphical Vim) For mac users, you can download Mac Vim which provides a good GUI experience. For Linux users, there are Gvim versions available for most of the distros. So, download the one suitable for your distribution.

If you have successfully installed Vim on your systems open Vim either in Terminal or the GUI and you should see a welcome screen similar to the picture above. If you got that, then Congratulations, you have the power of Vim with you now.

Don't forget what Uncle Ben said, "With great power comes, great responsibility". So, your responsibility as a Vim user is to spread the vim awesomeness with your co-workers and friends. It would be even better if you can share this article with them but that is entirely up to you. (Jedi mind tricks working implicitly)

And before we finish this article I will give you a sneak peak at the power of Vim and what you can do with it. Watch Damian Conway's Video on Vim : More Instantly better Vim. Conway is one of the Vim geniuses whom I admire.. This video gives you a small window in to the world of Vim and what Vim can do in the hands of a seasoned user. You might not be able to understand how Conway is doing his magic but that is entirely fine. You obviously won't be able to understand Linux Kernel module code when you're just starting to write Hello World programs. This video is just to show you how the masters use Vim and you will be able to do that too once you've mastered it.

Well, That is all for this article folks. Will see you again in the next one. Until then, Keep practicing and Happy Vimming!

Attributions:

Vim Logo - Vim Replacement Icon http://wolfrosch.com/works/goodies/vim (CC BY-NC-ND 3.0)

Floppy Disks - https://goo.gl/0Ns2Dj

Using Tab windows in Vim

2016-07-18T05:30:00+05:30

Using Tabs (vim calls them tab pages) is one of the sure ways to increase your productivity. Vim Tabs are just like the tabs in your browsers. Each tab can have multiple splits (referred as windows in Vim's documentation). So, you can have multiple splits open in one tab and then you can have multiple tabs.

Tabs are a really handy way of grouping things together. So, I usually have multiple tabs open in any session. I have a main editor tab where i will have multiple splits open for the code I am looking at and since I work with a lot of data files, I will have one tab dedicated for the data-sets that I will be using for my program. And, then if required, I will have another tab open for any notes, info that I have previously noted down.

Here are some commands and tips for working with tabs
To create a new tab - :tabnew
To go to the next tab - :tabnext
To go to the previous tab - :tabprevious
I don't like to type all of these commands everytime and so I have added these mappings in my vimrc to make switching between tabs much easier.

nnoremap <C-Tab> :tabnext<CR>
nnoremap <C-S-Tab> :tabprevious<CR>

With these I can move around the tabs just like I do with the tabs in my browser.
Another important thing that you might want to do with tabs is to be able to move them. I am really particular about how my tabs should be ordered and so I have added these mappings to move them around.

nnoremap <silent> <A-Left> :execute 'silent! tabmove ' . (tabpagenr()-2)<CR>
nnoremap <silent> <A-Right> :execute 'silent! tabmove ' . tabpagenr()<CR>

With these you can hit Alt + Left arrow to move it to left and vice versa.
Try :help tabpage in your Vim help for more info.
A lot of Vim users see tabs as an alternative to buffers and there are a lot of articles, discussions about Buffers Vs. Tabs. But,for me tabs and buffers are not exclusive. I often have multiple tabs opened and will still use buffers when I need them.
So, that is all for this article. Come back again for the next article.
Until then, Happy Vimming.

Image Credits : https://pbs.twimg.com/profile_images/64545277/vim_logo_400x400.png

PS: To see all the vim tutorials of FreBlogg , see : Freblogg/Vim

Word Count application with Apache Spark and Java

2016-06-23T05:30:00+05:30

Apache Spark is becoming ubiquitous by day and has been dubbed the next big thing in the Big Data world. Spark has been replacing MapReduce with its speed and scalability. In this series of articles on Spark we will try to solve various problems using Spark and Java.

Word count program is the big data equivalent of the classic Hello world program. The aim of this program is to scan a text file and display the number of times a word has occurred in that particular file. And for this word count application we will be using Apache spark 1.6 with Java 8.

For this program, we will be running spark in a stand alone mode. So you don't need to setup a cluster. Even Hadoop is not required for this exercise. Assuming you have Spark, Java and Maven installed properly, let's proceed.

Creating pom.xml

To compile Java programs with Maven, you will need a pom.xml file with the required dependencies. Use this pom.xml file if you don't have one available with you.

<?xml version="1.0" encoding="UTF-8"?>
<project>
  <groupId>com.freblogg.sparklearning</groupId>
  <artifactId>freblogg-spark-tutorial</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>example</name>
  <packaging>jar</packaging>
  <version>0.0.1</version>
  <dependencies>
    <dependency>
      <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.6.1</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>
  <properties>
    <java.version>1.8</java.version>
    <encoding>UTF-8</encoding>
    <spark.version>1.6.1</spark.version>
  </properties>
  <build>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.3</version>
          <configuration>
            <source>${java.version}</source>
            <target>${java.version}</target>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-plugin-plugin</artifactId>
          <version>3.3</version>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

Now, save this file as pom.xml and put it in the same folder as your src directory.

Input File

After creating the POM file, you will need an input file on which we will run our Wordcount program, to count the number of occurrences of each word. This is the file I will be using.

It is close to midnight and something evil is lurking in the dark
Under the moonlight you see a sight that almost stops your heart
You try to scream but terror takes the sound before you make it
You start to freeze as horror looks you right between the eyes
You are paralyzed

Java Program

Once we have the pom file ready, we can start with the code.

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
import java.util.Arrays;

public class WordCount {
 public static void main(String[] args) {

  SparkConf conf = new SparkConf().setMaster("local").setAppName("wordCount");
  JavaSparkContext sc = new JavaSparkContext(conf);

  // Load our input data.
  String inputFile = "file:///home/dsp/Desktop/sparkExamples/sample_testing/resources/inputFile";

  JavaRDD < String > input = sc.textFile(inputFile);
  // Split in to list of words
  JavaRDD < String > words = input.flatMap(l -> Arrays.asList(l.split(" ")));

  // Transform into pairs and count.
  JavaPairRDD < String, Integer > pairs = words.mapToPair(w -> new Tuple2(w, 1));

  JavaPairRDD < String, Integer > counts = pairs.reduceByKey((x, y) -> x + y);

  System.out.println(counts.collect());
 }
}

Execution

Once we have everything ready, its time to execute our program and see the output.
To compile it, first execute this in the directory with the pom file.

 mvn clean && mvn compile && mvn package

This will take sometime to run the first time because maven will have to download and install the dependencies. After successful compilation, It creates the target folder and a jar file named freblogg-spark-tutorial-0.0.1.jar.

Then to execute the program you need to run the spark-submit script in your SPARK_HOME folder.

 $SPARK_HOME/bin/spark-submit --class "WordCount" target/freblogg-spark-tutorial-0.0.1.jar

Once this command is executed your screen will be completely filled with spark logs. If you scroll a bit to the top, you will see the following output, which is the output we are interested in.

{.prettyprint} [(freeze,1), (are,1), (Under,1), (it,1), (is,2), (you,3), (takes,1), (lurking,1), (right,1), (that,1), (a,1), (You,3), (terror,1), (start,1), (dark,1), (between,1), (scream,1), (before,1), (to,3), (as,1), (in,1), (moonlight,1), (sound,1), (midnight,1), (see,1), (stops,1), (sight,1), (try,1), (something,1), (paralyzed,1), (evil,1), (It,1), (eyes,1), (make,1), (almost,1), (but,1), (and,1), (close,1), (heart,1), (looks,1), (your,1), (horror,1), (the,4)]

That is the counts of each word in the file. So, there you go. You have successfully written your first Spark application. Congratulations. You're officially a Spark programmer now!

Understanding the code

Now that we have our application set up, let's see what the program is doing, step by step.

First we have the spark variables sc and conf. Don't worry too much about them right now. All you need to know is that every Spark program needs those two lines.

 SparkConf conf = new SparkConf().setMaster("local")                                                        .setAppName("wordCount");
  JavaSparkContext sc = new JavaSparkContext(conf);

So, just copy paste the lines in every application you are going to work on.

Next we are reading the input file using RDD's. RDD's are essentially blob's of text that you read from various sources and you can transform them in to whatever you want using various operations. Here we are reading the input file from our local file system. If you want to read from HDFS, then replace the file:/// with hdfs:///

 String inputFile = "file:///home/dsp/Desktop/sparkExamples/sample_testing/resources/inputFile";
JavaRDD<String> input = sc.textFile(inputFile);

Then we have our first transformation operation on the input RDD we have created in the above step.

Flat Map is an inbuilt function that takes one input and can provide any number of outputs depending on the operations used inside it.

 JavaRDD <String> words = input.flatMap(l -> Arrays.asList(l.split(" ")));

Here we are splitting the sentence on white space characters. So, the flatmap function here returns a list of all the words in the input document and that will be stored in the RDD named words. For more about Flatmap, read this : Spark FlatMap and Map

Next, we have another transformation mapToPair that returns a Tuple of word and the number 1.

And, a Tuple is very similar to ordered pairs in Cartesian coordinate system. Tuple2 looks like (x,y), where x is the Key. Similarly Tuple3 will be (x,y,z) and so on.

 JavaPairRDD<String, Integer> pairs = words.mapToPair(w -> new Tuple2(w, 1));

As an example, the word you in the input will be mapped to (you,1) by mapToPair function. And, since the result is a pair, we have to store it in a JavaPairRDD which supports pairs.

And, then we are doing the final transformation on the pairs that will add up individual counts of each word.

JavaPairRDD <String, Integer> counts = pairs.reduceByKey((x, y) -> x + y);

ReduceByKey method groups all the Tuple pairs with the same key. We have the word 'you' repeated thrice and so we have (you,1) three times. Now, (you,1) , (you,1), (you,1) will become (you,3) * because of * the sum we are doing inside the function. And similarly for the other words.

Then finally we are performing an action on the RDD which is where the actual computation of all the above steps takes place. collect() will return all the elements in the RDD and we are printing that using println, giving us the output we want.

So there you go, Your first Spark application completed. To learn more go through the documentation and examples given on the Spark's webpage and subscribe to Freblogg for more tutorials.

Happy Sparking!

Image : http://www.datanami.com/wp-content/uploads/2014/12/spark-and-java-8.png

Self Promotion:

If you have liked this article and would like to see more, subscribe to our Facebook and G+ pages.

Facebook page @ Facebook.com/freblogg

Google Plus Page @ Google.com/freblogg

Apache Spark | Map and FlatMap

2016-06-19T03:12:00+05:30

Map and FlatMap functions transform one collection in to another just like the map and flatmap functions in several other functional languages. In the context of Apache Spark, they transform one RDD in to another RDD.

Here is how they differ from each other.

Map

Map converts an RDD of size ’n’ in to another RDD of size ‘n’. The input and output size of the RDD's will be the same. Or to put it in another way, [one element in input gets mapped to only one element in the output.

So, for example let’s say I have an array [1,2,3,4] and I want to increment each element by 10. The input size and output size are same, so we can use map for this transformation.

Required :

[1,2,3,4] -> [11,12,13,14]

Spark code :

myRdd.map(x -> x+10)

So, that is what map function does. While using map, you can be sure that the size of input and output will remain the same and so even if you put a hundred map functions in series, the output and the input will have the same number of elements.

FlatMap

Coming to FlatMap, it does a similar job. Transforming one collection to another. Or in spark terms, one RDD to another RDD. But there is no condition that output size has to be equal to the input size. Or to put it in another way, [one element in input can map to zero or more elements in the output.

Also, the output of flatMap is flattened . Though the function in flatMap returns a list of element(s) for each individual element of the input, the output of FlatMap will be an RDD which has all the elements flattened to a single list.

Let’s see this with an example.

Say you have a text file as follows

Hello World
Who are you

Now, if you run a flatMap on the textFile rdd,

words = linesRDD.flatMap(x -> List(x.split(“ “)))

And, the value in the words RDD would be,

[“Hello”, “World”, “Who”, “are”, “you”]

so, the transformation process looks like this,

 linesRDD -> [ [“Hello”, “World”],[“Who”,”are”,”you”] ]
          -> [“Hello”, “World”, “Who”, “are”, “you”]

So, those are the differences between Map and FlatMap of Apache Spark.
Keep Practicing and Keep Learning!

If you have liked this article and would like to see more, subscribe to our Facebook and G+ pages.
Facebook page @ Facebook.com/freblogg

Google Plus Page @ Google.com/freblogg

Image Credits : http://spark.apache.org/images/spark-logo-trademark.png

Quick Vim Tips

2016-06-19T02:59:00+05:30

Vim is one of the most powerful text editors available. And, hence it is not really possible for everyone to know everything or get the same ideas on improving their work experience. And, so this article includes a few tips and handy shortcuts that will help your productivity just as we have been doing with various Vim articles, but individually not extensive enough to get their own dedicated article.

So, here are some useful tips for Vim:

Change your directory locally

Usually to change the current directory in Vim, you would do,

:cd <path/directory>

But that would change it in all open buffers. So, to change it locally either just in a window or rather just in a split window , you can use lcd.

:lcd $HOME    "change the directory locally to $HOME

Open a link in a browser from vim

Now, this is something that I have found out recently and has been very useful ever since.
Especially if you are some one who does a lot of documentation in Vim or some one who works extensively with HTML, this would be invaluable to you. Just put the cursor on the URL and press gx

gx  "opens a link in your default browser

Edit and source your Vimrc fast

If you are using vim as your editor, changing, updating vimrc becomes one of the things that you do very frequently. It can be adding new mappings, deleting the old ones or adding new plugins etc. What ever that is having these mappings will help you do that much faster.

Opens vimrc in a vertical split

:nnoremap <leader>r  :vsp $MYVIMRC

Source your vimrc to get the new changes.

:nnoremap <leader>sv  :source $MYVIMRC

Automatically delete trailing white-spaces

If you are one of those people (including me) who wouldn't like to have trailing white spaces at the end of lines, then you would absolutely want to have this in your vimrc file. This will take care of all those nasty white-spaces on all lines.

autocmd! BufWritePre * call DeleteTrailingSpaces()

function! DeleteTrailingSpaces()
  execute "normal! mzA "
  "Deletes all Trailing spaces"
  %s/\s\+$//g
  execute "normal! `z"
endfunction

So, that is all for this Quick tips article. More will be coming in the future. So, stay tuned.

Happy Vimming!

Multi task in Vim with panes

2016-06-19T02:42:00+05:30

A lot of Programmers use Vim in some way or another but a vast majority of them use only a handful of features. Knowing to use Windows/Splits, tabs, macros and marks can really increase your productivity. Through this and the upcoming articles on Vim I will try to cover the important things that make Vim so awesome.

Splitting your Screen

Vim Splits are a very powerful way of keeping your workflow organized. You can use splits (windows or view-ports in Vim vernacular) to get a different view in to the same file or open a different file to see a quick diff .

The advantage of Vim compared to other popular editors is that, they either don't support splitting the screen or have several limitations on how you can split . Vim lets you split the window as much as you want in any number of pieces and also lets you switch between them instantaneously. And, most importantly you have both Horizontal and Vertical splits.
So, if it makes sense, you can create a really complex split layout like this. (though i would probably advise against it )

Or you can just keep it simple with just a couple of splits open. All up to your requirement.

So, Let's see a few things that will get you started with using Split windows in Vim.
To open a file in a Horizontal Split window just type this in,

:sp filename.here <Enter>

To open a file in a Vertical Split window,

:vsp filename.here <Enter>

Now, if you want to open the current file in a new split (horizontal or vertical), just type :sp or :vsp and it will open a new split-view in to the current file.
Here are a few things that will make using splits much easier.
You probably want to put these in your vimrc file.

Easy 'split' navigation
""""""""""""""""""""""""
nnoremap  h
nnoremap  j
nnoremap  k
nnoremap  l

This will make switching from one split to another really easy and intuitive. So, <Ctrl + h> moves the cursor to the left split just as 'h' would move it to the left character. Similarly for others.
Another thing i really like to do is resizing the Splits. I have the following in my vimrc for easy resizing.

Easily resize the splits
""""""""""""""""""""""""""""
nnoremap  :vertical resize +5
nnoremap  :vertical resize -5
nnoremap  :res +5
nnoremap  :res -5

So, to resize your vertical split to the right, press Ctrl + <Right-Arrow> and similarly for others.

Another mapping that I use almost every day is mapping my <Right-Arrow> to open up a new vertical split. I absolutely love this and use it when ever i want a quick diff of two files or just to open two views of the file in the buffer.
nnoremap <right> :vsplit <CR> So, these are the basic things that you need to know to start using Splits in Vim. For more info on Splits and buffers use the Vim's default help :help windows

That is all for this post. Stay tuned for the next one .
Happy Vimming!

Optional Character | RegEx : The Right Way

2016-06-19T02:24:00+05:30

We have already seen how to use the dot operator in Regular Expressions in the last tutorial (Link). To see all the articles of this Regular Expressions series, click here .

In this lesson, we will see what to do when some characters you want to match are optional, i.e., if they are present, match them and if they are not, don't bother.
So, Imagine a situation where you want to match foo and foobar. Hence, bar is optional. Then, what do we do to match and identify such words? Let's see ...

We can start with /foo/. As you see, it matches both foo and *foobar * but also matches any other word that contain foo. So, not exactly what we want.

For situations like these we use '?' which tells the RegEx engine that what ever character that precedes the '?' is optional and is not required.
Now let's simplify the problem a bit. We want to match only foo and foob for now.

So, we'll do /foob?/. As you see, it matches both foo and foob completely. So, this is good. Although, it still matches food, ignore that for now.
Extending the same idea forwards, /foob?a?r?/ will mean that 'b', 'a', 'r' are optional. And all of these words foo, foob, fooba and foobar. Since I am looking for only foo and foobar , I need to do something more to not allow the intermediaries.
Also, too many ?'s in the RegEx don't look that good.

So, Let's see how it can be done to meet our requirements. We will use grouping to do that. Groups will be covered in much detail in a separate tutorial. For now just sit tight and continue. So, we will do this, /foo(bar)?/. And, as you can see it is matching both foo and foobar as we wanted.

]

Now, let's try to understand what we did here.
(bar) is called a Group. Group is just a fancy way of saying that everything inside has something in common.
And, in our case the common property is that all the letters inside are optional. And, when the RegEx engine looks at that it knows bar is optional and so matches foo even if bar is present or not. The end result being only foo and foobar are matched completely and not the other things.

Now, Its time for some RegExercise...
What will you do to match both 'cats' and 'carts' but nothing else?
Think about it and try it in Regexr.com to see if it is working.

Well, that is everything for this tutorial. Stay Tuned for more.

Happy RegExing!

Encryption in Vim

2016-06-19T02:21:00+05:30

] Vim never fails to surprise you with the amazing features it has in its arsenal. Very recently I have found that Vim comes bundled with an encryption mechanism referred to as VimCrypt.
It is always a good practice to encrypt your files especially when it contains personal or sensitive information. I often write my Daily journal notes in vim and i always encrypt them with some external programs. But, Vim itself is capable of doing that.
Let's see how it works.

Here, I have a file named encrypt.txt open in vim.

So, now I want to encrypt this. All I have to do is this and press Enter.

:X

And, it will prompt you to enter a key for the Encryption. Enter it twice. This will be used to encrypt and later decrypt the file

And, that is all.. Your file is now encrypted with the key you have given.
The next time you try to open the file again, it will prompt you to enter the Encryption key.

Make sure to remember the key you have entered because if you enter a wrong key to decrypt the data, you will see a completely garbled gibberish on screen.

So, that is how you encrypt your files in Vim.

But, here are a few things to be mindful of

VimCrypt uses a really weak encryption algorithm. It can be broken rather effortlessly. Hence ,don't use this for encrypting really really important files.
If you open the file with a wrong password, you'll see garbled text on screen. But, Do not save that gibberish file on to the disc, because if you do, Vim will overwrite your original file contents and your data will be lost.

There are several new encryption methods available to achieve the same thing like, Blowfish and Blowfish 2. These are much better than the default VimCrypt and will be much harder to crack. More about them later in another tutorial.

So, that is all for this article. Its good practice to encrypt files to keep your data secure. So, try to use this whenever you can.
And, until next time, happy Vimming!

Dot Operator | RegEx : The Right Way

2016-06-19T02:18:00+05:30

Let's continue with Regular Expressions. All articles in this series can be found here. I will be using Regexr.com for most of these tutorials. It is a great site, where you can write and validate your regular expressions against your desired input text.

Now, Let's look at how regular expressions usually look like..

The '/' before and after the pattern is the "delimiter". We put our search pattern between the delimiters and in the end pass a flag. Flags add a bit more control on what you want to accomplish. So, with that knowledge let's get started.

Matching a String

To match any given string, we just do this, /{string}/ and the Regex engine will find that string from the text. So, when i search for Blogg it will be matched with FreBlogg as we can expect but not with 'blogg' as it is case-sensitive by default.

To, match even the 'blogg' on the second line, you can use the i(ignores the case) flag along with g. But, the problem with that i flag is that it will match a lot more than we want it to as you can see in the second image. So, use the i flag with caution as it can match other strings you might not want to.

We will later see what to do if you only want to match 'Blogg' and 'blogg' and nothing else.

So, that is how you can match for a single string. Let's look at the Dot operator now.

The Dot Operator

The /./ in regex, matches every character, except the newline characters.
We use this when we want to match a character but don't care what it is. A[s you see in the image, it matches all characters, numbers and spaces. Even special characters like @,!,\$.. will be matched.

When i try with /D./ , it matches *Do * because the Dot will match anything. But, it didn't match the 'D' on line 2 with 'o' on line 3, because dot operator won't match new line characters and we have a new line character after 'D'.

So, Let's use the stuff we've learned till now and do a small exercise.

Say, I want to match three letter words, i can use /.../, but look at what it matched.
As you see it did not exactly do what we wanted. Since, Dot matches Spaces as well, it matches a space and two letters as in 'oh'. So, we can't use it for this particular case. We will see much better way to do the same thing in the later tutorials.

Well, that is everything for this tutorial. Stay Tuned for more.

Happy RegExing!

RegEx : The Right Way |Tutorial 1

2016-06-19T02:07:00+05:30

Regular Expressions or RegEx is a sequence of characters that define a search pattern. Regex is every where these days and you can use it to extract information from Text files, Log files, Dictionaries, Spread sheets and webpages. Every major programming language has support for Regular Expressions. Most importantly grep, awk and sed use regex to find/replace matches.

Regular Expressions can help you save a lot of time. Instead of writing complex String pattern searches which span over multiple lines, regex gets the job done really easily and really fast.
Let's look at a simple scenario where you might want to use Regex. Let's say you have a String and you want to check if it is a Website URL. So, here are a few conditions that a URL should satisfy

Should have http:// , https://
May or may not have www.
Should have a .com , .org or something similar
Can have characters, digits, underscores etc.,
Might even have some sort of port numbers at the end http://google.com:80/

So, matching all these individually in any Programming language with various String parsing conditions can be a really challenging task. But, using RegEx the same thing can be achieved much easier and much faster.

That sounds like fun,doesn't it? . Well, Lets get started.

If you have used Linux shell/terminal before you probably would have used Regular Expressions already. Bash Shells have some basic Pattern matching capabilities built in to them. So, Let's look at an example

I am currently inside a folder with some files in it.

$ ls
file.csv       picture.jpg    README_en.txt  touch2.txt  video.mp4
HelloWorld.rb  program1.java  touch1.txt     touch2.vim

If i want to see only the files with the extension .txt, I can do this.

$ ls *.txt
README_en.txt  touch1.txt  touch2.txt

This can be thought of as regex in its basic form. We are giving a search pattern and we are seeing the output that matches this pattern. This '*' here is called a Wild card Character which basically matches anything and everything.

This time let's say I want to search for a txt file whose name is 'touch' followed by something [In this folder we have touch1.txt , touch2.txt ]. Let's say i don't remember the exact number following the touch. To search for that, I can use the ? operator and that will give me this.,

$ ls touch?.txt
touch1.txt  touch2.txt

Now, the '?' operator is also a wild card character but just for one character match. So, if there is a file called touchA.txt or touch%.txt, they will be matched too but touchAB.txt will not be matched.

So, these are how you can improve your search results using search patterns. We can use programs like grep, egrep and sed to do a lot more than just this. So, we will use them in the upcoming tutorials.

That's everything for this tutorial. So, stay tuned for the upcoming ones.
Happy RegExing!

NERDTree | Your very own Vim file tree

2016-06-19T02:00:00+05:30

NERDTree is a real time saver and a pretty cool extension to your Vim setup, to make it more user friendly. Almost every other Text editor out there comes with an ability to show the file directory listing in which the current file is present. And, if you are wondering how you can do the same in Vim, then look no further because NERDTree is what you want.

So, what does it do? It just shows all the files, folders in the current working directory. Also, you can add,delete files right from the list. That's pretty cool.

So, take a look,

Some useful stuff regarding this plugin,

:NERDTreeToggle - Toggles the file pane On/Off. So, you might want to map that in your vimrc. I have it mapped to <LEADER> n
? - Hit '?' and you'll get all the help you need for using it
m - Hit 'm' and you'll be presented with a menu to create,delete and list
For all extra info - :help NERDTree

So, that is all i have about this plugin. It is a really great addition to your workflow and you will love it.

Download Link : https://github.com/scrooloose/nerdtree

Happy Vimming!

PS: To see all the vim tutorials of Freblogg , visit : Freblogg/vim

Image Credits:

vim logo - hackdesign.org - https://goo.gl/ADCh6R

Matrix Rain / Falling Matrix code : Notepad trick

2013-05-11T21:16:00+05:30

Have you watched the movie matrix? If you have watched it you would surely have noticed the green coded numbers running up and down (also called Matrix Rain) on the screen. In this That falling code trick is very easy to create on your own. Now, I'll show you how to do that.

Steps To Generate Failling Matrix Code

1)Open Notepad on your computer
2)Copy and paste the following code in to it

@echo off
color 02
:start
echo %random% %random% %random% %random% %random% %random% %random% %random% %random% %random%
goto start

3)Save it with any name you wish, but with the extension '.bat' (.bat stands for batch file). Save it wherever you want in your file system.

4)Now, double click on the file and feel like a bad ass programmer. Try showing this to your friends and feel proud of yourself when they look high of you.

Watch the youtube video explaining the same

5)You can also change the color of the numbers by changing the 'color 02' in the code to what ever number you want
Following are the colors for various possible numbers

00 - Black colored
01 - Blue
02 - Green
03 - Aqua Blue (greenish blue)
04 - Red
05 - Purple
06 - Yellow
07 - White
08 - Grey
09 - Light Blue
0A - Light Green
0B - Light Aqua Blue
0C - Light Red
0D - Light Purple
0E - Light Yellow
0F - Bright White

6)To change the background change the first digit of the number
Some examples are:
1X - Blue Background + color of the letters corresponding to the number X
2X - Green Background + Color of the letters corresponding to X
AX - Light Green Background + Corresponding color of letters
So on and you can keep going

If No Argument is given, this command restores the color to what it was when CMD.EXE started. This value either comes from the current console window or from the DefaultColor registry value

Thanks for stopping by. For more interesting and awesome tricks and tweaks subscribe to our blog feed.

SAMSUNG GALAXY S4 Vs HTC ONE : Complete Review

2013-05-10T11:54:00+05:30

Are you planning to buy Samsung Galaxy S4 or the HTC One ??

Samsung Galaxy S4 and HTC One , the two new, popular, hyped and high performance HD Android Smart Phones are the talk of town these days. They are arguably the mot popular and most sought out phones of 2013. These are the biggest (nothing to do with their size) smart phones launched in 2013. They will go head to head and toe to toe against each other and their match up will go on till the end of 2013 and much beyond. Both of them initially when launched in to the market were thought of as tough and worthy opponents to Iphone 5. But they eventually crossed that mark and they have gone well further and now they are no longer seen as a competitor to Iphone or other smart phones but they just stand out of the pack. These two have clearly set them as the kings of 1080P smart phones. So, here is the side by side comparison of two greatest flagship phones of 2013 (of all time), [SAMSUNG GALAXY S4]{style="color: blue;"} * Vs [HTC ONE]{style="color: blue;"}*

Individual Stories

Galaxy S4

Galaxy S4 is one of the most popular phones of 2013 because of all the hype and publicity done by Samsung and because of its predecessor S3 which is so popular. These made everybody expect it to be a good phone. Every body thought of it as a Revolutionary update of S3 but when it came out it was more like an Evolution rather than a Revolution. Although Samsung says they have made over a 100 changes, people don't really see them. All they can see is a Thinner, Lighter, Squarer S3 which also gets some of its looks from Galaxy Note 2. So, S4 remains as a minor update of S3 with some new features and a high power battery. Its new features such as Air Gesture, Smart scroll, Smart stay made it worth waiting for.

HTC One

HTC One is an interesting phone this year for a few reasons. It is the Flagship phone for HTC and it is surely their desperate effort to stay high in the market and is a Must win situation for them mostly because of its predecessor One X which got some good reviews but was Commercially unsuccessful. HTC is clearly hoping this to be the ONE to change their fate and hence the name ONE. Secondly, it is competing directly against two other great smart phones Iphone and Samsung's S4. Just by watching the specs you can tell that HTC has surely taken a few risks trying to get the market share. Most of the emphasis of ONE is on unique hardware and software features. A new camera with an Image sensor to perform well in low light environments and many other features makes it a good pick for anyone.

Face to Face: Features, Pros & Cons

1) Build and Looks

HTC One is probably the best build phone ever made (at least among HTC's). The build quality is one of the biggest differences between these two phones. Build quality is the first thing a person notices when he holds one. The immediate first impression of a phone is how it feels. I must say, HTC stands out in this aspect due to its 'All metal Uni body back design' which also wraps up a bit on to the front side. It has that sturdy and hefty feel when you hold it in your hand which is one of the greatest things about this phone. While S4 feels completely different. It is made of plastic, so it feels flexible and light. If you don't mind a few extra grams in your hand you could go for ONE but if you want your phone to be as light as possible then you can go for S4. Both Plastic and Metal have their own Pro's & Con's but they balance out each other.
Another major difference is the Speakers. S4 has a small rear speaker which is ok (you know what i mean), but ONE has these massive stereo front facing speakers which are amazing. HTC calls it 'Boom Sound'. No one else did it this well like HTC. The Beats Audio is surely another plus.

2) Screen Size & Display

By: jokaone

HTC One is a 4.7 inch LCD Screen and Galaxy S4 has a 5 inch SUPER AMOLED screen. Both are 1080P full HD (1,920 x 1,080) smart phones. Both have a crystal clear, razor sharp display. HTC ONE has a bit brighter display because of it LCD screen compared to the AMOLED screen. It also displays more accurate colors. HTC can surely be proud about their high Pixel Density (more pixels in smaller screen).
HTC - 468 ppi * PPI - Pixels Per Inch
S4 - 441 ppi

3) Battery

Galaxy S4 is built-in 2600 mAh Li-ion battery while HTC One has a 2300 mAh Li-Po (Lithium Polymer) battery. Clearly, S4 has an advantage in this aspect. It lasts a bit longer than that of One. Further more the battery of S4 is replaceable unlike that of ONE. S4 has one more feature, the battery can be charged wireless. Nevertheless, both S4 and ONE are massive power houses destined to give high performance. The battery life is almost the same in both of these phones. The High clock speed of S4 consumes more battery power and it probably eats away some of the extra advantage because of the 2600 mAh battery.

4) Operating System

The S4 is getting shipped with the latest version of Android (4.2.2) and Samsung's very own Touchwiz user experience while ONE will be shipping with the previous version of Android (4.1.2) and its own Sense 5. The distinction here clearly depends on your tastes. If you like Touchwiz (i doubt that) you can go for S4 or if SENSE makes sense to you, you can prefer ONE. Though ONE will miss some of the updated features of the new version of Android for now, but you can be sure to see the new updates right around the corner.

5) Camera

One of the highly debated topic when comparing S4 and ONE is the 'Camera'. HTC ONE has a 4 Ultra Pixel (same as 4 MP) Rear camera while Galaxy S4 has a 13 MP Rear camera. Though both can take pretty good photos, ONE's 4 MP cam is a sort of downside (though it has its own pros). S4 takes these
amazingly crisp and high quality images thanks to its 13 MP sensor in almost all times, while ONE can take more detailed images in low-light environments because of its 4 MP camera. For regular uses and purposes we can't see much of a difference between them, but when we zoom or crop the image ONE's image fall's apart soon enough while S4's picture stay's crisp and crunchy all the way through which gives a photographers advantage for S4. So, if u are a pixel sucker (the one who wishes for more & more pixels) the S4 will be a great choice for you. One soothing thing for ONE lovers is its 'wide angle lens' both in the Front facing camera and Rear camera by which you can include more room in to your shot which is a noticeable difference between these two great phones.

Phone	Rear Camera	Front Camera
HTC ONE	4 MP with Auto Focus 1080P HD Recording HDR Recording	2.1 MP 1080p HD recording
Galaxy S4	13MP with Auto Focus 1080P HD Recording HDR Recording	2 MP 1080p HD recording

6) Performance

Both S4 and ONE are performance drivers. HTC One comes with a 1.7 GHz Quad-core Snapdragon 600 processor. In case of S4's quad-core variant, it has a 1.9 GHz Snapdragon 600 Soc while the octa-core version comes with a 1.6 GHz A15 quad-core cluster and 1.2 GHz A7 quad-core cluster. Since, octa-core versions are less popular, we'll try to ignore them, Which gives S4 a plus in this aspect. But a high clock speed indicates using more battery power and that is why S4 uses a 2600 mAh battery. Coming to response speed, HTC takes on the lead. It is very quick and has a highly responsive touch screen compared to S4, which can stutter sometime, which gives a slight edge to ONE.

7) Internal Memory Storage

When it comes to Memory storage Galaxy S4 tends to be more versatile and flexible. S4 comes with 16 GB, 32 GB, 64 GB versions and also comes with a Micro SD card slot which is extendable up to 64GB.

HTC ONE comes in two versions 32 GB or 64 GB and does not have a SD card Option (except in china).

8) Other features

Some other interesting and worth mentioning features of these phones

HTC ONE

ONE features a new News aggregator, known as 'Blink Feed' which displays a scrolling list of news and other content from social networking sites
ZOE: The Camera app includes a new shooting mode called, ZOE with which you can film 4 secs of video and create your own gif's
Remote Control: HTC One an electronic program guide powered by Peel, by which it can act as a remote control for TV

GALAXY S4

SMART SCROLL: Screen can be scrolled up or down by tilting the phone
SMART PAUSE: Video gets paused on its own if you are not looking at the screen. It will resume playing when you look at the screen again
GROUP PLAY: Allows you to share files with other Galaxy S4 phones. You can play the same game or you can listen to the same song with the other shared S4's acting as supporting speakers
AIR-View: Allows users to preview an image or a video by hovering their finger over it
ERASER: Allows user to remove unnecessary things from the image while capturing
DUAL-SHOT: Allows the person taking the picture to be in the picture (So, no more 'who took this image?' sort of questions)
SOUND & SHOT : Allows user to record a small voice clip along side a picture
KNOX: A new feature, which allows user to divide the phone for business and personal uses

Final say

Both of these are great phones at the end of the day. I will happily recommend anyone of these phones because no matter what you buy, its worth your money and time. Both of them counter each other in their features. So, Either you may buy S4 or HTC One, you can be sure that you'll get the high end quality performance you expect. The only thing i would like to add is to Put your money where ever your interest lies in. So, this concludes the S4 Vs ONE Review. What will you pick ? What do you prefer? Let us know what you think of each of these phones. We would be grateful if you can describe your reasons to pick one over the other. Thanks for stopping by. Have a happy day ahead.

Attributions

S4 Black mist image - Author: Samsung Belgium (creative commons license)
HTC ONE image - Author: Hi-tech@Mail.Ru (creative commons license)
Android Image - Author: Google (creative commons license)