In my Big Data learning schedule, the first technology is Hadoop, so I would start with a classical Hello World exercise for it, but written in Kotlin. By the way, all the code from this post can be found in its GitHub repository.
The purpose of this first exercise is pretty easy, developing a MapReduce algorithm that, given as input a text files, generates as a result, a file with for each word the number of times is present in the file. So, first of all, what is MapReduce? Wikipedia says that "MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster". Made it simple from a programming point of view, is just calling two functions, Map() for translating the input into something else and then sorting that, and Reduce() for doing a summary of what happened before.
For our exercise, this MapReduce context is pretty easy: we map each word into a tuple <word, 1>, because each word counts as 1, we sort them in groups that have the same word as key, and then we reduce each group by counting the number of 1 in them.
First the project's settings. I'm using a build.gradle to configure the entire project...
... and another build.gradle for the specific exercise.
They are quite standard gradle configuration files, the only important thing to notice is the jar task at line 35 in the first one. This is necessary for including the kotlin-stdlib inside the jar generated, in order to make possible to use Hadoop for running the example. I haven't found any better way to do it, if anybody has any advice about this, it would be really helpful.
Now, the mapper code. It must be a class that extends org.apache.hadoop.mapreduce.Mapper and override the map(..) method.
The input text is inside value, so we divided it into words, and for each word, we map it to the lowercase version and write it into the context as a key with value = 1. The sorting part on the first value given as input to context.write(..) is completely done by Hadoop, but I wouldn't be surprised if there is a way to influence it.
Similarly to the map part, the reduce class to use must extend org.apache.hadoop.mapreduce.Reducer and override reduce(..). Into reduce, values is the group having all the values that have as common word the value of key. We count all the values and return the couple <word, sum(values)>.
This file, instead, contains a couple of Kotlin extension methods for Hadoop. I'm pretty much obsessed by the concept in the last period, so I couldn't not write some of them, who knows maybe I'll end up writing a Kotlin library for Hadoop!!
Last step, the driver class, the one that configures everything. I still have to dive deeply inside the Hadoop documentation, but I immagine is used mostly to set where to find the input file/files, where to save the results, the mapper and reducer configuration and extra stuff. Here all the extension methods for the Job class make everything more readable in my opinion.
So, I think that this ends up this Hadoop Hello World and my first blog post. I hope everything make sense and please, if you have questions or comments, feel free to write me!