Posted by & filed under Content - Highlights and Reviews, Programming & Development.

codeA guest post by Kasper Grud Skat Madsen, who currently works as a Ph.D. student at the University of Southern Denmark. His research interests are data stream management and cloud computing. He is a contributor to storm and storm-deploy. Furthermore, he is involved in a project trying to extend Storm, called Enorm.

Storm is an open source system, for doing low-latency distributed computation on a shared-nothing architecture. In this post, we will present a simple (but useful) job, to maintain the set of unique users from different geographical regions. In the next post, we will show how to deploy this job to Amazon EC2. In these posts, only excerpts of the source will be given. For full source and demo data, see www.github.com/KasperMadsen/SimpleStormJob.

Before starting, I would like to point out the difference between Hadoop and Storm. Both systems scale very well, and can thus be used to process large amounts of data. One of the main differences is that Storm is optimized for low-latency processing and Hadoop is not (it is batch-based). There is a cost in providing low-latency processing, compared to batch-based processing, which makes performance comparisons unfair (the systems are solving two different problems). As a rule of thumb, use Storm when low latency matters, otherwise you are probably better off using Hadoop.

The problem

A company has developed an app, but knows nothing about how consumers are interacting with it. In order to gain a better understanding, they need us to help them do statistical calculations in real-time (strictly speaking in an online fashion). The first goal is to create a job to maintain the set of unique consumers from different geographical regions. Each time a consumer opens a page using our app, the following information is written on a stream, directly to Storm (simulated by demo data in this post).

  • User-id
  • Geohash
  • Timestamp (date and time)

Development platform

Installing Storm is simple, just download (http://storm-project.net/) and extract the current stable version (0.8.2 at the time of writing). Start a new Java project in your favorite IDE, import the storm-*.jar file and lib/*.jar files into the build-path of your project, and you are ready to begin.

Topologies

A topology (job) is a directed acyclic graph, where each vertex represents a worker instance and each edge the dataflow between the worker instances. There are two kinds of workers: spouts and bolts. Data can flow from a spout to a bolt, but never from a bolt to a spout. This naturally makes spouts work as inputs in a topology.

Each worker is deployed with arbitrary parallelism, meaning for each spout and bolt, the user decides how many tasks (instances) to run. Running a worker with more than one task introduces a complication, as Storm needs to know which tasks should receive the data. When defining the dataflow between the workers, Storm allows you to define a grouping, which decides how data is sent.

storm1

This figure shows one spout sending data to one bolt. The bolt consists of three tasks (instances), so there must be a grouping to define which task (2, 3 or 4) will receive the data sent from the task 1.

Storm comes with a set of predefined groupings. Let’s look at some of the most useful ones. If you are using an all grouping between the spout and the bolt, the data will be duplicated to task 2, 3 and 4. If you are using a fields grouping(key), the data will be sent in such a way that all of the data with the same value of the key will be sent to the same task (either 2, 3 or 4). Finally, using a shuffle grouping, data would be sent randomly to task 2, 3 or 4.

Implementing a spout

When a topology is submitted, Storm calls the open method on all of the spouts. In the open method, the spouts can prepare themselves by setting up a socket connection, reading data from a file and so on. In our job, we read the demo data from a file into the array called _inputContent. After the open method is called, Storm continuously calls the nextTuple function.

This function reads the input from the array, then parses and sends (emits) it along in the job. When sending data between tasks, the schema of the data must be defined. This is done by implementing the function called declareOutputFields.

The function tells Storm that each tuple emitted from the spout will contain a user-id, a short geohash, a full geohash and a timestamp.

Implementing the bolt

Create a new class called “processor” and make it implement the interface called IRichBolt. This interface declares a set of functions, but for this short post, we will only need to consider one of them.

The execute method is called each time a tuple is received. It parses the incoming tuple and extracts the user-id and short geohash. Then it updates for each short geohash a hashSet of unique users (remember a hashSet cannot contain duplicate values). The short geohash represents a larger geographical area than the full geohash, because of the way geohashes work. The size of the geographical regions can be controlled by adjusting the length of the short geohash.

Specifying the topology

All that is left is to define the topology, which means defining which spouts and bolts to use, and how they are interconnected.

The above topology specification can be visualized by a directed acyclic graph, as shown below.

storm2

The spout “input” is run using one task (task 1). The bolt “processor” is run using two tasks (task 2 and 3). The bolt subscribes to the spout called “input” using a fieldsGrouping(shortGeohash), which ensures that all data tuples with the same short geohash will be sent to the same task (either 2 or 3). This guarantees the computation of unique users is correct, no matter how many instances of the bolt “processor” are used.

Conclusion

Now it is time to look at the full source-code at www.github.com/KasperMadsen/SimpleStormJob, download it, and execute it before moving on to the second post about deploying Storm on Amazon EC2, which will soon be published.

See below for Storm and Hadoop resources from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Getting Started with Storm introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives.
Storm Real-time Processing Cookbook begins with setting up the development environment and then teaches log stream processing. This is followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.
Hadoop: The Definitive Guide, 3rd Edition you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You’ll also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Tags: Bolt, Geohash, Hadoop, Low-Latency, open source, Spout, Storm, tuple,

Comments are closed.