|

An Introduction to Storm

Storm, a big-data processing system, has been presented by Twitter as a distributed and fault-tolerant stream processing system with the following key design features:

In this blog post we provide you with a solid overview that will help you take advantage of implementing big-data analytics over streaming data.

The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for performing stream transformations are spouts and bolts. A spout is a source of streams that generates input tuples (Figure 1). A bolt consumes any number of input streams, carries out some processing, and possibly emits new streams (Figure 2). Complex stream transformations, such as the computation of a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts.

figure1

Figure 1: The behavior of Storm Spout

Bolt

Figure 2: The behavior of Storm Bolt

A topology is a graph of stream transformations where each node is a spout or a bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream. Links between nodes in a topology indicate how tuples should be passed around, and each node in a Storm topology executes in parallel. In any topology, we can specify how much parallelism is required for each node, and then Storm will spawn that number of threads across the cluster to perform the execution. Figure 3 depicts a sample Storm topology.

Topology

Figure 3: Sample Storm Processing Topology that consists of 2 Spouts and 4 Bolts

The Storm system relies upon the notion of stream grouping to specify how tuples are sent between processing components. In other words, it defines how that stream should be partitioned among the bolt’s tasks. In particular, Storm supports different types of stream groupings such as:

Figure 4 illustrates the behavior of built-in Storm groupings. In addition to the supported built-in stream grouping mechanisms, the Storm system allows its users to define their own custom grouping mechanisms.

Grouping

Figure 4: Storm Groupings

Although Storm is implemented in Java, it is possible to use other languages. To learn more about languages in Storm read Chapter 7. Using Non-JVM Languages with Storm in Getting Started with Storm.

With this overview of Storm complete, in the next post we will take a look at Storm in action, covering Storm clusters and a Storm example.

Safari Books Online has the content you need

Check out these Storm relevant books available from Safari Books Online:

Getting Started with Storm introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives.
Big Data Bibliography contains a selection of the most useful books for data analysis. This bibliography starts from high level concepts of business intelligence, data analysis and data mining, and works its way down to the tools needed for number crunching mathematical toolkits, machine learning, and natural language processing. Cloud Services and Infrastructure and Amazon Web Services are covered, along with Hadoop and NoSql sections that list the Big Data tools that can be deployed locally or in the cloud.
HBase: The Definitive Guide contains a chapter that covers ZooKeeper – Storm keeps all cluster states either in Zookeeper or on local disk.

About the author

sherif Dr. Sherif Sakr is a Senior Research Scientist in the Software Systems Group at National ICT Australia (NICTA), Sydney, Australia. He is also a Conjoint Senior Lecturer in The School of Computer Science and Engineering (CSE) at University of New South Wales (UNSW). He received his PhD degree in Computer Science from Konstanz University, Germany in 2007. He received his BSc and MSc degrees in Computer Science from the Information Systems department at the Faculty of Computers and Information in Cairo University, Egypt, in 2000 and 2003 respectively. In 2011, Dr. Sakr held a visiting research scientist position in the eXtreme Computing Group (XCG) at Microsoft Research, Redmond, USA. In 2012, he held a research MTS position in Alcatel-Lucent Bell Labs. Sherif is a Cloudera certified developer for Apache Hadoop and Cloudera certified Specialist for HBase. You can reach Sherif at ssakr@cse.unsw.edu.au.

About Safari Books Online

Safari Books Online is an online learning library that provides access to thousands of technical, engineering, business, and digital media books and training videos. Get the latest information on topics like Windows 8, Android Development, iOS Development, Cloud Computing, HTML5, and so much more – sometimes even before the book is published or on bookshelves. Learn something new today with a free subscription to Safari Books Online.
|

Comments are closed.