An Introduction to Storm
Storm, a big-data processing system, has been presented by Twitter as a distributed and fault-tolerant stream processing system with the following key design features:
- Horizontal scalability: Computations and data processing are performed in parallel using multiple threads, processes and machines.
- Guaranteed message processing: The system guarantees that each message will be fully processed at least once. The system takes care of replaying messages from the source when a task fails.
- Fault-tolerance: If there are faults during execution of the computation, the system will reassign tasks as necessary.
- Programming language agnostic: Storm tasks and processing components can be defined in any language, making Storm accessible to nearly anyone. Clojure, Java, Ruby, Python are supported by default. Support for other languages can be added by implementing a simple Storm communication protocol.
In this blog post we provide you with a solid overview that will help you take advantage of implementing big-data analytics over streaming data.
The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. The basic primitives Storm provides for performing stream transformations are spouts and bolts. A spout is a source of streams that generates input tuples (Figure 1). A bolt consumes any number of input streams, carries out some processing, and possibly emits new streams (Figure 2). Complex stream transformations, such as the computation of a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts.
Figure 1: The behavior of Storm Spout
Figure 2: The behavior of Storm Bolt
A topology is a graph of stream transformations where each node is a spout or a bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream. Links between nodes in a topology indicate how tuples should be passed around, and each node in a Storm topology executes in parallel. In any topology, we can specify how much parallelism is required for each node, and then Storm will spawn that number of threads across the cluster to perform the execution. Figure 3 depicts a sample Storm topology.
Figure 3: Sample Storm Processing Topology that consists of 2 Spouts and 4 Bolts
The Storm system relies upon the notion of stream grouping to specify how tuples are sent between processing components. In other words, it defines how that stream should be partitioned among the bolt’s tasks. In particular, Storm supports different types of stream groupings such as:
- Shuffle grouping where stream tuples are randomly distributed such that each bolt is guaranteed to get an equal number of tuples.
- Fields grouping where the tuples are partitioned by the fields specified in the grouping.
- All grouping where the stream tuples are replicated across all the bolts.
- Global grouping where the entire stream goes to a single bolt.
- Direct Grouping where the source decides which component will receive the tuple.
Figure 4 illustrates the behavior of built-in Storm groupings. In addition to the supported built-in stream grouping mechanisms, the Storm system allows its users to define their own custom grouping mechanisms.
Figure 4: Storm Groupings
Although Storm is implemented in Java, it is possible to use other languages. To learn more about languages in Storm read Chapter 7. Using Non-JVM Languages with Storm in Getting Started with Storm.
With this overview of Storm complete, in the next post we will take a look at Storm in action, covering Storm clusters and a Storm example.
Safari Books Online has the content you need
Check out these Storm relevant books available from Safari Books Online:
|Getting Started with Storm introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives.|
|Big Data Bibliography contains a selection of the most useful books for data analysis. This bibliography starts from high level concepts of business intelligence, data analysis and data mining, and works its way down to the tools needed for number crunching mathematical toolkits, machine learning, and natural language processing. Cloud Services and Infrastructure and Amazon Web Services are covered, along with Hadoop and NoSql sections that list the Big Data tools that can be deployed locally or in the cloud.|
|HBase: The Definitive Guide contains a chapter that covers ZooKeeper – Storm keeps all cluster states either in Zookeeper or on local disk.|
About the author
|Dr. Sherif Sakr is a Senior Research Scientist in the Software Systems Group at National ICT Australia (NICTA), Sydney, Australia. He is also a Conjoint Senior Lecturer in The School of Computer Science and Engineering (CSE) at University of New South Wales (UNSW). He received his PhD degree in Computer Science from Konstanz University, Germany in 2007. He received his BSc and MSc degrees in Computer Science from the Information Systems department at the Faculty of Computers and Information in Cairo University, Egypt, in 2000 and 2003 respectively. In 2011, Dr. Sakr held a visiting research scientist position in the eXtreme Computing Group (XCG) at Microsoft Research, Redmond, USA. In 2012, he held a research MTS position in Alcatel-Lucent Bell Labs. Sherif is a Cloudera certified developer for Apache Hadoop and Cloudera certified Specialist for HBase. You can reach Sherif at firstname.lastname@example.org.|