Posted by & filed under Content - Highlights and Reviews, Programming & Development.

codeA guest post by Kristen Hardwick, who has worked with several different parallel paradigms – including Grid, Cluster, and Cloud. She currently works at Spry where her focus is on designing and developing Big Data analytics for the Hadoop ecosystem. Kristen holds both a Bachelor of Science degree and a Master’s degree in Computer Science from Clemson University, with an emphasis on Parallel Computing.

Apache Giraph is a framework with methods that must be implemented in order to accomplish your graph processing goal. Two of the examples that come packaged with Giraph are Shortest Path and Page Rank. These examples are meant to be extended and altered to fit the needs of any new custom application. This post will go through these two example implementations in detail in order to explain the actions necessary to write a Giraph application.

Determine the Input Data Format

The first step of any Giraph application is to determine the format for the graph input data. There are many different built in Input Formats, all defining data types for the following information describing the graph.

Vertex ID

The Vertex ID is the identifier for a vertex in the graph. The framework does not restrict the definition, so this can be something no more complex than a label, or it can be a fully initialized object with complex pieces. The only limitation is the ability to represent the information in a form that Giraph can parse from the input file.

Vertex Value

The Vertex Value optional, and is another place to store additional information associated with a vertex. Typically, this field is used to store values or objects that should be updated during graph processing.

Edge Tuples

The final piece of input data is the collection of information necessary to define the set of out-edges associated with the source vertex ID. This information is composed of tuples with two elements per edge: the destination Vertex ID and the Edge Weight, where the Edge Weight is optional. Since Giraph only expects out-edges to be specified in this definition, any bi-directional edges must be defined as two separate out-edges with opposite directions.

Input Format Selection

Once the data types for each of these pieces of information have been defined, a MapReduce Input Format must be selected. A large variety of Input Formats are already implemented within the Giraph package so that the user can easily accommodate common data types. One such Input Format is the JsonLongDoubleFloatDoubleVertexInputFormat, which expects data in the following form:

When passed into this Input Format, the input data should look like the following – where each vertex is described in its own record:

If this data was read in by Giraph, it would produce a graph like the following:

giraph.fig1

Determine Output Data Format

The second step of any Giraph application is determining the data format for the processed results. Giraph has built-in Output Formats for this as well, where one of the most common ones is the IdWithValueTextOutputFormat. This will write Text output such that every Vertex ID is printed with its Vertex Value. For example, if the previously defined graph was printed with no alterations, it would be written to the output file as the following:

Implement the Compute Method

Once the graph is configured, the actual processing steps for the desired algorithm can be implemented. Giraph makes use of a vertex-centric design, where the logic is implemented within the compute method. In order to illustrate what this means, two algorithms will be discussed: Shortest Paths and Page Rank.

Shortest Paths

In the shortest path algorithm, the objective is to mark a particular vertex as the origin, and then determine the travel paths that incur the lowest cost when traveling from that origin to any vertex in the cluster. In this section, we will walk through the Giraph SimpleShortestPathsComputation.java implementation of the compute method required for this algorithm.

The method signature for the compute method defines the data types required for both the vertices in the graph and the messages that should be passed along the edges. In this case, the JsonLongDoubleFloatDoubleVertexInputFormat is being used, so the types defined in the vertex definition align with those definitions. The Messages are defined to be Double values.

It is important to note that the same compute method is called for all iterations, so any special cases that need to be handled in a particular iteration must be done through a call to the getSuperstep() function. In this case, the special case is to initialize the vertex value to the maximum Double value during the first iteration of the algorithm.

This check determines whether or not the passed in vertex is the marked origin vertex. If it is the origin, the current minimum is set to 0; otherwise, the current minimum is set to the maximum Double value.

This loop iterates over all messages sent during the previous superstep (passed as input to the function as shown in the method signature), updating the current minimum distance when the value in the message is smaller than what is currently stored.

If the vertex value is larger than the minimum distance from the received messages, it must be updated. After this update occurs, a message will be sent along all out-edges associated with this vertex updating the connected vertex with the new Double distance value.

Finally, after all processing has occurred and all messages have been sent, the vertex will vote to halt processing. The overall execution will be halted when there are no additional messages to process and when all vertices have voted to halt.

Page Rank

The Page Rank algorithm is used by Google’s Search engine in order to rank pages in order of importance. In this next section, we will walk through the Giraph SimplePageRankComputation.java implementation of the compute method required for this algorithm.

The initial steps of the Page Rank algorithm are very similar to the Shortest Paths example. The same Input Format is used, and the Messages will again be of type Double.

The main process of this compute method is to update the vertex Value with an importance calculation based upon the number of in-edges. This is determined by iterating over all received messages, one per in-edge, and then adjusting the value based on the overall number of vertices. It is important to note that on the first iteration, this entire process will be skipped.

The final step is to send a message across all out-edges so that those vertices can update their importance values accordingly. This process involves dividing the current value assigned to this vertex by the number of out-edges. This process repeats once for each superstep, until the maximum number of supersteps has been reached – at which point the vertices will vote to halt.

Conclusion

Using these explanations of the built-in examples of Page Rank and Shortest Paths in Apache Giraph, some insight can be gained into the necessary steps to go beyond these algorithms and into more complex custom code.

Giraph entered into the Apache Incubator process in summer 2011, and had its first release approved in early 2012. Its most recent official version (1.0.0) was released in May 2013. The trunk version (1.1.0) includes support for YARN and is being actively worked on by the community. Giraph is filling a significant gap in the Hadoop-based analytic world, and it will be extremely interesting to follow this technology as it continues to evolve.

Look below for some great Big Data books from Safari Books Online.

Not a subscriber? Sign up for a free trial.

Safari Books Online has the content you need

Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 is written by YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli and demonstrates how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.
Professional Hadoop Solutions is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.

Tags: Apache Giraph, Giraph, Page Rank, Shortest Path, Tuples, Vertex,

Comments are closed.