Distributed Graph Processing: Big Data Processing Platforms

Hadoop: Well you know it. Distributed processing framework that fits best for embarrassingly parallel and non-iterative jobs with very small or no data dependencies.

Piccolo: Hadoop alternative with shared distributed state in the form of a key-value store. Supports also fast graph operations. 2010 work by NYU compared only with Hadoop and MPI in the paper.

Spark: Alternative to Hadoop created by UC Berkeley especially for iterative jobs.

Shark: Built for processing Hive queries on Spark.

Neo4j: A graph database targeted at very fast querying of large graphs. There are lots of such databases.

Storm: Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm can be used with any programming language. Designed by Twitter.

Kineograph: Also targets streaming graph data like Twitter and makes batch processing on specific snapshots of the system. Downside: New updates appear two minutes later.

YARN: Announced as Hadoop 2.0. Purpose is to separate processing and data management in Hadoop. By this way, other projects such as Giraph, Spark can be integrated onto Hadoop more easily(?). See Figure below:

Pregel: Large-scale graph processing framework based on synchronous BSP model produced by Google.

Giraph: Hadoop implementation of Pregel.

Giraphx: An extension of Giraph that brings serializability and direct memory reads to Giraph.

Giraph++: Another extension of Giraph (by IBM) which also exploits direct memory reads.

GraphLab: an asynchronous large-scale graph processing framework by CMU.

PowerGraph: Later version of GraphLab which provides better performance with natural graphs by factoring the vertex computation over edges

GraphChi: single-machine version of GraphLab. It uses disk efficiently to store parts of the graph during computation.

Mahout: A machine learning library built on Hadoop. Main areas are recommendation, clustering, classification. Claimed to perform worse than GraphLab.

Twister: An early work that extends Hadoop to support iterative mapreduce. Not so promising.

Haloop: Another iterative mapreduce implementation after Twister.

Bagel: a Spark implementation of Pregel.

GraphX: a graph computation framework on Spark. More general than Bagel. It can emulate both Pregel and PowerGraph.

Other Pregel Clones:

GPS: by Stanford

Signal/Collect: Not exactly a Pregel clone. Two main differences: 1-Edges can have compute() method. 2- Barrier can be relaxed to have async execution.

Apache Hama: Effort before Giraph

GoldenOrb: another copy

Phoebus: ?

HipG: Differs from Pregel in this way: It does not have supersteps. Instead it uses synchronizers and sub-synchronizers to coordinate vertices and emulate supersteps.

Distributed Graph Processing

Wednesday, November 6, 2013

Big Data Processing Platforms - One-line Summaries

No comments:

Post a Comment