Distributed Graph Processing

Monday, November 11, 2013

LFGraph

LFGraph is a recent study by UIUC which seems solid and promises the best results in distributed graph analytics. It compares itself with Powergraph, one of the most popular and best-performing alternatives. Both are implemented in C++ and does not rely on Hadoop or any other framework.

The basic properties of LFGraph are:

It advocates the use of hash partitioning since it is very fast and provides good balance among workers. It says that clever graph partitioning takes up to 90% of total runtime in most applications and input datasets. Therefore it must be avoided. In addition, as opposed to the claim of PowerGraph, LFGraph authors claim that while synthetic power-law graphs causes high worker imbalance, the real-world graphs have thicker tails which significantly reduces this imbalance.
It uses synchronous supersteps like Pregel. But each superstep consists of 2 sub-steps: computation and communication. In computation step, multiple computation threads go over separately-assigned vertex lists and by using the already received neighbor values, compute the new vertex value. In communication step, this value is sent to other workers in batches while setting up updated flags and invalidating old values.
Instead of a vertex sending its value to every remote neighbor separately (Pregel does that), it uses publish lists at each worker which says which vertices will be published to which workers. The published values are kept in a worker-level list in the receiving worker. As a result communication cost is reduced significantly.
It needs no locking as in Pregel since it uses two values for each vertex: real value and shadow value. Despite this means extra memory and extra work to shift values between lists at each superstep, it is still highly memory efficient (A lot less than Powergraph).
The most basic property: It has single pass over neighbors. It only reads in-edges while others need to go on both incoming edge list and outgoing edge list. This brings significant improvement in computation times compared to Pregel and Graphlab (both need 2 passes).
Since workers take care of sending vertex values to nonlocal neighbors, vertices need only incoming edge list. This decreases memory cost significantly.

Concerns:

Since it uses single-pass only on incoming edges, no message is sent to out-neighbors. They have to be active all the time. So, algorithms that tolerate sleeping of vertices may not run well here. For example in PageRank, Giraphx can sleep most vertices in most supersteps. But to achieve same convergence, LFGraph needs to keep them awake all the time.
It compares PowerGraph and LFGraph with constant number of supersteps in Pregel. However, async systems known to converge faster. Therefore to achieve the same convergence PowerGraph gives in 10 supersteps, LFGraph might need 20 supersteps resulting in similar or worse completion times.
They perform experiments on Emulab cluster (same rack) which has almost zero network delays compared to an EC2 cluster (not necessarily same rack). When the network is slow, decreasing the amount of cross-worker neighbors might make a significant decrease.
No fault tolerance. Values of vertices with remote neighbor can be recovered but local vertices have no way. In fact fault tolerance might be achieved by just backing up local vertices which should be very small in a large cluster.
Despite the given properties, still I do not get how it achieves 10x less memory than PowerGraph. Using smem for LFGraph and heap space for PowerGraph might not be fair.

Thursday, November 7, 2013

Signal/Collect

Signal/Collect is a framework for synchronous and asynchronous parallel graph processing. It differs from Pregel for two main reasons: (1) the edges can have computations associated to them as well in the signal function, so you basically write a compute() method for the vertices and another one for the edges (2) the synchronization barrier constraints are relaxed, so it's possible also to implement async algorithms.

Asynchronous computation is done by introducing randomization on the set of vertices on which signal and collect computations have to be computed.

The main problem is, at the time, the core system doesn't scale to multiple machine. It uses shared memory model on a single powerful machine with a huge memory. It might be very inefficient or hard-to-use in a distributed setting.

It keeps a list which contains the latest known values of neighbors. In addition it has an incoming message queue for messages that are not read yet. This neighbor list is useful in Giraphx but it might be memory inefficient.

It supports prioritization of vertices or operations. The authors introduce a threshold score which is used to decide whether a node should collect its signals or it should send signals. Using this score, processing of algorithms can be accelerated in a way that for every superstep only signals and collects are performed if a certain threshold is hit.

It also provides a lot of small extensions which might give hints in Maestro implementation. It does not enable edge/vertex add/removal.

HipG

Its basic property is that it does not have supersteps. Instead the computation is coordinated by synchronizers which can also spawn sub-synchronizers. By this way, there's no more a global barrier but the algorithm can specify different barriers to which subsets of the vertices can synchronize. The idea is to allow a more fine-grained synchronization mechanism, to avoid many vertices to idle waiting for unrelated vertices to finish their computation.

Unlike Signal/Collect and Pregel which can only control individual vertices, global execution can be controlled thanks to synchronizers in HipG.

But it is hard to comprehend the usage of these sub-synchronizers mechanism. Who will manage which vertices?

It processes vertices not with vertex id order, instead they start from a pivot vertex and compute neighbors recursively until all are processed.

It can create the graph on-the-fly. This is especially important for model checking, search trees and so on. I think the biggest strength of HipG is this property.

Optimizations and Analysis of BSP Graph Processing Models on Public Clouds

In this paper the authors focus on solutions to the inefficiency of Pregel on complex graph algoritms such as betweenness centrality. Such algorithms tend to have a highly varying number of messages per superstep. Therefore using little number of workers may cause insufficient memory at peak messaging superstep. On the other hand renting lots of workers is costly and most workers will do little work in most supersteps. Therefore they propose usage of adaptive vertex and superstep scheduling heuristics.

In their vertex scheduling model, computation is started only for a subset of vertices called swath at a time. For finding best swath size there are 2 heuristics: Sampling run method runs small subsets of swaths, find their peak memory and extrapolates the result to find highest swath size to fit in the memory. Adaptive method bases next swath size on the peak memory usage of the previous.

For finding the best time to start a new swath they again have 2 methods: Baseline of starting the next swath after the previous is completed is not efficient. The first method is static method in which they start the next swath after a fixed number of supersteps equal to the average shortest path length. Dynamic method monitors memory utilization and starts next swath after peak utilization has passed.

They also evaluate the impact of partitioning and compare the hash partitioning with metis. As expected in Pagerank, metis helps for both experimented graphs. But in BC and APSP applications, while it helps in web-Google dataset, surprisingly it does not help for cit-Patents dataset. They explain it by the increase in work imbalance between workers resulting from the partitioning. To support this claim they give additional figures which show that despite computation time decreases with metis, synchronization barrier wait time increases making the total time of a superstep almost the same. But it cannot explain why this problem is not observed in web-Google.

They finally evaluate the effect of using clouds elastic worker number selection feature instead of swath initiation heuristics. They show that in supersteps where the computation is high, adding more workers is helpful but if the computation is low adding more workers hurts due to increased barrier synchronization cost. Using elastic worker number achieves significant cost reduction compared to 8 workers with a similar runtime. However they do not really dynamically scale worker numbers. Instead they do 2 experiments with 4 and 8 workers and at each superstep select the one that runs faster. The cost of scaling might be so high to make it useless. In addition, AWS charges per hour so changing worker number frequently might be very costly if the computation fluctuates greatly.

In this paper authors implement Pregel on .NET. They claim that Giraph is not good since it is built on Hadoop and suffers from Hadoop's side effects. But they neither explicitly state these side effects nor make a comparison with Giraph.

An important advantage of Pregel.NET is that workers have multiple threads so as to effectively use all cores in a worker. On the other hand it does not have combiners, aggregators and fault tolerance.

However, in most graph problems vertices need to execute together. Vertex values change simultaneously until convergence is achieved for all vertices. Starting vertices at different times may be applicable only for a small subset of graph problems in which vertices can execute independently..

It has a comprehensive related work section about Pregel and other BSP implementations such as GPS, Giraph, Hama and Trinity. However, it claims Giraph is bad since it buffers messages in disk instead of memory but this is not correct.

Distributed GraphLab

Distributed Graphlab's main focus is on solving graph algorithms efficiently in a distributed environment in a shared memory setting. It has 4 main properties:

- It is designed for graph data. But unlike Pregel, the user needs to focus on sequential computation rather than parallel movement of data.

- It is async unlike Pregel. Therefore it uses the most recent data, and it is not affected by the slower worker synchronization problem and highly-variable running times in a skewed network.

- For example in Pagerank, each node converges at a different rate. Graphlab supports prioritizing computation of vertices and adaptively pulling info from neighbors unlike Pregel. This pulling idea resembles our contribution in Giraph-e.

- Provides different levels of consistency to enforce serializibity.

- Data input format is similar to Pregel. There is V, E and D where D contains the edge and vertex data.

- An important difference is graph structure is not modifiable unlike Pregel.

- Another difference is, an update function at a vertex can also modify the data of neighbor vertices. But I dont agree with the argument in Section 3.2. Pregel can also make selective sending in Pagerank. In additon, Alg 1 looks like push not pull.

- Execution model: Similar to Pregel, vertices are executed one by one. But the order may be determined by the user unlike Pregel. But there is no info about the distribution of vertices. Do all workers go over the same set of vertices?

- Consistency is maintained at 3 levels. Level of consistency is inversely proportional to the level of concurrency. "Full consistency > edge cons. > vertex cons." Ex: Pagerank needs write access on current vertex and read access on edges and neighbor vertices. So it needs edge consistency.

- Instead of aggregators in Pregel, GiraphLab has global variables that can only be used sync. But how are they implemented in a distributed setting? As master variables organized via access tokens to workers requested with messaging?

Structure and Design:

- Partitioning: Uses Parallel METIS and two phase partitioning. Border vertices are named ghost and handled specially. Ghosts have cache copies at neighbor partitions. But how is the consistency of ghosts maintained?

The order vertices are selected for processing is done in two ways:

- Chromatic Engine: partially async. uses coloring. No info about how coloring is done. Same colors operate at the same time. color-step instead of superstep. Also needs barrier sync at the end of color-step. But requires less work at the end of a color-step compared to superstep.

- Distributed Locking Engine: Uses a reader-writer lock at each vertex. Deadlocks are avoided by canonical order on locks. Workers only allowed to update local vertices. Lock requests are pipelined to prevent remote lock acquisition latency.

-Fault Tolerance:achieved via an async version of Chandy-Lamport snapshot algorithm at fixed intervals

Wednesday, November 6, 2013

Big Data Processing Platforms - One-line Summaries

Hadoop: Well you know it. Distributed processing framework that fits best for embarrassingly parallel and non-iterative jobs with very small or no data dependencies.

Piccolo: Hadoop alternative with shared distributed state in the form of a key-value store. Supports also fast graph operations. 2010 work by NYU compared only with Hadoop and MPI in the paper.

Spark: Alternative to Hadoop created by UC Berkeley especially for iterative jobs.

Shark: Built for processing Hive queries on Spark.

Neo4j: A graph database targeted at very fast querying of large graphs. There are lots of such databases.

Storm: Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm can be used with any programming language. Designed by Twitter.

Kineograph: Also targets streaming graph data like Twitter and makes batch processing on specific snapshots of the system. Downside: New updates appear two minutes later.

YARN: Announced as Hadoop 2.0. Purpose is to separate processing and data management in Hadoop. By this way, other projects such as Giraph, Spark can be integrated onto Hadoop more easily(?). See Figure below:

Pregel: Large-scale graph processing framework based on synchronous BSP model produced by Google.

Giraph: Hadoop implementation of Pregel.

Giraphx: An extension of Giraph that brings serializability and direct memory reads to Giraph.

Giraph++: Another extension of Giraph (by IBM) which also exploits direct memory reads.

GraphLab: an asynchronous large-scale graph processing framework by CMU.

PowerGraph: Later version of GraphLab which provides better performance with natural graphs by factoring the vertex computation over edges

GraphChi: single-machine version of GraphLab. It uses disk efficiently to store parts of the graph during computation.

Mahout: A machine learning library built on Hadoop. Main areas are recommendation, clustering, classification. Claimed to perform worse than GraphLab.

Twister: An early work that extends Hadoop to support iterative mapreduce. Not so promising.

Haloop: Another iterative mapreduce implementation after Twister.

Bagel: a Spark implementation of Pregel.

GraphX: a graph computation framework on Spark. More general than Bagel. It can emulate both Pregel and PowerGraph.

Other Pregel Clones:

GPS: by Stanford

Signal/Collect: Not exactly a Pregel clone. Two main differences: 1-Edges can have compute() method. 2- Barrier can be relaxed to have async execution.

Apache Hama: Effort before Giraph

GoldenOrb: another copy

Phoebus: ?

HipG: Differs from Pregel in this way: It does not have supersteps. Instead it uses synchronizers and sub-synchronizers to coordinate vertices and emulate supersteps.

Spark: Cluster Computing with Working Sets

Incomplete!!

Spark is a framework like Mapreduce, that can cache data between iterations on data. Therefore for iterative tasks, it outperforms Hadoop by 10X. Main component in Spark is resilient distributed dataset (RDD) concept. But I did not understand it enough yet.

It is also very useful on interactive queries on the data since it can keep the data on memory. For example they cache the 39GB wikipedia snapshot to disk and can query it in less than a second after the first query since it is already kept in memory. But it is not aimed at graph queries and hence does not support graph queries like "find distance between these two nodes".

They implement logistic regression and alternating least squares (ALS) algorithms on Spark and with basic experiments on an EC2 cluster, they show that while first iteration takes a little longer than Hadoop (around 174sec), subsequent iterations are very fast (6sec). In Hadoop, each iteration takes 130sec since they are all independent Mapreduce jobs.

Spark is implemented in Scala and uses Mesos for distribution. Spark is a Hadoop alternative for iterative jobs similar to Twister. It is suitable for large-scale machine learning but does not aim graph processing. A following work, GraphX, by the same group in Berkeley is aimed at graph processing. I will read and review it too.

Another following study, Shark, is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.