Distributed Graph Processing: August 2013

Monday, August 26, 2013

Kineograph

In this paper, the authors aim to efficiently design a graph processing system that takes a stream of incoming data for a continuously evolving graph. Their method is based on taking snapshots of the system at regular intervals, accumulate submitted updates until this point and then apply all updates in the order they were submitted at snapshot time.

Despite it guarantees atomicity and ensures all transactions from the same ingest node are processed in the same sequence number order, it does not enforce computational consistency by allowing neighbors to update in parallel. This is safe for most graph problems but might prevent convergence in tasks like coloring.

It supports both pull model (GraphLab-like) and push model (Pregel-like) and also has features such as aggregators. In the experiments they feed the system from a bank of archived 100M tweets and measure the timeliness and theoughput of the system. They use two incremental Twitter applications; TunkRank and approximate shortest path, and a non-incremental application called k-exposure.

The current implementation also considers fault tolerance via replication at both storage layer and computation layer. In addition they talk about two yet unimplemented properties: the migration of partitions for incremental expansion and deletion of outdated information to handle continuous increase in graph size.

Due to batch processing of updates, operations in Kineograph has a delay up to 2 minutes. This system might be a good fit for delay-tolerant operations in systems like Twitter where the graph evolves very fast. But clearly it is not the right solution for answering graph queries in the fastest manner in a slightly evolving graph like LinkedIn. (see my LinkedIn review). Therefore I still believe our upcoming Hazelcast-based system will fill a void in distributed graph processing research for fast online graph queries and updates.

Using Set Cover to Optimize Large-Scale Low Latency Distributed Graph

In this paper, the authors talk about their method to decrease the amount of distributed calls in their online graph processing system in LinkedIn. The main task is to calculate the distances between two members if they are at most 3 hops away. For faster retrieval of distances, 2-hop neighbors of each vertex is stored as a compressed sorted array in Network Cache Service(NCS). 80% of all distance queries are answered using NCS but the remaining 20% is challenging and it is the main focus of the paper. In other words, the goal is to find the optimal set of GraphDB nodes to decrease the amount of merging in NCS for these second degree queries.

For this purpose, they use a greedy set cover problem. Normally, without any optimization, to get all 2-hop neighbors, the system needs to read from every machine in the cluster. But with set-cover solution, we know the partition IDs a vertex's second degree connections are found in, and then at each iteration we select the machine that covers the most uncovered partitions until all required partitions are covered.

As an optimization, they modify the greedy set cover algorithm in the following way: At each step pick a random uncovered partition storing second degree neighbors. Then get the machine with most coverage on remaining partitions that also covers this partition. Repeat this until all partitions are covered. While first method requires intersection with all machines, second method just requires intersection with R machines.

This paper provides just an example of how to optimize online graph queries. Since most previous research on distributed graph processing (e.g. Pregel, GraphLab) is done for offline graph queries, their latencies are too high for online graph queries. Therefore systems for large-scale low latency online graph queries is still a mostly untouched and important area of study.

As far as I know, the only other study targeted at distributed processing of continuously changing graphs is KineoGraph which I will review next. I believe that in-memory-data-grids, such as Hazelcast, can be used at this end since they present vast opportunities to store and process graph data fast and efficiently in a distributed manner.