Distributed Graph Processing: Using Set Cover to Optimize Large-Scale Low Latency Distributed Graph

In this paper, the authors talk about their method to decrease the amount of distributed calls in their online graph processing system in LinkedIn. The main task is to calculate the distances between two members if they are at most 3 hops away. For faster retrieval of distances, 2-hop neighbors of each vertex is stored as a compressed sorted array in Network Cache Service(NCS). 80% of all distance queries are answered using NCS but the remaining 20% is challenging and it is the main focus of the paper. In other words, the goal is to find the optimal set of GraphDB nodes to decrease the amount of merging in NCS for these second degree queries.

For this purpose, they use a greedy set cover problem. Normally, without any optimization, to get all 2-hop neighbors, the system needs to read from every machine in the cluster. But with set-cover solution, we know the partition IDs a vertex's second degree connections are found in, and then at each iteration we select the machine that covers the most uncovered partitions until all required partitions are covered.

As an optimization, they modify the greedy set cover algorithm in the following way: At each step pick a random uncovered partition storing second degree neighbors. Then get the machine with most coverage on remaining partitions that also covers this partition. Repeat this until all partitions are covered. While first method requires intersection with all machines, second method just requires intersection with R machines.

This paper provides just an example of how to optimize online graph queries. Since most previous research on distributed graph processing (e.g. Pregel, GraphLab) is done for offline graph queries, their latencies are too high for online graph queries. Therefore systems for large-scale low latency online graph queries is still a mostly untouched and important area of study.

As far as I know, the only other study targeted at distributed processing of continuously changing graphs is KineoGraph which I will review next. I believe that in-memory-data-grids, such as Hazelcast, can be used at this end since they present vast opportunities to store and process graph data fast and efficiently in a distributed manner.

Distributed Graph Processing

Monday, August 26, 2013

Using Set Cover to Optimize Large-Scale Low Latency Distributed Graph

No comments:

Post a Comment