In this paper, the authors aim to efficiently design a graph processing system that takes a stream of incoming data for a continuously evolving graph. Their method is based on taking snapshots of the system at regular intervals, accumulate submitted updates until this point and then apply all updates in the order they were submitted at snapshot time.
Despite it guarantees atomicity and ensures all transactions from the same ingest node are processed in the same sequence number order, it does not enforce computational consistency by allowing neighbors to update in parallel. This is safe for most graph problems but might prevent convergence in tasks like coloring.
It supports both pull model (GraphLab-like) and push model (Pregel-like) and also has features such as aggregators. In the experiments they feed the system from a bank of archived 100M tweets and measure the timeliness and theoughput of the system. They use two incremental Twitter applications; TunkRank and approximate shortest path, and a non-incremental application called k-exposure.
The current implementation also considers fault tolerance via replication at both storage layer and computation layer. In addition they talk about two yet unimplemented properties: the migration of partitions for incremental expansion and deletion of outdated information to handle continuous increase in graph size.
Due to batch processing of updates, operations in Kineograph has a delay up to 2 minutes. This system might be a good fit for delay-tolerant operations in systems like Twitter where the graph evolves very fast. But clearly it is not the right solution for answering graph queries in the fastest manner in a slightly evolving graph like LinkedIn. (see my LinkedIn review). Therefore I still believe our upcoming Hazelcast-based system will fill a void in distributed graph processing research for fast online graph queries and updates.
No comments:
Post a Comment