In this paper the
authors focus on solutions to the inefficiency of Pregel on complex
graph algoritms such as betweenness centrality. Such algorithms tend
to have a highly varying number of messages per superstep. Therefore
using little number of workers may cause insufficient memory at peak
messaging superstep. On the other hand renting lots of workers is
costly and most workers will do little work in most supersteps.
Therefore they propose usage of adaptive vertex and superstep
scheduling heuristics.
In their vertex
scheduling model, computation is started only for a subset of
vertices called swath at a time. For finding best swath size there
are 2 heuristics: Sampling run method runs small subsets of swaths,
find their peak memory and extrapolates the result to find highest
swath size to fit in the memory. Adaptive method bases next swath
size on the peak memory usage of the previous.
For finding the best
time to start a new swath they again have 2 methods: Baseline of
starting the next swath after the previous is completed is not
efficient. The first method is static method in which they start the
next swath after a fixed number of supersteps equal to the average
shortest path length. Dynamic method monitors memory utilization and
starts next swath after peak utilization has passed.
They also evaluate the
impact of partitioning and compare the hash partitioning with metis.
As expected in Pagerank, metis helps for both experimented graphs.
But in BC and APSP applications, while it helps in web-Google
dataset, surprisingly it does not help for cit-Patents dataset. They
explain it by the increase in work imbalance between workers
resulting from the partitioning. To support this claim they give
additional figures which show that despite computation time decreases
with metis, synchronization barrier wait time increases making the
total time of a superstep almost the same. But it cannot explain why
this problem is not observed in web-Google.
They finally evaluate
the effect of using clouds elastic worker number selection feature
instead of swath initiation heuristics. They show that in supersteps
where the computation is high, adding more workers is helpful but if
the computation is low adding more workers hurts due to increased
barrier synchronization cost. Using elastic worker number achieves
significant cost reduction compared to 8 workers with a similar
runtime. However they do not really dynamically scale worker numbers.
Instead they do 2 experiments with 4 and 8 workers and at each
superstep select the one that runs faster. The cost of scaling might
be so high to make it useless. In addition, AWS charges per hour so
changing worker number frequently might be very costly if the
computation fluctuates greatly.
In this paper authors
implement Pregel on .NET. They claim that Giraph is not good since it
is built on Hadoop and suffers from Hadoop's side effects. But they
neither explicitly state these side effects nor make a comparison
with Giraph.
An important advantage
of Pregel.NET is that workers have multiple threads so as to
effectively use all cores in a worker. On the other hand it does not
have combiners, aggregators and fault tolerance.
However, in most graph
problems vertices need to execute together. Vertex values change
simultaneously until convergence is achieved for all vertices.
Starting vertices at different times may be applicable only for a
small subset of graph problems in which vertices can execute
independently..
It has a comprehensive
related work section about Pregel and other BSP implementations such
as GPS, Giraph, Hama and Trinity. However, it claims Giraph is bad
since it buffers messages in disk instead of memory but this is not
correct.
No comments:
Post a Comment