Distributed Graph Processing: Spark: Cluster Computing with Working Sets

Incomplete!!

Spark is a framework like Mapreduce, that can cache data between iterations on data. Therefore for iterative tasks, it outperforms Hadoop by 10X. Main component in Spark is resilient distributed dataset (RDD) concept. But I did not understand it enough yet.

It is also very useful on interactive queries on the data since it can keep the data on memory. For example they cache the 39GB wikipedia snapshot to disk and can query it in less than a second after the first query since it is already kept in memory. But it is not aimed at graph queries and hence does not support graph queries like "find distance between these two nodes".

They implement logistic regression and alternating least squares (ALS) algorithms on Spark and with basic experiments on an EC2 cluster, they show that while first iteration takes a little longer than Hadoop (around 174sec), subsequent iterations are very fast (6sec). In Hadoop, each iteration takes 130sec since they are all independent Mapreduce jobs.

Spark is implemented in Scala and uses Mesos for distribution. Spark is a Hadoop alternative for iterative jobs similar to Twister. It is suitable for large-scale machine learning but does not aim graph processing. A following work, GraphX, by the same group in Berkeley is aimed at graph processing. I will read and review it too.

Another following study, Shark, is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.

Distributed Graph Processing

Wednesday, November 6, 2013

Spark: Cluster Computing with Working Sets

1 comment: