While highlevel data parallel frameworks, like MapReduce, sim plify the design and implementation of largescale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to ineffi cient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graphparallel computation while ensuring data consis tency and achieving a high degree of parallel performance in the sharedmemory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.
We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic ChandyLamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 12 orders of magnitude performance gains over Hadoopbased implementations.
Distributed GraphLab: A Framework for Machine Learningand Data Mining in the Cloud
Yucheng Low
Carnegie Mellon University
ylow@cs.cmu.eduJoseph Gonzalez
Carnegie Mellon University
jegonzal@cs.cmu.eduAapo Kyrola
Carnegie Mellon University
akyrola@cs.cmu.eduDanny Bickson
Carnegie Mellon University
bickson@cs.cmu.eduCarlos Guestrin
Carnegie Mellon University
guestrin@cs.cmu.eduJoseph M. Hellerstein
UC Berkeley
hellerstein@cs.berkeley.edu
ABSTRACT
While highlevel data parallel frameworks, like MapReduce, simplify the design and implementation of largescale data processing
systems, they do not naturally or efﬁciently support many important
data mining and machine learning algorithms and can lead to inefﬁcient learning systems. To help ﬁll this critical void, we introduced
the GraphLab abstraction which naturally expresses
asynchronous
,
dynamic
,
graphparallel
computation while ensuring data consistency and achieving a high degree of parallel performance in thesharedmemory setting. In this paper, we extend the GraphLab
framework to the substantially more challenging distributed setting
while preserving strong data consistency guarantees.
We develop graph based extensions to pipelined locking and data
versioning to reduce network congestion and mitigate the effect of
network latency. We also introduce fault tolerance to the GraphLab
abstraction using the classic ChandyLamport snapshot algorithmand demonstrate how it can be easily implemented by exploiting
the GraphLab abstraction itself. Finally, we evaluate our distributed
implementation of the GraphLab abstraction on a large AmazonEC2 deployment and show 12 orders of magnitude performance
gains over Hadoopbased implementations.
1. INTRODUCTION
With the exponential growth in the scale of Machine Learning and
Data Mining (MLDM) problems and increasing sophistication of
MLDM techniques, there is an increasing need for systems that can
execute MLDM algorithms efﬁciently in parallel on large clusters.Simultaneously, the availability of Cloud computing services likeAmazon EC2 provide the promise of ondemand access to afford
able largescale computing and storage resources without substantial
upfront investments. Unfortunately, designing, implementing, and
debugging the distributed MLDM algorithms needed to fully utilizethe Cloud can be prohibitively challenging requiring MLDM expertsto address race conditions, deadlocks, distributed state, and commu
nication protocols while simultaneously developing mathematically
complex models and algorithms.
Nonetheless, the demand for largescale computational and stor
age resources, has driven many [2, 14, 15, 27, 30, 35] to develop new
parallel and distributed MLDM systems targeted at individual mod
els and applications. This time consuming and often redundant effortslows the progress of the ﬁeld as different research groups repeatedly
solve the same parallel/distributed computing problems. Therefore,
the MLDM community needs a highlevel distributed abstraction
that speciﬁcally targets the
asynchronous
,
dynamic
,
graphparallel
computation found in many MLDM applications while hiding thecomplexities of parallel/distributed system design. Unfortunately,existing highlevel parallel abstractions (e.g. MapReduce [8, 9],Dryad [19] and Pregel [25]) fail to support these critical properties.
To help ﬁll this void we introduced [24] GraphLab abstraction which
directly targets asynchronous, dynamic, graphparallel computation
in the
sharedmemory
setting.
In this paper we extend the multicore GraphLab abstraction to thedistributed setting and provide a formal description of the distributed
execution model. We then explore several methods to implement
an efﬁcient distributed execution model while preserving strict con
sistency requirements. To achieve this goal we incorporate dataversioning to reduce network congestion and
pipelined distributed locking
to mitigate the effects of network latency. To address the
challenges of data locality and ingress we introduce the
atom graph
for rapidly placing graph structured data in the distributed setting.
We also add fault tolerance to the GraphLab framework by adaptingthe classic ChandyLamport [6] snapshot algorithm and demonstrate
how it can be easily implemented within the GraphLab abstraction.
We conduct a comprehensive performance analysis of ouroptimized C++ implementation on the Amazon Elastic Cloud(EC2) computing service. We show that applications createdusing GraphLab outperform equivalent Hadoop/MapReduce[9]
implementations by 2060x and match the performance of carefully
constructed MPI implementations. Our main contributions are the
following:
ã
A summary of common properties of MLDM algorithms and the
limitations of existing largescale frameworks. (Sec. 2)
ã
A modiﬁed version of the GraphLab abstraction and execution
model tailored to the distributed setting. (Sec. 3)
ã
Two substantially different approaches to implementing the new
distributed execution model(Sec. 4):
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee. Articles from this volume were invited to presenttheirresultsatThe38thInternationalConferenceonVeryLargeDataBases,August 27th  31st 2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment,
Vol. 5, No. 8Copyright 2012 VLDB Endowment 21508097/12/04...
$
10.00.
716
◦
Chromatic Engine:
uses graph coloring to achieve efﬁcient
sequentially consistent execution for static schedules.
◦
Locking Engine:
uses pipelined distributed locking and la
tency hiding to support dynamically prioritized execution.
ã
Fault tolerance through two snapshotting schemes. (Sec. 4.3)
ã
Implementations of three stateoftheart machine learning algo
rithms ontop of distributed GraphLab. (Sec. 5)
ã
An extensive evaluation of Distributed GraphLab using a 512 pro
cessor (64 node) EC2 cluster, including comparisons to Hadoop,Pregel, and MPI implementations. (Sec. 5)
2. MLDM ALGORITHM PROPERTIES
In this section we describe several key properties of efﬁcientlargescale parallel MLDM systems addressed by the GraphLababstraction [24] and how other parallel frameworks fail to address
these properties. A summary of these properties and parallel frame
works can be found in Table 1.
Graph Structured Computation:
Many of the recent advances
in MLDM have focused on modeling the
dependencies
between data.
By modeling data dependencies, we are able to extract more signalfrom noisy data. For example, modeling the dependencies between
similar shoppers allows us to make better product recommendations
than treating shoppers in isolation. Unfortunately, data parallelabstractions like MapReduce [9] are not generally well suited forthe
dependent
computation typically required by more advanced
MLDM algorithms. Although it is often possible to map algorithms
with computational dependencies into the MapReduce abstraction,
the resulting transformations can be challenging and may introduce
substantial inefﬁciency.
As a consequence, there has been a recent trend toward
graphparallel
abstractions like Pregel [25] and GraphLab [24] which
naturally express computational dependencies. These abstractionsadopt a vertexcentric model in which computation is deﬁned askernels that run on each vertex. For instance, Pregel is a bulk syn
chronous message passing abstraction where vertices communicate
through messages. On the other hand, GraphLab is a sequentialshared memory abstraction where each vertex can read and writeto data on adjacent vertices and edges. The GraphLab runtime isthen responsible for ensuring a consistent parallel execution. Consequently, GraphLab simpliﬁes the design and implementation of graphparallel algorithms by freeing the user to focus on sequential computation rather than the parallel movement of data (i.e.,
messaging).
Asynchronous Iterative Computation:
Many importantMLDM algorithms iteratively update a large set of parameters.Because of the underlying graph structure, parameter updates (onvertices or edges) depend (through the graph adjacency structure)on the values of other parameters. In contrast to
synchronous
systems, which update all parameters simultaneously (in parallel)using parameter values from the previous time step as input,
asynchronous
systems update parameters using the
most recent
parameter values as input. As a consequence, asynchronous systems
provides many MLDM algorithms with signiﬁcant algorithmicbeneﬁts. For example, linear systems (common to many MLDMalgorithms) have been shown to converge faster when solvedasynchronously [4]. Additionally, there are numerous othercases (e.g., belief propagation [13], expectation maximization[28], and stochastic optimization [35, 34]) where asynchronous
procedures have been empirically shown to signiﬁcantly outperform
synchronous procedures. In Fig. 1(a) we demonstrate how asyn
chronous computation can substantially accelerate the convergence
of PageRank.
Synchronous computation incurs costly performance penaltiessince the runtime of each phase is determined by the
slowest
ma
chine. The poor performance of the slowest machine may be caused
by a multitude of factors including: load and network imbalances,hardware variability, and multitenancy (a principal concern in the
Cloud). Even in typical cluster settings, each compute node may also
provide other services (e.g., distributed ﬁle systems). Imbalancesin the utilization of these other services will result in substantial
performance penalties if synchronous computation is used.
In addition, variability in the complexity and convergence of the individual vertex kernels can produce additional variability inexecution time, even when the graph is uniformly partitioned. For
example, natural graphs encountered in realworld applications have
powerlaw
degree distributions which can lead to highly skewedrunning times even with a random partition [36]. Furthermore, theactual work required for each vertex could depend on the data in a
problem speciﬁc manner (e.g., local rate of convergence).
While abstractions based on bulk data processing, such as MapReduce [9] and Dryad [19] were not designed for iterative computation,
recent projects such as Spark [38] extend MapReduce and otherdata parallel abstractions to the iterative setting. However, theseabstractions still do not support asynchronous computation. Bulk Synchronous Parallel (BSP) abstractions such as Pregel [25], Piccolo [33], and BPGL [16] do not naturally express asynchronicity.
On the other hand, the shared memory GraphLab abstraction was designed to efﬁciently and naturally express the asynchronous iterative
algorithms common to advanced MLDM.
Dynamic Computation:
In many MLDM algorithms, iterativecomputation converges asymmetrically. For example, in parameter optimization, often a large number of parameters will quicklyconverge in a few iterations, while the remaining parameters willconverge slowly over many iterations [11, 10]. In Fig. 1(b) we
plot the distribution of updates required to reach convergence forPageRank. Surprisingly, the majority of the vertices required onlya
single
update while only about 3% of the vertices required more
than 10 updates. Additionally, prioritizing computation can further
accelerate convergence as demonstrated by Zhang et al. [39] fora variety of graph algorithms including PageRank. If we updateall parameters equally often, we waste time recomputing parameters that have effectively converged. Conversely, by focusing earlycomputation on more challenging parameters, we can potentiallyaccelerate convergence. In Fig. 1(c) we empirically demonstrate
how dynamic scheduling can accelerate convergence of loopy belief
propagation (a popular MLDM algorithm).
Several recent abstractions have incorporated forms of dynamiccomputation. For example, Pregel [25] supports a limited form of dynamic computation by allowing some vertices to skip computation on each superstep. Other abstractions like Pearce et al. [32]and GraphLab allow the user to adaptively
prioritize
computation.While both Pregel and GraphLab support dynamic computation,only GraphLab permits prioritization as well as the ability to adap
tively pull information from adjacent vertices (see Sec. 3.2 for more
details). In this paper we relax some of the srcinal GraphLabscheduling requirements described in [24] to enable efﬁcient
dis
tributed
FIFO and priority scheduling.
Serializability:
By ensuring that all parallel executions have an
equivalent sequential execution, serializability eliminates many challenges associated with designing, implementing, and testing parallel
MLDM algorithms. In addition, many algorithms converge faster if
serializability is ensured, and some even require serializability for
correctness. For instance, Dynamic ALS (Sec. 5.1) is unstable whenallowed to race (Fig. 1(d)). Gibbs sampling, a
very popular
MLDM
algorithm, requires serializability for statistical correctness.
717
ComputationModelSparseDepend.Async.Comp.Iterative PrioritizedOrderingEnforceConsistencyDistributed
MPI Messaging Yes Yes Yes N/A No YesMapReduce[9] Par. dataﬂow No No extensions
(a)
No Yes YesDryad[19] Par. dataﬂow Yes No
extensions
(b)
No Yes YesPregel[25]/BPGL[16] GraphBSP Yes No Yes No Yes Yes
Piccolo[33] Distr. map No No Yes No Partially
(c)
YesPearce et.al.[32] Graph Visitor Yes Yes Yes Yes No No
GraphLab GraphLab
Yes Yes Yes Yes Yes Yes
Table 1: Comparison chart of largescale computation frameworks. (a) [38] describes and iterative extension of MapReduce. (b)[18] proposes an iterative extension for Dryad. (c) Piccolo does not provide a mechanism to ensure consistency but instead exposes amechanism for the user to attempt to recover from simultaneous writes.
2000 4000 6000 8000 10000 1200010
0
10
2
10
4
10
6
Time
E r r o r
Sync. (Pregel)Async. (GraphLab)
(a) Async vs Sync PageRank
0 10 20 30 40 50 60 7010
1
10
2
10
3
10
4
10
5
10
6
10
7
10Updates at Convergence
N u m b e r o f V e r t i c e s
51% of vertices
(b) Dynamic PageRank
0 5 10 15 2010
−7
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
Sweeps
R e s i d u a l
Sync. (Pregel)Async.Dynamic Async. (GraphLab)
(c) LoopyBP Conv.
0 2 4 6x 10
6
10
−1
10
0
10
1
10
2
Updates
T r a i n i n g E r r o r
SerializableNot Serializable
(d) ALS Consistency
Figure 1: (a) Rate of convergence, measured in
L
1
error to the true PageRank vector versus time, of the PageRank algorithm ona 25M vertex 355M edge web graph on 16 processors. (b) The distribution of update counts after running dynamic PageRank toconvergence. Notice that the majority of the vertices converged in only a single update. (c) Rate of convergence of Loopy Belief propagation on webspam detection. (d) Comparing serializable and nonserializable (racing) execution of the dynamic ALS algorithmin Sec. 5.1 on the Netﬂix movie recommendation problem. Nonserializable execution exhibits unstable convergence behavior.
An abstraction that enforces serializable computation eliminates
much of the complexity introduced by concurrency, allowing theMLDM expert to focus on the algorithm and model design. Debugging mathematical code in a concurrent program which hasdatacorruption caused by data races is difﬁcult and time consuming. Surprisingly, many asynchronous abstractions like [32] do not
ensure serializability or, like Piccolo [33], provide only basic mech
anisms to recover from data races. GraphLab supports a broad range
of consistency settings, allowing a program to choose the level of consistency needed for correctness. In Sec. 4 we describe several
techniques we developed to enforce serializability in the distributed
setting.
3. DIST. GRAPHLAB ABSTRACTION
The GraphLab abstraction consists of three main parts, the datagraph, the update function, and the sync operation. The data graph
(Sec. 3.1) represents user modiﬁable program state, and stores both
the mutable userdeﬁned data and encodes the sparse computational
dependencies. The update function (Sec. 3.2) represents the user
computation and operate on the data graph by transforming data in
small overlapping contexts called scopes. Finally, the sync operation
(Sec. 3.5) concurrently maintains global aggregates. To groundthe GraphLab abstraction in a concrete problem, we will use the
PageRank algorithm [31] as a running example.E
XAMPLE
1 (P
AGE
R
ANK
).
The PageRank algorithm recur
sively deﬁnes the rank of a webpage
v
:
R
(
v
) =
αn
+ (1
−
α
)
u
links to
v
w
u,v
×
R
(
u
)
(1)
in terms of the weighted
w
u,v
ranks
R
(
u
)
of the pages
u
that link to
v
as well as some probability
α
of randomly jumping to that page.
The PageRank algorithm iterates Eq. (1) until the PageRank changes
by less than some small value
ǫ
.
3.1 Data Graph
The GraphLab abstraction stores the program state as a directedgraph called the
data graph
. The data graph
G
= (
V,E,D
)
is acontainer that manages the user deﬁned data
D
. Here we use theterm
data
broadly to refer to model parameters, algorithm state,and even statistical data. Users can associate arbitrary data witheach vertex
{
D
v
:
v
∈
V
}
and edge
{
D
u
→
v
:
{
u,v
} ∈
E
}
in thegraph. However, as the GraphLab abstraction is not dependenton edge directions, we also use
D
u
↔
v
to denote the data on bothedge directions
u
→
v
and
v
→
u
. Finally, while the graph datais mutable, the structure is
static
and cannot be changed during
execution.E
XAMPLE
2 (P
AGE
R
ANK
: E
X
. 1).
The data graph is directly
obtained from the web graph, where each vertex corresponds to aweb page and each edge represents a link. The vertex data
D
v
stores
R
(
v
)
, the current estimate of the PageRank, and the edge
data
D
u
→
v
stores
w
u,v
, the directed weight of the link.
3.2 Update Functions
Computation is encoded in the GraphLab abstraction in the form
of update functions. An
update function
is a stateless procedure
that modiﬁes the data within the scope of a vertex and schedules thefuture execution of update functions on other vertices. The
scope
of
vertex
v
(denoted by
S
v
) is the data stored in
v
, as well as the data
stored in all adjacent vertices and adjacent edges (Fig. 2(a)).
A GraphLab update function takes as an input a vertex
v
and itsscope
S
v
and returns the new versions of the data in the scope as
well as a set vertices
T
:
Update
:
f
(
v,
S
v
)
→
(
S
v
,
T
)
After executing an update function the modiﬁed data in
S
v
is written
back to the data graph. The set of vertices
u
∈ T
are
eventually
718
executed by applying the update function
f
(
u,
S
u
)
following the
execution semantics described later in Sec. 3.3.
Rather than adopting a message passing or data ﬂow model asin [25, 19], GraphLab allows the user deﬁned update functionscomplete freedom to read and modify any of the data on adjacent
vertices and edges. This simpliﬁes user code and eliminates the need
for the users to reason about the movement of data. By controllingwhat vertices are returned in
T
and thus to be executed, GraphLabupdate functions can efﬁciently express adaptive computation. Forexample, an update function may choose to return (schedule) itsneighbors only when it has made a substantial change to its local
data.
There is an important difference between Pregel and GraphLab
in how dynamic computation is expressed. GraphLab decouples the
scheduling of future computation from the movement of data. Asa consequence, GraphLab update functions have access to data onadjacent vertices even if the adjacent vertices did not schedule thecurrent update. Conversely, Pregel update functions are initiatedby messages and can only access the data in the message, limiting
what can be expressed. For instance, dynamic PageRank is difﬁcult
to express in Pregel since the PageRank computation for a givenpage requires the PageRank values of all adjacent pages even if some of the adjacent pages
have not recently changed
. Therefore,
the decision to send data (PageRank values) to neighboring vertices
cannot be made by the sending vertex (as required by Pregel) but
instead must be made by the receiving vertex. GraphLab, naturally
expresses the
pull
model, since adjacent vertices are only responsible
for
scheduling
, and update functions can
directly read
adjacent
vertex values even if they have not changed.E
XAMPLE
3 (P
AGE
R
ANK
: E
X
. 1).
The update function for
PageRank (deﬁned in Alg. 1) computes a weighted sum of the current ranksofneighboringverticesandassignsitastherankofthecurrent
vertex. The algorithm is adaptive: neighbors are scheduled for update only if the value of current vertex changes by more than a
predeﬁned threshold.
Algorithm 1:
PageRank update function
Input
: Vertex data
R
(
v
)
from
S
v
Input
: Edge data
{
w
u,v
:
u
∈
N
[
v
]
}
from
S
v
Input
: Neighbor vertex data
{
R
(
u
) :
u
∈
N
[
v
]
}
from
S
v
R
old
(
v
)
←
R
(
v
)
// Save old PageRank
R
(
v
)
←
α/n
foreach
u
∈
N
[
v
]
do
// Loop over neighbors
R
(
v
)
←
R
(
v
) + (1
−
α
)
∗
w
u,v
∗
R
(
u
)
// If the PageRank changes sufficiently
if

R
(
v
)
−
R
old
(
v
)

> ǫ
then
// Schedule neighbors to be updated
return
{
u
:
u
∈
N
[
v
]
}
Output
: Modiﬁed scope
S
v
with new
R
(
v
)
3.3 The GraphLab Execution Model
The GraphLab execution model, presented in Alg. 2, follows a
simple single loop semantics. The input to the GraphLab abstraction
consists of the data graph
G
= (
V,E,D
)
, an update function, aninitial set of vertices
T
to be executed. While there are vertices re
maining in
T
, the algorithm removes (Line 1) and executes (Line 2)
vertices, adding any new vertices back into
T
(Line 3). Duplicate
vertices are ignored. The resulting data graph and global values are
returned to the user on completion.
To enable a more efﬁcient distributed execution, we relax theexecution ordering requirements of the sharedmemory GraphLababstraction and allow the GraphLab runtime to determine the
best
order to execute vertices. For example,
RemoveNext
(
T
)
(Line 1)
Algorithm 2:
GraphLab Execution Model
Input
: Data Graph
G
= (
V,E,D
)
Input
: Initial vertex set
T
=
{
v
1
,v
2
,...
}
while
T
is not Empty
do
1
v
←
RemoveNext(
T
)
2
(
T
′
,
S
v
)
←
f
(
v,
S
v
)
3
T ← T ∪ T
′
Output
: Modiﬁed Data Graph
G
= (
V,E,D
′
)
may choose to return vertices in an order that minimizes network communication or latency (see Sec. 4.2.2). The only requirementimposed by the GraphLab abstraction is that all vertices in
T
areeventually executed. Because many MLDM applications beneﬁt
from prioritization, the GraphLab abstraction allows users to assign
priorities to the vertices in
T
. The GraphLab runtime may use these
priorities in conjunction with system level objectives to optimize
the order in which the vertices are executed.
3.4 Ensuring Serializability
The GraphLab abstraction presents a rich
sequential model
which
is automatically translated into a
parallel execution
by allowingmultiple processors to execute the same loop on the same graph,
removing and executing different vertices simultaneously. To retain
the sequential execution semantics we must ensure that overlapping computation is not run simultaneously. We introduce several
consistency models
that allow the runtime to optimize the parallel
execution while maintaining serializability.
The GraphLab runtime ensures a
serializable
execution. A serializable execution implies that there exists a corresponding serialschedule of update functions that when executed by Alg. 2 pro
duces the same values in the datagraph. By ensuring serializability,
GraphLab simpliﬁes reasoning about highlyasynchronous dynamic
computation in the distributed setting.
A simple method to achieve serializability is to ensure that the
scopes of concurrently executing update functions do not overlap. In[24]wecallthisthe
fullconsistency
model(seeFig.2(b)). However,
full consistency limits the potential parallelism since concurrentlyexecuting update functions must be at least two vertices apart (seeFig. 2(c)). However, for many machine learning algorithms, theupdate functions do not need full read/write access to all of the
data within the scope. For instance, the PageRank update in Eq. (1)
only requires read access to edges and neighboring vertices. To
provide greater parallelism while retaining serializability, GraphLab
deﬁnes the
edge consistency
model. The edge consistency modelensures each update function has exclusive readwrite access to itsvertex and adjacent edges but read only access to adjacent vertices
Fig. 2(b)). As a consequence, the edge consistency model increases
parallelism by allowing update functions with slightly overlappingscopes to safely run in parallel (see Fig. 2(c)). Finally, the
vertexconsistency
model allows all update functions to be run in parallel,
providing maximum parallelism.
3.5 Sync Operation and Global Values
In many MLDM algorithms it is necessary to maintain global
statistics describing data stored in the data graph. For example, many
statistical inference algorithms require tracking global convergence
estimators. To address this need, the GraphLab abstraction deﬁnesglobal values that may be read by update functions, but are writtenusing sync operations. Similar to aggregates in Pregel, the
syncoperation
is an associative commutative sum:
Z
=
Finalize
v
∈
V
Map
(
S
v
)
(2)
719