Documents

p716_yuchenglow_vldb2012

Categories
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
While high-level data parallel frameworks, like MapReduce, sim- plify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to ineffi- cient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consis- tency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.
Transcript
  Distributed GraphLab: A Framework for Machine Learningand Data Mining in the Cloud Yucheng Low Carnegie Mellon University ylow@cs.cmu.eduJoseph Gonzalez Carnegie Mellon University  jegonzal@cs.cmu.eduAapo Kyrola Carnegie Mellon University akyrola@cs.cmu.eduDanny Bickson Carnegie Mellon University bickson@cs.cmu.eduCarlos Guestrin Carnegie Mellon University guestrin@cs.cmu.eduJoseph M. Hellerstein UC Berkeley hellerstein@cs.berkeley.edu ABSTRACT While high-level data parallel frameworks, like MapReduce, sim-plify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to ineffi-cient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses  asynchronous , dynamic ,  graph-parallel  computation while ensuring data consis-tency and achieving a high degree of parallel performance in theshared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of  network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithmand demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large AmazonEC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations. 1. INTRODUCTION With the exponential growth in the scale of Machine Learning and Data Mining (MLDM) problems and increasing sophistication of  MLDM techniques, there is an increasing need for systems that can execute MLDM algorithms efficiently in parallel on large clusters.Simultaneously, the availability of Cloud computing services likeAmazon EC2 provide the promise of on-demand access to afford- able large-scale computing and storage resources without substantial upfront investments. Unfortunately, designing, implementing, and debugging the distributed MLDM algorithms needed to fully utilizethe Cloud can be prohibitively challenging requiring MLDM expertsto address race conditions, deadlocks, distributed state, and commu- nication protocols while simultaneously developing mathematically complex models and algorithms. Nonetheless, the demand for large-scale computational and stor- age resources, has driven many [2, 14, 15, 27, 30, 35] to develop new parallel and distributed MLDM systems targeted at individual mod- els and applications. This time consuming and often redundant effortslows the progress of the field as different research groups repeatedly solve the same parallel/distributed computing problems. Therefore, the MLDM community needs a high-level distributed abstraction that specifically targets the  asynchronous ,  dynamic ,  graph-parallel computation found in many MLDM applications while hiding thecomplexities of parallel/distributed system design. Unfortunately,existing high-level parallel abstractions (e.g. MapReduce [8, 9],Dryad [19] and Pregel [25]) fail to support these critical properties. To help fill this void we introduced [24] GraphLab abstraction which directly targets asynchronous, dynamic, graph-parallel computation in the  shared-memory  setting. In this paper we extend the multi-core GraphLab abstraction to thedistributed setting and provide a formal description of the distributed execution model. We then explore several methods to implement an efficient distributed execution model while preserving strict con- sistency requirements. To achieve this goal we incorporate dataversioning to reduce network congestion and  pipelined distributed locking  to mitigate the effects of network latency. To address the challenges of data locality and ingress we introduce the  atom graph for rapidly placing graph structured data in the distributed setting. We also add fault tolerance to the GraphLab framework by adaptingthe classic Chandy-Lamport [6] snapshot algorithm and demonstrate how it can be easily implemented within the GraphLab abstraction. We conduct a comprehensive performance analysis of ouroptimized C++ implementation on the Amazon Elastic Cloud(EC2) computing service. We show that applications createdusing GraphLab outperform equivalent Hadoop/MapReduce[9] implementations by 20-60x and match the performance of carefully constructed MPI implementations. Our main contributions are the following: ã  A summary of common properties of MLDM algorithms and the limitations of existing large-scale frameworks. (Sec. 2) ã  A modified version of the GraphLab abstraction and execution model tailored to the distributed setting. (Sec. 3) ã  Two substantially different approaches to implementing the new distributed execution model(Sec. 4): Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheirresultsatThe38thInternationalConferenceonVeryLargeDataBases,August 27th - 31st 2012, Istanbul, Turkey. Proceedings of the VLDB Endowment,  Vol. 5, No. 8Copyright 2012 VLDB Endowment 2150-8097/12/04...  $  10.00. 716  ◦  Chromatic Engine:  uses graph coloring to achieve efficient sequentially consistent execution for static schedules. ◦  Locking Engine:  uses pipelined distributed locking and la- tency hiding to support dynamically prioritized execution. ã  Fault tolerance through two snapshotting schemes. (Sec. 4.3) ã  Implementations of three state-of-the-art machine learning algo- rithms on-top of distributed GraphLab. (Sec. 5) ã  An extensive evaluation of Distributed GraphLab using a 512 pro- cessor (64 node) EC2 cluster, including comparisons to Hadoop,Pregel, and MPI implementations. (Sec. 5) 2. MLDM ALGORITHM PROPERTIES In this section we describe several key properties of efficientlarge-scale parallel MLDM systems addressed by the GraphLababstraction [24] and how other parallel frameworks fail to address these properties. A summary of these properties and parallel frame- works can be found in Table 1. Graph Structured Computation:  Many of the recent advances in MLDM have focused on modeling the  dependencies  between data. By modeling data dependencies, we are able to extract more signalfrom noisy data. For example, modeling the dependencies between similar shoppers allows us to make better product recommendations than treating shoppers in isolation. Unfortunately, data parallelabstractions like MapReduce [9] are not generally well suited forthe  dependent   computation typically required by more advanced MLDM algorithms. Although it is often possible to map algorithms with computational dependencies into the MapReduce abstraction, the resulting transformations can be challenging and may introduce substantial inefficiency. As a consequence, there has been a recent trend toward  graph-parallel  abstractions like Pregel [25] and GraphLab [24] which naturally express computational dependencies. These abstractionsadopt a vertex-centric model in which computation is defined askernels that run on each vertex. For instance, Pregel is a bulk syn- chronous message passing abstraction where vertices communicate through messages. On the other hand, GraphLab is a sequentialshared memory abstraction where each vertex can read and writeto data on adjacent vertices and edges. The GraphLab runtime isthen responsible for ensuring a consistent parallel execution. Con-sequently, GraphLab simplifies the design and implementation of graph-parallel algorithms by freeing the user to focus on sequen-tial computation rather than the parallel movement of data (i.e., messaging). Asynchronous Iterative Computation:  Many importantMLDM algorithms iteratively update a large set of parameters.Because of the underlying graph structure, parameter updates (onvertices or edges) depend (through the graph adjacency structure)on the values of other parameters. In contrast to  synchronous systems, which update all parameters simultaneously (in parallel)using parameter values from the previous time step as input, asynchronous  systems update parameters using the  most recent  parameter values as input. As a consequence, asynchronous systems provides many MLDM algorithms with significant algorithmicbenefits. For example, linear systems (common to many MLDMalgorithms) have been shown to converge faster when solvedasynchronously [4]. Additionally, there are numerous othercases (e.g., belief propagation [13], expectation maximization[28], and stochastic optimization [35, 34]) where asynchronous procedures have been empirically shown to significantly outperform synchronous procedures. In Fig. 1(a) we demonstrate how asyn- chronous computation can substantially accelerate the convergence of PageRank. Synchronous computation incurs costly performance penaltiessince the runtime of each phase is determined by the  slowest   ma- chine. The poor performance of the slowest machine may be caused by a multitude of factors including: load and network imbalances,hardware variability, and multi-tenancy (a principal concern in the Cloud). Even in typical cluster settings, each compute node may also provide other services (e.g., distributed file systems). Imbalancesin the utilization of these other services will result in substantial performance penalties if synchronous computation is used. In addition, variability in the complexity and convergence of the individual vertex kernels can produce additional variability inexecution time, even when the graph is uniformly partitioned. For example, natural graphs encountered in real-world applications have  power-law  degree distributions which can lead to highly skewedrunning times even with a random partition [36]. Furthermore, theactual work required for each vertex could depend on the data in a problem specific manner (e.g., local rate of convergence). While abstractions based on bulk data processing, such as MapRe-duce [9] and Dryad [19] were not designed for iterative computation, recent projects such as Spark  [38] extend MapReduce and otherdata parallel abstractions to the iterative setting. However, theseabstractions still do not support asynchronous computation. Bulk Synchronous Parallel (BSP) abstractions such as Pregel [25], Pic-colo [33], and BPGL [16] do not naturally express asynchronicity. On the other hand, the shared memory GraphLab abstraction was de-signed to efficiently and naturally express the asynchronous iterative algorithms common to advanced MLDM. Dynamic Computation:  In many MLDM algorithms, iterativecomputation converges asymmetrically. For example, in parame-ter optimization, often a large number of parameters will quicklyconverge in a few iterations, while the remaining parameters willconverge slowly over many iterations [11, 10]. In Fig. 1(b) we plot the distribution of updates required to reach convergence forPageRank. Surprisingly, the majority of the vertices required onlya  single  update while only about 3% of the vertices required more than 10 updates. Additionally, prioritizing computation can further accelerate convergence as demonstrated by Zhang et al. [39] fora variety of graph algorithms including PageRank. If we updateall parameters equally often, we waste time recomputing parame-ters that have effectively converged. Conversely, by focusing earlycomputation on more challenging parameters, we can potentiallyaccelerate convergence. In Fig. 1(c) we empirically demonstrate how dynamic scheduling can accelerate convergence of loopy belief  propagation (a popular MLDM algorithm). Several recent abstractions have incorporated forms of dynamiccomputation. For example, Pregel [25] supports a limited form of dynamic computation by allowing some vertices to skip computa-tion on each super-step. Other abstractions like Pearce et al. [32]and GraphLab allow the user to adaptively  prioritize  computation.While both Pregel and GraphLab support dynamic computation,only GraphLab permits prioritization as well as the ability to adap- tively pull information from adjacent vertices (see Sec. 3.2 for more details). In this paper we relax some of the srcinal GraphLabscheduling requirements described in [24] to enable efficient  dis- tributed   FIFO and priority scheduling. Serializability:  By ensuring that all parallel executions have an equivalent sequential execution, serializability eliminates many chal-lenges associated with designing, implementing, and testing parallel MLDM algorithms. In addition, many algorithms converge faster if  serializability is ensured, and some even require serializability for correctness. For instance, Dynamic ALS (Sec. 5.1) is unstable whenallowed to race (Fig. 1(d)). Gibbs sampling, a  very popular   MLDM algorithm, requires serializability for statistical correctness. 717  ComputationModelSparseDepend.Async.Comp.Iterative PrioritizedOrderingEnforceConsistencyDistributed MPI Messaging Yes Yes Yes N/A No YesMapReduce[9] Par. data-flow No No extensions (a)  No Yes YesDryad[19] Par. data-flow Yes No  extensions (b)  No Yes YesPregel[25]/BPGL[16] GraphBSP Yes No Yes No Yes Yes Piccolo[33] Distr. map No No Yes No Partially (c)  YesPearce et.al.[32] Graph Visitor Yes Yes Yes Yes No No GraphLab GraphLab  Yes Yes Yes Yes Yes Yes Table 1: Comparison chart of large-scale computation frameworks. (a) [38] describes and iterative extension of MapReduce. (b)[18] proposes an iterative extension for Dryad. (c) Piccolo does not provide a mechanism to ensure consistency but instead exposes amechanism for the user to attempt to recover from simultaneous writes. 2000 4000 6000 8000 10000 1200010 0 10 2 10 4 10 6 Time    E  r  r  o  r Sync. (Pregel)Async. (GraphLab) (a) Async vs Sync PageRank  0 10 20 30 40 50 60 7010 1 10 2 10 3 10 4 10 5 10 6 10 7 10Updates at Convergence    N  u  m   b  e  r  o   f   V  e  r   t   i  c  e  s 51% of vertices (b) Dynamic PageRank  0 5 10 15 2010 −7 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 Sweeps    R  e  s   i   d  u  a   l Sync. (Pregel)Async.Dynamic Async. (GraphLab) (c) LoopyBP Conv. 0 2 4 6x 10 6 10 −1 10 0 10 1 10 2 Updates    T  r  a   i  n   i  n  g   E  r  r  o  r SerializableNot Serializable (d) ALS Consistency Figure 1: (a) Rate of convergence, measured in  L 1  error to the true PageRank vector versus time, of the PageRank algorithm ona 25M vertex 355M edge web graph on 16 processors. (b) The distribution of update counts after running dynamic PageRank toconvergence. Notice that the majority of the vertices converged in only a single update. (c) Rate of convergence of Loopy Belief prop-agation on web-spam detection. (d) Comparing serializable and non-serializable (racing) execution of the dynamic ALS algorithmin Sec. 5.1 on the Netflix movie recommendation problem. Non-serializable execution exhibits unstable convergence behavior. An abstraction that enforces serializable computation eliminates much of the complexity introduced by concurrency, allowing theMLDM expert to focus on the algorithm and model design. De-bugging mathematical code in a concurrent program which hasdata-corruption caused by data races is difficult and time consum-ing. Surprisingly, many asynchronous abstractions like [32] do not ensure serializability or, like Piccolo [33], provide only basic mech- anisms to recover from data races. GraphLab supports a broad range of consistency settings, allowing a program to choose the level of consistency needed for correctness. In Sec. 4 we describe several techniques we developed to enforce serializability in the distributed setting. 3. DIST. GRAPHLAB ABSTRACTION The GraphLab abstraction consists of three main parts, the datagraph, the update function, and the sync operation. The data graph (Sec. 3.1) represents user modifiable program state, and stores both the mutable user-defined data and encodes the sparse computational dependencies. The update function (Sec. 3.2) represents the user computation and operate on the data graph by transforming data in small overlapping contexts called scopes. Finally, the sync operation (Sec. 3.5) concurrently maintains global aggregates. To groundthe GraphLab abstraction in a concrete problem, we will use the PageRank algorithm [31] as a running example.E XAMPLE  1 (P AGE R ANK ).  The PageRank algorithm recur- sively defines the rank of a webpage  v :  R ( v ) =  αn  + (1  − α )  u  links to v w u,v  ×  R ( u )  (1) in terms of the weighted  w u,v  ranks  R ( u )  of the pages u that link to v  as well as some probability  α  of randomly jumping to that page. The PageRank algorithm iterates Eq. (1) until the PageRank changes by less than some small value  ǫ . 3.1 Data Graph The GraphLab abstraction stores the program state as a directedgraph called the  data graph . The data graph  G  = ( V,E,D )  is acontainer that manages the user defined data  D . Here we use theterm  data  broadly to refer to model parameters, algorithm state,and even statistical data. Users can associate arbitrary data witheach vertex  { D v  :  v  ∈  V   }  and edge  { D u → v  :  { u,v } ∈  E  }  in thegraph. However, as the GraphLab abstraction is not dependenton edge directions, we also use  D u ↔ v  to denote the data on bothedge directions  u  →  v  and  v  →  u . Finally, while the graph datais mutable, the structure is  static  and cannot be changed during execution.E XAMPLE  2 (P AGE R ANK : E X . 1).  The data graph is directly obtained from the web graph, where each vertex corresponds to aweb page and each edge represents a link. The vertex data  D v stores  R ( v )  , the current estimate of the PageRank, and the edge data  D u → v  stores  w u,v  , the directed weight of the link. 3.2 Update Functions Computation is encoded in the GraphLab abstraction in the form of update functions. An  update function  is a stateless procedure that modifies the data within the scope of a vertex and schedules thefuture execution of update functions on other vertices. The  scope  of  vertex v  (denoted by  S  v ) is the data stored in v , as well as the data stored in all adjacent vertices and adjacent edges (Fig. 2(a)). A GraphLab update function takes as an input a vertex v  and itsscope  S  v  and returns the new versions of the data in the scope as well as a set vertices  T    : Update  :  f  ( v, S  v )  →  ( S  v , T    ) After executing an update function the modified data in  S  v  is written back to the data graph. The set of vertices  u  ∈ T    are  eventually 718  executed by applying the update function  f  ( u, S  u )  following the execution semantics described later in Sec. 3.3. Rather than adopting a message passing or data flow model asin [25, 19], GraphLab allows the user defined update functionscomplete freedom to read and modify any of the data on adjacent vertices and edges. This simplifies user code and eliminates the need for the users to reason about the movement of data. By controllingwhat vertices are returned in  T    and thus to be executed, GraphLabupdate functions can efficiently express adaptive computation. Forexample, an update function may choose to return (schedule) itsneighbors only when it has made a substantial change to its local data. There is an important difference between Pregel and GraphLab in how dynamic computation is expressed. GraphLab decouples the scheduling of future computation from the movement of data. Asa consequence, GraphLab update functions have access to data onadjacent vertices even if the adjacent vertices did not schedule thecurrent update. Conversely, Pregel update functions are initiatedby messages and can only access the data in the message, limiting what can be expressed. For instance, dynamic PageRank is difficult to express in Pregel since the PageRank computation for a givenpage requires the PageRank values of all adjacent pages even if some of the adjacent pages  have not recently changed  . Therefore, the decision to send data (PageRank values) to neighboring vertices cannot be made by the sending vertex (as required by Pregel) but instead must be made by the receiving vertex. GraphLab, naturally expresses the  pull  model, since adjacent vertices are only responsible for  scheduling , and update functions can  directly read   adjacent vertex values even if they have not changed.E XAMPLE  3 (P AGE R ANK : E X . 1).  The update function for  PageRank (defined in Alg. 1) computes a weighted sum of the current ranksofneighboringverticesandassignsitastherankofthecurrent  vertex. The algorithm is adaptive: neighbors are scheduled for update only if the value of current vertex changes by more than a  predefined threshold. Algorithm 1:  PageRank update function Input : Vertex data  R ( v )  from  S  v Input : Edge data  { w u,v  :  u  ∈ N [ v ] }  from  S  v Input : Neighbor vertex data  { R ( u ) :  u  ∈ N [ v ] }  from  S  v R old ( v )  ←  R ( v )  // Save old PageRank R ( v )  ←  α/n foreach  u  ∈ N [ v ]  do  // Loop over neighbors R ( v )  ←  R ( v ) + (1  −  α )  ∗  w u,v  ∗  R ( u ) // If the PageRank changes sufficiently if   |  R ( v )  −  R old  ( v ) |  > ǫ  then // Schedule neighbors to be updated return  { u  :  u  ∈ N [ v ] } Output : Modified scope  S  v  with new  R ( v ) 3.3 The GraphLab Execution Model The GraphLab execution model, presented in Alg. 2, follows a simple single loop semantics. The input to the GraphLab abstraction consists of the data graph  G  = ( V,E,D ) , an update function, aninitial set of vertices  T    to be executed. While there are vertices re- maining in  T    , the algorithm removes (Line 1) and executes (Line 2) vertices, adding any new vertices back into  T    (Line 3). Duplicate vertices are ignored. The resulting data graph and global values are returned to the user on completion. To enable a more efficient distributed execution, we relax theexecution ordering requirements of the shared-memory GraphLababstraction and allow the GraphLab run-time to determine the  best  order to execute vertices. For example,  RemoveNext ( T    )  (Line 1) Algorithm 2:  GraphLab Execution Model Input : Data Graph  G  = ( V,E,D ) Input : Initial vertex set  T    =  { v 1 ,v 2 ,... } while  T    is not Empty  do 1  v  ←  RemoveNext( T    ) 2  ( T    ′ , S  v )  ←  f  ( v, S  v ) 3  T ← T ∪ T    ′ Output : Modified Data Graph  G  = ( V,E,D ′ ) may choose to return vertices in an order that minimizes network communication or latency (see Sec. 4.2.2). The only requirementimposed by the GraphLab abstraction is that all vertices in  T    areeventually executed. Because many MLDM applications benefit from prioritization, the GraphLab abstraction allows users to assign priorities to the vertices in  T    . The GraphLab run-time may use these priorities in conjunction with system level objectives to optimize the order in which the vertices are executed. 3.4 Ensuring Serializability The GraphLab abstraction presents a rich  sequential model  which is automatically translated into a  parallel execution  by allowingmultiple processors to execute the same loop on the same graph, removing and executing different vertices simultaneously. To retain the sequential execution semantics we must ensure that overlap-ping computation is not run simultaneously. We introduce several consistency models  that allow the runtime to optimize the parallel execution while maintaining serializability. The GraphLab runtime ensures a  serializable  execution. A seri-alizable execution implies that there exists a corresponding serialschedule of update functions that when executed by Alg. 2 pro- duces the same values in the data-graph. By ensuring serializability, GraphLab simplifies reasoning about highly-asynchronous dynamic computation in the distributed setting. A simple method to achieve serializability is to ensure that the scopes of concurrently executing update functions do not overlap. In[24]wecallthisthe fullconsistency model(seeFig.2(b)). However, full consistency limits the potential parallelism since concurrentlyexecuting update functions must be at least two vertices apart (seeFig. 2(c)). However, for many machine learning algorithms, theupdate functions do not need full read/write access to all of the data within the scope. For instance, the PageRank update in Eq. (1) only requires read access to edges and neighboring vertices. To provide greater parallelism while retaining serializability, GraphLab defines the  edge consistency  model. The edge consistency modelensures each update function has exclusive read-write access to itsvertex and adjacent edges but read only access to adjacent vertices Fig. 2(b)). As a consequence, the edge consistency model increases parallelism by allowing update functions with slightly overlappingscopes to safely run in parallel (see Fig. 2(c)). Finally, the  vertexconsistency  model allows all update functions to be run in parallel, providing maximum parallelism. 3.5 Sync Operation and Global Values In many MLDM algorithms it is necessary to maintain global statistics describing data stored in the data graph. For example, many statistical inference algorithms require tracking global convergence estimators. To address this need, the GraphLab abstraction definesglobal values that may be read by update functions, but are writtenusing sync operations. Similar to aggregates in Pregel, the  syncoperation  is an associative commutative sum: Z   =  Finalize  v ∈ V   Map ( S  v )   (2) 719

ICICI

Jul 23, 2017
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks