Documents

ZHT

Categories
Published
of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
ZHT Distributed Hash Table
Transcript
  ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table Tonglin Li 1 , Xiaobing Zhou 1 , Kevin Brandstatter  1 , Dongfang Zhao 1 , Ke Wang 1 , Anupam Rajendran 1 , Zhao Zhang 2 , Ioan Raicu 1,3   tli13@hawk.iit.edu, xzhou40@hawk.iit.edu, kbrandst@iit.edu, dzhao8@hawk.iit.edu, kwang22@hawk.iit.edu, arajend5@hawk.iit.edu, zhaozhang@uchicago.edu, iraicu@cs.iit.edu 1  Department of Computer Science, Illinois Institute of Technology, Chicago IL, USA 2  Department of Computer Science, University of Chicago, Chicago IL, USA 3 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne IL, USA  Abstract   — This paper presents ZHT, a zero-hop distributed hash table, which has been tuned for the requirements of high-end computing systems. ZHT aims to be a building block for future distributed systems, such as parallel and distributed file systems, distributed job management systems, and parallel programming systems. The goals of ZHT are delivering high availability, good fault tolerance, high throughput, and low latencies, at extreme scales of millions of nodes. ZHT has some important properties, such as being light-weight, dynamically allowing nodes join and leave, fault tolerant through replication, persistent, scalable, and supporting unconventional operations such as append (providing lock-free concurrent key/value modifications) in addition to insert/lookup/remove. We have evaluated ZHT's performance under a variety of systems, ranging from a Linux cluster with 512-cores, to an IBM Blue Gene/P supercomputer with 160K-cores. Using micro-benchmarks, we scaled ZHT up to 32K-cores with latencies of only 1.1ms and 18M operations/sec throughput. This work provides three real systems that have integrated with ZHT, and evaluate them at modest scales. 1) ZHT was used in the FusionFS distributed file system to deliver distributed meta-data management at over 60K operations (e.g. file create) per second at 2K-core scales. 2) ZHT was used in the IStore, an information dispersal algorithm enabled distributed object storage system, to manage chunk locations, delivering more than 500 chunks/sec at 32-nodes scales. 3) ZHT was also used as a building block to MATRIX, a distributed job scheduling system, delivering 5000  jobs/sec throughputs at 2K-core scales. We compared ZHT against other distributed hash tables and key/value stores and found it offers superior performance for the features and portability it supports.  Keywords- Distributed hash tables, key/value stores, high-end computing I.   I NTRODUCTION Exascale computers (e.g. capable of 10 18  ops/sec) [1], with a processing capability similar to that of the human brain, will enable the unraveling of significant scientific mysteries and  present new challenges and opportunities. Major scientific opportunities arise in many fields (such as weather modeling, understanding global warming, national security, drug discovery, and economics) and may rely on revolutionary advances that will enable exascale computing. “A supercomputer is a device for turning compute-bound problems into I/O bound problems”. -- Ken Batcher The quote [46] from Ken Batcher reveals the essence of modern high performance computing and implies an ever growing shift in bottlenecks from compute to I/O. For exascale computers, the challenges are even more radical, as the only viable approaches in next decade to achieve exascale computing all involve extremely high parallelism and concurrency. Up to 2012, some of the biggest systems already have more than 700,000 general purpose cores. Many experts  predict [1] that exascale computing will be a reality by 2019; an exascale system is expected to have millions of nodes,  billions of threads of execution, hundreds of petabytes of memory, and exabytes of persistent storage. In the current decades-old architecture of HPC systems, storage is completely segregated from the compute resources and are connected via a network interconnect (e.g. parallel file systems running on network attached storage, such as GPFS [2], PVFS [3], and Lustre [4]). This approach is not able to scale several orders of magnitude in terms of concurrency and throughput, and will thus prevent the move from petascale to exascale. If we do not solve the storage problem with new storage architectures, it could be a “show-stopper” in building exascale systems. The need for building efficient and scalable distributed storage for high-end computing (HEC) systems that will scale three to four orders of magnitude is on the horizon. One of the major bottlenecks in current state-of-the-art storage systems is metadata management. Metadata operations on parallel file systems can be inefficient at large scale. Experiments on the Blue Gene/P system at 16K-core scales show the various costs (wall-clock time measured at remote  processor) for file/directory create on GPFS (see [ 2 ]). Figure 1: Time per operation (touch) on GPFS on various number of  processors on a IBM Blue Gene/P Ideal performance would have been constant, but we can see the cost of these basic metadata operations (e.g. create file) growing from tens of milliseconds on a single node (four- 1101001,00010,000100,0001416642561024409616384    T   i   m   e   p   e   r   O   p   e   r   a   t   i   o   n    (   m   s    ) Scale (# of Cores) File Create (GPFS Many Dir)File Create (GPFS One Dir) 2013 IEEE 27th International Symposium on Parallel & Distributed Processing 1530-2075/13 $26.00 © 2013 IEEEDOI 10.1109/IPDPS.2013.110775  cores), to tens of seconds at 16K-core scales; at full machine scale of 160K-cores, we expect a file create to take over two minutes for the many directory case, and over 10 minutes for the single directory case. Previous work [5, 6] shows these times to be even worse, putting the full system scale metadata operations in the hour range, but the testbed as well as GPFS might have been improved over the last several years. Whether the time per metadata operation is minutes or hours on a large scale HEC system, the conclusion is that the distributed metadata management in GPFS does not have enough degree of distribution, and not enough emphasis was placed on avoiding lock contention. GPFS’s metadata performance degrades rapidly under concurrent operations, reaching saturation at only 4 to 32 core scales (on a 160K-core machine). Other distributed file systems (e.g. Google's GFS [7] and Yahoo's HDFS [8]) that have centralized metadata management make the problems observed with GPFS even worse from the scalability perspective. Future storage systems for high-end computing should support distributed metadata management, leveraging distributed data-structure tailored for this environment. The distributed data-structures share some characteristics with structured distributed hash tables [9], having resilience in face of failures with high availability; however, they should support close to constant time inserts/lookups/removes delivering the low latencies typically found in centralized metadata management (under light load). Metadata should be reliable and highly available, for which replication (a widely used mechanism) could be used. This work presents a zero-hop distributed hash table (ZHT), which has been tuned for the specific requirements of high-end computing (e.g. trustworthy/reliable hardware, fast networks, non-existent “churn”, low latencies, and scientific computing data-access patterns). ZHT aims to be a building block for future distributed systems, with the goal of delivering excellent availability, fault tolerance, high throughput, scalability,  persistence, and low latencies. ZHT has several important features making it a better candidate than other distributed hash tables and key-value stores, such as being light-weight, dynamically allowing nodes join and leave, fault tolerant through replication and by handling failures gracefully and efficiently propagating events throughout the system, a customizable consistent hashing function, supporting  persistence for better recoverability in case of faults, scalable, and supporting unconventional operations such as append (providing lock-free concurrent key/value modifications) in addition to insert/lookup/remove. We have evaluated ZHT's performance under a variety of systems, ranging from a Linux cluster with 512-cores, to an IBM Blue Gene/P supercomputer with 160K-cores. Using micro-benchmarks, we scaled ZHT up to 32K-cores with latencies of only 1.1ms and 18M operations/sec throughput. We compared ZHT against two other systems, Cassandra [38] and Memcached [20] and found it to offer superior  performance for the features and portability it supports, at large scales up to 16K-nodes. This work provides three real systems that have integrated with ZHT, and evaluates them at modest scales. 1) ZHT was used in the FusionFS distributed file system to deliver distributed meta-data management at over 60K operations (e.g. file create) per second at 2K-core scales. 2) ZHT was used in the IStore [50, 65], an information dispersal algorithm enabled distributed object storage system, to manage chunk locations delivering more than 500 chunks/sec at 32-nodes scales. 3) ZHT was also used as a building block to MATRIX, a distributed job scheduling system, delivering 5000 jobs/sec throughputs at 2K-core scales. The contributions of this paper are as follows: ã   Design and implementation of ZHT, a light-weight, high  performance, fault tolerant, persistent, dynamic, and highly scalable distributed hash table, optimized for high-end computing. ã   Support for unconventional operations, such as append allowing data to be incrementally added to an existing value, delivering lock-free concurrent modification on key/value pairs. ã   Micro-benchmarks up to 32K-core scales, achieving latencies of 1.1ms and throughput of 18M ops/sec. ã   Integration and evaluation with three real systems (FusionFS, IStore, and MATRIX), managing distributed storage metadata and distributed job scheduling information. II.   R  ELATED WORK    There have been many distributed hash table (DHT) algorithms and implementations proposed over the years. We discuss DHTs in this section due to their important role in  building support for scalable metadata service across extreme scale systems. Some of the DHTs from the literature are Kademlia [15], CAN [16], Chord [17], Pastry [18], Tapestry [19], Memcached, Dynamo [21], Cycloid [22], Ketama [23], RIAK [24], Maidsafe-dht [25], Cassandra and C-MPI [26]. Most of these DHTs scale logarithmically with system scales,  but some (e.g. Cycloid) go as far as reducing the number of operations to O(c) where c is a constant related to the maximum size of the network (instead of the actual size of the network), which in practice still results to c ~ log(N) [22]. There has been some uptake recently in using traditional DHTs in HEC, namely the C-MPI [26] project, in which the Kademlia DHT has been implemented and shown to run well on 1K nodes on a Blue Gene/P supercomputer. C-MPI is used to perform data management operations for the Swift project [27, 57], but it is rather simplistic (e.g. no support for data replication, data persistence, or fault tolerance via stateless  protocols). C-MPI adopted the Message Passing Interface (MPI) for communication, making it a bridle at large scale and  prone to system wide failures due to single node failures. Although MPI is attractive from a performance perspective on these HEC systems, it makes it hard to implement a fault tolerant system. Furthermore, C-MPI is based on new implementations of the Kademlia (with log(N) routing time) distributed hash table. Another recent project using DHTs on a HEC is DataSpaces [28], which deploys a DHT on a Cray XT5 to coordinate in-memory data management for simulation workflows. DataSpaces has similar drawbacks as C-MPI. In future work, we will consider supporting MPI, in addition to  protocols such as TCP and UDP, as MPI 3.0 [29] promises to address many of the current MPI fault tolerance limitations. 776  Dynamo [21] is a key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. Dynamo calls itself as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly. Dynamo is targeted mainly at applications that need an “always writeable” data store where no updates are rejected due to failures or concurrent writes. A significant drawback of Dynamo is the fact that it is an internal Amazon project, which cannot be used outside of the Amazon infrastructure. Cassandra, an implementation inspired by Amazon’s Dynamo, strives to be an always writable system in that the system is designed to always accept writes even in light of node failures. It accomplishes this by deferring consistency until the time when data is read and resolving conflicts at that time, this means that Cassandra needs to offer different levels of consistency on reads. Cassandra’s drawbacks include poor support on many supercomputers due to a lack of Java stack. Cassandra also uses logarithmic routing strategy which makes it less scalable. Memcached is an in-memory implementation of a key/value store. It was designed as a cache to accelerate distributed application execution. It is rather simplistic in which there is no data persistence, no data replication, and no dynamic membership. There are strict limitations on the size of the keys and values (250B and 1MB respectively). All these limit the use of Memcached for the purpose of making it a building  block for large-scale distributed systems, but it offers a good  baseline for comparison. In section 4 we’ll compare the performance of ZHT, Cassandra and Memcached. A brief overview of the differences between Cassandra, Memcached, C-MPI, Dynamo, and ZHT can be found in Table 1. Table 1: Comparison between ZHT and other DHT implementations Name Impl. Routing Time Persistence Dynamic membership Append Cassandra 38 Java log(N) Yes YesNoMemcached [20] C 2 No  No NoC-MPI [26] C/MPI  log(N) No  No NoDynamo [21] Java 0 to log(N) Yes YesNoZHT [14] C++ 0 to 2 Yes YesYes III.   ZHT   D ESIGN AND I MPLEMENTATION   Most HEC environments are batch oriented, which implies that a system that is configured at run time, generally has information about the compute and storage resources that will  be available. This means that the amount of resources (e.g. number of nodes) would not increase or decrease dynamically, and the only reason to decrease the allocation is either to handle failed nodes, or to terminate the allocation. By making dynamic membership optional, the complexity of the system can be reduced and a low average number of hops per operation can be achieved. We do believe that dynamic membership is important for some environments, especially for cloud computing systems, and hence have made efforts to support it without affecting  basic operations’ time complexity. Because nodes in HEC are generally reliable and have predicable uptime (nodes start on allocation, and shut down on de-allocation), it implies that node churn in HEC is virtually non-existent. This in principle guided our design of the proposed dynamic membership support in ZHT. It is also important to point out that nodes in a HEC system are generally trust-worthy, and that stringent requirements to encrypt communication and/or data would simply be adding overheads. HEC systems are generally locked down from the outside world, behind login nodes and firewalls, and although authentication and authorization are still needed, full communication encryption is wasteful for a large class of scientific computing applications that run on many HEC systems. Most storage systems used in HEC communicate  between the client nodes and storage servers without any encryption.  A.   Overview The primary goal of ZHT is to get all the benefits of DHTs, namely excellent availability and fault tolerance, but concurrently achieve the benefits minimal latencies normally associated with idle centralized indexes. The data-structure is kept as simple as possible for ease of analysis and efficient implementation. The application programming interface (API) of ZHT is kept simple and follows similar interfaces for hash tables. The four operations ZHT supports are 1. int insert  (key, value); 2. value lookup (key); 3. int remove (key), and 4. int append  (key, value). Keys are typically a variable length ASCII text string. Values can be complex objects, with varying size, number of elements, and types of elements. Integer return values return 0 for a successful operation, or a non-zero return code that includes information about the error that occurred.   In static membership, every node at bootstrap time has all information about how to contact every other node in ZHT. In a dynamic environment, nodes may join (for system  performance enhancement) and leave (node failure or scheduled maintenance) any time, although in HEC systems this “churn” occurs much less frequently than in traditional DHTs. Figure 2: ZHT architecture design showing namespace, hash function, and replication ID Space and Membership Table are shown in Figure 2 as a ring-shaped key name space. The node ids in ZHT can be randomly distributed throughout the network, or they can be closely correlated with the network distance between nodes. The correlation can generally be computed from information such as MPI rank or IP address. The random distribution of the 777  ID space has worked well up to 32K-cores, but we will explore a network aware topology in future work. The hash function maps an arbitrarily long string to an index value, which can then be used to efficiently retrieve the communication address (e.g. host name, IP address, MPI-rank) from a membership table (a local in-memory vector). Depending on the level of information that is stored (e.g. IP - 4  bytes, name - <100 bytes, socket - depends on buffer size), storing the entire membership table should consume only a small (less than 1%) portion of available memory of each node. On 1K-nodes scale, one ZHT instance has a memory footprint of only 10MB (from an available 2GB memory), achieving our desired sub 1% memory footprint. The memory footprint consists of ZHT server binary in memory, entries in hash table, membership table and ZHT server side socket connection  buffers. Among them, only membership table and socket  buffers will increase with the scale of nodes. Entries in hash table will be flushed to disk finally. But membership is very small, it takes 32 bytes per entry (for each node), 1million nodes only need 32MB memory. By tuning the number of Key-Value pairs that are allowed stay in memory, users can achieve the balance between performance and memory consumption.  B.   Terminologies: Physical node : A physical node is an independent physical machine. Each physical node may have several ZHT instances which are differentiated with IP address and port.   By adjusting the number of instance, ZHT can fit in heterogeneous systems with various computing power. Instance : A ZHT instance is a process which handles the requests from clients. Each instance takes care of some  partitions. By adjusting the number of instance, ZHT can fit in heterogeneous systems with various storage capacities and computing power. A ZHT instance can be identified by a combination of IP address and port, and each ZHT instance maintains many partitions. We only need to store addresses for ZHT instances, no need to do so for partitions. Therefore number of partitions can be much larger than the number of addresses. Partition : A partition is a contiguous range of the key address space. Manager : A Manager is a service running on each  physical node and takes charge of starting and shuting down ZHT instances. The manager   is also responsible for managing membership table, starting/stopping instances , and  partition  migration. As traditional consistent hashing does, initially we assign each of the k   physical nodes a manager   and one or more ZHT instances , each with a universal unique id (UUID) in the ring-shaped space. The entire name space  N   (a 64-bit integer) is evenly distributed into n    partitions where n  is a fixed big number indicating the maximal number of nodes that can be used in the system. It is worth noting that while n  (the number of  partitions, also the maximal number of physical nodes ) cannot be changed without potentially rehashing all the key/value pairs stored in ZHT, i  (the number of ZHT instances) as well as k   (the number of physical nodes) is changeable with changes only to the membership table. Each  physical node has one manager  , holds n / k     partitions , with each  partition  storing  N  / n  key-value pairs and i / k   ZHT instances serving requests. Each partition (which can be  persisted to disk) can be moved across different physical nodes when nodes join, leave, or fail. Figure 3: ZHT architecture per node   For example, in an initial system of 1000 ZHT instances (potentially running on 1000 nodes), where each instance contains 1000 partitions, the overall system could scale up to 1 million instances on 1 million physical nodes. Experiments validate this approach showing that there is little impact (0.73ms vs. 0.77ms per request) on the performance of  partitions as we increase the number of partitions per instance (see Figure 4). This design allows us to avoid a potentially expensive rehash of many key/value pairs when the need arises to migrate partitions. Figure 4: Concurent performance from 1 to 1K partition per ZHT instance   C.    Membership management ZHT supports both static and dynamic node membership. In the static case, the bootstrapping phase gets information from the batch job scheduler about the allocated node information (or perhaps the information could be extracted from the nodes at job start time). Once the membership is established, no new nodes would be allowed to join the system. Nodes could leave the system due to failures; we assume failed nodes do not recover. For the dynamic membership, nodes are allowed to dynamically join and leave the system. Most DHTs support dynamic membership, but typically deliver this through logarithmic routing. DHTs use consistent hashing which sacrifices performance in order to achieve scalability under  potentially extremely dynamic conditions. We address this issue with a zero-hop consistent hashing mechanism. With this novel design, we offer the desired flexibility of dynamic membership while maintain high performance with constant time routing. Node Joins:  On a node join operation, it checks out a copy of membership table from the ZHT Manager on a 0.60.650.70.750.81 10 100 1000    L   a   t   e   n   c   y    (   m   s    ) Number of partitions per instanceAverage latency 778

Reification

Jul 23, 2017
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks