Filename Size Access Description License


In the era of cloud computing, users move their data and computation from local machines to cloud, thus the services are expected to be 24/7 dependable. Cloud services must be accessible anytime and anywhere, not lose or corrupt users data, and scale as user base continues to grow. Unfortunately, guaranteeing cloud services’ dependability is challenging because these cloud services are backed by large sophisticated distributed systems such as scalable data stores, data-parallel frame- works, and cluster management systems. Such cloud-scale distributed systems remain difficult to get right because they need to address data races among nodes, complex failures in commodity hardware, tremendous user requests, and much more. Addressing these cloud-specific challenges makes the systems more complex and new intricate bugs continue to create dependability problems.,This dissertation tries to answer a vital question of cloud dependability: “how can we make cloud-scale distributed systems more dependable?” We try to answer this question by focusing on the problems of distributed concurrency bugs and scalability bugs. We focus on these two problems because they are novel issues that occur in cloud-scale environment only and not many works addressing them.,Distributed concurrency bug (DC bug) is one unsolved reliability problem in cloud systems. DC bugs are caused by non-deterministic order of distributed events such as message arrivals, machine crashes, and reboots. Cloud systems execute multiple complicated distributed protocols concurrently. The possible interleavings of the distributed events are beyond developer’s anticipations and some interleavings might not be handled properly that can lead to catastrophic failures. To combat DC bugs, we make two contributions. First, we conduct a formal study on DC bugs to gain foundation knowledge for DC-bug combating research. We study 104 DC bugs from various widely-deployed cloud-scale distributed systems in many characteristics along several axes of analysis such as the triggering timing condition, input preconditions, error and failure symptoms, and fix strategies. We present the first complete taxonomy of DC bugs, TaxDC, along with many findings on DC bugs that can guide future research.,Second, we advance state of the art of distributed system model checking by introducing “semantic-aware model checking” (SAMC). Distributed system model checkers (dmck) are used to test system reliability of real systems. Existing dmcks however rarely exercise multiple faults due to the state-space explosion problem, and thus do not address present reliability challenges of cloud systems in dealing with complex faults. SAMC pushes the boundary of dmcks by introducing a white-box principle that takes simple semantic information of the target system and incorporates that knowledge into state-space reduction policies. We show that SAMC can find deep bugs one to two orders of magnitude faster compared to state-of-the-art techniques.,And for the second aspect of system dependability, we focus on scalability bugs. Scale surpasses the limit of a single machine in meeting users’ increasing demands for computing and storage. On the negative side, scale creates new development and deployment issues. Developers must ensure that their algorithms and protocol designs to be scalable. However, until real deployment takes place, unexpected bugs in the actual implementations are unforeseen. This new era of cloud- scale distributed systems has given birth to “scalability bugs”, latent bugs that are scale-dependent, and only surface in large scale.,To address scalability bugs, we conduct a study on scalability bugs to understand how they manifest and what their root causes are, and introduce SCK, a methodology that enables developers to scale-check distributed systems and find scalability bugs economically on one machine. SCK helps developers identify potential buggy code and allows developers to colocate a large number of nodes to test the potential buggy code without sacrificing accuracy. We remove a problem of hardware contentions (i.e., CPU, memory, and thread) with four novel strategies, and we successfully integrate SCK to Cassandra, Riak, and Voldemort. With SCK, we achieve a high colocation factor (500 nodes), and can reproduce six scalability bugs and identify two new hidden bugs.


Additional Details


Download Full History