Today, distributed software infrastructures have become a dominant backbone for cloud computing and modern applications. Large-scale distributed systems such as scalable frameworks, storage systems, synchronization, and cluster management services have emerged as the data center operating system. Unfortunately, the reliability of distributed systems is threatened by non-deterministic concurrency bugs as it executes many complicated distributed protocols on thousands of machines with no common clocks and it must face a variety of random hardware failures. Facing the challenge of concurrency bugs, this dissertation proposes efficient approaches to empower software model checking to quickly unearth concurrency bugs based on the study of the comprehensive characteristics of real-world concurrency bugs in distributed systems. This dissertation makes three main contributions. First, we present TaxDC, the largest and most comprehensive taxonomy of distributed concurrency (DC) bugs. We study the characteristics of 104 real-world distributed concurrency (DC) bugs along several axes of analysis, such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies from four widely-deployed cloud-scale distributed systems: Cassandra, Hadoop MapReduce, HBase, and ZooKeeper. The study of DC bugs characteristics provides many motivation and guidelines for DC bugs detection, testing, and tools design. Second, we present FLYMC, a fast, scalable, and systematic software model checker for testing distributed systems implementations. To overcome the path/state-space explosion problem in testing complex interleavings of messages and faults, FLYMC introduces three powerful algorithms: state symmetry, event independence, and parallel flips. As a result, collectively, these algorithms make our approach on average 16×(up to 78×) faster than other state-of-the-art solutions. FLYMC is integrated with eight distributed systems, successfully reproduced twelve known DC bugs, and found ten new DC bugs which all have been confirmed by developers. Third, we present HMC, a heuristic software model checker to unearth DC bugs faster. By exploiting the properties of distributed systems and software model checking, HMC introduces four novel algorithms: blocking state-event, last state-event, miss-prediction step, and prioritized node crash, to prioritize more-likely test scenarios that will reach new corner cases. With HMC, we reproduced five DC bugs in Cassandra on average 6× (up to 15×) faster than FLYMC which we consider as the state-of-the-art of systematic software model checker.




Downloads Statistics

Download Full History