Multi-threaded software and distributed cloud software are prevalent as a dominant backbone for modern applications. Although it is extremely important, their reliability is severely threatened by software bugs. Among all types of software bugs, timing bugs are among the most troublesome due to their inherent non-deterministic nature and the huge interleaving space. Timing bugs are caused by unexpected timing among local events in multi-threaded systems (local concurrent bugs or LCbugs) or distributed events, such as message or faults, in distributed systems (distributed concurrency bugs or DCbugs). A timing bug model is critical to guide the design of automated tackling tools, which includes three parts: concurrent source, synchronization mechanisms, and sharing resources. Existing timing bug models mainly focus on the thread interleaving concurrent source, the lock-related synchronization mechanism, and shared global memory resources for LCbugs. To fight timing bugs and improve the concurrent software reliability in multi-threaded systems and distributed systems, this dissertation works on these three parts and makes the following contributions: First, this dissertation conducts an empirical study of timing bugs in multi-threaded systems and distributed cloud service systems to understand how common are timing bugs, what's the resource being competed, and how were they resolved or fixed. Our empirical study includes two parts. (1) we conduct a comprehensive characteristic study on real-world incidents in Microsoft Azure production-run cloud services. The study reveals several main findings: (a) about 15% software bug incidents in our study set are caused by timing bugs; (b) 60% timing bugs in our study set are DCbugs (message timing bugs or fault timing bugs); (c) half of the timing bugs in our study set are racing on persistent data instead of shared global memory variables; (d) mitigation strategy, especially running-environment mitigation, is widely used to resolve timing bug incidents in the cloud. (2) we conduct an empirical study of manual patches for real-world LCbugs in multi-threaded systems to understand the gap between automatically generated patches and manually generated patches. The study finds that (a) lock is the dominant synchronization primitive for enforcing atomicity; lock-related signals/waits are not the dominant primitive for enforcing pairwise ordering in patches. (b) leveraging existing synchronization in software is as common as adding extra primitives. These findings provide many motivation and guidelines for the design of timing bug tackling tools in this field. Second, guided by the empirical study, this thesis proposes new models and detection tools for message timing bugs and fault timing bugs in distributed systems. Our new model captures two new concurrent sources (message interleavings and random faults), the new synchronization mechanisms introduced by them, and new sharing resources, persistent data. Guided by the proposed model, detection tools are designed to predict message timing bugs and fault timing bugs from correct runs. Each step of our detection tool is carefully customized to address the unique challenges for DCbugs in distributed systems. The evaluation result shows that our tool can effectively and efficiently detect message timing bugs and fault timing bugs with low false positive rates. Third, motivated by the findings of \lcbug fixing strategies, we design a fixing tool to model and enforce timing relationship by leveraging existing non-lock synchronization primitives. Evaluation using real-world bugs shows that our tool can automatically generate patches that have matching quality with manual patches and are much simpler than those generated by the previous state of the art techniques.




Downloads Statistics

Download Full History