Files
Abstract
Distributed systems nowadays are the backbone of computing society, and are expected tohave high availability. Unfortunately, distributed timing bugs, a type of bugs triggered by
non-deterministic timing of messages and node crashes, widely exist. They lead to many
production-run failures, and are difficult to reason about and patch. Although recently
proposed techniques can automatically detect these bugs, how to automatically and correctly
x them still remains as an open problem. I designed DFix, a tool that automatically
processes distributed timing bug reports, statically analyzes the buggy system, and produces
patches. Our evaluation shows that DFix is effective in fixing real-world distributed timing
bugs.
Concurrency bugs are hard to find, reproduce, and debug. They often escape rigorousin-house testing, but result in large-scale outages in production. Existing concurrency bug
detection techniques unfortunately cannot be part of industry's integrated build and test
environment due to some open challenges: how to handle code developed by thousands of
engineering teams that uses a wide variety of synchronization mechanisms, how to report
little/no false positives, and how to avoid excessive testing resource consumption. TSVD is a
thread-safety violation detector that addresses these challenges through a new design point
in the domain of active testing. Unlike previous techniques that inject delays randomly or
employ expensive synchronization analysis, TSVD uses lightweight monitoring of the calling
behaviors of thread-unsafe methods, not any synchronization operations, to dynamically
identify bug suspects. It then injects corresponding delays to drive the program towards
thread-unsafe behaviors, actively learns from its ability or inability to do so, and persists its
learning from one test run to the next. TSVD is deployed and regularly used in Microsoft
and it has already found over 1000 thread-safety violations from thousands of projects. It
detects more bugs than state-of-the-art techniques, mostly with just one test run.
Synchronizations are fundamental to the correctness and performance of concurrent software. Unfortunately, correctly identifying all synchronizations has become extremely difficult in modern software systems due to the various types of synchronizations. Previouswork either only infers specific type of synchronization by code analysis or relies on manual
effect to annotate the synchronization. SherLock is a tool that uses unsupervised inference to
identify synchronizations. SherLock leverages the fact that most synchronizations appear
around the conflicting operations and form it into a linear system with a set of synchronization
properties and hypotheses. To collect enough observations, SherLock runs the unit tests
a small number of times with feedback-based delay injection. I applied SherLock on 8
C# open-source applications. Without any prior knowledge, SherLock inferred 122 unique
synchronizations, with few false positives. These inferred synchronizations cover a wide
variety of types, including lock operations, fork-join operations, asynchronous operations,
framework synchronization, and custom synchronization.