Files
Abstract
Modern software necessarily implements task-level mechanisms designed to anticipate and handle transient errors. Such mechanisms include retry, cancellation, checkpointing and timeout, and they are critical to the smooth operation of almost every type of software application - especially the distributed and large scale applications in use everywhere today. At the same time, these mechanisms are not trivial to implement correctly, and prone to defects, for a variety of reasons: they require nuanced handling of partial execution states, are contingent on difficult-to-determine timing and error-handling policies, use non-standard implementations that are not well supported by existing libraries or frameworks, and are frequently disabled or excluded from application testing. Broken implementations are common and often result in severe software issues. This dissertation aims to analyze and detect problems in two widely-used mechanisms: cancellation and retry. It conducts empirical studies of real-world problems associated with cancel and retry, and guided by these studies, develops approaches to detect policy and implementation-related problems in these mechanisms using static and complementary large language model (LLM) aided program analysis techniques. These techniques find hundreds of problems in popular open-source distributed applications.