Files

Action Filename Size Access Description License
Show more files...

Abstract

Modern distributed systems ("cloud systems") have emerged as a dominant backbone for many of today's applications. As these systems collectively become the "cloud operating system", users expect high dependability including performance stability and availability. Small jitters in system performance or minutes of service downtimes can have a huge impact on company and user satisfaction. In this dissertation, we tackle these challenges. We try to improve cloud system dependability by mitigating the disruptive cascading effect in the aspect of performance stability and availability. For the performance reliability aspect, we focus on mitigating cascading performance failure by improving the tail tolerance of data-parallel frameworks. One popular solution to reduce the tail latency problem is speculative execution (SE). Existing SE implementations such as in Hadoop and Spark are considered quite robust. However, we found an important source of tail latencies that current SE implementations cannot handle graciously: node-level network throughput degradation. We reveal the loopholes of current SE implementations under this unique fault model, and how the problem can cascade to the entire cluster. We then address the problem using PBSE, a robust, path-based speculative execution that employs three key ingredients: path progress, path diversity, and path-straggler detection and speculation. For the availability aspect, we try to improve cloud system availability by detecting and eliminating cascading outage bugs (CO bugs). CO bug is a bug that can cause simultaneous or cascades of failures to each of the individual nodes in the system, which eventually leads to a major outage. While hardware arguably is no longer a single point of failure, our large-scale studies of cloud bugs and outages reveal that CO bugs have emerged as a new class of outage-causing bugs and single point of failure in the software. We address the CO bug problem with the Cascading Outage Bugs Elimination (COBE) project. In this project, we: (1) study the anatomy of CO bugs, (2) develop CO-bug detection tools to unearth CO bugs.

Details

Additional Details

Actions

Preview

Downloads Statistics

from
to
Download Full History