Files
Abstract
In this document, we present our approaches for understanding and discovering scalability faults,i.e. faults whose symptoms appear at larger scales but are not visible at smaller scales.
First, we present a study of over 350 scalability faults collected from the repositories of 10 popular
open-source distributed systems. We analyze the symptoms they produce, the scenarios in which they
manifest, their root causes, the effectiveness of existing testing tools in detecting them
and the solutions and effort involved in tackling them.
Then, we present ScaleCheck, an emulation-based approach for discovering scalability
faults in large-scale distributed storage systems. ScaleCheck employs a set of
black and white box techniques to allow developers to “deploy” a cluster in a single-machine
and accurately observe the behavior of their systems as if they were deployed in multiple machines.
Moreover, ScaleCheck includes a collection-tracking mechanism that allows developers to discover
potentially harmful code paths affected by the increase on the amount of nodes in the cluster. We
integrated this approach to 4 popular distributed storage systems and accurately reproduced the
symptoms of 10 known scalability faults using a single machine.
Finally, we present SView, a framework for identifying and analyzing potential scalability
faults in large-scale distributed systems. SView combines instrumentation and statistical concepts
to identify dimensional code fragments (DCFs), i.e. pieces of code whose number of executions
(e.g., # loop iterations, # method executions) is positively correlated with the increase in size of
one or more system dimensions (e.g. # number of files, # clients, # requests), with static analysis
modules that detect faulty code patterns involving the DCFs. Sview lightweight approach does
not require modifications in the system under test, it's portable without effort across different
versions of the same system and focuses on the root cause of scalability faults rather
than the symptoms they produce. We evaluate SView in 15 different versions of 4 popular
distributed systems and use our analysis modules to detect known and unknown scalability faults.