Towards Scale-Checkable Systems

Stuardo Moraga, Cesar Andres

doi:10.6082/uchicago.5276

Towards Scale-Checkable Systems

Stuardo Moraga, Cesar Andres

2022

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

In this document, we present our approaches for understanding and discovering scalability faults,i.e. faults whose symptoms appear at larger scales but are not visible at smaller scales. First, we present a study of over 350 scalability faults collected from the repositories of 10 popular open-source distributed systems. We analyze the symptoms they produce, the scenarios in which they manifest, their root causes, the effectiveness of existing testing tools in detecting them and the solutions and effort involved in tackling them. Then, we present ScaleCheck, an emulation-based approach for discovering scalability faults in large-scale distributed storage systems. ScaleCheck employs a set of black and white box techniques to allow developers to “deploy” a cluster in a single-machine and accurately observe the behavior of their systems as if they were deployed in multiple machines. Moreover, ScaleCheck includes a collection-tracking mechanism that allows developers to discover potentially harmful code paths affected by the increase on the amount of nodes in the cluster. We integrated this approach to 4 popular distributed storage systems and accurately reproduced the symptoms of 10 known scalability faults using a single machine. Finally, we present SView, a framework for identifying and analyzing potential scalability faults in large-scale distributed systems. SView combines instrumentation and statistical concepts to identify dimensional code fragments (DCFs), i.e. pieces of code whose number of executions (e.g., # loop iterations, # method executions) is positively correlated with the increase in size of one or more system dimensions (e.g. # number of files, # clients, # requests), with static analysis modules that detect faulty code patterns involving the DCFs. Sview lightweight approach does not require modifications in the system under test, it's portable without effort across different versions of the same system and focuses on the root cause of scalability faults rather than the symptoms they produce. We evaluate SView in 15 different versions of 4 popular distributed systems and use our analysis modules to detect known and unknown scalability faults.