Files
Abstract
As numerous new data-intensive applications and storage hardware emerge, maintaining performance sustainability and robustness of data and storage systems is becoming more intricate and challenging. Users want numerous demands to be met. Service providers are facing a hard task of delivering acceptable service-level objectives (SLOs). Both parties essentially wish for the same goal, but the gap in between continues to widen tragically. Customers keep introducing more data paradigms and bombing providers with application-specific requirements, bringing a growing threat to designing generic systems that can persistently deliver rapid performance.
This dissertation aims at building fast and stable next-generation data and storage systems. Specifically, we architect these systems generically to achieve rapid responses of low latency even in the most turmoil scenarios. As systems grow in complexity, this dissertation tackles this significant problem from four different angles:
1. Data approach: We should have a thorough and large-scale understanding of real-world issues with increasing complicacy to help us pinpoint the potential crux and solutions. Here, we present TAILATSTORE, which mines performance logs tracking half a million disks and thousands of SSDs. TAILATSTORE reveals that storage performance instability is not uncommon, and the primary causes of slowdowns are the internal characteristics and idiosyncrasies of modern storage devices, motivating the design of tail-tolerant mechanisms.
2. Hardware-level approach: While other approaches attempt to reduce performance variability at the application level with approaches like speculation, we see a different point of view, whereas cutting performance variability “at the source” is more effective. Specifically, in TINYTAILFLASH, we re-architect SSDs that collaborate with the host and circumvent almost all noises induced by background operations.
3. OS-level approach: At the heart of the system stack is the OS; hence, the question is how the OS should evolve today to provide stable performance for the deep stack. In tackling this problem, our insight is that the OS is not just the OS for personal computers, but rather the OS for the “datacenter”. In this context, we present MITTOS – an OS that is SLO-aware and capable of predicting every I/O latency and failing over slow I/Os to peer OSs. MITTOS’s no-wait approach helps reduce I/O completion time up to 35% compared to wait-then-speculate approaches.
4. ML-for-system approach: Current systems are growing too complex for human designers to come up with a heuristic-based policy for optimal system control. This situation raises the question of whether machine learning can help. To answer this, we present LINNOS, which uses neural networks to predict the performance of every request and every I/O. LINNOS supports black-box devices and real production traces without requiring any extra input from users. Compared to hedging and heuristic-based methods, LINNOS improves the average I/O latencies by 9.6-79.6% with 87-97% inference accuracy and 4-6μs inference overhead for each I/O, demonstrating that it is possible to incorporate machine learning inside operating systems for real-time decision-making.
Lastly, this dissertation raises discussions on future research to build fast and stable data and storage systems and help storage applications achieve performance predictability in milli/micro-second era.