Files
Abstract
In modern cloud computing environments, ephemeral cloud resources are becoming increasingly prevalent. Ephemeral resources exhibit two distinct characteristics: (1) they can be terminated either predictably or unexpectedly by resource providers during utilization, and (2) their prices, while often more cost-effective than reserved or on-demand cloud resources, can fluctuate over time. Deploying data-intensive systems on ephemeral cloud resources for multi-tenants presents two challenges. First, the availability of resources can be significantly lower than the number of jobs when resources are inaccessible due to their ephemeral nature or rendered impractical due to exorbitant price fluctuations. Second, when resources are terminated or experience unreasonably high price fluctuations, the necessity to terminate jobs results in the forfeiture of ongoing progress. To address the challenges, we present three optimizations: maximizing resource utilization, preempting and reallocating scarce resources to the most appropriate jobs, and suspending jobs when necessary or advantageous. In this dissertation, we develop various prototype systems to realize the optimizations.
Firstly, we propose and implement Repack for deep learning training to share common I/O and computing processes among models on the same computing device. We present a comprehensive empirical study of Repack and end-to-end experiments and suggest: (1) packing two models can bring up to 40% performance improvement for a single training step, and the improvement increases when packing more models; (2) the benefit of the pack primitive largely depends on a number of factors including memory capacity, chip architecture, neural network structure, and batch size; (3) a pack-aware Hyperband is up to 2.7× faster, with this improvement growing as memory size increases and subsequently the density of models packed.
Secondly, we propose and design a resource arbitration framework, Rotary, to continuously prioritize the progressive iterative analytics and determine if/when to reallocate and preempt the resources for them. Progressive iterative analytics keep providing approximate or partial results to users by performing computations on a subset of the entire dataset until either the users are satisfied with the results, or the predefined completion criteria are achieved. We consider two prevalent cases for progressive iterative analytics: approximate query processing (AQP) and deep learning training (DLT), and implement two resource arbitration systems, Rotary-AQP and Rotary-DLT. We build a TPC-H based AQP workload and a survey-based DLT workload to evaluate Rotary.
Finally, we present an adaptive query execution framework, Riveter, for deploying cloud-native databases on ephemeral cloud resources. Within Riveter, we implement various strategies, including (1) a redo strategy that terminates queries and subsequently re-runs them, (2) a pipeline-level strategy that suspends a query once one of its pipelines has completed, thereby reducing the storage requirements for intermediate data, (3) and a process-level strategy that enables the suspension of query execution at any given moment but generates a substantial volume of intermediate data for query resumption. We devise a cost model to determine the strategy that causes minimum latency. To evaluate Riveter, we conducted a performance study, an end-to-end analysis, and a cost model evaluation using the TPC-H benchmark.