Files

Abstract

Large-scale data centers store vast amounts of user data across numerous disks, necessitating redundancy mechanisms like erasure coding (EC) to protect against disk failures. As storage systems scale in size, complexity, and layering, disk failure frequency and rebuild times increase. For managing tens or hundreds of thousands of disks, traditional single-level erasure coding (SLEC) does not scale well, as it struggles to balance repair overhead with rack- and enclosure-level failure tolerance. Multi-level erasure coding (MLEC), which applies EC at both network and local levels, has been deployed in large-scale systems. However, no in-depth study has addressed its design considerations at scale, leaving many research questions unaddressed. This dissertation provides a comprehensive analysis of MLEC at scale, focusing on its design considerations and relationship to deep learning (DL) workloads. We begin by presenting a detailed analysis of MLEC's design space across multiple dimensions, including code parameter selection, chunk placement schemes, and repair methods. We quantify their performance and durability, identifying which MLEC schemes and repair methods best tolerate independent and correlated failures and reduce repair network traffic by orders of magnitude. Evaluation methods include simulation, splitting, dynamic programming, and mathematical modeling. We also compare MLEC’s performance and durability with other EC schemes like SLEC and LRC, showing that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC. We then discuss the relationship between MLEC and DL workloads. As DL workloads become increasingly data-intensive, training datasets often exceed local storage, requiring access to remote erasure-coded storage. To cost-effectively evaluate MLEC’s ability to meet the throughput demands of DL workloads, we develop an emulation-based approach. We introduce GPEmu, a GPU emulator designed for efficient evaluation of DL systems without physical GPUs. GPEmu supports over 30 DL models and 6 GPU models, providing capabilities for time emulation, memory emulation, distributed system support, and GPU sharing. We also develop MLECEmu, which simulates the read throughput of erasure-coded disk arrays with I/O-throttled in-memory file systems. Using these tools, our end-to-end experiments show that MLEC storage can enhance GPU utilization with wider stripes, and our optimized MLEC repair methods reduce training performance degradation during catastrophic local failure repairs. While MLEC storage provides high aggregated intra-cluster read throughput for DL workloads, the network bandwidth between the GPU cluster and the MLEC storage cluster can become a bottleneck during training, as inter-cluster bandwidth is typically more constrained than intra-cluster bandwidth. Since many samples significantly reduce in size during preprocessing, we explore selective offloading of preprocessing tasks to remote MLEC storage to mitigate data traffic. Our case study evaluates this approach’s potential benefits and challenges. Based on our findings, we propose SOPHON, a framework that selectively offloads preprocessing tasks at a fine granularity to reduce data traffic. SOPHON uses online profiling and adaptive algorithms to optimize preprocessing for each sample in each training scenario. Evaluations using GPEmu and MLECEmu show that SOPHON reduces data traffic and training time by 1.2x to 2.2x compared to existing solutions.

Details

Actions

PDF

from
to
Export
Download Full History