Multi-Level Erasure Coded Storage Design and Its Relationship to Deep Learning Workloads

Wang, Meng

doi:10.6082/uchicago.13997

Wang, Meng

2024

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Large-scale data centers store vast amounts of user data across numerous disks, necessitating redundancy mechanisms like erasure coding (EC) to protect against disk failures. As storage systems scale in size, complexity, and layering, disk failure frequency and rebuild times increase. For managing tens or hundreds of thousands of disks, traditional single-level erasure coding (SLEC) does not scale well, as it struggles to balance repair overhead with rack- and enclosure-level failure tolerance. Multi-level erasure coding (MLEC), which applies EC at both network and local levels, has been deployed in large-scale systems. However, no in-depth study has addressed its design considerations at scale, leaving many research questions unaddressed. This dissertation provides a comprehensive analysis of MLEC at scale, focusing on its design considerations and relationship to deep learning (DL) workloads. We begin by presenting a detailed analysis of MLEC's design space across multiple dimensions, including code parameter selection, chunk placement schemes, and repair methods. We quantify their performance and durability, identifying which MLEC schemes and repair methods best tolerate independent and correlated failures and reduce repair network traffic by orders of magnitude. Evaluation methods include simulation, splitting, dynamic programming, and mathematical modeling. We also compare MLEC’s performance and durability with other EC schemes like SLEC and LRC, showing that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC. We then discuss the relationship between MLEC and DL workloads. As DL workloads become increasingly data-intensive, training datasets often exceed local storage, requiring access to remote erasure-coded storage. To cost-effectively evaluate MLEC’s ability to meet the throughput demands of DL workloads, we develop an emulation-based approach. We introduce GPEmu, a GPU emulator designed for efficient evaluation of DL systems without physical GPUs. GPEmu supports over 30 DL models and 6 GPU models, providing capabilities for time emulation, memory emulation, distributed system support, and GPU sharing. We also develop MLECEmu, which simulates the read throughput of erasure-coded disk arrays with I/O-throttled in-memory file systems. Using these tools, our end-to-end experiments show that MLEC storage can enhance GPU utilization with wider stripes, and our optimized MLEC repair methods reduce training performance degradation during catastrophic local failure repairs. While MLEC storage provides high aggregated intra-cluster read throughput for DL workloads, the network bandwidth between the GPU cluster and the MLEC storage cluster can become a bottleneck during training, as inter-cluster bandwidth is typically more constrained than intra-cluster bandwidth. Since many samples significantly reduce in size during preprocessing, we explore selective offloading of preprocessing tasks to remote MLEC storage to mitigate data traffic. Our case study evaluates this approach’s potential benefits and challenges. Based on our findings, we propose SOPHON, a framework that selectively offloads preprocessing tasks at a fine granularity to reduce data traffic. SOPHON uses online profiling and adaptive algorithms to optimize preprocessing for each sample in each training scenario. Evaluations using GPEmu and MLECEmu show that SOPHON reduces data traffic and training time by 1.2x to 2.2x compared to existing solutions.

Details

Title

Multi-Level Erasure Coded Storage Design and Its Relationship to Deep Learning Workloads

Author

Wang, Meng : University of Chicago : (https://orcid.org/0000-0002-0970-6558)

Degree Type

Ph.D.

Content Type

Dissertation

Academic Advisor

Haryadi S. Gunawi

Committee Member

Junchen Jiang
John Bent
Anjus George

Keywords

Erasure Coding; Storage Systems; Deep Learning Systems; Distributed Systems; Reliability; Performance

Digital Object Identifier

https://doi.org/10.6082/uchicago.13997

Funding Information

National Science Foundation, CCF-2119184
Office of Science of the U.S. Department of Energy, DE-AC05-00OR22725
National Science Foundation, CNS-1526304
National Science Foundation, CNS-1405959
National Science Foundation, CCF-2028427
National Science Foundation, CNS-2027170

Publication Date

2024-12

Language

English

Copyright Statement

Record Appears in

Physical Sciences Division > Computer Science
All

Record Created

2024-11-13

Actions

PDF

Statistics

Download Full History