Eliminating the Capacity Variation Penalty for Cloud Resource Management

Zhang, Chaojie

doi:10.6082/uchicago.5719

Zhang, Chaojie

2023

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Increasing power grid challenges due to rapid decarbonization and pressure for reduced carbon emissions compel data centers to operate with capacity varying in hours or days, perhaps on a dynamic basis in concert with renewable generation. With data centers exceeding 10% of load in many grids, the implied capacity variation may approach 50%. For today’s computing, variable resource capacity is problematic, causing severe loss in resource efficiency.Our approach is to create intelligent resource management for variable capacity resources. Traditional resource managers were built with the assumption of constant capacity, scheduling jobs that fail when capacity decreases, causing abrupt job failures and wasted resources. To understand scheduling performance under variable capacity, we define three key dimensions of variation that lead to performance loss. We use cloud and HPC workloads and explore the multi-dimensional capacity change space, characterizing performance in goodput, job failures, and waiting time. To improve performance, we consider intelligent termination policies to cope with capacity loss. Then, we take a broader view to prepare for capacity variation altogether. We consider two dimensions of uncertainty in capacity and workload, exploring the information space. We propose new scheduling techniques that exploit the information to prevent job failures and increase resource efficiency. We evaluate traditional schedulers under varying resource capacities and using a diverse set of HPC and cloud workloads. Results show that capacity variation decreases goodput by 60%, incurring 40% job failures. Amongst variability dimensions, dynamic range, structures, and change frequency are all important; each in some cases produces 10 - 40% goodput losses. Drill down with Google cloud workloads shows that variable capacity can cause serious problems, including up to 70% goodput loss and 20% job failures. Careful study of performance versus variability shows that avoiding major harm, such as goodput loss, requires a variation limit of <10% range. This prevents the cloud from significant temporal load shifting to reduce carbon emissions. We designed and compared the performance of intelligent termination policies to cope with capacity loss. Results demonstrate that these techniques achieve significant improvements under variations, with 10 - 66% goodput increase and 1.6 - 3x job failure reduction. Using job progress to minimize wasted computation produces 44% average goodput increase. Realistic examples show that with scheduling techniques, a typical data center can achieve 15% carbon emission reduction by exploiting variations. Then, we take a broader view and design new scheduling schemes that prepare for variation using Google cloud workloads. These new schedulers exploit a variety of information about workload and capacity to reduce uncertainty, increasing goodput by up to 180% and decreasing job failure rate by 5 - 15X. Within the information space, runtime classification is critical. Exploiting this information, LongShort algorithm can drastically increase variation range from <10 to 50% while maintaining performance. These results demonstrate great scheduling improvements for capacity variations but require validation with complex workload constraints. While capacity variation poses serious challenges to conventional resource managers, our intelligent resource management shows significant improvement, eliminating the variation penalty and demonstrating promising benefits of future variable capacity data centers.