Files

Abstract

With the increasing popularity and complexity of Deep Recommendation Systems (DRS) deployment into various online services, the size of its Embedding Vector (EV) table is tripling every 2 years reaching tens of terabytes in size. However, current state-of-the-art DRSs are ill-equipped to handle the exponential growth in EV table sizes. Although some open-source DRS platforms store full EV tables in DRAM, this approach presents drawbacks, including reduced resource utilization and increased operational costs. Consequently, there is growing interest in migrating large EV tables to SSDs as backend storage, prompting recent research to focus on optimizing backend storage for EV table lookups. However, existing storage solutions face limitations in adoption due to the requirement for customized devices such as custom SSDs or FPGA implementations. While incorporating off-the-shelf SSDs to reduce DRAM footprints may seem cost-effective, it can introduce performance instability, particularly for large-scale DRSs tasked with meeting stringent microsecond-scale tail latency Service Level Objectives (SLOs). Moreover, the current trend of microservices and Machine Learning (ML) deployment exacerbates these challenges, with tail-latency SLOs expected to become even tighter over time. Additionally, achieving SSDs with deterministic latency remains challenging due to the unpredictable behavior of internal activities such as garbage collection, wear leveling, and internal buffer-flush. Rooted in a deep understanding and analysis of the existing software/hardware stack, this dissertation focuses on building fundamental storage support mechanisms to achieve predictable performance, cost-efficiency, and adaptability in handling the ever-changing workload patterns in the age of AI/ML. By revisiting the challenges across different layers in a holistic manner, we built three storage supports: EVStore (as caching support) which reduces the memory footprint by up to 94%, Heimdall (as I/O admission support) which delivers 15-35% lower average I/O latency compared to the state of the art and up to 2× faster than a baseline, and TinFetch (as prefetching support) achieves a 3% to 27% improvement in hit rate compared to the state of the art.

Details

Actions

from
to
Export