Files
Abstract
Today, a wide range of machine learning applications, including video analytics and large language model (LLM) applications, are becoming distributed. Specifically, they both require loading data from a data source to the machine learning model for efficient inference. This thesis focuses on two concrete applications: video analytics, where analytical DNNs need to load video feeds from remote cameras, and LLM inference, where the inference engine needs to load KV caches from storage for faster processing. Our observation is that, by properly identifying the important parts of the data and loading them with high priority, the latency of the end-to-end pipeline can be greatly improved without sacrificing other performance metrics (accuracy in video analytics and throughput in LLM serving). Concretely, the pixels associated with objects in video analytics are more important than others, and similarly, the KV caches associated with requests with lower job completion time (JCT) are more important than others. However, existing approaches either estimate this importance too slowly or too inaccurately. This thesis leverages application-driven insights to quickly identify the important data with high accuracy. Our evaluation shows that we can reduce latency by 2–3× without sacrificing accuracy or throughput.