Today, significant research efforts are spent on designing machine learning (ML) models to extract useful information from data. While ML models have shown superior performance in research studies, deploying them in practice is a challenging process. My dissertation considers the task of integrating ML models to real-world distributed systems, focusing on a number of sensing and monitoring applications. For such applications, ML deployments must overcome the dual challenge of heterogeneity and scalability. Heterogeneity refers to the need for ML models to be customized for instances of the application (e.g., a specific geographic location). Scalability refers to the challenges of training and deploying distinct models across a large number of nodes. In this dissertation, through specific case studies of high-impact problems, I demonstrate how I address these challenges using a system approach. My research considers both physical sensing and cyber monitoring applications. In physical sensing applications, heterogeneity mainly comes from sensors’ local environments, while in cyber monitoring scenarios, human factor is the key source of heterogeneity. In both types of applications, I study how heterogeneity affects the ML system design and evaluation, and propose new methods to address the dual challenges of heterogeneity and scalability. First, for the task of spectrum monitoring in cellular networks, I present a novel system design that can efficiently train and deploy deep neural network (DNN) models on a large number of spectrum sensors. Since each sensor observes complex, heterogeneous, and time-varying spectrum data, it is hard to train and deploy accurate DNN-based spectrum sensing models at scale. My work addresses this challenge by leveraging the hierarchical network structure of cellular networks, where I build context-agnostic models for spectrum usage and apply transfer learning to minimize training cost and dataset constraints. Then, I consider the problem of evaluating ML-based video analytic pipelines (VAPs) with heterogeneous workloads, under the scenario of vehicle detection for smart city monitoring. The difficulty of proper VAP evaluation lies in the diverse video workloads caused by heterogeneous camera locations. Existing VAP evaluations do not consider such heterogeneity, and as a result, produce premature or ambiguous results. My work addresses this gap by building the first VAP benchmark, which provides proper and comprehensive evaluation of VAPs by characterizing the complex dependencies of VAP performance on video content characteristics. Next, I consider a challenging problem in the area of cyber monitoring, where ML models are applied to detect unacceptable face edits in online face images. Using a user study, we find the definition of unacceptable face edit can vary per user and application context, indicating the need for personalized protection. Meeting such diverse user needs is extremely challenging since it requires ML models to accurately recognize face edits in each photo. We address this challenge using a system approach. By integrating the system function of tracking original image copies, we successfully convert the extremely hard problem of recognizing any edit in an image into a feasible ML problem of comparing two images. The end result is an efficient photo moderation tool that allows users to define their own face edit policy and provides personalized protection accordingly. Finally, I summarize my work and discuss future directions.