Files
Abstract
The widespread adoption of SSDs has made ensuring stable performance difficult due to their high tail latencies, which are amplified in large systems. A promising approach to improving tail tolerance involves using machine learning to predict the latency of each request and using this information to decide whether to serve the request or fail over to a replica. Deciding whether to fail over to a replica is not as straightforward as setting a deadline for requests and then failing over if the latency is predicted to exceed the deadline (and not just because the predictions are sometimes inaccurate). To achieve the best performance, we need to consider all aspects of the system (including how many replicas are available, the cost of failing over, and the latency distribution of the workload). With this information, we can determine the expected latency of failing over to a replica and use it to decide when failing over is likely to improve performance. We incorporate this information into the system in two ways: by biasing the loss function of the learner to increase or decrease the predicted latencies (or for a classification problem, to increase or decrease the rate of positive or negative predictions), and by changing the threshold used to classify which requests are too slow. We find that both methods can achieve comparable performance improvements, but there is little performance benefit to combining them, suggesting that attempting to break down the problem and improve the machine learning predictions separately from tuning other aspects of the system may increase complexity unnecessarily compared to optimizing the whole system together. A holisitic approach and a deep understanding of the system is necessary to providing the best performance while minimizing system complexity.