DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

Abstract

Deep learning (DL) systems suffer from low resource utilization due to (i) the monolithic server model that tightly couples compute and memory, and (ii) the limited sharing between different inference applications, and across inference and training, because of strict service level objectives (SLOs). To address this problem, we present DistMind, a disaggregated DL system that enables efficient multiplexing of DL applications with near- optimal resource utilization. DistMind decouples compute from host memory, and exposes the abstractions of a GPU pool and a memory pool, each of which can be independently provisioned. The key challenge is to dynamically allocate GPU resources to different applications based on their real-time demands while meeting strict SLOs. We tackle this challenge by exploiting the power of high-speed 100Gbps networks, and design three-stage pipelining, cache-aware load balancing, and DNN-aware sharding mechanisms based on the characteristics of DL workloads, to achieve millisecond-scale application loading overhead and improve system efficiency. We have implemented a prototype of DistMind and integrated it with PyTorch. Experimental results on AWS EC2 show that DistMind achieves near 100% resource utilization, and compared with NVIDIA MPS and Ray, DistMind improves the throughput by up to 279% and reduces the inference latency by up to 94%.

Publication
In IEEE/ACM Transactions on Networking (CCF-A)
Yinmin Zhong
Yinmin Zhong
Ph.D. Student

My research interests include machine learning systems and large language models.