A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters
-
Abstract
As deep learning (DL) technology rapidly advances in the areas such as computer vision (CV), natural language processing (NLP), and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on GPU clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.
-
-