Priority-Driven Flow Scheduling for Distributed Large Language Model Training

TIANSHI WANG; Yiran Zhang; Yiao Zhang; Qiyang Zhang; Ao Zhou; Shangguang Wang

TIANSHI WANG, Yiran Zhang, Yiao Zhang, Qiyang Zhang, Ao Zhou, Shangguang Wang. Priority-Driven Flow Scheduling for Distributed Large Language Model Training[J]. Chinese Journal of Electronics.

Citation:

TIANSHI WANG, Yiran Zhang, Yiao Zhang, Qiyang Zhang, Ao Zhou, Shangguang Wang. Priority-Driven Flow Scheduling for Distributed Large Language Model Training[J]. Chinese Journal of Electronics.

Citation:

TIANSHI WANG, Yiran Zhang, Yiao Zhang, Qiyang Zhang, Ao Zhou, Shangguang Wang. Priority-Driven Flow Scheduling for Distributed Large Language Model Training[J]. Chinese Journal of Electronics.

Priority-Driven Flow Scheduling for Distributed Large Language Model Training

Abstract

Abstract

Large Language Model (LLM) training relies heavily on efficient communication coordination between distributed accelerators. While existing approaches focus on optimizing specific parallelism strategies independently, they lack systematic prioritization across different communication patterns, leading to suboptimal training performance. In this paper, we propose HyPA, a hybrid parallelism priority assign- ment framework. HyPA employs offline surrogate modeling for closed-form priority optimization and online parameter sensing for dynamic environmental adaptation, enabling adaptive bandwidth allocation and congestion control without modifying network infrastructure. Through comprehensive evaluation on realistic training workloads, HyPA achieves significant improvements in job completion time (JCT) for both dense and sparse LLM models, with up to 18.59% reduction in micro-benchmarks and up to 16% reduction in large-scale training deployments.

FullText(HTML)

References (0)

Supplements (1)

Cited By

Priority-Driven Flow Scheduling for Distributed Large Language Model Training

Abstract

Catalog

Follow Us

Priority-Driven Flow Scheduling for Distributed Large Language Model Training

Abstract

Catalog

Follow Us

Export File

Citation

Format

Content