TIANSHI WANG, Yiran Zhang, Yiao Zhang, Qiyang Zhang, Ao Zhou, Shangguang Wang. Priority-Driven Flow Scheduling for Distributed Large Language Model Training[J]. Chinese Journal of Electronics.
Citation: TIANSHI WANG, Yiran Zhang, Yiao Zhang, Qiyang Zhang, Ao Zhou, Shangguang Wang. Priority-Driven Flow Scheduling for Distributed Large Language Model Training[J]. Chinese Journal of Electronics.

Priority-Driven Flow Scheduling for Distributed Large Language Model Training

  • Large Language Model (LLM) training relies heavily on efficient communication coordination between distributed accelerators. While existing approaches focus on optimizing specific parallelism strategies independently, they lack systematic prioritization across different communication patterns, leading to suboptimal training performance. In this paper, we propose HyPA, a hybrid parallelism priority assign- ment framework. HyPA employs offline surrogate modeling for closed-form priority optimization and online parameter sensing for dynamic environmental adaptation, enabling adaptive bandwidth allocation and congestion control without modifying network infrastructure. Through comprehensive evaluation on realistic training workloads, HyPA achieves significant improvements in job completion time (JCT) for both dense and sparse LLM models, with up to 18.59% reduction in micro-benchmarks and up to 16% reduction in large-scale training deployments.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return