Tianshi Wang, Yiran Zhang, Yiao Zhang, et al., “Priority-Driven flow scheduling for distributed large language model training,” Chinese Journal of Electronics, vol. 8, no. 8, pp. 1–9, 2025. DOI: 10.23919/cje.2025.00.349
Citation: Tianshi Wang, Yiran Zhang, Yiao Zhang, et al., “Priority-Driven flow scheduling for distributed large language model training,” Chinese Journal of Electronics, vol. 8, no. 8, pp. 1–9, 2025. DOI: 10.23919/cje.2025.00.349

Priority-Driven Flow Scheduling for Distributed Large Language Model Training

  • Large Language Model (LLM) training relies heavily on efficient communication coordination between distributed accelerators. While existing approaches focus on optimizing specific parallelism strategies independently, they lack systematic prioritization across different communication patterns, leading to suboptimal training performance. In this paper, we propose HyPA, a hybrid parallelism priority assignment framework. HyPA employs offline surrogate modeling for closed-form priority optimization and online parameter sensing for dynamic environmental adaptation, enabling adaptive bandwidth allocation and congestion control without modifying network infrastructure. Through comprehensive evaluation on realistic training workloads, HyPA achieves significant improvements in job completion time (JCT) for both dense and sparse LLM models, with up to 18.59% reduction in micro-benchmarks and up to 16% reduction in large-scale training deployments.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return