ZHANG Jun, HE Yanxiang, SHEN Fanfan, LI Qing'an, TAN Hai. Memory Request Priority Based Warp Scheduling for GPUs[J]. Chinese Journal of Electronics, 2018, 27(5): 985-994. doi: 10.1049/cje.2018.05.003
Citation: ZHANG Jun, HE Yanxiang, SHEN Fanfan, LI Qing'an, TAN Hai. Memory Request Priority Based Warp Scheduling for GPUs[J]. Chinese Journal of Electronics, 2018, 27(5): 985-994. doi: 10.1049/cje.2018.05.003

Memory Request Priority Based Warp Scheduling for GPUs

doi: 10.1049/cje.2018.05.003
Funds:  This work is supported by the National Natural Science Foundation of China (No.61662002, No.61373039, No.61462004), the Science and Technology Project of the Educational Deparment in Jiangxi Province, China (No.GJJ150605), the Natural Science Foundation of Jiangxi Provice (No.20151BAB207042, No.20161BAB212056), and the Key Research and Developement Plan of the Scientific Department in Jiangxi Province, China (No.20161BBE50063).
More Information
  • Corresponding author: HE Yanxiang (corresponding author) was born in 1952. He received the B.S. and M.S. degrees in the Department of Mathematics from Wuhan University, China, in 1973 and 1975, respectively, and the Ph.D. degree from the Computer School of Wuhan University, China, in 1999. He is a professor of Wuhan University. His research interests include trustworthy software engineering, performance optimization of multi-core processor, and distribution computing. (Email:yxhe@whu.edu.cn)
  • Received Date: 2017-08-07
  • Rev Recd Date: 2018-01-26
  • Publish Date: 2018-09-10
  • High performance of GPGPU comes from its super massive multithreading, which makes it more and more widely used especially in the field of throughputoriented. Data locality is one of the important factors affecting the performance of GPGPU. Although GPGPU can exploit intra/inter-warp locality by itself in part, there is still large improvement space for that. In our work, we analyze the characteristics of different applications and propose memory request based warp scheduling to better exploit inter-warp spatial locality. This method can make some warps with good inter-warp locality run faster, which is beneficial to improve the whole performance. Our experimental results show that our proposed method can achieve 24.7% and 11.9% average performance improvement over LRR and MRPB respectively.
  • loading
  • P. Du, J. Zhao, W. Pan, et al., “GPU accelerated real-time collision handling in virtual disassembly”, Journal of Computer Science and Technology, Vol.30, No.3, pp.511-518, 2015.
    L. Cheng and T. Li, “Efficient data redistribution to speedup big data analytics in large systems”, Proc. of the 23rd International Conference on High Performance Computing, Hyderabad, India, pp.91-100, 2016.
    E. Lindholm, J. Nickolls, S. Oberman, et al., “NVIDIA Tesla: A unified graphics and computing architecture”, IEEE Micro, Vol.28, No.2, pp.39-55, 2008.
    J. Nickolls and W.J. Dally, “The GPU computing era”, IEEE Micro, Vol.30, No.2, pp.56-69, 2010.
    Y. He, J. Zhang, F. Shen, et al., “Thread scheduling optimization of general purpose graphics processing unit: A survey”, Chinese Journal of Computers, Vol.39, No.9, pp.1733-1749, 2016.
    C. Nugteren, G. van den Braak and H. Corporaal, “Future of GPGPU micro-architectural parameters”, Proc. of the Conference on Design, Automation and Test in Europe, Grenoble, France, pp.392-395, 2013.
    J. Meng, D. Tarjan and K. Skadron, “Dynamic warp subdivision for integrated branch and memory divergence tolerance”, Proc. of the 37th International Symposium on Computer Architecture, Saint Malo, France, pp.235-246, 2010.
    D. Tarjan, J. Meng and K. Skadron, “Increasing memory miss tolerance for SIMD cores”, Proc. of the ACM/IEEE Conference on High Performance Computing, Portland, USA, Page 22, 2009
    M. Meng, J.W. Sheaffer and K. Skadron, “Robust SIMD: Dynamically adapted SIMD width and multi-threading depth”, Proc. of the 26th International Parallel & Distributed Processing Symposium, Shanghai, China, pp.107-118, 2012.
    T.G. Rogers, M. O’Connor and T.M. Aamodt, “Divergenceaware warp scheduling”, Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, California, USA, pp.99-110, 2013.
    T.G. Rogers, M. O’Connor and T.M. Aamodt, “Cacheconscious wavefront scheduling”, Proc. of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, British Columbia, Canada, pp.72-83, 2012.
    Z. Zheng, Z. Wang and M. Lipasti, “Adaptive cache and concurrency allocation on gpgpus”, IEEE Computer Architecture Letters, Vol.14, No.2, pp.90-93, 2015.
    D. Li, M. Rhu, D.R. Johnson, et al., “Priority-based cache allocation in throughput processors”, Proc. of the 21st IEEE International Symposium on High Performance Computer Architecture, Burlingame, CA, USA, pp.89-100, 2015.
    O. Kayiran, A. Jog, M.T. Kandemir, et al., “Neither more nor less: Optimizing thread-level parallelism for GPGPUs”, Proc. of the 22nd International Conference on Parallel Architecture and Compilation Techniques, Edinburgh, United Kingdom, pp.157-166, 2013.
    X. Chen, L.W. Chang, C.I. Rodrigues, et al., “Adaptive cache management for energy-efficient GPU computing”, Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, United Kingdom, pp.343-355, 2014.
    M. Lee, S. Song, J. Moon, et al., “Improving GPGPU resource utilization through alternative thread block scheduling”, Proc. of the 20th International Symposium on High Performance Computer Architecture, Orlando, Florida, USA, pp.260-271, 2014.
    M. Gebhart, D. Johnson, D. Tarjan, et al., “Energy-efficient mechanisms for managing thread context in throughput processors”, Proc. of the 38th International Symposium on Computer Architecture, San Jose, California, USA, pp.235-246, 2011.
    V. Narasiman, C.J. Lee, M. Shebanow, et al., “Improving GPU performance via large warps and two-Level warp scheduling”, Proc. of the 44th International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.308-317, 2011.
    J. Adwait, O. Kayıran, N.C. Nachiappan, et al., “OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance”, Proc. of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, Houston, Texas, USA, pp.395-406, 2013.
    W. Jia, K. Shaw and M. Martonosi, “MRPB: Memory request prioritization for massively parallel processors”, Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture, Orlando, Florida, USA, pp.272-283, 2014.
    X. Xie, Y. Liang, Y. Wang, G. Sun, et al., “Coordinated static and dynamic cache bypassing for GPUs”, Proc. of the 21st IEEE International Symposium on High Performance Computer Architecture, Burlingame, CA, USA, PP.76-88, 2015.
    S.Y. Lee and C.J. Wu, “Ctrl-C: Instruction-aware control loop based adaptive cache bypassing for GPUs”, Proc. of the 34th IEEE International Conference on Computer Design, Scottsdale, AZ, USA, pp.133-140, 2016.
    C. Li, S.L. Song, H. Dai, et al., “Locality-driven dynamic GPU cache bypassing”, Proc. of 29th ACM on International Conference on Supercomputing, Newport Beach/Irvine, CA, USA,pp.67-77, 2015.
    Y. Tian, S. Puthoor, J.L. Greathouse, et al., “Adaptive GPU cache bypassing”, Proc. of the 8th Workshop on General Purpose Processing Using GPUs, San Francisco, CA, USA, pp.25-35, 2015.
    S. Che, M. Boyer, J. Meng, et al., “Rodinia: A benchmark suite for heterogeneous computing”, Proc. of IEEE International Symposium on Workload Characterization, Austin, TX, USA, pp.44-54, 2009.
    A. Bakhoda, G. Yuan, W.L. Fung, et al., “Analyzing CUDA workloads using a detailed GPU simulator”, Proc. of IEEE International Symposium on Performance Analysis of Systems and Software, Boston, Massachusetts, USA, pp.163-174, 2009.
    NVIDIA, CUDA C Programming Guide PG-02829-001 v6.5, 2014.
    NVIDIA, NVIDIA Compute PTX: Parallel Thread Execution ISA version 1.4, 2014.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (126) PDF downloads(247) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return