LI Bingchao, WEI Jizeng, GUO Wei, et al., “Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU,” Chinese Journal of Electronics, vol. 24, no. 4, pp. 684-688, 2015, doi: 10.1049/cje.2015.10.004
Citation: LI Bingchao, WEI Jizeng, GUO Wei, et al., “Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU,” Chinese Journal of Electronics, vol. 24, no. 4, pp. 684-688, 2015, doi: 10.1049/cje.2015.10.004

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU

doi: 10.1049/cje.2015.10.004
Funds:  This work is supported by the National Natural Science Foundation of China (No.61402321), and the Doctoral Fund of Ministry of Education of China (No.20110032120037).
More Information
  • Corresponding author: WEI Jizeng (corresponding author)received the Ph.D. degree in computer scienceand technology from Tianjin Universityin 2010. Now he is an assistantprofessor at School of Computer Scienceand Technology, Tianjin University. His researchinterests include GPU architecture,non-volatile memory and GPU for mobile.(Email: weijizeng@tju.edu.cn)
  • Received Date: 2013-12-30
  • Rev Recd Date: 2014-07-14
  • Publish Date: 2015-10-10
  • GPGPUs adopt SIMT execution model in which each logical thread in a warp corresponds to a SIMD lane while can still follow an independent control flow. When a branch divergence appears and threads within a warp take different execution paths, GPGPUs have to execute each path serially through SIMD lane masking, which potentially decreases the SIMD utilization and performance. We propose an efficient thread compaction mechanism to handle branch divergence with a novel register file structure. We also develop a new thread scheduling policy cooperating with our compaction mechanism. The simulation results show that our approach improves the SIMD utilization up to 74.4% and achieves a maximum 11.1% performance speedup with small hardware overhead.
  • loading
  • Chang Yisong, Wei Jizeng, Zhao Guoyu, et al., "A novel architecture of special arithmetic function unit for area-efficient programmable vertex shader", Chinese Journal of Electronics, Vol.22, No.3, pp.483-488, 2013.
    Liu Li, Liu Li and Yang Guangwen, "A highly efficient GPUCPU hybrid parallel implementation of sparse LU factorization", Chinese Journal of Electronics, Vol.21, No.1, pp.7-12, 2012.
    E. Lindholm, J. Nickolls, S. Oberman, et al., "Nvidia tesla: A unified graphics and computing architecture", IEEE Micro, Vol.28, No.2, pp.39-55, 2008.
    Jing Naifeng, Shen Yao, Lu Yao, et al., "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel, pp.344-355, 2013.
    Mark Gebhart, Daniel R. Johnson, David Tarjan, et al., "A hierarchical thread scheduler and register file for energy-efficient throughput processors", Transactions on Computer Systems, Vol.30, No.2, pp.8:1-8:38, 2012.
    Wing-kei S. Yu, Ruirui Huang, et al., "SRAM-DRAM hybrid memory with applications to efficient register files in finegrained multi-threading", Proc. of International Symposium on Computer Architecture, San Jose, USA, pp.247-258, 2011.
    S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann, Burlington, USA, 1997.
    W.W.L. Fung, I. Sham, G. Yuan and T.M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow", Proc. of International Symposium on Microarchitecture, Chicago, USA, pp.407-420, 2007.
    W.W.L. Fung, I. Sham, G. Yuan and T.M. Aamodt, "Dynamic warp formation: Efficient mimd control flow on simd graphics hardware", ACM Trans. Archit. Code Optim., Vol.6, No.2, pp.7:1-7:37, 2009.
    N. Brunie, S. Collange and G. Diamos, "Simultaneous branch and warp interweaving for sustained gpu performance", Proc. of International Symposium on Computer Architecture, Portland, USA, pp.49-60, 2012.
    G. Diamos, B. Ashbaugh, et al., "Simd re-convergence at thread frontiers", Proc. of International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.477-488, 2011.
    V. Narasiman, M. Shebanow, C.J. Lee, et al., "Improving GPU performance via large warps and two-level warp scheduling", Proc. of International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.308-317, 2011.
    M. Rhu and M. Erez, "Capri: Prediction of compactionadequacy for handling control-divergence in GPGPU architectures", Proc. of International Symposium on Computer Architecture, Portland, USA, pp.61-71, 2012.
    M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow", Proc. of International Symposium on High Performance Computer Architecture, Shenzhen, China, pp.591-602, 2013.
    J. Meng, D. Tarjan and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance", Proc. of International Symposium on Computer Architecture, Saint-Malo, France, pp.235-246, 2010.
    Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, et al., "SIMD divergence optimization through intra-warp compaction", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel, pp.368-379, 2013.
    W.W.L. Fung and T.M. Aamodt, "Thread block compaction for efficient SIMT control flow", Proc. of International Symposium on High Performance Computer Architecture, San Antonio, USA, pp.25-36, 2011.
    M. Rhu and M. Erez, "Maximizing simd resource utilization in GPGPUs with simd lane permutation", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel,pp.356-367, 2013.
    Yaohua Wang, Shuming Chen, et al., "Instruction Shuffle: Achieving MIMD-like performance on SIMD architectures", Computer Architecture Letters, Vol.11, No.2, pp.37-40, 2012.
    A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., "Analyzing CUDA workloads using a detailed GPU simulator", Proc. of International Symposium on Performance Analysis of Systems and Software, Boston, USA, pp.163-174, 2009.
    A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., GPGPU-Sim, http://www.gpgpu-sim.org, 2013.
    A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., GPGPU-Sim Manual, http://www.gpgpu-sim.org/manual, 2013.
    S. Che, M. Boyer, J. Meng, et al., "Rodinia: A benchmark suite for heterogeneous computing", International Symposium on Workload Characterization, Austin, USA, pp.44-54, 2009.
    NVIDIA Corporation, GPU Computing SDK, Version 2.3, 2009.
    John A. Stratton, et al., The Parboil Technical Report, University of Illinois at Urbana-Champaign, 2012.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (542) PDF downloads(1486) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return