Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU

LI Bingchao; WEI Jizeng; GUO Wei; SUN Jizhou

doi:10.1049/cje.2015.10.004

Volume 24 Issue 4

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2015 > 24(4): 684-688

LI Bingchao, WEI Jizeng, GUO Wei, et al., “Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU,” Chinese Journal of Electronics, vol. 24, no. 4, pp. 684-688, 2015, doi: 10.1049/cje.2015.10.004

Citation:

LI Bingchao, WEI Jizeng, GUO Wei, et al., “Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU,” Chinese Journal of Electronics, vol. 24, no. 4, pp. 684-688, 2015, doi: 10.1049/cje.2015.10.004

Citation:

PDF( 445 KB)

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU

doi: 10.1049/cje.2015.10.004

School of Computer Science and Technology, Tianjin University, Tianjin 300072, China

Funds: This work is supported by the National Natural Science Foundation of China (No.61402321), and the Doctoral Fund of Ministry of Education of China (No.20110032120037).

More Information

Corresponding author: WEI Jizeng (corresponding author)received the Ph.D. degree in computer scienceand technology from Tianjin Universityin 2010. Now he is an assistantprofessor at School of Computer Scienceand Technology, Tianjin University. His researchinterests include GPU architecture,non-volatile memory and GPU for mobile.(Email: weijizeng@tju.edu.cn)
Received Date: 2013-12-30
Rev Recd Date: 2014-07-14
Publish Date: 2015-10-10

Abstract

Abstract

GPGPUs adopt SIMT execution model in which each logical thread in a warp corresponds to a SIMD lane while can still follow an independent control flow. When a branch divergence appears and threads within a warp take different execution paths, GPGPUs have to execute each path serially through SIMD lane masking, which potentially decreases the SIMD utilization and performance. We propose an efficient thread compaction mechanism to handle branch divergence with a novel register file structure. We also develop a new thread scheduling policy cooperating with our compaction mechanism. The simulation results show that our approach improves the SIMD utilization up to 74.4% and achieves a maximum 11.1% performance speedup with small hardware overhead.
- Graphics processing unit (GPU),
- Single-instruction multiple-data (SIMD),
- Branch divergence,
- Register file

FullText(HTML)

References(25)

References

Chang Yisong, Wei Jizeng, Zhao Guoyu, et al., "A novel architecture of special arithmetic function unit for area-efficient programmable vertex shader", Chinese Journal of Electronics, Vol.22, No.3, pp.483-488, 2013.

Liu Li, Liu Li and Yang Guangwen, "A highly efficient GPUCPU hybrid parallel implementation of sparse LU factorization", Chinese Journal of Electronics, Vol.21, No.1, pp.7-12, 2012.

E. Lindholm, J. Nickolls, S. Oberman, et al., "Nvidia tesla: A unified graphics and computing architecture", IEEE Micro, Vol.28, No.2, pp.39-55, 2008.

Jing Naifeng, Shen Yao, Lu Yao, et al., "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel, pp.344-355, 2013.

Mark Gebhart, Daniel R. Johnson, David Tarjan, et al., "A hierarchical thread scheduler and register file for energy-efficient throughput processors", Transactions on Computer Systems, Vol.30, No.2, pp.8:1-8:38, 2012.

Wing-kei S. Yu, Ruirui Huang, et al., "SRAM-DRAM hybrid memory with applications to efficient register files in finegrained multi-threading", Proc. of International Symposium on Computer Architecture, San Jose, USA, pp.247-258, 2011.

S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann, Burlington, USA, 1997.

W.W.L. Fung, I. Sham, G. Yuan and T.M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow", Proc. of International Symposium on Microarchitecture, Chicago, USA, pp.407-420, 2007.

W.W.L. Fung, I. Sham, G. Yuan and T.M. Aamodt, "Dynamic warp formation: Efficient mimd control flow on simd graphics hardware", ACM Trans. Archit. Code Optim., Vol.6, No.2, pp.7:1-7:37, 2009.

N. Brunie, S. Collange and G. Diamos, "Simultaneous branch and warp interweaving for sustained gpu performance", Proc. of International Symposium on Computer Architecture, Portland, USA, pp.49-60, 2012.

G. Diamos, B. Ashbaugh, et al., "Simd re-convergence at thread frontiers", Proc. of International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.477-488, 2011.

V. Narasiman, M. Shebanow, C.J. Lee, et al., "Improving GPU performance via large warps and two-level warp scheduling", Proc. of International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.308-317, 2011.

M. Rhu and M. Erez, "Capri: Prediction of compactionadequacy for handling control-divergence in GPGPU architectures", Proc. of International Symposium on Computer Architecture, Portland, USA, pp.61-71, 2012.

M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow", Proc. of International Symposium on High Performance Computer Architecture, Shenzhen, China, pp.591-602, 2013.

J. Meng, D. Tarjan and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance", Proc. of International Symposium on Computer Architecture, Saint-Malo, France, pp.235-246, 2010.

Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, et al., "SIMD divergence optimization through intra-warp compaction", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel, pp.368-379, 2013.

W.W.L. Fung and T.M. Aamodt, "Thread block compaction for efficient SIMT control flow", Proc. of International Symposium on High Performance Computer Architecture, San Antonio, USA, pp.25-36, 2011.

M. Rhu and M. Erez, "Maximizing simd resource utilization in GPGPUs with simd lane permutation", Proc. of International Symposium on Computer Architecture, Tel-Aviv, Israel,pp.356-367, 2013.

Yaohua Wang, Shuming Chen, et al., "Instruction Shuffle: Achieving MIMD-like performance on SIMD architectures", Computer Architecture Letters, Vol.11, No.2, pp.37-40, 2012.

A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., "Analyzing CUDA workloads using a detailed GPU simulator", Proc. of International Symposium on Performance Analysis of Systems and Software, Boston, USA, pp.163-174, 2009.

A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., GPGPU-Sim, http://www.gpgpu-sim.org, 2013.

A. Bakhoda, G.L. Yuan, W.W.L. Fung, et al., GPGPU-Sim Manual, http://www.gpgpu-sim.org/manual, 2013.

S. Che, M. Boyer, J. Meng, et al., "Rodinia: A benchmark suite for heterogeneous computing", International Symposium on Workload Characterization, Austin, USA, pp.44-54, 2009.

NVIDIA Corporation, GPU Computing SDK, Version 2.3, 2009.

John A. Stratton, et al., The Parboil Technical Report, University of Illinois at Urbana-Champaign, 2012.

Relative Articles

Supplements(0)

Cited By

Proportional views