Citation: | ZHANG Jun, HE Yanxiang, SHEN Fanfan, et al., “Memory Request Priority Based Warp Scheduling for GPUs,” Chinese Journal of Electronics, vol. 27, no. 5, pp. 985-994, 2018, doi: 10.1049/cje.2018.05.003 |
P. Du, J. Zhao, W. Pan, et al., “GPU accelerated real-time collision handling in virtual disassembly”, Journal of Computer Science and Technology, Vol.30, No.3, pp.511-518, 2015.
|
L. Cheng and T. Li, “Efficient data redistribution to speedup big data analytics in large systems”, Proc. of the 23rd International Conference on High Performance Computing, Hyderabad, India, pp.91-100, 2016.
|
E. Lindholm, J. Nickolls, S. Oberman, et al., “NVIDIA Tesla: A unified graphics and computing architecture”, IEEE Micro, Vol.28, No.2, pp.39-55, 2008.
|
J. Nickolls and W.J. Dally, “The GPU computing era”, IEEE Micro, Vol.30, No.2, pp.56-69, 2010.
|
Y. He, J. Zhang, F. Shen, et al., “Thread scheduling optimization of general purpose graphics processing unit: A survey”, Chinese Journal of Computers, Vol.39, No.9, pp.1733-1749, 2016.
|
C. Nugteren, G. van den Braak and H. Corporaal, “Future of GPGPU micro-architectural parameters”, Proc. of the Conference on Design, Automation and Test in Europe, Grenoble, France, pp.392-395, 2013.
|
J. Meng, D. Tarjan and K. Skadron, “Dynamic warp subdivision for integrated branch and memory divergence tolerance”, Proc. of the 37th International Symposium on Computer Architecture, Saint Malo, France, pp.235-246, 2010.
|
D. Tarjan, J. Meng and K. Skadron, “Increasing memory miss tolerance for SIMD cores”, Proc. of the ACM/IEEE Conference on High Performance Computing, Portland, USA, Page 22, 2009
|
M. Meng, J.W. Sheaffer and K. Skadron, “Robust SIMD: Dynamically adapted SIMD width and multi-threading depth”, Proc. of the 26th International Parallel & Distributed Processing Symposium, Shanghai, China, pp.107-118, 2012.
|
T.G. Rogers, M. O’Connor and T.M. Aamodt, “Divergenceaware warp scheduling”, Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, California, USA, pp.99-110, 2013.
|
T.G. Rogers, M. O’Connor and T.M. Aamodt, “Cacheconscious wavefront scheduling”, Proc. of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, British Columbia, Canada, pp.72-83, 2012.
|
Z. Zheng, Z. Wang and M. Lipasti, “Adaptive cache and concurrency allocation on gpgpus”, IEEE Computer Architecture Letters, Vol.14, No.2, pp.90-93, 2015.
|
D. Li, M. Rhu, D.R. Johnson, et al., “Priority-based cache allocation in throughput processors”, Proc. of the 21st IEEE International Symposium on High Performance Computer Architecture, Burlingame, CA, USA, pp.89-100, 2015.
|
O. Kayiran, A. Jog, M.T. Kandemir, et al., “Neither more nor less: Optimizing thread-level parallelism for GPGPUs”, Proc. of the 22nd International Conference on Parallel Architecture and Compilation Techniques, Edinburgh, United Kingdom, pp.157-166, 2013.
|
X. Chen, L.W. Chang, C.I. Rodrigues, et al., “Adaptive cache management for energy-efficient GPU computing”, Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, United Kingdom, pp.343-355, 2014.
|
M. Lee, S. Song, J. Moon, et al., “Improving GPGPU resource utilization through alternative thread block scheduling”, Proc. of the 20th International Symposium on High Performance Computer Architecture, Orlando, Florida, USA, pp.260-271, 2014.
|
M. Gebhart, D. Johnson, D. Tarjan, et al., “Energy-efficient mechanisms for managing thread context in throughput processors”, Proc. of the 38th International Symposium on Computer Architecture, San Jose, California, USA, pp.235-246, 2011.
|
V. Narasiman, C.J. Lee, M. Shebanow, et al., “Improving GPU performance via large warps and two-Level warp scheduling”, Proc. of the 44th International Symposium on Microarchitecture, Porto Alegre, Brazil, pp.308-317, 2011.
|
J. Adwait, O. Kayıran, N.C. Nachiappan, et al., “OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance”, Proc. of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, Houston, Texas, USA, pp.395-406, 2013.
|
W. Jia, K. Shaw and M. Martonosi, “MRPB: Memory request prioritization for massively parallel processors”, Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture, Orlando, Florida, USA, pp.272-283, 2014.
|
X. Xie, Y. Liang, Y. Wang, G. Sun, et al., “Coordinated static and dynamic cache bypassing for GPUs”, Proc. of the 21st IEEE International Symposium on High Performance Computer Architecture, Burlingame, CA, USA, PP.76-88, 2015.
|
S.Y. Lee and C.J. Wu, “Ctrl-C: Instruction-aware control loop based adaptive cache bypassing for GPUs”, Proc. of the 34th IEEE International Conference on Computer Design, Scottsdale, AZ, USA, pp.133-140, 2016.
|
C. Li, S.L. Song, H. Dai, et al., “Locality-driven dynamic GPU cache bypassing”, Proc. of 29th ACM on International Conference on Supercomputing, Newport Beach/Irvine, CA, USA,pp.67-77, 2015.
|
Y. Tian, S. Puthoor, J.L. Greathouse, et al., “Adaptive GPU cache bypassing”, Proc. of the 8th Workshop on General Purpose Processing Using GPUs, San Francisco, CA, USA, pp.25-35, 2015.
|
S. Che, M. Boyer, J. Meng, et al., “Rodinia: A benchmark suite for heterogeneous computing”, Proc. of IEEE International Symposium on Workload Characterization, Austin, TX, USA, pp.44-54, 2009.
|
A. Bakhoda, G. Yuan, W.L. Fung, et al., “Analyzing CUDA workloads using a detailed GPU simulator”, Proc. of IEEE International Symposium on Performance Analysis of Systems and Software, Boston, Massachusetts, USA, pp.163-174, 2009.
|
NVIDIA, CUDA C Programming Guide PG-02829-001 v6.5, 2014.
|
NVIDIA, NVIDIA Compute PTX: Parallel Thread Execution ISA version 1.4, 2014.
|