Citation: | PENG Xiyuan, YU Jinxiang, YAO Bowen, LIU Liansheng, PENG Yu. A Review of FPGA-Based Custom Computing Architecture for Convolutional Neural Network Inference[J]. Chinese Journal of Electronics, 2021, 30(1): 1-17. doi: 10.1049/cje.2020.11.002 |
[1] |
Y. LeCun, L. Bottou, Y. Bengio, et al., "Gradient-based learning applied to document recognition", Proceedings of the IEEE, Vol. 86, No. 11, pp. 2278–2324, 1998. doi: 10.1109/5.726791
|
[2] |
S. Guo, B. Zhang, T. Yang, et al., "Multi-task convolutional neural network and information fusion for fault diagnosis and localization", IEEE Transactions on Industrial Electronics, Vol. 67, No. 9, pp. 8005–8015, 2020. doi: 10.1109/TIE.2019.2942548
|
[3] |
A. Krizhevsky, I. Sutskever and G.E. Hinton, "Imagenet classification with deep convolutional neural networks", International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, pp. 1097–1105, 2012.
|
[4] |
Z. Li, W. Yang, S. Peng, et al., "A survey of convolutional neural networks: analysis, applications, and prospects", arXiv preprint, arXiv: 2004.02806, 2020.
|
[5] |
J.T. Huang, J. Li and Y. Gong, "An analysis of convolutional neural networks for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia, pp. 4989–4993, 2015.
|
[6] |
D. Hernandez and T.B. Brown, "Measuring the algorithmic efficiency of neural networks", arXiv preprint, arXiv: 2005.04305, 2020.
|
[7] |
J. Zhou, H. Dai and H. Wang, "Lightweight convolution neural networks for mobile edge computing in transportation cyber physical systems", ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 6, pp. 1–20, 2019. doi: 10.1145/3339308
|
[8] |
W.J. Dally, Y. Turakhia and S. Han, "Domain-specific hardware accelerators", Communications of the ACM, Vol. 63, No. 7, pp. 48–57, 2020. doi: 10.1145/3361682
|
[9] |
N.P. Jouppi, C. Young, N. Patil, et al., "In-datacenter performance analysis of a tensor processing unit", Annual International Symposium on Computer Architecture, Toronto, ON, Canada, pp. 1–12, 2017.
|
[10] |
N.P. Jouppi, C. Young, N. Patil, et al., "A domain-specific architecture for deep neural networks", Communications of the ACM, Vol. 61, No. 9, pp. 50–59, 2018. doi: 10.1145/3154484
|
[11] |
Google. ai, "Cloud TPUs: Google's second-generation tensor processing unit is coming to cloud", http://g.co/tpu, 2020-9-23.
|
[12] |
X. Niu, W. Luk and Y. Wang, "EURECA: On-chip configuration generation for effective dynamic data access", ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, pp. 74–83, 2015.
|
[13] |
T. Chen, Z. Du, N. Sun, et al., "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning", ACM SIGARCH Computer Architecture News, Vol. 41, No. 1, pp. 269–284, 2014. http://dl.acm.org/citation.cfm?id=2644865.2541967
|
[14] |
Y. Chen, T. Luo, S. Liu, et al., "Dadiannao: A machine-learning supercomputer", Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, CAM, UK, pp. 609–622, 2014.
|
[15] |
D. Liu, T. Chen, S. Liu, et al., "Pudiannao: A polyvalent machine learning accelerator", ACM SIGARCH Computer Architecture News, Vol. 43, No. 1, pp. 369–381, 2015. doi: 10.1145/2786763.2694358
|
[16] |
Z. Du, R. Fasthuber, T. Chen, et al., "ShiDianNao: Shifting vision processing closer to the sensor", Annual International Symposium on Computer Architecture, Portland, OR, USA, pp. 92–104, 2015.
|
[17] |
S. Venkataramani, A. Ranjan, S. Banerjee, et al., "Scaledeep: A scalable compute architecture for learning and evaluating deep networks", Annual International Symposium on Computer Architecture, Toronto, ON, Canada, pp. 13–26, 2017.
|
[18] |
Y.H. Chen, T. Krishna, J.S. Emer, et al., "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks", IEEE Journal of Solid-state Circuits, Vol. 52, No. 1, pp. 127–138, 2016. http://ieeexplore.ieee.org/document/7418007
|
[19] |
Huawei, "Huawei Atlas 910 AI processor: High-performance AI processor for training", https://e.huawei.com/ascend-910, 2020-9-23.
|
[20] |
S. Wei and Y. Liu, "The principle and progress of dynamically reconfigurable computing technologies", Chinese Journal of Electronics, Vol. 29, No. 4, pp. 595–607, 2020. doi: 10.1049/cje.2020.05.002
|
[21] |
S.M. Trimberger, Field-Programmable Gate Array Technology, Springer Science+Business Media, New York, USA, pp. 34–35, 2012.
|
[22] |
T.J. Todman, G.A. Constantinides, S.J.E. Wilton, et al., "Reconfigurable computing: Architectures and design methods", IEE Proceedings-Computers and Digital Techniques, Vol. 152, No. 2, pp. 193–207, 2005. doi: 10.1049/ip-cdt:20045086
|
[23] |
V. Kathail, "Xilinx vitis unified software platform", ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, pp. 173–174, 2020.
|
[24] |
V. Kustikova, E. Vasiliev, A. Khvatov, et al., "Intel distribution of OpenVINO toolkit: A case study of semantic segmentation", Int. Conf. on Analysis of Images, Social Networks and Texts, Kazan, Russia, pp. 11–23, 2019.
|
[25] |
J.E. Stone, D. Gohara and G. Shi, "OpenCL: A parallel programming standard for heterogeneous computing systems", Computing in Science and Engineering, Vol. 12, No. 3, pp. 66–73, 2010. doi: 10.1109/MCSE.2010.69
|
[26] |
K. Guo, S. Zeng, J. Yu, et al., "A survey of FPGA-based neural network inference accelerators", ACM Transactions on Reconfigurable Technology and Systems, Vol. 12, No. 1, pp. 1–26, 2019. doi: 10.1145/3289185
|
[27] |
E. Wang, J.J. Davis, R. Zhao, et al., "Deep neural network approximation for custom hardware: Where we've been, where we're going", ACM Computing Surveys, Vol. 52, No. 2, pp. 1–39, 2019. http://arxiv.org/abs/1901.06955
|
[28] |
A. Khan, A. Sohail, U. Zahoora, et al., "A survey of the recent architectures of deep convolutional neural networks", Artificail Intelligencce Review, Vol. 53, No. 8, pp. 5455-5516, 2020. doi: 10.1007/s10462-020-09825-6
|
[29] |
S. Mittal, "A survey of FPGA-based accelerators for convolutional neural networks", Neural Computing and Applications, Vol. 32, No. 4, pp. 1109–1139, 2020. doi: 10.1007/s00521-018-3761-1
|
[30] |
K. Abdelouahab, M. Pelcat, J. Serot, et al., "Accelerating CNN inference on FPGAs: A Survey", arXiv preprint, arXiv: 1806.01683, 2018.
|
[31] |
M.P. Véstias, "A survey of convolutional neural networks on edge with reconfigurable computing", Algorithms, Vol. 12, No. 8, Page 154, 2019. doi: 10.3390/a12080154
|
[32] |
T. Wang, C. Wang, X. Zhou, et al., "A survey of FPGA based deep learning accelerators: Challenges and opportunities", arXiv preprint, arXiv: 1901.04988, 2018.
|
[33] |
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition", arXiv preprint, arXiv: 1409.1556, 2014.
|
[34] |
K. He, X. Zhang, S. Ren, et al., "Deep residual learning for image recognition", IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770–778, 2016.
|
[35] |
F. Iandola, M. Moskewicz, S. Karayev, et al., "Densenet: Implementing efficient convnet descriptor pyramids", arXiv preprint, arXiv: 1404.1869, 2014.
|
[36] |
C. Szegedy, W. Liu, Y. Jia, et al., "Going deeper with convolutions", IEEE Int. Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 1–9, 2015.
|
[37] |
S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift", International Conference on International Conference on Machine Learning, San Diego, CA, USA, pp. 1–9, 2015.
|
[38] |
C. Szegedy, V. Vanhoucke, S. Ioffe, et al., "Rethinking the Inception architecture for computer vision", IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 2818–2826, 2016.
|
[39] |
F.N. Iandola, S. Han, M.W. Moskewicz, et al., "SqueezeNet: AlexNet-level accuracy with 50$ \times $ fewer parameters and < 0.5MB model size", arXiv preprint, arXiv: 1602.07360, 2016.
|
[40] |
A. Gholami, K. Kwon, B. Wu, et al., "SqueezeNext: A hardware-aware neural network design", IEEE International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 1638–1647, 2018.
|
[41] |
A.G. Howard, M. Zhu, B. Chen, et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications", IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 432–445, 2017.
|
[42] |
M. Sandler, A. Howard, M. Zhu, et al., "MobileNetV2: Inverted residuals and linear bottlenecks", IEEE International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 4510–4520, 2018.
|
[43] |
F. Chollet, "Xception: Deep learning with depth-wise separable convolutions", IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 1251–1258, 2017.
|
[44] |
X. Zhang, X. Zhou, M. Lin, et al., "Shufflenet: An extremely efficient convolutional neural network for mobile devices", International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 6848–6856, 2018.
|
[45] |
N. Ma, X. Zhang, H.T. Zheng, et al., "ShuffleNetV2: Practical guidelines for efficient CNN architecture design", European Conference on Computer Vision, Munich, BY, Germany, pp. 116–131, 2018.
|
[46] |
J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks", International Conference on Articial Neural Networks, Hamburg, Germany, pp. 281–290, 2014.
|
[47] |
C. Peng, X. Zhang, G. Yu, et al., "Large kernel matters–improve semantic segmentation by global convolutional network", IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 1743–1751, 2017.
|
[48] |
A. Lavin and S. Gray, "Fast algorithms for convolutional neural networks", arXiv preprint, arXiv: 1509.09308, 2015.
|
[49] |
H.K. Jong, B.A. Mudassar, T. Na, et al., "Design of an energy efficient accelerator for training of convolutional neural networks using frequency-domain computation", Annual Conference on Design Automation, Austin, TX, USA, pp. 87–96, 2017.
|
[50] |
B. Zoph and Q.V. Le, "Neural architecture search with reinforcement learning", arXiv preprint, arXiv: 1611.01578, 2017.
|
[51] |
H. Pham, M.Y. Guan, B. Zoph, et al., "Efficient neural architecture search via parameter sharing", arXiv preprint, arXiv: 1802.03268, 2018.
|
[52] |
M. Tan, B. Chen, R. Pang, et al., "MnasNet: Platform-aware neural architecture search for mobile", IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 2815–2823, 2019.
|
[53] |
D. Misha, S. Babak, D. Laurent, et al., "Predicting parameters in deep learning", International Conference on Neural Information Processing Systems, Stateline, NV, USA, pp. 2148-2156, 2013.
|
[54] |
J. Teich, "Hardware/software codesign: The past, the present, and predicting the future", Proceedings of the IEEE, Vol. 100, Special Centennial Issue, pp. 1411–1430, 2012.
|
[55] |
M. Courbariaux, Y. Bengio and J.P. David, "Training deep neural networks with low precision multiplications", arXiv preprint, arXiv: 1412.7024, 2014.
|
[56] |
B. Jacob, S. Kligys, B. Chen, et al., "Quantization and training of neural networks for efficient integer-arithmetic-only inference", IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 2704–2713, 2018.
|
[57] |
C. Zhang, G. Sun, Z. Fang, et al., "Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 38, No. 11, pp. 2072–2085, 2018. http://dl.acm.org/citation.cfm?id=2967011
|
[58] |
J. Qiu, J. Wang, S. Yao, et al., "Going deeper with embedded FPGA platform for convolutional neural network", ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, pp. 26–35, 2016.
|
[59] |
L. Lai, N. Suda and V. Chandra, "Deep convolutional neural network inference with floatingpoint weights and fixed-point activations", arXiv preprint, arXiv: 1703.03073, 2018.
|
[60] |
E. H. Lee, D. Miyashita, E. Chai, et al., "LogNet: Energy-efficient neural networks using logarithmic computation", IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, pp. 5900–5904, 2017.
|
[61] |
A. Zhou, A. Yao, Y. Guo, et al., "Incremental network quantization: Towards lossless CNNs with low-precision weights", arXiv preprint, arXiv: 1702.03044, 2017.
|
[62] |
J. Wang, Q. Lou, X. Zhang, et al., "Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA", International Conference on Field Programmable Logic and Applications, Dublin, Ireland, pp. 163–1636, 2018.
|
[63] |
M. Courbariaux, Y. Bengio and J.P. David, "BinaryConnect: Training deep neural networks with binary weights during propagations", International Conference on Neural Information Processing Systems, Montreal, QC, Canada, pp. 3123–3131, 2015.
|
[64] |
I. Hubara, M. Courbariaux, D. Soudry, et al., "Binarized neural networks", Int. Conf. on Neural Information Processing Systems, Barcelona, Spain, pp. 4107–4115, 2016.
|
[65] |
Z. Lin, M. Courbariaux, R. Memisevic, et al., "Neural networks with few multiplications", arXiv preprint, arXiv: 1510.03009, 2015.
|
[66] |
S. Han, H. Mao and W.J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding", arXiv preprint, arXiv: 1510.00149, 2015.
|
[67] |
Y. Guo, A. Yao and Y. Chen, "Dynamic network surgery for efficient DNNs", International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 1379–1387, 2016.
|
[68] |
N. Lee, T. Ajanthan and P. Torr, "Snip: Single-shot network pruning based on connection sensitivity", arXiv preprint, arXiv: 1810.02340, 2018.
|
[69] |
S. Srinivas and R.V. Babu, "Data-free parameter pruning for deep neural networks", arXiv preprint, arXiv: 1507.06149, 2015.
|
[70] |
W. Deng, W. Yin and Y. Zhang, "Group sparse optimization by alternating direction method", Proceedings of the International Society for Optical Engineering, Vol. 8858, DOI: 10.1117/12.2024410,2013.
|
[71] |
B.Y. Liu, M. Wang, H. Foroosh, et al., "Sparse convolutional neural networks", IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 806–814, 2015.
|
[72] |
C. Tai, T. Xiao, Y. Zhang, et al., "Convolutional neural networks with low-rank regularization", arXiv preprint, arXiv: 1511.06067, 2015.
|
[73] |
M. Masana, J.V.D. Weijer, L. Herranz, et al., "Domain-adaptive deep network compression", IEEE International Conference on Computer Vision, Venice, Italy, pp. 4289–4297, 2017.
|
[74] |
J. Xue, J. Li, D. Yu, et al., "Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network", IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp. 6359–6363, 2014.
|
[75] |
X. Zhang, J. Zou, X. Ming, et al., "Efficient and accurate approximations of nonlinear convolutional networks", IEEE International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 1984–1992, 2015
|
[76] |
V. Lebedev and V. Lempitsky, "Fast convnets using group-wise brain damage", IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 2554–2564, 2016.
|
[77] |
M. Janzamin, H. Sedghi and A. Anandkumar, "Beating the perils of non-convexity: Guaranteed training of neural networks using tensor", arXiv preprint, arXiv: 1506.08473, 2015.
|
[78] |
Y.D. Kim, E. Park, S. Yoo, et al., "Compression of deep convolutional neural networks for fast and low power mobile applications", arXiv preprint, arXiv: 1511.06530, 2015.
|
[79] |
G. Hinton, O. Vinyals and J. Dean, "Distilling the knowledge in a neural network", arXiv preprint, arXiv: 1503.02531, 2015.
|
[80] |
A. Romero, N. Ballas, S.E. Kahou, et al., "FITNets: Hints for thin deep nets", arXiv preprint, arXiv: 1412.6550, 2014.
|
[81] |
P. Luo, Z.Y. Zhu, Z.W. Liu, et al., "Face model compression by distilling knowledge from neurons", AAAI Conference on Artiflcial Intelligence, Phoenix, AZ, USA, pp. 3560–3566, 2016.
|
[82] |
S. Zagoruyko and N. Komodakis, "Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer", arXiv preprint, arXiv: 1612.03928, 2016.
|
[83] |
L. Theis, I. Korshunova, A. Tejani, et al., "Faster gaze prediction with dense networks and Fisher pruning", arXiv preprint, arXiv: 1801.05787, 2018.
|
[84] |
J. Yim, D. Joo, J. Bae, et al., "A gift from knowledge distillation: Fast optimization, network minimization and transfer learning", IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 4133–4141, 2017.
|
[85] |
Y. He, J. Lin, Z. Liu, et al., "AMC: Automl for model compression and acceleration on mobile devices", European Conference on Computer Vision, Munich, Germany, pp. 784–800, 2018.
|
[86] |
N. Liu, X.L. Ma, Z.Y. Xu, et al., "AutoCompress: An automatic DNN structured pruning framework for Ultra-High compression rates", arXiv preprint, arXiv: 1907.03141, 2019.
|
[87] |
J.X. Wu, Y. Zhang, J.L. Hou, et al., "Pocketflow: An automated framework for compressing and accelerating deep neural networks", the 32nd International Conference on Neural Information Processing Systems (NIPS 2018), Workshop on Compact Deep Neural Networks with Industrial Applications, Montreal, QC, Canada, https://openreview.net/forum?id=H1fWoYhdim, 2020-2-26.
|
[88] |
G. Ofenbeck, R. Steinmann and V. Caparros, "Applying the roofline model", IEEE International Symposium on Performance Analysis of Systems and Software, Monterey, CA, USA, pp. 76–85, 2014.
|
[89] |
E. Wu, X.Q. Zhang, D. Berman, et al., "A high-throughput reconfigurable processing array for neural networks", International Conference on Field Programmable Logic and Applications, Ghent, Belgium, pp. 1–4, 2017.
|
[90] |
J. Zhang, W. Zhang, G. Luo, et al., "Frequency improvement of systolic array-based CNNs on FPGAs", IEEE International Symposium on Circuits and Systems, Sapporo, Hokkaido, Japan, pp. 1–4, 2019.
|
[91] |
C. Zhang, P. Li, G. Sun, et al., "Optimizing FPGA-based accelerator design for deep convolutional neural networks", ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Monterey, CA, USA, pp. 161–170, 2015.
|
[92] |
Y. Ma, Y. Cao, S. Vrudhula, et al., "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks", ACM/SIGDA International Symposium on Field–Programmable Gate Arrays, Monterey, CA, USA, pp. 45–54, 2017.
|
[93] |
Q.C. Xiao, Y. Liang, L.Q. Lu, et al., "Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs", Annual Design Automation Conference, Austin, TX, USA, pp. 1–6, 2017.
|
[94] |
K.Y. Guo, L. Z. Sui, J.T. Qiu, et al., "Angel–Eye: A complete design flow for mapping CNN onto embedded FPGA", IEEE Transactions on Computer–Aided Design of Integrated Circuits and Systems, Vol. 37, No. 1, pp. 35–47, 2017. http://ieeexplore.ieee.org/document/7930521/
|
[95] |
Y. Shen, M. Ferdman and P. Milder, "Maximizing CNN accelerator efficiency through resource partitioning", Annual International Symposium on Computer Architecture, Toronto, Canada, pp. 535–547, 2017.
|
[96] |
X. Wei, Y. Liang, X. Li, et al., "TGPA: Tile-grained pipeline architecture for low latency CNN inference", International Conference on Computer-Aided Design, San Diego, CA, USA, pp. 1–8, 2018.
|
[97] |
X. Wei, C.H. Yu, P. Zhang, et al., "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs", Annual Design Automation Conference 2017, Toronto, Canada, pp. 1–6, 2017.
|
[98] |
Y. Li, S. Lu, J. Luo, et al., "High-performance convolutional neural network accelerator based on systolic srrays and quantization", International Conference on Signal and Image Processing, Wuxi, China, pp. 335–339, 2019.
|
[99] |
M. Alwani, H. Chen, M. Ferdman, et al., "Fused-layer CNN accelerators", Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, China, pp. 1–12, 2016.
|
[100] |
A. Erdem, D. Babic and C. Silvano, "A Tile-based fused-layer approach to accelerate DCNNs on low-density FPGAs", IEEE International Conference on Electronics, Circuits and Systems, Genoa, Italy, pp. 37–40, 2019.
|
[101] |
M. Samragh, M. Ghasemzadeh and F. Koushanfar, "Customizing neural networks for efficient FPGA implementation", International Symposium on Field-programmable Custom Computing Machines, Napa, CA, USA, pp. 85–92, 2017.
|
[102] |
J. Gustafson and I. Yonemoto, "Beating floating point at its own game: Posit arithmetic", Supercomput Frontiers and Innovations, Vol. 4, No. 2, pp. 71–86, 2017. http://www.researchgate.net/publication/322151112_Beating_floating_point_at_its_own_game_Posit_arithmetic
|
[103] |
S.H.F. Langroudi, T. Pandit and D. Kudithipudi, "Deep learning inference on embedded devices: fixed–point vs posit", arXiv preprint, arXiv: 1805.08624, 2018.
|
[104] |
J. Johnson, "Rethinking floating point for deep learning", arXiv preprint, arXiv: 1811.01721, 2018.
|