YOLO-Drone: A Scale-Aware Detector for Drone Vision

LI Yutong; MA Miao; LIU Shichang; YAO Chao; GUO Longjiang

doi:10.23919/cje.2023.00.254

Article Contents

Article Navigation > Chinese Journal of Electronics > 2024 > Uncorrected proof

Yutong LI, Miao MA, Shichang LIU, et al., “YOLO-Drone: A Scale-Aware Detector for Drone Vision,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2023.00.254

Citation:

Yutong LI, Miao MA, Shichang LIU, et al., “YOLO-Drone: A Scale-Aware Detector for Drone Vision,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2023.00.254

Yutong LI, Miao MA, Shichang LIU, et al., “YOLO-Drone: A Scale-Aware Detector for Drone Vision,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2023.00.254

Citation:

Yutong LI, Miao MA, Shichang LIU, et al., “YOLO-Drone: A Scale-Aware Detector for Drone Vision,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2023.00.254

PDF( 16457 KB)

YOLO-Drone: A Scale-Aware Detector for Drone Vision

doi: 10.23919/cje.2023.00.254

LI Yutong^1
,,
MA Miao^{1, 2
,
,},
LIU Shichang^{1, 3
,},
YAO Chao^1
,,
GUO Longjiang^{1, 2
,}

1.
School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
2.
Key Laboratory of Modern Teaching Technology, Ministry of Education, Xi’an 710062, China
3.
College of Computer Science, Sichuan University, Chengdu 610065, China

More Information

Author Bio:
Yutong LI is currently an M.E. candidate at the School of Computer Science, Shaanxi Normal University, Xi’an, China. Her research interests include image processing, action recognition and temporal action localization. (Email: liyutongstu@snnu.edu.cn)

Miao MA received the Ph.D. degree in Signal and Information Processing degree from Northwest Polytechnic University, Xi’an, China, in 2005. She was a Visiting Scholar in Carnegie Mellon University in USA, during 2013 to 2014. She is currently a Professor at the School of Computer Science, Shaanxi Normal University, Xi’an, China. Her research interests include image processing and video analysis on educational big data. (Email: mmthp@snnu.edu.cn)

Shichang LIU received the M.E. degree from Shaanxi Normal University, Xi’an, China, in 2023. He is currently a Ph.D. candidate at the College of Computer Science, Sichuan University. His research interests include object detection, pose estimation, and low-light image enhancement. (Email: lsc@ieee.org)

Chao YAO received the B.Sc. degree in Telecommunications Engineering, in 2007, and the Ph.D. degree in Communication and Information Systems, in 2014, both from Xidian University, Xi’an, China. He was a visiting student in Center for Pattern Recognition and Machine Intelligence, Montreal, Canada, during 2010-2011. His research interests include feature extraction, handwritten character recognition, machine learning, and pattern recognition. (Email: 2002yaochao@gmail.com)

Longjiang GUO received the Ph.D. degree from the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. He was a Visiting Scholar with GeorgiaState University, Atlanta, GA USA, in 2009 and 2013. He is currently a Professor at the School of Computer Science, Shaanxi Normal University, Xi’an, China. His current research interests include Al education, learning technologies, etc. (Email: longjiangguo@snnu.edu.cn)
Corresponding author: Email: mmthp@snnu.edu.cn
Received Date: 2023-07-20
Accepted Date: 2022-11-10

Available Online: 2022-03-22

Abstract

Abstract

Object detection is an important task in drone vision. However, since the number of objects and their scales always vary greatly in the drone-captured video, small object-oriented feature becomes the bottleneck of model performance, and most existing object detectors tend to underperform in drone-vision scenes. To solve these problems, we propose a novel detector named YOLO-Drone in this paper. In the proposed detector, the backbone of YOLO is firstly replaced with ConvNeXt, which is the state-of-the-art one to extract more discriminative features. Then, a novel scale-aware attention (SAA) module is designed in detection head to solve the large disparity scale problem. A scale-sensitive loss (SSL) is also introduced to put more emphasis on object scale to enhance the discriminative ability of the proposed detector. Experimental results on the latest VisDrone 2022 test-challenge dataset (detection track) show that our detector can achieve an AP of 39.43%, which is tied with the previous SOTA, meanwhile, reducing 39.8% of the computational cost.
- Drone vision,
- object detection,
- scale-aware attention,
- scale-sensitive loss,
- VisDrone dataset

FullText(HTML)

References(39)

References

[1]	X. Y. Tian, J. Shao, D. W. Ouyang, et al., “UAV-satellite view synthesis for cross-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4804–4815, 2022. doi: 10.1109/TCSVT.2021.3121987
[2]	M. Dai, J. H. Hu, J. D. Zhuang, et al., “A transformer-based feature segmentation and region alignment method for UAV-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4376–4389, 2022. doi: 10.1109/TCSVT.2021.3135013
[3]	J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint, arXiv: 1804.02767, 2018.
[4]	T. Y. Lin, P. Goyal, R. Girshick, et al., “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2999–3007, 2017.
[5]	S. Q. Ren, K. M. He, R. Girshick, et al., “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017. doi: 10.1109/TPAMI.2016.2577031
[6]	Z. W. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 6154–6162, 2018.
[7]	Z. Tian, C. H. Shen, H. Chen, et al., “FCOS: Fully convolutional one-stage object detection,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 9626–9635, 2019.
[8]	J. Redmon, S. Divvala, R. Girshick, et al., “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016.
[9]	H. Law and J. Deng, “CornerNet: Detecting objects as paired Keypoints,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 734–750, 2018.
[10]	K. W. Duan, S. Bai, L. Xie, et al., “CenterNet: Keypoint triplets for object detection,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 6568–6577, 2019.
[11]	T. Y. Lin, M. Maire, S. Belongie, et al., “Microsoft COCO: Common objects in context,” in Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, pp. 740–755, 2014.
[12]	M. Everingham, L. Van Gool, C. K. I. Williams, et al., “The PASCAL visual object classes (VOC) challenge,” international Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010. doi: 10.1007/s11263-009-0275-4
[13]	P. F. Zhu, L. Y. Wen, D. W. Du, et al., “Detection and tracking meet drones challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2022. doi: 10.1109/TPAMI.2021.3119563
[14]	S. Hong, S. Kang, and D. Cho, “Patch-level augmentation for object detection in aerial images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Korea (South), pp. 127–134, 2019.
[15]	D. W. Du, P. F. Zhu, L. Y. Wen, et al., “VisDrone-SOT2019: The vision meets drone single object tracking challenge results,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Korea (South), pp. 199–212, 2019.
[16]	X. Liang, J. Zhang, L. Zhuo, et al., “Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1758–1770, 2020. doi: 10.1109/TCSVT.2019.2905881
[17]	J. X. Leng, M. J. C. Mo, Y. H. Zhou, et al., “Pareto refocusing for drone view object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1320–1334, 2023. doi: 10.1109/TCSVT.2022.3210207
[18]	J. F. Wan, B. Y. Zhang, Y. Y. Zhao, et al., “VistrongerDet: Stronger visual information for object detection in visDrone images,” in Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, pp. 2820–2829, 2021.
[19]	X. K. Zhu, S. Lyu, X. Wang, et al., “TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios,” in Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, pp. 2778–2788, 2021
[20]	S. Woo, J. Park, J. Y. Lee, et al., “CBAM: Convolutional block attention module,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 3–9, 2018.
[21]	Z. Liu, H. Z. Mao, C. Y. Wu, et al., “A convNet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 11966–11976, 2022.
[22]	Z. Ge, S. T. Liu, F. Wang, et al., “YOLOX: Exceeding YOLO series in 2021,” arXiv preprint, arXiv: 2107.08430, 2021.
[23]	J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 6517–6525, 2017.
[24]	T. Y. Lin, P. Dollár, R. Girshick, et al., “Feature pyramid networks for object detection,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944, 2017.
[25]	A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv preprint, arXiv: 2004.10934, 2020.
[26]	K. X. Wang, J. H. Liew, Y. T. Zou, et al., “PANet: Few-shot image semantic segmentation with prototype alignment,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 9196–9205, 2019.
[27]	H. R. Wang, Z. X. Wang, M. X. Jia, et al., “Spatial attention for multi-scale feature refinement for object detection,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, Korea (South), pp. 64–72, 2019.
[28]	F. Ö Ünel, B. O. Özkalayci, and C. Çiğla, “The power of tiling for small object detection,” in Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, pp. 582–591, 2019.
[29]	C. H. Y. Yang, Z. H. Huang, and N. Y. Wang, “QueryDet: Cascaded sparse query for accelerating high-resolution small object detection,” in Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 13658–13667, 2022.
[30]	Y. Liu, Z. Y. Lu, J. Li, et al., “Deep image-to-video adaptation and fusion networks for action recognition,” IEEE Transactions on Image Processing, vol. 29 pp. 3168–3182, 2020. doi: 10.1109/TIP.2019.2957930
[31]	N. C. Huang, Q. Jiao, Q. Zhang, et al., “Middle-level feature fusion for lightweight RGB-D salient object detection,” IEEE Transactions on Image Processing, vol. 31 pp. 6621–6634, 2022. doi: 10.1109/TIP.2022.3214092
[32]	X. Y. Dai, Y. P. Chen, B. Xiao, et al., “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 7369–7378, 2021.
[33]	J. W. Wang, C. Xu, W. Yang, et al., “A normalized Gaussian Wasserstein distance for tiny object detection,” arXiv preprint, arXiv: 2110.13389, 2021.
[34]	G. Ghiasi, Y. Cui, A. Srinivas, et al., “Simple copy-paste is a strong data augmentation method for instance segmentation,” in Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 2917–2927, 2021.
[35]	A. Buslaev, V. I. Iglovikov, E. Khvedchenya, et al., “Albumentations: Fast and flexible image augmentations,” information, vol. 11, no. 2, article no. 125, 2020. doi: 10.3390/info11020125
[36]	I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019.
[37]	N. Bodla, B. Singh, R. Chellappa, et al., “Soft-NMS-improving object detection with one line of code,” in Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp. 5562–5570, 2017.
[38]	R. Solovyev, W. M. Wang, and T. Gabruseva, “Weighted boxes fusion: Ensembling boxes from different object detection models,” Image and Vision Computing, vol. 107, no. 3, article no. 104117, 2021. doi: 10.1016/j.imavis.2021.104117
[39]	J. R. Zhu, X. D. Wang, Y. Liu, et al., “UavTinyDet: Tiny object detection in UAV scenes,” in Proceedings of the 2022 7th International Conference on Image, Vision and Computing, Xi’an, China, pp. 195–200, 2022.