Multi-scale Binocular Stereo Matching Based on Semantic Association

ZHENG Jin; JIANG Botao; PENG Wei; ZHANG Qiaohui

doi:10.23919/cje.2022.00.338

Article Contents

Article Navigation > Chinese Journal of Electronics > 2024 > Uncorrected proof

Jin ZHENG, Botao JIANG, Wei PENGet al., “Multi-scale Binocular Stereo Matching Based on Semantic Association,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2022.00.338

Citation:

Jin ZHENG, Botao JIANG, Wei PENGet al., “Multi-scale Binocular Stereo Matching Based on Semantic Association,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–13, 2024 doi: 10.23919/cje.2022.00.338

Citation:

PDF( 8633 KB)

Multi-scale Binocular Stereo Matching Based on Semantic Association

doi: 10.23919/cje.2022.00.338

ZHENG Jin^{1, 2
,
,},
JIANG Botao^2
,,
PENG Wei^2
,,
ZHANG Qiaohui^2
,

1.
State Key Laboratory of Virtual Reality Techonology and Systems, Beihang University, Beijing 100191, China
2.
School of Computer Science and Engineering, Beihang University, Beijing 100191, China

More Information

Author Bio:
Jin ZHENG received her B.S. and M.S. degree from Liaoning Technical University in 2001 and 2004, and her Ph.D. in School of Computer Science and Engineering from Beihang University in 2009. She joined the School of Computer Science and Engineering at Beihang University in 2009. In 2014, she visited Harvard University in MA, USA as a visiting scholar for one year. Her current research interests focus on object detection, tracking and recognition, among other similar interests. (Email: JinZheng@buaa.edu.cn)

Botao JIANG received his B.S. degree from China University of Geosciences (Wuhan) in 2022, He is currently a first-year postgraduate student majoring in Computer Technology at School of Computer Science and Engineering, Beihang University. His research interests include stereo matching, reinforcement learning, 3D object detection. (Email: Bert020@buaa.edu.cn)

Wei PENG received her B.S. degree in Communication Engineering from the Institute of Information Engineering, Hunan University, China, in 2019. She received her M.S. degree in School of Computer Science and Engineering from Beihang University in 2022. Her research interests include 3D object detection, tracking and data association. (Email: 3149169388@qq.com)

Qiaohui ZHANG received her B.S. degree in Sino-French Engineer School of Beihang University in 2022. She is currently a first-year postgraduate student majoring in Computer Technology at School of Computer Science and Engineering, Beihang University. Her research interests include depth estimation, 3D object detection. (Email: qiaohui_zhang@buaa.edu.cn)
Corresponding author: Email: JinZheng@buaa.edu.cn
Received Date: 2022-10-12
Accepted Date: 2023-11-10

Available Online: 2024-03-22

Abstract

Abstract

Aiming at the low accuracy of existing binocular stereo matching and depth estimation methods, this paper proposes a multi-scale binocular stereo matching network based on semantic association. A semantic association module is designed to construct the contextual semantic association relationship among the pixels through semantic category and attention mechanism. The disparity of those regions where the disparity is easily estimated can be used to assist the disparity estimation of relatively difficult regions, so as to improve the accuracy of disparity estimation of the whole image. Simultaneously, a multi-scale cost volume computation module is proposed. Unlike the existing methods, which use a single cost volume, the proposed multi-scale cost volume computation module designs multiple cost volumes for features of different scales. The semantic association feature and multi-scale cost volume are aggregated, which fuses the high-level semantic information and the low-level local detailed information to enhance the feature representation for accurate stereo matching. We demonstrate the effectiveness of the proposed solutions on the KITTI2015 binocular stereo matching dataset, and our model achieves comparable or higher matching performance.
- Binocular stereo matching,
- multi-scale features,
- semantic association,
- depth estimation,
- attention mechanism

FullText(HTML)

References(49)

References

[1]	Z. L. Shen, X. B. Song, Y. C. Dai, et al., “Digging into uncertainty-based pseudo-label for robust stereo matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 14301–14320, 2023. doi: 10.1109/TPAMI.2023.3300976
[2]	J. P. Jing, J. K. Li, P. F. Xiong, et al., “Uncertainty guided adaptive warping for robust and efficient stereo matching,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 3295–3304, 2023.
[3]	G. W. Xu, X. Q. Wang, X. H. Ding, et al., “Iterative geometry encoding volume for stereo matching,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 21919–21928, 2023.
[4]	C. Y. Chen, A. Seff, A. Kornhauser, et al., “DeepDriving: Learning affordance for direct perception in autonomous driving,” in Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2722–2730, 2015.
[5]	Y. Wang, B. Yang, R. Hu, et al., “PLUMENet: Efficient 3D object detection from stereo images,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, pp. 3383–3390, 2021.
[6]	H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008. doi: 10.1109/TPAMI.2007.1166
[7]	A. Klaus, M. Sormann, and K. Karner, “Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure,” in 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, pp. 15–18, 2006.
[8]	W. J. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 5695–5703, 2016.
[9]	J. H. Pang, W. X. Sun, J. S. Ren, et al., “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” in Proceedings of 2017 IEEE International Conference on Computer Vision Workshops, Venice, Italy, pp. 878–886, 2017.
[10]	M. Poggi, F. Tosi, K. Batsos, et al., “On the synergies between machine learning and binocular stereo for depth estimation from images: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5314–5334, 2021. doi: 10.1109/TPAMI.2021.3070917
[11]	V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” in Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, pp. 508–515, 2001.
[12]	J. Sun, N. N. Zheng, and H. Y. Shum, “Stereo matching using belief propagation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 787–800, 2003. doi: 10.1109/TPAMI.2003.1206509
[13]	A. Hosni, C. Rhemann, M. Bleyer, et al., “Fast cost-volume filtering for visual correspondence and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 2, pp. 504–511, 2013. doi: 10.1109/TPAMI.2012.156
[14]	K. J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 650–656, 2006. doi: 10.1109/TPAMI.2006.70
[15]	J. Sun, Y. Li, S. B. Kang, et al., “Symmetric stereo matching for occlusion handling,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, pp. 399–406, 2005.
[16]	Q. Yang, L. Wang, R. Yang, et al., “Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp. 492–504, 2009. doi: 10.1109/TPAMI.2008.99
[17]	J. R. Chang and Y. S. Chen, “Pyramid stereo matching network,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 5410–5418, 2018.
[18]	F. H. Zhang, V. Prisacariu, R. G. Yang, et al., “GA-Net: Guided aggregation net for end-to-end stereo matching,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 185–194, 2019.
[19]	X. Y. Guo, K. Yang, W. K. Yang, et al., “Group-wise correlation stereo network,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 3268–3277, 2019.
[20]	Y. M. Zhang, Y. M. Chen, X. Bai, et al., “Adaptive unimodal cost volume filtering for deep stereo matching,” in Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, pp. 12926–12934, 2020.
[21]	Z. B. Rao, M. Y. He, Y. C. Dai, et al., “NLCA-Net: A non-local context attention network for stereo matching,” APSIPA Transactions on Signal and Information Processing, vol. 9, article no. e18, 2020. doi: 10.1017/ATSIP.2020.16
[22]	S. Y. Chen, Z. Y. Xiang, C. Y. Qiao, et al., “SGNet: Semantics guided deep stereo matching,” in Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, Japan, pp. 106–122, 2020.
[23]	H. Laga, L. V. Jospin, F. Boussaid, et al., “A survey on deep learning techniques for stereo-based depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1738–1764, 2020. doi: 10.1109/TPAMI.2020.3032602
[24]	X. D. Gu, Z. W. Fan, S. Y. Zhu, et al., “Cascade cost volume for high-resolution multi-view stereo and stereo matching,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 2492–2501, 2020.
[25]	F. J. H. Wang, S. Galliani, C. Vogel, et al., “PatchmatchNet: Learned multi-view patchmatch stereo,” in Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 14189–14198, 2021.
[26]	H. Liu, R. Wang, Y. P. Xia, et al., “Improved cost computation and adaptive shape guided filter for local stereo matching of low texture stereo images,” Applied Sciences, vol. 10, no. 5, article no. 1869, 2020. doi: 10.3390/app10051869
[27]	B. L. Lu, Y. He, and H. N. Wang, “Stereo disparity optimization with depth change constraint based on a continuous video,” Displays, vol. 69, article no. 102073, 2021. doi: 10.1016/j.displa.2021.102073
[28]	S. Gidaris and N. Komodakis, “Detect, replace, refine: Deep structured prediction for pixel wise labeling,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 7187–7196, 2017.
[29]	A. Kendall, H. Martirosyan, S. Dasgupta, et al., “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp. 66–75, 2017.
[30]	F. Güney and A. Geiger, “Displets: Resolving stereo ambiguities using object knowledge,” in Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 4165–4175, 2015.
[31]	G. R. Yang, H. S. Zhao, J. P. Shi, et al., “SegStereo: Exploiting semantic information for disparity estimation,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 660–676, 2018.
[32]	J. M. Zhang, K. A. Skinner, R. Vasudevan, et al., “DispSegNet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1162–1169, 2019. doi: 10.1109/LRA.2019.2894913
[33]	Z. Y. Wu, X. Y. Wu, X. P. Zhang, et al., “Semantic stereo matching with pyramid cost volumes,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp. 7483–7492, 2019.
[34]	S. Y. Chen, Z. Y. Xiang, C. Y. Qiao, et al., “PGNet: Panoptic parsing guided deep stereo matching,” Neurocomputing, vol. 463 pp. 609–622, 2021. doi: 10.1016/j.neucom.2021.08.041
[35]	W. Y. Liu, Y. D. Wen, Z. D. Yu, et al., “Large-margin softmax loss for convolutional neural networks,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, pp. 507–516, 2016.
[36]	H. W. Sang, Q. H. Wang, and Y. Zhao, “Multi-scale context attention network for stereo matching,” IEEE Access, vol. 7 pp. 15152–15161, 2019. doi: 10.1109/ACCESS.2019.2895271
[37]	G. H. Zhang, D. C. Zhu, W. J. Shi, et al., “Multi-dimensional residual dense attention network for stereo matching,” IEEE Access, vol. 7 pp. 51681–51690, 2019. doi: 10.1109/ACCESS.2019.2911618
[38]	G. Y. Huang, Y. Y. Gong, Q. Z. Xu, et al., “A convolutional attention residual network for stereo matching,” IEEE Access, vol. 8 pp. 50828–50842, 2020. doi: 10.1109/ACCESS.2020.2980243
[39]	S. Woo, J. Park, J. Y. Lee, et al., “CBAM: Convolutional block attention module,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 3–19, 2018.
[40]	X. L. Wang, R. Girshick, A. Gupta, et al., “Non-local neural networks,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 7794–7803, 2018.
[41]	Y. Yao, Z. Z. Luo, S. W. Li, et al., “Recurrent MVSNet for high-resolution multi-view stereo depth inference,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5520–5529, 2019.
[42]	Q. S. Xu, W. H. Kong, W. B. Tao, et al., “Multi-scale geometric consistency guided and planar prior assisted multi-view stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4945–4963, 2023. doi: 10.1109/TPAMI.2022.3200074
[43]	S. Cheng, Z. X. Xu, S. L. Zhu, et al., “Deep stereo using adaptive thin volume representation with uncertainty awareness,” in Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 2521–2531, 2020.
[44]	G. Zhang, Z. Y. Li, J. M. Li, et al., “CFNet: Cascade fusion network for dense prediction,” arXiv preprint, arXiv: 2302.06052, 2023.
[45]	Z. L. Shen, Y. C. Dai, and Z. B. Rao, “CFNet: Cascade and fused cost volume for robust stereo matching,” in Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 13901–13910, 2021.
[46]	G. W. Xu, J. D. Cheng, P. Guo, et al., “Attention concatenation volume for accurate and efficient stereo matching,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 12971–12980, 2022.
[47]	A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, pp. 3354–3361, 2012.
[48]	J. Ku, A. Harakeh, and S. L. Waslander, “In defense of classical image processing: Fast depth completion on the CPU,” in 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, pp. 16–22, 2018.
[49]	K. M. He, X. Y. Zhang, S. Q. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770–778, 2016.