Volume 32 Issue 3
May  2023
Turn off MathJax
Article Contents
FAN Jiaqing, ZHANG Kaihua, ZHAO Yaqian, et al., “Unsupervised Video Object Segmentation via Weak User Interaction and Temporal Modulation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 507-518, 2023, doi: 10.23919/cje.2022.00.139
Citation: FAN Jiaqing, ZHANG Kaihua, ZHAO Yaqian, et al., “Unsupervised Video Object Segmentation via Weak User Interaction and Temporal Modulation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 507-518, 2023, doi: 10.23919/cje.2022.00.139

Unsupervised Video Object Segmentation via Weak User Interaction and Temporal Modulation

doi: 10.23919/cje.2022.00.139
Funds:  This work was supported by National Key Research and Development Program (2021ZD0112200) and National Natural Science Foundation of China (U21B2044)
More Information
  • Author Bio:

    Jiaqing FAN received the M.S. degree from the School of Information and Control, Nanjing University of Information Science and Technology, Nanjing, China, in 2019. He is currently pursuing the Ph.D. degree with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China. His research interests include video object segmentation. (Email: jqfan@nuaa.edu.cn)

    Qingshan LIU (corresponding author) is a Professor with Nanjing University of Information Science and Technology, Nanjing, China. He received the Ph.D. degree from the National Laboratory of Pattern Recognition, Chinese Academic of Science, Beijing, China, in 2003. He was an Assistant Research Professor with the Department of Computer Science, Computational Biomedicine Imaging and Modeling Center, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA, from 2010 to 2011. His current research interests are image and vision analysis, including face image analysis, graph and hypergraph-based image and video understanding. (Email: qsliu@nuist.edu.cn)

  • Received Date: 2022-05-18
  • Accepted Date: 2022-09-26
  • Available Online: 2022-10-09
  • Publish Date: 2023-05-05
  • In unsupervised video object segmentation (UVOS), the whole video might segment the wrong target due to the lack of initial prior information. Also, in semi-supervised video object segmentation (SVOS), the initial video frame with a fine-grained pixel-level mask is essential to good segmentation accuracy. It is expensive and laborious to provide the accurate pixel-level masks for each training sequence. To address this issue, We present a weak user interactive UVOS approach guided by a simple human-made rectangle annotation in the initial frame. We first interactively draw the region of interest by a rectangle, and then we leverage the mask RCNN (region-based convolutional neural networks) method to generate a set of coarse reference labels for subsequent mask propagations. To establish the temporal correspondence between the coherent frames, we further design two novel temporal modulation modules to enhance the target representations. We compute the earth mover’s distance (EMD)-based similarity between coherent frames to mine the co-occurrent objects in the two images, which is used to modulate the target representation to highlight the foreground target. We design a cross-squeeze temporal modulation module to emphasize the co-occurrent features across frames, which further helps to enhance the foreground target representation. We augment the temporally modulated representations with the original representation and obtain the compositive spatio-temporal information, producing a more accurate video object segmentation (VOS) model. The experimental results on both UVOS and SVOS datasets including Davis2016, FBMS, Youtube-VOS, and Davis2017, show that our method yields favorable accuracy and complexity. The related code is available.
  • loading
  • [1]
    Z. Zhang, B. L. Wang, Z. Z. Yu, et al., “Dilated convolutional pixels affinity network for weakly supervised semantic segmentation,” Chinese Journal of Electronics, vol.30, no.6, pp.1120–1130, 2021. doi: 10.1049/cje.2021.08.007
    [2]
    W. L. Qiu, X. B. Gao, and B. Han, “Video saliency detection via pairwise interaction,” Chinese Journal of Electronics, vol.29, no.3, pp.427–436, 2020. doi: 10.1049/cje.2020.02.018
    [3]
    S. C. Ren, W. X. Liu, Y. T. Liu, et al., “Reciprocal transformations for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15430–15439, 2021.
    [4]
    P. L. Huang, J. W. Han, N. Liu, et al., “Scribble-supervised video object segmentation,” IEEE/CAA Journal of Automatica Sinica, vol.9, no.2, pp.339–353, 2021. doi: 10.1109/JAS.2021.1004210
    [5]
    J. Johnander, M. Danelljan, E. Brissman, et al., “A generative appearance model for end-to-end video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.8945–8954, 2019.
    [6]
    P. S. Wen, R. L. Yang, Q. Q. Xu, et al., “DMVOS: Discriminative matching for real-time video object segmentation,” in Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, pp.2048–2056, 2020.
    [7]
    H. C. Wang, X. L. Jiang, H. B. Ren, et al., “SwiftNet: Real-time video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.1296–1305, 2021.
    [8]
    L. Wang, G. Hua, R. Sukthankar, et al., “Video object discovery and co-segmentation with extremely weak supervision,” in Proceedings of 13th European Conference on Computer Vision, Zurich, Switzerland, pp.640–655, 2014.
    [9]
    J. W. Han, L. Yang, D. W. Zhang, et al., “Reinforcement cutting-agent learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.9080–9089, 2018.
    [10]
    G. P. Ji, K. R. Fu, Z. Wu, et al., “Full-duplex strategy for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.4902–4913, 2021.
    [11]
    A. Azulay, T. Halperin, O. Vantzos, et al., “Temporally stable video segmentation without video annotations,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, pp.1919–1928, 2022.
    [12]
    B. Luo, H. L. Li, F. M. Meng, et al., “Video object segmentation via global consistency aware query strategy,” IEEE Transactions on Multimedia, vol.19, no.7, pp.1482–1493, 2017. doi: 10.1109/TMM.2017.2671447
    [13]
    Z. Y. Yin, J. Zheng, W. X. Luo, et al., “Learning to recommend frame for interactive video object segmentation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15440–15449, 2021.
    [14]
    A. Agarwala, A. Hertzmann, D. H. Salesin, et al., “Keyframe-based tracking for rotoscoping and animation,” ACM Transactions on Graphics, vol.23, no.3, pp.584–591, 2004. doi: 10.1145/1015706.1015764
    [15]
    W. B. Li, F. Viola, J. Starck, et al., “Roto++ accelerating professional rotoscoping using shape manifolds,” ACM Transactions on Graphics, vol.35, no.4, article no.62, 2016. doi: 10.1145/2897824.2925973
    [16]
    N. S. Nagaraja, F. R. Schmidt, and T. Brox, “Video segmentation with just a few strokes,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, pp. 3235–3243, 2015.
    [17]
    L. J. Yang, Y. R. Wang, X. H. Xiong, et al., “Efficient video object segmentation via network modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6499–6507, 2018.
    [18]
    A. Benard and M. Gygli, “Interactive video object segmentation in the wild,” arXiv preprint, arXiv: 1801.00269, 2017.
    [19]
    S. W. Oh, J. Y. Lee, N. Xu, et al., “Fast user-guided video object segmentation by interaction-and-propagation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.5242–5251, 2019.
    [20]
    D. Batra, A. Kowdle, D. Parikh, et al., “iCoseg: Interactive co-segmentation with intelligent scribble guidance,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp.3169–3176, 2010.
    [21]
    K. Xu, L. Y. Wen, G. R. Li, et al., “Spatiotemporal CNN for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1379–1388, 2019.
    [22]
    Y. Z. Zhang, Z. R. Wu, H. W. Peng, et al., “A transductive approach for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.6947–6956, 2020.
    [23]
    F. Perazzi, J. Pont-Tuset, B. McWilliams, et al., “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp.724–732, 2016.
    [24]
    P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, no.6, pp.1187–1200, 2014. doi: 10.1109/TPAMI.2013.242
    [25]
    N. Xu, L. J. Yang, Y. C. Fan, et al., “YouTube-VOS: sequence-to-sequence video object segmentation,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp.603–619, 2018.
    [26]
    J. Pont-Tuset, F. Perazzi, S. Caelles, et al., The 2017 DAVIS challenge on video object segmentation, arXiv preprint, arXiv: 1704.00675, 2017.
    [27]
    D. W. Zhang, J. W. Han, L. Yang, et al., “SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.42, no.2, pp.475–489, 2020. doi: 10.1109/TPAMI.2018.2881114
    [28]
    J. X. Miao, Y. C. Wei, and Y. Yang, “Memory aggregation networks for efficient interactive video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.10363–10372, 2020.
    [29]
    Y. Y. Mao, N. Wang, W. G. Zhou, et al., “Joint inductive and transductive learning for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.9650–9659, 2021.
    [30]
    B. Duke, A. Ahmed, C. Wolf, et al., “SSTVOS: Sparse spatiotemporal transformers for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.5908–5917, 2021.
    [31]
    Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International Journal of Computer Vision, vol.40, no.2, pp.99–121, 2000. doi: 10.1023/A:1026543900054
    [32]
    H. B. Ling and K. Okada, “An efficient earth mover’s distance algorithm for robust histogram comparison,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, no.5, pp.840–853, 2007. doi: 10.1109/TPAMI.2007.1058
    [33]
    L. Hou, C. P. Yu, and D. Samaras, “Squared earth mover’s distance-based loss for training deep neural networks,” arXiv preprint, arXiv: 1611.05916, 2016.
    [34]
    C. Zhang, Y. J. Cai, G. S. Lin, et al., “DeepEMD: Few-shot image classification with differentiable earth mover's distance and structured classifiers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.12200–12210, 2020.
    [35]
    D. Yeo, J. Son, B. Han, et al., “Superpixel-based tracking-by-segmentation using Markov chains,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.511–520, 2017.
    [36]
    Q. Wang, L. Zhang, L. Bertinetto, et al., “Fast online object tracking and segmentation: A unifying approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1328–1338, 2019.
    [37]
    P. Voigtlaender, J. Luiten, P. H. S. Torr, et al., “Siam R-CNN: Visual tracking by re-detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.6577–6587, 2020.
    [38]
    K. M. He, G. Gkioxari, P. Dollár, et al., “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp.2980–2988, 2017.
    [39]
    S. T. Liu, Z. M. Li, and J. Sun, “Self-EMD: Self-supervised object detection without ImageNet,” arXiv preprint, arXiv: 2011.13677, 2020.
    [40]
    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 3431–3440, 2015.
    [41]
    S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Applications. Birkhäuser, New York, NY, USA, doi: 10.1007/978-1-4614-5981-1, 2013.
    [42]
    S. Barratt, “On the differentiability of the solution to convex optimization problems,” arXiv preprint, arXiv: 1804.05098, 2018.
    [43]
    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.7132–7141, 2018.
    [44]
    H. Z. Fu, X. C. Cao, and Z. W. Tu, “Cluster-based co-saliency detection,” IEEE Transactions on Image Processing, vol.22, no.10, pp.3766–3778, 2013. doi: 10.1109/TIP.2013.2260166
    [45]
    W. D. Liu, C. Zhang, G. S. Lin, et al., “CRNet: cross-reference networks for few-shot segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.4164–4172, 2020.
    [46]
    T. F. Zhou, J. W. Li, S. Z. Wang, et al., “MATNet: Motion-attentive transition network for zero-shot video object segmentation,” IEEE Transactions on Image Processing, vol.29, pp.8326–8338, 2020. doi: 10.1109/TIP.2020.3013162
    [47]
    S. Yang, L. Zhang, J. Q. Qi, et al., “Learning motion-appearance co-attention for zero-shot video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.1544–1553, 2021.
    [48]
    A. Paszke, S. Gross, S. Chintala, et al., “Automatic differentiation in PyTorch,” in 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017.
    [49]
    N. N. Ma, X. Y. Zhang, H. T. Zheng, et al., “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.122–138, 2018.
    [50]
    D. Z. Liu, D. D. Yu, C. H. Wang, et al., “F2Net: Learning to focus on the foreground for unsupervised video object segmentation,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence, Vancouver, British Columbia, Canada, pp.2109–2117, 2021.
    [51]
    Y. Q. Wang, Z. L. Xu, H. Shen, et al., “CenterMask: Single shot instance segmentation with point representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.9310–9318, 2020.
    [52]
    X. K. Lu, W. G. Wang, C. Ma, et al., “See more, know more: Unsupervised video object segmentation with co-attention Siamese networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.3618–3627, 2019.
    [53]
    L. Zhang, J. M. Zhang, Z. Lin, et al., “Unsupervised video object segmentation with joint hotspot tracking,” in 16th European Conference on Computer Vision, Glasgow, UK, pp. 490–506, 2020.
    [54]
    M. M. Zhen, S. W. Li, L. Zhou, et al., “Learning discriminative feature with CRF for unsupervised video object segmentation,” in 16th European Conference on Computer Vision, Glasgow, UK, pp.445–462, 2020.
    [55]
    H. M. Song, W. G. Wang, S. Y. Zhao, et al., “Pyramid dilated deeper ConvLSTM for video salient object detection,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.744–760, 2018.
    [56]
    J. C. Cheng, Y. H. Tsai, S. J. Wang, et al., “SegFlow: Joint learning for video object segmentation and optical flow,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp.686–695, 2017.
    [57]
    J. Luiten, P. Voigtlaender, and B. Leibe, “PReMVOS: Proposal-generation, refinement and merging for video object segmentation,” in 14th Asian Conference on Computer Vision, Perth, Australia, pp.565–580, 2019.
    [58]
    S. W. Oh, J. Y. Lee, K. Sunkavalli, et al., “Fast video object segmentation by reference-guided mask propagation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.7376–7385, 2018.
    [59]
    Y. T. Hu, J. B. Huang, and A. G. Schwing, “VideoMatch: Matching based video object segmentation,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.56–73, 2018.
    [60]
    P. Voigtlaender, Y. N. Chai, F. Schroff, et al., “FEELVOS: Fast end-to-end embedding learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.9473–9482, 2019.
    [61]
    S. Caelles, K. K. Maninis, J. Pont-Tuset, et al., “One-shot video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.5320–5329, 2017.
    [62]
    P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in British Machine Vision Conference, London, UK, 2017.
    [63]
    A. Robinson, F. J. Lawin, M. Danelljan, et al., “Learning fast and robust target models for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.7404–7413, 2020.
    [64]
    H. J. Lin, X. J. Qi, and J. Y. Jia, “AGSS-VOS: Attention guided single-shot video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp.3948–3956, 2019.
    [65]
    C. Ventura, M. Bellver, A. Girbau, et al., “RVOS: End-to-end recurrent network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.5272–5281, 2019.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(5)

    Article Metrics

    Article views (598) PDF downloads(24) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return