Citation: | FAN Jiaqing, ZHANG Kaihua, ZHAO Yaqian, et al., “Unsupervised Video Object Segmentation via Weak User Interaction and Temporal Modulation,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 507-518, 2023, doi: 10.23919/cje.2022.00.139 |
[1] |
Z. Zhang, B. L. Wang, Z. Z. Yu, et al., “Dilated convolutional pixels affinity network for weakly supervised semantic segmentation,” Chinese Journal of Electronics, vol.30, no.6, pp.1120–1130, 2021. doi: 10.1049/cje.2021.08.007
|
[2] |
W. L. Qiu, X. B. Gao, and B. Han, “Video saliency detection via pairwise interaction,” Chinese Journal of Electronics, vol.29, no.3, pp.427–436, 2020. doi: 10.1049/cje.2020.02.018
|
[3] |
S. C. Ren, W. X. Liu, Y. T. Liu, et al., “Reciprocal transformations for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15430–15439, 2021.
|
[4] |
P. L. Huang, J. W. Han, N. Liu, et al., “Scribble-supervised video object segmentation,” IEEE/CAA Journal of Automatica Sinica, vol.9, no.2, pp.339–353, 2021. doi: 10.1109/JAS.2021.1004210
|
[5] |
J. Johnander, M. Danelljan, E. Brissman, et al., “A generative appearance model for end-to-end video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.8945–8954, 2019.
|
[6] |
P. S. Wen, R. L. Yang, Q. Q. Xu, et al., “DMVOS: Discriminative matching for real-time video object segmentation,” in Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, pp.2048–2056, 2020.
|
[7] |
H. C. Wang, X. L. Jiang, H. B. Ren, et al., “SwiftNet: Real-time video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.1296–1305, 2021.
|
[8] |
L. Wang, G. Hua, R. Sukthankar, et al., “Video object discovery and co-segmentation with extremely weak supervision,” in Proceedings of 13th European Conference on Computer Vision, Zurich, Switzerland, pp.640–655, 2014.
|
[9] |
J. W. Han, L. Yang, D. W. Zhang, et al., “Reinforcement cutting-agent learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.9080–9089, 2018.
|
[10] |
G. P. Ji, K. R. Fu, Z. Wu, et al., “Full-duplex strategy for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.4902–4913, 2021.
|
[11] |
A. Azulay, T. Halperin, O. Vantzos, et al., “Temporally stable video segmentation without video annotations,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, pp.1919–1928, 2022.
|
[12] |
B. Luo, H. L. Li, F. M. Meng, et al., “Video object segmentation via global consistency aware query strategy,” IEEE Transactions on Multimedia, vol.19, no.7, pp.1482–1493, 2017. doi: 10.1109/TMM.2017.2671447
|
[13] |
Z. Y. Yin, J. Zheng, W. X. Luo, et al., “Learning to recommend frame for interactive video object segmentation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.15440–15449, 2021.
|
[14] |
A. Agarwala, A. Hertzmann, D. H. Salesin, et al., “Keyframe-based tracking for rotoscoping and animation,” ACM Transactions on Graphics, vol.23, no.3, pp.584–591, 2004. doi: 10.1145/1015706.1015764
|
[15] |
W. B. Li, F. Viola, J. Starck, et al., “Roto++ accelerating professional rotoscoping using shape manifolds,” ACM Transactions on Graphics, vol.35, no.4, article no.62, 2016. doi: 10.1145/2897824.2925973
|
[16] |
N. S. Nagaraja, F. R. Schmidt, and T. Brox, “Video segmentation with just a few strokes,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, pp. 3235–3243, 2015.
|
[17] |
L. J. Yang, Y. R. Wang, X. H. Xiong, et al., “Efficient video object segmentation via network modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6499–6507, 2018.
|
[18] |
A. Benard and M. Gygli, “Interactive video object segmentation in the wild,” arXiv preprint, arXiv: 1801.00269, 2017.
|
[19] |
S. W. Oh, J. Y. Lee, N. Xu, et al., “Fast user-guided video object segmentation by interaction-and-propagation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.5242–5251, 2019.
|
[20] |
D. Batra, A. Kowdle, D. Parikh, et al., “iCoseg: Interactive co-segmentation with intelligent scribble guidance,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, pp.3169–3176, 2010.
|
[21] |
K. Xu, L. Y. Wen, G. R. Li, et al., “Spatiotemporal CNN for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1379–1388, 2019.
|
[22] |
Y. Z. Zhang, Z. R. Wu, H. W. Peng, et al., “A transductive approach for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.6947–6956, 2020.
|
[23] |
F. Perazzi, J. Pont-Tuset, B. McWilliams, et al., “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp.724–732, 2016.
|
[24] |
P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.36, no.6, pp.1187–1200, 2014. doi: 10.1109/TPAMI.2013.242
|
[25] |
N. Xu, L. J. Yang, Y. C. Fan, et al., “YouTube-VOS: sequence-to-sequence video object segmentation,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp.603–619, 2018.
|
[26] |
J. Pont-Tuset, F. Perazzi, S. Caelles, et al., The 2017 DAVIS challenge on video object segmentation, arXiv preprint, arXiv: 1704.00675, 2017.
|
[27] |
D. W. Zhang, J. W. Han, L. Yang, et al., “SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.42, no.2, pp.475–489, 2020. doi: 10.1109/TPAMI.2018.2881114
|
[28] |
J. X. Miao, Y. C. Wei, and Y. Yang, “Memory aggregation networks for efficient interactive video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.10363–10372, 2020.
|
[29] |
Y. Y. Mao, N. Wang, W. G. Zhou, et al., “Joint inductive and transductive learning for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.9650–9659, 2021.
|
[30] |
B. Duke, A. Ahmed, C. Wolf, et al., “SSTVOS: Sparse spatiotemporal transformers for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp.5908–5917, 2021.
|
[31] |
Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International Journal of Computer Vision, vol.40, no.2, pp.99–121, 2000. doi: 10.1023/A:1026543900054
|
[32] |
H. B. Ling and K. Okada, “An efficient earth mover’s distance algorithm for robust histogram comparison,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, no.5, pp.840–853, 2007. doi: 10.1109/TPAMI.2007.1058
|
[33] |
L. Hou, C. P. Yu, and D. Samaras, “Squared earth mover’s distance-based loss for training deep neural networks,” arXiv preprint, arXiv: 1611.05916, 2016.
|
[34] |
C. Zhang, Y. J. Cai, G. S. Lin, et al., “DeepEMD: Few-shot image classification with differentiable earth mover's distance and structured classifiers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.12200–12210, 2020.
|
[35] |
D. Yeo, J. Son, B. Han, et al., “Superpixel-based tracking-by-segmentation using Markov chains,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.511–520, 2017.
|
[36] |
Q. Wang, L. Zhang, L. Bertinetto, et al., “Fast online object tracking and segmentation: A unifying approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.1328–1338, 2019.
|
[37] |
P. Voigtlaender, J. Luiten, P. H. S. Torr, et al., “Siam R-CNN: Visual tracking by re-detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.6577–6587, 2020.
|
[38] |
K. M. He, G. Gkioxari, P. Dollár, et al., “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp.2980–2988, 2017.
|
[39] |
S. T. Liu, Z. M. Li, and J. Sun, “Self-EMD: Self-supervised object detection without ImageNet,” arXiv preprint, arXiv: 2011.13677, 2020.
|
[40] |
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 3431–3440, 2015.
|
[41] |
S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Applications. Birkhäuser, New York, NY, USA, doi: 10.1007/978-1-4614-5981-1, 2013.
|
[42] |
S. Barratt, “On the differentiability of the solution to convex optimization problems,” arXiv preprint, arXiv: 1804.05098, 2018.
|
[43] |
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.7132–7141, 2018.
|
[44] |
H. Z. Fu, X. C. Cao, and Z. W. Tu, “Cluster-based co-saliency detection,” IEEE Transactions on Image Processing, vol.22, no.10, pp.3766–3778, 2013. doi: 10.1109/TIP.2013.2260166
|
[45] |
W. D. Liu, C. Zhang, G. S. Lin, et al., “CRNet: cross-reference networks for few-shot segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.4164–4172, 2020.
|
[46] |
T. F. Zhou, J. W. Li, S. Z. Wang, et al., “MATNet: Motion-attentive transition network for zero-shot video object segmentation,” IEEE Transactions on Image Processing, vol.29, pp.8326–8338, 2020. doi: 10.1109/TIP.2020.3013162
|
[47] |
S. Yang, L. Zhang, J. Q. Qi, et al., “Learning motion-appearance co-attention for zero-shot video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp.1544–1553, 2021.
|
[48] |
A. Paszke, S. Gross, S. Chintala, et al., “Automatic differentiation in PyTorch,” in 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017.
|
[49] |
N. N. Ma, X. Y. Zhang, H. T. Zheng, et al., “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.122–138, 2018.
|
[50] |
D. Z. Liu, D. D. Yu, C. H. Wang, et al., “F2Net: Learning to focus on the foreground for unsupervised video object segmentation,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence, Vancouver, British Columbia, Canada, pp.2109–2117, 2021.
|
[51] |
Y. Q. Wang, Z. L. Xu, H. Shen, et al., “CenterMask: Single shot instance segmentation with point representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.9310–9318, 2020.
|
[52] |
X. K. Lu, W. G. Wang, C. Ma, et al., “See more, know more: Unsupervised video object segmentation with co-attention Siamese networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.3618–3627, 2019.
|
[53] |
L. Zhang, J. M. Zhang, Z. Lin, et al., “Unsupervised video object segmentation with joint hotspot tracking,” in 16th European Conference on Computer Vision, Glasgow, UK, pp. 490–506, 2020.
|
[54] |
M. M. Zhen, S. W. Li, L. Zhou, et al., “Learning discriminative feature with CRF for unsupervised video object segmentation,” in 16th European Conference on Computer Vision, Glasgow, UK, pp.445–462, 2020.
|
[55] |
H. M. Song, W. G. Wang, S. Y. Zhao, et al., “Pyramid dilated deeper ConvLSTM for video salient object detection,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.744–760, 2018.
|
[56] |
J. C. Cheng, Y. H. Tsai, S. J. Wang, et al., “SegFlow: Joint learning for video object segmentation and optical flow,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp.686–695, 2017.
|
[57] |
J. Luiten, P. Voigtlaender, and B. Leibe, “PReMVOS: Proposal-generation, refinement and merging for video object segmentation,” in 14th Asian Conference on Computer Vision, Perth, Australia, pp.565–580, 2019.
|
[58] |
S. W. Oh, J. Y. Lee, K. Sunkavalli, et al., “Fast video object segmentation by reference-guided mask propagation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.7376–7385, 2018.
|
[59] |
Y. T. Hu, J. B. Huang, and A. G. Schwing, “VideoMatch: Matching based video object segmentation,” in Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp.56–73, 2018.
|
[60] |
P. Voigtlaender, Y. N. Chai, F. Schroff, et al., “FEELVOS: Fast end-to-end embedding learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.9473–9482, 2019.
|
[61] |
S. Caelles, K. K. Maninis, J. Pont-Tuset, et al., “One-shot video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp.5320–5329, 2017.
|
[62] |
P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in British Machine Vision Conference, London, UK, 2017.
|
[63] |
A. Robinson, F. J. Lawin, M. Danelljan, et al., “Learning fast and robust target models for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp.7404–7413, 2020.
|
[64] |
H. J. Lin, X. J. Qi, and J. Y. Jia, “AGSS-VOS: Attention guided single-shot video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), pp.3948–3956, 2019.
|
[65] |
C. Ventura, M. Bellver, A. Girbau, et al., “RVOS: End-to-end recurrent network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp.5272–5281, 2019.
|