Volume 31 Issue 1
Jan.  2022
Turn off MathJax
Article Contents
LI Yanshan, GUO Tianyu, LIU Xing, LUO Wenhan, XIE Weixin. Action Status Based Novel Relative Feature Representations for Interaction Recognition[J]. Chinese Journal of Electronics, 2022, 31(1): 168-180. doi: 10.1049/cje.2020.00.088
Citation: LI Yanshan, GUO Tianyu, LIU Xing, LUO Wenhan, XIE Weixin. Action Status Based Novel Relative Feature Representations for Interaction Recognition[J]. Chinese Journal of Electronics, 2022, 31(1): 168-180. doi: 10.1049/cje.2020.00.088

Action Status Based Novel Relative Feature Representations for Interaction Recognition

doi: 10.1049/cje.2020.00.088
Funds:  This work was partially supported by the National Natural Science Foundation of China (62076165, 61771319, 61871154), the Natural Science Foundation of Guangdong Province (2019A1515011307), the Shenzhen Science and Technology Project (JCYJ20180507182259896), and the other project (2020KCXTD004, WDZC20195500201).
More Information
  • Author Bio:

    (corresponding author) received the Ph.D. degree in the South China University of Technology. He is currently a Researcher and Doctoral Supervisor with the ATR National Key Laboratory of Defense Technology, Shenzhen University. His research interests include computer vision, machine learning, and image analysis. (Email: lys@szu.edu.cn)

    received the B.E. degree in information and engineering at ShenZhen University. He is a Member of ATR National Key Laboratory of Defense Technology, ShenZhen University. His research interests include computer vision, machine learning, and action recognition. (Email: 2016130145@email.szu.edu.cn)

    received the Ph.D. degree from Huazhong University of Science and Technology. She is currently a Post-doctoral Fellow with the ATR National Key Laboratory of Defense Technology, Shenzhen University. Her research interests include computer vision, machine learning, and activity recognition. (Email: xingliu@szu.edu.cn)

    received the Ph.D. degree from Imperial College London, UK, 2016. His research interests include several topics in computer vision and machine learning, such as motion analysis (especially object tracking), image/video quality restoration, object detection and recognition, reinforcement learning

    received the degree from Xidian University, Xi’an, China. He is currently with the School of Information Engineering, Shenzhen University, China. His research interests include intelligent information processing, fuzzy information processing, image processing, and pattern recognition

  • Received Date: 2020-03-30
  • Accepted Date: 2020-06-22
  • Available Online: 2021-09-28
  • Publish Date: 2022-01-05
  • Skeleton-based action recognition has always been an important research topic in computer vision. Most of the researchers in this field currently pay more attention to actions performed by a single person while there is very little work dedicated to the identification of interactions between two people. However, the practical application of interaction recognition is actually more critical in our society considering that actions are often performed by multiple people. How to design an effective scheme to learn discriminative spatial and temporal representations for skeleton-based interaction recognition is still a challenging problem. Focusing on the characteristics of skeleton data for interactions, we first define the moving distance to distinguish the action status of the participants. Then some view-invariant relative features are proposed to fully represent the spatial and temporal relationship of the skeleton sequence. Further, a new coding method is proposed to obtain the novel relative feature representations. Finally, we design a three-stream CNN model to learn deep features for interaction recognition. We evaluate our method on SBU dataset, NTU RGB+D 60 dataset and NTU RGB+D 120 dataset. The experimental results also verify that our method is effective and exhibits great robustness compared with current state-of-the-art methods.
  • loading
  • [1]
    S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: A survey,” Artificial Intelligence Review, vol.43, no.1, pp.1–54, 2015. doi: 10.1007/s10462-012-9356-9
    [2]
    Z. Xu, C. Hu, and L. Mei, “Video structured description technology based intelligence analysis of surveillance videos for public security applications,” Multimedia Tools and Applications, vol.75, no.19, pp.12155–12172, 2016. doi: 10.1007/s11042-015-3112-5
    [3]
    S. Yan, Y. Xiong, and D. Lin, “Social interaction discovery by statistical analysis of f-formations,” in Proc. of British Machine Vision Conference, Dundee, Scotland, pp.1−12, 2011.
    [4]
    Y. Tian, L. Cao, Z. Liu, et al., “Hierarchical filtered motion for action recognition in crowded videos,” IEEE Trans. on Systems Man and Cybernetics, vol.42, no.3, pp.313–323, 2012. doi: 10.1109/TSMCC.2011.2149519
    [5]
    S. Yi, X. Wang, C. Lu, et al., “L0 regularized stationary time estimation for crowd group analysis,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, pp.2211–2218, 2014.
    [6]
    O. Aran and D. Gaticaperez, “One of a kind: Inferring personality impressions in meetings,” in Proc. of Int. Conf. on Machine Learning, Sydney, pp.11–18, 2013.
    [7]
    G. Liu, J. Yang, and Z. Li, “Content-based image retrieval using computational visual attention model,” Pattern Recognition, vol.48, no.8, pp.2554–2566, 2015. doi: 10.1016/j.patcog.2015.02.005
    [8]
    S. Sempena, N. U. Maulidevi, and P. R. Aryan, “Human action recognition using dynamic time warping,” in Proc. of International Conference on Electrical Engineering and Informatics, Bandung, pp.1–5, 2011.
    [9]
    A. Manzi, L. Fiorini, R. Limosani, et al., “Two-person activity recognition using skeleton data,” IET Computer Vision, vol.12, no.1, pp.27–35, 2018. doi: 10.1049/iet-cvi.2017.0118
    [10]
    Q. Huang, F. Pan, W. Li, et al., “Differential diagnosis of atypical hepatocellular carcinoma in contrast-enhanced ultrasound using spatio-temporal diagnostic semantics,” IEEE Journal of Biomedical and Health Informatics, vol.24, no.10, pp.2860–2869, 2020.
    [11]
    A. Patronperez, M. Marszalek, A. Zisserman, et al., “High five: Recognising human interactions in TV shows,” in Proc. of British Machine Vision Conference, Aberystwyth, Wales, pp.1–11, 2010.
    [12]
    C. V. Gemeren, R. Poppe, and R.C. Veltkamp, “Hands-on: Deformable pose and motion models for spatiotemporal localization of fine-grained dyadic interactions,” EURASIP Journal on Image and Video Processing, vol.2018, no.1, article no.16, 2018.
    [13]
    Y. Kong, Y. Jia, and Y. Fu, “Interactive phrases: Semantic descriptions for human interaction recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol.36, no.9, pp.1775–1788, 2014. doi: 10.1109/TPAMI.2014.2303090
    [14]
    J. Aggarwal and L. Xia, “Human activity recognition from 3d data: A review,” Pattern Recognition Letters, vol.48, pp.70–80, 2014. doi: 10.1016/j.patrec.2014.04.011
    [15]
    F. Han, B. Reily, W. Hoff, et al., “Space-time representation of people based on 3D skeletal data: A review,” IEEE Transactions on Systems Man and Cybernetics, vol.158, pp.85–105, 2017.
    [16]
    Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia, vol.19, no.2, pp.4–10, 2012. doi: 10.1109/MMUL.2012.24
    [17]
    G. Papandreou, T. Zhu, N. Kanazawa, et al., “Towards accurate multi-person pose estimation in the wild,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, pp.3711–3719, 2017.
    [18]
    A. Kanazawa, M. J. Black, D. W. Jacobs, et al., “End-to-end recovery of human shape and pose,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, pp.7122–7131, 2018.
    [19]
    H. Rhodin, M. Salzmann, and P. Fua, “Unsupervised geometry-aware representation for 3D human pose estimation,” in Proc. of the European Conference on Computer Vision, Munich, pp.750–767, 2018.
    [20]
    G. Varol, D. Ceylan, B. Russell, et al., “Bodynet: Volumetric inference of 3D human body shapes,” in Proc. of the European Conference on Computer Vision, Munich, pp.20–36, 2018.
    [21]
    A. Stergiou and R. Poppe, “Analyzing human-human interactions: A survey,” Computer Vision and Image Understanding, vol.188, article no.102799, 2019.
    [22]
    K. Yun, J. Honorio, and D. Chattopadhyay, et al., “Two-person interaction detection using body-pose features and multiple instance learning,” in Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, pp.28–35, 2012.
    [23]
    Y. Ji, G. Ye, and H. Cheng, “Interactive body part contrast mining for human interaction recognition,” in Proc. of International Conference on Multimedia and Expo, Chengdu, pp.1–6, 2014.
    [24]
    Y. Ji, H. Cheng, Y. Zheng, et al., “Learning contrastive feature distribution model for interaction recognition,” J. of Visual Communication and Image Representation, vol.33, pp.340–349, 2015. doi: 10.1016/j.jvcir.2015.10.001
    [25]
    N. Xu, A. Liu, W. Nie, et al., “Multi-modal and multi-view and interactive benchmark dataset for human action recognition,” in Proc. of ACM Multimedia, Sydney, pp.1195–1198, 2015.
    [26]
    M. Li and H. Leung, “Multiview skeletal interaction recognition using active joint interaction graph,” IEEE Transactions on Multimedia, vol.18, no.11, pp.2293–2302, 2016. doi: 10.1109/TMM.2016.2614228
    [27]
    H. Wu, J. Shao, X. Xu, et al., “Recognition and detection of two-person interactive actions using automatically selected skeleton features,” IEEE Transactions on Human-Machine Systems, vol.48, no.3, pp.304–310, 2018. doi: 10.1109/THMS.2017.2776211
    [28]
    A. Shahroudy, J. Liu, T. Ng, et al., “NTU RGB+D: A large scale dataset for 3D human activity analysis,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, pp.1010–1019, 2016.
    [29]
    J. Liu, A. Shahroudy, M. Perez, et al., “NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.42, no.10, pp.2684–2701, 2020.
    [30]
    Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, pp.1110–1118, 2015.
    [31]
    P. Zhang, C. Lan, J. Xing and et al., “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in Proc. of IEEE International Conference on Computer Vision, Venice, pp.2117–2126, 2017.
    [32]
    W. Zhu, C. Lan, J. Xing, et al., “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in Proc. of AAAI Conference on Artificial Intelligence, Phoenix, Arizona, pp.3697–3703, 2016.
    [33]
    S. Song, C. Lan, J. Xing, et al., “An end-to-end spatio-temporal attention model for human action recognition from skeleton data,” in Proc. of AAAI Conference on Artificial Intelligence, San Francisco, California, pp.4263–4270, 2017.
    [34]
    Y. Du, Y. Fu, and L. Wang, “Skeleton based action recognition with convolutional neural network,” in Proc. of Asian Conference on Pattern Recognition, Kuala Lumpur, pp.579–583, 2015.
    [35]
    M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol.68, pp.346–362, 2017. doi: 10.1016/j.patcog.2017.02.030
    [36]
    R. Trabelsi, J. Varadarajan, L. Zhang, et al., “Understanding the dynamics of social interactions: A multi-modal multi-view approach,” ACM Transactions on Multimedia Computing Communications and Applications, vol.15, no.1, pp.1–16, 2019.
    [37]
    J. Serra, Image Analysis and Mathematical Morphology, London: Academic Press, pp.184–185, 1982.
    [38]
    K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. of Neural Information Processing Systems, Montreal, pp.568–576, 2014.
    [39]
    K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, pp.770–778, 2016.
    [40]
    M. Li and H. Leung, “Multi-view depth-based pairwise feature learning for person-person interaction recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol.78, no.5, pp.5731–5749, 2019.
    [41]
    R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, pp.588–595, 2014.
    [42]
    J. Hu, W. Zheng, J. Lai, et al., “Jointly learning heterogeneous features for RGB-D activity recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, pp.5344–5352, 2015.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(6)

    Article Metrics

    Article views (277) PDF downloads(28) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return