Turn off MathJax
Article Contents
Hao LIU, Zhiquan FENG, and Qingbei GUO, “Multimodal Cross-Attention Mechanism-Based Algorithm for Elderly Behavior Monitoring and Recognition,” Chinese Journal of Electronics, vol. 34, no. 1, pp. 1–13, 2025 doi: 10.23919/cje.2023.00.263
Citation: Hao LIU, Zhiquan FENG, and Qingbei GUO, “Multimodal Cross-Attention Mechanism-Based Algorithm for Elderly Behavior Monitoring and Recognition,” Chinese Journal of Electronics, vol. 34, no. 1, pp. 1–13, 2025 doi: 10.23919/cje.2023.00.263

Multimodal Cross-Attention Mechanism-Based Algorithm for Elderly Behavior Monitoring and Recognition

doi: 10.23919/cje.2023.00.263
More Information
  • Author Bio:

    Hao LIU is currently pursuing the M.S. degree at the University of Jinan, Jinan, China. His primary research interests include deep learning, computer vision, and neural networks. (Email: 837515265@qq.com)

    Zhiquan FENG received the Ph.D. degree in computer software and theory from Shandong University, Jinan, China, in 2006. He is currently the Executive Deputy Director of Shandong Province Key Laboratory of Intelligent Computing Technology for Networked Environments, Jinan, China. His research interests include intelligent perception and natural interaction and VR. (Email: ise_fengzq@ujn.edu.cn)

    Qingbei GUO received the M.S. degree in computer science and technology from Shandong University, Jinan, China, in 2006, and the Ph.D. degree in artificial intelligence from Jiangnan University, Jinan, China, in 2021. He is currently an Associate Professor at the School of Information Science and Engineering, University of Jinan, Jinan, China, and a Member of the Shandong Provincial Key Laboratory of Network based Intelligent Computing. His main research interests include wireless sensor networks, deep learning/machine learning, computer vision, and neural networks. (Email: ise_guoqb@ujn.edu.cn)

  • Corresponding author: Email: 837515265@qq.com
  • Received Date: 2023-07-29
  • Accepted Date: 2023-11-10
  • Available Online: 2024-02-29
  • In contrast to the general population, behavior recognition among the elderly poses increased specificity and difficulty, rendering the reliability and usability aspects of safety monitoring systems for the elderly more challenging. Hence, this study proposes a multi-modal perception-based solution for an elderly safety monitoring recognition system. The proposed approach introduces a recognition algorithm based on multi-modal cross-attention mechanism, innovatively incorporating complex information such as scene context and voice to achieve more accurate behavior recognition. By fusing four modalities, namely image, skeleton, sensor data, and audio, we further enhance the accuracy of recognition. Additionally, we introduce a novel human-robot interaction mode, where the system associates directly recognized intentions with robotic actions without explicit commands, delivering a more natural and efficient elderly assistance paradigm. This mode not only elevates the level of safety monitoring for the elderly but also facilitates a more natural and efficient caregiving approach. Experimental results demonstrate significant improvement in recognition accuracy for 11 typical elderly behaviors compared to existing methods.
  • loading
  • [1]
    United Nations, “World population prospects 2022,” Available at: https://population.un.org/wpp/, 2022.
    [2]
    J. Bohg, K. Hausman, B. Sankaran, et al., “Interactive perception: Leveraging action in perception and perception in action,” IEEE Transactions on Robotics, vol. 33, no. 6, pp. 1273–1291, 2017. doi: 10.1109/TRO.2017.2721939
    [3]
    Y. Dobrev, S. Flores, and M. Vossiek, “Multi-modal sensor fusion for indoor mobile robot pose estimation,” in IEEE/ION Position, Location and Navigation Symposium (PLANS), Savannah, GA, USA, pp. 553–556, 2016.
    [4]
    X. K. Deng, Z. X. Zhang, A. Sintov, et al., “Feature-constrained active visual SLAM for mobile robot navigation,” in IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, pp. 7233–7238, 2018.
    [5]
    T. Baltrušaitis, C. Ahuja, and L. P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019. doi: 10.1109/TPAMI.2018.2798607
    [6]
    J. Ngiam, A. Khosla, M. Kim, et al., “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, USA, pp. 689–696, 2011.
    [7]
    N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, pp. 2222–2230, 2012.
    [8]
    F. Faghri, D. J. Fleet, J. R. Kiros, et al., “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proceedings of the British Machine Vision Conference, Newcastle, UK, 2018.
    [9]
    S. Mittal, S. C. Raparthy, I. Rish, et al., “Compositional attention: Disentangling search and retrieval,” in Proceedings of the 10th International Conference on Learning Representations, Available at: https://arxiv.org/abs/2110.09419, 2022.
    [10]
    L. Yao, A. Torabi, K. Cho, et al., “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, pp. 4507–4515, 2015.
    [11]
    X. Y. Zhang, X. S. Sun, Y. P. Luo, et al., “RSTNet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 15460–15469, 2021.
    [12]
    J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733, 2017.
    [13]
    K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 6546–6555, 2018.
    [14]
    C. Feichtenhofer, H. Q. Fan, J. Malik, et al., “SlowFast networks for video recognition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, pp. 6201–6210, 2019.
    [15]
    H. D. Jiang, Y. H. Li, S. J. Song, et al., “Rethinking fusion baselines for multi-modal human action recognition,” in Proceedings of the 19th Pacific Rim Conference on Multimedia (PCM), Hefei, China, pp. 178–187, 2018.
    [16]
    F. Baradel, C. Wolf, and J. Mille, “Human action recognition: Pose-based attention draws focus to hands,” in Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, pp. 604–613, 2017.
    [17]
    J. F. Hu, W. S. Zheng, J. H. Pan, et al., “Deep bilinear learning for RGB-D action recognition,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 346–362, 2018.
    [18]
    Z. L. Luo, J. T. Hsieh, L. Jiang, et al., “Graph distillation for action detection with privileged modalities,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 174–192, 2018.
    [19]
    J. N. Li, X. M. Xie, Q. Z. Pan, et al., “SGM-Net: Skeleton-guided multimodal network for action recognition,” Pattern Recognition, vol. 104, article no. 107356, 2020. doi: 10.1016/j.patcog.2020.107356
    [20]
    C. Ding, Y. Tie, and L. Qi, “Multi-information complementarity neural networks for multi-modal action recognition,” in Proceedings of the 8th International Symposium on Next Generation Electronics (ISNE), Zhengzhou, China, pp. 1–3, 2019.
    [21]
    S. J. Song, J. Y. Liu, Y. H. Li, et al., “Modality compensation network: Cross-modal adaptation for action recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 3957–3969, 2020. doi: 10.1109/TIP.2020.2967577
    [22]
    B. X. B. Yu, Y. Liu, and K. C. C. Chan, “Multimodal fusion via teacher-student network for indoor action recognition,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence, Online, pp. 3199–3207, 2021.
    [23]
    F. Negin, S. Cogar, F. Bremond, et al., “Generating unsupervised models for online long-term daily living activity recognition,” in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, pp. 186–190, 2015.
    [24]
    F. Negin, A. Goel, A. G. Abubakr, et al., “Online detection of long-term daily living activities by weakly supervised recognition of sub-activities,” in 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, pp. 1–6, 2018.
    [25]
    H. D. Duan, Y. Zhao, K. Chen, et al., “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 2959–2968, 2022.
    [26]
    C. Y. Yang, Y. H. Xu, J. P. Shi, et al., “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, pp. 588–597, 2020.
    [27]
    K. C. Thakkar and P. J. Narayanan, “Part-based graph convolutional network for action recognition,” in Proceedings of the British Machine Vision Conference, Newcastle, UK, 2018.
    [28]
    N. C. Garcia, S. A. Bargal, V. Ablavsky, et al., “Distillation multiple choice learning for multimodal action recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, pp. 2754–2763, 2021.
    [29]
    Zhou P, Shi W, Tian J, et al., “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 207–212, 2016.
    [30]
    T. Mikolov, K. Chen, G. Corrado, et al., “Efficient estimation of word representations in vector space,” in Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2013.
    [31]
    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. doi: 10.1038/nature14539
    [32]
    D. Y. Tang, B. Qin, and T. Liu, “Aspect level sentiment classification with deep memory network,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, pp. 214–224, 2016.
    [33]
    Y. W. Pan, T. Mei, T. Yao, et al., “Jointly modeling embedding and translation to bridge video and language,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4594–4602, 2016.
    [34]
    Q. Chen, X. D. Zhu, Z. H. Ling, et al., “Enhanced LSTM for natural language inference,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1657–1668, 2017.
    [35]
    H. A. El-Ghaish, M. E. Hussein, and A. A. Shoukry, Human action recognition using a multi-modal hybrid deep learning model,” in Proceedings of the British Machine Vision Conference (BMVC), London, UK, 2017.
    [36]
    S. J. Song, C. L. Lan, J. L. Xing, et al., “Skeleton-indexed deep multi-modal feature learning for high performance human action recognition,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, pp. 1–6, 2018.
    [37]
    W. Y. Xu, M. Q. Wu, M. Zhao, et al., “Fusion of skeleton and RGB features for RGB-D human action recognition,” IEEE Sensors Journal, vol. 21, no. 17, pp. 19157–19164, 2021. doi: 10.1109/JSEN.2021.3089705
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(3)

    Article Metrics

    Article views (135) PDF downloads(14) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return