Citation: | Hao LIU, Zhiquan FENG, and Qingbei GUO, “Multimodal Cross-Attention Mechanism-Based Algorithm for Elderly Behavior Monitoring and Recognition,” Chinese Journal of Electronics, vol. 34, no. 1, pp. 1–13, 2025 doi: 10.23919/cje.2023.00.263 |
[1] |
United Nations, “World population prospects 2022,” Available at: https://population.un.org/wpp/, 2022.
|
[2] |
J. Bohg, K. Hausman, B. Sankaran, et al., “Interactive perception: Leveraging action in perception and perception in action,” IEEE Transactions on Robotics, vol. 33, no. 6, pp. 1273–1291, 2017. doi: 10.1109/TRO.2017.2721939
|
[3] |
Y. Dobrev, S. Flores, and M. Vossiek, “Multi-modal sensor fusion for indoor mobile robot pose estimation,” in IEEE/ION Position, Location and Navigation Symposium (PLANS), Savannah, GA, USA, pp. 553–556, 2016.
|
[4] |
X. K. Deng, Z. X. Zhang, A. Sintov, et al., “Feature-constrained active visual SLAM for mobile robot navigation,” in IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, pp. 7233–7238, 2018.
|
[5] |
T. Baltrušaitis, C. Ahuja, and L. P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019. doi: 10.1109/TPAMI.2018.2798607
|
[6] |
J. Ngiam, A. Khosla, M. Kim, et al., “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, USA, pp. 689–696, 2011.
|
[7] |
N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, pp. 2222–2230, 2012.
|
[8] |
F. Faghri, D. J. Fleet, J. R. Kiros, et al., “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proceedings of the British Machine Vision Conference, Newcastle, UK, 2018.
|
[9] |
S. Mittal, S. C. Raparthy, I. Rish, et al., “Compositional attention: Disentangling search and retrieval,” in Proceedings of the 10th International Conference on Learning Representations, Available at: https://arxiv.org/abs/2110.09419, 2022.
|
[10] |
L. Yao, A. Torabi, K. Cho, et al., “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, pp. 4507–4515, 2015.
|
[11] |
X. Y. Zhang, X. S. Sun, Y. P. Luo, et al., “RSTNet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, pp. 15460–15469, 2021.
|
[12] |
J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733, 2017.
|
[13] |
K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 6546–6555, 2018.
|
[14] |
C. Feichtenhofer, H. Q. Fan, J. Malik, et al., “SlowFast networks for video recognition,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, pp. 6201–6210, 2019.
|
[15] |
H. D. Jiang, Y. H. Li, S. J. Song, et al., “Rethinking fusion baselines for multi-modal human action recognition,” in Proceedings of the 19th Pacific Rim Conference on Multimedia (PCM), Hefei, China, pp. 178–187, 2018.
|
[16] |
F. Baradel, C. Wolf, and J. Mille, “Human action recognition: Pose-based attention draws focus to hands,” in Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, pp. 604–613, 2017.
|
[17] |
J. F. Hu, W. S. Zheng, J. H. Pan, et al., “Deep bilinear learning for RGB-D action recognition,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 346–362, 2018.
|
[18] |
Z. L. Luo, J. T. Hsieh, L. Jiang, et al., “Graph distillation for action detection with privileged modalities,” in Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 174–192, 2018.
|
[19] |
J. N. Li, X. M. Xie, Q. Z. Pan, et al., “SGM-Net: Skeleton-guided multimodal network for action recognition,” Pattern Recognition, vol. 104, article no. 107356, 2020. doi: 10.1016/j.patcog.2020.107356
|
[20] |
C. Ding, Y. Tie, and L. Qi, “Multi-information complementarity neural networks for multi-modal action recognition,” in Proceedings of the 8th International Symposium on Next Generation Electronics (ISNE), Zhengzhou, China, pp. 1–3, 2019.
|
[21] |
S. J. Song, J. Y. Liu, Y. H. Li, et al., “Modality compensation network: Cross-modal adaptation for action recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 3957–3969, 2020. doi: 10.1109/TIP.2020.2967577
|
[22] |
B. X. B. Yu, Y. Liu, and K. C. C. Chan, “Multimodal fusion via teacher-student network for indoor action recognition,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence, Online, pp. 3199–3207, 2021.
|
[23] |
F. Negin, S. Cogar, F. Bremond, et al., “Generating unsupervised models for online long-term daily living activity recognition,” in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, pp. 186–190, 2015.
|
[24] |
F. Negin, A. Goel, A. G. Abubakr, et al., “Online detection of long-term daily living activities by weakly supervised recognition of sub-activities,” in 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, pp. 1–6, 2018.
|
[25] |
H. D. Duan, Y. Zhao, K. Chen, et al., “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 2959–2968, 2022.
|
[26] |
C. Y. Yang, Y. H. Xu, J. P. Shi, et al., “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, pp. 588–597, 2020.
|
[27] |
K. C. Thakkar and P. J. Narayanan, “Part-based graph convolutional network for action recognition,” in Proceedings of the British Machine Vision Conference, Newcastle, UK, 2018.
|
[28] |
N. C. Garcia, S. A. Bargal, V. Ablavsky, et al., “Distillation multiple choice learning for multimodal action recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, pp. 2754–2763, 2021.
|
[29] |
Zhou P, Shi W, Tian J, et al., “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 207–212, 2016.
|
[30] |
T. Mikolov, K. Chen, G. Corrado, et al., “Efficient estimation of word representations in vector space,” in Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2013.
|
[31] |
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. doi: 10.1038/nature14539
|
[32] |
D. Y. Tang, B. Qin, and T. Liu, “Aspect level sentiment classification with deep memory network,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, pp. 214–224, 2016.
|
[33] |
Y. W. Pan, T. Mei, T. Yao, et al., “Jointly modeling embedding and translation to bridge video and language,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4594–4602, 2016.
|
[34] |
Q. Chen, X. D. Zhu, Z. H. Ling, et al., “Enhanced LSTM for natural language inference,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1657–1668, 2017.
|
[35] |
H. A. El-Ghaish, M. E. Hussein, and A. A. Shoukry, “Human action recognition using a multi-modal hybrid deep learning model,” in Proceedings of the British Machine Vision Conference (BMVC), London, UK, 2017.
|
[36] |
S. J. Song, C. L. Lan, J. L. Xing, et al., “Skeleton-indexed deep multi-modal feature learning for high performance human action recognition,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, pp. 1–6, 2018.
|
[37] |
W. Y. Xu, M. Q. Wu, M. Zhao, et al., “Fusion of skeleton and RGB features for RGB-D human action recognition,” IEEE Sensors Journal, vol. 21, no. 17, pp. 19157–19164, 2021. doi: 10.1109/JSEN.2021.3089705
|