FMR-GNet: Forward Mix-Hop Spatial-Temporal Residual Graph Network for 3D Pose Estimation
-
Abstract
Graph convolutional networks that leverage spatial-temporal information from skeletal data have emerged as a popular approach for 3D human pose estimation. However, comprehensively modeling consistent spatial-temporal dependencies among the body joints remains a challenging task. Current approaches are limited by performing graph convolutions solely on immediate neighbors, deploying separate spatial or temporal modules, and utilizing single-pass feedforward architectures. To solve these limitations, we propose a forward multi-scale residual graph convolutional network (FMR-GNet) for 3D pose estimation from monocular video. First, we introduce a mix-hop spatial-temporal attention graph convolution layer that effectively aggregates neighboring features with learnable weights over large receptive fields. The attention mechanism enables dynamically computing edge weights at each layer. Second, we devise a cross-domain spatial-temporal residual module to fuse multi-scale spatial-temporal convolutional features through residual connections, explicitly modeling interdependencies across spatial and temporal domains. Third, we integrate a forward dense connection block to propagate spatial-temporal representations across network layers, enabling high-level semantic skeleton information to enrich lower-level features. Comprehensive experiments conducted on two challenging 3D human pose estimation benchmarks, namely Human3.6M and MPI-INF-3DHP, demonstrate that the proposed FMR-GNet achieves superior performance, surpassing the most state-of-the-art methods.
-
-