PSA-NeRF: Personalized Spatial Attention Neural Rendering for Audio-Driven Talking Portraits Generation

Huiyu Xu; Shuaifan Jin; Zhibo Wang; Zhongjie Ba; Tao Wei

Huiyu Xu, Shuaifan Jin, Zhibo Wang, Zhongjie Ba, Tao Wei. PSA-NeRF: Personalized Spatial Attention Neural Rendering for Audio-Driven Talking Portraits Generation[J]. Chinese Journal of Electronics.

Citation:

Huiyu Xu, Shuaifan Jin, Zhibo Wang, Zhongjie Ba, Tao Wei. PSA-NeRF: Personalized Spatial Attention Neural Rendering for Audio-Driven Talking Portraits Generation[J]. Chinese Journal of Electronics.

Citation:

Huiyu Xu, Shuaifan Jin, Zhibo Wang, Zhongjie Ba, Tao Wei. PSA-NeRF: Personalized Spatial Attention Neural Rendering for Audio-Driven Talking Portraits Generation[J]. Chinese Journal of Electronics.

PSA-NeRF: Personalized Spatial Attention Neural Rendering for Audio-Driven Talking Portraits Generation

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Audio-driven talking head animation has drawn considerable attention due to its wide applications, like virtual reality. Recent work models the scene of talking portraits via a neural radiance field (NeRF) that works on speech representation and positional embedding. Such implicit methods can generate high-fidelity talking heads but fail to generate natural and personalized facial dynamics due to the difficulty of capturing the correlation between audio and visual modalities. In this work, we present Personalized Spatial Attention Neural Rendering (PSA-NeRF), a novel framework for NeRF-based audio-driven talking portraits generation. By incorporating the semantic constraints of the spatial attention map, our framework explicitly learns the correlation between audio and visual modalities during neural rendering. Specifically, we first extract spatially correlated speech representation via an audio-to-lip proxy task. Then we generate the corresponding spatial attention map as the semantic prior to draw the spatial-aware speech representation, which treats different semantic regions differently. Furthermore, we provide personalized editing on the attention map to customize the facial attributes. Finally, we generate talking heads with natural and personalized facial dynamics through a spatial-aware NeRF. Extensive evaluations demonstrate that our method can generate more photorealistic talking portraits with audio-synchronized and personalized facial attributes than state-of-the-art methods.