A Comprehensive Survey on Text-to-Video Generation

Xie Fan; Zeng Dan; Shen Qiaomu; Tang Bo

doi:10.23919/cje.2024.00.151

Fan Xie, Dan Zeng, Qiaomu Shen, et al., “A comprehensive survey on text-to-video generation,” Chinese Journal of Electronics, vol. 34, no. 4, pp. 1009–1036, 2025. DOI: 10.23919/cje.2024.00.151

Citation:

Fan Xie, Dan Zeng, Qiaomu Shen, et al., “A comprehensive survey on text-to-video generation,” Chinese Journal of Electronics, vol. 34, no. 4, pp. 1009–1036, 2025. DOI: 10.23919/cje.2024.00.151

Citation:

Fan Xie, Dan Zeng, Qiaomu Shen, et al., “A comprehensive survey on text-to-video generation,” Chinese Journal of Electronics, vol. 34, no. 4, pp. 1009–1036, 2025. DOI: 10.23919/cje.2024.00.151

A Comprehensive Survey on Text-to-Video Generation

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Since the release of Sora, the text-to-video (T2V) generation has brought profound changes to artificial intelligence-generated content. T2V generation aims to generate high-quality videos based on a given text description, which is challenging due to the lack of large-scale, high-quality text-video pairs for training and the complexity of modeling high-dimensional video data. Although there have been some valuable and impressive surveys on T2V generation, these surveys introduce approaches in a relatively isolated way, lack the development of evaluation metrics, and lack the latest advances in T2V generation since 2023. Due to the rapid expansion of the field of T2V generation, a comprehensive review of the relevant studies is both necessary and challenging. This survey attempts to connect and systematize existing research in a comprehensive way. Unlike previous surveys, this survey reviews nearly one hundred representative T2V generation approaches and includes the latest method published on July 2024 from the perspectives of model, data, evaluation metrics, and available open source. It may help readers better understand the current research status and ideas and have a quick start with accessible open-source models. Finally, the future challenges and method trends of T2V generation are thoroughly discussed.