Fan Xie, Dan Zeng, Qiaomu Shen, et al., “A comprehensive survey on text-to-video generation,” Chinese Journal of Electronics, vol. 34, no. 3, pp. 1–27, 2025. DOI: 10.23919/cje.2024.00.151
Citation: Fan Xie, Dan Zeng, Qiaomu Shen, et al., “A comprehensive survey on text-to-video generation,” Chinese Journal of Electronics, vol. 34, no. 3, pp. 1–27, 2025. DOI: 10.23919/cje.2024.00.151

A Comprehensive Survey on Text-to-Video Generation

More Information
  • Author Bio:

    Xie Fan: Fan Xie is a master’s student at Southern University of Science and Technology. His research interests include text-to-video generation and multimodality. (Email: 12232387@mail.sustech.edu.cn)

    Zeng Dan: Dan Zeng received the B.E. and Ph.D. degrees in computer science and technology from Sichuan University in 2013 and 2018, respectively. From 2018 to 2020, she was a Post-Doctoral Research Fellow with the Data Management and Biometrics Group, University of Twente, The Netherlands. She is currently a Tenure-track Assistant Professor with the School of Artificial Intelligence, Sun Yat-sen University. Before that, she was a Research Associate Professor at the Southern University of Science and Technology. Her research interests are in the field of pattern recognition, computer vision, and deep learning. (Email: danzeng1990@gmail.com)

    Shen Qiaomu: Qiaomu Shen received the PhD degree in computer science from The Hong Kong University of Science and Technology in 2020. He is currently an assistant professor in Beijing Institute of Technology (Zhuhai). Before that, he worked as a research assistant professor in Southern University of Science and Technology (SUSTech). His current research interests include spatial-temporal visualization, urban computing, and visual analytics of complex system. (Email: joyshen06@gmail.com)

    Tang Bo: Bo Tang received the PhD degree in computer science from The Hong Kong Polytechnic University in 2017. He is currently an tenured asscoiate professor in Southern University of Science and Technology. He was an visiting researcher at Centrum Wiskunde & Informatica and Microsoft Research Asia, respectively.His research interest is architecting and implementing the data foundation for the Large Language Model era. (Email: tangb3@sustech.edu.cn)

  • Corresponding author:

    Zeng Dan, Email: danzeng1990@gmail.com

  • Received Date: June 08, 2024
  • Accepted Date: November 26, 2024
  • Available Online: January 09, 2025
  • Since the release of Sora, the Text-to-Video (T2V) generation has brought profound changes to AI-generated content. T2V generation aims to generate high-quality videos based on a given text description, which is challenging due to the lack of large-scale, high-quality text-video pairs for training and the complexity of modeling high-dimensional video data. Although there have been some valuable and impressive surveys on T2V generation, these surveys introduce approaches in a relatively isolated way, lack the development of evaluation metrics, and lack the latest advances in T2V generation since 2023. Due to the rapid expansion of the field of T2V generation, a comprehensive review of the relevant studies is both necessary and challenging. This survey attempts to connect and systematize existing research in a comprehensive way. Unlike previous surveys, this survey reviews nearly ninety representative T2V generation approaches and includes the latest method published on March 2024 from the perspectives of model, data, evaluation metrics, and available open-source. It may help readers better understand the current research status and ideas and have a quick start with accessible open-source models. Finally, the future challenges and method trends of T2V generation are thoroughly discussed.

  • Artificial Intelligence Generated Content (AIGC) is developing rapidly and has become one of the most popular topics in AI. The generative modalities of AIGC include image [1]– [3], video [4]– [6], audio [7]– [9], and more. We counted the number of papers published on different generated modalities in the past five years (2019 to 2023) in Figure 1. As illustrated in Figure 1(a), text-to-image (T2I) generation research has dominated the AIGC field for many years. Nevertheless, we can also see from Figure 1(b) that the development of text-to-video (T2V) generation has exploded in recent years, which may fundamentally shift the research emphasis in the future. We can see that T2I generation started early and is the focus of research. Although T2V generation started relatively late, the right graph in Figure 1 shows that it has grown rapidly in recent years.

    Figure  1.  AIGC developments in the last five years, including Text-to-Image, Text-to-Video and Text-to-Audio.

    T2V generation aims to generate high-quality videos based on a given text description, and the videos generally contain 16 frames with a duration of two seconds. It is challenging for two reasons: First, there is a lack of large-scale, high-quality text-video pairs for training; for example, tens of millions of paired data are usually required. Second, the complexity of modeling high-dimensional video data is high because 1) The semantic space for the text is much smaller than the generation space for the video frame. 2) Correct retention of semantics and continuity between frames are required. 3) The computation power is demanding, training a T2V model like InternVid [10] typically requires 64 NVIDIA GPUs for three days.

    The release of Sora [11] this year has profoundly pushed the frontier of the T2V generation. Prior to that, both academia and industry put a great effort into improving T2V generation models due to the wide application prospects. At this point, a comprehensive review of the relevant studies is both necessary and challenging. Although there have been some valuable and impressive surveys on T2V generation, these surveys introduce approaches in a relatively isolated way, lack the development of evaluation metrics, and lack the latest advances in T2V generation since 2023. Unlike previous surveys, this survey paper reviews nearly ninety representative T2V generation approaches and includes the latest method published on March 2024 from the perspectives of model, data, evaluation metrics, and available open-source.

    Our survey is illustrated in Figure 2, and the organization is as follows: Section 2 clarifies the differences between this survey and others. Section 3 explores existing methods and reviews their strengths and weaknesses. Section 4 introduces current T2V datasets, while Section 5 reviews the development of metrics for evaluating T2V generation. Section 6 provides the results of the experiment on representative methods. Section 7 discusses challenges and future trends, and the last section concludes this review.

    Figure  2.  The organization of our survey.

    Table 1 presents the differences between this survey and the existing surveys. Unlike previous surveys, this survey paper reviews nearly one hundred representative T2V generation approaches and includes the latest method published in July 2024. Also, more T2V datasets and metrics are comprehensively reviewed.

    Table  1.  Compare our survey with existing surveys.
    Survey #Methods Latest Pub. Year #T2V Datasets #Metrics
    [12] 6 Oct. 2022 NA NA
    [13] 16 Dec. 2022 15 5
    [14] 28 Oct. 2023 31 4
    [15] 81 May 2024 24 5
    Ours 97 Jul. 2024 40 20
     | Show Table
    DownLoad: CSV

    Singh [12] presents and compares popular T2I and T2V generation methods, discussing their ideals, advantages, and disadvantages. The survey offers an overview of T2V generation techniques but lacks a comprehensive exploration of datasets and evaluation metrics.

    Xing et al. [14] provide a detailed overview of T2V generation methods, including datasets and evaluation metrics. However, this survey is somewhat outdated and focuses primarily on diffusion model-based architectures. In contrast, our survey covers all types of architectures for T2V generation except diffusion models.

    Cho et al. [13] provide an excellent introduction for beginners, covering T2V applications, technical limitations, ethical conflicts, and future directions. However, their work has limitations, including an insufficient introduction to mainstream methods, insufficient comprehensive coverage of datasets, especially the lack of introduction to newly proposed datasets [10], [16], and lack of introduction to metrics such as EvalCrafter [17] and FETV [18].

    Sun et al. [15] present a comprehensive review of current T2V work, introducing 81 research works categorized by different architectures. [15] is the latest review paper on T2V task. However, statistical data on the comparison results of the methods and their open-source resources are not provided, which may pose an obstacle for people to quickly get started with T2V.

    In contrast, our survey not only comprehensively introduces related research, including core ideas, strengths, and weaknesses, but also introduces T2V datasets, evaluation metrics, experimental results, and open-source methods in detail, overcoming existing surveys' limitations.

    The primary generation procedure is illustrated in Figure 3. First, a text encoder processes the input text to encoded features. These features are then utilized to produce the corresponding video by a generative model.

    Figure  3.  A brief diagram of text-to-video generation.

    Text Encoder. Existing text encoders can be divided into two categories: pre-trained multimodal models such as CLIP [19], and pre-trained large language models (LLMs) such as BERT [20], T5 [21], and Llama-2 [22].

    Pre-trained multimodal models, exemplified by CLIP [19], learn their matching relationship by training on large-scale text-image pairs, thereby aligning image and text in an embedding space. However, CLIP cannot handle text with complex meanings, which may limit its effectiveness for long and complex text input.

    Pre-trained LLMs excel in various tasks after being trained on large-scale corpora. BERT [20] can learn from unlabeled data and exhibit impressive performance, which can be further improved as model size and training data expand. T5 [21] and Llama-2 [22] are favored for their superior performance and the availability of open-source. Usually, LLMs outperform CLIP in understanding long text input.

    Generation Model. Existing methods for generations can be divided into four categories: 1) VAE-based [23] approaches, 2) GAN-based [24] approaches, 3) Autoregressive transformer-based [4] approaches, 4) Diffusion model-based [25] approaches, and 5) T2I methods for video generation approaches. Figure 4 shows the timeline of representative T2V generation methods in academia and industry. Figure 5 shows the categorization of existing methods for T2V generation. Figure 6 shows the network architecture comparison with respect to different generative models.

    Figure  4.  A timeline of representative text-to-video generation methods in academia and industry.
    Figure  5.  The categorization of existing methods for text-to-video generation.
    Figure  6.  Comparison of the network architectures with respect to different generative models.

    The Variational Autoencoder (VAE) [23] is a groundbreaking method for generating images. It consists of an encoder and a decoder. The encoder maps the input data into a probability distribution, while the decoder generates new data by sampling from the learned probability distribution. Sync-DRAW [26] and GODIVA [27] are representative T2V generation methods based on VAE.

    Recurrent VAEs. Sync-DRAW [26] combines a VAE with a recurrent attention mechanism for generating videos. It generates temporally coherent video frames by focusing on individual frames through the attention mechanism, while using the VAE to globally learn the latent distribution of the video. In addition, it keeps full attention to the object through a gating mechanism, which can generate videos that maintain the structural integrity of the object.

    VQ-VAE. GODIVA [27] is the first to use VQ-VAE [117] for open-domain T2V generation, as illustrated in Figure 7. It combines VQ-VAE and 3D sparse attention to generate video, where 3D sparse attention can significantly reduce the computational cost. First, a VQ-VAE autoencoder is trained to represent continuous video pixels as discrete tokens. Then, a 3D sparse attention model is trained using language as input, with the discrete video tokens used as labels for video generation.

    Figure  7.  The architecture of GODIVA [27].

    VAE has a simple structure and is easy to use, but it is limited to generating basic videos and suffers from “posterior collapse”. VQ-VAE, used in GODIVA, combines the advantages of discrete representation with the efficacy of continuous models. According to the current state of the art, VAE is mainly used to encode conditional information, such as text, rather than video modeling. A comparison of pros and cons is illustrated in Figure 8.

    Figure  8.  Pros and Cons of VAE-based approaches.

    Generative Adversarial Networks (GANs) [24] have been ruling image generation for a decade. In contrast to VAE, the core idea of GAN is to estimate the generator via an adversarial process. GANs can usually produce images with good perceptual quality and are widely used in T2V generation methods. However, the models based on GAN can only generate videos with moving digits or simple human actions. They cannot further generate more complex and diverse videos anymore. Moreover, GAN often suffers from pattern collapse problems, and it is also difficult to scale these methods to complex and large-scale video datasets.

    Temporal GAN. TGANs-C [28] is the first GAN-based work for T2V generation, proposing a temporal GAN where the generator takes text embeddings and noise vectors to produce video frames. It enhances the traditional 2D generator to a 3D model for better spatio-temporal dynamics and incorporates motion analysis in the discriminator to ensure coherent frame transitions, as detailed in Figure 9.

    Figure  9.  The architecture of TGANs-C [28].

    Text-Filter conditioning GAN. TFGAN [29] introduces a new text conditioning method for discriminator feature maps through convolutional operations, as depicted in Figure 10. Meanwhile, the authors also created the Moving Shapes Dataset, where the text describes the shapes moving along a trajectory.

    Figure  10.  The architecture of TFGAN [29].

    GAN combined with VAE. VGFT [30] proposed a hybrid generator that combines GAN with VAE [23] to extract statistical and dynamic information from text, thereby generating diverse and smooth videos that correspond well to the input text.

    Leverage the previous frame for generation. IRC-GAN [31] proposes the introspective recurrent convolutional GAN, consisting of the Recurrent Transconvolutional Generator (RTG) and Mutual-information Introspection (MI). RTG generates each frame based on the previous one to obtain better coherence. MI uses mutual information to compute the semantic distance between the T2V and the corresponding text and tries to minimize the distance. TiVGAN [32] proposes a new training framework that initially focuses on learning the relationship between text and image to create high-quality single video frames. As training progresses, the model is gradually trained on more successive frames, which can stabilize the training and allow for clearer video generation.

    Story visualization. StoryGAN [33] proposes a new task called “story visualization”. The input is a multi-sentence paragraph, a story, and the output is a series of visualization images, with one for each sentence. Compared to other T2V works, this task can focus less on the continuity of the generated image frames and more on the global consistency between dynamic scenes and characters. Word-Level [34] expands on StoryGAN [33] by introducing a new sentence representation that combines word information from all story sentences. Also, a new fusion feature discriminator is proposed, extending spatial attention to improve image quality and story consistency.

    GAN-based methods, like VAE-based methods, are able to generate realistic images and videos. However, they are limited to generating simple videos, such as moving figures or basic human movements. They cannot go further to generate more complex and diverse videos. In addition, GANs frequently suffer from pattern collapse problems, making it challenging to apply these methods to complex, large-scale video datasets. Figure 11 outlines the pros and cons of each subcategory in this section.

    Figure  11.  Pros and Cons of GAN-based approaches. Along the red arrows, the order of subcategories corresponds to the order of the narrative.

    The autoregressive approach allows the model to generate sequences step-by-step, with the latest generated content being based on the previously generated content, which naturally fits with the idea of generating coherent videos. Compared with GAN-based methods, the autoregressive transformer-based methods can avoid pattern collapse problems and generate better video quality. However, these methods require more computational and memory resources as the intermediate processes need to be stored and are constantly involved in the computation.

    Transformers with 3D structure. NUWA [35] introduces a 3D transformer encoder and decoder framework that provides a unified representation space for images and video, supporting both T2I and T2V generation. A 3D Nearby Attention (3DNA) mechanism is proposed to reduce the computational complexity. The architecture is shown in Figure 12.

    Figure  12.  The architecture of Nuwa [35].

    Multimodality. VideoPoet [36] utilizes the unified architecture of LLMs to perform unified autoregressive learning for text, video, image, and audio modalities. Each modality has a separate tokenizer that converts the data into a discrete sequence of tokens. In addition, it incorporates a super-resolution module in the token space to improve video quality. MMVID [37] takes multimodal inputs for video generation. It consists of an auto-encoder for encoding images and videos and a non-autoregressive transformer for predicting video tokens from multimodal inputs, which can generate better temporally consistent videos, such as special VID token, textual embedding, and improved mask prediction. Moreover, it proposes a new dataset called Multimodal VoxCeleb, where the video sources are VoxCeleb’s [118] 19,522 videos with 36 manually labeled facial attributes. LWM [38] proposes the RingAttention technique, which extends the context window of the model so that it can handle sequences up to one million tokens long. The context length is gradually increased during the training process, and the training data includes text-image pairs, text-video pairs, and chat data for downstream tasks.

    Figure  13.  The architecture of Phenaki [39].

    Generate variable-length videos. Phenaki [39] is the first model to generate videos of variable length. First, it trains the transformer and randomly masks the video tokens during training. While generating the video, an arbitrary length video is generated by freezing the past tokens.

    Transformers, but not autoregressive. Some methods are not based on autoregressive transformers, but their architectures are based on transformers, such as MAGVIT [40] and WorldDreamer [41]. MAGVIT can handle multiple video synthesis tasks simultaneously and significantly outperforms contemporaneous diffusion and autoregressive methods in inference speed. WorldDreamer is the first generalized world model built for video generation. It proposes the spatial temporal patchwise transformer (STPT), which performs attentional manipulation of local patches within a spatio-temporal window. SPT facilitates the learning of visual signal dynamics and accelerates the convergence of the training process, leading to about three times faster than diffusion-based methods.

    Compared to GAN-based methods, autoregressive transformer-based methods avoid the pattern collapse problem and produce better-quality videos. However, these methods require more computational and memory resources as the intermediate processes must be stored and constantly involved in computation. Figure 14 displays the pros and cons of each subcategory under this section.

    Figure  14.  Pros and Cons of auto-regressive transformer based approaches. Along the red arrows, the order of subcategories corresponds to the order of the narrative.

    Denoising diffusion probabilistic model [25] (DDPM) can avoid the problem of pattern collapse in GANs and low generation quality in VAEs. At its core, a diffusion model adds random noise to existing data and reverses the process to generate the high-quality image. Through this process, the model learns to create synthetic data.

    Since the diffusion model manipulates images in pixels, the computational consumption is particularly massive for large images. Rombach R et al. [119] proposed an image generation model based on the Latent Diffusion Model, whose core idea is to utilize an encoder an image to a latent vector and a decoder to decode the latent vector into an image. The advantage of such an approach is that the operation of the diffusion model is performed in the latent space, whose dimension is much smaller than the original pixel space. Thus, the computational consumption is significantly reduced.

    The research works introduced below are all based on either the diffusion model or the latent diffusion model, and no specific distinction is made in this survey.

    First 3D UNet. VDM [6] is the first work to employ diffusion model for video generation. It extends the traditional U-Net [120] architecture (2D) to 3D spatio-temporal and supports joint training of images and videos at the same time. It further proposes conditional sampling for spatio-temporal video, which is capable of generating long and high-resolution videos. With the introduction of the 3D U-Net architecture, there has been an increase in the use of diffusion modeling for video generation.

    Temporal modeling exploration. LVDM [42] is one of the representative works of the latent diffusion model in video generation. It innovatively proposes hierarchical diffusion in the latent space. The framework is shown in Figure 15, where t and s are randomly sampled diffusion timesteps for generated latents and conditional latents, respectively. pc and pu are probabilities of the conditional and unconditional input, respectively. After that, to overcome the performance degradation problem caused by long video generation, LVDM [42] proposes conditional latent perturbation and unconditional guidance, which effectively mitigates the cumulative error while extending to have more than one thousand frames.

    Figure  15.  The architecture of LVDM [42].

    Arguing that the oversimplification of other works for temporal modeling limits the spatio-temporal performance, VersVideo [43] proposes multi-excitation paths for spatio-temporal convolution across a pool of dimensions with different axes and multi-expert spatio-temporal attention blocks, which improves the spatio-temporal performance of the model without significantly increased training and inference costs. It also integrates the temporal module into the decoder to solve the problem of information loss due to the latent space.

    Multi-stage T2V methods. Show-1 [44] innovatively combines the respective advantages of pixel-based and latent-based video diffusion models. Pixel-based video diffusion models perform well but have high computational costs. While the latent-based model effectively reduces the computational effort, it is not easy to accurately align text and video. Show-1 first generates a low-resolution video using the pixel-based diffusion model to accurately align the text and video. Then, the latent video diffusion model is used to upsample the low-resolution video to high-resolution video. Its framework is shown in Figure 16. LaVie [45] consists of three modules: a basic T2V model, a temporal interpolation (TI) model, and a video super-resolution (VSR) model. The basic model generates keyframes, the TI model generates smoother results, and the VSR model further improves resolution. MoVideo [46] generates the video in two steps, first generating the depth and optical flow of the video and then generating the final video by combining the keyframes generated by the T2I model under these two conditions. Mora [47] utilizes a variety of advanced large models for T2V generation that can replicate Sora’s [11] generative capabilities. Specifically, video generation is decomposed into several subtasks, each assigned to a specialized large-scale model. VideoElevator [48] uses encapsulated T2V models to improve temporal consistency and T2I models to provide high-quality detail.

    Figure  16.  The architecture of Show-1 [44].

    Multi-stage methods improve the quality of generated videos better than single-stage methods. However, the drawbacks are the complexity of the generation process and the increased burden of training.

    Noise prior exploration. VideoFusion [49] decomposes the standard diffusion process into adding basic and residual noise, where consecutive frames share the basic noise. This way, frames in the same video clip are encoded as related noises, allowing the denoising network to reconstruct coherent video more easily. InstructVideo [50] combines human preferences with text into noise. POS [51] proposes optimal noise approximators and semantic-preserving rewriters. The optimal noise approximator first searches for the video closely related to the text and then inverts it into the noise space as an improved noise for the text input. The semantic preservation rewriter rewrites the original text while preserving the semantics.

    The generated video is denoised from noise, which can directly affect the video quality. Improving the initial noise without changing other modules can further improve the model’s performance.

    Leverage new training data. VidRD [52] proposes a set of strategies for combining video-text data that involves different elements of several existing datasets, including video datasets for action recognition and image-text datasets. VideoFactory [16] collected videos on YouTube and labeled them using BLIP-2, building a large video dataset called HD-VG-130M. InternVid [10] proposes a method for autonomously building high-quality video text datasets using LLMs and publicly releasing the collected datasets.

    High-quality T2V datasets are essential for improving model performance. Compared to other generative tasks, the available datasets in T2V generation are currently quite scarce, which limits the model performance to some extent.

    Efficient training. F3-Pruning [53] proposes a training-free generalized pruning strategy to prune redundant spatio-temporal attention weights. This speeds up the inference of the T2V model and ensures video quality. VideoLCM [54] applies consistency models (CM) [121] to the video generation domain. It achieves high-fidelity and smooth video synthesis using only four sampling steps, demonstrating the potential of real-time synthesis. ART-V [55] learns only simple continuous motions between neighboring frames and generates videos autoregressively, thus reducing the enormous computational overhead of training. AdaDiff [56] arranges denoising steps according to different samples. It uses the gradient method for optimization to maximize a well-designed reward function. It reduces the inference time by at least one-third while achieving similar results to other methods. HPDM [57] studies patch diffusion models (PDMs) that model the distribution of patches instead of the whole input, keeping 0.7% of the original pixels. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. Mobius [58] proposes a highly efficient spatial-temporal parallel training paradigm for T2V tasks. In its 3D-Unet, the temporal and spatial layers are parallel, optimizing the feature flow and backpropagation. This method reduces GPU memory usage by 24% and training time by 12%, offering significant improvement for T2V fine-tuning task.

    Incorporating transformers into diffusion models. VDT [59] is the pioneer in using transformers in diffusion-based video generation. Utilizing transformer in diffusion models can leverage its rich spatio-temporal representations well. Similar to VDT, Latte [60] is also a transformer-based diffusion model that achieves a performance beyond that of the VDT. W.A.L.T [61] devised a transformer framework that transforms the latent vectors of images and videos in the same latent space. The transformer framework is a window attention architecture consisting of self-attention layers that alternate between non-overlapping, window-constrained spatial and spatio-temporal attention. Snap Video [62] replaces U-Nets in the traditional diffusion model with a transformer-based FITs [122] structure. Further, it extends the number of parameters, significantly improving temporal consistency and motion modeling. Sora [11] is the first large-scale general-purpose video generation model that has attracted widespread attention in the community. It is based on a DiT [123] structure similar to Latte. It has several features. The first is the ability to train based on videos and images with different resolutions, durations, and aspect ratios and trained on the original image size. Secondly, it will convert short user prompts into longer detailed instructions using GPT [124] to improve video quality. Thirdly, it supports generation from images and videos, including image-to-video generation, extended generated video, video editing, etc.

    Multimodality. Considering the limited scale of publicly available text-video pairs, TF-T2V [63] proposes a new T2V generation framework that allows direct learning using text-free videos. The basic principle is to separate the process of text decoding from the process of temporal modeling. For this purpose, it employs a content and a motion branch, jointly optimized with shared weights. In the content branch, paired image-text data is leveraged to learn text-conditioned and image-conditioned spatial appearance generation. The motion branch supports the training of motion dynamic synthesis by feeding text-free videos (or partially paired video-text data if available). VideoCrafter2 [64] separates appearance and motion by utilizing low-quality videos for motion learning and high-quality images for appearance learning. It also suggests using synthetic images with complex concepts instead of real images for fine-tuning.

    Personalized video generation. Animate Anyone [65] proposes a new framework customized for character animation, capable of converting character photos into animated videos controlled by a required sequence of poses while ensuring a consistent appearance and temporal stability. GEN-1 [66] proposes a structure- and content-oriented video diffusion model that can modify existing videos based on text. DynamiCrafter [67] animates open-domain images using a pre-trained video diffusion prior, a proposed dual-stream image injection mechanism, and a dedicated training paradigm.

    Spatio-temporal decoupling. HiGen [68] improves performance by decoupling the spatial and temporal elements of video from both structure and content perspectives. At the structural level, it uses a unified noise reducer to decompose the T2V task into two steps: spatial inference and temporal inference. At the content level, it extracts motion and appearance changes from the content of the input video, respectively. LAMP [69] proposes a new setting for the T2V generation task to balance the generation freedom with the training cost. It learns the motion patterns only from a training set consisting of 8 to 16 videos and later generates subsequent frames using the images generated by the T2I model as the first frame. MotionDirector [70] utilizes a dual-path LoRAs architecture to decouple the learning of appearance and motion and designs a new appearance debiasing temporal loss to mitigate the effect of appearance on the temporal training objective.

    MagicTime [71] designs a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from time-lapse videos, and transform pre-trained T2V models to generate metamorphic videos. Compared with general videos, Metamorphic videos contain the entire transformation of the subject, with a much more significant change in motion than general videos. ChronoMagic, a time-lapse video dataset, has been released as the training data. VideoTetris [72] introduces a novel spatio-temporal Compositional Diffusion, which manipulates the cross-attention value of denoising network temporally and spatially, synthesizing videos that faithfully follow complex or progressive instructions.

    Controllable T2V generation. PEEKABOO [73] expects users to control video generation interactively. It proposes a new spatial-temporal masked attention module to achieve spatio-temporal control without the extra overhead of training and inference for current video generation models. ControlVideo [74] is adapted from ControlNet [125] and introduces three new modules to improve video generation. First, full cross-frame interaction is added to the self-attention module. Second, it uses frame interpolation to mitigate flicker effects. Finally, it synthesizes multiple consistent short clips. VideoComposer [75] offers a variety of methods for controllable video generation at once. It can simultaneously control spatial and temporal patterns in composited video through textual descriptions, sketch sequences, reference videos, and even simple manual movements. StyleCrafter [76] augments the pre-trained T2V model with a style control adapter, which can generate videos in any style by providing reference images. It uses a style-rich image dataset to train the style control adapter. MobileVidFactory [77] is a system that uses text input to generate videos for mobile devices automatically. The system first generates high-quality videos using a video generator. Then, the user can enrich the visual presentation by adding specified text. Finally, it matches the generated video with the appropriate audio in an audio database. Boximator [78] is a method for fine-grained video motion control that provides two types of boxes, allowing the user to select any object and define its motion without entering additional text. It trains only the control module, which retains knowledge of the underlying model, so its performance improves as the underlying model evolves.

    CTRL-Adapter [79] is an efficient and versatile framework that adds diverse controls to any image or video diffusion model through the adaptation of pretrained ControlNets. Training CTRL-Adapter is much more efficient than training a ControlNet for a new backbone, and it can outperform or match strong baselines in visual quality and spatial control. CameraCtrl [80] implements precise camera pose control for T2V models. After accurately parameterizing the camera trajectory, it can train a plug-and-play camera module on the T2V model and keep the other modules untouched. MotionClone [81] is a training-free framework that performs motion cloning from reference videos to control text-to-video generation. It robustly maintains motion fidelity while assimilating novel textual semantics.

    Remove flicker and artifacts. Flickering and artifacts in the generated video are due to the current model’s lack of learning and generative capabilities. Removing the flickering and artifacts can make the generated video more realistic. DiffSynth [82] proposes a latent in-iteration deflickering framework and a video deflickering algorithm to mitigate the flickering. The latent in-iteration deflickering framework applies the video deflickering algorithm to the latent space of the diffusion model, effectively preventing the accumulation of flicker in intermediate steps. The video deflickering algorithm remaps objects in different frames and blends them to enhance video consistency. Like DiffSynth [82], DSDN [83] also reduces flicker and artifacts in the generated video. It designs two diffusion streams, one for the video content and one for the motion variations, so that the content and the motion can be better aligned. Experiments show that this decomposition also reduces the generation of flicker.

    Complex dynamics modeling. Dysen-VDM [84] proposes a dynamic scene manager module to enhance the dynamics of generated videos. The module consists of 1) extracting key actions from the input text in chronological order, 2) converting the action schedule into a dynamic scene graph representation, and 3) enriching the scenes in the DSG with sufficiently reasonable details. VideoDirGPT [85] inputs the text prompts into GPT-4 [124] to output a video plan, which includes generating scene descriptions, entities with their respective layouts, backgrounds for each scene, and consistent grouping of entities and backgrounds. Finally, the video generator generates the video based on the video plan. LVD [86] utilizes LLMs to generate dynamic scene layouts based on the prompt and then uses the generated layouts to guide the diffusion model to generate video. Such a process does not involve any updates to the parameters of the LLM and the diffusion model.

    All three methods leverage the comprehension capabilities of large language models to guide generative models to better generation.

    Domain-specific T2V generation. Text2Performer [87] focuses on the generation of human videos. It has two novel designs: decomposed human representations and a diffusion-based motion sampler. Video Adapter [88] decomposes domain-specific video distributions into pre-trained priors and trainable components, which significantly reduces the cost of tuning large pre-trained video models. DrivingDiffusion [89] generates realistic multi-view driving videos from prompts and 3D layouts.

    Generating longer videos. NUWA-XL [90] is a follow-up work of NUWA [35] generating long videos from text. It employs a coarse-to-fine generation paradigm. A global diffusion model generates keyframes over the entire period, and then a local diffusion model recursively fills in the content between nearby frames. SEINE [91] proposes a short-to-long (S2L) video diffusion model. It generates transitions based on textual descriptions automatically. Transition videos are generated by providing images of different scenes as inputs, combined with text-based control. MTVG [92] proposes multi-T2V generation that directly utilizes a pre-trained diffusion-based T2V generation model without additional fine-tuning. Similarly, FreeNoise [93], like MTVG [92], studies the generation of long videos conditioned on multiple texts. Instead of initializing noise for all frames, FreeNoise rearranges a series of noises for long-range correlation and provides temporal attention to them through window-based fusion. Gen-L-Video [94] extends existing short video diffusion models to generate long videos based on hundreds of clips with different semantics without introducing additional training while maintaining content consistency. StreamingT2V [95] proposes an autoregressive approach that utilizes novel short-term and long-term dependency blocks to seamlessly carry over video chunks with high motion while preserving high-level scene and object features during the generation process. Vlogger [96] is a system that generates vlogs longer than 5 minutes from text. It utilizes the LLM as a director and breaks down the generation of vlogs into four phases: Script, Actor, showmaker, and Voicer. FIFO-Diffusion [97] is capable of efficiently generating very long videos based on models trained on short clips (16 frames) by iteratively performing diagonal denoising. This is achieved without additional training, while not degrading the video quality and preserving the dynamics and semantics of the scene.

    The diffusion model relies on a long Markov chain of diffusion steps to generate samples, allowing for more complex, nonlinear distributions to be modeled compared to other architectures. Besides, the diffusion model’s training process is stable, has excellent loss convergence, and does not encounter the problem of pattern collapse. However, it is notorious for being time-consuming and computationally expensive in terms of time and computational cost. Figure 17 displays the pros and cons of each subcategory under this section.

    Figure  17.  Pros and Cons of diffusion model-based approaches. Along the red arrows, the order of subcategories corresponds to the order of the narrative.

    Training a T2V model from scratch requires tremendous computational cost. Thus, many works focus on how pre-trained T2I models can be utilized to contribute to video generation. The T2I-based model reduces the training cost and ensures the image quality of the generated video.

    Temporal layer. ModelScopeT2V [98] is the first open-source diffusion-based T2V generation model. Spatio-temporal blocks are added to a T2I synthesis model to ensure consistent frame generation and smooth motion transitions. Video LDM [99] first pre-trains the image generator on images. It then introduces a temporal layer and fine-tunes the encoded image sequence to convert the image generator into a video generator. Imagen Video [100] utilizes the mature T2I model Imagen [126] to generate the base video. Six diffusion models are then cascaded, three for spatial super-resolution and three for temporal super-resolution. Each model is trained independently, and the cascade maximizes the performance benefits. The framework is shown in Figure 19. SVD [101] proposes a three-step paradigm for training video generation models: T2I pre-training, video pre-training, and high-quality video fine-tuning. In addition, it provides a series of processes to generate high-quality T2V datasets.

    Figure  19.  The architecture of Imagen [100].

    Frame adapter. While previous approaches usually added a 1D temporal layer to model the time, MagicVideo [102] considered it unnecessary to use such a complex operation and proposed the concept of the frame adapter, which uses only two sets of parameters to model the relationship between the images and the video. Similar to MagicVideo [102], SimDA [103] employs adapters to transform T2I models into T2V models. It not only includes a lightweight spatial adapter to transfer visual information for T2V learning but also introduces a temporal adapter to model temporal relationships for lower feature dimensions.

    Frame interpolation. CogVideo [4] generates several key frames using a T2I model called CogView2 [3]. Based on the keyframes, several rounds of frame interpolation are performed to form a final video. The process of frame interpolation is an autoregressive process. It also proposes multi-frame rate hierarchical training to align text-video pairs better. The framework is shown in Figure 18. Make-A-Video [5] incorporates a super-resolution module based on frame interpolation to improve video quality. There is no need for text-video pairs for its training, and only video data is needed to learn the motion. MagicVideo-V2 [104] integrates a T2I model, a video motion generator, a reference image embedding module, and a frame interpolation module into an end-to-end video generation pipeline. GridDiffusion [105] generates videos using the grid diffusion model. It first generates key grid images, including four images inside a grid image. After that, masked grid images are inserted into the grid, allowing the interpolation model to generate the masked images autoregressively.

    Figure  18.  The architecture of CogVideo [4].

    Optimizing the latent representation of video. Unlike generating keyframes and then interpolating them, Lumiere’s [106] proposed STUNet can directly generate all the frames in one step and then use spatial super-resolution on some overlapping windows to get higher-resolution video. VideoGen [107] utilizes the T2I model to generate a reference image based on the prompt. Then, an efficient cascading latent diffusion model is introduced, which conditions the reference image and prompt for generating the latent representation of the video. PYoCo [108] proposes a video diffusion noise for fine-tuning T2I models into T2V models. It fine-tunes the eDiff-I [127] to construct a large-scale T2V diffusion model. Text2Video-Zero [109] utilizes a pre-trained T2I model to generate the latent space representation of the image. After that, the latent space representation of each frame is generated using the dynamics method and the cross-attention mechanism that only pays attention to the first frame. Finally, the video is generated by the decoder.

    One-shot video tuning. Tune-A-Video [110] proposes a new task of training a T2V model using only a single text-video pair and a pre-trained T2I model.

    Parameter-free. Latent-Shift [111] proposes a parameter-free temporal shift module that can generate videos based on the T2I model. The module accomplishes this by shifting both parts of the feature mapping channel forward and backward along the time dimension.

    LLM as a helper. DirecT2V [112] and Free-Bloom [113] use language models to transform user prompts into detailed frame descriptions, then employ a T2I model to generate each frame. DirecT2V enhances frame consistency using novel value mapping and dual softmax filtering, while FreeBloom proposes joint noise sampling and dual path interpolation. FlowZero [114] utilizes LLM to generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. The DSS then guides the image diffusion model in generating videos. In particular, it proposes a self-refining iterative process that enhances the alignment of the video with the text. GPT4Motion [115] utilizes GPT-4 [124] to generate Blender scripts based on user prompts, producing coherent physical motion across frames. Blender 1 is an open-source 3D creation suite that provides modeling, animation, and rendering tools that facilitate the creation of detailed 3D scenes.

    Adds motion to the image. Motion-I2V [116], a new framework for image-to-video generation (I2V), decomposes image-to-video generation into two stages. In the first phase, a diffusion-based motion field predictor predicts motion from the text and the image. For the second stage, a video latent diffusion model generates the final video based on the image and motion. Note that when using this approach to generate video, it is necessary to use other models to generate the image corresponding to the text as a first step.

    Implementing a T2V model based on a pre-trained T2I model reduces training costs and improves image quality. However, generating video from text is split into two parts and is no longer an end-to-end process. At the same time, the degree of freedom and the magnitude of motion are slightly worse than the end-to-end T2V models. Figure 20 displays the pros and cons of each subcategory under this section.

    Figure  20.  Pros and Cons of T2I-based approaches. Along the red arrows, the order of subcategories corresponds to the order of the narrative.

    Compared to other research fields, T2V requires a lot of computational and data resources, and the models are usually released by industry. For commercial reasons, many models and training details are not open source. We summarize the existing open-source methods in Table 2 to help researchers quickly get started with the experiments.

    Table  2.  Open source T2V methods collation.
    Method Venue Frames Resolution Code Official Release
    Follow Your Pose [128] AAAI24 8 512×512 https://github.com/mayuelala/FollowYourPose
    ConditionVideo [129] AAAI24 24 512×512 https://github.com/pengbo807/ConditionVideo
    Make-A-Video [5] Arxiv22 16 256×256 https://github.com/lucidrains/make-a-video-pytorch ×
    LVDM [42] Arxiv22 16 256×256 https://github.com/YingqingHe/LVDM
    DirecT2V [112] Arxiv23 16 512×512 https://github.com/KU-CVLAB/DirecT2V
    LaVie [45] Arxiv23 61 1280×2048 https://github.com/Vchitect/LaVie
    ModelScope [98] Arxiv23 16 256×256 https://modelscope.cn/models/iic/text-to-video-synthesis/summary
    VidRD [52] Arxiv23 16 256×256 https://github.com/anonymous0x233/ReuseAndDiffuse
    VideoDirectorGPT [85] Arxiv23 16 256×256 https://github.com/HL-hanlin/VideoDirectorGPT
    Show-1 [44] Arxiv23 29 320×576 https://github.com/showlab/Show-1
    VideoFusion [49] Arxiv23 33 256×256 https://github.com/ai-forever/KandinskyVideo
    HiGen [68] Arxiv23 32 448×256 https://github.com/ali-vilab/VGen
    Animate Anyone [65] Arxiv23 24 768×768 https://github.com/HumanAIGC/AnimateAnyone
    StyleCrafter [76] Arxiv23 16 320×512 https://github.com/GongyeLiu/StyleCrafter
    DynamiCrafter [67] Arxiv23 16 576×1024 https://github.com/Doubiiu/DynamiCrafter
    MotionDirector [70] Arxiv23 16 384×384 https://github.com/showlab/MotionDirector
    FlowZero [114] Arxiv23 8 512×512 https://github.com/aniki-ly/FlowZero
    MagicTime [71] Arxiv24 16 512×512 https://github.com/PKU-YuanGroup/MagicTime
    Ctrl-Adapter [79] Arxiv24 16 256×256 https://github.com/HL-hanlin/Ctrl-Adapter
    CameraCtrl [80] Arxiv24 16 256×384 https://github.com/hehao13/CameraCtrl
    VideoTetris [72] Arxiv24 16 320×512 https://github.com/YangLing0818/VideoTetris
    MotionClone [81] Arxiv24 16 512×512 https://github.com/Bujiazi/MotionClone/
    Latte [60] Arxiv24 16 256×256 https://github.com/Vchitect/Latte
    VideoCrafter2 [64] Arxiv24 16 320×512 https://github.com/AILab-CVC/VideoCrafter
    MMVID [37] CVPR22 8 128×128 https://github.com/snap-research/MMVID
    MAGVIT [40] CVPR23 16 128×128 https://github.com/google-research/magvit
    Text2Performer [87] CVPR23 20 512×256 https://github.com/yumingj/Text2Performer
    Dysen-VDM [84] CVPR24 16 256×256 https://github.com/scofield7419/Dysen
    BIVDiff [130] CVPR24 8 512×512 https://github.com/MCG-NJU/BIVDiff
    LAMP [69] CVPR24 16 320×512 https://github.com/RQ-Wu/LAMP
    Tune-A-Video [110] ICCV23 32 512×512 https://github.com/showlab/Tune-A-Video
    Text2Video-Zero [109] ICCV23 8 512×512 https://github.com/Picsart-AI-Research/Text2Video-Zero
    CogVideo [4] ICLR23 16 480×480 https://github.com/THUDM/CogVideo
    LVD [86] ICLR24 16 512×512 https://github.com/TonyLianLong/LLM-groundedVideoDiffusion
    AnimateDiff [131] ICLR24 16 256×256 https://github.com/guoyww/AnimateDiff
    FreeNoise [93] ICLR24 64 1024×576 https://github.com/AILab-CVC/FreeNoise
    VDM [6] NeurIPS22 16 64×64 https://github.com/lucidrains/video-diffusion-pytorch ×
    Free-Bloom [113] NeurIPS23 6 512×512 https://github.com/SooLab/Free-Bloom
     | Show Table
    DownLoad: CSV

    The dataset for the T2V task can be categorized into two classes based on the text [14]. The first is caption-level datasets, where the text corresponding to the video is more detailed in description, and the other is category-level datasets, where the text corresponding to the video is a category of the video.

    We list the current common caption-level datasets in the T2V task in Table 3. From the table, we can observe that the early datasets annotated with text are manually annotated (Manual), and the videos are small in size and single domain (e.g., movie, action, cooking), as well as low resolution (e.g., 240P). With the release of WebVid-10M [132], the T2V dataset has ushered in an era of rapid development, and it has become the most dominant dataset in the T2V task. However, the resolution of WebVid-10M [132] is too low, and a watermark exists, leading to poor video quality. Therefore, subsequent datasets have increased the video resolution and added algorithms to filter inappropriate videos (e.g., the presence of watermarks or subtitles).

    Table  3.  The comparison of main caption-level video datasets.
    Dataset Text Domain Clips Res.
    MSVD/2011 [146] Manual Open 2K -
    MSR-VTT/2016 [147] Manual Open 10K 240P
    DideMo/2017 [148] Manual Flickr 27K -
    LSMDC/2017 [149] Manual Movie 118K 1080P
    ActivityNet/2017 [150] Manual Action 100K -
    YouCook2/2018 [151] Manual Cooking 14K -
    How2/2018 [152] Manual Instruct 80K -
    VATEX/2019 [153] Manual Action 41K 240P
    HowTo100M/2019 [133] ASR Instruct 136M 240P
    WTS70M/2020 [134] Metadata Action 70M -
    YT-Temporal/2021 [154] ASR Open 180M -
    WebVid10M/2021 [132] Alt-text Open 10.7M 360P
    Echo-Dynamic/2021 [155] Manual ECG 10K -
    Tiktok/2021 [156] Mannual Action 0.3K -
    HD-VILA/2022 [157] ASR Open 103M 720P
    VideoCC3M/2022 [135] Transfer Open 10.3M -
    HD-VG-130M/2023 [16] Generated Open 130M 720P
    InternVid/2023 [10] Generated Open 234M 720P
    CelebV-Text/2023 [158] Generated Face 70K 480P
    Vimeo25M/2023 [45] Generated Open 25M -
    Panda-70M/2024 [140] Generated Open 70M 720P
    VidProM/2024 [144] Collected Open 6M -
    MiraData/2024 [159] Generated Game 57K -
     | Show Table
    DownLoad: CSV

    In addition to gradually improving the quality of the videos in the dataset, the newly released datasets also pay more attention to the alignment between text and video. Improving the alignment between text and video improves the generation performance of the model, which has been demonstrated in recent work [10], [101].

    Manual annotation can provide high-quality text, but if the number of videos rises, the burden of manual labor will be unbearable. HowTo100M [133] and other datasets collect videos originating from YouTube, and they use the automatic speech recognition (ASR) technique provided by YouTube to generate the texts, but the semantic relevance is low. WebVid10M [132] uses Alt-text, and WTS70M [134] uses Metadata (which contains titles, descriptions, tags, and channel names). VideoCC3M [135] transfers the text-image dataset to the text-video dataset. It uses Conceptual Captions3M [136] as the original dataset. It starts with the text image dataset and, for each text image pair in the dataset, finds frames in the video that are similar to the image and then extracts short video clips around the matching frames and corresponds the text to those clips.

    The latest datasets all use different generative methods to get the texts, which saves labor and also ensures that the quality of the texts is high.

    HD-VG-130M [16] first cuts the video using PySceneDetect 2. After cutting, the content of each video contains only one scene. After that, select the middle frame of the video and use BLIP2 [137] to generate a textual description. This description will be used to describe the video. InternVid [10] has two scales to generate text, coarse and fine, where the coarse scale is generated in the same way as HD-VG-130M [16]. At the fine scale, Tag2Text [138] is used to generate text descriptions for each frame of the video. These text descriptions are then synthesized into a comprehensive description using a pre-trained language model. CelebV-Text [40] utilizes a semi-automatic template-based text generation strategy. An algorithm automatically labels attributes that are easy to label, and attributes that are difficult to label are labeled manually. Afterward, following the template, the attributes are filled in to get the final description of the video. Vimeo25M [45] uses Videochat [139] to generate text automatically. Panda-70M [140] utilizes multiple models (including VideoLLaMA [141], VideoChat [139], VideoChat [139] Text, BLIP-2 [137], and MiniGPT-4 [142]) to generate texts. After that, it fine-tunes Unmasked Teacher (UMT) [143] to help select the best one of the texts. In order to minimize the computational requirements, it proposes a student model to extract knowledge from the teacher model. VidProM [144] collected 1.67 million T2V prompts from real users. Based on the prompts, 6.69 million videos were generated by Pika 3, Text2Video-Zero [109], VideoCraft2 [64], and ModelScope [98]. MiraData 4 uniformly sampled eight frames for each video and arranged them into a large 2x4 grid image. Then, a one-sentence caption is generated for each video using Panda-70M’s [140] caption model. After that, the generated captions are fed into GPT-4V [145] as auxiliaries along with the large 2x4 image to output multi-dimensional captions in one dialog round efficiently.

    As shown in Figure 21, we give examples of text-video pairs from MSVD, MSR-VTT, WebVid10M, and Panda70M to illustrate the development of the T2V dataset. We show four frames from the selected video uniformly over time. If there are multiple text annotations, we select two sentences from them for the demonstration. For comparison, we resize the videos to the same size. Both MSVD and MSR-VTT have multiple text annotations for the same video. MSVD may even have incorrect text annotations. The video from MSR-VTT contains multiple scenes, and the others are single scenes. From WebVid10M to Panda70M, we can see more precise text annotation.

    Figure  21.  Showcase of different datasets.

    Without a suitable caption-level dataset, the T2V task uses category-level datasets to train the model. These category-level datasets are from other tasks, e.g., UCF101 [160], Kinetics [161], and Something-Something [162] from the action recognition task. DAVIS [163] from the video editing task. We list the category-level datasets ever used for the T2V task in Table 4.

    Table  4.  The comparison of main Category-level video datasets.
    Datasets Categories Clips Res.
    KTH/2004 [164] 6 2K 160×120
    MUG/2010 [165] 6 1K 896×896
    UCF-101/2012 [160] 101 13K 256×256
    Cityscapes/2015 [166] 30 3K 256×256
    Moving MNIST/2016 [167] 10 10K 64×64
    Kinetics-400/2017 [168] 400 260K 256×256
    BAIR/2017 [169] 2 45K 64×64
    DAVIS/2017 [163] - 90 1280×720
    Sky Time-Lapse/2018 [170] 1 38K 256×256
    Ssthv2/2018 [162] 174 220K 256×256
    Kinetics-600/2018 [171] 600 495K 256×256
    MiT/2018 [172] 339 1M 340×256
    Tai-Chi-HD/2019 [173] 1 3K 256×256
    iPER/2019 [174] 10 206 256×256
    Bridge Data/2021 [175] 10 7K 256×256
    Mountain Bike/2022 [176] 1 1K 576×1024
    RDS/2023 [99] 2 683K 512×1024
     | Show Table
    DownLoad: CSV

    Quantitative metrics consist of the visual quality of T2V and the alignment of text and video. To better evaluate the performance of T2V models, EvalCrafter [17] further improves the metrics on visual quality and text-video alignment and proposes metrics on motion quality and temporal consistency. These will be introduced in the following four subsections. For qualitative metrics, which are subjective human evaluations, they will be introduced in Section 5.5.

    The traditional metrics to measure the visual quality of video are FVD [177] and IS [178], developed from image visual metrics.

    Fréchet Video Distance (FVD) [177] builds on the principle of FID [179]. It measures the visual quality of the generated video by calculating the distance between the generated video’s distribution and the real video’s distribution. The calculation formula is shown in Eq. (1),

    d(PR,PG)=|μRμG|2+Tr(ΣR+ΣG2(ΣRΣG)12)
    (1)

    where μR and μG are the means, and R and G are the co-variance matrices of PR and PG, respectively. FVD [177] adopts inflated 3D Convnets [168] (I3D) pretrained on Kinetics [161] to extract features from videos.

    Inception Score (IS) [178] uses the Inception Network [180], pre-trained on the ImageNet [181] dataset as the feature extraction to evaluate the image quality. When evaluating video quality, the feature extraction model is changed to 3D-Convnets (C3D) [182]. The calculation formula is shown in Eq. (2),

    IS=expExpGKL(p(yx)p(y))
    (2)

    where P(y) is the marginal distribution of all videos and P(y|x) denotes the output distribution of the model after inputting the generated videos. IS measures the diversity of the generated videos, with larger scores indicating more variety in the generated content.

    A recent study, EvalCrafter [17], utilizes Dover [183] to assess the visual quality of generated videos, which consist of two components, VQAA and VQAT, which are the aesthetic and technical scores, respectively. The technical perspective involves quantifying the perception of distortions, while the aesthetic perspective focuses on preferences and recommendations about content.

    In addition to video quality assessment, measuring the alignment between input text and generated video is another important perspective for evaluating T2V generation. The traditional evaluation metric is CLIPSIM [27], and EvalCrafter [17] further proposes more metrics to measure the text-video alignment more comprehensively. These evaluation metrics will be described below.

    CLIPSIM [27] is calculated by first encoding the image and text with the CLIP [19] model to get the embeddings and then calculating the cosine similarity between the embeddings. The similarities between frames and the input text are averaged to represent the final similarity between the video and the input text. and then take the average value. The formula is described in Eq. (3),

    CLIPSIM(p,x)=1tti=1C(emb(xt),emb(p))
    (3)

    where xit means the tth frame of the video, emb() means CLIP embedding, C() means calculating the cosine similarity, and p means the text.

    It is worth mentioning that the accuracy of CLIPSIM entirely depends on the CLIP [19] model. To reduce the side effect, Relative Matching (RM) [27] metric. CLIPSIM calculates the ratio of CLIPSIM of the generated video to that of the ground truth video. There are three other CLIPSIM-like metrics. CLIPScore-ft is based on the CLIP model fine-tuned on the MSR-VTT dataset [147]. BLIPScore and UMTScore use BLIP [137] and UMT [143] instead of CLIP.

    In practical scenarios, limited by the performance of the CLIP model and the complexity of the prompt, the above traditional metrics can not work well. Therefore, a series of metrics are proposed in EvalCrafter.

    SD-Score uses SDXL [184] to generate N1 images per prompt, extracting the visual embeddings to calculate the similarity between the generated video and the SDXL images. Essentially, SDXL [184] acts as the teacher, and the video generation model as the student. The results generated by the student are close to those generated by the teacher. The calculation is shown in Eq. (4),

    SSD=1MMi=1(1NNt=1(1N1N1k=1C(emb(xit),emb(dik))))
    (4)

    where xit means the tth frame of the ith video. N1 is typically set to 5.

    BLIP-BLEU uses BLIP2 [185] to generate the caption for the generated video and the BLEU [186] similarity between the caption and the prompt is calculated. Shown in Eq. (5),

    SBB=1MMi=1(1N2N2k=1B(pi,lik))
    (5)

    where B(,) is the BLEU similarity scoring function, {lik}N2k=1 are BLIP generated captions for ith video, and N2 is typically set to 5.

    OCR-Score checks whether the text required to appear in the video appears in the generated video to test the model’s ability to generate text. This process involves using PaddleOCR to detect the English text in the generated video, after that, calculate the word error rate (WER) [187], the normalized edit distance (NED) [188], and the character error rate (CER) [189]. The average of the three values is the OCR-Score.

    Detection-Score detects whether the requested objects appear in the video,

    SDet=1M1M1i=1(1NNt=1σit)
    (6)

    where M1 represents the count of prompts containing objects, and σij represents the detection result for frame t in video i (with a value of 1 indicating the detection of an object and 0 indicating otherwise).

    Count-Score detects whether the number of objects in the video is correct,

    SCount=1M2M2i=1(11NNt=1|citˆci|ˆci)
    (7)

    where M2 is the number of prompts with object counts, cit is the detected object count frame t in video i and ˆci is the ground truth object count for video i.

    Color-Score detects whether the color in the video matches the description in the prompt,

    SColor=1M3M3i=1(1NNt=1sit)
    (8)

    where M3 is the number of prompts with object colors, sit is the color accuracy result for frame i in video t (1 if the detected color matches the ground truth color, 0 otherwise).

    Celebrity ID Score calculates the distance between the celebrity in the generated video and the real image of the celebrity,

    SCIS=1M4M4i=1(1NNt=1(mink{1,,N3}D(xit,fik)))
    (9)

    where M4 is the number of prompts that contain celebrities, D(,) is the Deepface’s [190] distance function, {fik}N3k=1 are collected celebrities images for prompt i, and N3 is set to 3.

    Previous T2V studies did not yet consider metrics for evaluating the motion quality of the generated video. EvalCrafter [17] propose Action-Score, Flow-Score, and Motion AC-Score for motion quality assessment.

    Action Recognition (Action-Score) recognizes human actions in the generated video using the MMAction2 toolbox [191]. The action score is calculated as accuracy by comparing the recognized action with the action in the original prompt.

    Average Flow (Flow-Score) uses the pre-trained optical flow estimation method RAFT [192] to extract the dense flow of the video in two-frame intervals. Then, calculate the average flow score for the whole video clip. This helps identify static videos.

    Amplitude Classification Score (Motion AC-Score). Based on the average flow, Motion AC-Score calculates the motion amplitude of the generated video and determines whether the amplitude is the same as the amplitude specified by the prompt. This gives us a clearer picture of the motion changes in the video.

    Warping Error. Firstly, using a pre-trained optical flow estimation network [192] to obtain the optical flow for every two frames, after that, the difference between the warped image and the predicted image is computed pixel by pixel, and the final score is the average of all pairs.

    Semantic Consistency (CLIP-Temp). Specifically, calculates the semantic embedding on every two frames of the generated video, then obtains the average value of every two frames.

    Face Consistency. This metric evaluates the human identity consistency of the generated video. It is calculated by selecting the first frame as the reference frame and calculating the cosine similarity between the embedding of the reference frame and the embeddings of other frames. The average of these similarities is taken as the final score.

    Although, in the previous sections, we have introduced many automated evaluation metrics, some of these automated metrics have been found to be inconsistent with human judgments in some studies [3], [193], [194], indicating that automated evaluation metrics may not always be reliable. Therefore, the human perspective is also essential for evaluating generated videos.

    There are four main benchmarks that are widely used by the public: DrawBench [195], FETV [18], EvalCrafter [17] and VBench [196].

    DrawBench [195] is a benchmark for T2I generation, but it can also be used for T2V generation. The benchmark is proposed to compensate for COCO’s [197] limited range of prompts, typified by the newly proposed PaintSkills [198], to systematically assess visual reasoning skills and social biases outside of COCO [197]. DrawBench has eleven evaluation categories with a total of 200 prompts. These categories include color, count, spatial positioning, conflicting interaction, long description, misspelling, rare words, quoted words, and so on.

    FETV [18] is a fine-grained evaluation benchmark for T2V generation. It consists of 619 prompts, with 541 prompts sourced from existing datasets and 78 unique prompts created by the authors. Each prompt is categorized based on three aspects: the main content, attributes, and complexity. The feature referred to as “main content” was further divided into spatial and temporal categories. Similarly, “attribute control” encompasses both spatial and temporal qualities. The feature of “prompt complexity” is categorized into three levels: “simple”, “medium”, and “complex”, which are determined by the number of consecutive words in the prompts. By employing classification, the FETV benchmark can be subdivided into distinct subsets, enabling fine-grained evaluation.

    EvalCrafter [17] aims to create a list of reliable prompts to assess the capabilities of various T2V models fairly. To achieve its goal, EvalCrafter collected and analyzed a large number of prompts from the real world and selected more than 500. Afterward, EvalCrafter proposes an automated pipeline to increase the diversity of the selected prompts. In total, there are 50 styles and 20 camera motion prompts in the benchmark, and the average length of the prompts is 12.5 words, similar to real-world prompts.

    VBench [196] is a comprehensive benchmark suite for video generative models. It decomposes video generation quality into 16 dimensions, and each evaluation dimension assesses one aspect of video generation quality. To reduce the overhead of generating videos, it accurately filters the set of tested prompts; for each metric, there are only 100 prompts. Experiments show that VBench’s evaluation results align well with human perception.

    GenAI-Arena [199] is an open platform designed to rank generative models across text-to-image, image editing, and text-to-video tasks based on user preferences. Unlike other platforms, it is driven by community voting to ensure transparency and sustainable operation. It adopts the side-by-side human voting method to evaluate the models and releases the human preference voting as GenAI-Bench.

    Based on the benchmarks mentioned above, the researchers put the results generated by their model and those generated by others and asked the observer to choose the best-generated video based on certain aspects. The commonly examined aspects are video frame quality, semantic relevance, motion realism, etc.

    In order to demonstrate the consistency of automatic assessment results with human assessment results, some studies [17], [18] calculate Spearman’s rank correlation coefficients [200] and Kendall’s rank correlation coefficients [201]. These coefficients reveal the direction and strength between automatic and human assessment scores.

    Dataset. Currently, the T2V task is mainly evaluated in a zero-shot manner on the MSR-VTT [147] and UCF-101 [160] datasets. MSR-VTT [147] consists of 10,000 video clips in 20 categories, each described by approximately 20 natural sentences. Typically, the textual descriptions corresponding to the 2,990 video clips in the test set were used as prompts to generate the corresponding videos. The UCF-101 [160] consists of 13,320 video clips divided into 101 categories.

    Evaluation Metrics. For the MSR-VTT [147] dataset, the FVD [177] and FID [179] metrics are used to evaluate the video quality, and CLIPSIM [27] is used to measure the alignment between text and video. For the UCF-101 [160] dataset, the Inception Score, FVD [177], and FID [179] are used to evaluate the quality of the generated video and its frames. Many of the metrics mentioned are not yet widely used and are therefore not included in the statistics.

    Comparison of Results. We summarize the experimental results of the most existing methods in Table 5.

    Table  5.  Organization of experimental results on video generation methods.
    Method Venue Training Dataset Res. Params MSRVTT UCF-101
    FID(↓) FVD(↓) CLIPSIM(↑) FID(↓) FVD(↓) IS(↑)
    [UCF-101] Generate videos directly using category names.
    LaVie [45] Arxiv23 WebVid [132]
    Vimeo [45]
    320 × 512 3B 0.2949 526.3
    MagicVideo [102] Arxiv22 WebVid [132]
    HD-VILA [157]
    256 × 256 36.5 998 145 655
    MicroCinema [202] Arxiv23 WebVid [132] 448 × 448 377.4 0.2967 342.86 37.46
    [UCF-101] Generate videos using the template sentences corresponding to category names.
    Make-A-Video [5] Arxiv22 WebVid [132]
    HD-VILA [157]
    256 × 256 9.72B 13.17 0.3049 367.23 33
    VideoFactory [16] Arxiv24 256 × 256 2.04B 0.3005 410
    POS [51] Arxiv23 training free 256 × 256 42.29 0.2993 566.68 38.19
    VideoGen [107] Arxiv23 WebVid [132] 256 × 256 0.3127 554 71.61
    PYoCo [108] ICCV23 256 × 256 9.73 355.19 47.76
    Latent-shift [111] Arxiv23 256 × 256 1.53B 15.23 0.2773
    PixelDance [203] Arxiv23 336 × 596 1.5B 381 0.3125 49.36 242.82 42.1
    [MSRVTT] Generate videos for all sentences.
    Make-A-Video [5] Arxiv22 WebVid [132]
    HD-VILA [157]
    256 × 256 9.72B 13.17 0.3049 367.23 33
    VideoGen [107] Arxiv23 WebVid [132] 256 × 256 0.3127 554 71.61
    Latent-shift [111] Arxiv23 256 × 256 1.53B 15.23 0.2773
    [MSRVTT] Generate videos by randomly selecting one out of every 20 sentences.
    VideoFactory [16] Arxiv24 WebVid [132]
    HD-VILA [157]
    256 × 256 2.04B 0.3005 410
    VersVideo [43] ICLR24 WebVid [132] 256 × 256 2B 421 0.3014 119 81.3
    POS [51] Arxiv23 training free 256 × 256 42.29 0.2993 566.68 38.19
    SimDA [103] CVPR24 WebVid [132] 256 × 256 1.08B 456 0.2945
    LaVie [45] Arxiv23 WebVid [132]
    Vimeo [45]
    320 × 512 3B 0.2949 526.3
    PixelDance [203] Arxiv23 WebVid [132] 336 × 596 1.5B 381 0.3125 49.36 242.82 42.1
    UniVG [204] Arxiv24 1280 × 720 336 0.3014
    [FVD] Sample size with 2048.
    CogVideo [4] ICLR23 WebVid [132] 480 × 480 9.4B 23.59 1294 0.2631 179 701.59 25.27
    Show-1 [44] Arxiv23 320 × 576 13.08 538 0.3072 394.46 35.42
    ModelScope [98] Arxiv23 256 × 256 1.7B 11.09 550 0.293 410
    LVD [86] ICLR24 training free 512 × 512 1.7B 521 861
    [FVD] Sample size with 10000.
    LaVie [45] Arxiv23 WebVid [132]
    Vimeo [45]
    320 × 512 3B 0.2949 526.3
    MicroCinema [202] Arxiv23 WebVid [132] 448 × 448 377.4 0.2967 342.86 37.46
    [Others]
    VDM [6] NeurIPS22 UCF101 [160] 64 × 64 9.72B 298 57.62
    NUWA [35] ECCV22 VATEX [153] 256 × 256 47.68 0.2439
    CogVideo [4] ICLR23 WebVid [132] 480 × 480 9.4B 23.59 1294 0.2631 179 701.59 25.27
    LVDM [42] Arxiv22 256 × 256 1.16B 742 0.2381 641.8
    Dysen-VDM [84] CVPR24 256 × 256 12.64 0.3204 325.42 35.57
    VideoDirGPT [85] Arxiv23 256 × 256 1.92B 12.22 550 0.286
    VideoFusion [49] CVPR23 256 × 256 1.83B 581 0.2795 75.77 639.9 17.49
    Video-LDM [99] CVPR23 256 × 256 4.2B 0.2929 550.61 33.45
    HiGen [68] Arxiv23 448 × 256 8.6 406 0.2947
    VideoComposer [75] NeurIPS23 256 × 256 1.85B 580 0.2932
    TF-T2V [63] Arxiv23 448 × 256 8.19 441 0.2991
    ART•V [55] Arxiv23 768 × 768 291.08 0.2859 315.69 50.34
    MoVideo [46] Arxiv23 256 × 256 12.71 0.3213 313.41 34.13
    InternVid [10] Arxiv23 WebVid [132]
    InternVid [10]
    256 × 256 0.2951 60.25 616.51 21.04
    VidRD [52] Arxiv23 WebVid [132]
    Kinetics [168]
    VideoLT [205]
    256 × 256 363.19 39.37
    Imagen Video [100] Arxiv23 1280 × 768
    FusionFrames [206] Arxiv23 256 × 256 0.2976 433.05 24.33
    W.A.L.T [61] Arxiv23 128 × 224 3B 258.1 35.1
    HPDM [57] CVPR24 144 × 256 383.3 21.15
     | Show Table
    DownLoad: CSV

    All experimental results in Table 5 are derived from the original papers. It is worth noting that there are no consensus experimental settings for T2V. Therefore, the results of these works are only partially comparable. However, it is valuable to list them in a table to give readers a brief impression of the trend of the T2V evaluation. To make the most of these comparable results, we try our best to list comparable research works in a group to give more precise insights.

    Generate videos directly using category names on UCF-101 [45], [102], [202]: [202] performs the best with the lowest FVD value.

    Generate videos using the template sentences corresponding to category names on UCF-101 [5], [16], [51], [107], [108], [111], [203]: [203] gives the best FVD value and [107] gives the best IS value.

    Generate videos for all sentences on MSRVTT [5], [107], [111]: [107] shows the best CLIPSIM value. [5] shows the best FID value.

    Generate videos by randomly selecting one out of every 20 sentences on MSRVTT [16], [43], [45], [51], [103], [203], [204]: [203] has the best CLIPSIM value and [204] has the best FVD value.

    Calculate FVD with a sample size of 2048 [4], [44], [86], [98]: On the MSRVTT dataset, [86] has the smallest FVD. On the UCF-101 dataset, the FVD of [44] is the minimum.

    Calculate FVD with a sample size of 10000 [45], [202]: [202] achieved minimal FVD on both the MSRVTT dataset and the UCF-101 dataset.

    For readers interested in the experimental settings, we recommend checking the original paper for detailed information.

    Quantitative relationships in video. When a fixed number of objects is specified in the prompt, it is sometimes incorrectly reflected in the generated video. For example, the prompt mentions that two people are present, but the generated video has only one person throughout, or it changes from two people to some other number of people.

    Causality of events. The model has difficulty understanding how actions and behaviors will drive events. An example would be coloring and painting a wall, but the wall color does not change.

    Object interactions. The model has trouble understanding the boundaries between objects and modeling their interactions. For example, after throwing a ball, the ball and the basket merge instead of being bounced off.

    Scale and proportionality. The model experiences difficulty understanding the relationship between scale size and proportion of different objects in different parts of the scene. For example, one person in the same scene is particularly short while another is particularly tall.

    Object illusion. The objects generated by the model are unstable, appearing or disappearing suddenly in the video.

    Large-scale open-source T2V datasets. Although many datasets have been proposed recently, the number is insufficient for the model to learn. Also, the quality of the videos and texts in the datasets needs to be continuously improved so that the model performance can be further enhanced. It is also essential to open source collected datasets, which can effectively accelerate the progress of the research.

    Efficient training methods and model architecture. Training a T2V model takes a lot of computational effort and time. More efficient architectures reduce the time required for inference and weaken the hardware requirements needed for inference, which can significantly facilitate the application of the model.

    Comprehensive metrics for evaluation. While the recent EvalCrafter [17] and FETV [18] have primarily filled the gap, the newly proposed metrics will be included in the methods comparison in the future.

    Abstract text generation. Existing T2V generation methods all assume the input text is concrete, which is not always practical in the real world. For abstract words or abstract sentences, it is difficult for the model to generate well, and the quality of the generated video will drop significantly. For example, the prompt is “Hard work is a virtue”. However, such a demand is reasonable because people think abstractly, and abstract ideas can be challenging to describe. We hope that the results generated by the model can conform to our abstract thinking or help our abstract thinking to become more concrete.

    Long video generation. Most of the research works mentioned in this survey can only generate short videos for 2 seconds with 16 frames, which limits the application use. If long videos with relatively high quality can be generated, T2V generation will have excellent application prospects.

    In this article, we present a thorough survey of text-to-video generation techniques and systematically categorize methods into 1) VAE-based approaches, 2) GAN-based approaches, 3) Auto-regressive transformer based approaches, 4) Diffusion-based approaches, 5) T2I for video generation approaches. This survey comprehensively reviews nearly one hundred representative T2V generation approaches and includes the latest method published in July 2024. In addition, we introduce 40 video datasets, 20 evaluation metrics, and available open-source T2V models, making it easy for the reader who would like to work on T2V generation research. Furthermore, we report comparative performance evaluations. In the end, we discuss challenges and future trends that move the field forward.

    This work was supported by the National Natural Science Foundation of China No.62206123, and a research gift from the Office of the Cyberspace Administration of Shenzhen Municipal Committee of the Communist Party of China.

  • [1]
    A. Ramesh, M. Pavlov, G. Goh, et al., “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, virtual event, pp. 8821–8831, 2021.
    [2]
    A. Ramesh, P. Dhariwal, A. Nichol, et al., “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint, arXiv: 2204.06125, 2022.
    [3]
    M. Ding, W. D. Zheng, W. Y. Hong, et al., “CogView2: Faster and better text-to-image generation via hierarchical transformers,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 1229, 2022.
    [4]
    W. Y. Hong, M. Ding, W. D. Zheng, et al., “CogVideo: Large-scale pretraining for text-to-video generation via transformers,” in Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, pp. 1–24, 2023.
    [5]
    U. Singer, A. Polyak, T. Hayes, et al., “Make-a-video: Text-to-video generation without text-video data,” in Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, pp. 1–16, 2023.
    [6]
    J. Ho, T. Salimans, A. Gritsenko, et al., “Video diffusion models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 628, 2022.
    [7]
    M. J. Chen, X. Tan, B. H. Li, et al., “AdaSpeech: Adaptive text to speech for custom voice,” in Proceedings of the 9th International Conference on Learning Representations, virtual event, pp. 1–10, 2021.
    [8]
    Y. Zhang, R. J. Weiss, H. G. Zen, et al., “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2080–2084, 2019.
    [9]
    D. Paul, M. P. Shifas, Y. Pantazis, et al., “Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion,” in Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp. 1361–1365, 2020.
    [10]
    Y. Wang, Y. N. He, Y. Z. Li, et al., “InternVid: A large-scale video-text dataset for multimodal understanding and generation,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–25, 2024.
    [11]
    T. Brooks, B. Peebles, C. Holmes, et al., “Video generation models as world simulators,” Available at: https://openai.com/research/video-generation-models-as-world-simulators, 2024-02-15.
    [12]
    A. Singh, “A survey of AI text-to-image and AI text-to-video generators,” in Proceedings of 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC), Cairo, Egypt, pp. 32–36, 2023.
    [13]
    J. Cho, F. D. Puspitasari, S. Zheng, et al., “Sora as an AGI world model? A complete survey on text-to-video generation,” arXiv preprint, arXiv: 2403.05131, 2024.
    [14]
    Z. Xing, Q. J. Feng, H. R. Chen, et al., “A survey on video diffusion models,” ACM Computing Surveys, vol. 57, no. 2, article no. 41, 2025. DOI: 10.1145/3696415
    [15]
    R. Sun, Y. M. Zhang, T. Shah, et al., “From Sora what we can see: A survey of text-to-video generation,” arXiv preprint, arXiv: 2405.10674, 2024.
    [16]
    W. J. Wang, H. Yang, Z. X. Tuo, et al., “VideoFactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” in Proceedings of the 12th International Conference on Learning Representations, Vienna Austria, pp. 1–30, 2024.
    [17]
    Y. F. Liu, X. D. Cun, X. B. Liu, et al., “EvalCrafter: Benchmarking and evaluating large video generation models,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 22139–22149, 2024.
    [18]
    Y. X. Liu, L. Li, S. H. Ren, et al., “FETV: A benchmark for fine-grained evaluation of open-domain text-to-video generation,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 2723, 2023.
    [19]
    A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, virtual event, pp. 8748–8763, 2021.
    [20]
    J. Devlin, M. W. Chang, K. Lee, et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186, 2019.
    [21]
    C. Raffel, N. Shazeer, A. Roberts, et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 1, article no. 140, 2020.
    [22]
    H. Touvron, T. Lavril, G. Izacard, et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint, arXiv: 2302.13971, 2023.
    [23]
    D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proceedings of the 2nd International Conference on Learning Representations, Banff, AB, Canada, pp. 1–14, 2022.
    [24]
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial networks,” Communications of the ACM, vol. 63,11, pp. 139–144, 2020. DOI: 10.1145/3422622
    [25]
    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, article no. 574, 2020.
    [26]
    G. Mittal, T. Marwah, and V. N. Balasubramanian, “Sync-DRAW: Automatic video generation using deep recurrent attentive architectures,” in Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, pp. 1096–1104, 2017.
    [27]
    C. F. Wu, L. Huang, Q. X. Zhang, et al., “GODIVA: Generating open-domain videos from natural descriptions,” arXiv preprint, arXiv: 2104.14806, 2021.
    [28]
    Y. W. Pan, Z. F. Qiu, T. Yao, et al., “To create what you tell: Generating videos from captions,” in Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, pp. 1789–1798, 2017.
    [29]
    Y. Balaji, M. R. Min, B. Bai, et al., “Conditional GAN with discriminative filter generation for text-to-video synthesis,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 1995–2001, 2019.
    [30]
    Y. T. Li, M. Min, D. H. Shen, et al., “Video generation from text,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp. 7065–7072, 2018.
    [31]
    K. L. Deng, T. Y. Fei, X. Huang, et al., “IRC-GAN: Introspective recurrent convolutional GAN for text-to-video generation,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 2216–2222, 2019.
    [32]
    D. Kim, D. Joo, and J. Kim, “TiVGAN: Text to image to video generation with step-by-step evolutionary generator,” IEEE Access, vol. 8, pp. 153113–153122, 2020. DOI: 10.1109/ACCESS.2020.3017881
    [33]
    Y. Li, Z. Gan, Y. Shen, et al., “StoryGAN: A sequential conditional GAN for story visualization,” in Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 6322–6331, 2019.
    [34]
    B. W. Li, “Word-level fine-grained story visualization,” in Proceedings of the 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, pp. 347–362, 2022.
    [35]
    C. F. Wu, J. Liang, L. Ji, et al., “NÜWA: Visual synthesis pre-training for neural visual world creation,” in Proceedings of the 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, pp. 720–736, 2022.
    [36]
    D. Kondratyuk, L. J. Yu, X. Y. Gu, et al., “VideoPoet: A large language model for zero-shot video generation,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, article no. 1005, 2024.
    [37]
    L. G. Han, J. Ren, H. Y. Lee, et al., “Show me what and tell me how: Video synthesis via multimodal conditioning,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 3605–3615, 2022.
    [38]
    H. Liu, W. Yan, M. Zaharia, et al., “World model on million-length video and language with Blockwise RingAttention,” arXiv preprint, arXiv: 2402.08268, 2024.
    [39]
    R. Villegas, M. Babaeizadeh, P. J. Kindermans, et al., “Phenaki: Variable length video generation from open domain textual descriptions,” in Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, pp. 1–14, 2023.
    [40]
    L. J. Yu, Y. Cheng, K. Sohn, et al., “MAGVIT: Masked generative video transformer,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 10459–10469, 2023.
    [41]
    X. F. Wang, Z. Zhu, G. Huang, et al., “WorldDreamer: Towards general world models for video generation via predicting masked tokens,” arXiv preprint, arXiv: 2401.09985, 2024.
    [42]
    Y. Q. He, T. Y. Yang, Y. Zhang, et al., “Latent video diffusion models for high-fidelity long video generation,” arXiv preprint, arXiv: 2211.13221, 2022.
    [43]
    J. X. Xiang, R. C. Huang, J. Zhang, et al., “VersVideo: Leveraging enhanced temporal diffusion models for versatile video generation,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–19, 2024.
    [44]
    D. J. Zhang, J. Z. Wu, J. W. Liu, et al., “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” International Journal of Computer Vision, in press.
    [45]
    Y. H. Wang, X. Y. Chen, X. Ma, et al., “LaVie: High-quality video generation with cascaded latent diffusion models,” International Journal of Computer Vision, in press.
    [46]
    J. Y. Liang, Y. C. Fan, K. Zhang, et al., “MoVideo: Motion-aware video generation with diffusion model,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 56–74, 2024.
    [47]
    Z. Q. Yuan, Y. X. Liu, Y. H. Cao, et al., “Mora: Enabling generalist video generation via a multi-agent framework,” arXiv preprint, arXiv: 2403.13248, 2024.
    [48]
    Y. B. Zhang, Y. X. Wei, X. H. Lin, et al., “VideoElevator: Elevating video generation quality with versatile text-to-image diffusion models,” arXiv preprint, arXiv: 2403.05438, 2024.
    [49]
    Z. X. Luo, D. Y. Chen, Y. Y. Zhang, et al., “Notice of Removal: VideoFusion: Decomposed diffusion models for high-quality video generation,” in Proceedings of 2013 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 10209–10218, 2023.
    [50]
    H. J. Yuan, S. W. Zhang, X. Wang, et al., “InstructVideo: Instructing video diffusion models with human feedback,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 6463–6474.
    [51]
    S. J. Ma, H. Y. Xu, M. J. Li, et al., “POS: A prompts optimization suite for augmenting text-to-video generation,” arXiv preprint, arXiv: 2311.00949, 2023.
    [52]
    J. X. Gu, S. C. Wang, H. Y. Zhao, et al., “Reuse and diffuse: Iterative denoising for text-to-video generation,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–16, 2024.
    [53]
    S. T. Su, J. Z. Liu, L. L. Gao, et al., “F³-pruning: A training-free and generalized pruning strategy towards faster and finer text-to-video synthesis,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 4961–4969, 2024.
    [54]
    X. Wang, S. W. Zhang, H. Zhang, et al., “VideoLCM: Video latent consistency model,” arXiv preprint, arXiv: 2312.09109, 2023.
    [55]
    W. M. Weng, R. Y. Feng, Y. H. Wang, et al., “ART · V: Auto-regressive text-to-video generation with diffusion models,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 2024, pp. 7395–7405.
    [56]
    H. Zhang, Z. X. Wu, Z. Xing, et al., “AdaDiff: Adaptive step selection for fast diffusion,” arXiv preprint, arXiv: 2311.14768, 2023.
    [57]
    I. Skorokhodov, W. Menapace, A. Siarohin, et al., “Hierarchical patch diffusion models for high-resolution video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 7569–7579, 2024.
    [58]
    Y. R. Yang, J. C. Zhang, Y. Deng, et al., “Mobius: A high efficient spatial-temporal parallel training paradigm for text-to-video generation task,” arXiv preprint, arXiv: 2407.06617, 2024.
    [59]
    H. Y. Lu, G. X. Yang, N. Y. Fei, et al., “VDT: General-purpose video diffusion transformers via mask modeling,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–28, 2024.
    [60]
    X. Ma, Y. H. Wang, G. Y. Jia, et al., “Latte: Latent diffusion transformer for video generation,” arXiv preprint, arXiv: 2401.03048, 2024.
    [61]
    A. Gupta, L. J. Yu, K. Sohn, et al., “Photorealistic video generation with diffusion models,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 393–411, 2024.
    [62]
    W. Menapace, A. Siarohin, I. Skorokhodov, et al., “Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7038–7048, 2024.
    [63]
    X. Wang, S. W. Zhang, H. J. Yuan, et al., “A recipe for scaling up text-to-video generation with text-free videos,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 6572–6582, 2024.
    [64]
    H. X. Chen, Y. Zhang, X. D. Cun, et al., “VideoCrafter2: Overcoming data limitations for high-quality video diffusion models,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7310–7320, 2024.
    [65]
    H. Li, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2024, pp. 8153–8163.
    [66]
    P. Esser, J. Chiu, P. Atighehchian, et al., “Structure and content-guided video synthesis with diffusion models,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 7312–7322, 2023.
    [67]
    J. B. Xing, M. H. Xia, Y. Zhang, et al., “DynamiCrafter: Animating open-domain images with video diffusion priors,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 399–417, 2024.
    [68]
    Z. W. Qing, S. W. Zhang, J. Y. Wang, et al., “Hierarchical spatio-temporal decoupling for text-to- video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2024, pp. 6635–6645.
    [69]
    R. Q. Wu, L. Y. Chen, T. Yang, et al., “LAMP: Learn a motion pattern for few-shot video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7089–7098, 2024.
    [70]
    R. Zhao, Y. C. Gu, J. Z. Wu, et al., “MotionDirector: Motion customization of text-to-video diffusion models,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 273–290, 2024.
    [71]
    S. H. Yuan, J. F. Huang, Y. J. Shi, et al., “MagicTime: Time-lapse video generation models as metamorphic simulators,” arXiv preprint, arXiv: 2404.05014, 2024.
    [72]
    Y. Tian, L. Yang, H. T. Yang, et al., “VideoTetris: Towards compositional text-to-video generation,” arXiv preprint, arXiv: 2406.04277, 2024.
    [73]
    Y. Jain, A. Nasery, V. Vineet, et al., “PEEKABOO: Interactive video generation via masked-diffusion,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 8079–8088, 2024.
    [74]
    Y. B. Zhang, Y. X. Wei, D. S. Jiang, et al., “ControlVideo: Training-free controllable text-to-video generation,” in Proceedings of the 12th International Conference on Learning Representations, Vienna Austria, pp. 1–21, 2024.
    [75]
    X. Wang, H. J. Yuan, S. W. Zhang, et al., “VideoComposer: Compositional video synthesis with motion controllability,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 334, 2023.
    [76]
    G. Y. Liu, M. H. Xia, Y. Zhang, et al., “StyleCrafter: Enhancing stylized text-to-video generation with style adapter,” arXiv preprint, arXiv: 2312.00330, 2023.
    [77]
    J. C. Zhu, H. Yang, W. J. Wang, et al., “MobileVidFactory: Automatic diffusion-based social media video generation for mobile devices from text,” in Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, pp. 9371–9373, 2023.
    [78]
    J. W. Wang, Y. C. Zhang, J. X. Zou, et al., “Boximator: Generating rich and controllable motions for video synthesis,” in Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, article no. 2142, 2024.
    [79]
    H. Lin, J. Cho, A. Zala, et al., “Ctrl-Adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model,” arXiv preprint, arXiv: 2404.09967, 2024.
    [80]
    H. He, Y. H. Xu, Y. W. Guo, et al., “CameraCtrl: Enabling camera control for text-to-video generation,” arXiv preprint, arXiv: 2404.02101, 2024.
    [81]
    P. Y. Ling, J. Z. Bu, P. Zhang, et al., “MotionClone: Training-free motion cloning for controllable video generation,” arXiv preprint, arXiv: 2406.05338, 2024.
    [82]
    Z. J. Duan, L. Z. You, C. Y. Wang, et al., “DiffSynth: Latent in-iteration deflickering for realistic video synthesis,” in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track, Vilnius, Lithuania, pp. 332–347, 2024.
    [83]
    B. H. Liu, X. Liu, A. B. Dai, et al., “Dual-stream diffusion net for text-to-video generation,” arXiv preprint, arXiv: 2308.08316, 2023.
    [84]
    H. Fei, S. Q. Wu, W. Ji, et al., “Dysen-VDM: Empowering dynamics-aware text-to-video diffusion with LLMs,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7641–7653, 2024.
    [85]
    H. Lin, A. Zala, J. Cho, et al., “VideoDirectorGPT: Consistent multi-scene video generation via LLM-guided planning,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–25, 2024.
    [86]
    L. Lian, B. F. Shi, A. Yala, et al., “LLM-grounded video diffusion models,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–21, 2024.
    [87]
    Y. M. Jiang, S. Yang, T. L. Koh, et al., “Text2Performer: Text-driven human video generation,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, pp. 22690–22700, 2023.
    [88]
    M. J. Yang, Y. L. Du, B. Dai, et al., “Probabilistic adaptation of text-to-video models,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–17, 2024.
    [89]
    X. F. Li, Y. F. Zhang, and X. Q. Ye, “DrivingDiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 469–485, 2024.
    [90]
    S. M. Yin, C. F. Wu, H. Yang, et al., “NUWA-XL: Diffusion over diffusion for extremely long video generation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, pp. 1309–1320, 2023.
    [91]
    X. Y. Chen, Y. H. Wang, L. J. Zhang, et al., “Seine: Short-to-long video diffusion model for generative transition and prediction,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–15, 2024.
    [92]
    G. Oh, J. Jeong, S. Kim, et al., “MTVG: Multi-event video generation with text-to-video models,” in Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 401–418, 2024.
    [93]
    H. N, Qiu, M. H. Xia, Y. Zhang, et al., “FreeNoise: Tuning-free longer video diffusion via noise rescheduling,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–15, 2024.
    [94]
    F. Y. Wang, W. S. Chen, G. L. Song, et al., “Gen-L-Video: Multi-text to long video generation via temporal co-denoising,” arXiv preprint, arXiv: 2305.18264, 2023.
    [95]
    R. Henschel, L. Khachatryan, D. Hayrapetyan, et al., “StreamingT2V: Consistent, dynamic, and extendable long video generation from text,” in Proceedings of the 13th International Conference on Learning Representations, Singapore, Singapore, pp. 1–28, 2025.
    [96]
    S. B. Zhuang, K. C. Li, X. Y. Chen, et al., “Vlogger: Make your dream a vlog,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 8806–8817, 2024.
    [97]
    J. Kim, J. Kang, J. Choi, et al., “FIFO-diffusion: Generating infinite videos from text without training,” arXiv preprint, arXiv: 2405.11473, 2024.
    [98]
    J. N. Wang, H. J. Yuan, D. Y. Chen, et al., “ModelScope text-to-video technical report,” arXiv preprint, arXiv: 2308.06571, 2023.
    [99]
    A. Blattmann, R. Rombach, H. Ling, et al., “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 22563–22575, 2023.
    [100]
    J. Ho, W. Chan, C. Saharia, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint, arXiv: 2210.02303, 2022.
    [101]
    A. Blattmann, T. Dockhorn, S. Kulal, et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint, arXiv: 2311.15127, 2023.
    [102]
    D. Q, Zhou, W. M. Wang, H. S. Yan, et al., “MagicVideo: Efficient video generation with latent diffusion models,” arXiv preprint, arXiv: 2211.11018, 2022.
    [103]
    Z. Xing, Q. Dai, H. Hu, et al., “SimDA: Simple diffusion adapter for efficient video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7827–7839, 2024.
    [104]
    W. M. Wang, J. W. Liu, Z. J. Lin, et al., “MagicVideo-V2: Multi-stage high-aesthetic video generation,” arXiv preprint, arXiv: 2401.04468, 2024.
    [105]
    T. Lee, S. Kwon, and T. Kim, “Grid diffusion models for text-to-video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 8734–8743, 2024.
    [106]
    O. Bar-Tal, H. Chefer, O. Tov, et al., “Lumiere: A space-time diffusion model for video generation,” in Proceedings of the SIGGRAPH Asia 2024 Conference Papers, Tokyo, Japan, article no. 94, 2024.
    [107]
    X. Li, W. Q. Chu, Y. Wu, et al., “VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation,” arXiv preprint, arXiv: 2309.00398, 2023.
    [108]
    S. W. Ge, S. Nah, G. L. Liu, et al., “Preserve your own correlation: A noise prior for video diffusion models,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 22873–22884, 2023.
    [109]
    L. Khachatryan, A. Movsisyan, V. Tadevosyan, et al., “Text2Video-zero: Text-to-image diffusion models are zero-shot video generators,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 15908–15918, 2023.
    [110]
    J. Z. Wu, Y. X. Ge, X. T. Wang, et al., “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 7589–7599, 2023.
    [111]
    J. An, S. Y. Zhang, H. Yang, et al., “Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation,” arXiv preprint, arXiv: 2304.08477, 2023.
    [112]
    S. Hong, J. Seo, H. Shin, et al., “DirecT2V: Large language models are frame-level directors for zero-shot text-to-video generation,” arXiv preprint, arXiv: 2305.14330, 2024.
    [113]
    H. Z. Huang, Y. F. Feng, C. Shi, et al., “Free-bloom: Zero-shot text-to-video generator with LLM director and LDM animator,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 1138, 2023.
    [114]
    Y. Lu, L. C. Zhu, H. H. Fan, et al., “FlowZero: Zero-shot text-to-video synthesis with LLM-driven dynamic scene syntax,” arXiv preprint, arXiv: 2311.15813, 2023.
    [115]
    J. Lv, Y. Huang, M. Yan, et al., “GPT4motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, pp. 1430–1440, 2024.
    [116]
    X. Y. Shi, Z. Y. Huang, F. Y. Wang, et al., “Motion-I2V: Consistent and controllable image-to-video generation with explicit motion modeling,” in Proceedings of the ACM SIGGRAPH 2024 Conference Papers, Denver, CO, USA, article no. 111, 2024.
    [117]
    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 6309–6318, 2017.
    [118]
    A. Nagrani, J. S. Chung, W. Xie, et al., “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, article no. 101027, 2020. DOI: 10.1016/j.csl.2019.101027
    [119]
    R. Rombach, A. Blattmann, D. Lorenz, et al., “High-resolution image synthesis with latent diffusion models,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 10674–10685, 2022.
    [120]
    O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Munich, Germany, pp. 234–241, 2015.
    [121]
    Y. Song, P. Dhariwal, M. Chen, et al., “Consistency models,” in Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, article no. 1335, 2023.
    [122]
    T. Chen and L. L. Li, “FIT: Far-reaching interleaved transformers,” arXiv preprint, arXiv: 2305.12689, 2023.
    [123]
    W. Peebles and S. N. Xie, “Scalable diffusion models with transformers,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 4172–4182, 2023.
    [124]
    OpenAI, J. Achiam, S. Adler, et al., “GPT-4 technical report,” arXiv preprint, arXiv: 2303.08774, 2023.
    [125]
    L. M. Zhang, A. Y. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 3813–3824, 2023.
    [126]
    C. Saharia, W. Chan, S. Saxena, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 2643, 2022.
    [127]
    Y. Balaji, S. Nah, X. Huang, et al., “eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint, arXiv: 2211.01324, 2023.
    [128]
    Y. Ma, Y. He, X. Cun, X. Wang, et al., “Follow your pose: Pose-guided text-to-video generation using pose-free videos,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 4117–4125, 2024.
    [129]
    B. Peng, X. Y. Chen, Y. H. Wang, et al., “ConditionVideo: Training-free condition-guided video generation,” in Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 4459–4467, 2024.
    [130]
    F. Y. Shi, J. X. Gu, H. Xu, et al., “BIVDiff: A training-free framework for general-purpose video synthesis via bridging image and video diffusion models,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 7393–7402, 2024.
    [131]
    Y. W. Guo, C. Y. Yang, A. Y. Rao, et al., “AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–19, 2024.
    [132]
    M. Bain, A. Nagrani, G. Varol, et al., “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 1708–1718, 2021.
    [133]
    A. Miech, D. Zhukov, J. B. Alayrac, et al., “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 2630–2640, 2019.
    [134]
    J. C. Stroud, Z. C. Lu, C. Sun, et al., “Learning video representations from textual web supervision,” arXiv preprint, arXiv: 2007.14937, 2020.
    [135]
    A. Nagrani, P. H. Seo, B. Seybold, et al., “Learning audio-video modalities from image captions,” in Proceedings of the 17th European Conference on Computer Vision – ECCV 2022, Tel Aviv, Israel, pp. 407–426, 2022.
    [136]
    P. Sharma, N. Ding, S. Goodman, et al., “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2556–2565, 2018.
    [137]
    J. N. Li, D. X. Li, C. M. Xiong, et al., “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proceedings of the 39th International Conference on Machine Learning, Barcelona, Spain, pp. 12888–12900, 2022.
    [138]
    X. Y. Huang, Y. C. Zhang, J. Y. Ma, et al., “Tag2Text: Guiding vision-language model via image tagging,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–20, 2024.
    [139]
    K. C. Li, Y. N. He, Y. Wang, et al., “VideoChat: Chat-centric video understanding,” arXiv preprint, arXiv: 2305.06355, 2023.
    [140]
    T. S. Chen, A. Siarohin, W. Menapace, et al., “Panda-70M: Captioning 70M videos with multiple cross-modality teachers,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 13320–13331, 2024.
    [141]
    H. Zhang, X. Li, and L. D. Bing, “Video-LLaMA: An instruction-tuned audio-visual language model for video understanding,” in Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore, Singapore, pp. 543–553, 2023.
    [142]
    D. Y. Zhu, J. Chen, X. Q. Shen, et al., “MiniGPT-4: Enhancing vision-language understanding with advanced large language models,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–17, 2024.
    [143]
    K. C. Li, Y. L. Wang, Y. Z. Li, et al., “Unmasked teacher: Towards training-efficient video foundation models,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 19891–19903, 2023.
    [144]
    W. H. Wang and Y. Yang, “VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” arXiv preprint, arXiv: 2403.06098, 2024.
    [145]
    Z. Y. Yang, L. J. Li, K. Lin, et al., “The dawn of LMMs: Preliminary explorations with GPT-4V(ision),” arXiv preprint, arXiv: 2309.17421, 2023.
    [146]
    D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, pp. 190–200, 2011.
    [147]
    J. Xu, T. Mei, T. Yao, et al., “MSR-VTT: A large video description dataset for bridging video and language,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 5288–5296, 2016.
    [148]
    L. A. Hendricks, O. Wang, E. Shechtman, et al., “Localizing moments in video with natural language,” in Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 5804–5813, 2017.
    [149]
    A. Rohrbach, A. Torabi, M. Rohrbach, et al., “Movie description,” International Journal of Computer Vision, vol. 123, no. 1, pp. 94–120, 2017. DOI: 10.1007/s11263-016-0987-1
    [150]
    R. Krishna, K. Hata, F. Ren, et al., “Dense-captioning events in videos,” in Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp. 706–715, 2017.
    [151]
    L. W. Zhou, C. L. Xu, and J. Corso, “Towards automatic learning of procedures from web instructional videos,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, pp. 7590–7598, 2018.
    [152]
    R. Sanabria, O. Caglayan, S. Palaskar, et al., “How2: A large-scale dataset for multimodal language understanding,” arXiv preprint, arXiv: 1811.00347, 2018.
    [153]
    X. Wang, J. W. Wu, J. K. Chen, et al., “VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 4580–4590, 2019.
    [154]
    R. Zellers, X. M. Lu, J. Hessel, et al., “MERIOT: Multimodal neural script knowledge models,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, article no. 1810, 2021. (查阅网上资料, 未找到本条文献出版地信息, 请确认) .
    [155]
    H. Reynaud, M. Y. Qiao, M. Dombrowski, et al., “Feature-conditioned cascaded video diffusion models for precise echocardiogram synthesis,” in Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, Vancouver, BC, Canada, pp. 142–152, 2023.
    [156]
    T. Wang, L. J. Li, K. Lin, et al., “DisCo: Disentangled control for realistic human dance generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 9326–9336, 2023.
    [157]
    H. W. Xue, T. K. Hang, Y. H. Zeng, et al., “Advancing high-resolution video-language representation with large-scale video transcriptions,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 5026–5035, 2022.
    [158]
    J. H. Yu, H. Zhu, L. M. Jiang, et al., “CelebV-Text: A large-scale facial text-video dataset,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 14805–14814, 2023.
    [159]
    X. Ju, “Mira: A mini-step towards sora-like long video generation,” https://github.com/mira-space. (查阅网上资料,未找到本条文献信息,且网址与文献内容不符,请确认) .
    [160]
    K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint, arXiv: 1212.0402, 2012.
    [161]
    W. Kay, J. Carreira, K. Simonyan, et al., “The kinetics human action video dataset,” arXiv preprint, arXiv: 1705.06950, 2017.
    [162]
    R. Goyal, S. E. Kahou, V. Michalski, et al., “The “something something” video database for learning and evaluating visual common sense,” in Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), 2017, Venice, Italy, pp. 5843–5851, 2017.
    [163]
    J. Pont-Tuset, F. Perazzi, S. Caelles, et al., “The 2017 DAVIS challenge on video object segmentation,” arXiv preprint, arXiv: 1704.00675, 2017.
    [164]
    C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge, UK, pp. 32–36, 2004.
    [165]
    N. Aifanti, C. Papachristou, and A. Delopoulos, “The mug facial expression database,” in Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10, Desenzano del Garda, Italy, pp. 1–4, 2010.
    [166]
    M. Cordts, M. Omran, S. Ramos, et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 3213–3223, 2016.
    [167]
    N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using LSTMs,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 843–852, 2015.
    [168]
    J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733, 2017.
    [169]
    F. Ebert, C. Finn, A. X. Lee, et al., “Self-supervised visual planning with temporal skip connections,” in Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, pp. 344–356, 2017.
    [170]
    W. Xiong, W. H. Luo, L. Ma, et al., “Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 2364–2373, 2018.
    [171]
    J. Carreira, E. Noland, A. Banki-Horvath, et al., “A short note about kinetics-600,” arXiv preprint, arXiv: 1808.01340, 2018.
    [172]
    M. Monfort, A. Andonian, B. L. Zhou, et al., “Moments in time dataset: One million videos for event understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 502–508, 2020. DOI: 10.1109/TPAMI.2019.2901464
    [173]
    A. Siarohin, S. Lathuilière, S. Tulyakov, et al., “First order motion model for image animation,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, article no. 641, 2019.
    [174]
    W. Liu, Z. Piao, Z. Tu, et al., “Liquid warping GAN with attention: A unified framework for human image synthesis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 51145–5133, 2022. DOI: 10.1109/TPAMI.2021.3078270
    [175]
    F. Ebert, Y. L. Yang, K. Schmeckpeper, et al., “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” in Proceedings of the Robotics: Science and Systems 2022, New York City, NY, USA, 2022.
    [176]
    T. Brooks, J. Hellsten, M. Aittala, et al., “Generating long videos of dynamic scenes,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 2303, 2022.
    [177]
    T. Unterthiner, S. van Steenkiste, K. Kurach, et al., “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint, arXiv: 1812.01717, 2018.
    [178]
    M. Saito, S. Saito, M. Koyama, et al., “Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN,” International Journal of Computer Vision, vol. 128, no. 10, pp. 2586–2606, 2020. DOI: 10.1007/s11263-020-01333-y
    [179]
    M. Heusel, H. Ramsauer, T. Unterthiner, et al., “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 6629–6640, 2017.
    [180]
    C. Szegedy, W. Liu, Y. Q. Jia, et al., “Going deeper with convolutions,” in Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1–9, 2015.
    [181]
    J. Deng, W. Dong, R. Socher, et al., “ImageNet: A large-scale hierarchical image database,” in Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, pp. 248–255.
    [182]
    D. Tran, L. Bourdev, R. Fergus, et al., “Learning spatiotemporal features with 3D convolutional networks,” in Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 4489–4497, 2015.
    [183]
    H. N. Wu, E. L. Zhang, L. Liao, et al., “Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 20087–20097, 2023.
    [184]
    D. Podell, Z. English, K. Lacey, et al., “SDXL: Improving latent diffusion models for high-resolution image synthesis,” in Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, pp. 1–13, 2024.
    [185]
    J. N. Li, D. X. Li, S. Savarese, et al., “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, pp. 19730–19742, 2023.
    [186]
    K. Papineni, S. Roukos, T. Ward, et al., “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 311–318, 2002.
    [187]
    D. Klakow and J. Peters, “Testing the correlation of word error rate and perplexity,” Speech Communication, vol. 38, no. 1-2, pp. 19–28, 2002. DOI: 10.1016/S0167-6393(01)00041-3
    [188]
    Y. P. Sun, Z. H. Ni, C. K. Chng, et al., “ICDAR 2019 competition on large-scale street view text with partial labeling -- RRC-LSVT,” in Proceedings of 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, pp. 1557–1562, 2019.
    [189]
    A. C. Morris, V. Maier, and P. Green, “From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition,” in Proceedings of the 8th International Conference on Spoken Language Processing, Jeju Island, Korea, 2004.
    [190]
    S. I. Serengil and A. Ozpinar, “Hyperextended lightface: A facial attribute analysis framework,” in Proceedings of 2021 International Conference on Engineering and Emerging Technologies (ICEET), Istanbul, Turkey, pp. 1–4, 2021.
    [191]
    MMAction2 Contributors, “Openmmlab’s next generation video understanding toolbox and benchmark,” https://github.com/open-mmlab/mmaction2, 2020. (查阅网上资料,未找到本条文献引用日期信息,请补充) .
    [192]
    Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proceedings of the 16th European Conference on Computer Vision – ECCV 2020, Glasgow, UK, pp. 402–419, 2020.
    [193]
    M. Otani, R. Togashi, Y. Sawai, et al., “Toward verifiable and reproducible human evaluation for text-to-image generation,” in Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, pp. 14277–14286, 2023.
    [194]
    G. Parmar, R. Zhang, and J. Y. Zhu, “On aliased resizing and surprising subtleties in GAN evaluation,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11400–11410, 2022.
    [195]
    C. Saharia, W. Chan, S. Saxena, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, article no. 2643, 2022. (查阅网上资料, 本条文献与第126条文献重复, 请确认) .
    [196]
    Z. Q. Huang, Y. N. He, J. S. Yu, et al., “VBench: Comprehensive benchmark suite for video generative models,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 21807–21818, 2024.
    [197]
    T. Y. Lin, M. Maire, S. Belongie, et al., “Microsoft coco: Common objects in context,” in Proceedings of the 13th European Conference on Computer Vision – ECCV 2014, Zurich, Switzerland, pp. 740–755, 2014.
    [198]
    J. Cho, A. Zala, and M. Bansal, “DALL-EVAL: Probing the reasoning skills and social biases of text-to-image generation models,” in Proceedings of 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 3020–3031, 2023.
    [199]
    D. F. Jiang, M. Ku, T. L. Li, et al., “GenAI arena: An open evaluation platform for generative models,” in Proceedings of the 38th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, pp. 1–20, 2024.
    [200]
    J. H. Zar, “Spearman rank correlation,” in Encyclopedia of Biostatistics, 2nd ed., P. Armitage and T. Colton, Eds. John Wiley & Sons, Ltd, Chichester, 2005.
    [201]
    M. G. Kendall, Rank Correlation Methods. Griffin, London, UK, 1948.
    [202]
    Y. H. Wang, J. M. Bao, W. M. Weng, et al., “MicroCinema: A divide-and-conquer approach for text-to-video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 8414–8424, 2024.
    [203]
    Y. Zeng, G. Q. Wei, J. N. Zheng, et al., “Make pixels dance: High-dynamic video generation,” in Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 8850–8860, 2024.
    [204]
    L. D. Ruan, L. Tian, C. W. Huang, et al., “UniVG: Towards unified-modal video generation,” arXiv preprint, arXiv: 2401.09084, 2024.
    [205]
    X. Zhang, Z. X. Wu, Z. J. Weng, et al., “VideoLT: Large-scale long-tailed video recognition,” in Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 7940–7949, 2021.
    [206]
    V. Arkhipkin, Z. Shaheen, V. Vasilev, et al., “FusionFrames: Efficient architectural aspects for text-to-video generation pipeline,” arXiv preprint, arXiv: 2311.13073, 2023.
  • Other Related Supplements

Catalog

    Figures(21)  /  Tables(5)

    Article Metrics

    Article views (121) PDF downloads (49) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return