Recent advancements in generative models for text-to-image (T2I) tasks have led to impressive results in producing high-resolution, realistic images from textual prompts. However, extending this capability to text-to-video (T2V) models poses challenges due to the complexities introduced by motion. Current T2V models face limitations in video duration, visual quality, and realistic motion generation, primarily due to challenges related to modeling natural motion, memory, compute requirements, and the need for extensive training data.
State-of-the-art T2I diffusion models excel in synthesizing high-resolution, photo-realistic images from complex text prompts with versatile image editing capabilities. However, extending these advancements to large-scale T2V models faces challenges due to motion complexities. Existing T2V models often employ a cascaded design, where a base model generates keyframes and subsequent temporal super-resolution (TSR) models fill in gaps, but limitations in motion coherence persist.
Researchers from Google Research, Weizmann Institute, Tel-Aviv University, and Technion present Lumiere, a novel text-to-video diffusion model addressing the challenge of realistic, diverse, and coherent motion synthesis. They introduce a Space-Time U-Net architecture that uniquely generates the entire temporal duration of a video in a single pass, contrasting with existing models that synthesize distant keyframes followed by temporal super-resolution. By incorporating spatial and temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, Lumiere achieves state-of-the-art text-to-video results, efficiently supporting various content creation and video editing tasks.
Employing a Space-Time U-Net architecture, Lumiere efficiently processes spatial and temporal dimensions, generating full video clips at a coarse resolution. Temporal blocks with factorized space-time convolutions and attention mechanisms are incorporated for effective computation. The model leverages pre-trained text-to-image architecture, emphasizing a novel approach to maintain coherence. Multidiffusion is introduced for spatial super-resolution, ensuring smooth transitions between temporal segments and addressing memory constraints.
Lumiere surpasses existing models in video synthesis. Trained on a dataset of 30M 80-frame videos, Lumiere outperforms ImagenVideo, AnimateDiff, and ZeroScope in qualitative and quantitative evaluations. With competitive Frechet Video Distance and Inception Score in zero-shot testing on UCF101, Lumiere demonstrates superior motion coherence, generating 5-second videos at higher quality. User studies confirm Lumiere’s preference over various baselines, including commercial models, highlighting its excellence in visual quality and alignment with text prompts.
To sum up, the researchers from Google Research and other institutes have introduced Lumiere, an innovative text-to-video generation framework based on a pre-trained text-to-image diffusion model. They addressed the limitation of globally coherent motion in existing models by proposing a space-time U-Net architecture. This design, incorporating spatial and temporal down- and up-sampling, enables the direct generation of full-frame-rate video clips. The demonstrated state-of-the-art results highlight the versatility of the approach for various applications, such as image-to-video, video inpainting, and stylized generation.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel