AI

LTX-2: Lightricks' Open-Source 4K Audio-Video Foundation Model

LTX-2 is the first open-source DiT-based audio-video foundation model generating synchronized 4K audio and video at up to 50fps on consumer GPUs.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
LTX-2: Lightricks' Open-Source 4K Audio-Video Foundation Model

The generative AI landscape has been transformed by diffusion models for images and, more recently, for video. But generating video that sounds as good as it looks has remained a stubbornly separate problem – until now. LTX-2 changes that equation entirely.

Developed by Lightricks, the company behind the popular creative tools Facetune and LTX Studio, LTX-2 is the first open-source Diffusion Transformer (DiT) based audio-video foundation model capable of generating synchronized 4K audio-video content at up to 50 frames per second. Unlike previous approaches that required stitching together separate video and audio generation pipelines, LTX-2 produces both modalities simultaneously, with the audio naturally aligned to the visual content.

Running on consumer GPUs – a deliberate design goal – LTX-2 brings professional-grade video generation to individual creators, small studios, and researchers. The model has already garnered significant attention on GitHub and in the AI research community for its combination of quality, speed, and openness.

This guide explores the LTX-2 architecture, supported pipelines, hardware requirements, and how it compares to other video generation models.


What Makes LTX-2 Architecturally Distinct?

LTX-2 is built on a Diffusion Transformer (DiT) architecture, which replaces the traditional U-Net backbone used in models like Stable Diffusion with a transformer-based design. This architectural choice brings several advantages:

FeatureLTX-2 (DiT-based)Traditional U-Net Models
Audio-Video SyncNative joint generationSeparate pipelines
Resolution ScalingScales to 4KTypically limited to 1080p
Frame RateUp to 50fpsTypically 24-30fps
Temporal CoherenceTransformer attention across framesTemporal layers bolted on
Consumer GPU SupportYes (16-24 GB VRAM)Varies widely

The DiT architecture processes video as a sequence of spatiotemporal patches, allowing the transformer’s self-attention mechanism to learn long-range temporal dependencies that are essential for coherent video generation.

What Pipelines Does LTX-2 Support?

LTX-2 is not limited to a single generation mode. It supports multiple input-conditioned pipelines:

Text-to-Video

The foundational pipeline. Generate a video and synchronized audio from a text prompt:

"A cinematic drone shot flying over a misty mountain range at sunrise, birds chirping, wind howling"

LTX-2 produces both the visual sequence and the corresponding ambient audio.

Image-to-Video

Animate a static image with generated motion and context-appropriate sound. This pipeline is particularly useful for bringing photographs to life with subtle motion and environmental audio.

Video-to-Video

Apply style transfers, modifications, or extensions to existing video footage. The audio pipeline adapts to the modified visual content.

Audio-to-Video

Generate video that matches a provided audio track. This pipeline enables music video production, lip-sync content, and sound-driven animations.

PipelineInputOutput ResolutionTypical Generation Time (24 GB GPU)
Text-to-VideoText promptUp to 4K2-5 minutes
Image-to-VideoImage + optional textUp to 4K1-4 minutes
Video-to-VideoVideo + style promptUp to 4K3-8 minutes
Audio-to-VideoAudio track + textUp to 1080p2-6 minutes

How Does Audio-Video Synchronization Work?

The most impressive technical achievement in LTX-2 is its native audio-video synchronization. Previous video generation models treated audio as an afterthought – generate the video, then add audio through a separate model or manual post-processing. LTX-2 generates both modalities from the same latent representation:

The shared latent representation ensures that the audio and video share the same temporal structure. When the video shows a door slamming, the audio naturally produces a slamming sound at the exact same frame. This alignment emerges from training, not from post-processing heuristics.

What Hardware Do You Need to Run LTX-2?

Lightricks designed LTX-2 with consumer-grade hardware as a target. Here are the practical requirements:

Generation QualityMinimum VRAMRecommended VRAMGPU Examples
480p (SD)8 GB12 GBRTX 3060, RTX 4060
1080p (Full HD)12 GB16 GBRTX 4070 Ti, RTX 4080
4K (Ultra HD)16 GB24 GBRTX 4090, RTX 5090

These requirements are substantially lower than competing models like Sora or Kling, which typically require datacenter GPUs for high-resolution generation.

How Does LTX-2 Compare to Other Video Generation Models?

The open-source video generation landscape has several notable entries, but LTX-2 occupies a unique position:

ModelOpen SourceMax ResolutionAudio SyncConsumer GPU
LTX-2 (Lightricks)Yes4KNativeYes
Stable Video Diffusion (Stability AI)Yes1080pNoYes
Open-Sora (HPC-AI)Yes1080pNoLimited
CogVideo (THUDM)Yes720pNoYes

LTX-2 is the only open-source model that combines 4K resolution, native audio synchronization, and consumer GPU support.

What Are the Limitations?

As impressive as LTX-2 is, it has boundaries worth noting:

  • Generation time: High-resolution 4K generation can take several minutes even on top-tier GPUs, limiting real-time or near-real-time applications.
  • Content consistency: Like all generative models, LTX-2 can produce temporal inconsistencies in long-duration clips, especially in complex scenes with multiple moving elements.
  • Audio quality ceiling: While the synchronized audio is a breakthrough, its quality may not match dedicated audio generation models for complex sound design (multiple simultaneous sound sources, precise timing control).
  • Training data biases: The model reflects biases present in its training data, which affects how it represents different scenes, cultures, and scenarios.

Frequently Asked Questions

What is LTX-2?

LTX-2 is an open-source DiT-based audio-video foundation model by Lightricks that generates synchronized 4K video and audio at up to 50fps on consumer GPUs.

What pipelines does LTX-2 support?

LTX-2 supports text-to-video, image-to-video, video-to-video, and audio-to-video generation pipelines, all with native synchronized audio output.

What are the hardware requirements for LTX-2?

4K generation requires 24 GB+ VRAM, 1080p requires 16 GB, and 480p requires 12 GB. Consumer GPUs like the RTX 4090 are fully supported.

How does LTX-2 handle audio synchronization?

LTX-2 generates audio and video from a shared latent representation, ensuring temporal alignment without post-processing. This is the first open-source model to achieve this natively.

What is LTX-2’s license?

LTX-2 is released as open source by Lightricks. The exact license terms are documented in the GitHub repository and may allow both research and commercial use.

Further Reading

TAG
CATEGORIES