"LTX-2 is an open-source audio-video foundation model by Lightricks that generates synchronized 4K video and audio using a Diffusion Transformer (DiT) architecture, capable of up to 50fps on consumer GPUs."

LTX-2: Lightricks' Open-Source 4K Audio-Video Foundation Model

Q: "What pipelines does LTX-2 support?"

"LTX-2 supports text-to-video, image-to-video, video-to-video, and audio-to-video pipelines, all with synchronized audio generation."

Q: "What are the hardware requirements for LTX-2?"

"LTX-2 is designed to run on consumer GPUs. The 4K generation requires 24 GB+ VRAM, while lower resolutions can run on 12-16 GB VRAM GPUs."

Q: "How does LTX-2 handle audio synchronization?"

"LTX-2 is the first DiT-based model to natively generate synchronized audio and video, eliminating the need for separate audio post-processing. The audio is temporally aligned with the video frames."

Q: "What is LTX-2's license?"

"LTX-2 is released under a permissive open-source license by Lightricks. Specific license terms are available on the official GitHub repository."

LTX-2 is the first open-source DiT-based audio-video foundation model generating synchronized 4K audio and video at up to 50fps on consumer GPUs.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 6 min read

The generative AI landscape has been transformed by diffusion models for images and, more recently, for video. But generating video that sounds as good as it looks has remained a stubbornly separate problem – until now. LTX-2 changes that equation entirely.

Developed by Lightricks, the company behind the popular creative tools Facetune and LTX Studio, LTX-2 is the first open-source Diffusion Transformer (DiT) based audio-video foundation model capable of generating synchronized 4K audio-video content at up to 50 frames per second. Unlike previous approaches that required stitching together separate video and audio generation pipelines, LTX-2 produces both modalities simultaneously, with the audio naturally aligned to the visual content.

Running on consumer GPUs – a deliberate design goal – LTX-2 brings professional-grade video generation to individual creators, small studios, and researchers. The model has already garnered significant attention on GitHub and in the AI research community for its combination of quality, speed, and openness.

This guide explores the LTX-2 architecture, supported pipelines, hardware requirements, and how it compares to other video generation models.

What Makes LTX-2 Architecturally Distinct?

LTX-2 is built on a Diffusion Transformer (DiT) architecture, which replaces the traditional U-Net backbone used in models like Stable Diffusion with a transformer-based design. This architectural choice brings several advantages:

Feature	LTX-2 (DiT-based)	Traditional U-Net Models
Audio-Video Sync	Native joint generation	Separate pipelines
Resolution Scaling	Scales to 4K	Typically limited to 1080p
Frame Rate	Up to 50fps	Typically 24-30fps
Temporal Coherence	Transformer attention across frames	Temporal layers bolted on
Consumer GPU Support	Yes (16-24 GB VRAM)	Varies widely

The DiT architecture processes video as a sequence of spatiotemporal patches, allowing the transformer’s self-attention mechanism to learn long-range temporal dependencies that are essential for coherent video generation.

graph TD
    subgraph "LTX-2 Architecture"
        A[Input: Text / Image / Video / Audio] --> B[Spatiotemporal Encoder]
        B --> C[DiT Backbone]
        C --> D[Video Decoder]
        C --> E[Audio Decoder]
        D --> F[Output: 4K Video up to 50fps]
        E --> G[Output: Synchronized Audio]
    end

What Pipelines Does LTX-2 Support?

LTX-2 is not limited to a single generation mode. It supports multiple input-conditioned pipelines:

Text-to-Video

The foundational pipeline. Generate a video and synchronized audio from a text prompt:

"A cinematic drone shot flying over a misty mountain range at sunrise, birds chirping, wind howling"

LTX-2 produces both the visual sequence and the corresponding ambient audio.

Image-to-Video

Animate a static image with generated motion and context-appropriate sound. This pipeline is particularly useful for bringing photographs to life with subtle motion and environmental audio.

Video-to-Video

Apply style transfers, modifications, or extensions to existing video footage. The audio pipeline adapts to the modified visual content.

Audio-to-Video

Generate video that matches a provided audio track. This pipeline enables music video production, lip-sync content, and sound-driven animations.

Pipeline	Input	Output Resolution	Typical Generation Time (24 GB GPU)
Text-to-Video	Text prompt	Up to 4K	2-5 minutes
Image-to-Video	Image + optional text	Up to 4K	1-4 minutes
Video-to-Video	Video + style prompt	Up to 4K	3-8 minutes
Audio-to-Video	Audio track + text	Up to 1080p	2-6 minutes

How Does Audio-Video Synchronization Work?

The most impressive technical achievement in LTX-2 is its native audio-video synchronization. Previous video generation models treated audio as an afterthought – generate the video, then add audio through a separate model or manual post-processing. LTX-2 generates both modalities from the same latent representation:

graph LR
    A[Input Conditioning] --> B[Shared Latent Space]
    B --> C[Video Pathway]
    B --> D[Audio Pathway]
    C --> E[Video Frames]
    D --> F[Audio Waveform]
    E --> G{Temporal Alignment}
    F --> G
    G --> H[Synchronized Output]

The shared latent representation ensures that the audio and video share the same temporal structure. When the video shows a door slamming, the audio naturally produces a slamming sound at the exact same frame. This alignment emerges from training, not from post-processing heuristics.

What Hardware Do You Need to Run LTX-2?

Lightricks designed LTX-2 with consumer-grade hardware as a target. Here are the practical requirements:

Generation Quality	Minimum VRAM	Recommended VRAM	GPU Examples
480p (SD)	8 GB	12 GB	RTX 3060, RTX 4060
1080p (Full HD)	12 GB	16 GB	RTX 4070 Ti, RTX 4080
4K (Ultra HD)	16 GB	24 GB	RTX 4090, RTX 5090

These requirements are substantially lower than competing models like Sora or Kling, which typically require datacenter GPUs for high-resolution generation.

How Does LTX-2 Compare to Other Video Generation Models?

The open-source video generation landscape has several notable entries, but LTX-2 occupies a unique position:

Model	Open Source	Max Resolution	Audio Sync	Consumer GPU
LTX-2 (Lightricks)	Yes	4K	Native	Yes
Stable Video Diffusion (Stability AI)	Yes	1080p	No	Yes
Open-Sora (HPC-AI)	Yes	1080p	No	Limited
CogVideo (THUDM)	Yes	720p	No	Yes

LTX-2 is the only open-source model that combines 4K resolution, native audio synchronization, and consumer GPU support.

What Are the Limitations?

As impressive as LTX-2 is, it has boundaries worth noting:

Generation time: High-resolution 4K generation can take several minutes even on top-tier GPUs, limiting real-time or near-real-time applications.
Content consistency: Like all generative models, LTX-2 can produce temporal inconsistencies in long-duration clips, especially in complex scenes with multiple moving elements.
Audio quality ceiling: While the synchronized audio is a breakthrough, its quality may not match dedicated audio generation models for complex sound design (multiple simultaneous sound sources, precise timing control).
Training data biases: The model reflects biases present in its training data, which affects how it represents different scenes, cultures, and scenarios.

Frequently Asked Questions

What is LTX-2?

LTX-2 is an open-source DiT-based audio-video foundation model by Lightricks that generates synchronized 4K video and audio at up to 50fps on consumer GPUs.

What pipelines does LTX-2 support?

LTX-2 supports text-to-video, image-to-video, video-to-video, and audio-to-video generation pipelines, all with native synchronized audio output.

What are the hardware requirements for LTX-2?

4K generation requires 24 GB+ VRAM, 1080p requires 16 GB, and 480p requires 12 GB. Consumer GPUs like the RTX 4090 are fully supported.

How does LTX-2 handle audio synchronization?

LTX-2 generates audio and video from a shared latent representation, ensuring temporal alignment without post-processing. This is the first open-source model to achieve this natively.

What is LTX-2’s license?

LTX-2 is released as open source by Lightricks. The exact license terms are documented in the GitHub repository and may allow both research and commercial use.