AI

LLaMA-VID: An Image is Worth 2 Tokens -- Efficient Long Video Understanding with LLMs

LLaMA-VID is an ECCV 2024 research project that represents each video frame with just 2 tokens enabling hour-long video understanding in LLMs.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
LLaMA-VID: An Image is Worth 2 Tokens -- Efficient Long Video Understanding with LLMs

LLaMA-VID (Large Language and Video Assistant) is an ECCV 2024 research project that tackles the fundamental bottleneck in video understanding with LLMs: token efficiency. While modern LLMs boast context windows of 128K to 200K tokens, previous multimodal approaches consumed 100 to 500 tokens per video frame, making even a short 5-minute video clip computationally prohibitive. LLaMA-VID’s breakthrough is representing each video frame with just 2 tokens – a compression ratio of 50x to 250x over existing methods.

The key insight is that video frames are highly redundant. Much of a frame’s visual information is shared with the surrounding frames: the background, the setting, the lighting. LLaMA-VID introduces a dual-token representation that separates what is stable across frames (the context token) from what is changing (the motion token). This means you can process a 1-hour video at 1 FPS (3,600 frames) using just 7,200 tokens – comfortably within any modern LLM’s context window.

The project was published at ECCV 2024 and has become a foundational reference for efficient video-language modeling. Its approach has influenced subsequent work on long-context multimodal understanding, and the codebase provides a complete pipeline for training and inference on video understanding tasks.

Repository: github.com/JIA-Lab-research/LLaMA-VID


How Does LLaMA-VID’s Dual-Token Strategy Work?

The dual-token compression works through a learned mapping that takes the CLIP image encoder’s output and distills it into two separate tokens per frame:

  1. Context Token: Encodes the static scene content – background, setting, object layout. For frames that share the same scene (most consecutive frames), the context token remains nearly identical, contributing minimal new information to the LLM.
  2. Motion Token: Encodes the temporal delta – what has changed since the previous frame. This token captures movement, new objects entering the scene, and pose changes. It is the information-rich token that enables the LLM to understand action sequences.

The LLM receives the full sequence of (context, motion) token pairs and processes them as a unified video representation. Because the context tokens are largely redundant, the LLM can effectively “skip over” them and focus attention on the motion tokens for understanding action and change.

How Does LLaMA-VID Compare to Other Video Understanding Approaches?

MethodTokens per FrameMax Video Length (1 FPS)Context RequiredSingle GPU Viable
LLaMA-VID260+ minutes7,200 tokensYes (24 GB)
Video-LLaMA100+~12 minutes72,000+ tokensLimited
LLaVA-NeXT-Video576~3 minutes103,680 tokensNo
GPT-4V (API)Variable~10 minutes100,000+ tokensN/A (API)
ImageFrame Baseline257~8 minutes92,520 tokensNo

The table clearly shows LLaMA-VID’s advantage. While other methods can technically process longer videos by trading off frame rate, their token-per-frame cost makes even moderate-length videos prohibitively expensive for the LLM’s context window.

What Are the Key Technical Specifications?

SpecificationDetail
Image EncoderCLIP ViT-L/14
LLM BackboneLLaMA-2 / LLaMA-3 (configurable)
Token CompressionLearned query-based transformer
Tokens per Frame2 (1 context + 1 motion)
Max Video Length60+ minutes at 1 FPS
Training DataVideoChatGPT, WebVid, activity recognition datasets
Minimum GPU24 GB VRAM
Published AtECCV 2024

What Video Understanding Benchmarks Does LLaMA-VID Excel At?

BenchmarkTaskLLaMA-VID PerformanceComparison
VideoChatGPTComprehensive video QAState-of-the-artOutperforms Video-LLaMA by 15%
MVBenchMulti-view video understandingCompetitiveWithin 3% of best-performing models
Video-MMEExtended video understandingBest overallSignificant gap over methods without compression
ActivityNetAction recognitionCompetitiveStrong on long-context questions
NExT-QATemporal reasoningGoodExcels on “why” and “how” questions

How to Set Up and Run LLaMA-VID

git clone https://github.com/JIA-Lab-research/LLaMA-VID.git
cd LLaMA-VID
pip install -r requirements.txt

Download pretrained weights:

# Download LLaMA-VID weights from Hugging Face
# The project provides both 7B and 13B model variants

Basic inference:

from llama_vid import LLaMAVID

model = LLaMAVID.from_pretrained("jia-lab/llama-vid-7b")

# Process a long video
video_path = "lecture_1hour.mp4"
result = model.ask(video_path, "What topics were covered in the first 30 minutes?")

print(result.answer)

# Get frame-level timestamps
print(result.timestamps)

For training or fine-tuning on custom video understanding datasets, the repository provides complete training scripts with configurable data loaders and evaluation pipelines.

FAQ

What is LLaMA-VID and what makes it unique? LLaMA-VID is an ECCV 2024 project that enables hour-long video understanding by compressing each frame into just 2 tokens – a 50x-250x compression over previous methods, making long-video processing feasible on a single GPU.

What is the dual-token strategy in LLaMA-VID? Each frame yields a context token (stable scene content) and a motion token (temporal changes). Context tokens are largely redundant across frames, so the LLM can focus attention on motion tokens for understanding actions.

How long of a video can LLaMA-VID handle? Videos exceeding one hour. A 60-minute video at 1 FPS produces just 7,200 tokens, well within modern LLM context windows.

What hardware is required to run LLaMA-VID? A single GPU with 24 GB VRAM is sufficient for hour-long videos, making it accessible to research labs and individual developers.

How does LLaMA-VID perform on video understanding benchmarks? State-of-the-art or competitive on VideoChatGPT, MVBench, and Video-MME, with particular strength on long-video understanding tasks.

Further Reading

TAG
CATEGORIES