What if editing a video was as simple as telling an AI what you want, in plain English, and watching it happen?
No dragging clips along a timeline. No hunting through menus for color correction filters. No manually scrubbing through hours of footage to find the dead space. Just a conversation with a coding agent that understands video — cuts, colors, audio, subtitles, and all.
That is the promise of Video Use, an open-source project (currently at approximately 4,200 GitHub stars) that extends the browser-use ecosystem into video editing territory. Instead of an AI agent controlling a web browser, Video Use has an AI agent controlling FFmpeg, subtitle burners, animation renderers, and color grading pipelines — all driven by natural language prompts from agents like Claude Code, OpenAI Codex, Hermes, or OpenClaw.
Answer Capsule: Video Use is an open-source tool that lets coding agents edit videos through natural language commands. It handles filler word removal, color grading, subtitles, animations, and audio fades — all while being dramatically more token-efficient than traditional video processing approaches.
How Does Video Use Let an LLM Edit Video Without Watching It?
The single biggest obstacle to AI-driven video editing is obvious: large language models cannot watch video. They cannot see the content that a human editor would see on a timeline. This is not a minor limitation — it is the core problem that Video Use was built to solve.
Most naive approaches to LLM-based video editing would attempt to send raw video frames to the model, frame by frame. A standard 10-minute 1080p video at 30fps contains approximately 18,000 frames. At a conservative estimate, processing those frames through current LLM tokenizers would consume around 45 million tokens — and that is before any actual editing logic is applied. The cost alone makes the approach impractical.
Video Use takes a fundamentally different approach based on a layered representation, and this is the core innovation of the project:
The LLM never watches the video. It reads the video.
Layer 1: Audio Transcript via ElevenLabs Scribe
The first layer is a dense but compact audio transcript. Video Use sends the audio track to ElevenLabs Scribe, which returns a full word-level transcription with precise timestamps. Every word is accounted for — filler words like “umm,” “uh,” “like,” and “you know” are identified alongside content words, each one keyed to the exact moment it was spoken.
This output is written to a file called takes_packed.md. The transcript of a typical 10-minute video clocks in at roughly 12KB — a fraction of the total edit metadata.
Why this matters: the LLM can now read every single word in the video, know exactly when it was said, detect patterns (filler word density, pacing, awkward pauses), and make editing decisions based on text — the medium it understands best.
Layer 2: Visual Composite via Timeline View
Transcript alone is not enough. An LLM also needs to see what the video looks like at key moments. But sending all 18,000 frames of a 10-minute video is a non-starter.
Instead, Video Use generates a visual composite — a PNG filmstrip image — only at decision points. These are the moments where a cut, transition, or visual treatment might be appropriate. Instead of 18,000 frames, the LLM sees perhaps 20 to 50 composited PNGs.
The result? The LLM has everything it needs to make informed editorial decisions:
- From the transcript: precise word-level timing, pause detection, filler word locations
- From the composite: visual context at every cut boundary
The Efficiency Ratio
| Approach | Data Volume | Feasible for LLM? |
|---|---|---|
| Raw video frames | ~45M tokens | No — cost-prohibitive |
| ElevenLabs transcript only | ~12KB text | Partial — no visual context |
| Transcript + visual composite | ~12KB text + a handful of PNGs | Yes — the sweet spot |
What Editing Features Does Video Use Support?
With the transcript and visual composite available, the coding agent can orchestrate a wide range of editing operations through FFmpeg and companion tools. These are the features that currently ship with Video Use.
Auto Filler Word and Dead Space Removal
This is the feature that generates the most immediate value for content creators. The LLM reads the transcript, identifies every instance of filler language (“umm,” “uh,” “like,” “you know,” and similar hesitation markers), and surgically removes them from the edit. Alongside this, dead space — pauses longer than a configurable threshold — is automatically trimmed.
The result is a compressed, punchier version of the original recording, with no awkward silences and none of the verbal tics that make unscripted content sound unpolished. The LLM applies 30ms audio fades at every cut boundary, so the audio does not click or pop at the edit points.
Auto Color Grading
Video Use ships with preset color grading pipelines that can be applied to the entire video or to specific segments:
- Warm Cinematic: boosts warmth, adds a subtle teal-orange split, and applies a gentle film curve
- Neutral Punch: increases contrast and vibrance without introducing a color cast — suitable for talking-head content that should not look stylized
- Custom FFmpeg Chains: advanced users can define arbitrary ffmpeg
-vffilter chains and reference them by name from the agent’s prompt
The LLM selects the grading based on the content it reads from the transcript and the visual composite. A dramatic monologue might get Warm Cinematic; a product demo might get Neutral Punch.
Burned-In Subtitles
Video Use generates subtitle tracks and burns them directly into the video output. The subtitle style is fully configurable:
- Font family and size
- Position on screen (bottom-center, top-left, etc.)
- Background box opacity and color
- Text color and stroke width
Because the LLM has word-level timestamps from the ElevenLabs transcript, the subtitles are perfectly synchronized with the spoken audio — no manual alignment needed.
Animation Overlays
For creators who want to add visual polish, Video Use supports animation overlays generated by three different renderers:
| Engine | Best For | Output |
|---|---|---|
| Manim | Mathematical animations, chalk-talk style | High-quality programmatic motion graphics |
| Remotion | Complex composited scenes | React-based video components rendered to frames |
| PIL | Simple overlay graphics | Still-image overlays and lower-thirds |
The LLM writes the animation script (Python for Manim or PIL, React for Remotion), renders it, and composites it over the video track.
Self-Evaluation: How Video Use Checks Its Own Work
One of the most interesting design choices in Video Use is the self-evaluation loop. After the agent applies an edit — a cut, a color grade, a subtitle burn — the system does not simply assume success. It renders the output at every cut boundary and evaluates it.
The evaluation checks:
- Audio continuity: Is there a click or pop at the cut point? (The 30ms fade is the first defense, but the evaluation confirms.)
- Visual consistency: Does the color grade transition smoothly? Are there flash frames or dropped frames?
- Subtitle sync: Are the subtitles still aligned after the cut? Did a filler-word removal shift the audio relative to the visuals?
If the evaluation detects an issue, the agent loops back and corrects it. This makes the editing process iterative and corrective rather than a single pass of “generate and hope it works.”
Session Memory via project.md
Video Use persists all editorial decisions and context in a project.md file that lives alongside the video project. This file acts as session memory — the coding agent can reference it across multiple sessions or conversations to maintain continuity.
The project.md file contains:
- The original file paths and encode settings
- Every cut that was made, with timestamps
- Color grading decisions applied to each segment
- Subtitle style configuration
- A list of filler words that were removed (customizable per project)
- Notes from the self-evaluation loop
This means you can start an edit with Claude Code, pause, pick it up with Codex the next day, and the new agent will know exactly what has been done and what remains.
Getting Started with Video Use
The setup is straightforward for anyone familiar with Python and FFmpeg:
# Clone the repository
git clone https://github.com/browser-use/video-use
# Create a virtual environment and install dependencies
uv sync
# or: pip install -r requirements.txt
# Install FFmpeg (if not already installed)
brew install ffmpeg
You will also need access to an LLM provider — Claude Code, OpenAI Codex, Hermes, or OpenClaw — and an ElevenLabs API key for the Scribe transcription layer.
Once the environment is set up, the workflow is:
- Place your raw video file in the project directory
- Tell the agent: “Edit this video — remove filler words, apply warm cinematic grading, add subtitles”
- The agent transcribes the audio, generates the visual composite, and begins editing
- Review the output and provide follow-up instructions
FAQ
What is Video Use?
Video Use is an open-source video editing tool that lets you edit videos by chatting with coding agents like Claude Code, Codex, or OpenClaw instead of using traditional timeline editors.
How does Video Use understand video content?
The LLM never watches the video it reads an audio transcript via ElevenLabs Scribe for word-level timestamps and generates visual composite PNGs only at decision points.
What editing features does Video Use support?
It supports auto-removal of filler words and dead space, auto color grading, 30ms audio fades, customizable subtitles, and animation overlays via Manim, Remotion, or PIL.
What is the token efficiency of Video Use?
Instead of processing 45 million tokens from raw video frames, Video Use uses approximately 12KB of text transcript plus a handful of PNG images for dramatic token savings.
Is Video Use free to use?
Yes, Video Use is open source and free. Requirements include FFmpeg and a Python environment with uv or pip.
Further Reading
- Video Use GitHub Repository — Source code, documentation, and community issues
- browser-use — The browser automation framework that inspired the video editing extension
- ElevenLabs Scribe — The speech-to-text API used for audio transcription
- Manim — Mathematical animation engine for programmatic motion graphics
- Remotion — Write videos in React with programmatic compositing
