"What is AudioGhost AI?"

"AudioGhost AI is an open-source GUI application by developer 0x0funky that wraps Meta's SAM-Audio model for object-oriented audio separation. It allows users to isolate specific sound sources from mixed audio using natural language prompts."

"How does AudioGhost AI work?"

"It uses Meta's SAM-Audio model, which combines the Segment Anything (SAM) architecture with audio understanding. Users describe the sound they want to isolate with a text prompt, and the model separates it from the background. The GUI provides waveform visualization for selecting regions."

"What are the VRAM requirements for AudioGhost AI?"

"AudioGhost AI runs on consumer GPUs with 4 to 6 GB of VRAM, making it accessible on mid-range consumer hardware such as NVIDIA GTX 1060, RTX 2060, RTX 3060, and similar cards."

"What GUI features does AudioGhost AI offer?"

"The application provides a graphical interface built with Gradio, featuring waveform and spectrogram visualization, a text prompt input box, region selection tools, and one-click export for separated audio tracks."

"What license is AudioGhost AI released under?"

"AudioGhost AI is released under the MIT License, allowing free use, modification, and distribution for both personal and commercial projects."

AudioGhost AI: Open-Source Object-Oriented Audio Separation with Meta's SAM-Audio

AudioGhost AI wraps Meta's SAM-Audio model in a user-friendly GUI for text-guided sound separation, running on consumer GPUs with 4-6GB VRAM.

Editorial Team May 02, 2026 6 min read

For decades, isolating a single instrument from a mixed recording required either expensive multi-track access from the original studio session or painstaking spectral editing by an experienced audio engineer. AudioGhost AI rewrites this workflow by bringing Meta’s state-of-the-art SAM-Audio model to the desktop with a straightforward graphical interface, letting anyone separate sounds with nothing more than a text prompt.

Developed by the open-source contributor 0x0funky, AudioGhost AI is a purpose-built wrapper around Meta AI’s SAM-Audio research model. SAM-Audio extends the “Segment Anything” philosophy — originally developed for image segmentation — into the audio domain. The original SAM model made it possible to click on any pixel in an image and isolate that object; SAM-Audio applies the same principle to sound. Describe the sound source you want (“the lead vocal,” “the snare drum,” “the acoustic guitar,”) and the model isolates it from the rest of the mix with impressive fidelity.

What makes AudioGhost AI particularly notable is its accessibility. Consumer-grade audio separation tools have historically required either cloud API subscriptions or powerful server-grade GPUs. AudioGhost AI runs comfortably on GPUs with 4 to 6 GB of VRAM — a range that covers the vast majority of consumer and gaming GPUs currently in use. This opens professional-quality audio separation to independent musicians, podcasters, video editors, and hobbyists who lack access to high-end compute resources.

What Exactly Is AudioGhost AI and Why Was It Created?

AudioGhost AI was created to bridge the gap between Meta’s research output and practical, everyday use. Meta published SAM-Audio as a research model with command-line inference scripts, but no user-friendly interface. 0x0funky built AudioGhost AI to provide a Gradio-based GUI that eliminates the need to touch any terminal commands or Python inference code.

The tool is opinionated in the best way: it focuses on doing one thing well — text-guided object-oriented audio separation — rather than trying to be a full digital audio workstation. Users describe the sound they want to extract, adjust the region of interest on a waveform display, and export the isolated track.

graph LR
    A[Input mixed audio file] --> B[AudioGhost AI GUI]
    C[Text prompt describing target sound] --> B
    B --> D[SAM-Audio inference engine]
    D --> E[Isolated sound source]
    D --> F[Residual background audio]
    E --> G[Export WAV/MP3]
    F --> G

How Does SAM-Audio’s Object-Oriented Approach Compare to Traditional Source Separation?

Traditional source separation models — such as Demucs or Spleeter — are classifier-based. They are trained to recognize specific categories (vocals, drums, bass, other) and can only output those predefined stems. If you want to isolate “just the hi-hat” rather than the entire drum bus, or “the rhythm guitar on the left channel” rather than all guitars, these models fall short.

SAM-Audio takes a fundamentally different approach. Instead of classifying sounds into fixed buckets, it uses a text-conditioned diffusion model that can attend to any sound described in natural language. This is the same architectural philosophy behind Meta’s Segment Anything Model, but adapted for the spectrogram domain rather than the pixel domain.

Separation Approach	Category Flexibility	Output Quality	VRAM Requirement	GUI Availability
AudioGhost AI + SAM-Audio	Unlimited (any text prompt)	High	4-6 GB	Yes (Gradio)
Meta SAM-Audio (CLI)	Unlimited (any text prompt)	High	4-6 GB	No (terminal only)
Demucs (Hybrid)	Fixed (vocals, drums, bass, other)	Very High	2-4 GB	Third-party only
Spleeter	Fixed (2/4/5 stems)	Moderate	1-2 GB	Third-party only
Cloud APIs (Pyannote, etc.)	Varies by provider	High	None (server-side)	Yes (web)

What Hardware Do You Need to Run AudioGhost AI?

One of AudioGhost AI’s strongest selling points is its modest hardware appetite. The SAM-Audio model uses a distilled architecture that achieves strong separation quality without the VRAM demands of larger audio foundation models.

GPU Model	VRAM	Expected Performance
NVIDIA GTX 1060 / 1070	6 GB / 8 GB	Full inference, ~15-30 sec per clip
NVIDIA RTX 2060 / 3060	6 GB / 12 GB	Full inference, faster with CUDA cores
NVIDIA RTX 4060 / 4070	8 GB / 12 GB	Full inference, near real-time
Apple M1/M2/M3 (Metal)	8 GB+ unified	Supported via PyTorch MPS backend
Cloud (RunPod, Colab, etc.)	N/A	Full performance

The application supports CUDA (NVIDIA), Metal Performance Shaders (Apple Silicon), and CPU-only fallback, though the CPU path is significantly slower and recommended only for short clips.

What Does the AudioGhost AI GUI Look Like and How Do You Use It?

AudioGhost AI provides a clean, three-panel interface built on Gradio, making it accessible both locally and remotely via a browser:

Input panel on the left: Upload an audio file (WAV, MP3, FLAC up to several minutes in length) and type a text description of the sound to isolate.
Visualization panel in the center: A waveform display with spectrogram overlay. Users can select a time region to restrict separation to a specific section of the audio.
Output panel on the right: Two downloadable audio files — the isolated sound source and the residual background audio.

The workflow is straightforward: upload, describe, select region, separate, and export. No configuration files, no command-line arguments, and no Python scripting knowledge required.

Frequently Asked Questions About AudioGhost AI

Getting Started with AudioGhost AI

To run AudioGhost AI locally, you need Python 3.10 or later, a compatible GPU (optional but recommended), and the following setup steps:

Clone the repository from github.com/0x0funky/audioghost-ai
Install dependencies with pip install -r requirements.txt
Launch the GUI with python app.py
Open the provided local URL in your browser

The first launch downloads the SAM-Audio model weights automatically (approximately 2 GB). Subsequent launches are instant.

sequenceDiagram
    participant User
    participant GUI as AudioGhost GUI
    participant Model as SAM-Audio Model
    participant Disk as Local Storage

    User->>GUI: Upload audio file
    User->>GUI: Enter text prompt
    GUI->>Model: Send spectrogram + text embedding
    Model->>Model: Diffusion-based separation
    Model-->>GUI: Return isolated waveform
    GUI-->>User: Display results + export buttons
    User->>GUI: Click Export
    GUI->>Disk: Save WAV/MP3 files

Limitations and Current Development Status

As a wrapper around a research model, AudioGhost AI inherits some limitations from SAM-Audio itself. The current version works best with clean mixes where the target sound source has distinct spectral characteristics. Very dense mixes with heavy reverb or multiple similar instruments (e.g., two electric guitars playing the same chord progression) can produce artifacts. The model also has a practical limit of approximately 3 to 5 minutes of audio per inference run due to attention window constraints.

Development is active, with the community contributing improvements to the Gradio interface, adding batch processing support, and experimenting with fine-tuned variants of SAM-Audio for specific use cases like podcast dialogue extraction and field recording cleanup.

AudioGhost AI: Open-Source Object-Oriented Audio Separation with Meta's SAM-Audio

What Exactly Is AudioGhost AI and Why Was It Created?

How Does SAM-Audio’s Object-Oriented Approach Compare to Traditional Source Separation?

What Hardware Do You Need to Run AudioGhost AI?

What Does the AudioGhost AI GUI Look Like and How Do You Use It?

Frequently Asked Questions About AudioGhost AI

Getting Started with AudioGhost AI

Limitations and Current Development Status

Further Reading

LATEST POST

Easy Dataset: Open-Source Framework for Synthesizing LLM Fine-Tuning Data

CopilotKit: The Open-Source Frontend Stack for Building In-App AI Copilots

ComfyUI: The Most Powerful Open-Source Diffusion Model GUI with Node-Based Workflow

TAG

CATEGORIES