For decades, isolating a single instrument from a mixed recording required either expensive multi-track access from the original studio session or painstaking spectral editing by an experienced audio engineer. AudioGhost AI rewrites this workflow by bringing Meta’s state-of-the-art SAM-Audio model to the desktop with a straightforward graphical interface, letting anyone separate sounds with nothing more than a text prompt.
Developed by the open-source contributor 0x0funky, AudioGhost AI is a purpose-built wrapper around Meta AI’s SAM-Audio research model. SAM-Audio extends the “Segment Anything” philosophy — originally developed for image segmentation — into the audio domain. The original SAM model made it possible to click on any pixel in an image and isolate that object; SAM-Audio applies the same principle to sound. Describe the sound source you want (“the lead vocal,” “the snare drum,” “the acoustic guitar,”) and the model isolates it from the rest of the mix with impressive fidelity.
What makes AudioGhost AI particularly notable is its accessibility. Consumer-grade audio separation tools have historically required either cloud API subscriptions or powerful server-grade GPUs. AudioGhost AI runs comfortably on GPUs with 4 to 6 GB of VRAM — a range that covers the vast majority of consumer and gaming GPUs currently in use. This opens professional-quality audio separation to independent musicians, podcasters, video editors, and hobbyists who lack access to high-end compute resources.
What Exactly Is AudioGhost AI and Why Was It Created?
AudioGhost AI was created to bridge the gap between Meta’s research output and practical, everyday use. Meta published SAM-Audio as a research model with command-line inference scripts, but no user-friendly interface. 0x0funky built AudioGhost AI to provide a Gradio-based GUI that eliminates the need to touch any terminal commands or Python inference code.
The tool is opinionated in the best way: it focuses on doing one thing well — text-guided object-oriented audio separation — rather than trying to be a full digital audio workstation. Users describe the sound they want to extract, adjust the region of interest on a waveform display, and export the isolated track.
graph LR
A[Input mixed audio file] --> B[AudioGhost AI GUI]
C[Text prompt describing target sound] --> B
B --> D[SAM-Audio inference engine]
D --> E[Isolated sound source]
D --> F[Residual background audio]
E --> G[Export WAV/MP3]
F --> GHow Does SAM-Audio’s Object-Oriented Approach Compare to Traditional Source Separation?
Traditional source separation models — such as Demucs or Spleeter — are classifier-based. They are trained to recognize specific categories (vocals, drums, bass, other) and can only output those predefined stems. If you want to isolate “just the hi-hat” rather than the entire drum bus, or “the rhythm guitar on the left channel” rather than all guitars, these models fall short.
SAM-Audio takes a fundamentally different approach. Instead of classifying sounds into fixed buckets, it uses a text-conditioned diffusion model that can attend to any sound described in natural language. This is the same architectural philosophy behind Meta’s Segment Anything Model, but adapted for the spectrogram domain rather than the pixel domain.
| Separation Approach | Category Flexibility | Output Quality | VRAM Requirement | GUI Availability |
|---|---|---|---|---|
| AudioGhost AI + SAM-Audio | Unlimited (any text prompt) | High | 4-6 GB | Yes (Gradio) |
| Meta SAM-Audio (CLI) | Unlimited (any text prompt) | High | 4-6 GB | No (terminal only) |
| Demucs (Hybrid) | Fixed (vocals, drums, bass, other) | Very High | 2-4 GB | Third-party only |
| Spleeter | Fixed (2/4/5 stems) | Moderate | 1-2 GB | Third-party only |
| Cloud APIs (Pyannote, etc.) | Varies by provider | High | None (server-side) | Yes (web) |
What Hardware Do You Need to Run AudioGhost AI?
One of AudioGhost AI’s strongest selling points is its modest hardware appetite. The SAM-Audio model uses a distilled architecture that achieves strong separation quality without the VRAM demands of larger audio foundation models.
| GPU Model | VRAM | Expected Performance |
|---|---|---|
| NVIDIA GTX 1060 / 1070 | 6 GB / 8 GB | Full inference, ~15-30 sec per clip |
| NVIDIA RTX 2060 / 3060 | 6 GB / 12 GB | Full inference, faster with CUDA cores |
| NVIDIA RTX 4060 / 4070 | 8 GB / 12 GB | Full inference, near real-time |
| Apple M1/M2/M3 (Metal) | 8 GB+ unified | Supported via PyTorch MPS backend |
| Cloud (RunPod, Colab, etc.) | N/A | Full performance |
The application supports CUDA (NVIDIA), Metal Performance Shaders (Apple Silicon), and CPU-only fallback, though the CPU path is significantly slower and recommended only for short clips.
What Does the AudioGhost AI GUI Look Like and How Do You Use It?
AudioGhost AI provides a clean, three-panel interface built on Gradio, making it accessible both locally and remotely via a browser:
- Input panel on the left: Upload an audio file (WAV, MP3, FLAC up to several minutes in length) and type a text description of the sound to isolate.
- Visualization panel in the center: A waveform display with spectrogram overlay. Users can select a time region to restrict separation to a specific section of the audio.
- Output panel on the right: Two downloadable audio files — the isolated sound source and the residual background audio.
The workflow is straightforward: upload, describe, select region, separate, and export. No configuration files, no command-line arguments, and no Python scripting knowledge required.
Frequently Asked Questions About AudioGhost AI
Getting Started with AudioGhost AI
To run AudioGhost AI locally, you need Python 3.10 or later, a compatible GPU (optional but recommended), and the following setup steps:
- Clone the repository from github.com/0x0funky/audioghost-ai
- Install dependencies with
pip install -r requirements.txt - Launch the GUI with
python app.py - Open the provided local URL in your browser
The first launch downloads the SAM-Audio model weights automatically (approximately 2 GB). Subsequent launches are instant.
sequenceDiagram
participant User
participant GUI as AudioGhost GUI
participant Model as SAM-Audio Model
participant Disk as Local Storage
User->>GUI: Upload audio file
User->>GUI: Enter text prompt
GUI->>Model: Send spectrogram + text embedding
Model->>Model: Diffusion-based separation
Model-->>GUI: Return isolated waveform
GUI-->>User: Display results + export buttons
User->>GUI: Click Export
GUI->>Disk: Save WAV/MP3 filesLimitations and Current Development Status
As a wrapper around a research model, AudioGhost AI inherits some limitations from SAM-Audio itself. The current version works best with clean mixes where the target sound source has distinct spectral characteristics. Very dense mixes with heavy reverb or multiple similar instruments (e.g., two electric guitars playing the same chord progression) can produce artifacts. The model also has a practical limit of approximately 3 to 5 minutes of audio per inference run due to attention window constraints.
Development is active, with the community contributing improvements to the Gradio interface, adding batch processing support, and experimenting with fine-tuned variants of SAM-Audio for specific use cases like podcast dialogue extraction and field recording cleanup.
Further Reading
- AudioGhost AI GitHub Repository — Source code, installation guide, and issue tracker
- Meta AI’s SAM-Audio Paper — The research publication behind the underlying model
- Meta SAM-Audio GitHub — Official model weights and CLI inference scripts
- Gradio Documentation — Framework used for the GUI interface
- Demucs: Music Source Separation in the Waveform Domain — Alternative open-source separation approach