AI

AudioGhost AI: Open-Source Object-Oriented Audio Separation with Meta's SAM-Audio

AudioGhost AI wraps Meta's SAM-Audio model in a user-friendly GUI for text-guided sound separation, running on consumer GPUs with 4-6GB VRAM.

AudioGhost AI: Open-Source Object-Oriented Audio Separation with Meta's SAM-Audio

For decades, isolating a single instrument from a mixed recording required either expensive multi-track access from the original studio session or painstaking spectral editing by an experienced audio engineer. AudioGhost AI rewrites this workflow by bringing Meta’s state-of-the-art SAM-Audio model to the desktop with a straightforward graphical interface, letting anyone separate sounds with nothing more than a text prompt.

Developed by the open-source contributor 0x0funky, AudioGhost AI is a purpose-built wrapper around Meta AI’s SAM-Audio research model. SAM-Audio extends the “Segment Anything” philosophy — originally developed for image segmentation — into the audio domain. The original SAM model made it possible to click on any pixel in an image and isolate that object; SAM-Audio applies the same principle to sound. Describe the sound source you want (“the lead vocal,” “the snare drum,” “the acoustic guitar,”) and the model isolates it from the rest of the mix with impressive fidelity.

What makes AudioGhost AI particularly notable is its accessibility. Consumer-grade audio separation tools have historically required either cloud API subscriptions or powerful server-grade GPUs. AudioGhost AI runs comfortably on GPUs with 4 to 6 GB of VRAM — a range that covers the vast majority of consumer and gaming GPUs currently in use. This opens professional-quality audio separation to independent musicians, podcasters, video editors, and hobbyists who lack access to high-end compute resources.


What Exactly Is AudioGhost AI and Why Was It Created?

AudioGhost AI was created to bridge the gap between Meta’s research output and practical, everyday use. Meta published SAM-Audio as a research model with command-line inference scripts, but no user-friendly interface. 0x0funky built AudioGhost AI to provide a Gradio-based GUI that eliminates the need to touch any terminal commands or Python inference code.

The tool is opinionated in the best way: it focuses on doing one thing well — text-guided object-oriented audio separation — rather than trying to be a full digital audio workstation. Users describe the sound they want to extract, adjust the region of interest on a waveform display, and export the isolated track.


How Does SAM-Audio’s Object-Oriented Approach Compare to Traditional Source Separation?

Traditional source separation models — such as Demucs or Spleeter — are classifier-based. They are trained to recognize specific categories (vocals, drums, bass, other) and can only output those predefined stems. If you want to isolate “just the hi-hat” rather than the entire drum bus, or “the rhythm guitar on the left channel” rather than all guitars, these models fall short.

SAM-Audio takes a fundamentally different approach. Instead of classifying sounds into fixed buckets, it uses a text-conditioned diffusion model that can attend to any sound described in natural language. This is the same architectural philosophy behind Meta’s Segment Anything Model, but adapted for the spectrogram domain rather than the pixel domain.

Separation ApproachCategory FlexibilityOutput QualityVRAM RequirementGUI Availability
AudioGhost AI + SAM-AudioUnlimited (any text prompt)High4-6 GBYes (Gradio)
Meta SAM-Audio (CLI)Unlimited (any text prompt)High4-6 GBNo (terminal only)
Demucs (Hybrid)Fixed (vocals, drums, bass, other)Very High2-4 GBThird-party only
SpleeterFixed (2/4/5 stems)Moderate1-2 GBThird-party only
Cloud APIs (Pyannote, etc.)Varies by providerHighNone (server-side)Yes (web)

What Hardware Do You Need to Run AudioGhost AI?

One of AudioGhost AI’s strongest selling points is its modest hardware appetite. The SAM-Audio model uses a distilled architecture that achieves strong separation quality without the VRAM demands of larger audio foundation models.

GPU ModelVRAMExpected Performance
NVIDIA GTX 1060 / 10706 GB / 8 GBFull inference, ~15-30 sec per clip
NVIDIA RTX 2060 / 30606 GB / 12 GBFull inference, faster with CUDA cores
NVIDIA RTX 4060 / 40708 GB / 12 GBFull inference, near real-time
Apple M1/M2/M3 (Metal)8 GB+ unifiedSupported via PyTorch MPS backend
Cloud (RunPod, Colab, etc.)N/AFull performance

The application supports CUDA (NVIDIA), Metal Performance Shaders (Apple Silicon), and CPU-only fallback, though the CPU path is significantly slower and recommended only for short clips.


What Does the AudioGhost AI GUI Look Like and How Do You Use It?

AudioGhost AI provides a clean, three-panel interface built on Gradio, making it accessible both locally and remotely via a browser:

  1. Input panel on the left: Upload an audio file (WAV, MP3, FLAC up to several minutes in length) and type a text description of the sound to isolate.
  2. Visualization panel in the center: A waveform display with spectrogram overlay. Users can select a time region to restrict separation to a specific section of the audio.
  3. Output panel on the right: Two downloadable audio files — the isolated sound source and the residual background audio.

The workflow is straightforward: upload, describe, select region, separate, and export. No configuration files, no command-line arguments, and no Python scripting knowledge required.


Frequently Asked Questions About AudioGhost AI


Getting Started with AudioGhost AI

To run AudioGhost AI locally, you need Python 3.10 or later, a compatible GPU (optional but recommended), and the following setup steps:

  1. Clone the repository from github.com/0x0funky/audioghost-ai
  2. Install dependencies with pip install -r requirements.txt
  3. Launch the GUI with python app.py
  4. Open the provided local URL in your browser

The first launch downloads the SAM-Audio model weights automatically (approximately 2 GB). Subsequent launches are instant.


Limitations and Current Development Status

As a wrapper around a research model, AudioGhost AI inherits some limitations from SAM-Audio itself. The current version works best with clean mixes where the target sound source has distinct spectral characteristics. Very dense mixes with heavy reverb or multiple similar instruments (e.g., two electric guitars playing the same chord progression) can produce artifacts. The model also has a practical limit of approximately 3 to 5 minutes of audio per inference run due to attention window constraints.

Development is active, with the community contributing improvements to the Gradio interface, adding batch processing support, and experimenting with fine-tuned variants of SAM-Audio for specific use cases like podcast dialogue extraction and field recording cleanup.


Further Reading

TAG
CATEGORIES