AI

MLX-Audio: TTS, STT, and STS Library Optimized for Apple Silicon

MLX-Audio is a text-to-speech, speech-to-text, and speech-to-speech library built on Apple's MLX framework with support for dozens of models.

MLX-Audio: TTS, STT, and STS Library Optimized for Apple Silicon

Apple Silicon Macs equipped with M-series chips – from the M1 through the latest M4 Ultra – pack extraordinary computational power, particularly for machine learning workloads. Their unified memory architecture allows models to access large amounts of fast memory without the bottlenecks of traditional CPU-GPU data transfer. MLX-Audio, an open-source Python library built on Apple’s MLX framework, is purpose-built to exploit this hardware advantage for all things audio AI.

MLX-Audio provides a unified interface for text-to-speech, speech-to-text, and speech-to-speech conversion, supporting dozens of models from OpenAI’s Whisper (for transcription) to Kokoro and VoiceCraft (for synthesis). It brings together capabilities that are typically scattered across multiple libraries and frameworks, all optimized to run efficiently on Mac hardware.

The library was motivated by a simple observation: while powerful audio AI models exist, running them on Mac hardware was often a frustrating experience. PyTorch-based implementations ran slowly or not at all on Apple Silicon, and users had to juggle incompatible dependencies. MLX-Audio solves this by leveraging MLX’s native Apple Silicon optimization and providing a consistent, well-documented interface for all supported models.


How Does MLX-Audio Leverage Apple’s MLX Framework?

The MLX framework is Apple’s answer to PyTorch and JAX – an array framework designed specifically for Apple Silicon. Unlike PyTorch, which runs on Apple Silicon through the MPS (Metal Performance Shaders) backend with varying compatibility, MLX is natively designed for the M-series architecture.

graph LR
    A[MLX-Audio Library] --> B[Text-to-Speech\nKokoro, Parler,\nVoiceCraft, CosyVoice]
    A --> C[Speech-to-Text\nWhisper, Distil-Whisper,\nWhisper-Timestamped]
    A --> D[Speech-to-Speech\nVoiceCraft STS,\ncustom pipelines]
    B --> E[MLX Framework\nApple Silicon Native]
    C --> E
    D --> E
    E --> F[Unified Memory\nM-series Chip]
    F --> G[CPU Cores]
    F --> H[GPU Cores]
    F --> I[Neural Engine]

The unified memory architecture is the key advantage. On a traditional system, data must be copied from system RAM to GPU VRAM before inference can begin, adding latency and limiting the effective batch size. On Apple Silicon, the CPU, GPU, and Neural Engine all share the same memory pool, eliminating this copy overhead. MLX-Audio exploits this to efficiently run large audio models that would struggle on discrete GPU setups with limited VRAM.

A practical implication: a Whisper Large model that requires approximately 3 GB of GPU VRAM on an NVIDIA card runs comfortably on a Mac with 16 GB of unified memory, and the MLX-optimized implementation often achieves comparable or better throughput due to reduced data transfer overhead.


What Audio Models Does MLX-Audio Support?

MLX-Audio’s model support spans the major categories of audio AI, with each model offering different trade-offs between quality, speed, and resource usage.

ModelCategoryKey StrengthsSpeed on M3 MaxModel Size
Whisper Large V3STTBest accuracy, 99+ languages3-4x realtime3.0 GB
Distil-Whisper Large V3STT2x faster than Whisper, minimal accuracy loss6-8x realtime1.5 GB
Whisper TinySTTNear-instant, very low resource50x+ realtime150 MB
KokoroTTSFast, lightweight, 10+ languages15x+ realtime350 MB
ParlerTTSExpressive, controllable voice5-8x realtime800 MB
VoiceCraftTTS/STSZero-shot voice cloning from 3s sample2-3x realtime1.2 GB
CosyVoiceTTSChinese + English, natural prosody3-5x realtime900 MB
F5-TTSTTSHigh-quality English, prompt-based style4-6x realtime600 MB

The “speed” figures represent how much faster than real-time the model runs on an M3 Max MacBook Pro with 64 GB memory. Whisper Tiny, for example, can transcribe a one-minute audio clip in about one second.


How Do You Get Started with MLX-Audio?

Getting started with MLX-Audio is straightforward, especially if you have a recent Apple Silicon Mac and a Python environment set up.

StepCommand / CodeNotes
Installpip install mlx-audioRequires macOS with Apple Silicon
Transcribe audioimport mlx_audio; text = mlx_audio.transcribe("audio.mp3")Uses Whisper by default
Generate speechmlx_audio.tts("Hello world", voice="kokoro")Text-to-speech with Kokoro
Voice cloningmlx_audio.tts("Custom voice", voice_ref="sample.wav", model="voicecraft")Clone voice from 3-second sample
STS pipelinemlx_audio.sts("input.mp3", target_voice="speaker.wav")Speech-to-speech with voice transfer
# Basic transcription example
import mlx_audio

# Transcribe an audio file
result = mlx_audio.transcribe(
    "meeting_recording.mp3",
    model="whisper-large-v3",
    language="en"
)
print(result.text)

# Text-to-speech
mlx_audio.tts(
    "Welcome to the MLX-Audio tutorial.",
    voice="af_bella",  # Kokoro voice
    output="output.wav"
)

# Voice cloning with VoiceCraft
mlx_audio.tts(
    "This is a cloned voice from a short sample.",
    voice_ref="sample_speech.wav",
    model="voicecraft",
    output="cloned_output.wav"
)

The library includes command-line tools for quick use without writing Python code, and supports batch processing for transcribing or generating multiple audio files in one command.


What Are Practical Applications of MLX-Audio?

MLX-Audio’s combination of local processing, broad model support, and Apple Silicon optimization makes it suitable for a wide range of applications.

ApplicationModels UsedKey Advantage
Podcast transcriptionWhisper Large V3Runs locally, preserves privacy
Voice assistantsKokoro TTS + Whisper TinyLow latency, always-on capability
Audiobook narrationVoiceCraft or ParlerNatural, expressive voices
Meeting transcriptionDistil-WhisperFast processing of long recordings
Dubbing / localizationWhisper + CosyVoiceSTT then TTS in target language
Accessibility toolsKokoro + Whisper TinyReal-time captioning and screen reading
Voice cloning for contentVoiceCraftCustom synthetic voices from samples

A particularly compelling use case is podcast and meeting transcription. Because everything runs locally on the Mac, there are no API costs, no data leaving the device, and no internet connectivity requirements. A one-hour meeting recording can be transcribed in under 15 minutes with Whisper Large V3, and in under 10 minutes with Distil-Whisper.

Privacy-sensitive industries like legal, healthcare, and finance benefit significantly from local-only processing. Audio data never needs to be uploaded to cloud APIs, which is often a regulatory requirement under GDPR, HIPAA, or similar frameworks.


FAQ

What is MLX-Audio? MLX-Audio is an open-source Python library that provides text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) capabilities, built on Apple’s MLX framework for optimized performance on Apple Silicon (M-series) hardware. It supports dozens of models including Whisper, Kokoro, Parler, VoiceCraft, and more.

How does MLX-Audio differ from other audio AI libraries? MLX-Audio is specifically optimized for Apple Silicon through the MLX framework, which leverages the unified memory architecture of M-series chips for efficient model inference. This means it runs significantly faster and more efficiently on Mac hardware compared to generic PyTorch-based implementations, often achieving near real-time performance for complex models.

What TTS models does MLX-Audio support? MLX-Audio supports a growing collection of TTS models including Kokoro (lightweight, fast), Parler (expressive TTS), VoiceCraft (zero-shot voice cloning), CosyVoice, ChatTTS, F5-TTS, and Edge-TTS for multilingual and multi-speaker synthesis.

What STT/speech recognition models does MLX-Audio support? For speech-to-text, MLX-Audio supports OpenAI Whisper (all model sizes from tiny to large), Distil-Whisper (faster, distilled variants), and Whisper-Timestamped (with word-level timestamps). These provide robust transcription across dozens of languages.

Can MLX-Audio be used for real-time applications? Yes, MLX-Audio is suitable for real-time applications. The MLX framework’s efficient memory management and model quantization enable low-latency inference on Apple Silicon. Smaller TTS models like Kokoro can generate speech faster than real-time, while Whisper tiny and base models provide near-instant transcription.


Further Reading

TAG
CATEGORIES