MLX-Audio: TTS, STT, and STS Library Optimized for Apple Silicon

Q: "What is MLX-Audio?"

"MLX-Audio is an open-source Python library that provides text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) capabilities, built on Apple's MLX framework for optimized performance on Apple Silicon (M-series) hardware. It supports dozens of models including Whisper, Kokoro, Parler, VoiceCraft, and more."

Q: "How does MLX-Audio differ from other audio AI libraries?"

"MLX-Audio is specifically optimized for Apple Silicon through the MLX framework, which leverages the unified memory architecture of M-series chips for efficient model inference. This means it runs significantly faster and more efficiently on Mac hardware compared to generic PyTorch-based implementations, often achieving near real-time performance for complex models."

Q: "What TTS models does MLX-Audio support?"

"MLX-Audio supports a growing collection of TTS models including Kokoro (lightweight, fast), Parler (expressive TTS), VoiceCraft (zero-shot voice cloning), CosyVoice, ChatTTS, F5-TTS, and Edge-TTS for multilingual and multi-speaker synthesis."

Q: "What STT/speech recognition models does MLX-Audio support?"

"For speech-to-text, MLX-Audio supports OpenAI Whisper (all model sizes from tiny to large), Distil-Whisper (faster, distilled variants), and Whisper-Timestamped (with word-level timestamps). These provide robust transcription across dozens of languages."

Q: "Can MLX-Audio be used for real-time applications?"

"Yes, MLX-Audio is suitable for real-time applications. The MLX framework's efficient memory management and model quantization enable low-latency inference on Apple Silicon. Smaller TTS models like Kokoro can generate speech faster than real-time, while Whisper tiny and base models provide near-instant transcription."

MLX-Audio is a text-to-speech, speech-to-text, and speech-to-speech library built on Apple's MLX framework with support for dozens of models.

Editorial Team May 02, 2026 6 min read

Apple Silicon Macs equipped with M-series chips – from the M1 through the latest M4 Ultra – pack extraordinary computational power, particularly for machine learning workloads. Their unified memory architecture allows models to access large amounts of fast memory without the bottlenecks of traditional CPU-GPU data transfer. MLX-Audio, an open-source Python library built on Apple’s MLX framework, is purpose-built to exploit this hardware advantage for all things audio AI.

MLX-Audio provides a unified interface for text-to-speech, speech-to-text, and speech-to-speech conversion, supporting dozens of models from OpenAI’s Whisper (for transcription) to Kokoro and VoiceCraft (for synthesis). It brings together capabilities that are typically scattered across multiple libraries and frameworks, all optimized to run efficiently on Mac hardware.

The library was motivated by a simple observation: while powerful audio AI models exist, running them on Mac hardware was often a frustrating experience. PyTorch-based implementations ran slowly or not at all on Apple Silicon, and users had to juggle incompatible dependencies. MLX-Audio solves this by leveraging MLX’s native Apple Silicon optimization and providing a consistent, well-documented interface for all supported models.

How Does MLX-Audio Leverage Apple’s MLX Framework?

The MLX framework is Apple’s answer to PyTorch and JAX – an array framework designed specifically for Apple Silicon. Unlike PyTorch, which runs on Apple Silicon through the MPS (Metal Performance Shaders) backend with varying compatibility, MLX is natively designed for the M-series architecture.

graph LR
    A[MLX-Audio Library] --> B[Text-to-Speech\nKokoro, Parler,\nVoiceCraft, CosyVoice]
    A --> C[Speech-to-Text\nWhisper, Distil-Whisper,\nWhisper-Timestamped]
    A --> D[Speech-to-Speech\nVoiceCraft STS,\ncustom pipelines]
    B --> E[MLX Framework\nApple Silicon Native]
    C --> E
    D --> E
    E --> F[Unified Memory\nM-series Chip]
    F --> G[CPU Cores]
    F --> H[GPU Cores]
    F --> I[Neural Engine]

The unified memory architecture is the key advantage. On a traditional system, data must be copied from system RAM to GPU VRAM before inference can begin, adding latency and limiting the effective batch size. On Apple Silicon, the CPU, GPU, and Neural Engine all share the same memory pool, eliminating this copy overhead. MLX-Audio exploits this to efficiently run large audio models that would struggle on discrete GPU setups with limited VRAM.

A practical implication: a Whisper Large model that requires approximately 3 GB of GPU VRAM on an NVIDIA card runs comfortably on a Mac with 16 GB of unified memory, and the MLX-optimized implementation often achieves comparable or better throughput due to reduced data transfer overhead.

What Audio Models Does MLX-Audio Support?

MLX-Audio’s model support spans the major categories of audio AI, with each model offering different trade-offs between quality, speed, and resource usage.

Model	Category	Key Strengths	Speed on M3 Max	Model Size
Whisper Large V3	STT	Best accuracy, 99+ languages	3-4x realtime	3.0 GB
Distil-Whisper Large V3	STT	2x faster than Whisper, minimal accuracy loss	6-8x realtime	1.5 GB
Whisper Tiny	STT	Near-instant, very low resource	50x+ realtime	150 MB
Kokoro	TTS	Fast, lightweight, 10+ languages	15x+ realtime	350 MB
Parler	TTS	Expressive, controllable voice	5-8x realtime	800 MB
VoiceCraft	TTS/STS	Zero-shot voice cloning from 3s sample	2-3x realtime	1.2 GB
CosyVoice	TTS	Chinese + English, natural prosody	3-5x realtime	900 MB
F5-TTS	TTS	High-quality English, prompt-based style	4-6x realtime	600 MB

The “speed” figures represent how much faster than real-time the model runs on an M3 Max MacBook Pro with 64 GB memory. Whisper Tiny, for example, can transcribe a one-minute audio clip in about one second.

How Do You Get Started with MLX-Audio?

Getting started with MLX-Audio is straightforward, especially if you have a recent Apple Silicon Mac and a Python environment set up.

Step	Command / Code	Notes
Install	`pip install mlx-audio`	Requires macOS with Apple Silicon
Transcribe audio	`import mlx_audio; text = mlx_audio.transcribe("audio.mp3")`	Uses Whisper by default
Generate speech	`mlx_audio.tts("Hello world", voice="kokoro")`	Text-to-speech with Kokoro
Voice cloning	`mlx_audio.tts("Custom voice", voice_ref="sample.wav", model="voicecraft")`	Clone voice from 3-second sample
STS pipeline	`mlx_audio.sts("input.mp3", target_voice="speaker.wav")`	Speech-to-speech with voice transfer

# Basic transcription example
import mlx_audio

# Transcribe an audio file
result = mlx_audio.transcribe(
    "meeting_recording.mp3",
    model="whisper-large-v3",
    language="en"
)
print(result.text)

# Text-to-speech
mlx_audio.tts(
    "Welcome to the MLX-Audio tutorial.",
    voice="af_bella",  # Kokoro voice
    output="output.wav"
)

# Voice cloning with VoiceCraft
mlx_audio.tts(
    "This is a cloned voice from a short sample.",
    voice_ref="sample_speech.wav",
    model="voicecraft",
    output="cloned_output.wav"
)

The library includes command-line tools for quick use without writing Python code, and supports batch processing for transcribing or generating multiple audio files in one command.

What Are Practical Applications of MLX-Audio?

MLX-Audio’s combination of local processing, broad model support, and Apple Silicon optimization makes it suitable for a wide range of applications.

Application	Models Used	Key Advantage
Podcast transcription	Whisper Large V3	Runs locally, preserves privacy
Voice assistants	Kokoro TTS + Whisper Tiny	Low latency, always-on capability
Audiobook narration	VoiceCraft or Parler	Natural, expressive voices
Meeting transcription	Distil-Whisper	Fast processing of long recordings
Dubbing / localization	Whisper + CosyVoice	STT then TTS in target language
Accessibility tools	Kokoro + Whisper Tiny	Real-time captioning and screen reading
Voice cloning for content	VoiceCraft	Custom synthetic voices from samples

A particularly compelling use case is podcast and meeting transcription. Because everything runs locally on the Mac, there are no API costs, no data leaving the device, and no internet connectivity requirements. A one-hour meeting recording can be transcribed in under 15 minutes with Whisper Large V3, and in under 10 minutes with Distil-Whisper.

Privacy-sensitive industries like legal, healthcare, and finance benefit significantly from local-only processing. Audio data never needs to be uploaded to cloud APIs, which is often a regulatory requirement under GDPR, HIPAA, or similar frameworks.

FAQ

What is MLX-Audio? MLX-Audio is an open-source Python library that provides text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) capabilities, built on Apple’s MLX framework for optimized performance on Apple Silicon (M-series) hardware. It supports dozens of models including Whisper, Kokoro, Parler, VoiceCraft, and more.

How does MLX-Audio differ from other audio AI libraries? MLX-Audio is specifically optimized for Apple Silicon through the MLX framework, which leverages the unified memory architecture of M-series chips for efficient model inference. This means it runs significantly faster and more efficiently on Mac hardware compared to generic PyTorch-based implementations, often achieving near real-time performance for complex models.

What TTS models does MLX-Audio support? MLX-Audio supports a growing collection of TTS models including Kokoro (lightweight, fast), Parler (expressive TTS), VoiceCraft (zero-shot voice cloning), CosyVoice, ChatTTS, F5-TTS, and Edge-TTS for multilingual and multi-speaker synthesis.

What STT/speech recognition models does MLX-Audio support? For speech-to-text, MLX-Audio supports OpenAI Whisper (all model sizes from tiny to large), Distil-Whisper (faster, distilled variants), and Whisper-Timestamped (with word-level timestamps). These provide robust transcription across dozens of languages.

Can MLX-Audio be used for real-time applications? Yes, MLX-Audio is suitable for real-time applications. The MLX framework’s efficient memory management and model quantization enable low-latency inference on Apple Silicon. Smaller TTS models like Kokoro can generate speech faster than real-time, while Whisper tiny and base models provide near-instant transcription.

MLX-Audio: TTS, STT, and STS Library Optimized for Apple Silicon

How Does MLX-Audio Leverage Apple’s MLX Framework?

What Audio Models Does MLX-Audio Support?

How Do You Get Started with MLX-Audio?

What Are Practical Applications of MLX-Audio?

FAQ

Further Reading

LATEST POST

Easy Dataset: Open-Source Framework for Synthesizing LLM Fine-Tuning Data

CopilotKit: The Open-Source Frontend Stack for Building In-App AI Copilots

ComfyUI: The Most Powerful Open-Source Diffusion Model GUI with Node-Based Workflow

TAG

CATEGORIES