The landscape of AI music generation has been dominated by commercial services like Suno and Udio, but the open-source ecosystem just received a powerful challenger. ACE-Step 1.5 is a cascaded diffusion transformer model that generates full-length songs in under 2 seconds while supporting LoRA fine-tuning on consumer GPUs – a combination of speed, quality, and accessibility that has not been seen before in open-source music generation.
Developed by ace-step, version 1.5 represents a significant leap over its predecessor. The model uses a cascaded architecture where multiple diffusion transformers work in sequence to progressively refine the audio output, from coarse structure to fine detail. This approach allows ACE-Step 1.5 to achieve generation quality that rivals commercial alternatives while remaining fully open source under the MIT License.
The repository provides pre-trained weights, inference scripts, a Gradio web interface, and comprehensive documentation for training, fine-tuning, and deployment. With model sizes ranging from 780M to 5.5B parameters, users can choose the right balance of quality and speed for their hardware.
How Does ACE-Step 1.5 Generate Music So Quickly?
The secret to ACE-Step 1.5’s speed lies in its cascaded diffusion transformer architecture and an optimized inference pipeline that minimizes the number of diffusion steps needed for high-quality output.
graph LR
A[Text Prompt] --> B[Text Encoder]
B --> C[Cascaded Diffusion Transformer L]
C --> D[Cascaded Diffusion Transformer M]
D --> E[Cascaded Diffusion Transformer S]
E --> F[Vocoder / Decoder]
F --> G[Audio Output]
H[Reference Audio] --> I[Audio Encoder]
I --> C
G --> J[< 2 seconds on A100]
The cascaded design means each sub-model refines the output of the previous stage. The large transformer (L) establishes the broad musical structure, the medium transformer (M) adds harmonic detail, and the small transformer (S) polishes the fine-grained audio quality. This progressive refinement is far more efficient than generating high-quality audio in a single pass.
| Stage | Model Size | Purpose | Approximate Inference Time |
|---|---|---|---|
| First | ACE-Step-1.5-L (5.5B) | Coarse structure generation | ~0.8s on A100 |
| Second | ACE-Step-1.5-M (2.4B) | Harmonic refinement | ~0.6s on A100 |
| Third | ACE-Step-1.5-S (780M) | Fine detail polishing | ~0.4s on A100 |
What Model Variants Are Available and How Do They Compare?
ACE-Step 1.5 offers multiple model sizes to accommodate different hardware and quality requirements, from research-grade large models to lightweight consumer variants.
| Variant | Parameters | Recommended GPU | Generation Quality | Speed on RTX 4090 |
|---|---|---|---|---|
| ACE-Step-1.5-L | 5.5B | A100 / H100 | Best | ~4s |
| ACE-Step-1.5-M | 2.4B | RTX 4090 / A10G | High | ~3s |
| ACE-Step-1.5-S | 780M | RTX 3090 / RTX 4080 | Good | ~2s |
| LoRA Module | ~10-50M | RTX 4090 | Custom styles | Training: ~30 min |
The LoRA module is particularly notable because it allows users to fine-tune the model on specific genres, instruments, or artists with minimal GPU memory requirements. A complete LoRA training run completes in roughly 30 minutes on an RTX 4090 with a dataset of 50-100 short audio clips.
How Do You Use ACE-Step 1.5 for Music Generation?
Getting started with ACE-Step 1.5 is straightforward, with multiple interfaces available depending on your workflow.
graph TD
A[ACE-Step 1.5 Usage] --> B[Gradio Web UI]
A --> C[Python API]
A --> D[Command Line]
B --> E[Text-to-Music]
B --> F[Reference-to-Music]
C --> G[Batch Generation]
C --> H[LoRA Training]
D --> I[Script Integration]
The Gradio web interface provides an intuitive way to experiment with the model, supporting both text prompts and reference audio inputs. For developers, the Python API offers programmatic access for batch generation, custom pipelines, and integration with larger applications.
| Generation Mode | Input | Output | Use Case |
|---|---|---|---|
| Text-to-Music | “Upbeat electronic dance with synth bass” | Full song | Creative exploration |
| Reference-to-Music | Prompt + 30s audio clip | Styled continuation | Genre adaptation |
| LoRA Fine-tuning | Custom dataset + base model | Fine-tuned weights | Personalized styles |
FAQ
What is ACE-Step 1.5? ACE-Step 1.5 is an open-source music generation model developed by ace-step that uses cascaded diffusion transformers to generate full-length songs in under 2 seconds on an NVIDIA A100 GPU. It supports both text-to-music and text-with-reference-to-music generation.
How fast is ACE-Step 1.5 at generating music? ACE-Step 1.5 generates a full song in under 2 seconds on an A100 GPU and under 7 seconds on a consumer RTX 4090. This dramatic speed improvement over earlier versions comes from architectural optimizations in the cascaded diffusion transformer pipeline.
What model variants are available? The repository offers several variants: ACE-Step-1.5-L (large, 5.5B parameters), ACE-Step-1.5-M (medium, 2.4B parameters), ACE-Step-1.5-S (small, 780M parameters), and the LoRA module for custom training. The large model provides the highest quality while smaller variants trade some fidelity for faster generation.
Does ACE-Step 1.5 support LoRA training? Yes, ACE-Step 1.5 includes LoRA (Low-Rank Adaptation) training support, allowing users to fine-tune the model on custom music datasets with minimal computational overhead. This enables personalized music generation styles without full model retraining.
What is the license for ACE-Step 1.5? ACE-Step 1.5 is released under the MIT License, making it fully permissive for both research and commercial use. Users can freely use, modify, and distribute the model and its weights without restrictions.
Further Reading
- ACE-Step GitHub Repository – Source code, weights, and documentation
- ACE-Step 1.5 Model on Hugging Face – Pre-trained model weights and LoRA modules
- Cascaded Diffusion Models Explained – Research paper on cascaded diffusion architecture
- LoRA Fine-Tuning Guide – Hugging Face guide to LoRA adaptation
- ACE-Step 1.5 Demo Gallery – Audio samples and comparisons with commercial solutions
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!