ACE-Step 1.5: Open-Source Music Generation Model Outperforming Commercial Solutions

Q: "What is ACE-Step 1.5?"

"ACE-Step 1.5 is an open-source music generation model developed by ace-step that uses cascaded diffusion transformers to generate full-length songs in under 2 seconds on an NVIDIA A100 GPU. It supports both text-to-music and text-with-reference-to-music generation."

Q: "How fast is ACE-Step 1.5 at generating music?"

"ACE-Step 1.5 generates a full song in under 2 seconds on an A100 GPU and under 7 seconds on a consumer RTX 4090. This dramatic speed improvement over earlier versions comes from architectural optimizations in the cascaded diffusion transformer pipeline."

Q: "What model variants are available?"

"The repository offers several variants: ACE-Step-1.5-L (large, 5.5B parameters), ACE-Step-1.5-M (medium, 2.4B parameters), ACE-Step-1.5-S (small, 780M parameters), and the LoRA module for custom training. The large model provides the highest quality while smaller variants trade some fidelity for faster generation."

Q: "Does ACE-Step 1.5 support LoRA training?"

"Yes, ACE-Step 1.5 includes LoRA (Low-Rank Adaptation) training support, allowing users to fine-tune the model on custom music datasets with minimal computational overhead. This enables personalized music generation styles without full model retraining."

Q: "What is the license for ACE-Step 1.5?"

"ACE-Step 1.5 is released under the MIT License, making it fully permissive for both research and commercial use. Users can freely use, modify, and distribute the model and its weights without restrictions."

ACE-Step 1.5 is an open-source music generation model generating full songs in under 2 seconds, with LoRA training and consumer GPU support.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 04, 2026 5 min read

The landscape of AI music generation has been dominated by commercial services like Suno and Udio, but the open-source ecosystem just received a powerful challenger. ACE-Step 1.5 is a cascaded diffusion transformer model that generates full-length songs in under 2 seconds while supporting LoRA fine-tuning on consumer GPUs – a combination of speed, quality, and accessibility that has not been seen before in open-source music generation.

Developed by ace-step, version 1.5 represents a significant leap over its predecessor. The model uses a cascaded architecture where multiple diffusion transformers work in sequence to progressively refine the audio output, from coarse structure to fine detail. This approach allows ACE-Step 1.5 to achieve generation quality that rivals commercial alternatives while remaining fully open source under the MIT License.

The repository provides pre-trained weights, inference scripts, a Gradio web interface, and comprehensive documentation for training, fine-tuning, and deployment. With model sizes ranging from 780M to 5.5B parameters, users can choose the right balance of quality and speed for their hardware.

How Does ACE-Step 1.5 Generate Music So Quickly?

The secret to ACE-Step 1.5’s speed lies in its cascaded diffusion transformer architecture and an optimized inference pipeline that minimizes the number of diffusion steps needed for high-quality output.

graph LR
    A[Text Prompt] --> B[Text Encoder]
    B --> C[Cascaded Diffusion Transformer L]
    C --> D[Cascaded Diffusion Transformer M]
    D --> E[Cascaded Diffusion Transformer S]
    E --> F[Vocoder / Decoder]
    F --> G[Audio Output]
    H[Reference Audio] --> I[Audio Encoder]
    I --> C
    G --> J[< 2 seconds on A100]

The cascaded design means each sub-model refines the output of the previous stage. The large transformer (L) establishes the broad musical structure, the medium transformer (M) adds harmonic detail, and the small transformer (S) polishes the fine-grained audio quality. This progressive refinement is far more efficient than generating high-quality audio in a single pass.

Stage	Model Size	Purpose	Approximate Inference Time
First	ACE-Step-1.5-L (5.5B)	Coarse structure generation	~0.8s on A100
Second	ACE-Step-1.5-M (2.4B)	Harmonic refinement	~0.6s on A100
Third	ACE-Step-1.5-S (780M)	Fine detail polishing	~0.4s on A100

What Model Variants Are Available and How Do They Compare?

ACE-Step 1.5 offers multiple model sizes to accommodate different hardware and quality requirements, from research-grade large models to lightweight consumer variants.

Variant	Parameters	Recommended GPU	Generation Quality	Speed on RTX 4090
ACE-Step-1.5-L	5.5B	A100 / H100	Best	~4s
ACE-Step-1.5-M	2.4B	RTX 4090 / A10G	High	~3s
ACE-Step-1.5-S	780M	RTX 3090 / RTX 4080	Good	~2s
LoRA Module	~10-50M	RTX 4090	Custom styles	Training: ~30 min

The LoRA module is particularly notable because it allows users to fine-tune the model on specific genres, instruments, or artists with minimal GPU memory requirements. A complete LoRA training run completes in roughly 30 minutes on an RTX 4090 with a dataset of 50-100 short audio clips.

How Do You Use ACE-Step 1.5 for Music Generation?

Getting started with ACE-Step 1.5 is straightforward, with multiple interfaces available depending on your workflow.

graph TD
    A[ACE-Step 1.5 Usage] --> B[Gradio Web UI]
    A --> C[Python API]
    A --> D[Command Line]
    B --> E[Text-to-Music]
    B --> F[Reference-to-Music]
    C --> G[Batch Generation]
    C --> H[LoRA Training]
    D --> I[Script Integration]

The Gradio web interface provides an intuitive way to experiment with the model, supporting both text prompts and reference audio inputs. For developers, the Python API offers programmatic access for batch generation, custom pipelines, and integration with larger applications.

Generation Mode	Input	Output	Use Case
Text-to-Music	“Upbeat electronic dance with synth bass”	Full song	Creative exploration
Reference-to-Music	Prompt + 30s audio clip	Styled continuation	Genre adaptation
LoRA Fine-tuning	Custom dataset + base model	Fine-tuned weights	Personalized styles

FAQ

What is ACE-Step 1.5? ACE-Step 1.5 is an open-source music generation model developed by ace-step that uses cascaded diffusion transformers to generate full-length songs in under 2 seconds on an NVIDIA A100 GPU. It supports both text-to-music and text-with-reference-to-music generation.

How fast is ACE-Step 1.5 at generating music? ACE-Step 1.5 generates a full song in under 2 seconds on an A100 GPU and under 7 seconds on a consumer RTX 4090. This dramatic speed improvement over earlier versions comes from architectural optimizations in the cascaded diffusion transformer pipeline.

What model variants are available? The repository offers several variants: ACE-Step-1.5-L (large, 5.5B parameters), ACE-Step-1.5-M (medium, 2.4B parameters), ACE-Step-1.5-S (small, 780M parameters), and the LoRA module for custom training. The large model provides the highest quality while smaller variants trade some fidelity for faster generation.

Does ACE-Step 1.5 support LoRA training? Yes, ACE-Step 1.5 includes LoRA (Low-Rank Adaptation) training support, allowing users to fine-tune the model on custom music datasets with minimal computational overhead. This enables personalized music generation styles without full model retraining.

What is the license for ACE-Step 1.5? ACE-Step 1.5 is released under the MIT License, making it fully permissive for both research and commercial use. Users can freely use, modify, and distribute the model and its weights without restrictions.

ACE-Step 1.5: Open-Source Music Generation Model Outperforming Commercial Solutions

How Does ACE-Step 1.5 Generate Music So Quickly?

What Model Variants Are Available and How Do They Compare?

How Do You Use ACE-Step 1.5 for Music Generation?

FAQ

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES