AI

ACE-Step 1.5: Open-Source Music Generation Model Outperforming Commercial Solutions

ACE-Step 1.5 is an open-source music generation model generating full songs in under 2 seconds, with LoRA training and consumer GPU support.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
ACE-Step 1.5: Open-Source Music Generation Model Outperforming Commercial Solutions

The landscape of AI music generation has been dominated by commercial services like Suno and Udio, but the open-source ecosystem just received a powerful challenger. ACE-Step 1.5 is a cascaded diffusion transformer model that generates full-length songs in under 2 seconds while supporting LoRA fine-tuning on consumer GPUs – a combination of speed, quality, and accessibility that has not been seen before in open-source music generation.

Developed by ace-step, version 1.5 represents a significant leap over its predecessor. The model uses a cascaded architecture where multiple diffusion transformers work in sequence to progressively refine the audio output, from coarse structure to fine detail. This approach allows ACE-Step 1.5 to achieve generation quality that rivals commercial alternatives while remaining fully open source under the MIT License.

The repository provides pre-trained weights, inference scripts, a Gradio web interface, and comprehensive documentation for training, fine-tuning, and deployment. With model sizes ranging from 780M to 5.5B parameters, users can choose the right balance of quality and speed for their hardware.


How Does ACE-Step 1.5 Generate Music So Quickly?

The secret to ACE-Step 1.5’s speed lies in its cascaded diffusion transformer architecture and an optimized inference pipeline that minimizes the number of diffusion steps needed for high-quality output.

graph LR
    A[Text Prompt] --> B[Text Encoder]
    B --> C[Cascaded Diffusion Transformer L]
    C --> D[Cascaded Diffusion Transformer M]
    D --> E[Cascaded Diffusion Transformer S]
    E --> F[Vocoder / Decoder]
    F --> G[Audio Output]
    H[Reference Audio] --> I[Audio Encoder]
    I --> C
    G --> J[< 2 seconds on A100]

The cascaded design means each sub-model refines the output of the previous stage. The large transformer (L) establishes the broad musical structure, the medium transformer (M) adds harmonic detail, and the small transformer (S) polishes the fine-grained audio quality. This progressive refinement is far more efficient than generating high-quality audio in a single pass.

StageModel SizePurposeApproximate Inference Time
FirstACE-Step-1.5-L (5.5B)Coarse structure generation~0.8s on A100
SecondACE-Step-1.5-M (2.4B)Harmonic refinement~0.6s on A100
ThirdACE-Step-1.5-S (780M)Fine detail polishing~0.4s on A100

What Model Variants Are Available and How Do They Compare?

ACE-Step 1.5 offers multiple model sizes to accommodate different hardware and quality requirements, from research-grade large models to lightweight consumer variants.

VariantParametersRecommended GPUGeneration QualitySpeed on RTX 4090
ACE-Step-1.5-L5.5BA100 / H100Best~4s
ACE-Step-1.5-M2.4BRTX 4090 / A10GHigh~3s
ACE-Step-1.5-S780MRTX 3090 / RTX 4080Good~2s
LoRA Module~10-50MRTX 4090Custom stylesTraining: ~30 min

The LoRA module is particularly notable because it allows users to fine-tune the model on specific genres, instruments, or artists with minimal GPU memory requirements. A complete LoRA training run completes in roughly 30 minutes on an RTX 4090 with a dataset of 50-100 short audio clips.


How Do You Use ACE-Step 1.5 for Music Generation?

Getting started with ACE-Step 1.5 is straightforward, with multiple interfaces available depending on your workflow.

graph TD
    A[ACE-Step 1.5 Usage] --> B[Gradio Web UI]
    A --> C[Python API]
    A --> D[Command Line]
    B --> E[Text-to-Music]
    B --> F[Reference-to-Music]
    C --> G[Batch Generation]
    C --> H[LoRA Training]
    D --> I[Script Integration]

The Gradio web interface provides an intuitive way to experiment with the model, supporting both text prompts and reference audio inputs. For developers, the Python API offers programmatic access for batch generation, custom pipelines, and integration with larger applications.

Generation ModeInputOutputUse Case
Text-to-Music“Upbeat electronic dance with synth bass”Full songCreative exploration
Reference-to-MusicPrompt + 30s audio clipStyled continuationGenre adaptation
LoRA Fine-tuningCustom dataset + base modelFine-tuned weightsPersonalized styles

FAQ

What is ACE-Step 1.5? ACE-Step 1.5 is an open-source music generation model developed by ace-step that uses cascaded diffusion transformers to generate full-length songs in under 2 seconds on an NVIDIA A100 GPU. It supports both text-to-music and text-with-reference-to-music generation.

How fast is ACE-Step 1.5 at generating music? ACE-Step 1.5 generates a full song in under 2 seconds on an A100 GPU and under 7 seconds on a consumer RTX 4090. This dramatic speed improvement over earlier versions comes from architectural optimizations in the cascaded diffusion transformer pipeline.

What model variants are available? The repository offers several variants: ACE-Step-1.5-L (large, 5.5B parameters), ACE-Step-1.5-M (medium, 2.4B parameters), ACE-Step-1.5-S (small, 780M parameters), and the LoRA module for custom training. The large model provides the highest quality while smaller variants trade some fidelity for faster generation.

Does ACE-Step 1.5 support LoRA training? Yes, ACE-Step 1.5 includes LoRA (Low-Rank Adaptation) training support, allowing users to fine-tune the model on custom music datasets with minimal computational overhead. This enables personalized music generation styles without full model retraining.

What is the license for ACE-Step 1.5? ACE-Step 1.5 is released under the MIT License, making it fully permissive for both research and commercial use. Users can freely use, modify, and distribute the model and its weights without restrictions.


Further Reading

TAG
CATEGORIES