AI

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion is a research project from Nankai University and ByteDance enabling consistent long-range comic generation and video creation using novel self-attention.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion is a research project from Nankai University and ByteDance that tackles one of the hardest problems in generative AI: maintaining visual consistency across long sequences of images and videos. Accepted as a major research contribution, it introduces a novel consistent self-attention (CSA) mechanism that enables diffusion models to generate coherent comic strips, animations, and videos – all without finetuning or per-sequence training.

The core challenge StoryDiffusion addresses is simple to state but extremely difficult to solve: how do you generate a sequence of images where the same character looks consistently the same in every frame? Previous diffusion models could produce stunning single images, but when asked to generate a multi-panel comic or a video clip, characters would subtly change appearance between frames – a different nose shape, a changed outfit, a shifted background style.

StoryDiffusion’s CSA mechanism solves this by expanding the self-attention computation across the entire sequence of generated images simultaneously, rather than computing attention within each image independently. The result is a training-free approach that works with existing pretrained diffusion models and scales to sequences of arbitrary length through a sliding window technique.

Repository: github.com/HVision-NKU/StoryDiffusion


How Does Consistent Self-Attention Work?

In standard diffusion models, each image is generated independently. Self-attention computes relationships between pixels within a single image, so there is no mechanism for one generated frame to “know about” another generated frame. StoryDiffusion changes this by modifying the self-attention layer to operate across the entire sequence:

  1. Cross-Frame Attention: The key and value matrices in the self-attention layers are constructed from all frames in the sequence, not just the current frame. This means each pixel’s attention computation considers pixels from every other frame.
  2. Sliding Window Scaling: For very long sequences (hundreds of frames), a sliding window approach limits the attention to neighboring frames, balancing consistency with computational cost.
  3. Training-Free Integration: CSA is injected into existing pretrained diffusion models through architectural modification to the attention layers, requiring no additional training or finetuning.

What Are the Key Capabilities of StoryDiffusion?

CapabilityDescriptionOutput Quality
Comic Strip GenerationMulti-panel comics with consistent characters and styleHigh – character identity preserved across 10+ panels
Video GenerationTemporal sequences with smooth frame-to-frame transitionsHigh – minimal flickering, consistent appearance
Style-Consistent BatchMultiple images sharing the same artistic styleVery High – style lock across arbitrary batch size
Character InteractionMultiple characters interacting in a single consistent sceneHigh – each character maintains unique identity
Long Sequence ScalingSequences of 100+ frames with sliding windowMedium-High – slight quality degradation at extreme lengths

How Does StoryDiffusion Compare to Other Approaches?

MethodTraining RequiredConsistencySequence LengthInference Speed
StoryDiffusion (CSA)No (training-free)HighArbitrary (sliding window)Fast (no retraining)
Fine-tuned Character ModelsYes, per characterVery HighLimited to training dataModerate
IP-AdapterLightweight finetuningMediumAnyFast
Frame-by-frame SDNoLowAnyFast
Video Diffusion ModelsYes, large-scaleHighFixed lengthSlow

The key advantage of StoryDiffusion is that it achieves training-free consistency – you can generate a 20-page comic of a character that has never been seen in training data, without any finetuning or additional model training.

What Types of Content Can StoryDiffusion Generate?

Comic Strip Generation

StoryDiffusion excels at multi-panel comic generation. Users provide a sequence of text prompts describing each panel, and the system generates a complete comic strip where characters maintain identity throughout.

The dotted arrows represent consistent self-attention connections. Each panel generation is aware of the others, ensuring the wizard looks the same in every panel.

Video Generation

For video, StoryDiffusion extends the same CSA mechanism across temporal frames. The result is video output where characters maintain consistent appearance without the jarring identity shifts that occur when each frame is generated independently.

How to Install StoryDiffusion

git clone https://github.com/HVision-NKU/StoryDiffusion.git
cd StoryDiffusion
pip install -r requirements.txt

Basic usage for generating a consistent comic strip:

python comic_generation.py \
  --prompts "a young wizard casts a spell" \
             "the wizard conjures a magical shield" \
             "the wizard faces a fearsome dragon" \
             "the wizard stands triumphant" \
  --output ./comic_output \
  --style fantasy

For video generation:

python video_generation.py \
  --prompt "a samurai walking through a bamboo forest" \
  --frames 48 \
  --output ./video_output

The system supports configurable resolution, guidance scale, and the CSA sliding window size for balancing consistency and performance.

FAQ

What is StoryDiffusion and what problem does it solve? StoryDiffusion from Nankai University and ByteDance solves the consistency problem in long-range generation. It introduces consistent self-attention (CSA) that maintains character, style, and scene identity across arbitrary-length image and video sequences without finetuning.

How does the consistent self-attention mechanism work? CSA expands the attention receptive field across multiple frames by computing self-attention across the entire sequence simultaneously. A sliding window approach scales to arbitrarily long sequences, ensuring features like character appearance and background style remain consistent.

Can StoryDiffusion generate full comic strips? Yes, it is purpose-built for comic generation, producing multi-panel strips with consistent characters, backgrounds, and art styles across all panels without requiring comic-specific training data.

Does StoryDiffusion support video generation too? Yes, the video branch applies CSA across temporal frames, maintaining character and scene coherence with smooth transitions and minimal flickering artifacts.

How do I install and run StoryDiffusion? Clone the GitHub repo, install requirements, and run the provided scripts for comic or video generation. A GPU with 8 GB+ VRAM is recommended.

Further Reading

TAG
CATEGORIES