Exploring Diffusion Transformer Designs via Grafting

Work done in collaboration with Stanford, Together AI, UC San Diego, Northwestern University, Google DeepMind and Salesforce Research.

Read paper Project Models Demo

Designing model architectures is a core part of building modern AI systems, alongside data, algorithms, compute, and benchmarks. Model architecture defines a learnable function and involves key choices—such as which operators to use (e.g., attention, convolution) and how to configure them (e.g., model depth, width). Despite its critical role, insight into architectures—what works and what doesn’t—is difficult to obtain, due to the prohibitive cost of training models from scratch, especially in today’s foundation model era. As a result, exploring new architectures remains a major challenge, particularly for generative models.

Much like how new software is built on existing code rather than written from scratch, can pretrained models serve as scaffolds for exploring new architectural designs? We investigate architectural editing of pretrained models. We focus on diffusion transformers (DiTs), a class of generative transformers widely used for image and video generation¹²³.

A pretrained model implements a computational graph to perform tasks such as image or video generation. Think of it like an electrical circuit wired to light up a bulb. Given a new architectural idea and a pretrained model, we investigate whether the idea can be materialized by modifying its computational graph under small compute budgets. For example, one might hypothesize that a convolutional design could replace Multi-Head Attention (MHA) or Multi-Layer Perceptron (MLP) in a DiT. A simple way to test this idea is to swap MHA or MLP operators with a convolutional operator, while preserving model quality. This raises two key questions:

(Q1) operator initialization: how should a new operator be initialized before integrating it into the computational graph?
(Q2) error accumulation: how can we mitigate error propagation as multiple operators are replaced?

To address these questions, we present grafting, a simple two-stage approach to architecture editing. Grafting works as follows:

(i) activation distillation: This stage transfers the functionality of the original operator to the new one by distilling its activations using a regression objective.
(ii) lightweight finetuning: This stage mitigates error propagation caused by integrating multiple new operators by finetuning using limited data.

We test grafting across a series of increasingly challenging generative modeling tasks.

Result I: Hybrid architectures for class-conditional image generation.

We first validate grafting on class-conditional image generation using DiT-XL/2 at 256×256 resolution. In this setup, we replace softmax attention (MHA) with alternatives like local gated convolutions (Hyena-SE and our proposed Hyena-X/Y), local attention (sliding window), and linear attention (Mamba-2). For MLPs, we test variants with different expansion ratios (e.g., 3x, 6x) as well as a convolutional version (Hyena-X).

Interestingly, several interleaved hybrid designs achieve good generation quality, with FID scores between 2.38 and 2.64 (lower is better; DiT-XL/2 baseline: 2.27). Grafting is both simple and lightweight: each experiment completes in under 24 hours on 8×H100 GPUs, using less than 2% of the original model’s pretraining compute.

Here are some high-quality samples generated using our grafted models:

Result II: Efficient high-resolution text-to-image (T2I) generation.

Grafting scales to real-world, high-resolution tasks. We apply it to 2048×2048 text-to-image generation using PixArt-Σ (DiT). This setting is particularly challenging: it involves long sequences (16,384 tokens), a multimodal setting with text conditioning, and no training data. We target self-attention operators—responsible for over 62% of generation latency—and apply grafting using just 12k synthetic samples. The resulting model runs 1.43x faster, with less than 2% drop in GenEval score (47.78 vs. 49.75), demonstrating that grafting remains effective even at large scale.

Here’s an image depicting “Picasso, fractal, mosaic, face and body in mosaic pattern, beautiful woman, background white palace Great Hall, photorealistic.” generated using our grafted model.

Result III: Converting model depth to width via grafting.

Beyond swapping operators, grafting also enables more structural edits. Motivated by our MLP grafting results, we try something more radical: since modern GPUs favor parallel over sequential computation, we rewire DiT-XL/2 by parallelizing every pair of transformer blocks. This reduces model depth by 2x (28→14). The grafted model achieves FID 2.77, outperforming other models with similar depth. To our knowledge, this is the first attempt to convert sequential transformer blocks into parallel in a pretrained DiT—showing that architectures can be restructured via grafting.

Here are some high-quality samples generated using our depth to width restructured model:

Looking Forward: Applications

Grafting shows potential in settings where efficiency is important—such as extending model capabilities (e.g., from short- to long-form video understanding or generation) or building efficient inference stacks for interactive applications like image editing. We hope our findings encourage the community to explore new architectural designs using grafting.

Try it yourself!

Read paper Project Models Demo

References:

[1] Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF international conference on computer vision. 2023.

[2] Brooks, Tim, et al. “Video Generation Models as World Simulators.” OpenAI, 2024, openai.com/index/video-generation-models-as-world-simulators/

[3] Gupta, Agrim, et al. "Photorealistic video generation with diffusion models." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

Manage your preferences

Exploring Diffusion Transformer Designs via Grafting

Authors

Published

Result I: Hybrid architectures for class-conditional image generation.

Result II: Efficient high-resolution text-to-image (T2I) generation.

Result III: Converting model depth to width via grafting.

Looking Forward: Applications

Power your business, workflows, and engineers with Liquid AI