ExplicitShortCut: On the Design of One-step Diffusion
via Shortcutting Flow Paths

Main Introduction Image
One-step generation showcase of ImageNet-256x256 with ESC(SiT-XL/2 architecture)
One-step generation showcase of ImageNet-256x256 with ESC(SiT-XL/2 architecture).

Abstract

Recent one-step diffusion models trained from scratch have shown strong empirical efficiency by shortcutting probability flow paths, but their method design is often tightly coupled with implementation details. This makes it difficult to understand what is essential and what is replaceable.

We present a unified design space for shortcut models through flow-map learning: a one-step prediction is regularized by a two-step target under sampled times. This view unifies representative discrete-time and continuous-time methods, and enables component-level analysis of path choice, time sampler, flow-map solver, and loss metric. Building on this perspective, our ESC variant improves training stability and generation quality, reaching FID50k 2.85 on ImageNet-256x256 with 1-NFE, and 2.53 with extended training.

Unifying the Designing Choices

We follow the common training template from the paper section Learning to shortcut flow paths: construct a two-step flow-map target \(\hat{X}_{s,r} \circ \hat{X}_{t,s}(x_t)\), then learn a one-step predictor \(X^\theta_{t,r}(x_t)\) to match it. Under this template, existing methods can be compared by modular choices rather than isolated derivations.

Design Choice CT SCD IMM sCT MeanFlow
Model Type Discrete Discrete Discrete Continuous Continuous
Flow Path Cosine Linear Linear Cosine Linear
Network Output \(v^\theta\) \(u^\theta\) \(v^\theta\) \(v^\theta\) \(u^\theta\)
1st-step Target DDIM with \(v_{t|0}\) Euler with \(u^\theta\) DDIM with \(v_{t|0}\) DDIM with \(v_{t|0}\) DDIM with \(v_{t|0}\)
2nd-step Target DDIM with \(v^\theta_s\) Euler with \(u^\theta\) DDIM with \(v^\theta_s\) DDIM with \(v^\theta_s\) Euler with \(u^\theta\)
One-step Prediction DDIM with \(v^\theta_t\) Euler with \(u^\theta\) DDIM with \(v^\theta_t\) DDIM with \(v^\theta_t\) Euler with \(u^\theta\)
Loss Metric LPIPS \(l_2\) Grouped kernel (MMD) \(l_2\) \(l_2\)
Unified design view of representative shortcut models, where \(u^\theta\) is the network parameterization of average velocity and \(v^\theta\) is the network parameterization of instantaneous velocity. \(r \le s \le t\) are the three time points for constructing the two-step target.

Elucidating the Designing Space

1. Theoretical Intuition

Q1: Why share a common design frame? Shortcut models fundamentally simulate PF-ODE trajectories under the marginal velocity field \(v_t(x)\). Ideally, we would supervise with true trajectory pairs \((x_t, x_r)\), but \(x_r\) is intractable once \(x_t\) is sampled. The practical and general solution is therefore to construct a two-step target through an intermediate state \(\hat{x}_s\), and train a one-step map to match it.

Q2: What is the challenge in target construction? In practice, \(\hat{x}_s\) and \(\hat{x}_r\) deviate from the ideal trajectory states \((x_s, x_r)\), introducing bias and variance in supervision. These deviations explain why different shortcut designs can yield noticeably different performance, even if they follow the same high-level principle.

Q3: Why does distillation from pretrained fields work better? Distillation uses a stronger pretrained velocity field that better approximates the marginal velocity, reducing target-construction error and providing cleaner supervision than pure training-from-scratch settings.

Theoretical intuition figure
Physical picture of ideal vs. practical learning for DTSC and CTSC (from the paper figure).

2. Empirical Illustration

Q1: Following linear or cosine paths? Empirically, linear-path variants are more competitive in shortcut settings. A key intuition is lower transport curvature and smaller deviation in two-step targets, which makes one-step approximation easier.

Q2: Discrete-time or continuous-time shortcutting? Under unified implementation, continuous-time variants (e.g., sCT / MeanFlow family) consistently outperform discrete-time variants in one-step generation quality.

Q3: Fixing terminal time or making it random? Fixing \(r=0\) can improve early-stage convergence, while randomizing terminal time \(r\) better captures global shortcut patterns and often yields stronger late-stage performance.

Empirical illustration across CIFAR and ImageNet
FID50k comparison curves across CIFAR-10 and ImageNet settings.

Improvements to Training

Building on the theoretical and empirical analysis above, we adopt a continuous-time linear-path baseline (MeanFlow with SiT-B/2) and improve it through three directions. First, we use plug-in velocity to reduce supervision variance from conditional velocity estimates. Second, we introduce a gradual time sampler, which starts with easier supervision and smoothly transitions to the full CTSC sampling regime. Third, we integrate practical optimization tricks (adaptive weighting and warmup-style stabilization) to improve training robustness.

In addition, under classifier-free guidance training, we apply a mixed plug-in probability and class-consistent batching to preserve class information while still benefiting from lower-variance supervision. Overall, these modifications form ESC and consistently improve one-step generation fidelity.

SiT-B/2 Setting (1-NFE, ImageNet-256) FID50k
MeanFlow baseline (CFG) 6.09
+ A1 Plug-in velocity (p=1.0) 6.01
+ A2 Plug-in velocity (p=0.5) 5.98
+ B1 Plug-in (p=1.0) + class-consistent batching 6.08
+ B2 Plug-in (p=0.5) + class-consistent batching 5.96
+ C Gradual time sampler 5.99
+ D Additional stabilization techniques 5.95
ESC (Baseline + B2 + C + D) 5.77
SiT-B/2 ablation under one-step generation (organized from the paper's ImageNet-256 study).

Experiments

Scaling-up setting. We evaluate ESC on ImageNet-256x256 in latent space with SiT-XL/2 (about 676M parameters). Following the MeanFlow training protocol under classifier-free guidance, ESC is trained from scratch for 240 epochs (about 1.2M iterations), and ESC+ extends training to 480 epochs (about 2.4M iterations).

Main results. Under 1-NFE on ImageNet-256x256, MeanFlow reports FID50k 3.43, while ESC improves to 2.85; with longer training, ESC+ reaches 2.53. This is a substantial gain over previous shortcut models trained from scratch, and even surpasses MeanFlow's two-step number (2-NFE, FID50k 2.93). The result indicates that better target construction and lower-variance supervision remain effective at large model scales.

Method Params NFE FID50k
iCT 675M 1 34.24
SCD 675M 1 10.60
IMM 675M 1x2 7.77
MeanFlow 676M 1 3.43
MeanFlow 676M 2 2.93
ESC (class-consistent) 676M 1 2.85
ESC+ (longer training) 676M 1 2.53
Scaling-up comparison on ImageNet-256x256 (shortcut-model-focused subset).

Additional observations. In our ablation and scaling runs, class-consistent batching improves convergence speed, and plug-in velocity adds little computational overhead while improving stability. Performance gains are also more pronounced on larger backbones, suggesting that variance reduction becomes increasingly important as model capacity grows.

BibTeX

@inproceedings{
  lin2026shortcut,
  title={On the Design of One-step Diffusion via Shortcutting Flow Paths},
  author={Haitao Lin and Peiyan Hu and Minsi Ren and Zhifeng Gao
                 and Zhi-Ming Ma and Guolin Ke and Tailin Wu and Stan Z. Li},
  year={2026},
  booktitle={The Fourteenth International Conference on Learning Representations},
  url={https://openreview.net/forum?id=k6q8rRYVQR}
}