Recent one-step diffusion models trained from scratch have shown strong empirical efficiency by shortcutting probability flow paths, but their method design is often tightly coupled with implementation details. This makes it difficult to understand what is essential and what is replaceable.
We present a unified design space for shortcut models through flow-map learning: a one-step prediction is regularized by a two-step target under sampled times. This view unifies representative discrete-time and continuous-time methods, and enables component-level analysis of path choice, time sampler, flow-map solver, and loss metric. Building on this perspective, our ESC variant improves training stability and generation quality, reaching FID50k 2.85 on ImageNet-256x256 with 1-NFE, and 2.53 with extended training.
We follow the common training template from the paper section Learning to shortcut flow paths: construct a two-step flow-map target \(\hat{X}_{s,r} \circ \hat{X}_{t,s}(x_t)\), then learn a one-step predictor \(X^\theta_{t,r}(x_t)\) to match it. Under this template, existing methods can be compared by modular choices rather than isolated derivations.
| Design Choice | CT | SCD | IMM | sCT | MeanFlow |
|---|---|---|---|---|---|
| Model Type | Discrete | Discrete | Discrete | Continuous | Continuous |
| Flow Path | Cosine | Linear | Linear | Cosine | Linear |
| Network Output | \(v^\theta\) | \(u^\theta\) | \(v^\theta\) | \(v^\theta\) | \(u^\theta\) |
| 1st-step Target | DDIM with \(v_{t|0}\) | Euler with \(u^\theta\) | DDIM with \(v_{t|0}\) | DDIM with \(v_{t|0}\) | DDIM with \(v_{t|0}\) |
| 2nd-step Target | DDIM with \(v^\theta_s\) | Euler with \(u^\theta\) | DDIM with \(v^\theta_s\) | DDIM with \(v^\theta_s\) | Euler with \(u^\theta\) |
| One-step Prediction | DDIM with \(v^\theta_t\) | Euler with \(u^\theta\) | DDIM with \(v^\theta_t\) | DDIM with \(v^\theta_t\) | Euler with \(u^\theta\) |
| Loss Metric | LPIPS | \(l_2\) | Grouped kernel (MMD) | \(l_2\) | \(l_2\) |
Q1: Why share a common design frame? Shortcut models fundamentally simulate PF-ODE trajectories under the marginal velocity field \(v_t(x)\). Ideally, we would supervise with true trajectory pairs \((x_t, x_r)\), but \(x_r\) is intractable once \(x_t\) is sampled. The practical and general solution is therefore to construct a two-step target through an intermediate state \(\hat{x}_s\), and train a one-step map to match it.
Q2: What is the challenge in target construction? In practice, \(\hat{x}_s\) and \(\hat{x}_r\) deviate from the ideal trajectory states \((x_s, x_r)\), introducing bias and variance in supervision. These deviations explain why different shortcut designs can yield noticeably different performance, even if they follow the same high-level principle.
Q3: Why does distillation from pretrained fields work better? Distillation uses a stronger pretrained velocity field that better approximates the marginal velocity, reducing target-construction error and providing cleaner supervision than pure training-from-scratch settings.
Q1: Following linear or cosine paths? Empirically, linear-path variants are more competitive in shortcut settings. A key intuition is lower transport curvature and smaller deviation in two-step targets, which makes one-step approximation easier.
Q2: Discrete-time or continuous-time shortcutting? Under unified implementation, continuous-time variants (e.g., sCT / MeanFlow family) consistently outperform discrete-time variants in one-step generation quality.
Q3: Fixing terminal time or making it random? Fixing \(r=0\) can improve early-stage convergence, while randomizing terminal time \(r\) better captures global shortcut patterns and often yields stronger late-stage performance.
Building on the theoretical and empirical analysis above, we adopt a continuous-time linear-path baseline (MeanFlow with SiT-B/2) and improve it through three directions. First, we use plug-in velocity to reduce supervision variance from conditional velocity estimates. Second, we introduce a gradual time sampler, which starts with easier supervision and smoothly transitions to the full CTSC sampling regime. Third, we integrate practical optimization tricks (adaptive weighting and warmup-style stabilization) to improve training robustness.
In addition, under classifier-free guidance training, we apply a mixed plug-in probability and class-consistent batching to preserve class information while still benefiting from lower-variance supervision. Overall, these modifications form ESC and consistently improve one-step generation fidelity.
| SiT-B/2 Setting (1-NFE, ImageNet-256) | FID50k |
|---|---|
| MeanFlow baseline (CFG) | 6.09 |
| + A1 Plug-in velocity (p=1.0) | 6.01 |
| + A2 Plug-in velocity (p=0.5) | 5.98 |
| + B1 Plug-in (p=1.0) + class-consistent batching | 6.08 |
| + B2 Plug-in (p=0.5) + class-consistent batching | 5.96 |
| + C Gradual time sampler | 5.99 |
| + D Additional stabilization techniques | 5.95 |
| ESC (Baseline + B2 + C + D) | 5.77 |
Scaling-up setting. We evaluate ESC on ImageNet-256x256 in latent space with SiT-XL/2 (about 676M parameters). Following the MeanFlow training protocol under classifier-free guidance, ESC is trained from scratch for 240 epochs (about 1.2M iterations), and ESC+ extends training to 480 epochs (about 2.4M iterations).
Main results. Under 1-NFE on ImageNet-256x256, MeanFlow reports FID50k 3.43, while ESC improves to 2.85; with longer training, ESC+ reaches 2.53. This is a substantial gain over previous shortcut models trained from scratch, and even surpasses MeanFlow's two-step number (2-NFE, FID50k 2.93). The result indicates that better target construction and lower-variance supervision remain effective at large model scales.
| Method | Params | NFE | FID50k |
|---|---|---|---|
| iCT | 675M | 1 | 34.24 |
| SCD | 675M | 1 | 10.60 |
| IMM | 675M | 1x2 | 7.77 |
| MeanFlow | 676M | 1 | 3.43 |
| MeanFlow | 676M | 2 | 2.93 |
| ESC (class-consistent) | 676M | 1 | 2.85 |
| ESC+ (longer training) | 676M | 1 | 2.53 |
Additional observations. In our ablation and scaling runs, class-consistent batching improves convergence speed, and plug-in velocity adds little computational overhead while improving stability. Performance gains are also more pronounced on larger backbones, suggesting that variance reduction becomes increasingly important as model capacity grows.
@inproceedings{
lin2026shortcut,
title={On the Design of One-step Diffusion via Shortcutting Flow Paths},
author={Haitao Lin and Peiyan Hu and Minsi Ren and Zhifeng Gao
and Zhi-Ming
Ma and Guolin Ke and Tailin Wu and Stan Z. Li},
year={2026},
booktitle={The Fourteenth International Conference on Learning Representations},
url={https://openreview.net/forum?id=k6q8rRYVQR}
}