ESC: ExplicitShortCut

Authors

Haitao Lin^1,*, Peiyan Hu^1,2,*, Minsi Ren¹

Zhifeng Gao³, Zhi-Ming Ma²

Guolin Ke^3,+, Tailin Wu^1,+, Stan Z. Li^1,+

Affiliations

¹ Westlake University

² Chinese Academy of Sciences

³ DP Technology

Date

2025.9

*Equal contribution

+Corresponding author

One-step generation showcase of ImageNet-256x256 with ESC(SiT-XL/2 architecture).

Abstract

Recent one-step diffusion models trained from scratch have shown strong empirical efficiency by shortcutting probability flow paths, but their method design is often tightly coupled with implementation details. This makes it difficult to understand what is essential and what is replaceable.

We present a unified design space for shortcut models through flow-map learning: a one-step prediction is regularized by a two-step target under sampled times. This view unifies representative discrete-time and continuous-time methods, and enables component-level analysis of path choice, time sampler, flow-map solver, and loss metric. Building on this perspective, our ESC variant improves training stability and generation quality, reaching FID50k 2.85 on ImageNet-256x256 with 1-NFE, and 2.53 with extended training.

Unifying the Designing Choices

We follow the common training template from the paper section Learning to shortcut flow paths: construct a two-step flow-map target \(\hat{X}_{s,r} \circ \hat{X}_{t,s}(x_t)\), then learn a one-step predictor \(X^\theta_{t,r}(x_t)\) to match it. Under this template, existing methods can be compared by modular choices rather than isolated derivations.

Design Choice	CT	SCD	IMM	sCT	MeanFlow
Model Type	Discrete	Discrete	Discrete	Continuous	Continuous
Flow Path	Cosine	Linear	Linear	Cosine	Linear
Network Output	\(v^\theta\)	\(u^\theta\)	\(v^\theta\)	\(v^\theta\)	\(u^\theta\)
1st-step Target	DDIM with \(v_{t\|0}\)	Euler with \(u^\theta\)	DDIM with \(v_{t\|0}\)	DDIM with \(v_{t\|0}\)	DDIM with \(v_{t\|0}\)
2nd-step Target	DDIM with \(v^\theta_s\)	Euler with \(u^\theta\)	DDIM with \(v^\theta_s\)	DDIM with \(v^\theta_s\)	Euler with \(u^\theta\)
One-step Prediction	DDIM with \(v^\theta_t\)	Euler with \(u^\theta\)	DDIM with \(v^\theta_t\)	DDIM with \(v^\theta_t\)	Euler with \(u^\theta\)
Loss Metric	LPIPS	\(l_2\)	Grouped kernel (MMD)	\(l_2\)	\(l_2\)

Unified design view of representative shortcut models, where \(u^\theta\) is the network parameterization of average velocity and \(v^\theta\) is the network parameterization of instantaneous velocity. \(r \le s \le t\) are the three time points for constructing the two-step target.

Elucidating the Designing Space

1. Theoretical Intuition

Q1: Why share a common design frame? Shortcut models fundamentally simulate PF-ODE trajectories under the marginal velocity field \(v_t(x)\). Ideally, we would supervise with true trajectory pairs \((x_t, x_r)\), but \(x_r\) is intractable once \(x_t\) is sampled. The practical and general solution is therefore to construct a two-step target through an intermediate state \(\hat{x}_s\), and train a one-step map to match it.

Q2: What is the challenge in target construction? In practice, \(\hat{x}_s\) and \(\hat{x}_r\) deviate from the ideal trajectory states \((x_s, x_r)\), introducing bias and variance in supervision. These deviations explain why different shortcut designs can yield noticeably different performance, even if they follow the same high-level principle.

Q3: Why does distillation from pretrained fields work better? Distillation uses a stronger pretrained velocity field that better approximates the marginal velocity, reducing target-construction error and providing cleaner supervision than pure training-from-scratch settings.

Theoretical intuition figure — Physical picture of ideal vs. practical learning for DTSC and CTSC (from the paper figure).

2. Empirical Illustration

Q1: Following linear or cosine paths? Empirically, linear-path variants are more competitive in shortcut settings. A key intuition is lower transport curvature and smaller deviation in two-step targets, which makes one-step approximation easier.

Q2: Discrete-time or continuous-time shortcutting? Under unified implementation, continuous-time variants (e.g., sCT / MeanFlow family) consistently outperform discrete-time variants in one-step generation quality.

Q3: Fixing terminal time or making it random? Fixing \(r=0\) can improve early-stage convergence, while randomizing terminal time \(r\) better captures global shortcut patterns and often yields stronger late-stage performance.

Empirical illustration across CIFAR and ImageNet — FID50k comparison curves across CIFAR-10 and ImageNet settings.

Improvements to Training

Building on the theoretical and empirical analysis above, we adopt a continuous-time linear-path baseline (MeanFlow with SiT-B/2) and improve it through three directions. First, we use plug-in velocity to reduce supervision variance from conditional velocity estimates. Second, we introduce a gradual time sampler, which starts with easier supervision and smoothly transitions to the full CTSC sampling regime. Third, we integrate practical optimization tricks (adaptive weighting and warmup-style stabilization) to improve training robustness.

In addition, under classifier-free guidance training, we apply a mixed plug-in probability and class-consistent batching to preserve class information while still benefiting from lower-variance supervision. Overall, these modifications form ESC and consistently improve one-step generation fidelity.

SiT-B/2 Setting (1-NFE, ImageNet-256)	FID50k
MeanFlow baseline (CFG)	6.09
+ A1 Plug-in velocity (p=1.0)	6.01
+ A2 Plug-in velocity (p=0.5)	5.98
+ B1 Plug-in (p=1.0) + class-consistent batching	6.08
+ B2 Plug-in (p=0.5) + class-consistent batching	5.96
+ C Gradual time sampler	5.99
+ D Additional stabilization techniques	5.95
ESC (Baseline + B2 + C + D)	5.77

SiT-B/2 ablation under one-step generation (organized from the paper's ImageNet-256 study).

Experiments

Scaling-up setting. We evaluate ESC on ImageNet-256x256 in latent space with SiT-XL/2 (about 676M parameters). Following the MeanFlow training protocol under classifier-free guidance, ESC is trained from scratch for 240 epochs (about 1.2M iterations), and ESC+ extends training to 480 epochs (about 2.4M iterations).

Main results. Under 1-NFE on ImageNet-256x256, MeanFlow reports FID50k 3.43, while ESC improves to 2.85; with longer training, ESC+ reaches 2.53. This is a substantial gain over previous shortcut models trained from scratch, and even surpasses MeanFlow's two-step number (2-NFE, FID50k 2.93). The result indicates that better target construction and lower-variance supervision remain effective at large model scales.

Method	Params	NFE	FID50k
iCT	675M	1	34.24
SCD	675M	1	10.60
IMM	675M	1x2	7.77
MeanFlow	676M	1	3.43
MeanFlow	676M	2	2.93
ESC (class-consistent)	676M	1	2.85
ESC+ (longer training)	676M	1	2.53

Scaling-up comparison on ImageNet-256x256 (shortcut-model-focused subset).

Additional observations. In our ablation and scaling runs, class-consistent batching improves convergence speed, and plug-in velocity adds little computational overhead while improving stability. Performance gains are also more pronounced on larger backbones, suggesting that variance reduction becomes increasingly important as model capacity grows.

ExplicitShortCut: On the Design of One-step Diffusion
via Shortcutting Flow Paths

Authors

Affiliations

Date

Abstract

Unifying the Designing Choices

Elucidating the Designing Space

1. Theoretical Intuition

2. Empirical Illustration

Improvements to Training

Experiments

BibTeX

ExplicitShortCut: On the Design of One-step Diffusion via Shortcutting Flow Paths

Authors

Affiliations

Date

Abstract

Unifying the Designing Choices

Elucidating the Designing Space

1. Theoretical Intuition

2. Empirical Illustration

Improvements to Training

Experiments

BibTeX

ExplicitShortCut: On the Design of One-step Diffusion
via Shortcutting Flow Paths