Most practical bottlenecks in high-quality text-to-image diffusion are quadratic token costs and expensive high-resolution denoising steps. The core insight of this paper is that the global structure can be generated cheaply at low resolution and then restored to high resolution with a pixel-space GAN upsampler plus a tiny amount of latent noise so the pretrained flow prior can correct SR artifacts with only a few high-res steps.
Key Findings
- Staged low-to-high pipeline: performing the bulk of denoising at low resolution reduces per-step compute (fewer tokens) and required timesteps, so end-to-end latency falls dramatically while preserving global layout.
- Pixel-space GAN-based super-resolution: upsampling in pixel space (Real-ESRGAN x2 in their reference implementation) preserves LR structure and supplies high-frequency signals that the subsequent latent-stage resampling can refine.
- Low-strength latent noise injection: adding a small, scheduler-consistent noise (e.g., sigma≈0.1) after SR lets the high-resolution flow prior resample and correct SR-imposed high-frequency errors rather than blindly trusting them.
- Empirical tradeoff points: configurations like 12 low-res steps + 1 high-res refinement achieve ≈10× end-to-end speedup with OneIG quality within ~1% of native sampling; combining MrFlow with timestep-distilled models can compound speedups up to ≈25×.
Who It's For and Tradeoffs
Great fit if you need much faster text-to-image sampling from pretrained flow-matching / diffusion backbones without any finetuning, custom kernels, or training-time changes — especially when latency matters more than squeezing every bit of visual minutiae. Look elsewhere if your pipeline requires strict fidelity to original samples at every pixel or if you cannot accept any external SR model dependency: MrFlow shifts some trust to the SR network (though the latent noise + short HR refinement mitigate many artifacts). Also expect modest additional overhead for VAE encode/decode and SR steps, albeit small compared to HR denoising.
How it Works (brief)
The method executes: (1) low-resolution latent sampling to produce a coarse image structure; (2) VAE decode to pixels; (3) pixel-space super-resolution via a pretrained lightweight GAN SR; (4) VAE re-encode the upscaled image into latents; (5) inject low-strength, scheduler-consistent noise; (6) perform a few high-resolution denoising/refinement steps with the original flow prior; (7) final VAE decode. The design intentionally keeps most cost in cheap LR sampling and uses one or very few HR steps to finish details.
