Why this matters
Interleaved generation (alternating text instructions and images) is central to visual storytelling, stepwise editing, and embodied manipulation, yet most image generators are architecturally limited to single-shot outputs. InterleaveThinker reframes the problem: instead of designing a new generator, it layers a multi-agent orchestration (planner + generator + critic) around any existing image model so they can execute multi-step, instruction-driven trajectories.
Key Findings
- Multi-agent pipeline: a planner produces the stepwise instruction sequence for each generation step; the image generator executes; a critic inspects outputs and issues corrective instruction refinements. This decomposition isolates planning, execution, and evaluation responsibilities.
- Cold-start SFT datasets: the authors build Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to initialize planner and critic behaviors in the target interleaved format.
- RL refinement with GRPO: to improve step-wise correction, they train Interleave-Critic-RL-13k using GRPO. Because full-trajectory optimization is costly (trajectories can exceed 25 generator calls), they introduce accuracy and step-wise rewards that let single-step RL updates effectively guide long trajectories.
- Broad improvements: applying the pipeline consistently improves multiple base image generators on interleaved-generation benchmarks (authors report results comparable to Nano Banana and GPT-5 on the evaluated tasks). It also unexpectedly enhances reasoning-style multi-step generation—for example, measurable gains on 4-step FLUX.2-klein benchmarks like WISE and RISE.
Where it fits
Great fit if you need stepwise, instruction-driven visual outputs without swapping your image backbone: visual narratives, guided editing workflows, robotic/embodied instruction pipelines, or tools that require iterative image feedback. It’s a pragmatic adapter layer that leverages existing generators while adding control and evaluative feedback.
Look elsewhere if you need a single monolithic model with end-to-end learned interleaved capabilities (costly to train from scratch) or if your use case cannot tolerate the runtime overhead of many generator calls per trajectory.
Practical trade-offs and mechanics
- Compute vs. control: typical interleaved trajectories may require 20–25+ generator calls; the method accepts extra runtime to gain stepwise control and higher fidelity to multi-step instructions.
- Engineering surface: integrates as an external pipeline that speaks to the generator via prompts/instructions; requires adapting the planner/critic to the target generator’s instruction style.
- Evaluation orientation: the critic both filters bad outputs and generates refined instructions, so its reward design (accuracy and step-wise) is key—authors supply SFT and RL datasets to bootstrap this behavior.
Quick takeaway
InterleaveThinker is a practical pattern for retrofitting interleaved, agentic control onto existing image generators by combining a planner, generator, and a critic reinforced with single-step RL. Expect clearer stepwise guidance and better adherence to multi-step plans at the cost of higher runtime and engineering to adapt the planner/critic to your generator.
