InterleaveThinker: Reinforcing Agentic Interleaved Generation

By Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo

· HF Daily Papers · Jun 11, 2026

InterleaveThinker uses a multi-agent pipeline to add interleaved text-image generation capabilities to existing image generators.

Categories: Research

Excerpt

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo — Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-w

Read at source: https://arxiv.org/abs/2606.13679