RiT: Vanilla Diffusion Transformers Suffice in Representation Space
RiT shows that flow matching in DINOv2 representation space achieves 7.3x higher effective rank and 35x better conditioning than pixel space with comparable intrinsic dimensionality.
Excerpt
Le Zhang, Ning Mang, Aishwarya Agrawal — Flow matching with x-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space li2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both d!approx!33) yet DINOv2 exhibits 7.3times higher effective rank, 35times better covariance conditioning, 11.5times lower excess kurtosis, and 1.7times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint [CLS]-patch modeling. On ImageNet 256{times}256, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT^DH-XL with
Read at source: https://arxiv.org/abs/2605.21981