The State-Prediction Separation Hypothesis
A two-stream Transformer architecture separates state storage from token prediction and improves data and compute efficiency in pretraining experiments.
Excerpt
Giovanni Monea, Nathan Godey, Kianté Brantley, Yoav Artzi — Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the state-prediction separation hypothesis: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.
Read at source: https://arxiv.org/abs/2607.01218