Geometric Factual Recall in Transformers

By Shauli Ravfogel, Gilad Yehudai, Joan Bruna, Alberto Bietti

· HF Daily Papers · May 12, 2026

Geometric memorization theory shows transformers can encode facts with logarithmic embedding dimension via linear superpositions, challenging standard associative memory views.

Categories: Research

Excerpt

Shauli Ravfogel, Gilad Yehudai, Joan Bruna, Alberto Bietti — How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, geometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode linear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of x?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic sele

Read at source: https://arxiv.org/abs/2605.12426