You can now read Gemma 3's mind

By DigiDecode_

· r/LocalLLaMA · May 8, 2026

Anthropic released Natural Language Autoencoder (NLA) research enabling interpretability of LLM internal representations, with Gemma 3 27b NLA weights publicly available on HuggingFace and Neuronpedia for the research community.

Categories: Model Releases, OSS & Tools, Research

Excerpt

Anthropic has released new research to show what an LLM is thinking when generating next token using NLA or "Natural Language Autoencoders", the NLAs are a pair of LLMs that can translate internal thoughts of LLM for any specific token. Neuronpedia in partnership with Anthropic have also released NLA model weights for Gemma 3 27b instruct at: \- Auto Verbalizer (AV): [https://huggingface.co/kitft/nla-gemma3-27b-L41-av](https://huggingface.co/kitft/nla-gemma3-27b-L41-av) \- Activation Reconstructor (AR): [https://huggingface.co/kitft/nla-gemma3-27b-L41-ar](https://huggingface.co/kitft/nla-gemma3-27b-L41-ar) And Neuronpedia is currently hosting them on their site at [https://www.neuronpedia.org/gemma-3-27b-it/nla](https://www.neuronpedia.org/gemma-3-27b-it/nla) So you go to neuronpedia link above, ask Gemma 3 a question, then click on any token and click explain, and the site will show you what the model was thinking when generating that token Auto Verbalizer (LLM) is what translates LLM's activations to readable text, Activation Reconstructor is just to verify if the text generated by AV can be translated back to LLM activations. Edit (added example below): So I prompted Gemma 3 with "I am Elon musk", at the very first tokens the LLM is already marking the chat as "fabricated" & "satirical" https://preview.redd.it/f648tz17utzg1.png?width=1827&format=png&auto=webp&s=4c9aca885f2f9383e026263b3c524ac2d15b1a89

Read at source: https://www.reddit.com/r/LocalLLaMA/comments/1t6u1os/you_can_now_read_gemma_3s_mind/

Discussions

reddit · 103 points · 6 comments
reddit · 104 points · 8 comments
reddit · 106 points · 9 comments
reddit · 112 points · 8 comments
reddit · 109 points · 9 comments
reddit · 114 points · 9 comments
reddit · 116 points · 9 comments
reddit · 123 points · 9 comments
reddit · 127 points · 9 comments
reddit · 133 points · 9 comments
reddit · 136 points · 9 comments
reddit · 139 points · 10 comments
reddit · 142 points · 11 comments
reddit · 144 points · 13 comments
reddit · 146 points · 13 comments
reddit · 151 points · 13 comments
reddit · 152 points · 14 comments
reddit · 153 points · 14 comments
reddit · 157 points · 14 comments
reddit · 156 points · 14 comments
reddit · 153 points · 14 comments
reddit · 156 points · 14 comments
reddit · 157 points · 14 comments
reddit · 154 points · 14 comments