Needle: We Distilled Gemini Tool Calling Into a 26M Model

· r/LocalLLaMA ·

Needle, a 26M parameter function-calling model distilled from Gemini, runs at 6000 tok/s prefill on consumer devices using a Simple Attention Network architecture with no MLPs.

Categories: Model Releases, OSS & Tools

Excerpt

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale. Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...). Training: \- Pretrained on 200B tokens across 16 TPU v6e (27 hours) \- Post-trained on 2B tokens of synthesized function-calling data (45 minutes) \- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.) You can test it right now and finetune on your Mac/PC: [https://github.com/cactus-compute/needle](https://github.com/cactus-compute/needle) The full writeup on the architecture is here: [https://github.com/cactus-compute/needle/blob/main/docs/simple\_attention\_networks.md](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) We found that the "no FFN" finding

Discussions