BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

· r/LocalLLaMA ·

BeeLlama v0.2.0 adds a major DFlash update, speeding Qwen and Gemma inference on single-GPU consumer hardware.

Categories: OSS & Tools

Excerpt

**BeeLlama v0.2.0 is here!** >Not quite a pegasus, but close enough. [**GitHub**](https://github.com/Anbeeld/beellama.cpp) **|** [**Qwen 3.6 27B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md) **|** [**Gemma 4 31B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-gemma-4-31b-dflash.md) * Full Gemma 4 31B support with efficient DFlash implementation and vision. * Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution. * DFlash GGUFs with upstream architecture are now supported. * Fixes to adaptive profit behavior around baseline probing. * Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it. * Reasoning and tool-call boundaries were tightened. * Stricter draft/target validation and better draft-model discovery. * ...and many more improvements! **Benchmarks** * Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB * Config: same as in quick start docs, but with reasoning off for non-chat prompts * Baseline and MTP server in comparison: llama.cpp [b9275](https://github.com/ggml-org/llama.cpp/releases/tag/b9275) CUDA 13.1 Windows prebuilt * The full text of the benchmark prompts is in [README.md on GitHub](https://github.com/Anbeeld/beellama.cpp/blob/main/README.md#dflash-speedup) **Qwen 3.6 27B** Target m

Discussions