FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8
FastDMS implements dynamic KV-cache sparsification achieving 6.4x compression with near-lossless perplexity on Llama 3.2 1B.
Excerpt
Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published [Dynamic Memory Sparsification (DMS)](https://arxiv.org/abs/2506.05345), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV-cache compression.
I found the results intriguing to build a small reference implementation and trainer to sanity-check the idea. On WikiText-2 with Llama 3.2 1B, I was able to get a rough replication:
| Configuration | PPL | Delta | KLD (nats/tok) | Compression |
|---|---:|---:|---:|---:|
| Vanilla Llama-3.2-1B | 9.226 | - | - | 1x |
| DMS (trained, eviction active) | 9.200 | -0.28% | 0.026 | 6.4x |
Training the DMS predictors took about 20 minutes on the PRO 6000 and the compression looked basically lossless. One small problem though, my HF reference implementation ran at about... 18 tok/s.
So, after a few weeks of kernel grinding, I'm pleased to announce **FastDMS**, an MIT-licensed implementation of DMS with compact KV storage that physically reclaims evicted slots. It is tested on NVIDIA's original Qwen 3 8B DMS checkpoint as well as my own Llama 3.2 1B DMS checkpoint. (the original HF reference version and my trainer are in the repo as well): https://github.com/shisa-ai/FastDMS
On my benchmark setup, FastDMS uses **5-8x** less KV memory than vLLM BF16 KV at 8K context while also decoding **1.5-2X** faster than vLLM.
Compact DMS saves real allocator/device memory, not just theoretical KV bytes.
Read at source: https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/