The exact KV cache usage of DeepSeek V4

· r/LocalLLaMA ·

Technical analysis of DeepSeek V4's KV cache efficiency: V4 uses ~10GB at 1M context versus V3's ~84GB, representing an ~8x improvement per calculations from the V4 paper and vLLM implementation.

Categories: Model Releases

Excerpt

Figure 1 of DSV4 paper seems to imply that DSV3.2 uses \~50GB at 1m context and DSV4 uses \~5GB: [https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek\_V4.pdf](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf) \*\*\*Numbers updated with the KV cache breakdown from vllm\*\*\* [https://vllm.ai/blog/deepseek-v4](https://vllm.ai/blog/deepseek-v4) From my own calculations, the correct FP16 KV cache at 1m context should be: |Model|Params|128k|160k|1m|KV%| |:-|:-|:-|:-|:-|:-| |V3/3.1|671B|8.58GiB|10.72GiB|68.63GiB|5.11%| |V3.2|671B|10.48GiB|13.11GiB|83.88GiB|6.25%| |V4 Flash|284B|0.84GiB|1.05GiB|6.72GiB|1.18%| |V4 Pro|1600B|1.20GiB|1.50GiB|9.62GiB|0.3%| So while KV cache saving is not 9.5x but 7.879x. It is still very impressive. If you look at the KV% metric, then we are seeing close to 20x gain. This basically obliterates all current transformer-SSM hybrid models' KV cache usage. But the transformer-SSM crowd can just use DSV4's CSA and HCA on their transformer layers to catch up. At this KV cache usage, that also means when DSV4 is supported at llama.cpp, we can easily run 1m context for DSV4 Flash on 256GB RAM and 3090 or for DSV4 Pro on 1.5TB RAM and RTX 6000 Blackwell. I suppose the various speed gain mentioned in the paper can make this viable. While DSV4 Pro doesn't do well at artificial analysis. We can expect Kimi and Zhipu will make derivatives off it such that we have a beast that uses very little KV cache. All in al

Discussions