DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

· r/MachineLearning ·

DeepSeek released the full V4 technical paper detailing FP4 quantization-aware training for MoE models, achieving 2x speedup on QK selectors with 99.7% recall, and reducing 1M context FLOPs to 27% (Pro) / 10% (Flash) of baseline while cutting KV cache by 90%+.

Categories: Model Releases, Research

Excerpt

DeepSeek dropped the full V4 paper this week. preview from april was 58 pages, this version adds a lot of technical depth. What stood out for me. FP4 quantization aware training. theyre running FP4 QAT directly in late stage training. MoE expert weights quantized to FP4 (the main gpu memory consumer). QK path in the CSA indexer uses FP4 activations. 2x speedup on QK selector with 99.7% recall preserved. inference runs directly on the FP4 weights. Efficiency table is striking: |Model|1M context FLOPs|KV cache| |:-|:-|:-| |V3.2|baseline|baseline| |V4-Pro|27% of baseline|10% of baseline| |V4-Flash|10% of baseline|7% of baseline| Training stability, two mechanisms. Trillion parameter MoE has the loss spike problem, divergence, unpredictable failures. they documented two fixes. Anticipatory routing. they deliberately desync main model and router updates. current step uses latest params for features, but routing uses cached older params. breaks the feedback loop that amplifies anomalies. 20% overhead but only kicks in during loss spikes. SwiGLU clamping. hard limits on the SwiGLU linear path (-10 to 10) and gate path (max 10). suppresses extreme values that would cascade. Generative reward model. instead of separate reward models for RLHF, they use the same model to generate and evaluate. trained on scored data, model learns to judge its own outputs with reasoning attached. minimal human labeling, reasoning grounded eval, unified training. Human eval results. chinese writ

Discussions