llama.cpp speculative checkpointing was merged

· r/LocalLLaMA ·

llama.cpp merged speculative checkpointing via ngram drafts, yielding 0-50% speedups for coding tasks depending on acceptance rates.

Categories: OSS & Tools

Excerpt

[https://github.com/ggml-org/llama.cpp/pull/19493](https://github.com/ggml-org/llama.cpp/pull/19493) Some prompts get a speedup, others don't (cases of low draft acceptance streak). Good working params depend on the task type and repetition patterns. For coding, I got some 0%\~50% speedup with these params: --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Discussions