2026-06-11 — Dan Billings

q8_0 KV Cache and Batch Size: 54% faster decode, 2x prefill on 5090

Dan Billings — 2026-06-11

The 5090 post ran --cache-type-k q4_0 --cache-type-v q4_0 to fit 256k context on 32 GB. That was the right call for capacity, but the q4_0 quantization cost decode speed. Switching to q8_0 gave 54% faster decode (106 vs 69 tok/s) and 2x faster prefill (1,036 tok/s at -ub 3072 vs ~500 tok/s at -ub 2048).

This post documents the KV cache mechanics, the batch size sweep, and the actual flags that make it work.

The KV cache: q4_0 vs q8_0

Same model, same server, same context. Only the KV cache quantization changed:

--cache-type-k q4_0 --cache-type-v q4_0   # before
--cache-type-k q8_0 --cache-type-v q8_0   # now
q4_0 KV q8_0 KV Change
Decode (tok/s) 69 106 +54%
Prefill (tok/s) ~500 (ub=2048) ~1,036 (ub=3072) +107%
Bytes/token ~20 ~40 2× memory

The prefill rate went from 0.35 ms/token to 0.44 ms/token (at larger batch sizes). Decode went from 14.5 ms/token to 10.6 ms/token. Both phases dequantize KV cache entries on every token — q4_0 means more dequant work per byte, which compounds.

The KV cache doesn't touch VRAM

After a cold prefill of 124,849 tokens, the 5090 VRAM stayed at 29.7 GB:

Before: 28,547 MiB
After:  29,661 MiB  (batch increase, not cache)

The KV cache lives in system RAM as checkpoints. From the journal:

19:53:27  prompt_save: 92,839 tokens, 3,598 MiB (draft: 364 MiB)
19:53:38  cache state: 2 prompts, 4,972 MiB (limits: 8,192 MiB)
            prompt A: 1,473 tokens, 1 checkpoint, 354 MiB
            prompt B: 92,839 tokens, 2 checkpoints, 4,618 MiB

93K tokens at q8_0 = 4.6 GB in RAM. The GPU only loaded a 312 MiB active window for generation, then paged it back out immediately. VRAM is dominated by model weights (14 GB NVFP4), not the cache.

Batch size sweep: 2048 → 3072 → 4096

The default llama.cpp batch is -b 512 -ub 512 — conservative for CPU. On the 5090 we swept:

Batch Prefill Decode VRAM
-b 2048 -ub 2048 ~500 tok/s 87 tok/s 87.5%
-b 3072 -ub 3072 ~1,036 tok/s 94 tok/s 91%
-b 4096 -ub 4096 ~1,074 tok/s 77 tok/s 94%
-b 8192 -ub 8192 stalled

2048 was the safe default. Prefill was fine, decode was fine, but prefill wasn't leveraging the card's bandwidth.

3072 is the winner. Prefill doubles over 2048 (2,272 tok/s on 92K tokens), decode stays at 94 tok/s, and VRAM at 91% has headroom. This is what we ship with.

4096 pushes prefill slightly higher (2,302 tok/s on 118K tokens) but decode drops to 77 tok/s and VRAM hits 94% — too tight. The batch buffer is eating into the model weights' VRAM.

8192 stalled. The prefill buffer overflows or the GPU can't sustain the larger batch.

The lesson: batch size is a tradeoff curve. Bigger batch = faster prefill, but decode suffers because the batch buffer displaces weights from fast GPU memory. 3072 is the knee of the curve on a 32 GB card with a 27B NVFP4 model.

How llama.cpp checkpoints work

The cache is a ring of up to 32 checkpoints in system RAM. Each checkpoint holds a slice of the conversation — typically 250-350 MiB for 26-29K tokens at q8_0.

When a new request arrives, llama.cpp scans all cached prompts for prefix overlap:

looking for better prompt, base f_keep = 0.002
found better prompt with f_keep = 0.706, sim = 0.734

f_keep is the fraction of the new prompt already in the cache. 0.706 means 70% was cached — only the 331-token delta needed evaluation, taking 537 ms at 617 tok/s.

When there's no overlap (fresh conversation), the old cache is erased:

20:02:41  cache: 1 prompt, 6,754 MiB (54K tokens)
20:02:50  new prompt: 32,559 tokens, 1,359 MiB
20:02:52  cache: 1 prompt, 2,620 MiB  <- 54K prompt erased

The LRU eviction is aggressive — it drops the old conversation even when total cache is well under the 8.2 GB limit. The constraint is prefix overlap, not memory pressure.

The math: how much KV fits in RAM

At q8_0, KV cache costs ~40 bytes/token:

Context q8_0 (RAM) q4_0 (RAM) f16 (RAM)
100K tokens 4.0 GB 2.0 GB 8.0 GB
200K tokens 8.0 GB 4.0 GB 16.0 GB
256K tokens 10.2 GB 5.1 GB 20.5 GB

With 96 GB system RAM on WSL2 and an 8.2 GB cache limit, you have enormous headroom. The constraint is the cache limit, not physical RAM. Bump it to --cache-size 16g and you can hold 256K at q8_0 or 200K at f16.

The flags behind the Grafana numbers

The rates you see on the Grafana dashboard come from two different machines with different models but the same KV cache setting. Here are the actual llama.cpp arguments from ansible-scala that make it happen:

5090 (danwin) — Qwen 3.6, 27B, NVFP4:

--ctx-size 200000          # 200k context; single slot
--cache-type-k q8_0 --cache-type-v q8_0  # q8_0 KV (the speed unlock)
-b 3072 -ub 3072           # batch=3072: sweet spot for prefill/decode tradeoff
--spec-type draft-mtp --spec-draft-n-max 3  # MTP n=3, peak (no cliff on Blackwell)
-no-mmap                   # WSL2: free RAM copy after offloading to VRAM
-fit off                   # skip common_fit_params precheck that bails on --ngl 99

The -no-mmap flag is the WSL2 hack — without it, llama.cpp keeps a system RAM copy of the weights and a VRAM copy, which can OOM WSL2's memory allocator. The -fit off skips a precheck that incorrectly rejects --ngl 99 with NVFP4 models. The batch size was tuned across three values: 2048 (safe), 3072 (winner), 4096 (decode penalty), 8192 (stalled).

4090 (danarch) — Gemma 4, 12B, QAT-Q4_0:

--ctx-size 262144           # 256k context; 2 slots
--cache-type-k q8_0 --cache-type-v q8_0  # same q8_0 KV
--spec-type draft-mtp --spec-draft-n-max 3  # MTP n=3
--model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf  # separate draft model
--spec-draft-ngl 99         # draft model fully on GPU
-np 2                       # 2 parallel slots

Key difference: Gemma 4 on the 4090 uses a separate draft GGUF (--model-draft) for MTP, while Qwen 3.6 on the 5090 has MTP baked into the same model file. The --spec-draft-ngl 99 pins the entire draft model to VRAM — on a 24 GB card with a 12B Q4_0 target, that leaves tight headroom (~22 GB used total). The 2 slots (-np 2) trade per-slot speed for concurrency — measured at ~95 tok/s per slot vs ~178 tok/s single slot.

Both run --cont-batching, --jinja, --fa on, --metrics. The Prometheus exporter on danarch scrapes danwin:8080/metrics (5090), localhost:8080/metrics (4090), and localhost:8081/metrics (nomic-embed on 3070Ti). The Grafana dashboard compares prefill rate, decode rate, VRAM usage, context fill, and cache hit rate across all three GPUs.

The tradeoff: memory vs speed

q8_0 costs 2× RAM per token vs q4_0. At 93K tokens that's 4.6 GB instead of 2.3 GB. On a machine with 16 GB system RAM, q8_0 would pressure swap. On 96 GB, it's irrelevant.

The dequant overhead is the real differentiator:

q4_0 q8_0
Decode ms/token 14.5 10.6
Prefill ms/token 0.35 0.44 (ub=3072)
Per-channel scale/zero lookup yes (4-bit) minimal (8-bit)

At decode time, every token requires dequantizing KV cache entries to compute attention. q4_0 means per-channel scale/zero-point lookup on every access. q8_0 is simpler arithmetic. The 4 ms/token difference compounds: a 200-token response saves ~0.8 seconds.

What this means for daily use

At 94 tok/s with MTP at n=3 (draft acceptance 0.77-0.91), a typical Hermes Agent response (2-3K tokens) generates in 30-35 seconds. Prefill for a 30K token context window takes 11 seconds (at 1,036 tok/s). The model didn't change — the KV cache and batch settings did.

The qualitative threshold matters: at 69 tok/s you wait for the thinking section to finish. At 94 tok/s it's faster than the reading bottleneck, which changes the interaction pattern from "watch the model think" to "read the result."

Economic reality: 2026

The 5090 costs $3,500 at Micro Center. The 4090 costs $1,600. That's the hardware. Add a system to run it on — I'm using WSL2 on Windows 11 for the 5090, Arch Linux for the 4090 — both consumer workstations that already exist in the house.

Running locally is not free. It's $3,500 once, then electricity and your time. The cloud alternative: DeepSeek V3 at $0.28/M input, $1.10/M output. At 10K tokens in and 3K tokens out per day, that's ~$0.60/day or ~$200/year. The 5090 pays for itself in 18 months of daily use at that rate.

But the comparison isn't just cost per token. It's data sovereignty, latency, and context window. Cloud APIs have hard context limits (128K for DeepSeek, 200K for Claude). Locally I can run 200K context on a single slot with the full conversation history. The cloud also doesn't let you run a continuous background worker extracting observations from every conversation — the Honcho deriver needs 8K-token batches to stream through the model, and that's expensive on an API.

The local setup also has a ceiling. A single 5090 can't run two 256K slots (needs 34 GB, has 32). Multi-model serving requires either multiple GPUs or accepting lower per-slot throughput. The 4090 + 5090 fleet solves this: Gemma 4 handles the foreground agent loop, Qwen 3.6 handles deep reasoning and dreaming. Total hardware cost: $5,100.

The question isn't whether local is cheaper than cloud — it's whether the capabilities you need (long context, continuous background processing, data control) justify the upfront investment. For a technical user running an AI agent full-time, the answer is yes. For occasional use, the cloud API is simpler and cheaper.

What didn't work


Playbook: src/main/scala/ansible/examples/Rtx5090Setup.scala (5090), danarch4090.scala (4090). Grafana dashboard at danarch:3000/d/0098691a-cc1d-49a6-a25e-3b26f8882ca0.