2026-06-07 — Dan Billings

Gemma 4 MTP on a 4090: 1.9× at single slot, slower at parallel 4

Dan Billings — 2026-06-07

Two weeks ago, llama.cpp refused to load Gemma 4's drafter GGUF — unknown architecture gemma4_assistant, missing embedding tensors, six permutations, six failures. Then vLLM worked on the 26B MoE at 1.13× speedup, but that was the MoE-blunts-MTP result I expected.

The vLLM post ended with a prediction: a dense Gemma 4 of similar parameter count should show speedup closer to Qwen 3.6's 1.85×. The hypothesis was that MoE amortizes the same memory bus that MTP wants to amortize, so the two don't stack.

The hypothesis holds, and it's even better than expected. gemma-4-12b-it-qat (dense, QAT-trained) with Gemma4 MTP on this same 4090 delivers 1.9× at single slot. But there's a trap: if you run at --parallel > 1, MTP makes the service slower than MTP-off. The gain only exists at --parallel 1.

Headline measured on this 4090 (single slot, gemma-4-12b-it-qat-q4_0 target + qat-assistant-MTP-Q8_0 draft):

So if you're going to remember two things from this post: (1) --parallel 1 or MTP hurts you, and (2) Gemma 4 MTP uses a separate drafter GGUF, not layers inside the target.

What changed since the failure post

Two things landed between the two posts:

  1. llama.cpp PR #23398 merged on June 7 (HEAD 04eb4c4). This adds the gemma4_assistant loader and teaches the draft-mtp speculative path to pair a separate drafter GGUF with a Gemma 4 target at runtime. The drafter's missing input embedding tensors are now sourced from the target, exactly as Google's MTP design intended.

  2. The gemma-4-12b-it-qat model family. This is a dense (not MoE) 12B model QAT-trained by Google. The QAT process produces two GGUFs: the target (gemma-4-12b-it-qat-q4_0.gguf) and the assistant drafter (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf). The drafter is Q8_0 quantized — larger than you'd expect for a drafter, but that's the calibration that comes with QAT.

The 26B-A4B-it was MoE, which explains why vLLM only gave 1.13×. The 12B dense variant has the full parameter load on every forward pass — exactly the profile where MTP shines.

The working recipe

llama-server \
  --model /home/dan/models/gemma-4-12b-it-qat-q4_0.gguf \
  --model-draft /home/dan/models/gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --spec-draft-ngl 99 \
  --ctx-size 262144 \
  --n-gpu-layers 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --parallel 1 \
  -fit off \
  --flash-attn \
  --cont-batching \
  --jinja \
  --port 8080

Two flags need explanation:

-fit off — The common_fit_params precheck at startup sees --n-gpu-layers 99 and calculates that the combined target + drafter weights won't fit the 24 GB. It refuses to start. But the calculation is overly conservative: the KV cache allocates lazily (empty context uses ~12 GB, climbs toward ~22 GB on a full 256k session), and whisper/tts run on GPU1, leaving the full 4090 for gemma4. -fit off bypasses the precheck; if it really doesn't fit, you'll get an OOM at allocation instead of a preflight rejection.

--cache-type-k q8_0 --cache-type-v q8_0 — q8_0 KV gives higher long-context fidelity than q4_0 at essentially the same throughput. Measured: ~130 tok/s with q8_0 vs ~110-130 tok/s with q4_0 — no measurable difference in speed. At a full 256k session, q8_0 KV climbs to ~22 GB used on the 4090, leaving ~2 GB free. Tight but fits.

The parallel trap

This is the one that will bite you. MTP speculative decoding only pays off at --parallel 1. With --parallel > 1, the draft overhead makes the live service slower than MTP-off.

Configuration tok/s Notes
np=4, MTP off ~90 tok/s baseline, 512k total, 128k/slot
np=4, MTP on (n=3) ~64 tok/s SLOWER than MTP-off
np=1, MTP off ~95 tok/s baseline
np=1, MTP on, 8k ctx ~178 tok/s 1.85×
np=1, MTP on, 256k live ~121 tok/s 60-67% acceptance, sustained

The old np=4/512k MTP config was running at ~64 tok/s. That's slower than the np=1 MTP-off baseline of 95 tok/s. The parallel trap isn't subtle — you're losing real throughput by enabling MTP at multiple slots.

The reason: speculative decoding with a separate drafter adds a full forward pass of the drafter model per batch step. At --parallel 1, the drafter forward pass runs once, and the target verifies N tokens from a single batch — the overhead is amortized. At --parallel 4, the drafter still runs once per step, but the verification cost scales with the batch size, and the acceptance rate drops because the drafter was trained for single-token prediction, not batched prediction. The net result: the drafter forward pass becomes dead weight that slows everything down.

This is hardware-specific to the 4090. The 5090 post also settled on np=1, but for a different reason — VRAM accounting (256k context at parallel=2 needs 34 GB, which OOMs on 32 GB). On the 4090, it's a performance cliff, not a memory cliff.

Dense vs MoE: the bandwidth argument

The MoE-blunts-MTP hypothesis from the vLLM post now has its confirmation:

The 12B dense model at Q4_0 is ~6 GB of weights. At batch=1, every decode token reads all 6 GB from VRAM. The 4090's ~1 TB/s bandwidth ceiling is far from saturated at that weight size — so baseline throughput (95 tok/s) has room to grow. MTP fills that room.

The 26B-A4B MoE only activates ~4B params per token (~3.5 GB at AWQ). That's already eating a significant fraction of the 1 TB/s bus at 165 tok/s baseline. MTP has less room to recover.

KV cache: q8_0 is free

One unexpected result: upgrading the KV cache from q4_0 to q8_0 costs nothing in throughput. The q8_0 quantization gives higher fidelity for long contexts — important when you're running 256k tokens and the tail of the context matters for retrieval accuracy. The measurement is unambiguous: ~130 tok/s at q8_0 vs ~110-130 tok/s at q4_0 on the same 256k live load. The throughput difference is within measurement noise, but the KV precision is objectively higher.

KV allocates lazily — at load with empty context, only ~12 GB is used. A full 256k session climbs toward ~22 GB. The ~2 GB headroom is tight enough that you don't want Whisper or TTS sharing GPU0. That's why the playbook puts audio services on GPU1 (CUDA_VISIBLE_DEVICES=1).

Acceptance rate

Draft acceptance on the 4090: 60-67% at n=3, averaging ~62% in live service. At 8k context the acceptance is higher (~66%), dropping toward 60% as the context fills. This is comparable to Qwen 3.6's 58-61% at n=3 on the same hardware, and well above the 50% threshold where speculative decoding stops being worthwhile.

VRAM accounting

Target (qat-q4_0):          ~6 GB
Drafter (assistant Q8_0):   ~8 GB
KV cache (q8_0, 256k):      ~10 GB (lazy, peaks at full context)
-------------------------------
Total at full context:      ~24 GB
4090 VRAM:                  24 GB (driver holds ~1.5 GB, ~22.5 GB allocatable)

Wait — that doesn't add up to 24 GB and leave room. The actual measured usage at 256k with q8_0 KV is ~22 GB, with ~2 GB free. The discrepancy is that the drafter and target share the input embedding table (the MTP design), so the total weight footprint is less than the naive sum of both files on disk. Plus the KV cache grows lazily and the 10 GB figure is a theoretical maximum — actual usage depends on the fill fraction.

Reproducibility

The playbook is danarch4090.scala. The surgical redeploy that adds MTP to a running box is DanarchGemma4Mtp.scala. Both are typed Scala 3 playbooks — run sbt --client "runMain ansible.examples.DanarchGemma4Mtp" and it rewrites just the gemma4 systemd unit.

The drafter GGUF (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf) is placed manually at /home/dan/models/ — not auto-downloaded by the playbook. Get it from the Google QAT Gemma 4 release.


Previous: Gemma 4 mtp, vllm (MoE at 1.13×, dense prediction). Next post: fleet observability — Prometheus, Grafana, Jaeger across three GPUs.