my 5090 and I

Observations on local LLMs

192k Context on a Single 5090: Gemma-4-31B with Q8/Q5_1 KV Cache Quantization

2026-06-28 — Dan Billings

Tuning KV cache quantization to push Gemma-4-31B context window limits on the RTX 5090. Compiling GGML_CUDA_FA_ALL_QUANTS, progressive benchmark sweeps at 64k, 128k, and 192k, and how q8_0 Key / q5_1 Value quantization saves 1.3 GB of VRAM to safely run a 192k context size in production.

100 tok/s at 100k Context: Hosting Gemma-4-31B QAT + MTP + Stable Diffusion on the RTX 5090

2026-06-26 — Dan Billings

Configuring the RTX 5090 optimally for local AI. Google's Gemma-4-31B with QAT-calibrated UD-Q4_K_XL weights and MTP drafting, hitting ~100 tokens/s at near-100k context. Plus, compiling stable-diffusion.cpp with CUDA for private, high-performance image generation, and monitoring it all with Prometheus.

q8_0 KV Cache and Batch Size: 54% faster decode, 2x prefill on 5090

2026-06-11 — Dan Billings

KV cache quantization determines both capacity and speed. q8_0 over q4_0 gave 54% faster decode. Batch size 3072 is the sweet spot on a 32 GB 5090 — 2x prefill over 2048, better decode than 4096. The cache lives in RAM as checkpoints, not VRAM. The 4090 needed a separate llama.cpp build (SM_86 vs SM_89) to avoid Ampere crashes.

Cognitive Load Balancing: The Hybrid Cloud-Local AI Architecture

2026-06-09 — Dan Billings

We have an optimized, local AI setup: DeepSeek V4 Pro for lightning-fast frontend queries, with asynchronous Honcho meta-queries offloaded to local consumer hardware. Extract facts on the 24GB 4090 (Gemma 4), and perform dialectic reasoning/dreaming on a 32GB 5090 (Qwen 3.6) locked to a single slot for full 256k context.

Speed for the Hot Loop, Local for the Rest: DeepSeek API + Tiered Dual-GPU Memory

2026-06-07 — Dan Billings

Combining the lightning-fast DeepSeek v4 Pro cloud API for the foreground agent hot loop with a tiered local GPU cluster (RTX 4090 + 5090) for Honcho deriver, dreaming, and context compression background tasks. High performance meets zero marginal cost.

Port 0 doesn't exist in my infrastructure: typed Scala for home cluster management

2026-06-07 — Dan Billings

Managing a heterogeneous fleet of machines with a typed Scala 3 domain language — Free Monads for platform abstraction, Iron refinement types for compile-time validation, and LLMs for maintainability. If the infrastructure program compiles, the deployment is structurally sound.

Gemma 4 MTP on a 4090: 1.9× at single slot, slower at parallel 4

2026-06-07 — Dan Billings

The gemma-4-12b-it-qat dense model with Gemma4 MTP on a single RTX 4090. Separate drafter GGUF, 1.9x speedup at single slot, and why parallel > 1 makes MTP slower than MTP-off.

Tracing requests across three GPUs and two operating systems: Jaeger without containers

2026-06-07 — Dan Billings

Setting up OpenTelemetry and Jaeger to track request lifecycles across macOS (Hermes client), Arch Linux (Honcho server, nomic-embed, PostgreSQL), and WSL2 (llama-server inference). How to configure, verify, and read Jaeger traces to find performance bottlenecks.

Documentation that compiles: Free Monads and mdoc as provable specs

2026-06-06 — Dan Billings

Documentation usually lies. With mdoc and Free Monads, it compiles and runs. If the docs don't compile, the build fails — making your documentation a living, mathematically provable specification of your DSL.

500 bytes instead of raw video: YOLO11 pose estimation as typed IaC

2026-06-06 — Dan Billings

Privacy-preserving edge vision via typed Infrastructure-as-Code. Deploying YOLO11 pose estimation to extract 17 COCO keypoints, render a glowing neon skeleton, and stream 500-byte telemetry instead of raw video.

Dreaming on the 5090, serving on the 4090: split-GPU cognitive load balancing

2026-06-06 — Dan Billings

Configuring Honcho memory to run fast when it matters (local Gemma 4 12B at 75 tok/s) and slow when it doesn't (asynchronous Qwen 3.6 27B dreaming). Plus, an architectural blueprint for offloading STT/TTS voice pipelines to a secondary RTX 3070.

111 tok/s at n=3 on 5090: NVFP4 MTP without the Blackwell cliff

2026-06-01 — Dan Billings

A speculative forecast for running Qwen3.6-27B on an RTX 5090 with NVFP4 TurboQuant quantization and MTP. The bandwidth math, VRAM accounting, and why 200 tok/s is a qualitative shift — not just a faster number.

Honcho's dream cycle: what happens when your AI memory system reasons about itself

2026-06-01 — Dan Billings

After weeks of running Hermes with Honcho memory, the deriver is accumulating observations and the dream cycle is approaching. An explanation of what deduction and induction passes actually do, surprisal sampling, the peer card, and what changes when dreaming finally fires.

Persistent memory for a local LLM: Honcho on Hermes Agent

2026-05-30 — Dan Billings

Giving Hermes Agent actual persistent memory via a self-hosted Honcho instance. Dialectical reasoning, VRAM contention, embedding dimension mismatches, and why DeepSeek v4 Pro's cheap tokens are the right engine for this kind of work.

Gemma 4 MTP on vLLM: 1.13× and why MoE blunts speculative decoding

2026-05-25 — Dan Billings

After llama.cpp refused to load the Gemma 4 drafter, I tried what Google's announcement actually said to use. vLLM serves it — 1.13× over baseline at n=2 on a 24 GB 4090. The interesting question is why the speedup is modest compared to Qwen 3.6's 1.85×: MoE blunts MTP.

Gemma 4 MTP on llama.cpp: the drafter architecture llama.cpp doesn't recognize

2026-05-24 — Dan Billings

Google shipped Gemma 4 MTP for transformers / MLX / vLLM / SGLang / Ollama. llama.cpp isn't on the list. I tried anyway. It doesn't work yet — here's exactly where it breaks and what upstream would need to change.

Qwen 3.6 MTP on a 4090: 84 tok/s at n=3, 6 tok/s at n=4

2026-05-24 — Dan Billings

A reproducible walkthrough for getting llama.cpp MTP working on a single RTX 4090 with unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL. Every llama-server flag explained.

← Home