my 4090 and I
Observations on local LLMs
2026-06-11 — Dan Billings
KV cache quantization determines both capacity and speed. q8_0 over q4_0 gave 54% faster decode. Batch size 3072 is the sweet spot on a 32 GB 5090 — 2x prefill over 2048, better decode than 4096. The cache lives in RAM as checkpoints, not VRAM. The 4090 needed a separate llama.cpp build (SM_86 vs SM_89) to avoid Ampere crashes.
2026-06-09 — Dan Billings
We have an optimized, local AI setup: DeepSeek V4 Pro for lightning-fast frontend queries, with asynchronous Honcho meta-queries offloaded to local consumer hardware. Extract facts on the 24GB 4090 (Gemma 4), and perform dialectic reasoning/dreaming on a 32GB 5090 (Qwen 3.6) locked to a single slot for full 256k context.
2026-06-07 — Dan Billings
Combining the lightning-fast DeepSeek v4 Pro cloud API for the foreground agent hot loop with a tiered local GPU cluster (RTX 4090 + 5090) for Honcho deriver, dreaming, and context compression background tasks. High performance meets zero marginal cost.
2026-06-07 — Dan Billings
Managing a heterogeneous fleet of machines with a typed Scala 3 domain language — Free Monads for platform abstraction, Iron refinement types for compile-time validation, and LLMs for maintainability. If the infrastructure program compiles, the deployment is structurally sound.
2026-06-07 — Dan Billings
The gemma-4-12b-it-qat dense model with Gemma4 MTP on a single RTX 4090. Separate drafter GGUF, 1.9x speedup at single slot, and why parallel > 1 makes MTP slower than MTP-off.
2026-06-07 — Dan Billings
Setting up OpenTelemetry and Jaeger to track request lifecycles across macOS (Hermes client), Arch Linux (Honcho server, nomic-embed, PostgreSQL), and WSL2 (llama-server inference). How to configure, verify, and read Jaeger traces to find performance bottlenecks.
2026-06-06 — Dan Billings
Documentation usually lies. With mdoc and Free Monads, it compiles and runs. If the docs don't compile, the build fails — making your documentation a living, mathematically provable specification of your DSL.
2026-06-06 — Dan Billings
Privacy-preserving edge vision via typed Infrastructure-as-Code. Deploying YOLO11 pose estimation to extract 17 COCO keypoints, render a glowing neon skeleton, and stream 500-byte telemetry instead of raw video.
2026-06-06 — Dan Billings
Configuring Honcho memory to run fast when it matters (local Gemma 4 12B at 75 tok/s) and slow when it doesn't (asynchronous Qwen 3.6 27B dreaming). Plus, an architectural blueprint for offloading STT/TTS voice pipelines to a secondary RTX 3070.
2026-06-01 — Dan Billings
A speculative forecast for running Qwen3.6-27B on an RTX 5090 with NVFP4 TurboQuant quantization and MTP. The bandwidth math, VRAM accounting, and why 200 tok/s is a qualitative shift — not just a faster number.
2026-06-01 — Dan Billings
After weeks of running Hermes with Honcho memory, the deriver is accumulating observations and the dream cycle is approaching. An explanation of what deduction and induction passes actually do, surprisal sampling, the peer card, and what changes when dreaming finally fires.
2026-05-30 — Dan Billings
Giving Hermes Agent actual persistent memory via a self-hosted Honcho instance. Dialectical reasoning, VRAM contention, embedding dimension mismatches, and why DeepSeek v4 Pro's cheap tokens are the right engine for this kind of work.
2026-05-25 — Dan Billings
After llama.cpp refused to load the Gemma 4 drafter, I tried what Google's announcement actually said to use. vLLM serves it — 1.13× over baseline at n=2 on a 24 GB 4090. The interesting question is why the speedup is modest compared to Qwen 3.6's 1.85×: MoE blunts MTP.
2026-05-24 — Dan Billings
Google shipped Gemma 4 MTP for transformers / MLX / vLLM / SGLang / Ollama. llama.cpp isn't on the list. I tried anyway. It doesn't work yet — here's exactly where it breaks and what upstream would need to change.
2026-05-24 — Dan Billings
A reproducible walkthrough for getting llama.cpp MTP working on a single RTX 4090 with unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL. Every llama-server flag explained.
← Home