my 4090 and I

Observations on local LLMs

q8_0 KV Cache and Batch Size: 54% faster decode, 2x prefill on 5090

KV cache quantization determines both capacity and speed. q8_0 over q4_0 gave 54% faster decode. Batch size 3072 is the sweet spot on a 32 GB 5090 — 2x prefill over 2048, better decode than 4096. The cache lives in RAM as checkpoints, not VRAM. The 4090 needed a separate llama.cpp build (SM_86 vs SM_89) to avoid Ampere crashes.


Cognitive Load Balancing: The Hybrid Cloud-Local AI Architecture

We have an optimized, local AI setup: DeepSeek V4 Pro for lightning-fast frontend queries, with asynchronous Honcho meta-queries offloaded to local consumer hardware. Extract facts on the 24GB 4090 (Gemma 4), and perform dialectic reasoning/dreaming on a 32GB 5090 (Qwen 3.6) locked to a single slot for full 256k context.


Speed for the Hot Loop, Local for the Rest: DeepSeek API + Tiered Dual-GPU Memory

Combining the lightning-fast DeepSeek v4 Pro cloud API for the foreground agent hot loop with a tiered local GPU cluster (RTX 4090 + 5090) for Honcho deriver, dreaming, and context compression background tasks. High performance meets zero marginal cost.


Port 0 doesn't exist in my infrastructure: typed Scala for home cluster management

Managing a heterogeneous fleet of machines with a typed Scala 3 domain language — Free Monads for platform abstraction, Iron refinement types for compile-time validation, and LLMs for maintainability. If the infrastructure program compiles, the deployment is structurally sound.


Gemma 4 MTP on a 4090: 1.9× at single slot, slower at parallel 4

The gemma-4-12b-it-qat dense model with Gemma4 MTP on a single RTX 4090. Separate drafter GGUF, 1.9x speedup at single slot, and why parallel > 1 makes MTP slower than MTP-off.


Tracing requests across three GPUs and two operating systems: Jaeger without containers

Setting up OpenTelemetry and Jaeger to track request lifecycles across macOS (Hermes client), Arch Linux (Honcho server, nomic-embed, PostgreSQL), and WSL2 (llama-server inference). How to configure, verify, and read Jaeger traces to find performance bottlenecks.


Documentation that compiles: Free Monads and mdoc as provable specs

Documentation usually lies. With mdoc and Free Monads, it compiles and runs. If the docs don't compile, the build fails — making your documentation a living, mathematically provable specification of your DSL.


500 bytes instead of raw video: YOLO11 pose estimation as typed IaC

Privacy-preserving edge vision via typed Infrastructure-as-Code. Deploying YOLO11 pose estimation to extract 17 COCO keypoints, render a glowing neon skeleton, and stream 500-byte telemetry instead of raw video.


Dreaming on the 5090, serving on the 4090: split-GPU cognitive load balancing

Configuring Honcho memory to run fast when it matters (local Gemma 4 12B at 75 tok/s) and slow when it doesn't (asynchronous Qwen 3.6 27B dreaming). Plus, an architectural blueprint for offloading STT/TTS voice pipelines to a secondary RTX 3070.


111 tok/s at n=3 on 5090: NVFP4 MTP without the Blackwell cliff

A speculative forecast for running Qwen3.6-27B on an RTX 5090 with NVFP4 TurboQuant quantization and MTP. The bandwidth math, VRAM accounting, and why 200 tok/s is a qualitative shift — not just a faster number.


Honcho's dream cycle: what happens when your AI memory system reasons about itself

After weeks of running Hermes with Honcho memory, the deriver is accumulating observations and the dream cycle is approaching. An explanation of what deduction and induction passes actually do, surprisal sampling, the peer card, and what changes when dreaming finally fires.


Persistent memory for a local LLM: Honcho on Hermes Agent

Giving Hermes Agent actual persistent memory via a self-hosted Honcho instance. Dialectical reasoning, VRAM contention, embedding dimension mismatches, and why DeepSeek v4 Pro's cheap tokens are the right engine for this kind of work.


Gemma 4 MTP on vLLM: 1.13× and why MoE blunts speculative decoding

After llama.cpp refused to load the Gemma 4 drafter, I tried what Google's announcement actually said to use. vLLM serves it — 1.13× over baseline at n=2 on a 24 GB 4090. The interesting question is why the speedup is modest compared to Qwen 3.6's 1.85×: MoE blunts MTP.


Gemma 4 MTP on llama.cpp: the drafter architecture llama.cpp doesn't recognize

Google shipped Gemma 4 MTP for transformers / MLX / vLLM / SGLang / Ollama. llama.cpp isn't on the list. I tried anyway. It doesn't work yet — here's exactly where it breaks and what upstream would need to change.


Qwen 3.6 MTP on a 4090: 84 tok/s at n=3, 6 tok/s at n=4

A reproducible walkthrough for getting llama.cpp MTP working on a single RTX 4090 with unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL. Every llama-server flag explained.