2026-06-06 — Dan Billings

Dreaming on the 5090, serving on the 4090: split-GPU cognitive load balancing

Dan Billings — 2026-06-06

When building an agentic workspace, there is a constant tension between latency and intelligence. You want the system to feel immediate, responding to your keystrokes or voice prompts without friction. But you also want it to reason deeply, utilizing large-parameter models that analyze context, draw inferences, and remember patterns across days of conversation.

In my home lab, this tension manifests in how Honcho (my semantic memory server) and the voice assistant subsystems are structured. After upgrading my main Arch server (danarch) to a single RTX 4090, I ran into a classic resource bottleneck: trying to cram an LLM, an embedding server, an ASR (Speech-to-Text) pipeline, and a TTS (Text-to-Speech) engine onto the same card.

The solution is a form of cognitive load balancing: splitting execution so things are fast when they need to be, slow when they can afford to be, and isolating spiky real-time pipelines onto dedicated silicon.

1. Fast vs. Slow: Splitting the Honcho Mind

Most memory configurations treat the LLM as a single endpoint. Every call—whether it is summarizing a 1,000-token session, answering a quick query, or dreaming up deep deductive conclusions—goes to the same model.

In my setup, I split Honcho's brain in two:

Fast when it matters: Local Gemma 4 12B

Normal deriver actions, real-time message observations, session summaries, and the crucial startup prewarm query (which fetches what Honcho knows about me to prime a new Hermes Agent session) are latency-critical. If these take 5 seconds to load, the interaction feels broken.

For these tasks, Honcho is configured to talk to gemma-4-12b-it-qat-q4_0.gguf running locally on danarch's RTX 4090:

Port: localhost:8080
Performance: ~75 tokens per second (utilizing Flash Attention and --metrics in llama-server).
VRAM budget: Comfortable context size (128K) at Q4_0 quantization.

Real-time observations get summarized and logged immediately with zero perceived lag.

Slow when it doesn't: Remote Qwen 3.6 27B

The dreaming cycle (deduction and induction passes) is a different beast. The dreamer worker wakes up, gathers hundreds of recent observations, and tries to synthesize them:

Deduction: Evaluates observations to extract implicit facts and spot logical contradictions.
Induction: Formulates generalized patterns about my working style, project architecture, and tool preferences, while merging duplicate conclusions.

This requires maximum reasoning capabilities. A 12B model can do it, but Qwen 3.6 27B is far more stable at compiling high-fidelity markdown peer cards.

However, running a 27B model at 200k context requires massive VRAM and is slower. Because dreaming runs asynchronously in the background via a systemd task queue, latency is irrelevant. The deriver worker can take 80 seconds to complete a dream pass, and I will never notice.

So, I updated the IaC configuration (danarch4090.scala) to split the Honcho environment:

_ <- Honcho.setup(
       llmBaseUrl    = "http://localhost:8080/v1", // Local Gemma 4 (75 tok/s)
       llmModel      = "gemma-4-12b-it-qat-q4_0.gguf",
       embedMessages = true,
       embedBaseUrl  = Some("http://localhost:8081/v1"),
       embedModel    = Some("nomic-embed-text-v1.5"),
       
       // Asynchronous deep dreaming routed to the larger model on danwin
       dreamBaseUrl  = Some("http://danwin:8080/v1"),
       dreamModel    = Some("Qwen3.6-27B-NVFP4-MTP-GGUF.gguf")
     )

During a manual dream cycle triggered today:

Deduction Pass: Succeeded on Qwen 3.6 27B over Tailscale, generating high-fidelity markdown cards outlining my Scala 3 architectural preferences (no primitives, Iron refinements, Free Monad interpreters).
User Impact: Zero impact on the local 4090 VRAM or real-time conversational latencies.

2. The Next Step: Secondary GPU Offloading for Audio

While Honcho's split-model routing resolves the cognitive latency vs. intelligence trade-off, my local voice pipeline still faces a raw VRAM crunch.

In danarch4090.scala, I have three audio services defined to enable real-time speech interaction:

parakeet-asr (Port 8765): Runs NVIDIA's Parakeet TDT 0.6B model for fast streaming Speech-to-Text.
whisper-asr (Port 8766): Serves Whisper Turbo (~5 GB VRAM) for high-accuracy synchronous transcription.
nemo-tts (Port 8767): Serves NeMo FastPitch + HiFi-GAN for low-latency Text-to-Speech narration.

Right now, they are all bound to CUDA_VISIBLE_DEVICES=0—competing directly with the main Gemma 4 12B instance. If I load the Whisper Turbo server, it locks up 5 GB of VRAM. Add in a 128K context window for Gemma 4, and the 24 GB of the RTX 4090 is completely maxed out.

The 3070 Blueprint

I have an RTX 3070 Ti (8 GB VRAM) sitting in storage. It is the perfect card to act as a co-processor for the audio pipeline.

By installing it alongside the 4090 on danarch, we can isolate the VRAM and compute workloads completely:

┌───────────────────────────────────────┐
│              danarch Server           │
│                                       │
│  ┌─────────────────────────────────┐  │
│  │       RTX 4090 (GPU 0)          │  │
│  │  - Gemma 4 12B (128k context)   │  │
│  │  - nomic-embed-text-v1.5        │  │
│  └─────────────────────────────────┘  │
│                                       │
│  ┌─────────────────────────────────┐  │
│  │       RTX 3070 (GPU 1)          │  │
│  │  - Whisper Turbo (~5 GB VRAM)   │  │
│  │  - Parakeet TDT (~1 GB VRAM)    │  │
│  │  - NeMo FastPitch (~500 MB)     │  │
│  └─────────────────────────────────┘  │
└───────────────────────────────────────┘

In the typed Scala playbooks, changing this is a simple environment declaration in the systemd service definitions:

See also: Type-Safe Home Cluster — the architecture behind these playbooks.

val whisperService = GenericService(name = "whisper-asr", ...)
  .withEnvironment("CUDA_VISIBLE_DEVICES" -> "1") // Bound to the 3070

val ttsService = GenericService(name = "nemo-tts", ...)
  .withEnvironment("CUDA_VISIBLE_DEVICES" -> "1") // Bound to the 3070

This ensures that spiky latency spikes from real-time microphone dictation or TTS generation never interrupt the KV-cache generation of the local LLM. It also gives the main model a full 24 GB of dedicated VRAM, enabling massive 256k context limits.

By treating hardware topologies with the same modularity as software architectures, we can build a self-hosted agent environment that is fast when we talk, deep when it dreams, and completely local.

← All writings · Home