2026-06-07 — Dan Billings

Speed for the Hot Loop, Local for the Rest: DeepSeek API + Tiered Dual-GPU Memory

Dan Billings — 2026-06-07

When building an agentic workspace, you face a trade-off between conversational latency and background reasoning depth. You want instant, streaming feedback when you type or talk to the agent. At the same time, you want the agent to remember everything, synthesize long-term facts, and summarize massive context windows without sending your API bill into the stratosphere.

In my home lab setup, the agent is Hermes Agent (running on my Mac Mini) integrated with a self-hosted Honcho memory server (running on danarch, my RTX 4090 Linux box).

I've reorganized this architecture to achieve the ultimate trade-off: lightning-fast foreground chat via the DeepSeek cloud API, paired with unlimited, zero-cost background memory synthesis and context compression on my local GPUs.

The Bottleneck: Unified Model Routing

Previously, my configuration routed all LLM requests to a local model instance (e.g., Qwen 3.6 27B on my RTX 5090 or Gemma 4 12B on my RTX 4090). This presented two distinct issues:

Latency in the Hot Loop: During active turns, local generation speed was constrained by VRAM bandwidth and context size. A full context window of 64k+ tokens degraded local decode speeds.
API Costs vs. VRAM Pressure: If I moved the entire setup to a cloud provider like Anthropic or OpenAI to get speed, the background tasks—specifically Honcho's Deriver (which inspects every turn to extract JSON observations) and Dream Cycle (which summarizes blocks of observations to induce user preferences)—would swallow massive volumes of input tokens, driving up API costs.

The solution is Cognitive Load Balancing: split the agent's tasks across endpoints according to their execution urgency, latency sensitivity, and token footprint.

The Architecture: Hybrid Cloud-Local Routing

I restructured the setup to separate the foreground conversational hot loop from the background memory stack:

                  ┌──────────────────────┐
                  │   User Converses     │
                  └──────────┬───────────┘
                             │ (Hot Loop)
                             ▼
┌─────────────────────────────────────────────────────────┐
│              dans-mac-mini (Hermes Agent)               │
│                                                         │
│   Foreground Chat (Fast API)  ──► DeepSeek Cloud API    │
│   Context Compression (Slow)  ──► RTX 5090 Local Qwen   │
└────────────────────────────┬────────────────────────────┘
                             │ (Asynchronous logging)
                             ▼
┌─────────────────────────────────────────────────────────┐
│               danarch (Honcho Server)                   │
│                                                         │
│   Fact Extraction (Deriver)   ──► RTX 4090 Local Gemma  │
│   Memory Dreaming (Induction) ──► RTX 5090 Local Qwen   │
└─────────────────────────────────────────────────────────┘

1. Foreground Chat (The Hot Loop): DeepSeek v4 Pro

The conversation requires maximum speed and capability. I configured the main Hermes model to point to the DeepSeek API:

Endpoint: https://api.deepseek.com/v1
Model: deepseek-chat
Context Window: 262144 (256k tokens)
Prewarm Query: Summarizing the user's active "peer card" at session start is run synchronously here, ensuring the agent loads preferences instantly.

2. Client-Side Context Compression: RTX 5090 (Qwen 3.6 27B)

Context compression summarizes older turns when history fills up. It is a slow, token-heavy task. Instead of wasting DeepSeek API tokens, Hermes offloads this task:

Endpoint: http://danwin:8080/v1 (RTX 5090)
Model: Qwen3.6-27B-NVFP4-MTP-GGUF.gguf
VRAM footprint: Offloaded completely from the Mac Mini and the 4090.

3. Honcho Deriver (Fact Extraction): RTX 4090 (Gemma 4 12B)

Every time a message is exchanged, Honcho must parse the text and extract atomic observations.

Endpoint: http://localhost:8080/v1 (Local RTX 4090)
Model: gemma-4-12b-it-qat-q4_0.gguf
Performance: ~75 tokens/s. Gemma 4 is highly optimized for structured JSON extraction. Because this runs locally on the 4090, it costs zero API dollars, and the high local throughput ensures the logging queue never lags.

4. Honcho Dreaming (Generalization): RTX 5090 (Qwen 3.6 27B)

Every 8 hours (or after 50 new observations), Honcho runs a deep deduction/induction pass to update the user's peer card. This requires maximum reasoning capability.

Endpoint: http://danwin:8080/v1 (RTX 5090 over Tailscale)
Model: Qwen3.6-27B-NVFP4-MTP-GGUF.gguf
Latency: Since this runs asynchronously in a systemd background worker (honcho-deriver.service), the higher network latency of running over Tailscale has zero impact on the user's chat turns.

Infrastructure-as-Code Implementation

I updated the typed Scala 3 playbook (MacMini.scala) to declare this configuration cleanly. We expanded the HermesAuxiliaryCompressionConfig model to allow custom endpoint overrides for the compression model:

final case class HermesAuxiliaryCompressionConfig(
    context_length: ContextLength,
    provider:       Option[HermesProvider] = None,
    model:          Option[ModelName] = None,
    base_url:       Option[HttpUrl] = None
)

The resulting Mac Mini configuration task in the playbook binds DeepSeek as the primary driver while explicitly routing the auxiliary compression engine to the local RTX 5090:

_ <- Hermes.configure(
       modelUrl      = url, // https://api.deepseek.com/v1
       soul          = soulContent,
       honchoUrl     = honchoUrlOpt,
       contextLength = 262144, // 256k limit
       compression   = Some(HermesCompressionConfig(enabled = true, threshold = 0.90)),
       provider      = HermesProvider.Deepseek,
       defaultModel  = "deepseek-chat".assume[Not[Empty]],
       
       // Offload compression to the local 5090 Qwen instance
       auxiliaryCompressionContextLength = Some(262144),
       auxiliaryCompressionProvider       = Some(HermesProvider.Custom),
       auxiliaryCompressionModel          = Some("Qwen3.6-27B-NVFP4-MTP-GGUF.gguf".assume[Not[Empty]]),
       auxiliaryCompressionBaseUrl        = Some("http://danwin:8080/v1".assume[ValidHttpUrl]),
       
       sttUrl        = Some(defaultSttUrl),
       logLevel      = Some(HermesLogLevel.Debug)
     )

And the Honcho server environment (danarch4090.scala) registers the local Gemma instance for real-time derivation and the remote 5090 Qwen instance for dreaming:

_ <- Honcho.setup(
       llmBaseUrl    = "http://localhost:8080/v1", // Local Gemma 4 12B
       llmModel      = "gemma-4-12b-it-qat-q4_0.gguf",
       embedMessages = true,
       embedBaseUrl  = Some("http://localhost:8081/v1"),
       embedModel    = Some("nomic-embed-text-v1.5"),
       
       // Asynchronous dreaming offloaded to 5090
       dreamBaseUrl  = Some("http://danwin:8080/v1"),
       dreamModel    = Some("Qwen3.6-27B-NVFP4-MTP-GGUF.gguf")
     )

Conclusion: The Best of Both Worlds

By applying Cognitive Load Balancing, my agent's conversational loop streams back answers immediately using DeepSeek's high-context cloud API. Meanwhile, the heavy lifting of observation extraction, memory dreaming, and context compression runs entirely locally on my RTX 4090 and 5090 GPU hardware.

This hybrid architecture delivers low latency where it matters (in user interaction) and zero marginal cost where it matters (in background context processing and reasoning).

← All writings · Home