2026-06-09 — Dan Billings

Cognitive Load Balancing: The Hybrid Cloud-Local AI Architecture

Dan Billings — 2026-06-09

[!TIP] Core Philosophy: We have an optimized, local AI setup. We only call 3rd-party cloud queries when absolutely necessary for latency-sensitive foreground chat. Everything else runs locally at zero marginal cost.

When building a high-performance agentic workspace with Hermes and Honcho, balancing conversational latency against deep background reasoning is the central challenge. Sending every task to a cloud provider drives up API costs exponentially, while running everything locally on consumer hardware bottlenecks the foreground chat experience.

The solution is Cognitive Load Balancing: dynamically splitting tasks based on their urgency, token footprint, and required reasoning depth.

1. The Hot Loop: DeepSeek v4 Pro

For the foreground chat (the "hot loop"), you need maximum speed and a large context window. Hermes Agent connects directly to the DeepSeek API (deepseek-chat) for all real-time conversational interactions.

[!NOTE] Why DeepSeek? The DeepSeek v4 Pro model acts as the fast, highly capable conversational interface. By using a cloud API exclusively for foreground turns, we guarantee instantaneous streaming responses while keeping API costs strictly bounded to active usage.

2. Asynchronous Meta-Queries: The Local GPU Advantage

Hermes enables this hybrid architecture by offloading massive, token-heavy meta-queries asynchronously to the local GPU cluster.

Instead of burning DeepSeek API credits on constant context compression, fact extraction, and memory generalization, these tasks are sent to your local machines.

Fact Extraction (The Honcho Deriver)

Hardware: RTX 4090 (24GB VRAM)
Model: Gemma 4 12B (gemma-4-12b-it-qat-q4_0.gguf)
Why it's perfect: Gemma 4 is highly optimized for structured JSON extraction. The RTX 4090 chews through these extraction batches at ~75 tok/s, processing the active conversation history to identify atomic facts completely free of charge.

Dialectic Reasoning & Context Compression

Hardware: RTX 5090 (32GB VRAM)
Model: Qwen 3.6 27B NVFP4 TurboQuant (Qwen3.6-27B-NVFP4-MTP-GGUF.gguf)
Why it's perfect: Deep inductive dreaming and context summarization require significant cognitive depth. Every 8 hours (or 50 observations), the system triggers a background process to synthesize facts and update user preferences. Because it runs completely asynchronously, network latency over Tailscale or slow generation speeds have zero impact on the user's active session.

3. The 5090 "Single Slot" Guarantee

To make the RTX 5090 the ultimate background reasoning and compression engine, we dedicate its resources to a single inference slot.

[!IMPORTANT] VRAM Accounting & Performance The NVFP4 TurboQuant weights consume ~14 GB. A single 256k slot with Q4_0 KV cache takes ~10 GB. This combined 24 GB footprint comfortably fits within the consumer-grade 32 GB VRAM limit of the RTX 5090.

By sacrificing concurrency for context depth, we achieve several massive advantages:

Full 256k Context Guarantee: The single slot ensures Hermes has the model's entire native context length available to a single user. Lossy context summarization only triggers after massive volumes of tokens (e.g., 118k+ tokens), meaning an entire coding day easily fits into the fast KV cache without dropping important data.
Maximum Throughput: The RTX 5090 running MTP at n=3 on a single slot delivers up to 111 tok/s. This means the memory system rapidly churns through the massive dreaming contexts and background compressions without ever creating a backlog.

Architecture Diagram

graph TD
    User([User Converses]) --> |Hot Loop| Hermes[Hermes Agent <br/> Mac Mini]
    
    Hermes --> |Foreground Chat| DeepSeek[DeepSeek API <br/> v4 Pro]
    Hermes --> |Asynchronous<br/>Context Compression| Qwen[RTX 5090: Qwen 3.6 27B <br/> 256k Context / Single Slot]
    
    Hermes -.-> |Async Turn Logging| Honcho[Honcho Memory Server <br/> danarch]
    
    Honcho --> |Fact Extraction| Gemma[RTX 4090: Gemma 4 12B <br/> JSON Deriver]
    Honcho --> |Background Dreaming| Qwen
    
    classDef cloud fill:#2d3436,stroke:#74b9ff,stroke-width:2px,color:#fff;
    classDef local fill:#0984e3,stroke:#fff,stroke-width:2px,color:#fff;
    classDef agent fill:#00b894,stroke:#fff,stroke-width:2px,color:#fff;
    
    class DeepSeek cloud;
    class Qwen,Gemma local;
    class Hermes,Honcho agent;

This hybrid setup ensures the cloud is used only for the latency-sensitive front-end, while your 32 GB 5090 and 24 GB 4090 provide a massive, free reservoir of compute for the deep dialectic reasoning that makes the agent truly smart.

← All writings · Home