Intelligent Context Routing: How We Built a Memory System That Decides What the Model Sees

14 Jan 2026

Applying recursive language model principles to build a general-purpose memory architecture that routes context intelligently

Intelligent Context Routing: How We Built a Memory System That Decides What the Model Sees

Applying recursive language model principles to build a general-purpose memory architecture that routes context intelligently across every conversation, task, and module

Every AI system that runs more than a single prompt has a memory problem.

Your model has a context window — 128K, 200K, maybe 1M tokens. That sounds generous until you consider what a real production system accumulates: conversation history, user preferences, prior task outputs, retrieved documents, system instructions, tool results, error logs, structured data from upstream modules. Stack all of that together and you're not just hitting token limits — you're drowning the model in context it doesn't need for the task at hand.

The research community calls this context rot: the empirically demonstrated phenomenon where model performance degrades as a function of context length, even when the input fits within the window. Recent work from MIT on Recursive Language Models (RLMs) showed that on information-dense benchmarks, models operating on curated, targeted context dramatically outperform the same models given the full input — by 28–58% on some tasks. The conclusion is counterintuitive: giving the model less information produces better results.

We took this principle and built it into the core of our platform. Not as a feature for one module. As the memory architecture that governs every conversation, every task execution, and every cross-module interaction across the entire system.

The Problem: Everything Competes for Context

Production AI systems aren't single-turn Q&A bots. They maintain state. They accumulate history. They process documents, generate outputs, handle errors, and carry forward context across multi-step workflows.

In our platform, a single user session might involve:

Uploading financial documents (10-Ks, 10-Qs, earnings transcripts)
Running a multi-module analysis pipeline (revenue extraction, COGS, working capital, debt scheduling, tax computation, DCF valuation)
Having a conversation about the results — asking follow-up questions, requesting adjustments, exploring scenarios
Triggering error recovery when something goes wrong (see our previous post on self-healing systems)
Referencing prior sessions — "use the same assumptions as last time"

Every one of these interactions generates context. Conversation turns pile up. Module outputs accumulate. Retrieved document chunks stack. Error-fix pairs from the self-healing system get loaded. And all of it competes for space in a finite context window.

The naive approach — concatenate everything and hope the model figures out what's relevant — fails in three predictable ways:

Context rot degrades output quality. The model's attention is spread across thousands of tokens of conversation history when it only needs the last three turns and one specific module output. Irrelevant context doesn't just waste tokens — it actively harms performance.
Token cost scales linearly with accumulated state. Every API call that ships the full conversation history, all retrieved chunks, and all module outputs burns tokens on content the model doesn't need for the current task. Across hundreds of user sessions per day, this compounds into significant cost.
Cross-contamination between tasks. When a revenue extraction module's context includes residual conversation history about debt scheduling, subtle reasoning errors emerge. The model conflates information from different analytical contexts.

The Research Foundation: Recursive Language Models

Our architecture is grounded in the RLM framework from Zhang, Kraska, and Khattab at MIT CSAIL. The central insight of RLMs is that long inputs should not be fed into the model directly — they should be treated as part of an external environment that the model can programmatically interact with.

In RLMs, the input is loaded as a variable inside a Python REPL. The model writes code to peek into, filter, chunk, and recursively call itself over targeted snippets. It never sees the full input in its context window. Instead, it reasons about what it needs, retrieves it programmatically, and operates on curated slices.

The performance implications are significant:

On BrowseComp-Plus (multi-hop QA over 6–11M tokens across 1,000 documents), RLM with GPT-5 achieved 91.33% accuracy. The base model scored 0% — it couldn't even fit the input.
On OOLONG (dense information aggregation), RLMs outperformed the base model by 28.4% even on inputs that fit within the context window.
On OOLONG-Pairs (quadratic complexity pairwise reasoning), base GPT-5 scored below 0.1% F1. RLM scored 58%.

The takeaway isn't about handling longer inputs. It's that intelligent context curation produces fundamentally better reasoning than raw context stuffing, regardless of whether you're at the token limit.

We applied this principle as a general-purpose memory layer across our entire platform.

Architecture: Three-Tier Memory Hierarchy

We model our memory system as a hierarchy — analogous to CPU cache, RAM, and disk in computer architecture. Each tier has different capacity, latency, and persistence characteristics.

Tier 1: Persistent Memory Store (Disk)

All accumulated state lives here permanently. This includes:

Conversation history: Every user message and model response, across all sessions.
Document store: Uploaded files — SEC filings, financial statements, supporting documents — chunked and indexed with domain-aware metadata.
Module outputs: Structured results from every pipeline execution — extracted revenue figures, computed margins, generated schedules.
Error-fix pairs: The self-healing system's learned solutions (error context → fix instruction mappings).
User context: Preferences, prior assumptions, recurring parameters ("always use 10% WACC for this client").

This tier is backed by our database (Supabase). Nothing enters the model's context window directly from here. It's the source of truth, but it's external to the model.

Each piece of stored memory carries metadata for routing:

{
  "memory_id": "conv-2024-0210-turn-47",
  "memory_type": "conversation",
  "session_id": "session-abc123",
  "timestamp": "2025-02-10T14:30:00Z",
  "module_context": "revenue_extraction",
  "entities": ["company_xyz", "segment_revenue", "FY2024"],
  "token_count": 480,
  "relevance_decay": 0.95,
  "dependencies": ["conv-2024-0210-turn-45", "module-output-revenue-xyz"]
}

The relevance_decay field is important — conversation turns from three hours ago are less likely to be relevant than turns from three minutes ago. The router uses this for temporal prioritisation.

Tier 2: Context Router (The Decision Layer)

This is the core innovation. Before every LLM call — whether it's a module execution, a conversational response, an error diagnosis, or a follow-up query — the router determines exactly what memory should be loaded into context.

The router operates in two phases:

Phase 1: Static retrieval based on task type. Each task type declares a retrieval specification — a structured schema defining what categories of memory it needs and with what priority.

For a conversational follow-up response:

{
  "task_type": "conversation_response",
  "required_memory": [
    {"type": "conversation", "scope": "current_session", "recency": "last_5_turns", "priority": "essential"},
    {"type": "user_context", "scope": "global", "priority": "essential"},
    {"type": "module_output", "scope": "current_session", "priority": "supplementary"},
    {"type": "error_fix", "scope": "current_session", "priority": "conditional"}
  ],
  "max_context_tokens": 16000
}

For a revenue extraction module:

{
  "task_type": "revenue_extraction",
  "required_memory": [
    {"type": "document_chunk", "sections": ["Item 7 - MD&A", "Item 8 - Financial Statements"], "priority": "essential"},
    {"type": "module_output", "modules": ["prior_revenue_extraction"], "priority": "supplementary"},
    {"type": "user_context", "keys": ["default_assumptions", "client_preferences"], "priority": "supplementary"},
    {"type": "error_fix", "module": "revenue_extraction", "priority": "conditional"}
  ],
  "max_context_tokens": 32000
}

For an error diagnosis:

{
  "task_type": "error_diagnosis",
  "required_memory": [
    {"type": "error_context", "scope": "current_error", "priority": "essential"},
    {"type": "error_fix", "similarity_match": true, "priority": "essential"},
    {"type": "module_output", "scope": "current_pipeline", "priority": "supplementary"},
    {"type": "conversation", "scope": "current_session", "recency": "last_3_turns", "priority": "conditional"}
  ],
  "max_context_tokens": 24000
}

This produces a candidate set of memory items. The candidate set almost always exceeds the token budget.

Phase 2: Dynamic triage via LLM. The candidate set (presented as summaries and metadata, not full content) is passed to a lightweight LLM call that performs context triage. Given the specific task, the candidate set, and the token budget, the model decides:

Essential: Must include. The task cannot execute without this memory.
Supplementary: Include if budget allows. Improves output quality but isn't strictly required.
Redundant: Exclude. Information is already covered by another memory item in the payload.
Irrelevant: Exclude. Retrieved by the static rules but not actually useful for this specific task instance.

The router also determines presentation order — most relevant items first. This is a deliberate mitigation for position bias (the well-documented tendency of LLMs to attend more strongly to information at the beginning and end of context).

The output is a curated context payload:

{
  "context_payload": [
    {"memory_id": "user-context-assumptions", "classification": "essential", "tokens": 320},
    {"memory_id": "conv-turn-47", "classification": "essential", "tokens": 480},
    {"memory_id": "conv-turn-46", "classification": "essential", "tokens": 350},
    {"memory_id": "module-output-revenue-xyz", "classification": "supplementary", "tokens": 1200},
    {"memory_id": "error-fix-revenue-null-parser", "classification": "supplementary", "tokens": 280}
  ],
  "total_tokens": 2630,
  "budget_remaining": 13370,
  "excluded": [
    {"memory_id": "conv-turn-12", "reason": "Superseded by more recent conversation context"},
    {"memory_id": "module-output-cogs-xyz", "reason": "Not relevant to current conversational query about revenue"}
  ]
}

Tier 3: Active Context Window (Working Memory)

The curated payload is injected into the LLM's context window alongside the task-specific system prompt and the current user input. The model operates on a clean, targeted context — exactly the memory it needs, nothing more.

Each LLM call gets its own isolated context. A conversation response sees recent turns and relevant module outputs. A revenue module sees financial document chunks and user assumptions. An error diagnosis sees the error payload and similar prior fixes. There is no cross-contamination.

If the model needs information that wasn't in its initial payload — say, the user references something from an earlier session that the router didn't include — the model can issue a callback to the router, requesting specific additional memory. The router evaluates the request against the remaining token budget and either fulfils it or signals that the budget is exhausted.

How This Connects to the Self-Healing System

Our self-healing architecture (detailed in our previous post) generates error-fix pairs and stores them in the database. The context routing system is what makes those learned fixes actionable.

When an error occurs, the router's retrieval specification for error_diagnosis tasks includes a similarity search over the error-fix database. If a matching fix exists with a high reward score, it's loaded into the diagnosis module's context alongside the current error. The model doesn't regenerate a fix from scratch — it references a proven solution.

This creates a compounding loop:

Self-healing generates and validates fixes, persisting successful ones.
Context routing retrieves relevant fixes when similar errors recur.
The model applies proven fixes instantly (Tier 1 cached recovery) instead of running the full diagnosis loop (Tier 2 LLM-generated recovery).

Over time, the error-fix database grows, the router retrieves better matches, and Tier 2 diagnosis calls decrease. The system gets faster at recovering because it remembers what worked.

Why This Isn't RAG

Retrieval-Augmented Generation retrieves the top-K chunks by embedding similarity and concatenates them. It's a one-shot retrieval: query → rank → stuff into context.

Our routing system differs in several fundamental ways:

RAG retrieves by similarity. We route by necessity. Similarity doesn't equal relevance. A chunk about revenue recognition policy (accounting treatment) is semantically similar to a query about revenue figures, but the actual revenue table is what the task needs. Our router reasons about task requirements, not embedding distance.

RAG doesn't deduplicate. If three conversation turns discuss the same topic, RAG might retrieve all three, wasting tokens on redundant information. Our router identifies overlap and selects the most informative version.

RAG doesn't manage token budgets. Our router actively trades off between information completeness and context capacity. It classifies memory as essential vs. supplementary and makes deliberate exclusion decisions when budgets are tight. RAG truncates at top-K regardless of what's being cut.

RAG doesn't respect task topology. A DCF module needs structured outputs from upstream modules, not raw document chunks that those modules already processed. Our router understands module dependencies and composes context from a mix of derived data and raw sources. RAG has no concept of this.

RAG is stateless across calls. Each retrieval is independent. Our router carries forward routing decisions — if a chunk was classified as irrelevant for one module, that signal informs routing for subsequent modules in the same pipeline.

The Recursive Pattern: Cross-Module and Cross-Session Memory

The RLM paper's recursive sub-calling pattern maps directly to how our modules share memory.

When a downstream module (e.g., DCF valuation) needs upstream results, it doesn't receive the raw document chunks that upstream modules processed. It receives the structured outputs — revenue projections, COGS margins, working capital schedules — as compact, pre-processed memory items.

DCF Module Context:
├── Structured output: Revenue projections (from revenue module) — 800 tokens
├── Structured output: COGS margins (from COGS module) — 400 tokens
├── Structured output: Working capital schedule (from WC module) — 600 tokens
├── Structured output: Debt schedule (from debt module) — 500 tokens
├── Structured output: Tax rates (from tax module) — 300 tokens
├── Targeted chunks: Capex discussion from MD&A — 2,100 tokens
├── Targeted chunks: Management guidance on growth — 1,800 tokens
└── User context: Client-specific WACC assumption — 120 tokens
     Total: ~6,620 tokens (vs. 120,000+ for full document pass-through)

The same pattern applies to conversation memory. When a user asks "what assumptions did we use last time?", the router doesn't load the entire prior session. It retrieves the specific module outputs and conversation turns where assumptions were discussed — structured, compact, relevant.

This is functionally equivalent to RLM's recursive decomposition: the root task decomposes into sub-tasks, each sub-task operates on curated context, and results flow back as structured variables rather than raw token streams.

Trade-Offs and Limitations

Router latency. Every LLM call is preceded by a routing decision. The Phase 1 static retrieval is fast (database query). The Phase 2 dynamic triage adds an LLM inference call. We mitigate this by using a smaller, faster model for routing and by caching routing plans for repeated task patterns.

Routing errors are silent failures. If the router excludes a memory item that turns out to be essential, the downstream module operates on incomplete information — and may not know it. The callback mechanism partially addresses this, but false negatives in routing are harder to detect than false negatives in retrieval.

Cold start problem. For new users or new document types, the router has limited metadata to work with. Routing quality improves as the system accumulates memory and the static rules get refined for specific use cases.

Metadata quality dependency. The router is only as good as the metadata attached to each memory item. If a document chunk is tagged with incorrect section labels or missing entity references, it may be excluded when it shouldn't be, or included when it's irrelevant.

What's Next

Confidence-driven re-routing. If a module reports low confidence in its output, automatically trigger a second pass with an expanded context payload — pull in supplementary memory items that were initially excluded. This closes the loop between output quality and context allocation.

Learned routing policies. The static retrieval specifications are currently hand-crafted per task type. We're investigating whether an RL-trained routing policy — drawing from SEAL's learned self-edit generation — could discover more effective routing strategies by optimising for downstream task accuracy as the reward signal. The router would learn which memory combinations produce the best outputs for each task type.

Cross-session memory consolidation. As the persistent memory store grows across hundreds of sessions, we need strategies for consolidating and compressing old memory — identifying which historical items are still valuable and which can be archived. This parallels the human cognitive process of moving short-term memory into long-term storage, selectively retaining what matters.

Conclusion

Context windows are a constraint, not a feature. The research is clear: even when your input fits in context, putting everything in context degrades performance. Bigger windows don't solve the problem. Smarter memory management does.

By treating all accumulated state — conversations, documents, module outputs, learned fixes — as external memory governed by an intelligent routing layer, we give every LLM call exactly the context it needs and nothing more. The model reasons over signal, not noise.

The system that remembers everything but only recalls what matters — that's the system worth building.