
7 Feb 2026
Applying reinforcement-learned self-adaptation to build production AI agents that autonomously diagnose, patch, and learn from runtime failures — without redeployment.
Self-Healing AI: How We Built a System That Fixes Itself
Applying reinforcement-learned self-adaptation to build production AI agents that autonomously diagnose, patch, and learn from runtime failures — without redeployment.
When you're operating AI systems in production — handling financial document parsing, multi-module DCF valuations, SEC filing extraction — failure isn't a question of if, but when. API rate limits get hit. Document schemas drift. Edge cases in revenue recognition logic surface at 2 AM when nobody's watching.
The standard playbook is reactive: error fires, alert triggers, engineer debugs, fix gets deployed. That cycle works at small scale. It doesn't work when you're building autonomous agents designed to run twenty-four seven across hundreds of financial documents.
We built a self-healing runtime layer that allows our system to autonomously diagnose errors, generate candidate fixes, validate them, and persist successful solutions — all without triggering a redeployment. Here's how it works under the hood.
The Theoretical Foundation: MIT's SEAL Framework
Our approach is grounded in recent work from MIT on Self-Adapting Language Models (SEAL), published at NeurIPS 2025 by Zweiger et al. The core contribution of SEAL is a reinforcement learning framework where an LLM learns to generate its own finetuning data — called "self-edits" — and applies them via supervised finetuning (SFT) with LoRA adapters to persistently update its weights.
SEAL operates as a nested optimization loop:
Outer loop (RL): The model generates candidate self-edits, applies them via parameter updates, evaluates downstream task performance, and uses the result as a binary reward signal to reinforce effective self-edit generation.
Inner loop (SFT): Each self-edit is applied as a lightweight LoRA finetuning step: θ′ ← SFT(θ, SE), where SE is the generated self-edit data.
The reward function is deliberately simple:
r(SE, τ, θ) = 1 if adaptation using SE improves LM_θ's performance on task τ
0 otherwiseSEAL's key finding: a 7B parameter model (Qwen 2.5-7B) generating its own synthetic training data via RL-optimized self-edits outperformed synthetic data generated by GPT-4.1 on knowledge incorporation benchmarks (47.0% vs 46.3% on no-context SQuAD). The model literally teaches itself better than a larger model can teach it.
We took this principle — self-generated adaptation data, validated by downstream performance — and applied it to a fundamentally different domain: runtime error recovery in production systems.
System Architecture
Our self-healing pipeline consists of five components operating as a closed-loop control system.
1. Error Capture and Context Serialization
When an exception propagates to our error boundary, we don't just log a stack trace. We serialize the full execution context into a structured payload:
{
"error_id": "uuid-v4",
"timestamp": "ISO-8601",
"error_type": "ExtractionError",
"error_message": "Failed to parse revenue line items from 10-K filing",
"stack_trace": "...",
"module": "revenue_extraction",
"input_context": {
"document_id": "sec-filing-xxxx",
"document_type": "10-K",
"extraction_stage": "revenue_breakdown",
"partial_output": { ... }
},
"system_state": {
"api_calls_remaining": 42,
"memory_usage_mb": 3200,
"active_modules": ["revenue", "cogs", "working_capital"]
},
"prior_fix_attempts": []
}This payload gets written to our database (hosted on Supabase) immediately. The richness of the context matters — the more the diagnosing LLM knows about what was happening when the error occurred, the better the fix it can generate.
2. Diagnosis and Fix Generation via LLM
The error payload is passed to Claude's API as a structured prompt. The model receives the full error context along with any relevant prior fixes from our solution database (retrieved via similarity search over error type and module metadata).
The prompt follows a diagnostic chain:
Root cause analysis — What caused this error given the execution context?
Fix proposal — Generate a concrete, executable remediation strategy.
Risk assessment — What could go wrong if this fix is applied?
The model returns a structured fix object:
{
"diagnosis": "Revenue line item parser expected tabular format but received narrative disclosure. The 10-K filing uses an atypical inline format for segment revenue.",
"fix_type": "runtime_patch",
"fix_instruction": "Switch to fallback regex-based extraction for narrative-style revenue disclosures when primary table parser returns null.",
"affected_module": "revenue_extraction",
"confidence": 0.82,
"risk_level": "low"
}This is analogous to SEAL's self-edit generation step — the model produces adaptation data (a fix instruction) conditioned on input context (the error payload).
3. Execution and Reward Evaluation
The generated fix gets applied at runtime. This is the critical design decision: we do not redeploy the application. Our platform runs on AWS Elastic Beanstalk, and frequent redeployments would introduce instability — cold starts, connection drops, potential cascading failures across dependent modules.
Instead, fixes are stored in the database and loaded dynamically at runtime. When a module executes, it first queries the solution database for any active patches matching its current context. This is essentially a RAG (Retrieval-Augmented Generation) pattern applied to error recovery rather than knowledge retrieval.
The reward signal mirrors SEAL's binary formulation:
r(fix, error_context) = 1 if the operation completes successfully after fix application
0 if the error persists or a new error is introducedWe track three validation criteria:
Primary resolution: Does the original error disappear?
No regression: Does the fix introduce any new exceptions within the same execution pipeline?
Output validity: Does the module produce structurally valid output (schema validation, not semantic correctness)?
A fix must pass all three to receive a reward of 1. This is stricter than SEAL's original formulation but necessary for production safety.
4. Solution Persistence and Retrieval
Successful fixes are persisted as error-solution pairs in our database with full metadata:
{
"error_pattern": "ExtractionError::revenue_extraction::table_parser_null",
"fix_instruction": "...",
"success_count": 14,
"failure_count": 2,
"first_seen": "2025-02-01T08:30:00Z",
"last_applied": "2025-02-10T14:22:00Z",
"contexts_applied": ["10-K", "10-Q"],
"reward_score": 0.875
}Over time, this database becomes a learned knowledge base — similar to SEAL's weight updates, but externalized as retrievable data rather than embedded in model parameters. The advantage is clear: no finetuning required, no model versioning headaches, and the knowledge is immediately available to all concurrent sessions.
When a new error occurs, the system performs a lookup against this database using error type, module, and contextual similarity. If a high-confidence match exists (reward_score > threshold), the stored fix is applied directly. If no match exists, the full LLM diagnosis loop is triggered.
This creates a tiered recovery strategy:
Tier 1 — Cached fix: Known error pattern, proven solution. Applied instantly. Milliseconds.
Tier 2 — LLM diagnosis: Unknown error. Full context sent to Claude API. Fix generated, validated, and persisted. Seconds.
Tier 3 — Human escalation: Fix generation fails or confidence is below threshold. Error flagged for manual review.
5. Human-in-the-Loop Review and Promotion
On a daily or weekly cadence, our engineering team reviews the accumulated error-fix logs. The review process categorizes each error into one of three buckets:
Transient noise: One-off errors that occurred once across thousands of operations. No action needed. The runtime patch handled it.
Recurring pattern: The same error type appears repeatedly across different documents or sessions. The LLM's fix works, but the frequency suggests an underlying issue worth addressing in code.
Architectural deficiency: The error points to a fundamental design flaw — a module assumption that doesn't hold, a missing integration path, a data model limitation. This gets promoted to a proper engineering ticket and permanent code fix.
This is where SEAL's limitation around catastrophic forgetting becomes relevant. In the original paper, sequential self-edits degraded performance on prior tasks. In our system, we sidestep this entirely because fixes are externalized in the database, not baked into model weights. There's no parameter interference. Each fix is independent and retrievable.
However, the trade-off is that our system doesn't truly "learn" in the weight-update sense. The LLM itself doesn't become a better debugger over time through parameter changes. Instead, the system becomes more robust through its growing database of solutions. The intelligence is distributed: the LLM provides reasoning capability, the database provides memory.
Connecting to SEAL's Reward Loop
The parallel to SEAL is direct:
SEAL Component | Our Implementation |
Context (C) | Error payload with full execution context |
Self-Edit (SE) | Generated fix instruction |
Inner loop SFT | Runtime application of the fix (no weight update — database patch) |
Task evaluation (τ) | Does the operation complete successfully? |
Reward signal (r) | Binary: error resolved without regression = 1, otherwise = 0 |
Outer loop RL | Accumulated error-fix pairs improve future retrieval and diagnosis |
The key divergence: SEAL updates model weights. We update an external knowledge base. Both achieve the same functional outcome — the system adapts to new failure modes — but our approach avoids the redeployment and catastrophic forgetting challenges that weight-based adaptation introduces in production.
Trade-Offs and Limitations
Latency cost. Tier 2 recovery (LLM diagnosis) adds seconds of latency to error recovery. For real-time systems, this may be unacceptable. For our use case — batch financial document processing — it's negligible.
Semantic correctness is not guaranteed. Our reward signal validates that the operation completes and produces structurally valid output. It does not validate that the output is semantically correct. A fix that silently produces wrong numbers is worse than a fix that fails loudly. This is why the human review loop is non-negotiable, not optional.
Database growth. As the solution database scales, retrieval quality becomes critical. Naive exact-match lookups won't generalize. We're exploring embedding-based similarity search over error contexts to improve fix retrieval for novel-but-related error patterns.
Single-model dependency. If the LLM API itself goes down, the self-healing loop breaks. We mitigate this with cached fixes (Tier 1) and fallback to human escalation (Tier 3), but the system's diagnostic capability is fundamentally dependent on LLM availability.
What's Next
We're exploring three extensions:
Predictive resilience. Using the accumulated error-fix database to identify failure-prone execution paths before errors occur. If a particular document type or filing format correlates with high error rates, pre-load the relevant fixes before processing begins.
Fix composition. Currently, each fix addresses a single error. We're investigating whether the LLM can compose multi-step fixes for cascading failures — where one error triggers downstream errors that require coordinated recovery.
Reward signal enrichment. Moving beyond binary success/failure toward a richer reward that incorporates fix durability (does it hold over days?), generalizability (does it work across document types?), and efficiency (does it add minimal latency?). This maps directly to SEAL's potential for reward shaping to improve adaptation quality.
Conclusion
Static AI systems are a liability in production. They work until they don't, and when they don't, they need a human to come save them. Self-healing — grounded in the theoretical framework of self-adapting language models — offers a path toward AI systems that maintain themselves.
We're not claiming full autonomy. The human review loop is a feature, not a limitation. But the gap between "system breaks, everything stops" and "system breaks, system recovers, human reviews later" is the difference between a prototype and a production-grade platform.
The system that runs reliably at 3 AM without anyone watching — that's the system worth building.
