Insurance regulators don't accept "the model said so" as a control. When we migrated a Tier-1 P&C carrier's claims-intake queue from a Drools rules engine to a multi-agent FNOL pipeline, the audit team's question was never "is it accurate?" It was always "can you show me why this specific claim routed where it did?"
The constraint that drove the architecture
Annual claims volume: $2.4B. Eleven lines of business. Manual triage at 47% — adjusters were spending a median 6h 12m between first notice of loss and assignment. The existing routing logic was a Drools engine deployed in 2014, ossified, and routinely overridden by hand. The compliance team's veto was simple: every routing decision had to be explainable to a state-level auditor, in writing, with the input that produced it.
The topology we shipped
FNOL ─▶ intent.classify ─▶ policy.lookup ─▶ guardrail.eval ─┐
▼
routing.decide
│
┌─────────────────┤
▼ ▼
fast-track human.review
│
▼
(full trace
as input)- intent.classify — narrow-scope NLU agent. Classifies into one of 14 known FNOL types. Refuses on out-of-distribution input.
- policy.lookup — retrieval agent connected to the policy admin system. Returns the policy version active at date-of-loss, not at present.
- guardrail.eval — deterministic policy check. No LLM. Enforces 23 state-specific rules and the carrier's internal control library.
- routing.decide — executor agent constrained to 6 routing classes. Cannot invent new classes.
- human.review — only for cases that fail any guardrail or fall outside the routing classes. Receives the full prior trace as input.
What the audit team wanted
Four artifacts. We were surprised how aligned this was with what we'd already need for an on-call rotation — the audit team and the SRE team want the same thing, just for different reasons.
- Replayable traces. We used JSON Lines, one file per claim, written to the carrier's existing log warehouse.
- Versioned policy library. Every guardrail check carries the policy version it evaluated against — so a 2025 audit on a 2024 claim uses the 2024 policies.
- "Why did this happen" lookup. Takes claim_id, returns the full ordered trace with timestamps, inputs, and outputs.
- Tool-scope manifest per agent. A printable doc that says "this agent can call these tools and cannot call these others." The audit team loved this — it gave them something to point at when asked about agent capability boundaries.
The numbers
Eight weeks after cutover: manual triage dropped from 47% to 8%. Median FNOL-to-assignment dropped from 6h 12m to 9m. Policy-check violations per 1,000 claims dropped from 23 to 0 (a hard threshold the carrier hadn't hit since 2019). The full case-study breakdown lives on the work page.
One thing we'd do differently
We built the policy library as code first. The guardrail.eval node was a Python package with a thousand-line policy module that engineers maintained. Six months in, the compliance team asked if they could author policies themselves — they were the ones who actually understood the regulations, but they'd been routing changes through us as a tax.
We're now refactoring to a small DSL that compliance can read and write directly. Lesson we'll bring forward: figure out who actually authors the policy before you decide what shape the policy library should take. In hindsight it's obvious. In the moment, with a deadline, we picked the shape that was easiest for us, not the shape that would be easiest for the people who'd own it.
What stays stable past hour eight: long-horizon agent runs in production.