How I Built an LLM-Powered Insurance Support Classifier That Saved $200K Annually

From 3 Human Support agents manually triaging 16,000 emails/month to an $8/month AI pipeline

At Layr, we sell commercial insurance to small businesses. That means every day, hundreds of emails land in our support queue — billing questions, policy changes, mid-term endorsements, and one category that absolutely dominates the inbox: COI requests.

COI = Certificate of Insurance. It’s the document a landlord, vendor, or general contractor demands before they let your business set foot on their property. For a small business, it’s urgent. For a support team, it’s repetitive.

When I dug into our ticket volume, I found that COI-related emails accounted for nearly 40% of all inbound support traffic — and our team was hand-routing every single one. No auto-classification. No queue routing. Just a person reading emails and clicking buttons.

This is the story of how we changed that.

The Problem Worth Solving

Before touching any code, I needed to understand what “COI request” actually means in the wild. It’s not as simple as if email.contains("COI").

Real COI emails look like:

“Our new landlord at 450 Park Ave needs to be added as additional insured before we move in next Friday.”
“Forwarding a request from our general contractor — they need a certificate showing waiver of subrogation.”
“The certificate holder address on file is wrong. Can you update it?”
“Hi, can you re-send our COI? We misplaced it.”

And then there are the traps — emails that mention COI but aren’t COI requests:

“I paid my bill. Also my COI on file has our old address, will fix that later.” → This is a billing issue. The COI mention is deferred noise.

Keyword matching fails hard here. You need intent detection, not grep.

Why Claude Haiku We evaluated a few paths:

Rule-based / regex: Fast, but brittle. We’d be patching it forever as email language evolved. Fine-tuned classifier: High upfront cost, training data requirements, maintenance overhead. GPT-4o: Good quality, higher cost, no caching support that fit our pattern. Claude Haiku + structured output: Low latency (~400ms p95), cheap, excellent instruction-following, native prompt caching. At ~16,000 emails/month, cost mattered. And so did reliability — structured output with JSON schema enforcement meant no parsing failures, no hallucinated categories.

Haiku won.

Architecture

The pipeline is simple by design:

The Prompt Architecture This is where the real work happened.

System/User Split We split the prompt into two layers:

System prompt (static, cacheable): Role, definitions, redaction handling rules, classification logic, few-shot examples. ~1,050 tokens. This never changes between calls.

User turn (dynamic): Just the email subject and body, wrapped in XML tags.

Need updated COI for new lease - [redacted:address] Hi, our landlord at [redacted:address] requires we add them as additional insured. Policy number is [redacted:policy_number]. Can you send this by [redacted:date]? Thanks 

Why XML tags in the user turn? Prompt injection defense. If an email body contains “Ignore previous instructions,” the XML boundary makes it much harder for it to bleed into system context.

The Redaction Problem Our email pipeline scrubs PII before anything downstream touches it. So Haiku sees [redacted:phone] instead of an actual phone number.

This created a subtle problem: in testing, Haiku would occasionally lower its confidence because redaction tokens “looked like missing data.” We had to explicitly handle this in the system prompt:

Treat these tokens as valid, meaningful placeholders — they represent real data that existed in the original email. Do NOT treat them as missing information or reasons to lower confidence. A redacted phone number is still a phone number. Ignore redaction tokens when forming your decision unless the TYPE itself carries classification signal (e.g., [redacted:policy_number] confirms the sender is a policyholder). That last sentence matters: [redacted:policy_number] in a COI context is actually useful signal — it tells you this is a policyholder making an authenticated request.

Intent-Based Decision Framework The core classification logic is intent-driven, not keyword-driven:

COI if:

Explicit COI request (new issuance, re-send, download)
COI modification (holder update, address change, expiry extension)
Coverage verification (third party confirming active coverage)
Endorsement adds that trigger COI (additional insured, wavier of subrogation)
Third-party requests on behalf of a policyholder Decision rule for ambiguous cases:
For reply chains/forwards, weight the most recent message heaviest
When mixed-intent, classify by the PRIMARY action requested
"I paid my bill, also when does my COI expire?" → N/A (billing is primary) Few-Shot Examples We included exactly two examples — both adversarially chosen to represent the hard cases, not the easy ones:

Email: "Our new landlord at 450 Park Ave needs a certificate showing them as additional insured before we move in next Friday." Output: {"ai_category": "COI", "ai_confidence": 98, "ai_reasoning": "Explicit COI request with additional insured endorsement for landlord"} Example 2 (N/A — COI mentioned but not primary):

Email: "Following up on my payment from last week — the autopay didn't go through. Can someone check? Also my COI on file has our old address, will fix that later." Output: {"ai_category": "N/A", "ai_confidence": 88, "ai_reasoning": "Primary intent is billing issue; COI mention is deferred aside"} We deliberately didn’t include “easy” examples (obvious COI request = COI). Haiku doesn’t need help with those. The examples should teach it the hard edges.

Structured Output Implementation We use Anthropic’s native JSON schema enforcement via output_config. No prompt-based JSON instructions, no regex parsing on the output:

SCHEMA = { "type": "object", "properties": { "ai_category": { "type": "string", "enum": ["COI", "N/A"] }, "ai_confidence": { "type": "integer", "minimum": 0, "maximum": 100 }, "ai_reasoning": { "type": "string", "description": "12-15 words max, cite the signal driving the decision" } }, "required": ["ai_category", "ai_confidence", "ai_reasoning"], "additionalProperties": False } response = client.messages.create( model="claude-haiku-4-5", max_tokens=200, temperature=0, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"} } ], messages=[ {"role": "user", "content": f"\n{email_text}\n"} ], output_config={ "format": { "type": "json_schema", "schema": SCHEMA } } )

Key decisions:

temperature=0 — classification should be deterministic max_tokens=200 — output is tiny; cap it to avoid runaway costs additionalProperties: False — strict schema enforcement Prompt Caching This is where the economics flip completely.

Our traffic is bursty — most emails arrive during business hours in clusters. The system prompt (~1,050 tokens) is static across every call. With Anthropic’s prompt caching at a 1-hour TTL (cache_control: {"type": "ephemeral"}), the static system prompt gets read from cache on ~90% of calls.

Cache hits cost 10× less on input tokens. At 16,000 emails/month:

Without Cache With Cache (90% hit rate) Monthly cost ~$80 ~$8 Per email ~$0.005 ~$0.0005

We monitor cache performance on every response via usage.cache_read_input_tokens. If that number drops significantly, something is wrong — likely a deploy that changed the system prompt hash.

One critical detail: Anthropic’s cache requires a minimum of 1,024 tokens in the cacheable block to activate. Our system prompt is ~1,050 tokens. We deliberately don’t trim it below the threshold, even if there are minor optimizations available.

Confidence Bands and Routing Logic We use three confidence bands for downstream routing:

Band Range Action High 90–100 Auto-route, no human review Medium 70–89 Route + flag for spot check Low 0–69 Route to general queue, manual review

Low-confidence classifications don’t fail — they degrade gracefully into manual review. The system never makes a hard wrong call; at worst it says “I’m not sure” and routes conservatively.

The Numbers Before this system, COI routing required ~3 hours/day of support agent time (across the team) just for triage. At fully-loaded support agent cost:

Annual triage cost: ~$200K Classifier cost: ~$100/year Build + integration time: ~3 weeks The ROI conversation was short.

What surprised us wasn’t the cost savings — it was the consistency. Human routing has variance: agents have bad days, edge cases get mis-categorized, new hires need weeks to learn the nuances. The classifier is the same at 9am Monday and 5pm Friday. It never misremembers that “additional insured” is a COI trigger.

What We’d Do Differently Test adversarially from day one. We built our test set from “easy” examples initially and were overconfident about accuracy. The hard cases — mixed-intent emails, reply chains, third-party forwards — are where it matters most and where you need explicit examples.

Log everything, especially cache metrics. usage.cache_read_input_tokens is the single most important operational metric for this system. A sudden drop tells you the system prompt changed (even accidentally) and you're burning uncached tokens.

Don’t fine-tune prematurely. We had early discussions about fine-tuning Haiku on our historical tickets. Unnecessary. The classification task is well-defined enough that a strong system prompt with good examples outperforms a weakly fine-tuned model, with zero maintenance overhead.

What’s Next This is the first production AI component we’ve shipped at Layr. COI classification is a narrow, well-bounded problem — a good first target for exactly that reason.

Next on the roadmap: extending the classifier to a multi-class router across more support categories (billing, mid-term changes, cancellations, claims). Same architecture, wider taxonomy, more few-shot examples per class.

The underlying principle holds: when you have a classification problem with a stable taxonomy and high volume, a well-prompted LLM with structured output and caching is almost always the right call over a bespoke model or a rules engine.

Start simple. Ship it. Measure it. Expand.

If you’re building AI systems in insurtech and want to compare notes, I’m on LinkedIn. Always happy to talk production LLM architecture with people working in the same domain.

How I Built an LLM-Powered Insurance Support Classifier That Saved $200K Annually

The Problem Worth Solving

Why Claude Haiku We evaluated a few paths:

Architecture

Comments

More from this blog

Stop Reinventing AI Guardrails: Build Reusable LLM Text Safety with the Builder Pattern

Is GPT Really That Scary? Let’s Break It Down with Attack on Titan

Command Palette

The Problem Worth Solving

Why Claude Haiku We evaluated a few paths:

Architecture

Comments

More from this blog