InsureBench

Insurance AI benchmarks for model + harness performance.

Not a general model-power contest: InsureBench is about finding which model + harness can do the insurance task you need, at the reliability and cost you can deploy.

Score vs. cost

Each point is a model + harness combination, scored on insurance tasks and priced per completed task. The useful question is not which model is strongest overall, but which combination clears the bar for a specific workflow. Up and left is better; the dashed line is the efficient frontier.

More filters

Cost is per completed task, not token list price. Where direct metering is available we use it; otherwise we estimate from observed tokens and published pricing.

Harnesses. tooled_reader — the model plus a file-reading tool; it opens a task's attachments (loss runs, a manual) on request, then answers. v3 retries a malformed answer, so a formatting slip isn't scored as wrong. claude_code — Claude in its agentic CLI. codex_plain — the model in OpenAI's Codex CLI.

Method

InsureBench evaluates AI systems on insurance work — claims reasoning, actuarial analysis, underwriting review, coverage questions, document reading, form extraction, and workflow judgment. The unit of measurement is not just the model. It is the model plus the harness around it: tools, prompts, file readers, retries, skills, routing, structured-output handling.

Most benchmarks ask which model is best. Useful — but not the deployment question most insurance teams face. An insurer deciding how to use AI in production needs more specific answers:

Can this model classify this risk, read this claim file, or check this form reliably enough?
What does it cost per completed task — including tool calls and hidden reasoning tokens?
How fast is it?
Can it run behind a firewall, or on an open-weights model?
Does a better harness let a cheaper model perform well enough?
When should the workflow escalate to a stronger model or a human reviewer?

InsureBench is built around those questions — a public benchmark and lab project, maintained by InsureThing, and expanding with more tasks, models, harnesses, and workflow types.

Model + harness

In production, no one deploys a raw model in isolation; the deployed system is a model plus a harness — file reading, retrieval, output repair, domain skills, retries, caching, calculators, escalation rules, workflow-specific instructions. Those details materially change score and cost. That is why InsureBench scores combinations: a smaller model with the right file reader or task-specific harness may outperform a larger model in a simpler setup on a specific workflow; on other tasks, the model's underlying capability is the limiting factor. Both results matter. Why the harness matters →

Cost per task, not cost per token

Token price is only part of the economics; what matters operationally is the cost of a completed task. A model may be inexpensive per token but expensive per answer if it uses many output tokens, hidden reasoning tokens, or repeated tool calls. InsureBench tracks cost at the task level, closer to how these systems are actually deployed. That does not make frontier models a bad choice — for high-value, low-volume, expert-facing work, the strongest model may be the right tool. A senior actuary running one difficult analysis in an agentic environment has different economics from an automated scan run thousands of times a month.

Reliability is not benchmark saturation

In many benchmarks, a score above 85% suggests a task family is starting to saturate. In insurance operations, 85% success may be undeployable if the failures are silent, costly, or hard to detect. We care about where repeated processes can be made reliable — and what has to happen around the model to make them so: confidence calibration, error detection, escalation, retries, human review. The benchmark looks not only for tasks that separate frontier models, but for task families where many models succeed consistently — the areas where lower-cost deployment becomes realistic.

How to use the results

Start with the task type, not the model. For a repeated business process, look for model-plus-harness combinations that are accurate enough for the control environment, fast enough for the SLA, cheap enough for the volume, controllable enough for your data and compliance needs, and reliable enough that failures can be detected and escalated. The highest-scoring model overall may be the right answer for some work — and unnecessary for other work.

V1 snapshot

V1 is built around a 160-task comparable core, with runs generated over roughly a week. Questions are insurance-specific and SME-reviewed. We use objective keys where possible, rubric scoring for judgment tasks, and attachment-aware harnesses for packets, spreadsheets, images, and documents.

InsureBench is a performance estimate, not a procurement recommendation. Scores vary by task specifics, and your tasks may have different results. Procurement decisions may also reflect speed, existing relationships, deployment, compliance, data control, and other concerns.

Findings

The harness changes the result

File readers, retry policies, output repair, batching, and domain skills can move results enough to change deployment decisions. On some tasks, harness design is the difference between an unusable cheap model and an economical production candidate.

Model choice varies by task

Frontier models are broadly ahead, as expected — but the interesting results are task-specific. In the current V1 results, Mimo v2.5 was especially strong on actuarial tasks, outperforming even the frontier models in that slice at roughly 1/400th of the cost. Gemini 3.5 Flash Low did well on claims reasoning while struggling more on actuarial; Nemotron led the tested claims set. There is no single answer to "which model should we use?" — it depends on the work.

Cost can move independently of quality

Two models with similar scores can have very different per-task cost. Hidden reasoning tokens, verbose outputs, and repeated tool calls all matter; for high-volume workflows, small differences in per-task cost become material.

The best deployment may be a route, not a model

Many workflows should not be assigned to one model forever. A lower-cost model can handle routine cases, with uncertain or high-risk cases escalated to a stronger model or a human reviewer — routing that is hard to design from model leaderboards alone.

Open-weights and firewall-friendly deployment matter

For some carriers and MGAs, cost is only one constraint. Data control, latency, auditability, and the ability to run in a controlled environment may matter as much. InsureBench is designed to make those tradeoffs visible.

What's in V1

V1 covers insurance-specific tasks across claims, underwriting, actuarial, coverage, forms, document reading, and workflow judgment, with deeper runs across a wider selection of industry models and harnesses. The question bank keeps growing; new tasks are reviewed for objective keys, self-containment, attachment handling, scoring reliability, and SME validity. Where answers can be checked by code, they are; where judgment is required, scoring uses concrete rubrics.

Roadmap

InsureBench will keep expanding in four directions:

more insurance task families,
more models and deployment modes,
more harness variants,
more analysis of cost, latency, calibration, and escalation behavior.

We are particularly interested in repeated workflows where cost, reliability, latency, and data-control constraints matter: underwriting intake, claims triage, forms processing, coverage checks, loss-run analysis, agency management, and portfolio monitoring.

Sample questions

A few public examples with the supplied facts, expected key, and scoring rubric.

Work with us

InsureBench is a public benchmark and an InsureThing lab project. We welcome suggestions for tasks, models, harnesses, and workflows to test, and we are glad to work with model providers, carriers, MGAs, brokers, and builders who want to know how their systems perform on realistic insurance work.

InsureThing also helps insurance organizations apply this analysis to their own workflows — selecting the model, designing the harness, reducing cost, improving reliability, and building escalation paths around real business processes.

Contact: don@insure-thing.com