InsureBench

The first LLM + harness leaderboard built for insurance-specific tasks.

Cost vs. quality

Each dot is a model + harness combo — up and left wins (accurate and cheap). The dashed line is the efficient frontier; filters re-score the field live.

More filters

The reasoning tax. A model can be cheap per token and pricey per task — it just thinks more. Reasoning tokens bill like output, and you often never see them. Compare $/1M tok with $/100 tasks; flip to metered for real billed cost.

Harnesses. tooled_reader — the model plus a file-reading tool; it opens a task's attachments (loss runs, a manual) on request, then answers. v3 retries a malformed answer, so a formatting slip isn't scored as wrong. claude_code — Claude in its agentic CLI. codex_plain — the model in OpenAI's Codex CLI.

Why InsureBench

We're not just rating models for insurance — interesting, but academic. You don't need a benchmark to tell you Opus beats DeepSeek Flash. We built InsureBench to help insurance decision-makers find the model that's actually worth it for the job — where a cheaper one clears your bar, and where "cheap" quietly fails the work.

It's the model and the harness The harness is everything around the model — skills, prompts, reading tools, caching. It's where the leverage is: a good harness pulls a cheap model toward frontier accuracy, and caching a document that rides every prompt (an underwriting manual, say) slashes cost. The model is the talent. The harness is the operation. Why the harness matters →
"Saturated" depends on the job 94% is a victory lap for a benchmark and a liability for a claims bot. Easy items inflate the headline — filter Difficulty → Hard to see where models actually split. The last few points are where the differences, and the risk, live.

What we're learning

Roadmap

Sample questions

A few public examples — the exact prompt a model sees. Keys and the private bank stay private.