The first LLM + harness leaderboard built for insurance-specific tasks.
Each dot is a model + harness combo — up and left wins (accurate and cheap). The dashed line is the efficient frontier; filters re-score the field live.
The reasoning tax. A model can be cheap per token and pricey per task — it just thinks more. Reasoning tokens bill like output, and you often never see them. Compare $/1M tok with $/100 tasks; flip to metered for real billed cost.
Harnesses. tooled_reader — the model plus a file-reading tool; it opens a task's attachments (loss runs, a manual) on request, then answers. v3 retries a malformed answer, so a formatting slip isn't scored as wrong. claude_code — Claude in its agentic CLI. codex_plain — the model in OpenAI's Codex CLI.
We're not just rating models for insurance — interesting, but academic. You don't need a benchmark to tell you Opus beats DeepSeek Flash. We built InsureBench to help insurance decision-makers find the model that's actually worth it for the job — where a cheaper one clears your bar, and where "cheap" quietly fails the work.
A few public examples — the exact prompt a model sees. Keys and the private bank stay private.