Last updated: June 2026 (rev. 2)

Leaderboard

Legal Benchmarks test AI systems on real-world legal tasks to find out which AI legal professionals should use for the job.

Get benchmarkedWe run this benchmark every month.

Key takeaways

  • Anthropic’s Fable 5 costs more than twice as much as Opus 4.8 without beating it on legal work. It trails Opus 4.8 on contract drafting reliability and only ties it on information extraction.
  • The most reliable answers are often the most bloated answers. In information extraction, Opus 4.8 and Fable 5 average about 1,200 words per answer, roughly two to three times the length of GPT-5.4-mini, Qwen3.7 Max, Gemini 3.1 Pro, and DeepSeek V4 Pro.
  • Open-source models still lag frontier models on legal reliability. Qwen3.7 Max and DeepSeek V4 Pro top out in the mid-40s on drafting and around 60% on information extraction, while the frontier leaders reach roughly two-thirds and above 80%.
  • The same model can be strong in one task category and weak in another. GPT-5.5 is one task behind Opus 4.8 and Fable 5 on information extraction (80.0% vs. 83.3%), but much weaker on drafting (41.2%).

Contract Drafting

We test how well AI systems do the drafting work commercial lawyers do on a frequent basis: writing a new clause from scratch, redlining a counterparty’s draft, preparing an amendment, building an agreement from a template, or tightening existing language.

We measure AI output quality on two axes: reliability (is the answer right) and usefulness (can you use it as written). The best outputs are both. Full definitions and scoring are in the methodology below.

On the chart, each dot is one system: reliability runs across, usefulness runs up, so the best results sit in the top-right.

Loading chart…
Each dot is one model: how often it is fully correct (further right) against how client-ready the output reads (further up). Colour shows the provider. The shaded top-right region covers models above the midpoint on both axes — both correct and client-ready. Hover any dot for its full breakdown.

Results, contract drafting

AI is still not reliable enough to draft contracts unsupervised. Under an all-pass standard, Opus 4.8 leads at 67.6%, with Fable 5 next at 61.8%. Even the leader fails roughly one drafting task in three, and the open-source models top out in the mid-40s. Contract drafting is far from solved.

Sort by
Contract Drafting results: reliability, usefulness, and cost per task for each model.
Claude Opus 4.8Anthropic67.6% (23/34)2.67~$0.29
Claude Fable 5NewAnthropic61.8% (21/34)2.66~$0.64
Gemini 3.5 FlashNewGoogleGoogle55.9% (19/34)2.60$0.08
Claude Sonnet 4.6Anthropic50.0% (17/34)2.63$0.13
Gemini 3.1 ProGoogleGoogle50.0% (17/34)2.69$0.07
Qwen3.7 MaxNewAAlibaba44.1% (15/34)2.67$0.03
GPT-5.5OpenAI41.2% (14/34)2.77$0.15
DeepSeek V4 ProNewDDeepSeek26.5% (9/34)2.68$0.03
GPT-5.4-miniNewOpenAI26.5% (9/34)2.55$0.02

Reliability is the share of 34 drafting tasks passed under an all-pass standard; usefulness is the overall 1–3 score across clarity, length, and structure. Tap any column header to re-sort. Methodology

Apps vs. Models

The harness around a model improves reliability with little cost to usefulness. Both leading apps passed more hard drafting tasks than their base model, with ChatGPT by 22%, while usefulness held flat.

A legal-AI app is a model plus everything wrapped around it: a system prompt, file handling, retrieval, a UI. To see whether that wrapper changes the work or just repackages it, we ran each app against the API model underneath it on the nine hardest drafting tasks, where raw models pass under half the time. Same all-pass standard, same usefulness judges. This table is carried over from the prior run; apps were not re-tested this cycle.

ProviderModelReliabilityΔ vs baseUsefulnessΔ vs base
AnthropicClaude app55.6% (5/9)↑ +11.1pp2.65↓ −0.07
AnthropicOpus 4.7 (API)44.4% (4/9)2.72
OpenAIChatGPT app22.2% (2/9)↑ +22.2pp2.80≈ −0.03
OpenAIGPT-5.5 (API)0.0% (0/9)2.83
GoogleGoogleGemini app22.2% (2/9)↓ −11.1pp2.69↑ +0.04
GoogleGoogleGemini 3.1 Pro (API)33.3% (3/9)2.65

Task scope

We test 64 tasks built around real-world legal requests lawyers are already using AI to help with. The set splits into two categories.

Contract drafting accounts for 34 tasks: producing or amending contract language from a lawyer’s instruction, including clauses, side letters, amendments, and “tidy this up” requests.

Information extraction accounts for 30 tasks: answering questions about a document, such as locating clauses, defined terms, governing law, values, and obligations. Tasks vary by how broad the question is, how clean the document is, whether one document or several, and whether the answer is present, absent, partial, or contradictory.

The set is heavy on commercial, IP, employment, M&A, and competition work. Every task is single-turn: one instruction, one answer, no follow-up.

Where the tasks come from

Practitioner-authored tasks make up the core of the set. Lawyers from the global Legal Benchmarks community wrote the instructions, reference files, and answer keys based on real scenarios from their own work.

Synthetic tasks make up the rest. We built a pipeline to generate additional tasks from the failure modes identified in our earlier Info Extraction and Contract Drafting reports, and from published legal-AI benchmarking work in the industry. Every synthetic task is validated by a human expert before it enters the set.

How this builds on our prior research

This is the third installment in a series we’ve been publishing since April 2025. Our Phase 1 report tested 6 AI assistants on 18 information extraction tasks; Phase 2 tested 13 AI tools and a human-lawyer baseline on 30 contract drafting tasks, and introduced the two-axis split between reliability and usefulness that this benchmark inherits. The 64-task set consolidates both into a single, repeatable evaluation, with the failure modes identified in Phases 1 and 2 now seeding the synthetic tasks.

Methodology

We built an evaluation pipeline to collect AI outputs and measure their quality on two independent axes, reliability and usefulness, scalably and repeatably.

Collecting each answer

For an AI model, we send the instruction and documents to the model directly via API through our evaluation pipeline. System prompts are standardized across API models so the only variable between runs is the model itself. Documents are sent in native form rather than pre-converted to text, so reading them is part of the capability we test: a model that can read a file does so with its own vision, and a model that cannot fails tasks that depend on the document.

For a commercial application like Claude or ChatGPT, we use a browser agent that drives the application’s own interface the way a lawyer would, and we grade whatever it produces, including any document it drafts and attaches, not just its chat reply.

Model variations tested

Every API model is called through OpenRouter at the provider’s default reasoning setting, with no reasoning or reasoning_effort override, so the only variable between runs is the model itself.

Two-axis scoring

Quality is reported on two independent axes, reliability and usefulness, kept separate rather than blended into a single score. A memo can be technically accurate and unreadable, or beautifully formatted and wrong; reporting the axes separately preserves that distinction.

Reliability answers the question “is the answer right?” It captures accuracy, completeness, relevance to what was asked, and honesty about what the AI does not know rather than inventing. We report the share of tasks the AI gets fully right.

Usefulness answers the question “could a practitioner use it as written?” It captures whether the output is clear, well formatted, and the right length. We report the average of three sub-scores (clarity, length, structure), each rated 1 to 3, where 1 means needs rework before use, 2 means usable with edits, and 3 means use as-is.

How each task is graded

Each task ships with a checklist of requirements written by the lawyer who authored it, covering both substance and wording. For reliability, each item is marked pass or fail, and missing even one fails the whole task. This follows error-analysis-first eval discipline: binary judgments, no false-precision Likert scales hiding inside the task definition.

The checklist is also where we encode the fact that there is rarely only one right answer in legal work. For example, where a lawyer instructs the AI to draft an unenforceable clause, the criteria are written to pass any of several legitimate response shapes: refusing to draft and explaining why, drafting the requested clause with an explicit risk flag, or proposing a compliant alternative with the reasoning for the substitution. What fails is silent compliance with the bad instruction. The same logic runs through the extraction tasks: where a defensible answer can be “yes,” “no,” or “uncertain,” the criteria accept the valid variants and fail only the stated must-avoids, such as hallucination, missing a core element, or flattening a conditional answer into a flat one. The benchmark records the shape the model chose, and the criteria meet practice where a competent lawyer would actually land.

Reliability is graded by a single LLM judge (Claude Sonnet 4.6) against the task’s checklist. Usefulness is scored by a panel of LLM judges (Claude Sonnet 4.6 and Gemini 3.1 Pro) running in parallel for every cell, and their scores are averaged into a consensus.

Grader validation

LLM graders are only as useful as the evidence that they agree with practitioners. We run two checks.

Cross-grader agreement. Across roughly 1,700 paired scores, the two usefulness graders gave the exact same 1 to 3 score about 82% of the time. Agreement is highest on clarity (around 97%) and structure (83% to 94%), and lowest on length (56% to 64%); length is the most subjective sub-criterion and the one most worth a human eye. Cells where the two judges disagreed by the maximum 2-point spread are flagged for human review.

Favoritism check. We test whether a grader scores its own maker’s model higher. It does not show up in the data. Sonnet does not top the usefulness leaderboard in either category, and on the length sub-criterion the Anthropic models actually sit at the bottom, the opposite of what a biased grader would produce.

Results and rankings

Each model run produces, for every task, a reliability verdict (a per-criterion pass/fail tally that resolves to “fully right” or not) and a usefulness score (the mean of clarity, length, and structure from the two-judge panel).

We then aggregate across the 64 tasks in two cuts. Per category, contract drafting and information extraction are ranked separately and never combined: a model’s reliability score in a category is the share of that category’s tasks it got fully right, and its usefulness score is the mean of its task-level usefulness means in that category. Per sub-criterion, clarity, length, and structure are also reported individually, because they fail differently and a single overall number hides that.

Rankings are refreshed each time we re-run the benchmark.

Accessing the tasks

The task set, reference documents, and per-task criteria are not publicly distributed. Releasing them in full would let vendors train or tune against the benchmark, which would quickly erode the signal we are trying to publish. Researchers and academics who want to inspect the methodology in more detail can request access by getting in touch. Vendors who want their tool measured against the set can submit it through the route below, and we run the evaluation on their behalf.

Limitations

The benchmark gives a useful read on where models stand today, but it has limits worth flagging.

Task count and coverage. The set is 64 tasks. It leans commercial, IP, employment, M&A, and competition work, and it is English only with a US and UK skew. Practice areas and jurisdictions outside that scope are thin or absent.

Single-turn only. Every task is one instruction and one answer. We do not yet test multi-turn workflows, longer matters, or back-and-forth refinement, which is closer to how lawyers actually use these tools.

Snapshot, not a trend. Models and applications change month to month. The leaderboard reflects a point-in-time run, and a result from one month may not hold the next.

LLM judges. Reliability is graded by a single LLM judge and usefulness by a panel of two. Cross-grader agreement is high overall but lower on length, and the judges are validated by human experts on a spot-check basis, not on every task.

Application coverage. Application testing is limited to Claude, ChatGPT, and Gemini, and the Apps vs. Models table is carried over from the prior run rather than re-tested this cycle. Other legal AI products, including purpose-built tools, are not yet represented on the leaderboard.

File handling. Documents are sent in native form, so a model that cannot read a file will fail file-dependent tasks. That is by design — reading the document is part of the capability we test — but it means text-only models score lower on file-heavy tasks than they might with pre-extracted text.

API system prompts. System prompts are standardized across API models so the only variable is the model itself. A model that benefits from a more tailored prompt may look weaker here than it does in practice.

Cost figures. Per-task cost is based on listed API pricing at the time of the run. It does not include retries, application overhead, or enterprise discounts.

Consistency and drift not tested. We grade a single run per task. We do not yet measure how much a model’s output varies across repeated runs of the same task, which matters for anyone relying on the tool day to day.

What’s next

We re-run the benchmark every month, and a few things are already in train.

  • Benchmark applications. Tool builders can submit their products to see where they rank across these task categories.
  • Wider coverage. We want to expand into more task categories that both in-house lawyers and private-practice lawyers face day-to-day, and, more importantly, into more geographic regions and languages.
  • More practice areas. The set leans commercial, IP, employment, M&A, and competition. We are adding tasks in the areas that are thin today.
  • Beyond single-turn. Every task today is one instruction and one answer. Multi-turn work, longer matters, and tool use are on the list.

Get benchmarked

Building a legal AI tool? We test products, not just raw models, and the results sit on the same leaderboard. The top-performing applications in each category are published on the leaderboard and refreshed monthly. If you want your tool run against the same tasks under the same grading, get in touch.

Get your tool benchmarked

Contribute to the benchmark

Have a legal task we should test? Spot something unclear? Want to help improve the methodology? The benchmark is built and refined by a community of in-house lawyers, private-practice lawyers, and AI technical experts. Send us feedback or contribute a task.

Share feedback or contribute a task

Changelog

June 2026 — rev. 2

  • Expanded the field from seven to nine ranked API models. Added Claude Fable 5, Gemini 3.5 Flash, GPT-5.4-mini, and two new vendors — DeepSeek V4 Pro and Qwen3.7 Max. Retired Claude Opus 4.7, Gemini 3 Flash, and GPT-4o-mini.
  • Re-ran the full leaderboard on 64 tasks, re-grading all nine models on the current rubric. Some incumbent numbers moved as a result, not just the new rows.
  • Switched document handling to native form, so models are tested on whether they can read files, not just extracted text. Vision-capable models read documents directly; text-only models fail file-dependent tasks.

June 2026 — initial

  • Added Claude Opus 4.8.
  • Added the Limitations and Model variations tested sections.