Last updated: May 2026

Leaderboard

Which AI models can a lawyer rely on? We tested six leading models on 70 real legal tasks, ranked separately for drafting contracts and for finding information in documents. The best model for one job is rarely the best for the other.

Contract Drafting (34 tasks)

Sonnet 4.6 leads on drafting reliability. Three leading models sit within three points of each other, while GPT-5.5 trails the field.

Anthropic
Google
OpenAI
#ModelProviderReliability*Tasks passedCost per task
1Claude Sonnet 4.6Anthropic55.9%19 / 34$0.13
2Claude Opus 4.7Anthropic52.9%18 / 34$0.29
2Gemini 3.1 ProGoogleGoogle52.9%18 / 34$0.07
4Gemini 3 FlashGoogleGoogle38.2%13 / 34$0.01
5GPT-5.5OpenAI35.3%12 / 34$0.15
6GPT-4o-miniOpenAI14.7%5 / 34$0.003

*Reliability is the share of tasks where the model passed 100% of the checkpoints, meaning every requirement we checked for that task.

Info Extraction (31 tasks)

Opus and GPT-5.5 tie at the top, each passing 25 of 31 extraction tasks in full. Drafting and extraction reward different models.

Anthropic
Google
OpenAI
#ModelProviderReliability*Tasks passedCost per task
1Claude Opus 4.7Anthropic80.6%25 / 31$0.29
1GPT-5.5OpenAI80.6%25 / 31$0.15
3Claude Sonnet 4.6Anthropic74.2%23 / 31$0.13
4Gemini 3.1 ProGoogleGoogle64.5%20 / 31$0.07
5Gemini 3 FlashGoogleGoogle41.9%13 / 31$0.01
6GPT-4o-miniOpenAI16.1%5 / 31$0.003

*Reliability is the share of tasks where the model passed 100% of the checkpoints, meaning every requirement we checked for that task.

A note on the numbers. Cost per task is the average we paid to run a model on one task. Two tasks count as failures for reasons outside a model's legal ability: one document could not be opened by any of the six models, and one document was too long for one model to read in full. We still count these as failures, because a lawyer using these tools in practice would run into the same wall. The methodology below explains why.

Key takeaways

  1. 1

    Drafting and extraction have different winners

    Sonnet 4.6 is the most reliable model for contract drafting, at 55.9%. For pulling information out of documents, Opus 4.7 and GPT-5.5 tie for the lead at 80.6%. No single model wins both jobs, which is why we publish two separate leaderboards rather than one combined score. If you pick one model for everything, you are compromising on at least one type of work.

  2. 2

    GPT-5.5 is the most uneven model

    GPT-5.5 comes 5th for drafting (35.3%) but ties for 1st on extraction (80.6%). That is a 45 percentage point swing, the widest of any model. It is excellent at finding specific information in a document and noticeably weaker at producing a full first draft. Use it for extraction work, not for drafting.

  3. 3

    The top of the extraction leaderboard is very close

    Opus 4.7 and GPT-5.5 tie for first. Sonnet 4.6 is 6 percentage points behind and Gemini 3.1 Pro 16 points behind. Extraction bunches the leaders together because the answer is either in the document or it is not, leaving less room for models to differ. Drafting spreads the field much wider, from 55.9% down to 35.3%.

  4. 4

    Gemini 3.1 Pro is the best value

    Gemini 3.1 Pro ties for 2nd on drafting (52.9%) and comes 4th on extraction (64.5%). It costs $0.07 per task, roughly a quarter of Opus and half of Sonnet, while matching Opus on drafting reliability. It is a strong default if you want a single, affordable model, especially for extraction-heavy work.

  5. 5

    Every model struggles to leave good text alone

    On the two tasks where the existing clause was already sound and the right move was a light touch or no change at all, only 1 of 6 models passed each time, and it was a different model each time. These models are trained to rewrite, not to hold back. A legal product built on them needs its own check for when to leave text as it is, because the model will not provide one.

  6. 6

    A handful of tasks beat every model

    Six of the hardest tasks were failed by all six models. These include reading poor-quality scanned documents, working across two languages, and drafting from several documents at once. These are the tasks to watch as new models are released, because that is where progress will show up first.

  7. 7

    GPT-4o-mini is not suitable for legal work

    GPT-4o-mini scored 14.7% on drafting and 16.1% on extraction, passing roughly one task in six. It is useful only as a low-cost reference point, not as a serious option. If you need a cheaper model, Gemini 3 Flash is a far better choice at 38.2% for drafting and 41.9% for extraction, at $0.01 per task.

  8. 8

    The Anthropic lead is not just the grader favouring itself

    Our grader is itself one of the models being tested (Sonnet). If it were simply favouring its own family, you would expect Sonnet to win both leaderboards. It does not. Opus only ties for the extraction lead, and it ties with GPT-5.5, a model from another company that the same grader scored just as highly. That points to the grader rewarding correct work rather than house style. An independent check against human reviewers is still in progress, so treat the rankings as a strong guide rather than the final word.

Methodology

What we tested

We gave six leading AI models 70 tasks drawn from real legal work: 34 contract drafting tasks (writing and amending clauses, filling in templates), 31 information extraction tasks (finding clauses, tracing defined terms, spotting conflicts across documents), and 5 longer analysis tasks. The 5 analysis tasks are not in the leaderboards above, because the sample is too small to rank fairly. Every task came with the real documents a lawyer would work from and a clear description of what the finished work should contain.

How we scored it

We report one number for each model: reliability, the share of tasks where it passed every checkpoint for that task. We chose a pass-or-fail score on purpose. “Would I send this to a client?” is a yes-or-no question. A draft that gets most things right but misses one point still needs a lawyer to find and fix that point, so partial credit would overstate how useful the model is. Scoring each task as a single pass or fail also stops one very detailed task from dominating the overall results.

A few tasks have marking questions we are still reviewing, where no model passed a particular point even though the rest of the task scored well. Until that review is finished, treat the closest rankings with a little caution.

We count errors as failures

If a model crashed, refused the task, could not open a document, or ran out of room to read a long file, we counted that task as a failure rather than setting it aside. A model that cannot get through the documents a lawyer actually works with is not reliable for legal work, even if it would have done well on cleaner input.

The model never sees the answer key

For every task we kept two separate documents: the instructions and source files the model receives, and the checklist the grader uses to mark the work. The grader's checklist is never shown to the model. This matters. If the model could see the marking checklist, we would be testing whether it can follow a checklist, not whether it can do the legal work.

Who does the grading

A single AI model graded every piece of work, so that the marking stays consistent across all the models and tasks. One caveat: the grader is itself one of the models under test, which creates a risk that it favours its own family. We address that in the takeaways above. An independent check of the grader's marking against human reviewers is still in progress, so please treat the rankings as a strong guide rather than a final verdict.

Task source

Contract drafting tasks (34)

The drafting tasks were written by practising transactional lawyers. They cover the kind of work commercial lawyers do every day: master services agreements, NDAs, licensing and reseller terms, employment and non-compete clauses, partnership and M&A documents, IP and technology agreements, and dispute clauses such as arbitration and governing law. Each task was built around a specific trap a careful lawyer would want to catch. Examples include quietly redrafting a clause in one side's favour, writing a penalty that would not be enforceable without flagging the risk, and recognising that an existing clause already covers the client's concern and is best left alone.

Information extraction tasks (31)

The extraction tasks came from a separate set, also written by lawyers, each with an answer key. The answer key drives the marking, using three kinds of check: content the answer must include, wrong answers it must avoid (such as inventing the contents of a schedule the model cannot actually see), and bonus points for extra correct detail that is not required. The source documents are real or realistic legal materials: contracts, term sheets, board minutes, security summaries, scanned letters of intent, and email chains. Some tasks deliberately use poor-quality scans, where the honest answer is often “I cannot read this reliably” rather than a guessed value.

Analysis tasks (5, not in the leaderboards)

A smaller set of 5 tasks tests reasoning where the answer is genuinely uncertain, such as pension and regulatory-timing questions. We left these out of the leaderboards because five tasks are too few to rank on, but the per-task results are available for anyone who wants them.

What we deliberately left out

  • No confidential firm documents. Every source file is public, written for the test, or changed enough that it could not identify a real client.
  • No severity weighting in the score. We do not treat some checkpoints as more serious than others in the reliability number itself. The marking tells you what was checked; how much weight to give each point is your call as the reader.
  • The marking checklist, hidden from the model. As above, the model only ever sees the instructions and documents, never the checklist used to grade its work.