Legal AI Benchmarks Compared: Harvey Legal Agent Bench, Ivo, and GC AI
By Anna Guo

In the space of 4 weeks, 3 legal AI vendors published benchmarks.
Ivo’s Contract Review Comparison came out in April. Harvey released the Legal Agent Benchmark, or LAB, on May 6. GC AI published the In-House Legal Bench on May 15.
All 3 are positioned as serious contributions to legal AI evaluation. But there is an obvious tension here: these are vendor benchmarks, and vendor benchmarks tend to make the vendor look good.
Ivo ranked itself as the top AI, nearly matching the human baseline. GC AI ranked itself first against ChatGPT, Claude, and Gemini. Harvey has not released LAB leaderboard results yet, but in its earlier BigLaw Bench, published in August 2024, Harvey’s proprietary models outperformed all the public foundation models it tested.
So are these benchmarks even worth reading?
My answer: yes, but not because the headline rankings should be taken at face value.
Why Legal AI Benchmarks Are Appearing Now
The timing is not a coincidence. Legal AI vendors are publishing benchmarks now because the buyer question has changed from “can AI do legal work?” to “which tool deserves budget?”
The legal services market used to be easier to understand. If you needed legal work done, you went to human experts: outside counsel, in-house lawyers, or specialist service providers. Legal software sat in a different category. It helped with workflow, storage, signatures, billing, matter management, or contract lifecycle management. It did not seriously compete with lawyers on legal judgment or legal output.
Legal AI has scrambled that layout.
Law firms, legal AI vendors, CLMs, enterprise software tools, and foundation models are now competing for overlapping work: contract review, drafting, research, analysis, knowledge retrieval, workflow automation, and increasingly, legal judgment support.

The Legal Services Market Today
That makes benchmarks commercially useful.
They let vendors answer the same question in public: why should a legal team pay for this purpose-built legal AI tool instead of using Claude, ChatGPT, Gemini, Copilot, a CLM’s new AI feature, or an AI-enabled law firm?
This is especially urgent because foundation models are moving directly into the places lawyers already work. Once Claude is inside Word, Copilot is inside Microsoft 365, and ChatGPT can read documents, browse sources, and produce structured outputs, the easy “we own the workflow” answer becomes weaker.
Ivo Contract Review Comparison
What it is. 19 anonymized contracts, including NDAs, MSAs, and DPAs, reviewed by Ivo, Claude for Word, and 1 human Special Counsel from an Am Law 25 firm with 8 years of experience. 3 senior attorneys scored the outputs blind on a 1-to-10 scale across 5 criteria. Headline result: Ivo 4.52, Human 4.56, Claude 3.50.

How to read it. As the most transparent of the 3 legal AI benchmarks on what was done, and the most exposed on how it was scored. Ivo released the actual redlines and, importantly, the playbooks used to grade them.
The FAQ states the limitations plainly: single point in time, 1 human baseline, 19 contracts, and the playbook was provided as system configuration for Ivo but as a user prompt for Claude. That openness is rare in vendor benchmarking, and worth crediting.
What is interesting. The human baseline is the most thoughtful design choice in any of the 3 initiatives. Anchoring the scale against an experienced attorney tells you what a good score actually looks like, what humans get wrong, and where AI is and is not within the band of human professional output.
The published examples make this concrete in both directions. On Formatting Retention, meaning font, spacing, numbering, and defined terms in Word, the human attorney scored 10, Ivo scored 5, and Claude scored 4. The example shows Ivo inserting a CCPA clause with the right heading style but no section number. That is the kind of detail that is invisible until you try to ship the redline and the document numbering breaks.
The published failure modes track type. Claude’s pattern is aggressive whole-paragraph rewrites instead of surgical edits, missed playbook instructions, and unnecessary substantive overreach. Ivo’s weakest area is formatting fidelity, with section numbering omissions the most common error.

The human lawyer preserved redline formatting best in Ivo report
The human’s failure mode was judgment calibration, specifically substantive overreach when the playbook said “leave it alone,” plus leaving internal comments on the document that a reviewer would have to manually delete before sending out.
What is missing. The scoring is judge opinion on a 1-to-10 scale, against a playbook. Ivo does not report how often the 3 judges agreed with each other, which is the basic check that would tell a reader whether the scoring is reliable or noisy. For a task like contract redlining, scoring is inherently judgmental, but that makes reporting judge agreement more important, not less.
Quick takeaway.
- For builders, the lesson is that transparency on artifacts and limitations buys credibility even when the scoring is contested. Release the data, state the constraints, and let reviewers challenge it.
- For legal teams, Ivo’s redlines and playbooks are downloadable, which means you can pull them and re-score easily. That exercise will tell you more about fit than the headline numbers.
Read more: Ivo Contract Review Comparison
GC AI In-House Legal Bench
What it is. 100 tasks across 10 in-house legal categories, scored by an LLM-as-judge against attorney-written answer keys averaging 12 binary pass/fail criteria per task. GC AI was compared against ChatGPT, Claude, and Gemini. Headline result: GC AI 86.8%, ChatGPT 79.8%, Claude 68.4%, Gemini 57.5%.

GC AI Results
How to read it. Carefully, because there is very little to verify against.
No tasks are public, no answer keys are public, no example outputs are shown. Samples are “available upon request.” That means the entire methodology rests on trusting that the tasks reflect real in-house legal work, the answer keys reflect what a good in-house lawyer would actually expect, and the LLM judge is well-calibrated.
None of those things can be independently checked from what is published.
What is interesting. The task structure is sound on paper. Binary pass/fail per criterion is the right design. It forces atomic scoring instead of a vibe-based number on a 1-to-10 scale. 12 criteria per task gives meaningful granularity. The 10 categories cover most of what in-house counsel actually do, and the breakout by category is useful for thinking about where GC AI’s team believes their product is strong, such as regulatory tracking, checklists, and comparison, and weaker, such as extracting information and summarizing.
The comparison against ChatGPT, Claude, and Gemini is also a fair question to ask. Buyers do want to know whether the legal-AI premium is worth it over what they could get from a consumer chatbot.
The most interesting finding sits in the foundation-model rankings, not the headline. ChatGPT scored 79.8%. Claude scored 68.4%. Despite all the hype around Claude as the strongest model for legal work, ChatGPT outperformed it by more than 11 points on these tasks. That is a useful prompt for in-house teams who have standardized on Claude on the assumption it is the best generalist for legal work.
What is missing. Public tasks, or even a representative sample. Releasing 5 to 10 tasks and answer keys would be a low-cost way to let the community judge whether the design is fair.
The LLM judge is also not validated in any visible way. GC AI says there was “substantial alignment” between the LLM judge and human expert spot-checks, but no numbers are reported. How often did the LLM judge agree with the human reviewers? On which kinds of tasks did they disagree? Those are the basic questions a reader needs to know, and they are not answered.
Failure modes are not published either. The bench tells you what GC AI scored higher on. It does not tell you what kinds of mistakes the foundation models made, or what kinds GC AI made, which is the more useful question for a buyer evaluating legal AI tools.
Quick takeaway.
- For builders, releasing a sample of tasks and reporting how often the LLM judge agrees with human reviewers would convert this from a marketing paper to a research paper, and the cost of doing so is low.
- For legal teams, the category-level breakouts are useful as a scaffold for your own evaluations.
Read more: GC AI In-House Legal Bench
Harvey Legal Agent Bench
What it is. An open-source legal AI benchmark framework with over 1,200 tasks across 24 practice areas, scored against roughly 75,000 expert-written rubric criteria. Tasks are agentic, multi-document, and long-horizon. The average task ships with 7+ source documents, including docx files, emails, spreadsheets, and the occasional deck. The rubric runs binary pass/fail on each criterion.

How to read it. The amount of resources that have gone into LAB is both impressive and enormous, and it is by far the most serious of the 3 on infrastructure. It is a useful scaffold for thinking about what legal AI evaluation should look like at scale.
The agentic, deliverable-shaped framing is also closer to how legal work actually happens than the single-turn prompt-response shape most legal benchmarks default to.
What is interesting. Scale and structure. 1,200 tasks and 75,000 criteria is an order of magnitude beyond Ivo’s 19 contracts or GC AI’s 100 tasks. The 24 practice areas give meaningful breadth. Open-sourcing the harness and inviting model providers, startups, researchers, and law firms to audit the rubrics is the right posture, and the partnership list signals that the major labs are willing to participate.
What is missing. Two real concerns.
First, despite framing LAB as a benchmark for “real world work done by lawyers,” Harvey chose to publish an all-US Big Law version first. Roughly two-thirds of tasks are US-leaning.
The practice-area mix itself, with corporate M&A, IP, governance, trusts and estates, and funds at the top, reads like its tailor made for US AmLaw 100 firm. Despite the “for in-house” positioning in Harvey’s broader messaging, LAB is currently an instrument for evaluating agents on US Big Law deliverable assembly.
If you are in any in-house function whose work is shaped by non-US substantive law, you should treat LAB results as one input, not as a transferable proxy for whether a tool will work for you.
Second, binary pass/fail rubrics at this scale tend to capture only the objective elements: did the response mention X, cite Y, identify Z. That can bias scoring toward verbose, comprehensive answers.
Take one task to make this concrete: "Draft ICC Arbitration Statement of Claim for Joint Venture Breach Dispute".

Task criteria
A vast majority of the 66 criteria ask for specific dollar amounts, specific clause numbers, and specific dates.
The rubric does not focus on whether the statement of claim is a persuasive piece of advocacy, whether the ordering of arguments is strategically sound, or whether it is appropriately scoped.
Under all-pass grading, the math runs in one direction. A work product that lists everything has a structurally higher chance of hitting every checkpoint than a model that picks the right 3 points.
Good legal writing is often the opposite: surgically correct, concise, and willing to leave out what does not matter.
Quick takeaway.
- For builders, LAB is the closest thing to a reusable evaluation infrastructure for legal AI agents, and contributing tasks or auditing rubrics is a high-leverage way to shape what the category measures, particularly the non-US tracks, where the gap is widest.
- For legal teams, the 24 practice areas and task structure are a strong starting point for thinking about which agent capabilities you actually need to test in pilots. Use the structure even before the leaderboard arrives, but read the leaderboard, when it lands, as a measurement of US Big Law work, not legal work in general.
Read more: Harvey Legal Agent Benchmark
The case for independent benchmarks
3 benchmarks in a month is what a market looks like when the buyer side is paying attention and the seller side is responding to it. Each of these projects adds something the category did not have a month ago: Ivo's transparency on artifacts, GC AI's category breakouts, Harvey's scale and open infrastructure.
The shared limitation is that all 3 are vendor benchmarks. Each one was scoped, designed, and graded by the company whose product sits on the leaderboard. That conflict does not invalidate the work, but it does mean buyers need at least one source in the mix that is not selling them anything.
That is the gap Legal Benchmarks fills. We have published 2 studies so far. The April 2025 Info Extraction research tested AI on real legal extraction tasks. The September 2025 Contract Drafting Benchmark compared 13 AI tools and a human lawyer baseline across in-house drafting tasks, scored double-blind by expert reviewers, with the methodology and limitations published in full.
The next round of independent research is in scoping. If you want to contribute real workflows, submit a tool, or help shape the methodology, reach out.
About the Author
Anna Guo
Anna is the founder of Legal Benchmarks.


