Putting AI to the Test in Real-World Legal Work

An AI evaluation report for in-house counsel

OVERVIEW

This AI benchmarking project aims to answer the question:

Is AI ready to handle in-house legal work, or is it all marketing hype?

Our goal is to promote a transparent understanding of how these AI tools perform in real-world conditions and to encourage responsible adoption of AI in legal work.

This is the first report in a series examining AI tools across common in-house legal workflows. In this edition, we focus on information extraction tasks, questions lawyers often ask of their own document sets (Info Extraction Tasks). We tested 6 AI tools using real world queries submitted by in-house counsel and compared the outputs of purpose-built legal AI tools with general-purpose AI tools.

Discover how these tools performed and learn what every legal counsel needs to know before trusting AI with legal work.

Loading chart...

AI tools tested, covering both legal domain and general-purpose AI tools

108

AI outputs reviewed and scored by legal domain experts

Common AI failure modes identified, revealing where AI struggles with legal tasks

1.1 Introduction

AI is entering the in-house legal domain quickly, but it's still unclear how well it performs on the work lawyers actually do. Most resources available to in-house lawyers today on AI performance in the legal domain are polished vendor demos, academic papers, or ideal-condition benchmarks, which don't reflect everyday workflows.

And they rarely answer the question:

Can the tool do the job, and how much human input does it need?

This report is a step in that direction of answering these questions.

We focused on putting AI to the test on Information Extraction Tasks first because it underpins much of legal work from contract review to issue spotting, and is often a starting point for legal AI adoption.

We designed the evaluation around 2 principles:

Real-world inputs. Tasks were submitted by in-house counsel, using documents with redactions and formatting issues.
Practical usefulness. We assessed not just accuracy, but whether the output was usable—clear, scoped appropriately, and supported by features like citations or multi-document processing.

The key findings are as follows:

Copilot scored the lowest in overall accuracy (38.9%), underscoring the need for caution when relying on AI assistants, even for seemingly simple information extraction tasks.

Legal AI tools did not outperform general-purpose LLM chatbots in accuracy. ChatGPT and DeepSeek answered 12 out of 18 tasks (66.7%) accurately, matching or exceeding the overall accuracy of legal AI tools.

Legal AI tools ranked highest on qualitative measures of usefulness. Across 3 dimensions-helpfulness, adequate length, and feature support- legal AI tools outperformed all general-purpose AI tools, making them more suited to the needs of in-house lawyers handling daily tasks.

2.1 The AI Assistants We Evaluated

We tested 6 different AI tools that fall into 3 broad categories: legal AI assistants, productivity AI assistants and general-purpose AI chatbots (together, the AI Assistants).

Legal AI Assistants

GC AI

GC AI is a legal AI assistant designed specifically for in-house counsel, offering a comprehensive suite of assistants to streamline legal workflows and take a multi-model approach. It provides polished first drafts, counseling documents, and real-time legal guidance with citations and web search capabilities. For this evaluation, we accessed GC AI through a free trial.

Vecflow's Oliver

Oliver is an AI-powered legal assistant developed by Vecflow and tailored for legal work. It integrates with existing knowledge bases and external data sources to enhance legal research and analysis. For this evaluation, we accessed Oliver through a free trial.

Productivity AI Assistants

Google's NotebookLM

Google NotebookLM, powered by Google's Gemini, is an AI-powered assistant that allows users to upload documents and ask questions about them. It acts as a virtual research notebook, summarizing facts from uploaded sources and explaining complex ideas. For this evaluation, we accessed the free online version of NotebookLM.

Microsoft Copilot

Microsoft Copilot (Copilot), powered by OpenAI's GPTs, is a general-purpose AI assistant integrated into the Microsoft Office suite. It uses powerful large language models to help users draft documents, analyze content, and answer questions based on their files and communications. For this evaluation, we used the free web version of Copilot.

General-Purpose LLM Chatbots

DeepSeek

DeepSeek is a general-purpose LLM that offers a free, private AI interaction experience without requiring registration or storing user data. It supports all languages and provides an interface for tasks like semantic search and content generation. For this evaluation, we accessed the free online version of the DeepSeek-V3 model.

OpenAI's ChatGPT

OpenAI's ChatGPT, powered by GPT models, is a general AI assistant with broad capabilities in content generation and question answering. For this evaluation, we used the paid subscription to ensure unlimited access to the full GPT-4o model.

2.2 Task Dataset

In-house counsel working in the United States, the United Kingdom, Singapore and China submitted real-world Info Extraction Tasks to this evaluation.

Each task contribution consisted of:

a user query (in the in-house lawyer's own words);
one or more source documents (e.g. contracts, privacy policies, terms & conditions, or regulations) from which the answer must be extracted; and
attributes for an accurate answer (Accuracy Attributes)

We selected a diverse mix of 18 Info Extraction Tasks from these submissions. Some key dimensions of variation in the tasks included:

Dataset Characteristics

Labels	Tags
Query Scope	Open-Ended; Narrow/Binary
Query Clarity	Clear; Ambiguous
Document Quality	High Quality; Low Quality
Document Scope	Single Document; Multiple Documents
Information Match	Clear Single Match; Multiple Matches; Ambiguous/Partial Match; No Match; Contradictory Matches
Extraction Complexity	Non-interpretative (simple data extraction); Interpretative (requiring legal/contextual understanding)

This dataset variety was intentional. It allows us to observe how each AI Assistant handles easy vs. hard questions, clear vs. vague instructions, single vs. multiple documents, etc., similar to how real queries would vary in the legal practice.

The dataset tasks can be accessed upon email request addressed to aguozy@gmail.com

2.3 Methodology

We developed a human evaluation framework to assess each AI Assistant's performance on each task. 2 evaluators (with in-house legal backgrounds) reviewed the outputs. The evaluation was structured in 2 rounds, which consists of an Accuracy Assessment and Qualitative Assessment:

Accuracy Assessment (Pass/Fail)

For each Assistant's output, evaluators independently and blindly assessed whether the response met the minimum level of quality a competent in-house lawyer would expect for a satisfactory answer (Accuracy Standard) based on the Accuracy Attributes. The Accuracy Standard is characterized by 3 factors:

Factual Correctness: The response must be accurate and free from material errors.
Relevance to the Query: The response must directly address the query.
Completeness: The response must provide enough information to address the query.

If the evaluators disagreed on whether the output met the Accuracy Standard, they discussed their reasoning, and the decision was escalated to another evaluator, also a legal practitioner, for final judgment. In total, each task was reviewed by a minimum of 2 and up to 4 lawyers with diverse legal backgrounds to ensure fairness and reduce individual bias.

Qualitative Assessment

In addition to the binary pass/fail, each output was reviewed by 2 evaluators and scored based on quality dimensions that matter in the legal practice:

Helpfulness (0 to 2 points): Did the answer help solve the lawyer's problem? This covers whether the AI Assistant went beyond copy-pasting text to provide the information in a useful format/ structure.
Adequate Length (0 to 2 points): Was the answer appropriately detailed, not too terse to be unhelpful, but not overly verbose?
Feature Support (0 to 2 points): Did the AI Assistant effectively leveraged any special features during either input handling or output generation to make the response more useful, accurate, or trustworthy?

These 3 dimensions - Helpfulness, Adequate Length, and Feature Support - together comprise the Usefulness Factors (Usefulness Factors).

Scoring System

Based on this evaluation framework, each AI Assistant was assessed across all tasks using 2 key metrics:

The number of tasks it answered accurately (i.e. where both evaluators marked the answer Pass), used to compute an accuracy rate (percentage of tasks passed)
An overall performance score, which combines accuracy and qualitative metrics. Each passed task earned 6 points for meeting the Accuracy Standard (0 for a Fail), and up to 6 additional points were awarded based on the Usefulness Factors, yielding a maximum of 12 points per task.

This scoring system gives accuracy the greatest weight, as a pass is worth as much as all 3 quality dimensions combined. This reflects the feedback from in-house counsel we spoke to.

3.1 Accuracy Assessment Ranking

In this Accuracy Assessment, NotebookLM, an AI Assistant specifically designed for analyzing user-provided documents, came out on top, successfully meeting the Accuracy Standard for 14 of the 18 extractions.
Close behind were ChatGPT, DeepSeek, and Oliver, each of which answered 12 tasks correctly. GC AI was a bit lower with 10 passes, and Copilot lagged with only 7 correct tasks.

Loading chart...

3.2 Overall Performance Ranking

Looking at accuracy alone is not the full story. A simple pass/fail doesn't capture how the answer was given. In practice, 2 AI assistants might both be correct (earning a pass), but one assistant's answer might be phrased much more clearly or provide useful citations, making it far more valuable to the lawyer in daily routine work.

When we factored in the Usefulness Factors, the overall performance ranking of AI Assistants shifts significantly. Out of a theoretical maximum of ~432 points, the rankings are as follows:

Loading chart...

In this composite score, Oliver and GC AI, the 2 legal AI Assistants, ranked highest overall, despite NotebookLM achieving the most passes for accuracy. Their advantage lay in stronger performance on the Usefulness Factors, especially through clear, concise answers, pinpoint citations, and robust handling of multi-file queries. While NotebookLM also supported features like citations and multi-file input, its responses were occasionally too long or less structured. ChatGPT and DeepSeek lost points for overly verbose or generic outputs that lacked legal-specific framing.

Copilot, though powered by GPT models, underperformed on both accuracy and quality factors.

3.3 AI Assistant Strengths

Across tasks, all AI Assistants showed 2 clear strengths:

Enhanced Search Capabilities: Acting like a supercharged Ctrl+F, AI Assistants are able to quickly find answers when questions are clear and narrow and answers are in one document, and even understand semantically similar phrases, allowing them to pull answers that a simple keyword search might miss; and
Consistent Formatting: AI Assistants consistently produced structured first drafts that help lawye rs skip the drafting step and jump straight to review.

For example:

In Task 1, which asked each AI Assistant to extract the contract term clause in an MSA, all 6 AI Assistants correctly identified the "Term and Termination" clause.
If asked to provide clause citations or references, they would include clause numbers or headings from thedocument.
In Task 16, when asked to extract the governing law for multiple contracts in a table format, the AI Assistants that successfully processed all 11 documents were able to present the output in structured tables and accurately identify applicable law, even when explicit "governing law" language was absent. This interpretive strength is valuable but can become a risk when it overreaches and infers answers not grounded in the text (see discussion in the next section).

Through conversations with legal teams, we identified the following practical us e cases where AI assistants are already being successfully deployed for information extraction tasks:

Practical Use Cases

📝 Clause and Definition Retrieval

Quickly locate specific clauses or defined terms (e.g., termination, governing law) across contract libraries.

Commonly used by: legal teams

3.4 AI Assistant Weaknesses

While AI Assistants can streamline certain legal workflows, they frequently fail in 6 scenarios that legal teams should anticipate. Understanding where AI goes wrong and how to work around it is essential for in-house counsel seeking safe, efficient use of these tools.

Failure Mode 1: AI struggles with open-ended questions

When prompts are vague, broad, or lack clear boundaries, AI often returns incomplete answers.

Example (Task 13): The prompt instructed the AI Assistant to: "Identify all the commitments the Agent makes regarding regulatory compliance and conduct." Only 2 of 6 tools (NotebookLM and ChatGPT) correctly extracted all the relevant clauses from across the agreement. The others produced partial results, each omitting at least 1 major section.

Why it matters: AI may miss key clauses if the task isn't narrowly scoped, potentially leading counsel to overlook important rights or obligations and provide partial answers.

Failure Mode 2: AI hallucinates when the answer is missing

When the required information isn't present, but the question suggests it should be, some AI tools fabricate an answer rather than admit uncertainty.

Example (Task 9): In Task 9, 4 tools fabricated or misclassified the definition for "Critical Data", a term not defined in the source regulation.

Example (Task 10): In Task 10, 2 tools fabricated the liability cap amount and/or the time-based qualifier, despite the actual figures being redacted in the contract.

Task 10 Query: Extract the rules from the following limitation-of-liability clause. Specifically: 1) What is the liability cap and are there any special time-based provisions affecting it?

Liability Cap and Time-Based Provisions:

• The total cumulative liability of either party under the agreement is capped, but the specific cap amount is redacted

• There is a time-based provision affecting the cap:

If a claim arises within a certain period after the Effective Date, the liability cap is equal to the total amount of payments projected to be made by the Customer during that period

Correct Response (Pass)

Incorrect Response (Fail)

Why it matters: Confident, fabricated answers can easily slip past review and lead to incorrect legal advice or risk assessments.

Failure Mode 3: AI is error-prone when handling multiple documents

Despite marketing claims, most AI Assistants cannot properly and consistently analyze or extract information across multiple files, even when the documents are short and the task is seemingly simple.

Example (Task 16): In Task 16, where none of the AI Assistants succeeded in accurately extracting the governing laws from 11 contracts, it's worth noting that even for AI Assistants where the documents were within their context window limits, the provided answers were incomplete and/or partially inaccurate.

Why it matters: Multi-doc review is a standard part of legal work, and AI's inability to handle it reliably limits its usefulness in real-world scenarios.

Failure Mode 4: AI mirrors user assumptions in leading questions

If a question contains a false premise, AI Assistants tend to reinforce it rather than verifying it against the document.

Example (Task 6): When asked, "What are LinkedIn's indemnity obligations?" in a Linkedin MSA, only GC AI and Oliver correctly answered that LinkedIn had none. Other AI Assistants incorrectly described vendor indemnities as LinkedIn's, or referenced unrelated clauses. In a follow-up test, when asked a neutral version ("What are the indemnification obligations of the parties?"), all AI Assistants gave the correct answer.

Task 6 Query: What are clauses that detail the LinkedIn's indemnity obligations?

The indemnity obligations in LinkedIn's Master Services Agreement (MSA) are detailed under the Indemnity section.

Correct Response (Pass)

Incorrect Response (Fail)

Why it matters: Leading prompts can steer AI toward incorrect interpretations that look convincing on the surface.

Failure Mode 5: AI fails when technical or file-handling issues prevent content access

Some AI Assistants fail not because of poor reasoning but due to technical constraints, like file format issues, OCR failures, upload limits, or content filters.

Example (Task 16): In Task 16, tools hit file upload limits (e.g., Copilot allows only one file per query, GPT up to ten), which required AI Assistants to extract the governing law from 11 contracts. DeepSeek could only process 84% of the combined input.

Example (Task 14): In Task 14, Copilot refused to process a file entirely, likely due to content filters misclassifying it as sensitive.

Example (Task 7): When faced with low-quality scanned PDFs, some tools confidently returned incorrect answers, even when the underlying text was unreadable, instead of flagging OCR errors.

Task 7 Query: What's the total value contract value?

The total fixed contract value is $1,860,561 as stated in Article B.2. PRICES of the contract document.

Correct Response (Pass)

Incorrect Response (Fail)

Why it matters: Technical limitations can prevent AI from accessing or processing content, leading to incomplete or failed analysis regardless of the AI's capabilities. AI tools may silently skip unreadable content or refuse to process a file without pointing it out or explaining why, leaving lawyers to troubleshoot issues often without knowing whether the content was ever reviewed.

Failure Mode 6: AI struggles with contradictory information

When multiple conflicting references appear in the source, AI assistants may extract one and ignore the other, failing to recognize or flag the discrepancy.

Example (Task 11): When asked to extract the effective date from a contract that did not explicitly define it but mentioned 2 different dates in separate sections (each potentially valid depending on interpretation), some AI assistants returned only 1 date without acknowledging the other.

Why it matters: If an AI assistant doesn't flag conflicting information, a lawyer might take the answer at face value and miss a critical ambiguity.

3.5 General AI vs Legal AI

Do "purpose-built" Legal AI Assistants outperform gen eral-purpose ones?

Our findings suggest that while general-purpose tools can match legal AI tools in raw accuracy, purpose-bu ilt legal AI assistants often deliver more value where it matters most: in usabi lity and workflow integration.

Their advantage lies not just in what they generate but in how well they support the way legal teams actuall y work.

General-purpose AI tools can match or exceed legal AI tools in the accuracy

of the generated text.

General-purpose LLM chatbots like ChatGPT and DeepSeek performed just as well as the legal AI Assistants on the accuracy of the Info Extraction Tasks. NotebookLM, a productivity AI Assista nt not specifically optimized for legal use, achieved the highest accuracy score overall.

Legal AI tools offer stronger usability features for legal workflows.

Legal AI Assistants like GC AI and Oliver stood out by offering source-linked an swers, multi-document suppor t, and structured outputs tailored for legal review, features that streamline in -house workflows beyond just providing an accurate text output.

Loading chart...

Notably, GC AI delivered the answers with the most a ppropriate length and the ov erall most helpful answers out of all the AI Assistants, and when all 3 Usefulne ss Factors were considered together, the 2 legal AI assistants (GC AI and Oliver ) scored the highest and outperformed the general LLM applications.

As accuracy gaps narrow, usability and integration will drive the next wave

of legal AI adoption.

While current differences in accuracy between AI Assistants are still visible, t hese gaps are likely to clos e quickly as vendors adopt more advanced and powerful LLMs. As accuracy becomes a baseline, the real differentiators will shift to usability, workflow integrati on, and support. Features like an intuitive interface, integration with email or document systems, strong data security, and responsive support will increasingl y define which tools deliver real value to legal teams.

In-house legal teams evaluating AI tools should look beyond model accuracy perfo rmance today and focus on wh ich platforms will streamline legal work and scale with their needs tomorrow.

4.1 Conclusion

In a legal industry where AI is said to be replacing lawyers, where legal tech companies are raising record-breaking rounds, and where marketing claims often outpace reality, most in-house counsel are simply trying to keep up with what's happening while they're busy with limited time.

That's why we ran this independent evaluation: to offer a grounded, practical vi ew of how legal AI tools actually perform in real in-house legal tasks, and how much human oversight they still require.

So can AI do the job? In many cases, yes, and sometimes surprisingly well. B ut it still needs legal professionals to frame queries clearly, interpret ambiguous answers, and verify outputs. Human judgment remains essential, but AI is increasingly capable of lightening the load.

We hope these findings help legal teams cut through the hype, adopt AI more conf idently, and shape a more transparent, informed future for fellow lawyers.

If this mission resonates with you, we invite you to join us in shaping the future of legal AI adoption.

Limitations of the Study

Limitations

Narrow Task Scope

We tested only 18 Info Extraction Tasks. Broader legal tasks like legal research , drafting or redlining were not assessed, so findings may not generalize beyond similar use cases.

English-Only Evaluation

All tasks used English documents. Real-world scenarios often involve varied lang uages. We did not test multi lingual capabilities, even though some vendors offer them. One task involved bac k-translation to Chinese, but was offered in English.

Snapshot in Time

Results reflect AI performance as of early 2025. With rapid AI development, accu racy and features may evolve significantly within months.

Subjectivity in Human Evaluation

To ensure objectivity, each output was blind-reviewed by 2 independent human eva luators using a standardized rubric, with disagreements resolved by a 3rd reviewer. We did not conduct LLM-b ased reviews due to resource constraints and their known limitations, such as in consistent judgments, prompt sensitivity, and limited legal domain expertise.

Broader Factors Not Assessed

This evaluation did not assess the following broader dimensions of AI platforms:

Security & Privacy (e.g. data handling, storage, compliance with privacy laws)
Governance & Assurance (e.g. explainability, auditability, alignment with stan dards)
Pricing & Value (e.g. cost-effectiveness, pricing models)
Support & Reliability (e.g. uptime, vendor responsiveness)
Trust & Safety (e.g. bias, misuse risks)

These areas are critical for enterprise adoption, and we aim to consider them in future reviews.

No Multi-Turn Dialogue

Each AI Assistant had one shot per task. In practice, results may improve with p rompt iteration and continuo us dialogue. Our findings reflect a "cold start" scenario.

Limited Legal AI Vendor Coverage

We focused our evaluation on 2 legal AI vendors: GC AI and Vecflow (Oliver). We' re grateful to both for supp orting our independent review. While many other legal AI vendors were keen to pr ovide demos, several declined to participate in a structured evaluation, preferr ing to self-publish performance results or would only offer results collected by themselves, which we did not accept.

We are practicing lawyers and real users who utilize these tools in our day-to-d ay work. Our aim is to share honest, hands-on insights to help other lawyers like us better understand and a dopt these technologies. While this assessment doesn't replicate the formality o f a "Michelin-style" review or academic research paper, we remain committed to t ransparency, independence, and practical relevance.

We welcome legal AI vendors that believe in fostering transparency, informed ado ption, and open dialogue in the legal community to participate in future evaluations.

5.1 Next Steps

This benchmarking exercise is just a starting point. Going forward, we plan to:

Expand the evaluation to cover other legal functions and include new AI assist ants
We will gather additional tasks to test AI capabilities more broadly, with the next task category being su mmarization tasks, and publish updates to this report when relevant

We invite feedback and collaboration with legal practitioners, researchers, and vendors. If there are import ant use cases we missed or if a vendor believes their assistant would excel in o ur tasks, we welcome the opportunity to evaluate it. A more open and transparent benchmarking culture in the legal industry will benefit everyone.

5.1 Contributors

This project was made possible by legal professionals and AI engineering experts who generously contributed their time, expertise, and thoughtful feedback. Many were directly involved in shaping the methodology, participating in the evaluations, and/or reviewing AI Assistant outputs, all on a volunteer basis. We are deeply grateful.

Contributors

Gabriel Saunders

Director of Legal Ops

Marc Astbury

CPO at Jenni AI

Hui Xin Tan

Senior Legal Specialist

Mariette Clardy

Assistant General Counsel

Mathias Bock

Legal Consultant & Angel Investor

Patrick Gong

Lead Associate at King & Wood Mallesons

Rachel Chew

Senior Legal Counsel

Rodney Yap

Legal Technologist

Tan Xuan Ming

Legal Counsel

Uri Barak

Associate

Wei Yee Tan

Contract Manager

Advisors

Chris Holland

Partner

Jason Tamara Widjaja

AI Executive Director

Jordan Dea-Mattson

CPTO

5.2 Core Team

Anna Guo · Co-Author

Legal Counsel

Anna is the legal counsel at a health tech startup based in Singapore. She has both private practice and in-house experience in both Chinese (Ant Group) and US tech companies (Google), plus a stint as a startup founder herself. She is interested in exploring how technology is transforming the legal profession.

Arthur Souza Rodrigues · Co-Author

Securities and Technology Attorney

Arthur is a securities and technology attorney based in New York. He's the author of Prompting Techniques for Lawyers and several other AI-related projects, providing advice to various legal-tech entrepreneurs. Formerly at Carta, O'Melveny, and USP, Arthur currently serves as sole counsel for an education platform. A proud Wolverine with a passion for Portuguese language and Lusophone culture, he is developing a PII-removal tool called Tucano Voraz. Visit his blog for more.

The content of this report is made available under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Cite This Report

APA

Guo, A., & Rodrigues, A. (2025). Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel. Retrieved from https://www.legalbenchmarks.ai/research/phase-1-research

MLA

Guo, Anna, and Arthur Souza Rodrigues. "Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel." 2025. Web. Nov 17, 2025.

CHICAGO

Guo, Anna, and Arthur Souza Rodrigues. "Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel." Accessed November 17, 2025. https://www.legalbenchmarks.ai/research/phase-1-research.

BibTeX

@techreport{guo2025ai,
  title={Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel},
  author={Guo, Anna and Rodrigues, Arthur Souza},
  year={2025},
  url={https://www.legalbenchmarks.ai/research/phase-1-research},
  note={Accessed: 11/17/2025}
}

Note: This report is freely available for academic and professional use. Please cite appropriately when referencing this work in your research or professional materials.