How many frogs do you need to kiss to find the right legal AI vendor?

To give you some context on my perspective: I work in private practice at a Romanian law firm, handling Competition, Commercial, TMT, Consumer, and Transactional matters, while also advising on the AI regulatory landscape — including EU AI Act readiness and governance framework design. Early on in my career (how has it already been 10 years?!), I made a point of being a jack of many trades; this shapes how I now experience AI's value.

Every legal AI vendor demo tells a beautiful fairytale. While I don't doubt that happily-ever-after exists for some, I've found that the story on the screen doesn’t always match the work actually sitting on my desk. Here's what that experience has taught me.

Playbook Tools on Commercial Contracts

Let’s look at the absolute darling use case of legal AI demos: Contract Playbooks. The day when I had to sit down and review 20 NDAs never came. Nor did the day come when I had to review the same contract for the nth time. I can see the tool work in other practice areas. I can absolutely see how amazing a tool this could be for in-house work, and we can draft playbooks for clients to use in specific scenarios. However, when it comes to my own use cases, there’s a bit of a mismatch.

The jurisdictional tipology matters

Maybe there are jurisdictions out there where contracts are beautifully standardized across the board. With US liability clauses, you get different flavors of the same ice cream vendor. In other jurisdictions, you expect to eat ice cream, and sometimes you get ice cream, sometimes you get gravel. Playbook tools are built on a premise of standardization: the idea that contracts of the same type look similar enough that a stored position can be applied reliably. In some jurisdictions, that’s a reasonable baseline. In others, it isn’t.

For the same type of transaction, I might see a massive difference between vendor contracts. One could be a breezy 10-page document, while another is 110+ pages. Take a standard product supply deal, for instance: one counterparty might throw 10 different annexes at you (General Terms and Conditions, Pricing and Payment Terms, Logistical Delivery Protocols, Product Specifications, Brand Guidelines, DPA, ABC, Supplier Code of Conduct, ESG… you name it). Meanwhile, another buyer hands you a bare-bones 10-page contract with zero schedules, covering only the essential terms. The variation is so vast that a "standard position" rarely exists.

The ghost of the Civil Code

This variation is partly driven by the growing influence of Anglo-American legal drafting over recent decades, as exhaustive, define-everything, leave-nothing-to-chance templates have been transplanted into local commercial Civil Law deals.

Civil Law contract review comes with a lot of baggage, and there's only so much of it you can bring into an AI agent's context window. You don't just look at the text written within the document itself. The ghost of the Civil Code always looms above you; the visible text is deeply intertwined with the mandatory and suppletive rules of the jurisdiction. Because of this, an AI tool cannot simply evaluate and overwrite clauses in a vacuum. Any playbook modifications must be compatible with this invisible legislative background. Otherwise, enforcing a standard playbook position risks violating mandatory legal provisions, which could render the newly drafted clause entirely null and void.

Furthermore, a playbook position developed against one template may be technically correct in isolation, but misapplied in a completely different contractual architecture — the AI finds the analogous clause, applies the stored position, and misses that the new agreement distributes the same legal concept across three separate provisions.

Sometimes, a "bad" contract is a good strategy

Sometimes, it is entirely in the client’s favor to sign on the counterparty’s template. In these cases, applying a playbook might lead to negotiating on points you didn’t really have to. Strategic legal work isn't just about matching a standard; it’s about knowing when to leave well enough alone.

Drafting a playbook takes time

"But wait," the vendor says, "Our AI can look at your past negotiations and build the playbook for you!". The jurisdictional variety of contracts makes this difficult, too.

The problem runs deeper than simply extracting conclusions from redlines. Negotiation positions aren't context-free rules — they're decisions made within a specific contractual architecture and the legislation applicable at that time. What you accepted on a 10-page contract may reflect a completely different strategic calculus than what you pushed back on in a 110-page agreement. In my experience, it’s incredibly difficult for the AI to distinguish between 'I left this clause as is because it's acceptable' and 'I left this clause as is because the overall balance of this specific contract made it the right call. It might also identify a blunt modification — such as the sudden addition of a termination clause — yet fail to catch a subtle change in a definition that carries significant legal consequences for the contract as a whole.

There is such a thing as too much red

In practice, my goal isn't to destroy a counterparty's clause and replace it entirely with a playbook equivalent. I want to adapt the substance of the existing clause to achieve the desired effect, modifying the original text only to the extent necessary. Even if instructed to 'minimize changes,' a playbook anchors to an endpoint—the target rule or stored clause. For non-standard contracts, when the counterparty's wording sits far from that anchor, in my experience, the tool doesn't patch it; it regresses to its own example, rewriting the clause to match the playbook. I prefer working dynamically with the AI through a Word add-in's chat in these scenarios, which produces better results for me because the instructions can be tailored to the specific contractual context at hand.

Volume matters

The volume of comparable contracts you handle repeatedly is a massive variable in how useful playbooks are — it shifts based on jurisdiction, law firm size, and practice area. Some practice areas hit the kind of one-type contract volume where playbooks make perfect sense. Others don't. This might also mean that a single AI vendor might not be the solution for an entire law firm — you might actually need different solutions for different practice areas. I know. A headache-inducing idea.

Due Diligence tools for Commercial Contracts

Let’s move on to the next demo sweetheart: Due Diligence tools.

In the demo, these tools are presented as magical scanners that analyze 100 contracts at once, instantly extracting risks, key clauses, and liabilities. Same issue as for playbooks. They are built on the premise of standardized agreements. If a US indemnity clause always vows to indemnify and hold harmless, in my jurisdiction, getting to the same legal effect is more of a write-your-own-vows situation — same commitment, completely different words.

For the LLM, “to indemnify and hold harmless” is a hook; it leads directly to where it should look and analyze. When there's no standard phrasing to match, it's harder for the tool to arrive at the correct answer. I'm not saying LLMs only match exact keywords — but standard phrasing is a powerful shortcut: easy to spot in the document, and seen so often in training that the model knows what it signals. If you’ve ever thought, "It took me longer to review the tool's analysis, and I ended up redoing it almost from scratch"—been there, done that, I feel you.

The prompting illusion

Due diligence tools for commercial agreements are built on the idea that the same prompt will work for the whole bunch of contracts. But if your batch contains a 10-page contract with no liability clause, where statutory law applies, and a 110-page agreement where the same concept is distributed across 20 different provisions, in my experience, the prompt that works for one will miss half of what matters in the other.

No matter how incredibly detailed or clever you make your master prompt, it simply won't perfectly fit every contract in the batch. If you add a specific qualifier to fix an extraction issue for Contract A, that same highly-tailored nuance might actively break the extraction when the tool applies it to Contract B.

For me, right now, for non-standard commercial agreements, given the vast variation in contractual architectures, it works better for due diligence analysis to interact with the contract dynamically in a chat environment.

I need this approach for several reasons. First, I have to identify risks that stem from what is entirely missing from the agreement—an area where, in my experience, AI tools still struggle. Second, I have to manage the fact that AI is sometimes over-dramatic, confidently flagging an issue as “high risk” when, in practice, no lawyer would ever meaningfully raise it in a due diligence report. Let’s not forget that sometimes the mistakes have nothing to do with deep legal reasoning—they are basic extraction failures where the AI confidently gives you the wrong contract number or an incorrect effective date simply because the document wasn't parsed correctly.

Because these and other blind spots must be addressed, I need a setup that lets me view the agreement side by side with the chat, verify the sources, nuance the conclusions, ask follow-up questions about specific provisions, and iteratively refine the assessment. It’s not enough for me to see the document and the AI’s conclusions. If the conclusions are inaccurate, I need to be able to work with the tool to easily reach the correct ones. The path to the correct answer may vary drastically from one contract to the next, meaning these nuances simply cannot be addressed by modifying a master prompt applied to an entire batch simultaneously.

Once the substantive review is completed, the workflow can populate the DD report format. I assess tools on how well they assist me in my process.

Output length matters

DD tools with short outputs may be built this way to mimic the final format of a lawyer’s actual due diligence report, which can be written concisely. However, behind a lawyer's brief three-sentence summary is a massive amount of complex legal reasoning.

"But the DD solution also does extensive reasoning in the background! It just displays a short answer." I’m not saying it does not. The issue is that there is a possibility for crucial legal reasoning to get completely lost in the summarization process; a nuance actually relevant to the analysis might be trimmed out.

Model's output limit has, in my experience, a material impact on response quality and accuracy, across all Legal AI applications, including chat assistant, legal research, due diligence, and so on. For example, when your input consists of a very large document or multiple documents and a complex request in the chat assistant, a low output limit forces the AI to artificially compress its answer. This can result in incomplete—and therefore potentially inaccurate—responses, simply because a comprehensive, correct answer would have required substantially more text to formulate. I think a tool's contextual efficiency in its output management is a highly practical metric that I consider when evaluating vendors. It must have a high enough output limit to deliver comprehensive answers for complex tasks, while remaining concise when the task demands a short answer.

In Due Diligence tools, having an agent layered on top of the agents doing the batch review may be helpful with that iteration to get to the correct response, but in my experience, it might not be the same thing as a dedicated chat assistant analyzing a single document, because an agent sitting on top of a massive DD tool processing dozens of files has entirely different, competing information in its context compared to an assistant focused solely on just one agreement.

Don’t let perfect be the enemy of good

Playbooks and due diligence are the 2 demos vendors lean on hardest, and, in some jurisdictions, both break down in the same place: they assume standardized, one-type, high-volume work that isn't what some of us do.

Just because a tool doesn't quite live up to the glossy demo doesn't mean I will abandon using AI for that use case altogether, or that the only way to evaluate the tool is by comparing it to vendor claims. A mismatch between the demo and reality can often stem from differences in jurisdiction or contract typology.

If the shoe doesn't fit, I’m not gonna chop off my toes. I’m mixing fairytales, but you get it. I’m not going to force legal work to fit the solution. Instead, I’m gonna map out the actual steps of my legal reasoning and ask: Can the vendor’s solution actually add value along my reasoning chain?

Here are some ways I structure my own evaluation process to test a tool against my daily reality.

Legal AI Evaluation Framework by Legal Benchmarks - Make it a living document

Building a vendor assessment matrix from scratch is a massive, time-consuming drain on a firm's internal resources. That’s why the Legal AI Evaluation Framework by Legal Benchmarks is such a lifesaver.

However, the framework can’t be a static checklist. It must be a living document, constantly tailored to firm, specific jurisdiction, internal workflows, and lessons learned from past tool evaluations.

How do you make it yours? Below, I’m going to break down some of the ways I have customized the Framework to fit my needs.

Factual accuracy of vendor tools

On a general note, legal AI tools need built-in contingency mechanisms. The relevant question, in my opinion, is not whether an AI system will make mistakes—every AI system does—but what happens when it does. I assess whether a tool is designed solely to deliver an answer—through extraction, summarization, review, or other legal tasks—or whether it is also designed with the assumption that some answers will inevitably be wrong.

In particular, I look at whether the developers have taken into account the reality that lawyers must verify the outputs — how easily a user can identify a potentially incorrect answer, trace it back to its source, understand the reasoning that led to it, and override or correct it where necessary. I also evaluate how seamlessly that correction can be incorporated into the broader legal workflow.

The Framework assesses the "factual accuracy" of vendor tools—specifically, that outputs must not contain fabricated clauses or citations. In my experience, what constitutes factual accuracy depends entirely on the specific task and the tools the AI actually has access to at that moment. (Note: For this part, I am setting aside models strictly post-trained on a specific law firm's curated data or local legislation. I mean true post-training of the model, not RAG or database connectors.).

Chat Assistant without tools (No database connection, no uploaded documents, no web search, no legal research tool).

Most common issue: you ask a legal research question, and the LLM responds by generating fabricated legal content, including non-existent legal provisions, misidentified article numbers, attributes provisions to incorrect legal acts, and cites case law that does not exist.

I have seen models fail accuracy assessments simply because they were tested on legal questions without being connected to any legal databases, documents, or web searches. When errors occur in these cases, they stem from the fact that the model is forced to respond exclusively based on its training data.

In this "naked" configuration, you run into three limitations:

The training data cut-off: A model is bounded by its training data. If its cut-off is April 2023, it is blissfully unaware of any legislative changes or case law that happened after that date.
The "stochastic" problem: LLMs don't memorize data verbatim; they predict text. Even if the AI technically ingested your Civil Code during training, it doesn’t mean it will quote it accurately. It might confidently Frankenstein two different articles together simply because they look similar.
The "people pleaser" syndrome: Because of the way LLMs have been post-trained, they have become hyper-responsive, helpful assistants. They would much rather confidently invent a provision than admit they don't know the answer. Yes, vendors try to engineer the models to identify when they hit the limits of their knowledge, but overcoming the AI's inherent urge to "just give an answer" remains incredibly difficult to calibrate.

If you are using an AI with no tools, no documents, and no web access, I would evaluate it on rephrasing text, drafting emails, and structuring memos—tasks where the AI works exclusively with the information you provide in the prompt. I wouldn’t evaluate it for its legal research capabilities.

Chat Assistant with Uploaded Documents (grounded responses)

You upload a stack of documents into the chat. The full text of your uploaded documents is rarely copied integrally into the model's active context. Instead, tools typically use a form of RAG (Retrieval-Augmented Generation) to store the documents in vector format for fast semantic search.

This means only the specific snippets retrieved in response to your prompt actually enter the active context. Therefore, factual accuracy is still not guaranteed—it depends heavily on how well the system matches the request in the prompt to the document information.

Then we have the context window itself. Yes, capacities have increased significantly, but a model can still reference only what is within that active window. Once the limit is reached, earlier information simply drops out.

If you upload a large number of documents or let a conversation drag on for too long, there is a possibility for the AI to literally cease to "see" earlier documents or your previous instructions.

The practical reality is that there is a hard upper limit on the volume of documents and context you can use simultaneously. As you add more information to a single chat, the AI's reasoning quality may deteriorate. Cross-document confusion spikes, hallucinated errors increase, and outputs become incomplete.

Practical illustration:

Consider this scenario: drafting a complex insurance opinion that requires the analysis of three national laws, two EU regulations, and six regulatory authority guidelines. Expecting the model to analyze all these legal provisions into a single chat and asking the model to answer a complex legal question — one requiring the corroboration of all materials in a single pass—is, in my experience, an overextension of current model capacity. To succeed, the model would need to simultaneously process legal definitions across multiple documents, identify cross-references, apply hierarchy rules, and navigate terminological inconsistencies between statutory texts and regulatory guidelines.

The core issue arises precisely when these tools assess highly complex legislation. In Romanian insurance legislation, for example, correctly interpreting a single provision often requires correlating it with seven other interconnected articles. When forced to do this, in my experience, the AI frequently fails to interpret the legal framework accurately.

In my experience, legal AI tools perform better on EU legislation than on Romanian national law. This could be due to the volume of training data. Foundational models haven't had access to the same datasets of Romanian legislation and jurisprudence as they have for broader EU frameworks. It could also be related to legislative drafting history. When laws are amended countless times over the years, and a new definition or article is introduced, the legislator doesn't always review the entire text to ensure the new provision is perfectly correlated with the rest of the law. As an experienced lawyer, you know the historical context—you know the legislative intent was, for example, to apply an existing sanction to that newly introduced term, even if the drafting is imperfect. In my experience, LLMs struggle when the legislative text isn't perfectly correlated. If the literal text is disjointed and fails to explicitly link the concepts, the model concludes that the rule does not apply.

In these scenarios, I don’t expect 100% factual accuracy, I don’t expect the responses not to contain fabricated citations (if you push it hard enough, it will break and invent provisions even with access to legal text), and I don’t expect 100% correct legal interpretations on complex matters. I do compare tools to see which perform better on these tasks, but I usually take a different, modular approach. I split tasks into smaller projects to ensure per-document accuracy, and I evaluate not only on how good the tool’s legal reasoning was, but on how much faster it makes mine. If, for the correct legal answer, I have to corroborate 7 legal provisions, what I need is a fast sounding board, an extraction tool that follows my indications, having the law in the platform with sources on the legal provisions and highlights on where the information was extracted, summarizing at my indications, capturing the steps in my legal reasoning, and writing my conclusion in my style. Sometimes, the way the LLMt gets the answer wrong makes me see what the right answer is.

How I test for factual accuracy in responses grounded in documents

I use a sequential approach: start with the simplest configuration and progressively crank up the complexity.

Level 1: The baseline

I start with one document. At this stage, I test pure factual extraction—not legal interpretation.

Extract specific data points (e.g., current administrators, share capital) from a trade registry excerpt.
Extract existing clauses, parties, effective dates, and pricing from a single contract.
Identify specific deadlines or notice periods explicitly stated in the text.
Extract the precise legal definition of a term from a statute.
Cross-reference a sanctioning provision with a penalty provision within the same law.
Extract the final ruling and the parties involved from a single court decision.

This is just how I establish my baseline. In my experience, if a tool struggles with this simple data extraction, it’s usually not worth moving it to the next stage. (The obvious caveat: document length and information density still matter. Deeming the entire Civil Code as one document raises its own architectural challenges).

Level 2: The step-up

Once the baseline is established, I increase the complexity to see how the AI handles basic correlation:

Extract and compare data points from several trade registry excerpts
Correlate provisions between two different statutes.
Identify the specific textual changes between Version 1 and Version 2 of a contract.
Assess the legal implications of an amendment to the contract.
Cross-reference a main agreement with its attached schedule or annex to ensure definitions align.

Level 3: The stress test

Finally, I move to multi-document scenarios. This is where I test the limits of the tool’s reasoning and context management. I evaluate how well the tool sustains its performance and manages its context as the complexity of the task increases:

Extract the IP clauses from five different contracts.
Check whether the current provisions of a contract can be identified based on several amendments.
Correlate multiple court decisions on a single legal issue.
Generate a correct chronological timeline from a scattered document dump.
Corroborate provisions across several different laws.

Evaluating Case Law Research tools (Romania & EU)

I focus on these metrics:

Volume: The number of relevant decisions identified for a specific legal issue.
Interpretive accuracy: How accurately identified decisions are interpreted and applied to your specific query.
Judicial hierarchy & status: Does the tool correctly distinguish between interim and final decisions, and between individual judicial opinions and the court's final ruling?
Temporal tracking: How well the tool flags whether a decision is still relevant or if it interprets an older version of a law that has since been amended.
Statute-to-Case linking: The ability to see which specific statutory articles are being interpreted by the retrieved case law.

Keep in mind that even the most robust solutions may return errors when dealing with very high-volume corpora (such as European Commission competition decisions) on particularly complex points. The limitation here is that the tool does not have direct, simultaneous access to every single decision within its active context window. It most likely relies on semantic vector search, meaning its performance depends entirely on how effectively the algorithm matches the language of your prompt to the actual content in the judgments. I always ask the vendor to clarify how their search and retrieval mechanisms actually function.

One essential requirement I have for case law and legislation is that the tool should make the full text of the law or decision available directly within the solution, seamlessly integrated into the workflow or chat assistant. I need to be able to stay in the same chat and immediately verify if the AI correctly correlated my question with the actual text. The source should be attributed to the specific passage where the cited information appears, not just to a generic link to the top of the document.

Most importantly, having the text right there must allow me to continue working with that specific decision or statute further within the chat. It is critical to be able to ask follow-up questions, add nuances, request specific summaries, and iteratively refine the analysis without breaking the workflow.

Finally, my favorite practical benchmark for evaluating a specialized legal tool is to test it directly against advanced, general-purpose platforms such as ChatGPT Deep Research, Gemini Deep Research, or Perplexity.

I look at three things:

Does the evaluated tool identify more relevant decisions than the free ones on a specific request?
How do the tools compare when it comes to critical mistakes? Specifically, I look at whether the general models have a higher rate of identification errors (hallucinating non-existent case law or fabricating file numbers) or interpretation errors (fundamentally misreading the court's ruling).
Does it deliver a structured output that requires less manual editing and formatting before it can be used in a professional draft?

Web Search tools for research

When I evaluate AI web search tools for legal research —especially considering that specialized legal tools do not cover all jurisdictions— my perspective on "accuracy" changes. Accuracy becomes a function of the sources it manages to identify.

For me, the first comparative metric is the sheer volume of sources accessed. I am not looking for a response that dumps information from hundreds of sources into the final text. However, a tool that limits its background analysis to just the first 10 search results it encounters is insufficient for legal research. (For context, tools like Gemini Deep Research can now access hundreds of sources for a single query, which is a great benchmark). Casting this wide net is necessary as in some jurisdictions, there is a material risk that the first search result encountered is an outdated version of a statute. I benchmark the tool on its capacity to locate diverse sources — including case law, regulatory authority decisions, official guidelines, and practitioner commentary.

However, volume is useless without my second critical metric: corroboration and reconciliation. I benchmark the tool on its capacity to cross-reference conflicting data and recognize when a source or legal provision is outdated. The value lies in the tool’s ability to resolve these inconsistencies to deliver a current legal position.

I disagree with the idea that having a specialized legal research tool eliminates the need for a powerful web search tool. Not all answers can be found in past case law, and for certain cases, case law does not even exist. Furthermore, not all legal tasks are based strictly on legislation. Sometimes you just need market information to complete an assessment.

Conclusion

Efficiency might not be gained on Day One. It takes time to find your rhythm, to find a system that works alongside your legal reasoning.

Sometimes, using AI doesn't translate to doing things faster — sometimes it might translate to actually reading more. It does, however, let you access, synthesize, and correlate volumes of information that would have been impossible to navigate just three years ago.

It also requires a team with the time and space to experiment, even when that means a task takes longer at first. What you gain from that initial friction is learning how to make the tech work for you — how to apply it to your specific use cases, and what your actual, practical needs are, because those needs won't be the same for every lawyer.

So, how many frogs do you need to kiss to find the right legal AI vendor? The honest answer is that the right vendor is never going to miraculously emerge from a perfect demo. In my opinion, it reveals itself only after you've kissed a few frogs — and your team has had the opportunity to use the tech long enough to figure out what actually works for them.