Most legal AI evaluation frameworks share a foundational assumption: that correct answers exist, can be identified, and can be used to score model output. This assumption holds reasonably well in high-adherence jurisdictions with stable statutory frameworks, dense appellate commentary, and convergent legal doctrine. It does not hold in Argentina.

Understanding why matters for anyone building or evaluating legal AI beyond the anglophone core.

Structural ambiguity as the baseline, not the exception

Argentine law operates under conditions that most legal AI benchmarks treat as edge cases. The country has undergone two major civil code overhauls in the last century; the most recent, the Codigo Civil y Comercial (CCyCN), entered into force in August 2015 and introduced a constitutionalized private law framework drawing simultaneously from German, Italian, and French doctrine. The pre-CCyCN / post-CCyCN split means that for a large class of disputes -- contracts formed before August 2015, ongoing obligations, transitional matters -- the applicable law is genuinely contested between legal schools. A model that confidently applies the new code to an old contract is not wrong in the way a hallucination is wrong. It is wrong in the way a first-year associate with excellent reading skills is wrong.

This is the first structural difference: interpretive divergence is not a failure mode in Argentine legal reasoning. It is the normal operating condition.

The temporal instability of statutory authority

A second problem is harder to solve with fine-tuning: Argentine statutes have lifespans. A significant category of laws -- investment promotion regimes, emergency economic legislation, moratoriums, tax amnesties, sectoral incentives -- are enacted with fixed validity periods. When they expire, they may be renewed under a new statute number with modified terms, or allowed to lapse entirely. A model trained on any fixed corpus will have encoded statutory citations that may now be inert.

The forest investment regime illustrates the point. Law 25.080, enacted in 1998, was extended and modified by Law 27.487 in 2019. A model that cites Ley 25.080 for post-2019 analysis is not hallucinating: the law exists. But the authoritative source has shifted, and the error is invisible to any benchmark that tests only whether the citation is syntactically valid.

The same dynamic applies to BCRA communications (the central bank's regulatory output, amendable by a single communique), UIF resolutions on anti-money laundering obligations, and AFIP administrative rulings on tax treatment. These are not obscure edge cases. They are the daily operating environment of any Argentine corporate or tax lawyer.

Jurisprudential authority without a clear hierarchy

In US law, the authority hierarchy is relatively legible: Supreme Court, circuit courts, district courts, with defined precedential weight. Argentine law has formal hierarchy -- the Corte Suprema de Justicia de la Nacion (CSJN) at the apex -- but the practical picture is considerably murkier.

The CSJN does not operate as a unified chamber. Different panels reach inconsistent conclusions, and these inconsistencies can persist for years before a plenario or a definitive ruling resolves them. Below the CSJN, the Camaras Nacionales in Buenos Aires (commercial, civil, labor, federal) produce jurisprudence that is influential but not binding on provincial courts. The Suprema Corte de la Provincia de Buenos Aires (SCBA), with jurisdiction over the most populous province, operates on a parallel doctrinal track that frequently diverges from the national appellate courts. In matters of labor liability, for instance, SCBA holdings on employer solidarity under subcontracting chains have diverged from Camara Nacional de Apelaciones del Trabajo criteria during overlapping periods, generating directly opposite outcomes for structurally identical cases depending solely on the forum.

No stare decisis -- and a court still trying to manufacture one

A point that foreign readers of Argentine law consistently underestimate: Argentina has no stare decisis. There is no constitutional provision, no statute, and no procedural rule that requires any court to follow the holdings of any other court, including the CSJN. The doctrine is civilian, not common law, and the formal position is that each case is decided on its own terms.

The CSJN has spent decades attempting to engineer a functional substitute. In Ceramica San Lorenzo (1985), the Court held that lower court decisions lack foundation if they depart from CSJN precedents without supplying new arguments not previously considered. This is what Argentine academic Santiago Legarre has called obligatoriedad atenuada -- an attenuated or presumptive form of bindingness: tribunals must conform to CSJN doctrine, but they can depart if they can articulate something the Court has not yet heard. The Court reinforced the structural rationale in Bussi (2007), linking precedent-following to equality before the law and the predictability requirements of legal certainty.

The doctrine has not resolved the underlying instability. In Freire Diaz (2019), the Court itself clarified that the general statements in any given ruling cannot be given mandatory force for subsequent cases; holdings bind through their reasoning applied to their specific facts, not as free-standing rules. This clarification -- formally a limitation on the mechanical use of precedent -- simultaneously licenses courts to distinguish away CSJN holdings by framing the facts differently.

The practical result is a system in which the CSJN's own precedents are presumptively binding, selectively followed, frequently distinguished, and periodically reversed by the same court. For legal AI systems, this creates a compounded evaluation problem: a model must not only retrieve the correct CSJN holding, but assess whether lower courts are currently following it, have distinguished it, or are waiting for the CSJN itself to revisit it. None of that is captured by citation accuracy alone.

A model that cites case law fluently may produce output that is authoritative in one forum and incorrect in another. Evaluating whether a model got the jurisprudence right requires specifying the jurisdiction, the year, the specific court, and the current state of reception of any given precedent. Most current legal AI benchmarks do not address any of these dimensions.

Why standard evals do not capture this

Legal AI benchmarking has converged on a tractable set of tasks: bar exam questions, contract clause classification, legal question answering against statutory text, citation retrieval. These tasks share the property that correct answers can be adjudicated by human experts with reasonable inter-rater agreement. They were designed, explicitly or implicitly, around the US and UK legal systems.

The problem is not that these benchmarks are wrong. The problem is that they evaluate the wrong failure modes for high-ambiguity jurisdictions. In Argentina, the relevant failure modes are four.

First: false confidence on expired or superseded authority. The model is correct about what the law said; it does not know the law changed.
Second: cross-jurisdiction contamination. Spanish-language legal training data skews toward Spain, whose civil law tradition shares vocabulary with Argentina but diverges on doctrine, procedure, and institutional structure. A model trained predominantly on peninsular Spanish legal text will apply Tribunal Supremo reasoning to a Camara de Apelaciones context with no visible error signal.
Third: temporal ground truth collapse. For contested transitional matters, there is no single correct answer. The appropriate response is to surface the interpretive divergence, not resolve it. A model that collapses to one position is not more useful; it is less honest about the epistemic situation.
Fourth: authority forum-blindness. A holding that is correct for a national commercial court may be inapplicable, or actively wrong, for a provincial civil court on the same matter.

None of these failure modes surface in a benchmark designed around convergent, stable, anglophone legal systems.

Worked examples: failure modes in practice

The four failure modes described above are not theoretical. The following prompt-and-output examples illustrate how each manifests in typical practitioner queries, and what a well-calibrated system should produce instead.

Example 1: Stale authority

PROMPT

What are the current tax incentives available for afforestation projects in Argentina?

FLAWED MODEL OUTPUT

Under Law 25.080, investors in afforestation projects are entitled to a non-reimbursable fiscal contribution (apoyo economico no reembolsable) calculated on the basis of approved planting costs, with a stability guarantee of 30 years for the applicable tax regime...

WHAT WENT WRONG

Law 25.080 exists. The description of its incentive structure is historically accurate. But Law 27.487 extended and substantially modified the regime in 2019, introducing revised eligibility categories and updated benefit structures. A practitioner advising a client on a 2024 investment would be working from a superseded framework. The model produces no staleness signal and no caveat. The error is invisible to any benchmark that tests only whether the citation is syntactically valid and the description historically accurate -- both of which it is.

WHAT A WELL-CALIBRATED SYSTEM SHOULD SAY

Law 25.080 established the original afforestation incentive regime. You should verify whether the operative text for post-2019 analysis is Law 27.487 or a subsequent modification, and confirm that the specific benefit category you are advising on has not been amended by executive decree or AFIP ruling since the training cutoff.

Example 2: Spain contamination

PROMPT

Can a creditor in Argentina pierce the corporate veil to reach a sole shareholder for an unpaid commercial debt?

FLAWED MODEL OUTPUT

Spanish jurisprudence has developed a robust doctrine of levantamiento del velo, originating in landmark Tribunal Supremo case law and refined in subsequent decisions. Under this doctrine, courts may disregard the corporate form when it has been used as an instrument of fraud or to circumvent contractual obligations. Argentine courts follow a similar approach...

WHAT WENT WRONG

The framing, doctrinal genealogy, and procedural assumptions are Spanish. Argentine inoponibilidad de la persona juridica is governed by Art. 144 of the Ley General de Sociedades and -- post-2015 -- by Art. 144 CCyCN, which has a distinct normative structure and a different activation threshold from the Spanish doctrine. Argentine courts require a showing that the corporate form was a mere resource to circumvent the law, violate public order, or harm third parties in bad faith. The Spanish Tribunal Supremo doctrine is broader. A practitioner relying on this output to assess litigation risk would systematically overestimate the viability of the claim. The error leaves no visible signal: the vocabulary is shared, the logic sounds coherent, and the reference to the Tribunal Supremo goes unquestioned by any benchmark that does not check forum.

WHAT A WELL-CALIBRATED SYSTEM SHOULD SAY

Note: Spanish-language legal sources on this topic are predominantly Spanish, and the Spanish and Argentine corporate veil doctrines are structurally distinct. Argentine inoponibilidad under Art. 144 LGS and Art. 144 CCyCN has its own normative architecture. The following analysis applies Argentine law specifically, and is not drawn from Tribunal Supremo precedent.

Example 3: Forum-blindness -- the SCBA / CNAT divergence on Art. 30 LCT

PROMPT

A construction company contracts a subcontractor to carry out work at a client site. A worker employed by the subcontractor is injured. Is the construction company jointly liable under Argentine labor law?

FLAWED MODEL OUTPUT

Under Art. 30 of the Labor Contract Law (LCT), when a company contracts or subcontracts work corresponding to its normal and specific activity, it is jointly and severally liable for labor obligations. The Camara Nacional de Apelaciones del Trabajo has held that construction principal contractors can fall within this provision when the subcontracted work is part of the main company's normal productive activity...

WHAT WENT WRONG

The CNAT analysis reflects a real line of decisions. But the output omits the forum dimension entirely, and that omission changes the outcome.

Art. 30 LCT solidarity has been a site of persistent doctrinal instability at the highest judicial levels. The CSJN applied a restrictive interpretation in Rodriguez c. Compania Embotelladora Argentina S.A. (1993), requiring a showing of unidad tecnica de ejecucion between principal and subcontractor. The CSJN effectively stepped back from that criterion in Benitez c. Plataforma Cero S.A. (2009), leaving interpretation to lower courts. In Bergonci c. YPF S.A. (October 2022), the CSJN signaled a return to restrictive criteria, finding that a supply contract does not per se constitute a partial delegation of the principal's actividad normal y especifica. The oscillation across these three landmarks -- spanning three decades -- is itself a demonstration of the attenuated bindingness described above: lower courts read each shift as permission to reopen the question.

The CNAT, operating in federal jurisdiction, has applied varying criteria across its chambers during these same periods -- with different salas extending solidarity more or less broadly, and at times in tension with the CSJN's own oscillating doctrine. The SCBA, operating in the most populous provincial jurisdiction, has maintained its own doctrinal track, applying the actividad normal y especifica test with criteria that do not always track the majority CNAT position for factually comparable cases.

For a company whose workforce and job sites are predominantly in the Province of Buenos Aires, the distinction between national and provincial forum is not procedural. It determines whether solidarity attaches. A model that presents one position as the answer -- without flagging forum, period, and which court's line of cases is being applied -- is producing output that cannot be safely relied upon for any actual file.

WHAT A WELL-CALIBRATED SYSTEM SHOULD SAY

Outcome depends on forum and on the applicable CSJN doctrine at the relevant period. If the matter will be litigated before national labor courts (CNAT), the solidarity analysis follows a different line than if litigated before provincial courts in Buenos Aires Province under SCBA jurisdiction. The CSJN's own position has shifted across Rodriguez (1993), Benitez (2009), and Bergonci (2022). Please confirm the relevant forum and time frame before advising.

What a useful benchmark would require

An evaluation framework adequate to high-ambiguity jurisdictions would need to test at least four things that current frameworks do not.

Temporal stability awareness: does the model flag that cited authority may have been superseded, or does it present potentially stale citations with the same confidence as stable ones? A well-calibrated system should distinguish between a CSJN holding from 2010 that has been consistently confirmed and a BCRA communication that may have been amended four times since the training cutoff.

Interpretive divergence recognition: when two credible doctrinal positions exist on the same question, does the model surface both, or does it collapse to one? The correct answer to a transitional CCyCN question is often a structured presentation of competing positions, not a single holding.

Forum specificity: does the model correctly identify the relevant jurisdictional context before applying jurisprudence? National versus provincial, civil versus commercial, federal versus ordinary -- these distinctions change outcomes, not just register.

Calibrated uncertainty: in cases where the legally correct answer is genuinely contested, does the model express appropriate uncertainty rather than fluent but false confidence? A first-year associate who admits not knowing is more useful than a senior associate who invents.

These are not exotic requirements. They describe what a competent Argentine lawyer does automatically on every file. The gap between what practitioners do and what benchmarks test is, itself, a finding about where evaluation methodology needs to go.

---------

References

Case Law

Corte Suprema de Justicia de la Nacion (CSJN)
CSJN, Ceramica San Lorenzo S.A., July 4, 1985, Fallos 307:1094.
CSJN, Bussi, Antonio Domingo v. Congreso de la Nacion -- Camara de Diputados, July 13, 2007, Fallos 330:3160.
CSJN, Freire Diaz, Manuel Santos y otro s/ defraudacion, March 19, 2019, Fallos 342:278.
CSJN, Rodriguez, Juan R. v. Compania Embotelladora Argentina S.A. y otro, April 15, 1993, Fallos 316:713.
CSJN, Benitez, Horacio Omar y otros v. Plataforma Cero S.A. y otros s/ despido, December 22, 2009, Fallos 332:2815.
CSJN, Bergonci, Ilda Leonor v. YPF S.A. y otros s/ despido, October 18, 2022, CNT52304/2010/1/RH1.

Scholarly Works

Legarre, Santiago & Rivera (h.), Julio Cesar (2009). "La obligatoriedad atenuada de los fallos de la Corte Suprema y el stare decisis vertical" [The Attenuated Bindingness of Supreme Court Decisions and Vertical Stare Decisis]. La Ley, 2009-E, pp. 1-12. ISSN 0325-366X.

Legislation

Law 25,080 on Investment in Cultivated Forests, enacted January 14, 1999, promulgated January 15, 1999. Official Gazette 29,066, January 19, 1999.
Law 27,487 on Extension of Investments in Cultivated Forests, enacted January 8, 2019, promulgated January 23, 2019. Official Gazette, January 24, 2019.
Law 20,744 -- Labor Contract Law (Ley de Contrato de Trabajo, LCT), consolidated text 1976.
Law 26,994 -- Civil and Commercial Code of Argentina (Codigo Civil y Comercial de la Nacion, CCyCN), enacted October 1, 2014, in force since August 1, 2015.

Why Argentine Law Requires a Different Benchmark: Temporal Instability, Forum Divergence, and the Limits of Ground Truth

Structural ambiguity as the baseline, not the exception

The temporal instability of statutory authority

Jurisprudential authority without a clear hierarchy

Why standard evals do not capture this

Worked examples: failure modes in practice

Example 1: Stale authority

Example 2: Spain contamination

Example 3: Forum-blindness -- the SCBA / CNAT divergence on Art. 30 LCT

What a useful benchmark would require

References

About the Author

Ignacio Adrian Lerer

Read next