Custom Evals in the Legal Domain

Why this kick-off matters

Legal and compliance teams are being asked to sign off on AI systems and AI agents that operate in high-stakes environments. Leaderboards and generic benchmarks are not enough. What they need are custom evaluations that reflect real tasks, risk tolerances, and institutional duties.

Looking toward 2026

This December 3 session served as a short, high-signal kick-off: aligning on goals, shared language, and priority use cases. The work continues in 2026 through a series of focused research sprints, working groups, and convenings on evaluation practices for legal AI and agentic systems.

The Dec 3 Brain Trust

This initial group helped set the direction for the 2026 agenda, bringing perspectives from legal practice, agent evaluation platforms, consumer protection, and computational law.

Host & Convenor Photo of Dazza Greenwood

Dazza Greenwood

law.MIT.edu · Stanford CodeX · Stanford Digital Economy Lab

Opening remarks and framing: why legal teams need custom evals, not just benchmarks.

Legal Domain Photo of Tara Waters

Tara Waters

Vals AI · TLW Consulting

Reflections on legal AI evaluation work and what matters most to law firms and clients.

Agent Measurement Photo of Roman Engeler

Roman Engeler

Co-founder & CTO, Atla

Overview of evaluating AI agents and what teams are learning in practice.

Agent Eval Infrastructure Photo of Darius Emrani

Darius Emrani

Founder, Scorecard

Agent evaluation infrastructure for high-stakes legal systems.

Fiduciary & Loyalty Photo of Dan Leininger

Dan Leininger

Consumer Reports Innovation Lab

Early thinking on measuring loyalty and duty-of-care for consumer-facing AI agents.

Computational Law Photo of Robert Mahari

Robert Mahari

Associate Director, Stanford CodeX

Computational law perspective on how evaluation connects to legal systems, precedent, and governance.

Pre-reads to get up to speed

Quick reads on why custom evaluations matter in law and what practitioners are learning from real-world assessments.

Beyond AI Benchmarks

By Dazza Greenwood · Sep 2025

Argues that leaders must own “evaluation-as-policy,” turning domain standards into golden datasets and LLM-judge rubrics so AI systems are measured against what the organization truly values. Introduces the Lake Merritt open-source workbench and shows how small, expert-labeled eval packs can become a strategic asset, including process-level checks for agent workflows.

Read the post →

Vals Legal AI Report (Feb 2025)

CoCounsel, Vincent AI, Harvey Assistant, Oliver · 7 legal tasks

Independent, blind evaluation across seven core legal workflows (from redlining to EDGAR research) benchmarked against a lawyer control group. Harvey topped five of six tasks it entered, and CoCounsel stayed in the top tier across its tasks; AI systems collectively beat the lawyer baseline on four document-analysis tasks, with Document Q&A averaging 80.2%.

Read the Feb report →

Vals Legal AI Report (Oct 2025)

Alexi, Counsel Stack, Midpage, ChatGPT · 200 research questions

Focuses on legal research quality using weighted criteria (accuracy 50%, authoritativeness 40%, appropriateness 10%) against a lawyer baseline. All AI products clustered at 74–78% overall (lawyers 69%); legal-specialist tools led on authoritativeness, scoring six points above ChatGPT because of better sourcing and citations.

Read the Oct report →

More resources will be added over time.

Previous Initiative: AI Agents x Law

April 8, 2025 · Stanford FutureLaw Workshop

View Full Recap & Videos →

In April 2025, the AI Agents x Law workshop at Stanford FutureLaw explored contract terms for agentic transactions, authenticated delegation, and legal frameworks for AI agents. That work provides important context for this new focus on evaluation.

Key Sessions

Setting the Context: Dazza Greenwood
Legal Issues for Agents: Diana Stern
Authenticated Delegation: Tobin South
Agent Error Handling Demo: Andor Kesselman (tracking and mitigating agent mistakes)
Legal Practice Innovation: Damien Riehl

Core Readings

From Fine Print to Machine Code (Stern & Greenwood)
Engineering Loyalty by Design (Consumer Reports)
Authenticated Delegated Authorizations (arXiv)