A focused, online-first session hosted from Stanford to kick off a 2026 research and convening agenda on custom AI evaluations for legal teams and other organizations deploying AI systems and AI agents.
📍 Location: Hosted from Stanford University (Palo Alto) with full virtual participation
🗓 Date: Wednesday, December 3, 2025
🎥 Status: Event concluded. Watch the full recording below.
Legal and compliance teams are being asked to sign off on AI systems and AI agents that operate in high-stakes environments. Leaderboards and generic benchmarks are not enough. What they need are custom evaluations that reflect real tasks, risk tolerances, and institutional duties.
This December 3 session served as a short, high-signal kick-off: aligning on goals, shared language, and priority use cases. The work continues in 2026 through a series of focused research sprints, working groups, and convenings on evaluation practices for legal AI and agentic systems.
This initial group helped set the direction for the 2026 agenda, bringing perspectives from legal practice, agent evaluation platforms, consumer protection, and computational law.
law.MIT.edu · Stanford CodeX · Stanford Digital Economy Lab
Opening remarks and framing: why legal teams need custom evals, not just benchmarks.
Vals AI · TLW Consulting
Reflections on legal AI evaluation work and what matters most to law firms and clients.
Co-founder & CTO, Atla
Overview of evaluating AI agents and what teams are learning in practice.
Founder, Scorecard
Agent evaluation infrastructure for high-stakes legal systems.
Consumer Reports Innovation Lab
Early thinking on measuring loyalty and duty-of-care for consumer-facing AI agents.
Associate Director, Stanford CodeX
Computational law perspective on how evaluation connects to legal systems, precedent, and governance.Quick reads on why custom evaluations matter in law and what practitioners are learning from real-world assessments.
By Dazza Greenwood · Sep 2025
Argues that leaders must own “evaluation-as-policy,” turning domain standards into golden datasets and LLM-judge rubrics so AI systems are measured against what the organization truly values. Introduces the Lake Merritt open-source workbench and shows how small, expert-labeled eval packs can become a strategic asset, including process-level checks for agent workflows.
Read the post →CoCounsel, Vincent AI, Harvey Assistant, Oliver · 7 legal tasks
Independent, blind evaluation across seven core legal workflows (from redlining to EDGAR research) benchmarked against a lawyer control group. Harvey topped five of six tasks it entered, and CoCounsel stayed in the top tier across its tasks; AI systems collectively beat the lawyer baseline on four document-analysis tasks, with Document Q&A averaging 80.2%.
Read the Feb report →Alexi, Counsel Stack, Midpage, ChatGPT · 200 research questions
Focuses on legal research quality using weighted criteria (accuracy 50%, authoritativeness 40%, appropriateness 10%) against a lawyer baseline. All AI products clustered at 74–78% overall (lawyers 69%); legal-specialist tools led on authoritativeness, scoring six points above ChatGPT because of better sourcing and citations.
Read the Oct report →More resources will be added over time.
April 8, 2025 · Stanford FutureLaw Workshop
In April 2025, the AI Agents x Law workshop at Stanford FutureLaw explored contract terms for agentic transactions, authenticated delegation, and legal frameworks for AI agents. That work provides important context for this new focus on evaluation.