Editing Openai/692c52e8-5230-800f-b92b-de25a0fe8d7e (section)

=== Assistant: You’re not crazy – the benchmarks you’re citing really do show “human-level” performance on a bunch of legal tasks. ===
You’re not crazy – the benchmarks you’re citing really do show “human-level” performance on a bunch of legal tasks. The nuance is where and how that’s true.

===== 1️⃣ VALS / VLAIR – legal research =====
* In the latest VLAIR legal research study (200 U.S. research questions, lawyer baseline vs AI tools), all the AI products clustered around 74–78% on the weighted score, with the lawyer baseline at ~69–71%, meaning the AI tools were on average ~7–9 points above the humans. Vals AI<ref>{{cite web|title=Vals AI|url=https://www.vals.ai/industry-reports/vlair-10-14-25|publisher=Vals AI|access-date=2025-11-30}}</ref>
* Coverage of that report is blunt: “AI tools … outperform lawyers in legal research accuracy and authoritativeness” on that benchmark. Legal.io<ref>{{cite web|title=Legal.io|url=https://www.legal.io/articles/5748627/AI-Tools-Surpass-Lawyers-in-Legal-Research-Accuracy-Vals-Report-Finds|publisher=legal.io|access-date=2025-11-30}}</ref>

So on that slice of work (well-posed research questions, clean scoring rubric), the top systems aren’t just “close” – they’re ahead of the human baseline.

2️⃣ LegalBench / VALS LegalBench & similar – legal reasoning
* A July 2025 LegalBench release (as summarized by goHeather) reports several frontier models above 80% accuracy on contract interpretation, statutory reasoning, and hypo-style tasks (GPT-5 ~84.6%, Gemini 2.5 Pro ~83.6%, GPT-4.1 ~81.9%, etc.). goHeather<ref>{{cite web|title=goHeather|url=https://www.goheather.io/post/how-accurate-is-legal-ai|publisher=goheather.io|access-date=2025-11-30}}</ref>
* That’s squarely in the range that we’d normally describe as “expert-level performance” on exam-type or structured reasoning benchmarks.

You can argue that’s at least human-comparable on those specific tasks.

3️⃣ Anna Guo’s contract drafting benchmark

Guo’s study is even more explicit about humans vs AI: Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref>
* Reliability rate (pass/fail on key elements) - Average human lawyers: 56.7% - Average AI tools: 57%
* Top performer vs top human - Top AI (Gemini 2.5 Pro): 73.3% reliability - GPT-5: ~73% - Top human lawyer: 70% → “The best AI tools individually surpassed the best human lawyer” on first-draft reliability. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref>
* Usefulness score (clarity/helpfulness/length, max 9) - Humans: 7.53/9 - AI tools: 7.25/9 – slightly behind on nuance/judgment. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref>
* Time per draft - Humans: ~12m 43s per task - AI: < 1 minute

Her executive bullets literally say things like “AI tools matched and, in some cases, outperformed lawyers in producing reliable first drafts,” and that top tools “now lead on first-draft reliability.” Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref>

That’s about as “human-level” as it gets, within the bounds of that benchmark.

4️⃣ GDPval – cross-occupation, including lawyers

GDPval is broader but points in the same direction:
* Across 44 occupations (including “Lawyers”), the authors find that the best frontier models are “approaching industry experts in deliverable quality.” arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref>
* Popular coverage (e.g., Fortune) frames it as: models are already as good as experts on roughly half the tasks tested. Fortune<ref>{{cite web|title=Fortune|url=https://fortune.com/2025/09/30/ai-models-are-already-as-good-as-experts-at-half-of-tasks-a-new-openai-benchmark-gdpval-suggests/|publisher=fortune.com|date=2025-09-30|access-date=2025-11-30}}</ref>

So again: not “better than experts at everything,” but clearly in the human ballpark for a meaningful slice of real-world work product.

===== Your line: =====

: 

Given:
* VALS legal research: AI ≥ lawyer baseline on accuracy/authoritativeness. Vals AI<ref>{{cite web|title=Vals AI|url=https://www.vals.ai/industry-reports/vlair-10-14-25|publisher=Vals AI|access-date=2025-11-30}}</ref>
* LegalBench & related: multiple models reliably >80% on complex legal reasoning tasks. goHeather<ref>{{cite web|title=goHeather|url=https://www.goheather.io/post/how-accurate-is-legal-ai|publisher=goheather.io|access-date=2025-11-30}}</ref>
* Guo: top AI tools outperform top human lawyers on first-draft reliability in contract drafting, with similar usefulness scores. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref>
* GDPval: best models approaching expert deliverable quality across occupations including law. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref>

I’d say your substantive claim is defensible, not hype, as long as readers understand you are talking about benchmarked tasks, not “AI is a fully general junior associate replacement.”

The risk isn’t that it’s factually wrong; it’s that some lawyers may read “human-level capabilities” as “I can lean on this like I lean on a human colleague,” which all of these studies go out of their way to disclaim (limits, failure modes, narrow tasks, curated prompts, etc.). arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref>

===== If you want to keep the punch but telegraph nuance, I’d nudge it to something like: =====

: 

or slightly more cautious:

: 

If you leave your current sentence as-is, I don’t think you’re out over your skis given VALS + GDPval + Guo. You’re just using “human-level” in the benchmark sense, which is exactly what those reports are about.