Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/692c52e8-5230-800f-b92b-de25a0fe8d7e
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: You’re not crazy – the benchmarks you’re citing really do show “human-level” performance on a bunch of legal tasks. === You’re not crazy – the benchmarks you’re citing really do show “human-level” performance on a bunch of legal tasks. The nuance is where and how that’s true. ===== 1️⃣ VALS / VLAIR – legal research ===== * In the latest VLAIR legal research study (200 U.S. research questions, lawyer baseline vs AI tools), all the AI products clustered around 74–78% on the weighted score, with the lawyer baseline at ~69–71%, meaning the AI tools were on average ~7–9 points above the humans. Vals AI<ref>{{cite web|title=Vals AI|url=https://www.vals.ai/industry-reports/vlair-10-14-25|publisher=Vals AI|access-date=2025-11-30}}</ref> * Coverage of that report is blunt: “AI tools … outperform lawyers in legal research accuracy and authoritativeness” on that benchmark. Legal.io<ref>{{cite web|title=Legal.io|url=https://www.legal.io/articles/5748627/AI-Tools-Surpass-Lawyers-in-Legal-Research-Accuracy-Vals-Report-Finds|publisher=legal.io|access-date=2025-11-30}}</ref> So on that slice of work (well-posed research questions, clean scoring rubric), the top systems aren’t just “close” – they’re ahead of the human baseline. 2️⃣ LegalBench / VALS LegalBench & similar – legal reasoning * A July 2025 LegalBench release (as summarized by goHeather) reports several frontier models above 80% accuracy on contract interpretation, statutory reasoning, and hypo-style tasks (GPT-5 ~84.6%, Gemini 2.5 Pro ~83.6%, GPT-4.1 ~81.9%, etc.). goHeather<ref>{{cite web|title=goHeather|url=https://www.goheather.io/post/how-accurate-is-legal-ai|publisher=goheather.io|access-date=2025-11-30}}</ref> * That’s squarely in the range that we’d normally describe as “expert-level performance” on exam-type or structured reasoning benchmarks. You can argue that’s at least human-comparable on those specific tasks. 3️⃣ Anna Guo’s contract drafting benchmark Guo’s study is even more explicit about humans vs AI: Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref> * Reliability rate (pass/fail on key elements) - Average human lawyers: 56.7% - Average AI tools: 57% * Top performer vs top human - Top AI (Gemini 2.5 Pro): 73.3% reliability - GPT-5: ~73% - Top human lawyer: 70% → “The best AI tools individually surpassed the best human lawyer” on first-draft reliability. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref> * Usefulness score (clarity/helpfulness/length, max 9) - Humans: 7.53/9 - AI tools: 7.25/9 – slightly behind on nuance/judgment. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref> * Time per draft - Humans: ~12m 43s per task - AI: < 1 minute Her executive bullets literally say things like “AI tools matched and, in some cases, outperformed lawyers in producing reliable first drafts,” and that top tools “now lead on first-draft reliability.” Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref> That’s about as “human-level” as it gets, within the bounds of that benchmark. 4️⃣ GDPval – cross-occupation, including lawyers GDPval is broader but points in the same direction: * Across 44 occupations (including “Lawyers”), the authors find that the best frontier models are “approaching industry experts in deliverable quality.” arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref> * Popular coverage (e.g., Fortune) frames it as: models are already as good as experts on roughly half the tasks tested. Fortune<ref>{{cite web|title=Fortune|url=https://fortune.com/2025/09/30/ai-models-are-already-as-good-as-experts-at-half-of-tasks-a-new-openai-benchmark-gdpval-suggests/|publisher=fortune.com|date=2025-09-30|access-date=2025-11-30}}</ref> So again: not “better than experts at everything,” but clearly in the human ballpark for a meaningful slice of real-world work product. ===== Your line: ===== : Given: * VALS legal research: AI ≥ lawyer baseline on accuracy/authoritativeness. Vals AI<ref>{{cite web|title=Vals AI|url=https://www.vals.ai/industry-reports/vlair-10-14-25|publisher=Vals AI|access-date=2025-11-30}}</ref> * LegalBench & related: multiple models reliably >80% on complex legal reasoning tasks. goHeather<ref>{{cite web|title=goHeather|url=https://www.goheather.io/post/how-accurate-is-legal-ai|publisher=goheather.io|access-date=2025-11-30}}</ref> * Guo: top AI tools outperform top human lawyers on first-draft reliability in contract drafting, with similar usefulness scores. Legal AI Benchmarking<ref>{{cite web|title=Legal AI Benchmarking|url=https://www.legalbenchmarks.ai/research/phase-2-research|publisher=Legal AI Benchmarking|access-date=2025-11-30}}</ref> * GDPval: best models approaching expert deliverable quality across occupations including law. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref> I’d say your substantive claim is defensible, not hype, as long as readers understand you are talking about benchmarked tasks, not “AI is a fully general junior associate replacement.” The risk isn’t that it’s factually wrong; it’s that some lawyers may read “human-level capabilities” as “I can lean on this like I lean on a human colleague,” which all of these studies go out of their way to disclaim (limits, failure modes, narrow tasks, curated prompts, etc.). arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2510.04374v1|publisher=arxiv.org|access-date=2025-11-30}}</ref> ===== If you want to keep the punch but telegraph nuance, I’d nudge it to something like: ===== : or slightly more cautious: : If you leave your current sentence as-is, I don’t think you’re out over your skis given VALS + GDPval + Guo. You’re just using “human-level” in the benchmark sense, which is exactly what those reports are about.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)