Editing Openai/6933ffb6-9ed8-8010-99bb-a5bc968f2d9b (section)

==== 1. Some benchmarks are missing / not listed / have no leaderboard entries. - For SciCode, I did not find a public verified leaderboard on LLM-stats (or a consistent ranking list). The benchmark is listed in the broader “benchmark directory” (the site’s menu lists “SciCode”) LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks|publisher=LLM Stats|access-date=2025-12-07}}</ref> — but I found no “SciCode leaderboard page” with stable rankings. - For some older or niche benchmarks (older versions of AMC, variants of AIME, or AMC), results are either stale or not present, or only community/self-reported. ====
# Many “top results” are self-reported by model authors or the community — not independently audited. - For GPQA, LLM-stats explicitly states “with 0 verified results and 141 self-reported results.” LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/gpqa|publisher=LLM Stats|access-date=2025-12-07}}</ref> - That means the “top-5” for those benchmarks reflects what model providers or enthusiasts submitted, not necessarily what was re-tested under controlled conditions.
# For some benchmarks, the header indicates they exist — but clicking yields no data or indicates “pending.” - Example: the “Global-MMLU” page shows only 4 models evaluated, which is far from a robust large-scale leaderboard. LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/global-mmlu|publisher=LLM Stats|access-date=2025-12-07}}</ref> - For “LiveCodeBench,” the leaderboard lists 53 models — but again, the results are “self-reported.” LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/livecodebench|publisher=LLM Stats|access-date=2025-12-07}}</ref>