Editing Openai/6933ffb6-9ed8-8010-99bb-a5bc968f2d9b (section)

=== Assistant: Good initiative — using a public site like LLM‑stats.com to ground claims is smart. ===
Good initiative — using a public site like LLM‑stats.com to ground claims is smart. I poked around their leaderboard and other public sources — but there are limits. Here’s what I found (and why I couldn’t build a fully reliable “top-5 per benchmark” table for all your requested benchmarks). I also highlight where the gaps or caveats lie.

==== LLM-stats does provide leaderboards for many of the benchmarks you asked about. For some benchmarks, I can list the top models (often top-3 or top-5) based on their published scores. But in many cases results are “self-reported” (i.e. submitted by the model vendors or community) — which reduces how confident we can be in them. (LLM-stats itself notes that many are not “verified.”) LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/gpqa|publisher=LLM Stats|access-date=2025-12-07}}</ref> ====

Here are some of the leaderboards I could confidently access, with top models:

|     |     |
| --- | --- |
|     |     |
|     |     |
|     |     |

Additionally:
* LLM-stats lists LiveCodeBench among its benchmarks. LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks|publisher=LLM Stats|access-date=2025-12-07}}</ref>
* LLM-stats also enumerates MATH 500 (classic MATH benchmark) among its math benchmarks. LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks|publisher=LLM Stats|access-date=2025-12-07}}</ref>

==== 1. Some benchmarks are missing / not listed / have no leaderboard entries. - For SciCode, I did not find a public verified leaderboard on LLM-stats (or a consistent ranking list). The benchmark is listed in the broader “benchmark directory” (the site’s menu lists “SciCode”) LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks|publisher=LLM Stats|access-date=2025-12-07}}</ref> — but I found no “SciCode leaderboard page” with stable rankings. - For some older or niche benchmarks (older versions of AMC, variants of AIME, or AMC), results are either stale or not present, or only community/self-reported. ====
# Many “top results” are self-reported by model authors or the community — not independently audited. - For GPQA, LLM-stats explicitly states “with 0 verified results and 141 self-reported results.” LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/gpqa|publisher=LLM Stats|access-date=2025-12-07}}</ref> - That means the “top-5” for those benchmarks reflects what model providers or enthusiasts submitted, not necessarily what was re-tested under controlled conditions.
# For some benchmarks, the header indicates they exist — but clicking yields no data or indicates “pending.” - Example: the “Global-MMLU” page shows only 4 models evaluated, which is far from a robust large-scale leaderboard. LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/global-mmlu|publisher=LLM Stats|access-date=2025-12-07}}</ref> - For “LiveCodeBench,” the leaderboard lists 53 models — but again, the results are “self-reported.” LLM Stats<ref>{{cite web|title=LLM Stats|url=https://llm-stats.com/benchmarks/livecodebench|publisher=LLM Stats|access-date=2025-12-07}}</ref>

==== I can produce a partial table, but it will be very incomplete, and with warnings about reliability. Something like: ====

|     |     |     |
| --- | --- | --- |
|     |     |     |
|     |     |     |
|     |     |     |
|     |     |     |
|     |     |     |
|     |     |     |

==== - Yes — it is possible to use public data. LLM-stats and a few other sites give you a starting point. ====
* But “data-driven” is harder than it looks. Because of self-reporting, missing entries, incomplete coverage, and lack of verification.
* Any claims you draw from such a table must include disclaimers about reliability and coverage.

==== If you want to seriously compare models in a reproducible way: ====
* Focus on benchmarks with open test sets + public scoring scripts (e.g. classic MATH, MMLU, AMC / AIME, open math competition data, some coding benchmarks) — avoid ones that depend on proprietary or self-reported data.
* For each benchmark, pick 3–5 of the best models and re-evaluate them yourself, under identical prompt conditions.
* Prefer benchmarks designed to avoid training-data contamination (or otherwise verify that test data was not in model training sets).

If you like — I can try to build a “clean benchmark pack” of 4–6 tests (math, reasoning, coding) with open source data, and then show recent published performance for a handful of models (e.g. GPT-5.1, Gemini 3 Pro, Claude, Grok).
That way we’d base the comparison only on verifiable, open-data benchmarks.
Do you want me to build that pack for you now?