<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://freemwiki.com/index.php?action=history&amp;feed=atom&amp;title=Openai%2F694896a0-6814-800f-b484-4ed0ec0b6e9c</id>
	<title>Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://freemwiki.com/index.php?action=history&amp;feed=atom&amp;title=Openai%2F694896a0-6814-800f-b484-4ed0ec0b6e9c"/>
	<link rel="alternate" type="text/html" href="https://freemwiki.com/index.php?title=Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c&amp;action=history"/>
	<updated>2026-04-05T22:46:18Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.3</generator>
	<entry>
		<id>https://freemwiki.com/index.php?title=Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c&amp;diff=594726&amp;oldid=prev</id>
		<title>Lukegao: by freem.ai...</title>
		<link rel="alternate" type="text/html" href="https://freemwiki.com/index.php?title=Openai/694896a0-6814-800f-b484-4ed0ec0b6e9c&amp;diff=594726&amp;oldid=prev"/>
		<updated>2025-12-22T05:23:24Z</updated>

		<summary type="html">&lt;p&gt;by freem.ai...&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&lt;br /&gt;
{{DISPLAYTITLE:ChatGPT 5.2 Pro benchmarks}}&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;This conversation was exported from ChatGPT&amp;#039;&amp;#039;&amp;lt;ref&amp;gt;{{cite web|title=ChatGPT 5.2 Pro benchmarks|url=https://chatgpt.com/share/694896a0-6814-800f-b484-4ed0ec0b6e9c|publisher=ChatGPT|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== gpt-5-2-thinking ==&lt;br /&gt;
=== User: Why is ChatGPT 5.2 Pro not evaluated on benchmark sites? ===&lt;br /&gt;
Why is ChatGPT 5.2 Pro not evaluated on benchmark sites?  If it is, which sites?&lt;br /&gt;
&lt;br /&gt;
=== Assistant: “ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes. ===&lt;br /&gt;
“ChatGPT 5.2 Pro” is a bit of a shape-shifter as far as benchmarking goes.&lt;br /&gt;
&lt;br /&gt;
Most benchmark/leaderboard sites don’t (and often can’t) evaluate the ChatGPT product experience directly. They usually evaluate an API model with a stable, scriptable interface, because benchmarks require thousands of automated runs with controlled settings.&lt;br /&gt;
&lt;br /&gt;
Why you often don’t see “ChatGPT 5.2 Pro” on the big public leaderboards&lt;br /&gt;
# ChatGPT (the app) isn’t a clean benchmark target The ChatGPT experience includes extra layers that benchmarks try to avoid: system instructions, safety layers, tool routing, UI behaviors, feature toggles, and sometimes “smart defaults” that aren’t exposed as simple knobs. That makes results harder to reproduce and compare apples-to-apples across vendors.&lt;br /&gt;
# Automated evaluation via ChatGPT is generally a no-go OpenAI’s Terms of Use forbid automatically or programmatically extracting data or output from the services, which blocks the typical “benchmark harness logs into the website and blasts 50k prompts” approach. OpenAI&amp;lt;ref&amp;gt;{{cite web|title=OpenAI|url=https://openai.com/policies/row-terms-of-use/|publisher=openai.com|access-date=2025-12-22}}&amp;lt;/ref&amp;gt; So serious evaluators generally stick to the API.&lt;br /&gt;
# GPT-5.2 Pro is API-accessible, but it’s a specialty endpoint OpenAI’s model page states GPT-5.2 Pro is available via the Responses API (not the older Chat Completions flow), which can be an extra integration hurdle for some benchmark rigs. OpenAI Platform&amp;lt;ref&amp;gt;{{cite web|title=OpenAI Platform|url=https://platform.openai.com/docs/models/gpt-5.2-pro|publisher=OpenAI Platform|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
# It’s expensive to benchmark at scale The official model listing shows GPT-5.2 Pro priced at $21 per 1M input tokens and $168 per 1M output tokens. Benchmarks can become hilariously pricey, fast. OpenAI Platform&amp;lt;ref&amp;gt;{{cite web|title=OpenAI Platform|url=https://platform.openai.com/docs/models/gpt-5.2-pro|publisher=OpenAI Platform|access-date=2025-12-22}}&amp;lt;/ref&amp;gt; Even if a site could run it, many choose not to continuously retest the most expensive tiers.&lt;br /&gt;
&lt;br /&gt;
So is it evaluated anywhere? Yep, just not everywhere.&lt;br /&gt;
&lt;br /&gt;
Where GPT-5.2 Pro does show up (examples)&lt;br /&gt;
&lt;br /&gt;
SimpleBench&lt;br /&gt;
SimpleBench’s public leaderboard includes “GPT-5.2 Pro (xhigh)” (and also “GPT-5.2 (high)”). SimpleBench&amp;lt;ref&amp;gt;{{cite web|title=SimpleBench|url=https://simple-bench.com/|publisher=simple-bench.com|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
ARC Prize leaderboard (ARC-AGI)&lt;br /&gt;
ARC Prize’s leaderboard includes GPT-5.2 Pro results (they track performance vs cost per task for ARC-AGI style evaluations). ARC Prize&amp;lt;ref&amp;gt;{{cite web|title=ARC Prize|url=https://arcprize.org/leaderboard|publisher=ARC Prize|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
Note: their page is heavily visual/interactive, so some text scrapers show less detail, but the leaderboard is a real benchmarking venue.&lt;br /&gt;
&lt;br /&gt;
Where you’ll often NOT see GPT-5.2 Pro&lt;br /&gt;
&lt;br /&gt;
LMArena / Chatbot Arena (LMSYS)&lt;br /&gt;
Arena currently lists GPT-5.2 variants like gpt-5.2, gpt-5.2-high, and gpt-5.2-search, but not a “Pro” entry. LMArena&amp;lt;ref&amp;gt;{{cite web|title=LMArena|url=https://lmarena.ai/leaderboard|publisher=lmarena.ai|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
That’s consistent with the cost + access + “which exact settings did you run?” problem.&lt;br /&gt;
&lt;br /&gt;
Bonus: sites that track GPT-5.2 generally (may not label “Pro” separately)&lt;br /&gt;
Artificial Analysis has a GPT-5.2 model page and comparisons, but it commonly shows GPT-5.2 variants (like xhigh) rather than always breaking out “Pro” as its own named entry. Artificial Analysis&amp;lt;ref&amp;gt;{{cite web|title=Artificial Analysis|url=https://artificialanalysis.ai/models/gpt-5-2|publisher=Artificial Analysis|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
Epoch AI’s ECI aggregates benchmark scores and reports GPT-5.2 on its capability index (again, not always split into every tier the way OpenAI markets them). Substack&amp;lt;ref&amp;gt;{{cite web|title=Substack|url=https://substack.com/%40epochai/note/c-187985664|publisher=substack.com|access-date=2025-12-22}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you want a clean mental model: benchmark sites are usually grading the underlying API model (gpt-5.2-pro) under very specific settings, not the full ChatGPT “Pro” experience you use in the app. That’s why the label you’re looking for often seems to “vanish” like a magician’s coin, even when the model itself is being tested. 🪄&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:ChatGPT Shares]]&lt;br /&gt;
[[Category:openai]]&lt;/div&gt;</summary>
		<author><name>Lukegao</name></author>
	</entry>
</feed>