AI Benchmark Discrepancy Reveals Gaps in Performance Claims

Last updated: April 22, 2025 11:36 am

Follow:

3 Min Read

Contents

OpenAI claimed 25% completion of the test in December OpenAI o4 and o3 mini score highest on new FrontierMath results More must-read AI coverage Criticisms of AI benchmarking

FrontierMath accuracy for OpenAI’s o3 and o4-mini compared to leading models. Image: Epoch AI

The latest results from FrontierMath, a benchmark test for generative AI on advanced math problems, show OpenAI’s o3 model performed worse than OpenAI originally stated. While newer OpenAI models now outperform o3, the discrepancy highlights the need to scrutinize AI benchmarks closely.

Epoch AI, the research institute that created and administers the test, released its latest findings on April 18.

OpenAI claimed 25% completion of the test in December

Last year, the FrontierMath score for OpenAI o3 was part of the nearly overwhelming number of announcements and promotions released as part of OpenAI’s 12-day holiday event. The company claimed OpenAI o3, then its most powerful reasoning model, had solved more than 25% of problems on FrontierMath. In comparison, most rival AI models scored around 2%, according to TechCrunch.

SEE: For Earth Day, organizations could factor generative AI’s power into their sustainability efforts.

On April 18, Epoch AI released test results showing OpenAI o3 scored closer to 10%. So, why is there such a big difference? Both the model and the test could have been different back in December. The version of OpenAI o3 that had been submitted for benchmarking last year was a prerelease version. FrontierMath itself has changed since December, with a different number of math problems. This isn’t necessarily a reminder not to trust benchmarks; instead, just remember to dig into the version numbers.

OpenAI o4 and o3 mini score highest on new FrontierMath results

The updated results show OpenAI o4 with reasoning performed best, scoring between 15% and 19%. It was followed by OpenAI o3 mini, with o3 in third. Other rankings include:

OpenAI o1
Grok-3 mini
Claude 3.7 Sonnet (16K)
Grok-3
Claude 3.7 Sonnet (64K)

Although Epoch AI independently administers the test, OpenAI originally commissioned FrontierMath and owns its content.

Criticisms of AI benchmarking

Benchmarks are a common way to compare generative AI models, but critics say the results can be influenced by test design or lack of transparency. A July 2024 study raised concerns that benchmarks often overemphasize narrow task accuracy and suffer from non-standradized evaluation practices.

Share This Article

An abstract technological background consisting of a multitude of luminous guiding lines and dots.

Huawei Readies Ascend 920 Chip to Replace Restricted NVIDIA H20

Next Article How Enterprise IT Can Achieve Water Sustainability Despite the Demands of AI

Leave a review

Archives

Categories

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

OpenAI claimed 25% completion of the test in December

OpenAI o4 and o3 mini score highest on new FrontierMath results

Criticisms of AI benchmarking

Leave a Review Cancel reply

Recent Posts

Recent Comments

Archives

Categories

OpenAI claimed 25% completion of the test in December

OpenAI o4 and o3 mini score highest on new FrontierMath results

Criticisms of AI benchmarking

Leave a Review Cancel reply

Recent Posts

Recent Comments

You Might Also Like

IT Leader’s Guide to the Metaverse

State of AI Adoption in Financial Services: A TechRepublic Exclusive

AI Underperforms in Reality, and the Stock Market is Feeling It

Google Shows Off Pixel 10 Series and Pixel Watch 4

NVIDIA & NSF to Build Fully Open AI Models for Science