
OpenAI’s latest AI models are outpacing competitors from Google, Anthropic, xAI, and Meta in keeping their facts straight, according to new rankings. The results show stark differences in “hallucination rates,” or how often these AI models invent details.
The results come from Vectara’s Hughes Hallucination Evaluation Model (HHEM) Leaderboard, which measures the “ratio of summaries that hallucinate” across leading large language models. In head-to-head tests, ChatGPT models outperformed Gemini, Claude, Grok, and Meta AI, landing near the top of the accuracy race.
How the top AI tools stack up when the facts matter
Vectara’s HHEM Leaderboard is based on a large-scale test designed to determine whether AI models can adhere to the facts when summarizing real news articles. Each AI model was given the same set of short documents and scored on how often its summaries included information not found in the original text.
Refusal rates were also tracked, capturing how often an AI model declined to answer. With the conditions kept identical across the board, the results reveal which AI tools handle the truth best under the same pressure. Here’s how they performed.
OpenAI
OpenAI holds five of the lowest hallucination rates on the leaderboard, with ChatGPT-o3 mini at 0.795%, followed by ChatGPT-4.5, ChatGPT-5, ChatGPT-o1 mini, and ChatGPT-4o all clustered around the 1.2% to 1.49% mark.
That grounding in facts made the debut of ChatGPT-5 as the default model a strong move for the AI giant, until users pushed back, demanding the return of ChatGPT-4o. CEO Sam Altman relented, letting Plus subscribers choose their model.
But there’s a trade-off. Once free users hit their GPT-5 limit, they’re switched to ChatGPT-5 mini, a sharp drop in accuracy with a 4.9% hallucination rate that’s among the highest in OpenAI’s lineup. That could mean a sudden slide in how much you can trust the answers you get.
Google’s Gemini 2.5 Pro Preview and Gemini 2.5 Flash Lite scored 2.6% and 2.9%, respectively. Not as low as OpenAI’s leaders, but still well clear of the highest-risk models. Pro Preview replaced the now-retired Gemini 2.5 Pro Experimental, which had once posted one of the lowest scores on the board at 1.1%.
Anthropic
Anthropic’s newest models, Claude Opus 4.1 and Claude Sonnet 4, post hallucination rates of 4.2% and 4.5%. Those scores place both models among the more error-prone models on the board, well behind leaders such as ChatGPT and Gemini.
Meta
Meta’s LLaMA 4 Maverick and LLaMA 4 Scout had 4.6% and 4.7% hallucination rates, putting them in the same ballpark as Claude’s latest models and outside the group of most accurate performers on the board.
xAI
Grok 4 posts a high hallucination rate of 4.8%, placing it among the least accurate models on the leaderboard. Elon Musk has promoted the newly released model as “smarter than almost all graduate students, in all disciplines,” pointing to its 26.9% score on On Humanity’s Last Exam.
The chatbot is also facing criticism for harmful and inappropriate outputs. This combination of a high error rate and ongoing content issues could make Grok a risky choice for fact-reliable answers.
Keeping track of truth in the age of AI
When AI gets it wrong, it can sound right. And when those made-up details slip past unnoticed, bending facts and spreading misinformation, it can lead to serious risks in areas like health, law, finance, and politics. That’s why ongoing, transparent testing is more important than ever.
Vectara’s HHEM Leaderboard updates with every model change, tracking in real time which AIs are improving and which are falling behind. As these systems weave deeper into search, messaging, and everyday tools, knowing which AI model stays closest to the truth is knowing what to trust.
In our closer look at OpenAI’s GPT-5, we focus on the AI model’s health-related benchmarks and guidelines.