The AI startup Oumi, commissioned by The New York Times, analyzed a total of 4,326 Google searches using the industry-standard SimpleQA benchmark. The tests were conducted in two phases: once in October using Gemini 2 as the underlying AI model, and again in February after the upgrade to Gemini 3.

The results showed that AI Overviews were correct in 85% of cases with Gemini 2, increasing to 91% with Gemini 3. While this appears to be a high success rate, at Google’s scale it still translates into millions of incorrect answers per hour.

However, the study does not answer a key question: would users receive better information through traditional search results or alternative sources? Not everything published on websites is accurate either. The critical metric is not absolute correctness, but whether users receive more accurate information overall with AI than without it.

Better answers, weaker verification

Another key finding: although accuracy improved with Gemini 3, the verifiability of answers declined. Oumi examined whether the sources linked by Google actually supported the given answers.

With Gemini 2, 37% of correct answers were “ungrounded,” meaning the linked sources did not fully support the information. With Gemini 3, this rose to 56%. In many cases, users cannot verify the correctness of an answer based on the provided sources.

The quality of sources is also debatable. Among the 5,380 cited sources, Facebook and Reddit were the second and fourth most frequent. For correct answers, Facebook was cited in 5% of cases; for incorrect ones, in 7%. This may reflect Google’s preference for sources less likely to pursue legal action over content usage.

Errors can also occur even when the system identifies the correct source. For example, when asked about the Classical Music Hall of Fame, Google found the correct website listing Yo-Yo Ma as a member but still claimed there was no record of his induction.

In another case, when asked about a river west of Goldsboro, North Carolina, Google correctly identified a tourism website but misinterpreted the information, naming the Neuse River instead of the actual western river, the Little River.

Similarly, when asked about the Bob Marley Museum, Google’s AI Overview gave the incorrect opening year of 1987 instead of 1986, based on conflicting information from a Facebook post, a travel blog, and a Wikipedia page.

Google criticizes the study

To verify responses, Oumi used its own AI verification model, HallOumi, enabling large-scale evaluation. However, this introduces a key limitation: the verifying AI itself can make mistakes. Additionally, AI Overviews can generate different answers to identical queries, even seconds apart.

Google criticized the study as flawed. Spokesperson Ned Adriance argued that the SimpleQA benchmark contains inaccuracies and does not reflect real user search behavior.

Despite its name, the SimpleQA benchmark includes particularly difficult questions, where at least one AI model previously failed. It is also designed for answer scenarios without internet access.

Google’s latest model, Gemini 3.1 Pro, reportedly shows a 38-percentage-point reduction in hallucination rates compared to Gemini 3, according to the Artificial Analysis Intelligence Index. At the time of testing, Google Search likely used a lighter “Flash” version of Gemini 3. According to Google, results that incorporate web search are more accurate than those based solely on model knowledge.

The impact of AI answers on the web

The broader controversy around Google’s AI Overviews concerns their structural impact on the internet. By providing direct answers instead of directing users to external websites, Google reduces traffic to publishers, undermining their economic foundation.

The open web risks losing its role as a freely linked information network, gradually being replaced by a centralized AI interface controlled by Google. For most users, a 90% accuracy rate is sufficient, reducing the incentive to verify information through original sources.