OpenAI own research finds that even its best models … GIVE WRONG ANSWERS!!

Home / Uncategorized / OpenAI own research finds that even its best models … GIVE WRONG ANSWERS!!

OpenAI’S latest models are shockingly bad at being right.

 

4 November 2024 (Washington, DC) — From the article:

OpenAI has released a new benchmark, dubbed “SimpleQA,” that’s designed to measure the accuracy of the output of its own and competing artificial intelligence models.

In doing so, the AI company has revealed just how bad its latest models are at providing correct answers. In its own tests, its cutting edge o1-preview model, which was released last month, scored an abysmal 42.7% success rate on the new benchmark.

In other words, even the cream of the crop of recently announced large language models (LLMs) is far more likely to provide an outright incorrect answer than a right one — a concerning indictment, especially as the tech is starting to pervade many aspects of our everyday lives.

Competing models, like Anthropic’s, scored even lower on OpenAI’s SimpleQA benchmark, with its recently released Claude-3.5-sonnet model getting only 28.9% of questions right. However, the model was far more inclined to reveal its own uncertainty and decline to answer — which, given the damning results, is probably for the best.

Worse yet, OpenAI found that its own AI models tend to vastly overestimate their own abilities, a characteristic that can lead to them being highly confident in the falsehoods they concoct.

LLMs have long suffered from “hallucinations,” an elegant term AI companies have come up with to denote their models’ well-documented tendency to produce answers that are complete BS.

Despite the very high chance of ending up with complete fabrications, the world has embraced the tech with open arms, from students generating homework assignments to developers employed by tech giants generating huge swathes of code.

And the cracks are starting the show. Case in point, an AI model used by hospitals and built on OpenAI tech was caught this week introducing frequent hallucinations and inaccuracies while transcribing patient interactions.

Yep. Nothing to see here, move along, it’s all fine. For the full article, click here.

Related Posts