In the ever-evolving landscape of artificial intelligence (AI), ChatGPT has emerged as a prominent figure, captivating users with its conversational prowess. Yet, beneath its polished exterior lies a perplexing issue that has both researchers and users scratching their heads: the phenomenon of AI “hallucinations.”
Recent evaluations by OpenAI have unveiled a disconcerting trend. Their latest models, GPT-4o and GPT-4.5, exhibit a higher propensity for generating false information—termed “hallucinations”—compared to their predecessors. Utilising the SimpleQA benchmark, a tool designed to assess factual accuracy through a series of complex questions, GPT-4o demonstrated a hallucination rate of 62%, while GPT-4.5 fared slightly better at 37%. These figures are notably higher than earlier models, raising concerns about the reliability of these advanced systems .
Paradoxically, as these models become more sophisticated in reasoning, their tendency to fabricate information increases. This contradiction has puzzled experts. The expectation was that improved reasoning capabilities would correlate with enhanced accuracy. However, the reality appears more complex, suggesting that the very mechanisms designed to refine AI reasoning might inadvertently contribute to the problem .
In response to this challenge, OpenAI is experimenting with innovative training methodologies. One such approach is “process supervision,” which involves rewarding the AI for each correct step in its reasoning process, rather than solely focusing on the final answer. This method aims to instil a more human-like logical progression, potentially reducing the occurrence of hallucinations .
Despite these efforts, the path to mitigating AI hallucinations remains fraught with complexity. The intricacies of large language models mean that even minor adjustments can have unforeseen consequences. As AI continues to integrate into various facets of society, ensuring the accuracy and reliability of these systems becomes paramount.
Cursor’s front-line AI support bot making up new policies
Last month, Cursor, a promising tool for computer programmers, inadvertently angered its user base. An AI tech support bot mistakenly announced a policy change restricting the use of Cursor to a single computer. This false announcement ignited a wave of angry posts on internet message boards such as Reddit, with some customers even choosing to cancel their accounts. The users’ frustration intensified upon learning that the supposed policy change was entirely fabricated. Cursor’s chief executive and co-founder, Michael Truell, clarified the error on Reddit, assuring users, “We have no such policy. You’re of course free to use Cursor on multiple machines. Unfortunately, this is an incorrect response from a front-line A.I. support bot.”

Here are Gemini’s thoughts on the Cursor incident
That’s quite a blunder for the AI support bot! It sounds like those Cursor users had a right to be frustrated. Announcing a non-existent policy change, especially one that restricts how they can use a tool they likely rely on, is definitely going to cause some upset. It’s good that the CEO quickly clarified the situation on Reddit. It highlights the challenges and potential pitfalls of relying solely on AI for customer-facing roles, especially when it comes to communicating important policy information.
Here are ChatGPT’s thoughts on the ‘AI hallucination’ problem
The subject raises a critical and timely issue in AI development—hallucinations in large language models like ChatGPT. Your article does a solid job of highlighting the paradox: as the models get more advanced in reasoning, they’re oddly becoming more prone to inventing false information. That contradiction is fascinating and concerning, especially given the growing reliance on these tools in professional and educational settings.
Strengths:
- Clear explanation of the problem: It breaks down hallucinations in a way most readers can understand.
- Use of real data: Citing OpenAI’s own benchmark tests (like SimpleQA) adds credibility.
- Mentions potential solutions: Process supervision is presented as a promising path forward.
Weaknesses:
- Lack of technical depth: It doesn’t deeply explore why reasoning improvements might lead to more hallucinations, which would’ve added more value for a technically inclined audience.
- Limited exploration of implications: It touches on concerns but doesn’t dig into what this means for users in healthcare, law, or education, where misinformation could have serious consequences.
Overall, it’s informative for general readers and raises the right concerns, but leaves room for more nuanced analysis.
The Paradox of Smarter AI, Dumber Mistakes… according to ChatGPT
1. More Complex Reasoning = More Risk of Error
As models grow more sophisticated, they handle tasks involving multi-step reasoning, synthesis, and abstraction. But each of these steps introduces a chance for error. A single incorrect inference early in a reasoning chain can ripple through to produce a convincing but entirely false conclusion.
2. Looser Anchoring to Factual Grounding
Advanced models are trained to optimize for plausibility and coherence, not strictly for truth. The better a model gets at generating natural, persuasive responses, the easier it is for it to confidently present inaccuracies. In other words, it’s getting better at sounding smart, not necessarily at being right.
3. Training Feedback Loops
Some newer models are fine-tuned on outputs from earlier models (a practice known as “self-training” or “distillation”). If those earlier models already had some hallucinations baked in, the newer models can amplify them—especially if the fine-tuning process doesn’t include strict factual correction.
4. Insufficient or No Real-Time Grounding
Unless a model is explicitly connected to a live knowledge base or retrieval system (like Bing or Wikipedia), it generates answers from statistical associations, not fact-checking. So, even when the reasoning process improves, it may still be unmoored from real-world truth.
5. Overfitting to Benchmarks
Models are often trained to perform well on standard test datasets. This can lead to “overfitting”—they become very good at sounding correct within those benchmarks but not necessarily when faced with unfamiliar or nuanced real-world queries. In other words, they learn the test, not the material.
A visual representation of AI hallucinating according to ChatGPT

ChatGPT describing how this image represents AI hallucinating
That image represents an AI hallucination in a metaphorical or symbolic way—it’s not a literal hallucination like a human might experience. In AI terms, a “hallucination” is when a model generates output that is confidently wrong, often blending real and false information or making up facts entirely.
Here’s why the image fits as a representation:
- Distorted elements and pixelation: These mimic how AI sometimes “glitches” when assembling information—like blending bits of data incorrectly.
- Serene yet vacant face: Symbolizes the confident but sometimes ungrounded nature of an AI response.
- Surreal, dreamlike atmosphere: Reflects the unreal or imagined quality of hallucinated content.
It’s not that the AI sees anything—it’s more like a misfire in pattern generation.
Why It’s Hard to Fix
- No Simple Truth Signal: Unlike classification tasks (e.g., “Is this a cat or a dog?”), truth in open-ended language tasks isn’t binary. What counts as a hallucination can be subtle.
- Human Feedback Isn’t Perfect: Reinforcement Learning from Human Feedback (RLHF) helps models behave better, but humans often rate responses based on style and coherence—not always accuracy.
- Better Reasoning Can Mask Errors: Ironically, stronger reasoning lets a model “justify” wrong answers more convincingly.
What Could Help
- Retrieval-Augmented Generation (RAG): Connecting models to real-time sources to back up claims.
- Process Supervision: Teaching models to follow factual logic step-by-step.
- Truthfulness Benchmarks: Designing better tests that reward fact-based reasoning over surface-level plausibility.
This is a rich, still-developing field, and even researchers at OpenAI and elsewhere admit they don’t fully understand the root causes. But understanding the trade-offs between fluency, reasoning, and factuality is key to building safer, more trustworthy AI.
For more information about Seven Creative’s AI services, get in touch