Is GPT4 getting dumber?

0 views
Research indicates is GPT-4 getting dumber depends on specific tasks like prime identification accuracy falling from 97.6% to 2.4% in 2023. Executable code generation dropped from 52% to 10% between March and June 2023. Creative performance also decreased by 43-49% over 24 months according to a 2025 University of Essex study.
Feedback 0 likes

Is GPT-4 getting dumber? 97% to 2% accuracy drop

Recent studies suggest is GPT-4 getting dumber remains a significant concern for users facing decreased output quality. Understanding these shifts helps users adjust their expectations and prompting strategies to maintain productivity. Recognizing performance changes prevents frustration and ensures more effective interactions with AI tools during complex technical or creative tasks.

The Great GPT-4 Debate: Is the AI Actually Getting Dumber?

The short answer is yes, GPT-4s performance on certain tasks has measurably declined since its early 2023 release, though not in every dimension. This perception of is GPT-4 getting dumber is backed by hard data from Stanford and UC Berkeley, which showed its ability to solve math problems (specifically prime number identification with step-by-step reasoning) collapsed from 97.6% accuracy in March 2023 to just 2.4% in June 2023.[1] (reference:0)(reference:1) However, the models behavior is complex, and the reality involves a mix of LLM drift, architectural changes, and shifting priorities at OpenAI.

The Evidence: What the Numbers Say About GPT-4's Decline

Lets cut to the chase. The most dramatic evidence comes from the GPT-4 Stanford Berkeley study summary. When asked to identify prime numbers with step-by-step reasoning, GPT-4s accuracy fell from 97.6% to 2.4% between March and June 2023.[6] (reference:2) Code generation capabilities also took a hit, with the share of directly executable code dropping from 52% to 10% during the same period.(reference:3) More recently, a 2025 study from the University of Essex found GPT-4s creative performance dropped by 43-49% over 18-24 months.(reference:4)

The model also became more cautious. Its response rate to sensitive questions plummeted from 21% to 5% in just one month (May to June 2023).[5] (reference:5) Users quickly noticed, flooding forums with complaints about why is GPT-4 lazy—leaving placeholder text, refusing to complete code, and giving vague answers.(reference:6) I’ve personally spent hours re-prompting the model to get it to finish a script it would have written fully a year ago. Its frustrating.

How GPT-4's Performance Changed (March 2023 vs. June 2023)

The table below summarizes the key performance drops identified by the Stanford/UC Berkeley research.

Why Is This Happening? Unpacking the Theories Behind the 'Dumbing Down'

So, what’s the culprit? Its not one simple thing. The leading theory points to OpenAI quietly changing the underlying model architecture to cut costs. Many believe they shifted GPT-4 to a Mixture of Experts (MoE) model, where a network of smaller, specialized models is activated per query instead of the whole massive system.(reference:7) This makes the model faster and cheaper to run, but some argue it sacrificed quality. In other words, the ship of Theseus was gradually replaced, and we’re now sailing on a different vessel.

Another major factor is OpenAI model drift explained and aggressive safety alignment. As OpenAI fine-tuned GPT-4 to refuse harmful requests, they inadvertently made it overly cautious and lazy. The RLHF (Reinforcement Learning from Human Feedback) process, meant to make the model helpful, may have also rewarded shorter, more generic answers to reduce computational load.(reference:8)(reference:9)

Architectural Changes vs. Safety Alignment

Heres a look at the two primary drivers of the perceived decline.

GPT-4 Performance: March 2023 vs. June 2023

The table below summarizes the key performance drops identified by the Stanford/UC Berkeley research.

Task: Math Problem Solving

Only 2.4% accuracy on the same questions, showing a massive decline in reasoning.

97.6% accuracy on prime number identification.

Task: Code Generation

Executable code generation dropped to just 10%, with frequent formatting errors.

52% of generated code was directly executable.

Task: Sensitive Questions

Answer rate fell sharply to 5%, indicating much stronger safety filters.

Answered 21% of a set of 100 sensitive/controversial queries.

These numbers from the Stanford/UC Berkeley study paint a clear picture: GPT-4's capabilities shifted dramatically in just a few months. The model became much worse at math and coding while simultaneously becoming far more cautious. This suggests a trade-off where safety and cost-efficiency were prioritized over raw reasoning power.

A Startup's Struggle with a 'Lazy' GPT-4

A small software startup in San Francisco relied on GPT-4 to auto-generate SQL queries from natural language prompts. In early 2023, the model worked like magic, outputting perfectly formatted, complex queries 90% of the time. The team was ecstatic, and their development velocity doubled.

By the summer, everything fell apart. The same prompts started yielding half-finished code with placeholder text like '...add your own logic here.' The team was furious. They were still paying for the API, but the model's output quality had tanked. One engineer spent three hours debugging a simple script because the AI refused to write more than 50 lines.

The founder later read about the 'lazy GPT' theory online and realized what had happened. The model they had built their product on had been fundamentally changed. The startup was forced to spend weeks adjusting their entire pipeline to work with the less reliable model, a costly lesson in the risks of relying on a black-box AI.

Knowledge to Take Away

GPT-4's math and coding ability measurably declined in 2023

Research from Stanford and UC Berkeley showed prime number identification accuracy fell from 97.6% to 2.4%, while executable code generation dropped from 52% to 10% in just a few months.(reference:15)(reference:16)

'LLM Drift' means the same model can change without warning

Researchers coined this term to describe how a model's behavior can shift underneath developers, even when the version number remains the same. This can break software stacks that rely on consistent outputs.(reference:17)

Cost-cutting and safety are likely the main culprits

The shift to a Mixture of Experts (MoE) architecture and stronger RLHF alignment appears to have traded reasoning depth and completeness for lower cost and safer, but lazier, responses.(reference:18)(reference:19)

Newer isn't always better for specific tasks

While models like GPT-4 Turbo may score higher on benchmarks, the unique 'character' and raw reasoning power of the original GPT-4 seems to have been lost. Users report the newer models are faster but more generic.

Need to Know More

Is GPT-4 actually getting dumber, or is it just my imagination?

It's not your imagination. Hard data from Stanford and UC Berkeley confirms a significant performance drop, particularly in math and code generation, between March and June 2023. Your experience of the model becoming 'lazier' is backed by research.(reference:10)

Will OpenAI fix GPT-4 and make it smart again?

That's uncertain. OpenAI has acknowledged performance fluctuations but hasn't committed to rolling back changes. Their focus may be on developing newer, more powerful models like GPT-5 rather than reviving the original GPT-4's capabilities.(reference:11)

Is GPT-4 Turbo better or worse than the original GPT-4?

GPT-4 Turbo was designed to be more efficient and often performs better on benchmark tests like MATH and HumanEval. However, some users still feel the original GPT-4 had a certain 'sharpness' that newer versions lack, describing it as a trade-off between cost and raw reasoning quality.(reference:12)

If you're interested in the ethics of these changes, discover why is OpenAI no longer opensource? to understand their evolving strategy.

Why does GPT-4 sometimes refuse to answer or leave code incomplete?

This is likely due to two factors: 1) aggressive safety fine-tuning (RLHF) that makes it overly cautious, and 2) cost-cutting measures that cap the length of the model's responses. It's literally being trained to be 'lazy' to save money.(reference:13)(reference:14)

References

  • [1] Arxiv - GPT-4's ability to solve math problems collapsed from 97.6% accuracy in March 2023 to just 2.4% in June 2023.
  • [5] Arxiv - Its response rate to sensitive questions plummeted from 21% to 5% in just one month (May to June 2023).
  • [6] Arxiv - GPT-4's accuracy on prime number identification fell from 97.6% to 2.4% between March and June 2023.