ChatGPT struggles with accuracy, consistency on science quizzes

ChatGPT can sound confident, clear, and convincing. But a new study suggests that confidence may hide a deeper problem.

Researchers found that when the same question is asked multiple times, ChatGPT can give different answers – even when nothing in the prompt changes. In some cases, it flips between “true” and “false” on the exact same claim.

That kind of inconsistency raises a bigger concern. If an answer can change without a reason, how much can we trust it when the stakes are higher?

Testing ChatGPT accuracy

Across hundreds of hypotheses drawn from published scientific research papers, the system was repeatedly asked to decide whether each one was true or false.

By running the exact same question ten times, Mesut Cicek at Washington State University (WSU) showed that identical prompts could return opposite answers.

Some claims flipped back and forth between true and false across repeated runs, even though nothing in the input changed.

Such reversals expose a core limitation in how the system evaluates claims, setting up the need to examine where and why those errors occur.

Where ChatGPT loses accuracy

Errors were most pronounced with unsupported hypotheses, revealing a persistent bias toward agreement that the model did not overcome.

In 2025, ChatGPT correctly identified those false claims just 16.4 percent of the time – far below its headline accuracy.

That pattern suggests the system often defaults to “yes,” because matching familiar language is easier than spotting a flawed idea.

At first glance, overall performance looked solid, rising from 76.5 percent in 2024 to 80 percent in 2025. But once random guessing was factored out, effective accuracy dropped to around 60 percent – closer to a low D.

That gap exists because a true-or-false task gives every answer a 50 percent chance before any reasoning begins. When the score shrinks that much under pressure, it may be useful for drafting ideas, but it becomes risky for real decisions.

For readers using AI to judge evidence, the most reassuring answer may also be the least trustworthy.

Same ChatGPT question, different answers

Repetition exposed a second problem: identical ChatGPT prompts did not produce identical answers. Across ten repeated runs, only 72.9 percent of responses in 2025 stayed correct every time. Some claims flipped between true and false, even though nothing in the input changed.

“We’re not just talking about accuracy, we’re talking about inconsistency because if you ask the same question again and again, you come up with different answers,” said Cicek.

That instability means a single response can look reliable, while repeated checks reveal how fragile it really is.

Performance held up best on simple cause-and-effect chains, where one change leads directly to another. But dropped on claims that depended on context, where outcomes shift based on conditions rather than fixed rules.

These are the kinds of judgments people make every day – from pricing decisions to market strategy and policy tradeoffs. An AI system that misses those limits can still sound persuasive while quietly flattening the details that matter most.

When ChatGPT confidence beats accuracy

A large language model (LLM) is trained on massive text datasets and works by predicting likely next words, not by checking facts against the real world.

That design helps produce fluent, confident answers – even when the system has no grounded way to judge whether they are true.

OpenAI notes that ChatGPT can still produce “hallucinations,” or responses that sound certain but are factually incorrect.

Together, confidence and uncertainty make the system especially tricky to use. The wrong answer can feel solid enough to trust.

For science and business teams, that weakness turns a useful shortcut into a quiet risk. A polished summary can speed up planning, but a single flawed judgment can steer a product, budget, or campaign in the wrong direction.

“They just memorize, and they can give you some insight, but they don’t understand what they’re talking about,” Cicek said.

For now, the safest approach is to treat AI as a drafting partner – not as an unsupervised decision-maker.

Smarter ways to use ChatGPT

The takeaway is simple: use AI for speed, but don’t trust it without a second look.

Think of its answers as a first draft, not a final decision. Running the same prompt more than once can help reveal hidden instability because a reliable answer should not change without a clear reason.

It also helps to check sources, look for missing context, and compare the response with what experts already know. These small steps can catch problems that polished language might otherwise hide.

That extra effort may take a little more time, but it helps keep confident-sounding answers grounded in real evidence.

What comes next for AI

Even so, the paper does not close the case on every AI tool or every kind of reasoning. WSU’s team tested business hypotheses from open-access studies and repeated each prompt ten times on one platform.

Those limits leave room for broader comparisons, longer prompt runs, and tougher tasks that better mirror messy, real-world decisions.

Still, a result this consistent after a year of model updates tells readers not to confuse polish with judgment.

Across this test, ChatGPT appeared more polished in 2025 than in 2024, but it did not become a dependable reasoner.

The warning from WSU is clear: human experts still need to check the logic, especially when the answer sounds the easiest.

The study is published in Rutgers Business Review.

—–

Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.

Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.

—–

Source link