GPT-5 vs IQ Tests: Why AI Fails & What It Really Means

8/10/2025

GPT-5 vs. Offline IQ Tests: What Do the Low Scores Really Mean?

Alright, let's get into it. The tech world is buzzing, as it always is, but the recent launch of OpenAI's GPT-5 in August 2025 has kicked things into a whole new gear. The claims are big: "PhD-level performance," a "huge improvement" in capability, and an AI that's supposed to be our smartest, most useful partner yet. And in many ways, it's living up to the hype. On standard academic & professional benchmarks, it's crushing it.

But then, a funny thing happened. Someone put it up against a private, offline IQ test—the kind with no web access & questions it couldn't possibly have seen in its training data. The result? A surprisingly low score, around 70 according to one report. This is a stark contrast to its performance on other tests where it scored much higher, like the Mensa Norway test where it hit 118.

This has sparked a HUGE debate. How can an AI that can write flawless code, draft legal documents, & seemingly reason about complex scientific topics, totally bomb a test designed to measure basic human intelligence? It's a fascinating question, & honestly, the answer tells us a lot more about the nature of intelligence itself—both artificial & human—than it does about GPT-5's shortcomings.

So, what do these low scores really mean? Here’s the thing: it’s not as simple as “AI is dumber than we thought.” It's way more nuanced & interesting than that.

First Off, What Are We Even Measuring? A Look Inside an IQ Test

Before we can talk about why an AI might fail an IQ test, we gotta understand what these tests are actually looking for. They're not just trivia quizzes. They’re carefully designed instruments honed over a century to probe the very structure of human cognition.

The most respected one is the Wechsler Adult Intelligence Scale (WAIS). The latest version, WAIS-V, doesn't just give you a single IQ number. It breaks intelligence down into five key areas:

Verbal Comprehension (VCI): This measures your ability to use language, to reason with words, & to draw upon your learned knowledge. Think: "How are a boat & a car alike?" or "Why do we have laws?" It's about verbal concept formation.
Visual-Spatial Reasoning (VSR) & Perceptual Reasoning (PRI): This is about non-verbal problem-solving. Can you look at a set of blocks & replicate a pattern? Can you finish an incomplete picture in a logical way? It's about making sense of visual information.
Fluid Reasoning: This is the ability to solve novel problems, to see patterns & relationships you haven't been explicitly taught. It’s pure, on-the-spot thinking.
Working Memory (WMI): This is your mental scratchpad. Can you hold a string of numbers in your head & repeat them backward? Can you do mental math while remembering the steps? It’s about attention & mental control.
Processing Speed (PSI): How quickly & accurately can you scan & process simple visual information? Think of tasks where you have to quickly match symbols or numbers under a time limit. It’s about cognitive efficiency.

Another classic is Raven's Progressive Matrices. This one is entirely non-verbal. You're shown a grid of geometric patterns with one missing, & you have to pick the piece that logically completes the sequence. It's a pure measure of what psychologists call fluid intelligence or "g factor"—the ability to reason, solve novel problems, & see abstract relationships, independent of your cultural background or education. It’s about figuring out the underlying rules of a system you've never seen before.

When you look at it this way, you start to see the problem. These tests aren't just about having a massive database of facts. They're about a flexible, adaptive, & embodied intelligence. And that's where things get tricky for a Large Language Model.

The Problem of "Knowing" vs. "Thinking"

Here's the core of the issue: Large Language Models like GPT-5 don't "know" things in the way humans do. They are masters of pattern recognition. They've been trained on a frankly unimaginable amount of text & data from the internet. Think of it like an incredibly advanced autocomplete. It excels at predicting the next most likely word in a sequence based on the patterns it has learned. This allows it to generate incredibly fluent, coherent, & factually accurate text.

But this is NOT the same as genuine understanding. This brings us to the first major hurdle for AI in IQ tests: Data Contamination.

The Dirty Secret of AI Benchmarks: Data Contamination

When GPT-5 scores off the charts on a public benchmark like MMLU (a massive multitask language understanding test), there's always a nagging question: has it seen the test questions before? This is what researchers call "data contamination." Because these models are trained on vast swathes of the public internet, it's almost certain that a lot of common benchmark questions are lurking somewhere in that training data.

The model might not be "reasoning" out the answer; it might just be "remembering" it.

This is why the low scores on private, offline tests are so revealing. These tests are designed to be "uncontaminated." The questions are novel, unpublished, & not floating around on some obscure corner of the web. When faced with a truly novel problem—especially a visual or abstract one like in the Raven's Matrices—the AI can't rely on its massive database of memorized patterns. It has to actually reason. And that's where the cracks start to show.

Studies have shown that data contamination is a serious problem. Researchers have devised clever ways to test for it, like masking a wrong answer in a multiple-choice question & asking the model to guess the hidden word. Shockingly, models like GPT-4 could guess the exact missing option over 50% of the time on some benchmarks, suggesting they had a strong familiarity with the test's structure & content.

So when GPT-5 gets a lower score on a private IQ test, it doesn't necessarily mean it's "dumb." It means the test is successfully measuring its ability to generalize & reason on truly unseen problems, which is a much harder & more meaningful metric.

The "Thinking" Mode: A Glimpse into the AI's Brain

Now, OpenAI isn't oblivious to this. The architecture of GPT-5 itself is a tacit acknowledgment of this "thinking" problem. One of the biggest new features is its unified system that includes a special "thinking" mode.

Here’s how it works: GPT-5 has a fast, base model that handles simple, everyday queries. But when it encounters a complex prompt that requires multi-step logic or planning, a "router" automatically kicks the task over to a more powerful, slower, & more deliberate reasoning model. You can even manually trigger it by telling it to "think hard about this."

This is a HUGE deal. It's a move away from a one-size-fits-all approach. The performance difference is massive. On a benchmark called Humanity's Last Exam, a very difficult test, the standard GPT-5 scored a measly 6.3%. But with the "thinking" mode enabled, that score jumped to 24.8%.

This tells us that "thinking" for an AI is computationally expensive. It requires a different, more intensive process than just spitting out a quick answer based on pattern matching. This "thinking" mode is likely engaging in something like chain-of-thought reasoning, where it breaks down a problem into smaller steps, analyzes them, & then synthesizes a conclusion. It's an attempt to simulate a more human-like reasoning process.

But even this more advanced "thinking" is fundamentally different from human cognition.

The Human Advantage: Embodied, Contextual Intelligence

This brings us to the philosophical core of the debate. Even with a "thinking" mode, an LLM's intelligence is disembodied. It lives in a world of pure text. Human intelligence, on the other hand, is shaped by a lifetime of physical interaction with the world.

As AI pioneer Yann LeCun often points out, LLMs lack a true "world model." They don't understand cause & effect, the laws of physics, or the social dynamics that are second nature to us. A four-year-old child has absorbed more data about the physical world through sight, sound, & touch than an LLM ever could from reading text. This is what's known as Moravec's paradox: things that are hard for humans (like complex calculations) are easy for AI, while things that are easy for humans (like navigating a room or understanding a social cue) are incredibly hard for AI.

This is why AI struggles with certain parts of IQ tests:

Common Sense & Context: An IQ test assumes a baseline of shared human experience. A question might involve a simple scenario that any adult would understand, but which an AI, lacking a body & a life, has no grounding for. It can't truly grasp the context. Contextual AI is the ability to understand the full situation—past conversations, user emotions, real-world conditions—before responding, something humans do naturally.
True Abstraction: While AI is great at finding patterns, it struggles with genuine abstraction in the way humans do. When a human solves a Raven's Matrix, they're not just matching pixels; they're forming an abstract hypothesis about the underlying rule (e.g., "the number of shapes is increasing by one in each row," or "the inner shape is rotating 90 degrees clockwise"). This is a level of conceptual understanding that current AI hasn't quite reached.
Cognitive Flexibility: Humans can switch between different modes of thinking effortlessly. We can use verbal reasoning for one problem, shift to visual-spatial thinking for another, & tap into our working memory all at once. IQ tests are designed to tax this integrated system. LLMs, even with a "thinking" mode, are more rigid. They are brilliant in their domain (language) but lack the seamless, cross-modal flexibility of the human brain.

Geoffrey Hinton, another AI pioneer, has a slightly different take. He argues that LLMs do have a form of understanding & can perform analogical reasoning that sometimes even surpasses humans. He suggests that by processing so much language, they learn the underlying logical relationships between concepts. The debate between LeCun & Hinton highlights that even the experts don't fully agree on what's going on inside these "black boxes."

How Arsturn Fits Into This New World

So, what does all this mean for businesses & practical applications? It means we need to be smart about how we use these powerful new tools. We need to understand their strengths & weaknesses.

This is where platforms like Arsturn come in. While GPT-5 is a general-purpose model, the real value for a business comes from applying this technology in a focused, controlled way. Arsturn helps businesses create custom AI chatbots trained specifically on their own data. This is a game-changer for several reasons in light of what we've just discussed.

Instead of an AI that knows a little bit about everything, you get an expert on your business. When a customer visits your website & asks a question, they don't need an AI that can philosophize about the nature of intelligence. They need an AI that knows your product catalog, your shipping policies, & your return process inside & out.

Arsturn helps bridge the gap between a general model's vast but shallow knowledge & the deep, specific knowledge required for effective customer service & engagement. By building a no-code AI chatbot trained on your company's documents, website content, & support articles, you create a conversational AI that:

Provides Instant, Accurate Support: It can answer customer questions 24/7 with information that is guaranteed to be relevant to your business, reducing the risk of the AI "hallucinating" or making up answers.
Boosts Engagement & Lead Generation: The chatbot can proactively engage website visitors, ask qualifying questions, & capture leads, turning your website into a more interactive & effective sales tool.
Offers a Personalized Experience: Because it operates within the context of your business, the interactions feel more relevant & personalized, building trust with your audience.

In a way, using a platform like Arsturn is like giving the AI a targeted, offline "IQ test" on your business every day. It doesn't matter if it can't solve an abstract visual puzzle; what matters is that it can solve your customer's problem quickly & efficiently.

So, Do the Low Scores Matter?

Yes & no.

No, they don't mean that GPT-5 is a failure or that AI progress is stalling. In its domain—language processing, coding, information synthesis—it is ASTOUNDINGLY capable. It's a tool that can amplify human productivity in incredible ways.

Yes, they matter DEEPLY because they are a crucial reality check. They remind us that we are still very far from Artificial General Intelligence (AGI). They highlight the profound differences between pattern matching & genuine, embodied understanding. They force us to ask better questions—not just "How high can it score?" but "What is it actually learning?"

The low scores on private IQ tests are not an indictment of GPT-5. They are a beautiful illustration of the complexity of intelligence itself. They show us that the human mind isn't just a bigger language model. It's a different kind of machine altogether, one shaped by evolution, embodiment, & a rich, continuous interaction with the real world.

And honestly, that's a pretty cool thing to be reminded of.

Hope this was helpful & gives you a new way to think about all the AI hype. Let me know what you think