Last week my daughter, who recently started reading on her own, outperformed GPT-5 — the latest, supposedly greatest frontier model — on a question I genuinely had not expected GPT-5 to lose.
The question was this: which days of the week contain the letter D?
My daughter tracked through the seven words with her finger. Monday — yes. Tuesday — yes. Wednesday — yes, and look, twice. Thursday, Friday, Saturday, Sunday — yes, yes, yes, yes. All seven. Every day of the week ends in “-day.” They all contain a D.
GPT-5 said two.
Two.
I didn’t invent this test. I got the idea from a YouTube video — someone running the exact same question, getting the exact same parade of confident-and-wrong replies. I tried it myself because I didn’t believe it. Then I asked again, gently, the way you do when you suspect you’ve been misunderstood. GPT-5 “corrected itself” — produced a different number, still wrong. I asked again. Another wrong answer. Confident, polite, completely incorrect, every iteration. You can reproduce it.
The pattern is this: AI scores brilliantly on the things we’ve told it are important, and trips over the things a child handles without thinking.
If we were just having a giggle at chatbots failing at children’s logic puzzles, this would be a fun pub conversation and nothing more.
We do not need to accept AI bugs or mistakes as ‘just hallucinations’
It’s a quite genius marketing move by AI companies, when you think about it: they’ve rebranded the word for mistake into something soft and almost harmless. Hallucination. We laugh along. We make up silly excuses — oh, it’s just hallucinating, how cute. But wait. If a human contractor handed you a piece of code with mistakes like these and asked for a hundred euros, you’d send it back unpaid until it was fixed. A billion-dollar AI model gets a pass because we shrug and say “oh silly, it’s just hallucinating.” That’s an unacceptable double standard — one we as humans will never be able to compete with.
We are letting these systems triage support tickets. We are letting them filter job applications. We are letting them write the first draft of medical summaries, legal contracts, insurance decisions, school report cards. We are letting them make calls about whose voice gets amplified online and whose gets buried. We are letting them sit between us and decisions that used to require a person — and overseas, militaries are already briefing AI into targeting and command chains.
And the systems doing all of that are the same systems that, on the right Tuesday, can’t count to seven when the seven items in question all end in the same three letters.
“But it’ll get better”… they say
Will it? Probably. The days-of-the-week question will be in a fine-tuning set by next quarter and the model will pass it perfectly forever after untill we ask for the letter ‘y’. But somehow the benchmarks will fail-forward till infinity.
But that doesn’t fix the underlying issue, because the underlying issue is not “this specific question.”
The underlying issue is that the model doesn’t actually understand the world the way a 7-year-old understands the world.
It is doing extremely sophisticated pattern-matching on text. When the pattern aligns with reality, you get a brilliant answer. When it doesn’t, you get a confident, well-formatted, completely wrong one.
We’re close to dystopia-by-stupidity
Hollywood keeps shipping the same plot: AI whiffs something obvious, humans pay. HAL won’t open the pod bay doors. In WarGames war-ai needs a child to explain mutually assured stalemate. In Ex Machina — still one of my favourite films — the robot doesn’t understand humans. It just understands how to be, to maximise the result of the human behaving in the desired way. And if you want a modern version: just pick any Black Mirror episode from a few years back and you’ll see how close we already are to those stories that end up in a AI dystopia-by-stupidity.
My benchmark isn’t a leaderboard score anymore. Stop staring at the newest benchmark graph every time a lab drops a model. Benchmarks overweight logic-statistical-level tricks and underweight the low-level common sense work we actually rely on in daily life. If the tool can’t consistently clear the boring basics, the fancy stuff doesn’t save you… its just better at making the wrong answer sound more convincing.
So why integrate AI into your daily workflow at all? Because used correctly — as a draft machine, a brainstorming partner, a tool for boring glue code where you can verify the output — it genuinely saves time. The problem isn’t using AI. It’s treating it like it understands what it’s saying. Use it for the parts you can check. Keep your brain on the parts you can’t.
For another, very different concrete stupidity-test you can run yourself in five minutes read AI can’t agree on how tall a plant is. It’s about how AI doesn’t even agree with itself and how we solve it… using AI…
The pattern is this: AI scores brilliantly on the things we’ve told it are important, and trips over the things a child handles without thinking.
If we were just having a giggle at chatbots failing at children’s logic puzzles, this would be a fun pub conversation and nothing more.
We do not need to accept AI bugs or mistakes as ‘just hallucinations’
It’s a quite genius marketing move by AI companies, when you think about it: they’ve rebranded the word for mistake into something soft and almost harmless. Hallucination. We laugh along. We make up silly excuses — oh, it’s just hallucinating, how cute. But wait. If a human contractor handed you a piece of code with mistakes like these and asked for a hundred euros, you’d send it back unpaid until it was fixed. A billion-dollar AI model gets a pass because we shrug and say “oh silly, it’s just hallucinating.” That’s an unacceptable double standard — one we as humans will never be able to compete with.
We are letting these systems triage support tickets. We are letting them filter job applications. We are letting them write the first draft of medical summaries, legal contracts, insurance decisions, school report cards. We are letting them make calls about whose voice gets amplified online and whose gets buried. We are letting them sit between us and decisions that used to require a person — and overseas, militaries are already briefing AI into targeting and command chains.
And the systems doing all of that are the same systems that, on the right Tuesday, can’t count to seven when the seven items in question all end in the same three letters.
“But it’ll get better”… they say
Will it? Probably. The days-of-the-week question will be in a fine-tuning set by next quarter and the model will pass it perfectly forever after untill we ask for the letter ‘y’. But somehow the benchmarks will fail-forward till infinity.
But that doesn’t fix the underlying issue, because the underlying issue is not “this specific question.”
The underlying issue is that the model doesn’t actually understand the world the way a 7-year-old understands the world.
It is doing extremely sophisticated pattern-matching on text. When the pattern aligns with reality, you get a brilliant answer. When it doesn’t, you get a confident, well-formatted, completely wrong one.
We’re close to dystopia-by-stupidity
Hollywood keeps shipping the same plot: AI whiffs something obvious, humans pay. HAL won’t open the pod bay doors. In WarGames war-ai needs a child to explain mutually assured stalemate. In Ex Machina — still one of my favourite films — the robot doesn’t understand humans. It just understands how to be, to maximise the result of the human behaving in the desired way. And if you want a modern version: just pick any Black Mirror episode from a few years back and you’ll see how close we already are to those stories that end up in a AI dystopia-by-stupidity.
My benchmark isn’t a leaderboard score anymore. Stop staring at the newest benchmark graph every time a lab drops a model. Benchmarks overweight logic-statistical-level tricks and underweight the low-level common sense work we actually rely on in daily life. If the tool can’t consistently clear the boring basics, the fancy stuff doesn’t save you… its just better at making the wrong answer sound more convincing.
So why integrate AI into your daily workflow at all? Because used correctly — as a draft machine, a brainstorming partner, a tool for boring glue code where you can verify the output — it genuinely saves time. The problem isn’t using AI. It’s treating it like it understands what it’s saying. Use it for the parts you can check. Keep your brain on the parts you can’t.
For another, very different concrete stupidity-test you can run yourself in five minutes read AI can’t agree on how tall a plant is. It’s about how AI doesn’t even agree with itself and how we solve it… using AI…


