AI is a great tool but...

AI can’t agree on how tall a plant is (try it)

Here’s a test you can run right now, in two browser tabs, in five minutes. Pick a plant. Any plant. Find a halfway decent reference photo of it. Ask three different LLMs how tall this species typically grows. Then ask the same model the same question again, but reword it slightly.

You will get different answers. Different from each other. Different from themselves. Sometimes by a few centimetres. Sometimes by an entire order of magnitude.

This should be the easiest kind of question. Plants have Latin names. They have boring, well-documented facts: leaf shape, flower season, typical adult size. That’s what botany books are: one long catalogue of predictable answers. So of course I assumed AI would nail it.

A simple job for AI — until…

Because plants are named so precisely, I figured this would be mechanical: take the species name, fetch the known range, done. The specific run I did a few weeks ago looked like this. Same plant, fairly common houseplant. Three models, three answers:

Model A: “typically grows to about 1–2 metres indoors.”
Model B: “can reach 4 to 6 metres in the right conditions.”
Model C: “remains a small plant, usually under 60 cm.”

So which is it. Two-foot houseplant or six-metre monster. Pick a lane.

Then I went back to Model A. Asked the same question. Different phrasing. “How tall does this plant get?” vs “What’s the typical adult height of this species?”

Different number. By a factor of three.

And the answers, importantly, were all delivered with exactly the same level of confidence. No hedging. No “it depends.” No “I’m not sure, but.” Just three different numbers, presented as facts, by tools we are increasingly being told to trust with our research.

This is not about hard questions

If we were debating where the best pizza in Ghent is, sure, disagreement is the point. If we were arguing about whether a strategy is wise, fine. Humans disagree about judgement calls.

But a plant has a height. The information exists. It’s in gardening books. It’s on the label in the garden centre. It’s on Wikipedia. It’s the kind of thing the internet has agreed on for decades.

If a system can’t consistently answer something this simple, what exactly are we trusting it to be consistent about?

It’s a trust problem, not a feature gap

Here’s the thing I want you to actually sit with. When a tool gives you a wrong answer once, that’s a bug. You can route around it. You learn to double-check that thing. Fine.

When a tool gives you a different wrong answer every time you ask, that’s not a bug. That’s a fundamentally different category of tool. That’s a tool you cannot trust at all, because the next call might be the wrong one and you have no way to know.

And here’s the part people miss: the higher the stakes, the less you should trust AI to be correct. Not because it’s malicious — because it can be confidently wrong in ten different ways, and you won’t get a warning label when it happens.

We are poisoning the well. Large portions of the internet are being written and rewritten by AI. AI uses the internet as its common source of truth — and “truth” should really get quotes here. On the way it will blatantly make mistakes on facts in history, misquote people, get botanical specs wrong — you name it. Then a few weeks later another model passes through to “learn” from what its predecessor wrote. All of this happens on autopilot until we lose all track of what was actually the original, truthful source.

How do we fix this problem?

We stop asking the model to be an oracle, and start using it like what it’s good at: producing drafts, glue code, and deterministic logic. If you can, let AI write a small, boring script that always follows the same rules: validate input, look up known sources, compare answers, flag mismatches. That beats “ask again and hope”.

Start with this rule: if a human can’t easily verify the answer, don’t let the model be the final judge. That’s fine for drafting an email. It’s workable for summarising a meeting. It’s broken for decisions that hit customers, patients, students, or anyone who can’t check the facts themselves.

And here’s the practical move: have a vibe-coder on the team — someone who can quickly ship small checks, scrapers, validators, and spreadsheet formulas. Or hire us when it gets bigger than “one script”. Because the fix is rarely “one better prompt”. It’s putting guardrails around the model so the output becomes predictable enough to trust.

For the version of this same trust problem where you ask AI to do less and it does more anyway, read Stop handing me AI slop.

Related Posts

See if we're a fit?

Book a quick intro. Thirty minutes. No six-week discovery pitch.

Subscribe Newsletter

Subscribe to our newsletter for the latest AI insights