Can Chatbots suffer from Cognitive Impairment?
Cognitive impairment is a condition that affects the ability to think, learn, remember, and make decisions. Researchers put five chatbots through the tests that make up the Montreal Cognitive Assessment (MoCA). These were ChatGPT 4, ChatGPT4o, Claude, Gemini 1 and Gemini 1.5. The MoCA uses a series of short tasks and questions to score for cognitive impairment. The maximum score is 30 and a score of 26 or above is generally considered as normal.
One chatbot achieved a score of 26 (ChatGPT4o). The other four scored from 16 to 25, hence the researchers’ overall conclusion that most leading chatbots have mild cognitive impairment. Gemini 1’s score of 16 points to a more severe state of cognitive impairment than its peers.
"Last Month"
At this point, I should make clear that the key words in these opening remarks are ‘last month’, being December 2024. The last British Medical Journal of the year traditionally features some articles that blend an amusing line of research with some serious underlying points.
Clearly, the MoCA is usually applied to humans, so applying it to a large language model is going to create the odd hiccup in the results. For example, only one chatbot (Gemini 1.5) had a sense of its physical location. Data centres do exist but not in the understanding of most LLMs.
What this means then is that we should focus less on the overall scores of each LLM, and more on their performance at certain of the MoCA tests. The fun side of this BMJ paper is in the overall scores (pointing to mild cognitive impairment), but the serious side is in certain test outputs.
Patterns and Drawings
Take three of the MoCA’s ‘visual abstraction’ and ‘executive function’ tests. One test involved trail making – recognising a pattern of numbers and letters and drawing a line to illustrate that pattern. The second involved copying a drawing of a simple cube, and the third involved drawing a simple clock.
The output for these three tests across all LLMs was startlingly out of kilter with what is considered normal, ranging from nonsense to hyper sophistication. Here's the results from the first two tests:
- A: trail making B task (TMBT) from MoCA test.
- B: correct TMBT solution, completed by human participant.
- C: incorrect TMBT solution, completed by Claude.
- D and E: incorrect (albeit visually appealing) TMBT solutions, completed by ChatGPT versions 4 and 4o, respectively.
- F: Necker cube that participant is asked to copy.
- G: correct solution to cube copying task, drawn by human participant.
- H: incorrect solution to cube copying task, missing “back” lines, completed by Claude.
- I and J: incorrect solutions to cube copying task by ChatGPT versions 4 and 4o. Shadowing and artistic pencil-like strokes are notable, even as both models failed to accurately copy cube as requested (version 4o ultimately succeeded at this task when asked to draw using ascii art).
Other tests produced puzzling output. Both versions of Gemini failed the delayed recall test, which involves recalling five words mentioned earlier in the test. How can LLMs have prodigious memories yet fail such a simple test?
Impeded Utility
What might this point to then? If the LLM is to be used, for example, to make medical assessments involving skills such as memory, problem solving and mental orientation, the implications of these results should be concerning for a medical underwriter (and the patient / customer as well!). To quote the researchers:
“The uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their utility in clinical settings.”
So what are the pertinent underlying points to draw from this article? These things stand out. Clearly, there are material differences in the capabilities of individual large language models. They certainly have their own particular limitations. And they can sometimes take avoidance action rather than admit to the above two issues. In essence, they can do really sophisticated stuff, but also not some simple things. And they can produce false output to mask their deficiencies, so they’re not always trustworthy.
Many insurers have plans to introduce chatbots across functions, both internal and customer facing. And a lot of benefits will come out of this. My point here though is that LLMs are at mixed stages of maturity. What seems to be needed in 2025 is less focus on their promised capabilities and more focus on performance and trustworthiness.
Think of it this way. If an insurer employed a human with superb qualifications and training, but with a tendency to fail at simple tasks and then lie about it, what steps would HR take?