Most of the time, when we talk about large language models (LLMs), we end up in the weeds of training data and parameter counts. Useful if you’re a researcher; less useful if you’re a leader, policymaker, or practitioner trying to answer a simpler question:
“Is this thing actually behaving in a way I’m comfortable with?”
Two realities make that hard:
- The training data is too large for humans to grasp in any meaningful way.
- The models are too complex for us to truly understand their internal “decision making.”
But their outputs – the words they put on the page – are something we can read, interrogate, and assess.
That’s where I had the idea for a post a few months ago (https://www.linkedin.com/posts/meetsteveharris_today-i-found-my-self-pondering-the-use-of-activity-7350323287782027264-L79w) and a research fellowship submission to ERA Cambridge (which I sadly didn’t get): using human assessment frameworks like DSM-5 to evaluate LLM outputs from a human perspective. Not because LLMs have mental disorders (they don’t – do they 🙂 ), but because DSM style thinking gives us a human focused, structured, language-based way to examine complex behaviour.
In other words: what happens if we treat an LLM’s answers a bit like a patient’s speech in a clinical interview?
What is it?
At its core, this approach accepts that we’re dealing with a black box. We can’t realistically unpack everything that happened inside the model – but we can study how it behaves.
So we:
- Focus on outputs, not internals.
- Use human diagnostic lenses (DSM-5, cognitive biases, personality measures, clinical interview patterns) to make sense of those outputs.
- Describe what we see in human-centric language that is more understandable.
That might look like:
- Framing “hallucinations” as confabulation – the model confidently making things up, similar to a person filling in memory gaps without realising.
- Noticing “authority simulation” – formal, passive, expert-sounding language (“It was determined that…”) without actual evidence behind it.
- Calling out fixed ideation or bias – e.g., a persistent techno-optimistic slant that downplays risk, or an over-eager tendency to agree with the user.
- Comparing behaviour across prompts, models and time like you would compare different “presentations” of a person: more verbose here, more cautious there, more sycophantic in another context.
In my experiment I tried :
- Using the same LLM for both roles
- Using different LLM’s for the response and the analysis
- Comparing both outputs
We’re not literally diagnosing the AI. We’re borrowing the structure and language of human assessment to build a more systematic, human-readable way of talking about how LLMs behave.
What are the benefits of this approach?
- Makes behaviour intelligible to non-technical audiences: Talking about “confabulation” or “authority simulation” is more relatable than loss functions and logits.
- Creates a structured checklist, not just gut feel: Coherence, factuality, bias, tone, safety, consistency – DSM-style thinking nudges us to consider all of them, not just accuracy.
- Surfaces quirks that metrics may miss: Things like relentless positivity, grandiose tone, or oddly flat affect often show up in conversation before they show up in benchmarks.
- Supports before/after comparisons between models and versions: You can say: “Fewer confabulations, but more authoritative tone,” or “Better consistency, but more verbosity,” in language everyone understands.
- Encourages cross-disciplinary input: Psychologists and psychiatrists are trained to read language for underlying patterns. Inviting them into AI evaluation broadens the toolkit beyond pure engineering.
- Connects technical behaviour to ethical and governance questions: It becomes easier to talk with boards, regulators, and staff about where an LLM is over-confident, biased, or misleading and what to do about it.
- Helps organisations think about “model personality”: Even metaphorically, describing a model’s default style (highly agreeable, risk-tolerant, blunt, cautious) helps decide where and how to deploy it.
What are the limitations of this approach?
- It’s always anthropomorphic and metaphorical: LLMs don’t have minds, feelings, or psychopathology. We must be clear that we’re using human concepts as lenses, not literal diagnoses.
- Easy to over-stretch or misuse clinical language: Terms like “delusion” or “schizophrenia” have specific meanings in human contexts. Using them loosely for AI risks both confusion and stigma.
- Self-evaluation by the model is fragile: Using an LLM to “diagnose” its own output introduces shared blind spots and self-favouring bias. Human oversight (or at least diverse models) still matters.
- Not a replacement for technical and safety evaluation: You still need hard metrics, test suites, red-teaming, and domain-specific checks. The DSM-style view is a qualitative layer on top, not a silver bullet.
- Highly context- and prompt-dependent: Change the instructions and you can change the apparent “symptoms.” Any assessment needs to be explicit about the scenario you’re evaluating.
- Risk of turning into a gimmick: Without clear operational definitions, what you actually mean by “confabulation,” “authority simulation,” “flattened affect”, this can devolve into clever language rather than useful insight.
Conclusion
Using DSM-5 and related psychological frameworks to assess LLM behaviour is not about diagnosing machines as if they were people. It’s about stealing the best bits of human assessment practice – structure, language, pattern recognition – and applying them to systems that are otherwise too big and opaque to reason about directly.
Done thoughtfully, this approach:
- Gives us a human-readable report card on how a model behaves.
- Highlights not just whether answers are right, but how they show up – confident, biased, cautious, sycophantic, authoritative.
- Brings together multiple disciplines – AI, psychology, ethics, governance – to look at the same behaviour from different angles.
But we need to keep two things front and centre:
- It’s one data point, not the whole picture: Use it alongside quantitative evaluation, safety testing, and domain expertise.
- It’s metaphor, not mind-reading: These are helpful analogies for behaviour, not evidence that an LLM has a psyche.
If we hold those boundaries, this kind of “AI psychology” can be a powerful way to talk about LLMs in human terms – without forgetting that, under the hood, we’re still dealing with probabilities, not people.