Reading LLMs Like Patients: What DSM-5 Can Teach Us About AI Behaviour

Most of the time, when we talk about large language models (LLMs), we end up in the weeds of training data and parameter counts. Useful if you’re a researcher; less useful if you’re a leader, policymaker, or practitioner trying to answer a simpler question:

“Is this thing actually behaving in a way I’m comfortable with?”

Two realities make that hard:

  • The training data is too large for humans to grasp in any meaningful way.
  • The models are too complex for us to truly understand their internal “decision making.”

But their outputs – the words they put on the page – are something we can read, interrogate, and assess.

That’s where I had the idea for a post a few months ago (https://www.linkedin.com/posts/meetsteveharris_today-i-found-my-self-pondering-the-use-of-activity-7350323287782027264-L79w) and a research fellowship submission to ERA Cambridge (which I sadly didn’t get): using human assessment frameworks like DSM-5 to evaluate LLM outputs from a human perspective. Not because LLMs have mental disorders (they don’t – do they 🙂 ), but because DSM style thinking gives us a human focused, structured, language-based way to examine complex behaviour.

In other words: what happens if we treat an LLM’s answers a bit like a patient’s speech in a clinical interview?

What is it?

At its core, this approach accepts that we’re dealing with a black box. We can’t realistically unpack everything that happened inside the model – but we can study how it behaves.

So we:

  • Focus on outputs, not internals.
  • Use human diagnostic lenses (DSM-5, cognitive biases, personality measures, clinical interview patterns) to make sense of those outputs.
  • Describe what we see in human-centric language that is more understandable.

That might look like:

  • Framing “hallucinations” as confabulation – the model confidently making things up, similar to a person filling in memory gaps without realising.
  • Noticing “authority simulation” – formal, passive, expert-sounding language (“It was determined that…”) without actual evidence behind it.
  • Calling out fixed ideation or bias – e.g., a persistent techno-optimistic slant that downplays risk, or an over-eager tendency to agree with the user.
  • Comparing behaviour across prompts, models and time like you would compare different “presentations” of a person: more verbose here, more cautious there, more sycophantic in another context.

In my experiment I tried :

  • Using the same LLM for both roles
  • Using different LLM’s for the response and the analysis
  • Comparing both outputs

We’re not literally diagnosing the AI. We’re borrowing the structure and language of human assessment to build a more systematic, human-readable way of talking about how LLMs behave.

What are the benefits of this approach?

  • Makes behaviour intelligible to non-technical audiences: Talking about “confabulation” or “authority simulation” is more relatable than loss functions and logits.
  • Creates a structured checklist, not just gut feel: Coherence, factuality, bias, tone, safety, consistency – DSM-style thinking nudges us to consider all of them, not just accuracy.
  • Surfaces quirks that metrics may miss: Things like relentless positivity, grandiose tone, or oddly flat affect often show up in conversation before they show up in benchmarks.
  • Supports before/after comparisons between models and versions: You can say: “Fewer confabulations, but more authoritative tone,” or “Better consistency, but more verbosity,” in language everyone understands.
  • Encourages cross-disciplinary input: Psychologists and psychiatrists are trained to read language for underlying patterns. Inviting them into AI evaluation broadens the toolkit beyond pure engineering.
  • Connects technical behaviour to ethical and governance questions: It becomes easier to talk with boards, regulators, and staff about where an LLM is over-confident, biased, or misleading and what to do about it.
  • Helps organisations think about “model personality”: Even metaphorically, describing a model’s default style (highly agreeable, risk-tolerant, blunt, cautious) helps decide where and how to deploy it.

What are the limitations of this approach?

  • It’s always anthropomorphic and metaphorical: LLMs don’t have minds, feelings, or psychopathology. We must be clear that we’re using human concepts as lenses, not literal diagnoses.
  • Easy to over-stretch or misuse clinical language: Terms like “delusion” or “schizophrenia” have specific meanings in human contexts. Using them loosely for AI risks both confusion and stigma.
  • Self-evaluation by the model is fragile: Using an LLM to “diagnose” its own output introduces shared blind spots and self-favouring bias. Human oversight (or at least diverse models) still matters.
  • Not a replacement for technical and safety evaluation: You still need hard metrics, test suites, red-teaming, and domain-specific checks. The DSM-style view is a qualitative layer on top, not a silver bullet.
  • Highly context- and prompt-dependent: Change the instructions and you can change the apparent “symptoms.” Any assessment needs to be explicit about the scenario you’re evaluating.
  • Risk of turning into a gimmick: Without clear operational definitions, what you actually mean by “confabulation,” “authority simulation,” “flattened affect”, this can devolve into clever language rather than useful insight.

Conclusion

Using DSM-5 and related psychological frameworks to assess LLM behaviour is not about diagnosing machines as if they were people. It’s about stealing the best bits of human assessment practice – structure, language, pattern recognition – and applying them to systems that are otherwise too big and opaque to reason about directly.

Done thoughtfully, this approach:

  • Gives us a human-readable report card on how a model behaves.
  • Highlights not just whether answers are right, but how they show up – confident, biased, cautious, sycophantic, authoritative.
  • Brings together multiple disciplines – AI, psychology, ethics, governance – to look at the same behaviour from different angles.

But we need to keep two things front and centre:

  1. It’s one data point, not the whole picture: Use it alongside quantitative evaluation, safety testing, and domain expertise.
  2. It’s metaphor, not mind-reading: These are helpful analogies for behaviour, not evidence that an LLM has a psyche.

If we hold those boundaries, this kind of “AI psychology” can be a powerful way to talk about LLMs in human terms – without forgetting that, under the hood, we’re still dealing with probabilities, not people.