Evals for Humans - From Axial to Recall a humans guide to AI evaluation jargon

Layla Foord
Sep 10
6 min read

Supercalifragilisticexpialidocious

Even though the sound of it is something quite atrocious

If you say it loud enough you'll always sound precocious

Axial and TPR, toxicity and recall

Um-dittle-ittl-um-dittle-I

Evaluation, ground truth set, transition failure matrix

Criteria drift and R.A.G, hallucination scoring

Regression test and LLMs and multi-turn conversation

AI jargon makes no sense so get me a thesaurus.

Um-dittle-ittl-um-dittle-I

The language around AI evaluation can sound like a spell book written in a secret tongue. It dazzles, it intimidates, and it sometimes makes the rest of us feel like outsiders nodding along. But strip away the incantations, and what are we left with? Simple human practice: notice what broke, fix it, and try not to break it again.

Hamel Husain and Shreya Shankar, along with others, have done something important: they’ve taken this fuzzy business of testing AI and given it structure. That’s valuable. Their approach is rigorous, and it works. But the way it gets talked about can make the rest of us feel like we’re missing a class in a language we didn’t even know existed.

So here’s a plain-English guide to what they’re saying, what it means, and how you can use it without needing a new dictionary.

Step 1: Error Analysis (a.k.a. Watch It Fail)

What they say: Do an “error analysis” with “open coding” and then “axial coding.”

What it really means:

Watch your AI in action with about 100 real user examples.
Ask your domain expert (psychologist, lawyer, teacher, whoever knows the space) to mark each one: pass or fail. No half marks, no 1-to-5 scale. Just: did it work or not?
For fails, write down why it failed. For passes, note what worked, and what still could be improved.
Group the mistakes into a few main buckets (no more than 10). That’s your “taxonomy of failure modes.”

Everyday metaphor: Like cooking a new dish for friends. You don’t rate it 1–5 stars. Either it worked (everyone ate it) or it failed (burnt custard, no one touched it). Then you group the issues: “too salty,” “overcooked,” “recipe unclear.”

Step 2: Build an Evaluation Suite (a.k.a. Write Tests)

What they say: Build a “reliable evaluation suite” with “code-based evaluators” and “LLM-as-a-judge.”

What it really means:

For simple, rule-based mistakes → write little code checks (like unit tests). Example: “Does this output include a date in the right format?”
For fuzzy, human-judgment mistakes (tone, helpfulness, relevance) → train another AI to be the judge. Feed it lots of examples from your human expert so it knows your quality bar.
Don’t overcomplicate it: code checks for the easy stuff, AI judges for the subjective stuff.

Everyday metaphor: Parenting. Some things are easy to check: “Did the lunchbox make it into the schoolbag?” That’s a rule-based test. Other things need judgment: “Was that text to your friend kind or unkind?” That’s subjective — you need a human (or a judge AI) to weigh in.

Step 3: Validate Your Judge (a.k.a. Check the Checker)

What they say: Build a “ground truth dataset,” split it into “train/dev/test sets,” measure “TPR and TNR.”

What it really means:

Get your domain expert to label a bunch of examples as pass/fail with explanations.
Use a small slice of those to teach your AI judge what “good” and “bad” look like.
Use the next chunk to test and refine the judge.
Keep a final chunk locked away to see if the judge actually holds up on unseen cases.
Don’t just look at “accuracy.” Check:
- TPR (true positive rate): how often it correctly says “yes” when it should.
- TNR (true negative rate): how often it correctly says “no” when it should.

Everyday metaphor: Driving. If you’re teaching a learner driver, you don’t just check if they “mostly” drove okay. You check: Did they actually stop at stop signs (TNR), and did they go when it was safe (TPR)? A driver who always stops but never goes isn’t safe. Neither is one who always goes but never stops.

Step 4: Keep It Running (a.k.a. Don’t Backslide)

What they say: “Operationalise your evals for continuous improvement.”

What it really means:

Bake your tests into the release process. Every time you ship an update, rerun the tests.
If something that used to work suddenly fails, you catch it before your users do.
Over time, this becomes a flywheel: spot the biggest problems → fix them → test again → find the next ones.

Everyday metaphor: Home maintenance. You don’t just fix the leaky tap once and forget about it. You keep an eye out for drips, squeaks, or cracks. Regular checks stop small problems becoming big disasters.

Special Cases (a.k.a. The Fancy Bits)

They also talk about specific AI architectures:

Multi-turn conversations → Check if the whole chat achieved the user’s goal. If not, see if it’s because the AI lost context or just didn’t know the answer.
- Everyday metaphor: Like a dinner party story that loses its thread halfway through. Did you forget the point, or never know it in the first place?
RAG (retrieval-augmented generation) → Test the search part (did it fetch the right docs?) separately from the writing part (did it use them faithfully and answer the question?).
- Everyday metaphor: Researching an essay. Did you grab the right books from the library? And then, did you actually use them to write the essay instead of making things up?
Agents → If the AI does multi-step tasks, map out where in the sequence it usually breaks. That’s a “transition failure matrix,” or in plain English: a chart of “what step broke most often.”
- Everyday metaphor: Baking a cake. Did you forget the sugar, burn it in the oven, or drop it taking it out? Knowing the step helps you fix it.

Why Bother With the Jargon?

Inside a big lab with hundreds of engineers, you need shared terms. Error analysis means the same thing to everyone, so they can coordinate. That’s useful. But if you’re a product team or a curious outsider, you don’t need to drown in acronyms to get the point.

At its core, evals are just humans doing what we’ve always done: notice what broke, fix it, and try not to do it again.

So, respect to Hamel, Shreya, and the labs for building a method that works. And also, permission granted for the rest of us to say: Evals = tests that matter. That’s all.

Whilst I am using levity to untangle the complexity and help us all feel less stupid, this post builds on the brilliant work of Hamel Husain and Shreya Shankar, who’ve trained thousands of PMs and engineers in these methods. Their course AI Evals for Engineers & PMs is worth checking if you want the full system. And Lenny’s Newsletter is always a great source of inspiration and information and where I found the work of Hamal and Shreya.

-Layla

Field Guide to Eval for Humans Jargon

Jargon Term	What It Really Means
Eval / Evals	Tests that matter
Error analysis	Look at where it breaks
Open coding	Write down what went wrong
Axial coding	Group the mistakes into buckets
Taxonomy of failure modes	The list of those buckets
Evaluation suite	A bunch of tests
Operationalising	Running the tests regularly
Regression test	Making sure old fixes don’t break again or break anything you already did
Ground truth dataset	Human-labeled examples of good/bad
Train/Dev/Test split	Teach, refine, and then check on fresh data
LLM-as-a-judge	Another AI checking the first AI’s work
TPR (True Positive Rate)	How often it says yes correctly
TNR (True Negative Rate)	How often it says no correctly
Hallucination score	How much the AI makes things up
Toxicity score	How offensive or harmful it was
Multi-turn conversations	Chats that go back and forth
RAG (Retrieval-Augmented Generation)	Searching first, then writing an answer
Recall@k	Did it find the right docs in the top k?
Precision@k	Were the top k docs actually relevant?
Transition failure matrix	A map of where multi-step tasks break
Criteria drift	Standards that shift over time
Conversational flow issues	When the AI loses the thread
Handoff failures	It doesn’t know when to pass to a human
Rescheduling problems	Struggles with dates and times
Faithfulness	Did it stick to the source facts?
Answer relevance	Did it answer the actual question?