Why Evals Matter
The cost of untested AI features — and what a real eval pipeline looks like
Shipping an AI feature without evals is like deploying code without tests and hoping nothing breaks. The difference is that AI failures are subtle — the model still returns a response, just a worse one. A new model version, a prompt change, or a context window shift can degrade quality silently. Evals are the tests that catch these regressions before users report them.
What can go wrong without evals
- Prompt regressions after model updates — A new Claude or GPT version changes how it interprets instructions. Your prompts that worked last month produce subtly different output now. Without evals, you find out from user complaints.
- Context poisoning — Retrieved chunks in RAG pipelines vary by query. Adding more documents to the knowledge base can cause the model to ignore the correct answer. You would never know without a test set.
- Tone and format drift — A "professional" system prompt that produced formal emails now produces casual ones after you tweaked the instructions for something else. Evals catch format changes.
- Latency and cost regressions — A prompt change that improves quality can triple token count. Without measuring, you get a surprise bill at end of month.
The eval pipeline
- Test set — a collection of inputs with expected outputs — Your golden dataset: real user queries (or representative ones) with human-validated expected responses. The foundation of every eval system.
- Runner — code that executes your AI feature on the test set — Calls your actual production code (not a simplified version) with each test input. Captures the output for scoring.
- Scorer — measures quality of each output — Exact match for structured outputs. LLM-as-judge for quality and correctness. Regex/contains checks for format and required content.
- Aggregator — summarises results and detects regressions — Computes pass rate, average score, and per-category breakdowns. Compares against a baseline. Flags when quality drops below threshold.
- CI integration — runs on every code change — Evals run in CI alongside your tests. A drop in eval score blocks the PR just like a failing test.
Try this
Write a simple manual eval: take an AI feature you have already built (or the simplest possible: a summarisation prompt). Create a text file with 5 test cases — each with an input and a hand-written description of what a good response looks like. Run your AI feature on all 5 inputs manually, then score each output against your expected description (pass/fail, no code yet). This is the conceptual foundation of everything that follows.