Evaluating LLM Outputs
Build an eval suite before you ship — LLM-as-judge, golden datasets, and regression testing
The biggest mistake in AI product development is skipping evaluations. You tweak a prompt, it "seems better" on the 3 examples you tested, you ship it — and it breaks on 15% of real inputs you never thought to test. Evals are your test suite for AI. Without them, you are flying blind every time you change a prompt, update a model, or add a feature.
What an eval is
- A dataset of inputs with expected outputs — Start with 50–100 real examples from your use case. Each example has an input (prompt + context) and a gold-standard output (what the correct response looks like).
- A scoring function — For structured outputs: exact match or schema validation. For text: human ratings, rule-based checks, or LLM-as-judge scoring. For code: execution and test pass rate.
- A runner that produces a score — Run all inputs through the current model/prompt, score each output, aggregate. The score is your benchmark. Regressions are easy to spot.
LLM-as-judge — scaling evaluation
Use Claude to score Claude's outputs
Write a scoring prompt
const scorePrompt = `You are evaluating a customer support AI response.
User question: ${question}
AI response: ${response}
Score the response 1-5 on:
- Accuracy: does it correctly answer the question?
- Helpfulness: does it give actionable next steps?
- Tone: is it appropriately professional?
Respond with JSON only: {"accuracy": N, "helpfulness": N, "tone": N, "overall": N, "reason": "..."}`Run the evaluator
const evaluation = await anthropic.messages.create({
model: "claude-haiku-4-5-20251001", // cheap model for evals
max_tokens: 200,
messages: [{ role: "user", content: scorePrompt }],
})
const scores = JSON.parse(evaluation.content[0].text)Aggregate across your eval set
const results = await Promise.all(evalSet.map(evaluateOne))
const avgScore = results.reduce((s, r) => s + r.overall, 0) / results.length
console.log(`Eval score: ${avgScore.toFixed(2)}/5.00`)Store results and compare across prompt versions
Save eval results to a database or CSV with a version tag. When you change the prompt, run the eval again and compare the scores. Any drop > 0.2 points warrants investigation before shipping.
Building your golden dataset
- Start with real examples — Sample actual user queries from your logs (or create realistic ones). Real distribution beats hypothetical examples.
- Include edge cases explicitly — Add examples that represent known failure modes: very short inputs, very long inputs, ambiguous queries, queries in unexpected languages, adversarial inputs.
- Label conservatively — When writing expected outputs, be precise. "Paris" not "Paris, France or just Paris". The tighter the expectation, the more useful the eval.
- Review and expand over time — Every time a real user hits a bug, add that example to the eval set. The eval set should grow from production failures, not just your imagination.
Run evals in CI before every deploy
Add your eval runner to your CI pipeline. If the eval score drops below a threshold (e.g., 0.3 points from baseline), the deploy fails. This catches prompt regressions automatically — the same way unit tests catch code regressions.
Try this
Build a 20-example eval set for a real or hypothetical AI feature (customer support chatbot, document summariser, code explainer). Write 5 edge case examples specifically. Build a simple scorer (even if it's just: does the output contain the expected keyword?). Run it. You now have the foundation of a production eval pipeline.