RadarTrek
Home/Courses/Production AI Engineering/Evaluating LLM Outputs
Lesson 01 / 9·9 minFree

Evaluating LLM Outputs

Build an eval suite before you ship — LLM-as-judge, golden datasets, and regression testing

The biggest mistake in AI product development is skipping evaluations. You tweak a prompt, it "seems better" on the 3 examples you tested, you ship it — and it breaks on 15% of real inputs you never thought to test. Evals are your test suite for AI. Without them, you are flying blind every time you change a prompt, update a model, or add a feature.

What an eval is

  • A dataset of inputs with expected outputsStart with 50–100 real examples from your use case. Each example has an input (prompt + context) and a gold-standard output (what the correct response looks like).
  • A scoring functionFor structured outputs: exact match or schema validation. For text: human ratings, rule-based checks, or LLM-as-judge scoring. For code: execution and test pass rate.
  • A runner that produces a scoreRun all inputs through the current model/prompt, score each output, aggregate. The score is your benchmark. Regressions are easy to spot.

LLM-as-judge — scaling evaluation

Use Claude to score Claude's outputs

1

Write a scoring prompt

const scorePrompt = `You are evaluating a customer support AI response. User question: ${question} AI response: ${response} Score the response 1-5 on: - Accuracy: does it correctly answer the question? - Helpfulness: does it give actionable next steps? - Tone: is it appropriately professional? Respond with JSON only: {"accuracy": N, "helpfulness": N, "tone": N, "overall": N, "reason": "..."}`
2

Run the evaluator

const evaluation = await anthropic.messages.create({ model: "claude-haiku-4-5-20251001", // cheap model for evals max_tokens: 200, messages: [{ role: "user", content: scorePrompt }], }) const scores = JSON.parse(evaluation.content[0].text)
3

Aggregate across your eval set

const results = await Promise.all(evalSet.map(evaluateOne)) const avgScore = results.reduce((s, r) => s + r.overall, 0) / results.length console.log(`Eval score: ${avgScore.toFixed(2)}/5.00`)
4

Store results and compare across prompt versions

Save eval results to a database or CSV with a version tag. When you change the prompt, run the eval again and compare the scores. Any drop > 0.2 points warrants investigation before shipping.

Building your golden dataset

  • Start with real examplesSample actual user queries from your logs (or create realistic ones). Real distribution beats hypothetical examples.
  • Include edge cases explicitlyAdd examples that represent known failure modes: very short inputs, very long inputs, ambiguous queries, queries in unexpected languages, adversarial inputs.
  • Label conservativelyWhen writing expected outputs, be precise. "Paris" not "Paris, France or just Paris". The tighter the expectation, the more useful the eval.
  • Review and expand over timeEvery time a real user hits a bug, add that example to the eval set. The eval set should grow from production failures, not just your imagination.
!

Run evals in CI before every deploy

Add your eval runner to your CI pipeline. If the eval score drops below a threshold (e.g., 0.3 points from baseline), the deploy fails. This catches prompt regressions automatically — the same way unit tests catch code regressions.

🎯

Try this

Build a 20-example eval set for a real or hypothetical AI feature (customer support chatbot, document summariser, code explainer). Write 5 edge case examples specifically. Build a simple scorer (even if it's just: does the output contain the expected keyword?). Run it. You now have the foundation of a production eval pipeline.

RadarTrek Intel — monthly score updates

We track 40+ tools so you don't have to. Score changes, new tools, and new guides — once a month, no spam.