A rubric is a document. A judge is something that runs.
In Module 01 you built a rubric. It's a document: target, criteria, weights, anti-rubric. Useful — but only if someone reads it and applies it by hand to one piece of work at a time. Want to check fifty Skills at once? Fifty drafts? Fifty agent responses? You can't. The rubric is a static thing; taste-checking fifty pieces of work is not.
A rubric
A document. Human-readable. Static. Evaluates one thing at a time, by hand, slowly.
A judge
A runnable. Machine-executable. Dynamic. Evaluates any number of things, automatically, in seconds.
This module is about the jump — taking the rubric from Module 01 and turning it into a judge you can actually run. That's when the harness becomes real: the moment your taste is encoded into something that can catch bad work while you sleep.
A rubric in a Google Doc catches the work you remember to check. A judge that runs automatically catches the work you'd have missed — including work you made yourself when your standards slipped. That's not a small upgrade. That's the difference between "I have good taste" and "my good taste is protected."
A judge can be a person or an AI. Both work. Differently.
You have two tools for running a rubric: a human (yourself, a teacher, a peer reviewer) or an AI like Claude. Each one has real strengths and real weaknesses. The craft of Harness Studio is knowing which to use when — and most often, using both together so their weaknesses cancel.
A human judge
An LLM judge
The right setup for most serious work is LLM first, human last. The LLM runs across everything and flags the stuff that scored low or looked weird. A human then reviews only the flagged pieces. That's how you get consistency AND catch the things the LLM missed — without a human having to look at every single item.
Pick a real rubric to turn into a judge.
For the live runner in step 5, you need something concrete to judge. Pick a kind of thing you'd want to evaluate — each option comes with a pre-built rubric you can use as a starting point in the runner.
Notice that all four of these are text-based. Judges work best on text because text is easy to feed to an LLM and easy to reason about. You can build judges for images, code, or audio too — but text is where the craft is clearest and where you should start.
A judge is really a prompt with five parts.
When your judge is an LLM, it's not magic — it's a carefully written prompt that tells the model exactly what you want. Every good judge prompt has the same five parts in the same order. Miss any of them and the judge starts drifting.
You are a careful, calibrated judge. You do not give
high scores to be kind. You give accurate scores.
You are evaluating: {target}
# e.g. "the opening paragraph of a short story for a 13-year-old"
Score each criterion 1–5, where 1 = clearly missing,
3 = present but weak, 5 = present and strong.
{criteria with weights}
Watch for these games: {anti-rubric notes}
If you see them, lower the score and say so.
Reply with ONLY JSON in this exact shape:
{"scores": [{"name": "...", "score": N}], "reasoning": "..."}
Read through the five parts. Notice that the rubric parts come from Module 01 directly — you already know how to write them. The new parts for Module 02 are the role (tells the LLM to resist sycophancy), the anti-rubric being explicit in the prompt (catches games), and the JSON format (so your code can read the result reliably).
The single most important line in any LLM judge prompt is the role line. Without "do not give high scores to be kind", most LLMs will drift toward giving everything a 4 out of 5. Not because they're dumb — because they've been trained to be pleasant. You have to explicitly tell the judge: your job is accuracy, not kindness. That one sentence can change the entire distribution of scores.
Run a real judge on real work.
This is a real LLM judge. It calls Claude with a prompt built from your target, your criteria, and your anti-rubric — and Claude actually evaluates the work you paste in. The scores and reasoning come back live. Your rubric just stopped being a document and started being something that runs.
Live judge runner · powered by Claude
Run the judge on the same work twice. Do the scores stay the same? Run it on something obviously bad and something obviously good. Are the scores different enough? Those are the two most important tests of any judge — consistency and sensitivity. A judge that can't tell the difference between great and terrible isn't a judge. It's a rubber stamp.
Good judge, bad judge. Often just a few words apart.
Two rounds. Each round shows two judge prompts that look similar but produce completely different results. The difference is the whole craft of prompt-level taste.
Round 1. Two prompts to judge a student essay. Which is better?
Round 2. Your judge has scored twenty essays — and every one got 4/5 or higher. What should you do?
The single most important habit in judge-building is the sanity check: deliberately feed your judge something obviously bad, and see if it catches it. If your judge can't distinguish real work from obvious nonsense, it can't distinguish great work from mediocre work either — it's just producing numbers that look rigorous. A broken judge is worse than no judge, because it gives you false confidence in decisions you're about to make.
You just ran a real judge on real work.
Powered by Claude, built from your own rubric, with calibrated scoring and honest anti-rubric notes. This is the thing most AI companies haven't figured out yet — and you just built one from scratch.
What you just learned
- A rubric is a document. A judge is a runnable. Jumping from one to the other is the whole Module 02.
- Two kinds of judge: human (smart, slow, inconsistent) and LLM (fast, consistent, drifts). The serious setup is LLM first, human last.
- A judge prompt has five parts: role, target, rubric, anti-rubric, format.
- The most important line is the role line — "do not give high scores to be kind". Without it, LLMs default to sycophancy.
- Always run a sanity check: feed the judge something obviously bad. If it doesn't catch it, the judge is broken.
- Score compression (everything clustering high) is a red flag, not good news.
- A broken judge is worse than no judge — it gives you false confidence.
In Module 03, you'll learn what to do when the thing drifting isn't the judge — it's you. What happens when your own standards slowly slip? How do you notice? And what does a harness look like that can catch the builder rather than the built?
★ Before you call it done
Three questions. Same three. Every time.
These are the same three questions for every module in Kindling. They are how you check whether AI did the part it should and you did the part only you could. Tap each one to mark it true.
★ ★ ★