Harness Studio · Module 02 · Build Your Own Judge

Step 01 · The big jump

A rubric is a document. A judge is something that runs.

In Module 01 you built a rubric. It's a document: target, criteria, weights, anti-rubric. Useful — but only if someone reads it and applies it by hand to one piece of work at a time. Want to check fifty Skills at once? Fifty drafts? Fifty agent responses? You can't. The rubric is a static thing; taste-checking fifty pieces of work is not.

📄

A rubric

A document. Human-readable. Static. Evaluates one thing at a time, by hand, slowly.

→

⚙️

A judge

A runnable. Machine-executable. Dynamic. Evaluates any number of things, automatically, in seconds.

This module is about the jump — taking the rubric from Module 01 and turning it into a judge you can actually run. That's when the harness becomes real: the moment your taste is encoded into something that can catch bad work while you sleep.

A rubric in a Google Doc catches the work you remember to check. A judge that runs automatically catches the work you'd have missed — including work you made yourself when your standards slipped. That's not a small upgrade. That's the difference between "I have good taste" and "my good taste is protected."

Step 02 · Two kinds of judge

A judge can be a person or an AI. Both work. Differently.

You have two tools for running a rubric: a human (yourself, a teacher, a peer reviewer) or an AI like Claude. Each one has real strengths and real weaknesses. The craft of Harness Studio is knowing which to use when — and most often, using both together so their weaknesses cancel.

🧑‍⚖️

A human judge

✓Catches things a rubric doesn't list — weird, novel, "this is off but I can't name it" moments

✓Knows context the rubric can't express

✓Can argue with themselves — actually weigh tradeoffs

✗Slow. Expensive. Moody. Gets tired.

✗Inconsistent across days and moods

✗Can't run at 3am on fifty things at once

Best for: the final human check on work that matters. The last mile of quality.

🤖

An LLM judge

✓Consistent across runs — scores the same thing the same way

✓Fast. Can check 50 things before breakfast.

✓Never tired, never moody, never sick

✗Stuck inside the rubric's categories — misses what isn't named

✗Can be sycophantic: wants to give high scores

✗Drifts in quiet ways as the model changes

Best for: volume and consistency. First-pass triage on lots of work.

The right setup for most serious work is LLM first, human last. The LLM runs across everything and flags the stuff that scored low or looked weird. A human then reviews only the flagged pieces. That's how you get consistency AND catch the things the LLM missed — without a human having to look at every single item.

Step 03 · Pick a rubric

Pick a real rubric to turn into a judge.

For the live runner in step 5, you need something concrete to judge. Pick a kind of thing you'd want to evaluate — each option comes with a pre-built rubric you can use as a starting point in the runner.

📖Short story opening

💌Apology text message

💡Explanation for a kid

🧭Piece of advice

You'll run a judge on —. The runner in step 5 will be pre-loaded with a rubric for this.

Notice that all four of these are text-based. Judges work best on text because text is easy to feed to an LLM and easy to reason about. You can build judges for images, code, or audio too — but text is where the craft is clearest and where you should start.

Step 04 · The judge prompt

A judge is really a prompt with five parts.

When your judge is an LLM, it's not magic — it's a carefully written prompt that tells the model exactly what you want. Every good judge prompt has the same five parts in the same order. Miss any of them and the judge starts drifting.

judge.prompt

1 # ── the role ──
You are a careful, calibrated judge. You do not give
high scores to be kind. You give accurate scores.

2 # ── the target ──
You are evaluating: {target}
# e.g. "the opening paragraph of a short story for a 13-year-old"

3 # ── the rubric (from Module 01) ──
Score each criterion 1–5, where 1 = clearly missing,
3 = present but weak, 5 = present and strong.
{criteria with weights}

4 # ── the anti-rubric ──
Watch for these games: {anti-rubric notes}
If you see them, lower the score and say so.

5 # ── the output format ──
Reply with ONLY JSON in this exact shape:
{"scores": [{"name": "...", "score": N}], "reasoning": "..."}

1Roletells the model how to act

2Targetwhat's being judged

3Rubricthe criteria and weights

4Anti-rubriccatches the games

5Formatso you can parse it

Read through the five parts. Notice that the rubric parts come from Module 01 directly — you already know how to write them. The new parts for Module 02 are the role (tells the LLM to resist sycophancy), the anti-rubric being explicit in the prompt (catches games), and the JSON format (so your code can read the result reliably).

The single most important line in any LLM judge prompt is the role line. Without "do not give high scores to be kind", most LLMs will drift toward giving everything a 4 out of 5. Not because they're dumb — because they've been trained to be pleasant. You have to explicitly tell the judge: your job is accuracy, not kindness. That one sentence can change the entire distribution of scores.

Step 05 · The real thing

Run a real judge on real work.

This is a real LLM judge. It calls Claude with a prompt built from your target, your criteria, and your anti-rubric — and Claude actually evaluates the work you paste in. The scores and reasoning come back live. Your rubric just stopped being a document and started being something that runs.

Live judge runner · powered by Claude

The rubric Target Criteria (one per line · format: "name | weight 1-5") Anti-rubric (how could it be gamed?)

The work to judge Paste the text you want evaluated

↳ Fill in the rubric and the work, then click Run.

The verdict

Click ▶ Run the judge to see real scores and reasoning.

Run the judge on the same work twice. Do the scores stay the same? Run it on something obviously bad and something obviously good. Are the scores different enough? Those are the two most important tests of any judge — consistency and sensitivity. A judge that can't tell the difference between great and terrible isn't a judge. It's a rubber stamp.

Step 06 · Prompt taste

Good judge, bad judge. Often just a few words apart.

Two rounds. Each round shows two judge prompts that look similar but produce completely different results. The difference is the whole craft of prompt-level taste.

Round 1. Two prompts to judge a student essay. Which is better?

Round 2. Your judge has scored twenty essays — and every one got 4/5 or higher. What should you do?

The single most important habit in judge-building is the sanity check: deliberately feed your judge something obviously bad, and see if it catches it. If your judge can't distinguish real work from obvious nonsense, it can't distinguish great work from mediocre work either — it's just producing numbers that look rigorous. A broken judge is worse than no judge, because it gives you false confidence in decisions you're about to make.

Step 07 · You did it

⚙️

You just ran a real judge on real work.

Powered by Claude, built from your own rubric, with calibrated scoring and honest anti-rubric notes. This is the thing most AI companies haven't figured out yet — and you just built one from scratch.

What you just learned

A rubric is a document. A judge is a runnable. Jumping from one to the other is the whole Module 02.
Two kinds of judge: human (smart, slow, inconsistent) and LLM (fast, consistent, drifts). The serious setup is LLM first, human last.
A judge prompt has five parts: role, target, rubric, anti-rubric, format.
The most important line is the role line — "do not give high scores to be kind". Without it, LLMs default to sycophancy.
Always run a sanity check: feed the judge something obviously bad. If it doesn't catch it, the judge is broken.
Score compression (everything clustering high) is a red flag, not good news.
A broken judge is worse than no judge — it gives you false confidence.

In Module 03, you'll learn what to do when the thing drifting isn't the judge — it's you. What happens when your own standards slowly slip? How do you notice? And what does a harness look like that can catch the builder rather than the built?

★ Before you call it done

Three questions. Same three. Every time.

These are the same three questions for every module in Kindling. They are how you check whether AI did the part it should and you did the part only you could. Tap each one to mark it true.

★ ★ ★

This is yours. Ship it.

Next: Catch yourself drifting →

Build your own judge.

Ask Claude to grade things using your rubric.