Taste 5–7 hours Harness Studio · 02 of 03

Build a Judge That Works

Take the rubric from Project 10 and build an LLM-as-judge: a Claude prompt that scores work against your rubric and explains why. The judge has to agree with you 80% of the time on your hand-scored examples before it ships. When it disagrees, you investigate which of you was right.

An LLM-as-judge is exactly what it sounds like: you give Claude a rubric and a piece of work, and Claude returns a score with reasons. It's an obvious idea. The non-obvious part is making one that actually agrees with you.

The reason this matters: once your judge works, you can grade your own work — and AI's work — at scale. You can score 100 essays in a minute. You can run your own writing past it weekly to see if you're drifting. That's the foundation of Project 12.

The 80% bar

Calibration matters. A judge that agrees with you 50% of the time is worse than no judge — it's noise dressed up as signal. The 80% threshold isn't arbitrary; it's where the judge becomes more reliable than your tired-eye self-review.

Step by step

  1. Build the judge prompt with the four required parts.

    The judge prompt has four parts: system (who the judge is), rubric (your YAML from Project 10), work (one piece of work to score), output format (strict JSON so you can parse it).

  2. Force evidence in the output schema.

    A judge that scores without quoting back is hand-waving. Require: for each dimension, return (score, evidence-quote, reason). The quote forces the judge to point at a specific sentence in the work, not just feel a vibe.

  3. Run the judge against your 5 hand-scored examples.

    Compare its scores to yours, dimension by dimension. Compute the agreement rate (within ±1 point on each dimension). 25 comparisons total.

  4. Investigate the disagreements (this is the work).

    For every dimension where the judge and you disagreed by 2+ points, ask: who was right?. Sometimes the judge will be right (your eye was tired). Sometimes you will (the rubric missed something). Both are interesting.

  5. Iterate to 80%+ agreement.

    Loop. Run, investigate, sharpen, re-run. You'll go through three rounds. By round three the judge will agree with you on roughly 80% of dimensions. That's ship-ready. You also know your rubric better than you did before.

  6. Find the moment the judge changes your mind.

    Run the judge against one piece of work you've never scored before. Trust it. Read its evidence. Try to convince yourself it's wrong. If you can't — write up that case. That writeup is the artifact you ship.

A complete worked example, every file

The full college-essay judge: prompt, runner, schema validator, parsed output, and the case study where the judge changed the author's mind.

judge-prompt.md · the prompt template
## system

You are a careful judge of short personal essays for US college
applications. You score against the five-dimension rubric below,
each dimension 1–5, using the anchors provided. You quote one
specific sentence as evidence for each dimension. You do not pad,
soften, or hedge. You do not refuse to score.

If a dimension is genuinely ambiguous (the work doesn't give you
enough to judge), score it 3 and explain in `reason` why. Never
score 1 or 5 without quoted evidence.

## rubric

{insert rubric.yaml + anchors.md from Project 10 here}

## work

{insert the essay to be scored here}

## response

Return a single JSON object, no prose before or after, of this shape:

{
  "voice":       { "score": <1-5>, "evidence": "", "reason": "" },
  "specificity": { "score": <1-5>, "evidence": "<...>", "reason": "<...>" },
  "arc":         { "score": <1-5>, "evidence": "<...>", "reason": "<...>" },
  "pace":        { "score": <1-5>, "evidence": "<...>", "reason": "<...>" },
  "ending":      { "score": <1-5>, "evidence": "<...>", "reason": "<...>" },
  "total":       ,
  "headline":    "",
  "ship":        = 18 with no dim < 3?>
}
judge.js · the runner with retry, validation, parsing
/**
 * The judge runner. Loads rubric + work, calls Claude, validates
 * the JSON, returns a score object. Used by the agreement
 * harness AND by Project 12's weekly drift check.
 */
import { Anthropic } from '@anthropic-ai/sdk';
import fs from 'node:fs/promises';
import { validateOutput } from './validate.js';

const client = new Anthropic();
const RUBRIC = await fs.readFile('rubric.yaml', 'utf8');
const ANCHORS = await fs.readFile('anchors.md', 'utf8');
const PROMPT_TEMPLATE = await fs.readFile('judge-prompt.md', 'utf8');

export async function judge(work, opts = {}) {
  const { model = 'claude-sonnet-4-6', maxRetries = 2 } = opts;

  const fullPrompt = PROMPT_TEMPLATE
    .replace('{insert rubric.yaml + anchors.md from Project 10 here}', RUBRIC + '\n\n' + ANCHORS)
    .replace('{insert the essay to be scored here}', work);

  for (let attempt = 0; attempt < maxRetries + 1; attempt++) {
    const msg = await client.messages.create({
      model,
      max_tokens: 1500,
      messages: [{ role: 'user', content: fullPrompt }],
    });

    const text = msg.content[0].text.trim();
    let parsed;
    try {
      // Some models wrap JSON in ```json``` — strip that
      const cleaned = text.replace(/^```(?:json)?\s*/m, '').replace(/\s*```$/m, '');
      parsed = JSON.parse(cleaned);
    } catch (e) {
      if (attempt === maxRetries) throw new Error('judge returned unparseable JSON after ' + maxRetries + ' retries: ' + e.message);
      continue;
    }

    const errors = validateOutput(parsed);
    if (errors.length === 0) return parsed;
    if (attempt === maxRetries) throw new Error('judge output failed validation: ' + errors.join('; '));
  }
}

// CLI: node judge.js path/to/essay.txt
if (import.meta.url === `file://${process.argv[1]}`) {
  const path = process.argv[2];
  if (!path) { console.error('usage: node judge.js path/to/essay.txt'); process.exit(1); }
  const work = await fs.readFile(path, 'utf8');
  const result = await judge(work);
  console.log(JSON.stringify(result, null, 2));
  console.log(`\n total: ${result.total}/25 · ship: ${result.ship} · ${result.headline}`);
}
output-schema.json · what every judge response must look like
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Judge output v1.0",
  "type": "object",
  "required": ["voice", "specificity", "arc", "pace", "ending", "total", "headline", "ship"],
  "properties": {
    "voice":       { "$ref": "#/definitions/dim" },
    "specificity": { "$ref": "#/definitions/dim" },
    "arc":         { "$ref": "#/definitions/dim" },
    "pace":        { "$ref": "#/definitions/dim" },
    "ending":      { "$ref": "#/definitions/dim" },
    "total":       { "type": "integer", "minimum": 5, "maximum": 25 },
    "headline":    { "type": "string", "minLength": 10, "maxLength": 200 },
    "ship":        { "type": "boolean" }
  },
  "definitions": {
    "dim": {
      "type": "object",
      "required": ["score", "evidence", "reason"],
      "properties": {
        "score":    { "type": "integer", "minimum": 1, "maximum": 5 },
        "evidence": { "type": "string", "minLength": 5 },
        "reason":   { "type": "string", "minLength": 10 }
      }
    }
  }
}
example-output.json · the judge scoring essay e02
{
  "voice": {
    "score": 5,
    "evidence": "She never wrote it down.",
    "reason": "Five-word sentence with the writer's signature spareness. Could only be this writer."
  },
  "specificity": {
    "score": 5,
    "evidence": "winter melon soup",
    "reason": "Named dish. Not 'a recipe' or 'a soup'. The specificity does work — without it the essay collapses."
  },
  "arc": {
    "score": 4,
    "evidence": "The first time I made it after she died, the smell came back exactly.",
    "reason": "Real arc: from the grandmother teaching to the writer cooking alone. The smell-coming-back is the earned shift. One half-point off because the arc happens entirely in the last sentence."
  },
  "pace": {
    "score": 5,
    "evidence": "She never wrote it down.",
    "reason": "No filler. Three sentences total. Every sentence pulls weight."
  },
  "ending": {
    "score": 5,
    "evidence": "the smell came back exactly.",
    "reason": "Lands. Specific. The word 'exactly' does the entire emotional work without naming the emotion. Rare."
  },
  "total": 24,
  "headline": "A three-sentence essay that earns its weight at every word — the kind of restraint admissions readers remember.",
  "ship": true
}
disagreement-case-3.md · where the judge changed my mind
## case 3: my essay e03 (the coffee shop one)

### my hand score (Project 10)

  voice: 3, specificity: 4, arc: 4, pace: 3, ending: 4 → total 18

### the judge's score

  voice: 3, specificity: 3, arc: 3, pace: 4, ending: 2 → total 15

### where we disagreed (>1 point)

  - ending: I scored 4. Judge scored 2.

### the judge's evidence

  evidence: "The coffee shop closed and I moved on, but I'll
             never forget what it taught me about people."
  reason: "The 'never forget what it taught me' construction is
           the closing-line equivalent of 'I learned the value
           of perseverance.' It claims an ending without earning
           one. The specific coffee shop detail does NOT carry
           into this sentence — the writer reverts to the
           generic register precisely at the moment they should
           land specifically."

### my reflection

  the judge is right.

  I had given the ending a 4 because I LIKED the rest of the
  essay. I let the strong middle inflate my score on the closer.
  Reading the judge's evidence, the sentence really is pointing
  at "I learned…" — exactly the failure mode my own anchor for
  ending=1 calls out.

  the rubric I wrote in Project 10 already said this. My eye
  forgave it because of halo effect.

  this is the moment I trust the judge more than my tired self.
  the judge has the rubric in front of it; I have my mood.

### what I changed

  - re-scored e03 to total 15.
  - re-scored e05 (which I'd given 21) — judge agreed at 21, no change.
  - I will, going forward, not score my own essays without first
    running the judge. Use the judge as a forcing function, then
    override only with cause.

Live demo 1: judge-vs-you agreement calculator

Enter your scores and the judge's scores for 5 examples × 5 dimensions. The calculator returns your agreement rate and lists the specific disagreements to investigate.

Agreement calc

For each example, type 5 scores comma-separated. e.g. 4,5,3,3,4 / 5,4,3,4,3


Live demo 2: assemble a real judge prompt

Paste your rubric and a piece of work. The widget assembles the full judge prompt — exactly what you'd send to the Claude API. Copy it, paste into console.anthropic.com Workbench, see what comes back.

Judge-prompt assembler


Live demo 3: validate a judge response against the schema

Paste a judge response (the JSON Claude returned). The validator checks it against the strict schema. Most v0 judge prompts return invalid JSON about 20% of the time — this is the widget you'd run before trusting any score.

Judge-output validator

What makes this hard

The hardest move is investigating disagreements honestly. The strong human urge is to assume the judge is wrong because it's a machine. Sometimes it is. Often it isn't. The judge has the rubric in front of it; you might not. The judge isn't tired; you might be. The judge has no friendship with the author; you might. The judge is sometimes the more honest reader. Letting that be true is a taste move.

The second hard thing is keeping the judge prompt stable across the iteration. If you tweak the prompt every time you don't like a result, you're just curve-fitting to your existing scores. Hold the prompt; tighten the rubric. The judge should change because the rubric got sharper, not because the prompt got cleverer.

The third — and this catches engineers more than writers — is treating the judge's output as gospel. It isn't. It's a calibrated tool. It still drifts on edge cases. The agreement rate is a floor for trust, not a ceiling. Always look at the evidence quote, not just the score.

Self-check before you ship

  • Judge prompt requires evidence (a quoted sentence) for every dimension.
  • Output is strict JSON, validated against output-schema.json; my code parses it without surprises.
  • Agreement rate against my 5 hand-scored examples is ≥ 80% (within ±1 on each dimension).
  • I have at least one investigated disagreement where the judge changed my mind, written up.
  • I tightened the rubric, not the prompt, to close gaps.
  • The judge runner has retry logic for invalid JSON output.

Push further · for the harder end of 15+

A calibrated judge is real infrastructure. Here's how it scales.

  1. Multi-model ensemble. Run the judge with three different models (haiku, sonnet, opus). Aggregate the scores. Take median, not mean — outliers are interesting. Where the three models disagree wildly, that's where your rubric is genuinely ambiguous. Sharpen those anchors.
  2. Build the chain-of-thought variant. Add a "first, walk through your reasoning step by step before giving the score" instruction. Compare against the direct-score version on the same 25 dimensions. Chain-of-thought usually wins by ~10% agreement on hard examples — but is 3x slower and 5x more expensive. Decide explicitly which version to ship and why.
  3. Self-eval the judge. Have the judge score 100 essays. Sample 10 of them. Hand-score those 10 yourself, blind. Compute the agreement on the 10. This is "ground truth eval of your judge" — exactly the loop OpenAI / Anthropic / DeepMind use to validate their own evaluation systems. You're now doing real LLM-evaluation research.