Project 10 · Define Good. Defensibly. · Kindling Builders Edition

Most people have taste. Few people can articulate it. The gap between the two is where this project lives. By the end you'll have a one-page rubric that says, in writing, what makes work good in your chosen domain. It will surprise you.

This is the first Harness Studio project — the project before any system, before any code, before any judge. The judge in Project 11 will only ever be as smart as the rubric you write today. So: write a real one.

A warning

The first rubric you write will be wrong. That's fine. Writing it down is what reveals where it's wrong. You will rewrite it three times before it's good. The rewrites are the project.

Step by step

Pick the domain. Be sharp about it.

Not "good writing." Pick "a 200-word personal essay for a college application" or "a Slack message that gets a thoughtful reply instead of a thumbs-up" or "a code comment future-me will thank past-me for." The narrower the domain, the more honest your rubric will be.
Pick five dimensions.

Five — not three (too vague), not eight (overlapping). Each dimension is one sharp thing the work either has or doesn't. Each must be checkable by reading the work, not by reading your mind.
Write anchors at 1, 3, and 5 for each dimension.

Anchors are the calibration. A "5" on voice in your rubric should be unambiguous. Same for "1." This is where most rubrics die — people write the dimensions and skip the anchors, and then nobody (including them) can score anything consistently.
Score five real examples by hand.

Find five real examples in your domain. Don't ask Claude to generate them. Use real ones — kid magazine essays, sample admissions essays online, your own past work, a friend's draft. Score each against your five dimensions. You'll get two surprises (described below).
Investigate the two surprises.

Surprise 1: one example you thought was great will score 11/25. Decide who's right — your rubric or your gut. Surprise 2: two examples will score similarly but feel very different — meaning you're missing a dimension. Add it. Re-score everything.
Write the reflection.

One paragraph: what surprised you about your own taste once you had to defend it on paper? Be honest. The number-one surprise is usually: "I had been calling things good when they didn't actually score well on my own rubric. My eye was being lazy." That sentence is the goal of this project.

A complete worked example, every file

The full rubric for "a 200-word personal essay for a college application." Every dimension, every anchor, five real-essay scores, and the reflection that surprised the author.

rubric.yaml · the contract

version: 1.0
domain: |
  A 200-word personal essay for a US college application,
  written by a high school senior, on a self-chosen topic.
  Written for an admissions reader who reads ~80 of these per day.

dimensions:

  - id: voice
    weight: 1.0
    asks: |
      Could only this writer have written this? Or could any
      thoughtful person?

  - id: specificity
    weight: 1.0
    asks: |
      Does the essay name a specific thing — a teacher, a moment,
      a smell — that grounds it? Or does it speak in abstractions?

  - id: arc
    weight: 1.0
    asks: |
      Does the writer change between the first sentence and the
      last? Does the change feel earned, or claimed?

  - id: pace
    weight: 1.0
    asks: |
      Are there sentences that aren't pulling weight? Could 25%
      be cut without losing anything?

  - id: ending
    weight: 1.0
    asks: |
      Does the last sentence land — or peter out into "and so I
      learned" platitudes?

scoring:
  scale: 1-5 per dimension
  total: 5-25
  ship_threshold: >= 18 (with no dimension below 3)

anchors.md · 1, 3, 5 for every dimension

## voice

  1: |
    Reads like a competent stranger. AI could have written this.
    Vocabulary generic — 'passionate', 'journey', 'taught me'.
    The author is invisible. You finish the essay knowing nothing
    about who wrote it.
  3: |
    Has a tic — a sentence rhythm or word choice that's the
    writer's. But you'd have to look hard to find it. Mostly
    the voice is the textbook voice of a competent applicant.
  5: |
    Could only have been written by this person. The voice is
    unmistakable from sentence one. Vocabulary is specific to the
    writer's world. You finish the essay and you'd recognize their
    next paragraph anywhere.

## specificity

  1: |
    All abstractions. "I learned the value of perseverance from
    my time on the soccer team." No moments, no people named,
    no smells, no objects.
  3: |
    One specific detail, but treated as decoration, not load-
    bearing. The detail is mentioned and then dropped.
  5: |
    The whole essay turns on a specific moment, sensory detail,
    or named person. The detail does the work. Take the detail
    away and the essay collapses.

## arc

  1: |
    Static. The writer at the end is the writer at the beginning.
    No change earned. May claim a change ("and that's how I
    learned…") but the claim isn't grounded in the essay.
  3: |
    A small change. The writer moves a step — but the step feels
    like one a teacher told them they'd make, not one they had to.
  5: |
    A real change. By the last sentence the writer notices
    something they couldn't have noticed at the start. The reader
    can feel the shift.

## pace

  1: |
    25%+ could be cut without losing anything. Sentences exist
    to fill word count or transitions. "Furthermore," "in
    conclusion," padding adverbs.
  3: |
    Reads cleanly but has 1–2 paragraphs that could shrink. No
    obvious filler, but no unusual economy either.
  5: |
    Every sentence pulls weight. Cut anything and you'd lose
    something specific. Read it twice; the second time you
    notice why each sentence is exactly as long as it is.

## ending

  1: |
    Peters out into "and that taught me…" or "I will carry this
    with me forever." Generic moral. Reads as if the writer ran
    out of word count.
  3: |
    The ending circles back to an image from earlier. Earned, but
    expected — the move you can see coming three sentences out.
  5: |
    The last sentence lands. Either it surprises (a different
    register, a smaller scale, a thing only this writer would say)
    or it makes you re-read the first sentence with new eyes. Or both.

scores.csv · 5 real essays, hand-scored

essay_id,voice,specificity,arc,pace,ending,total,my_gut_before,note
e01-eagle-scout,2,3,2,3,1,11,"high (i liked it)","my gut was wrong. it's well-written but generic. 'I learned leadership' three times in 200 words. SURPRISE 1."
e02-grandmother-recipe,5,5,4,4,5,23,"high","gut and rubric agree. specific dish, named grandmother, real moment, ending lands."
e03-summer-job-coffee-shop,3,4,4,3,4,18,"medium","slightly underrates this. the coffee shop detail does work."
e04-soccer-game-loss,2,3,3,4,2,14,"medium","gut said this was OK. rubric flags voice and ending. correctly."
e05-violin-practice,4,5,5,3,4,21,"high","gut and rubric agree. very specific. arc is real (he learns the violin is hers, not his)."

example-3-detail.md · the SURPRISE-1 case, scored in detail

## essay e01-eagle-scout (the one my gut got wrong)

excerpt:
  "Becoming an Eagle Scout taught me leadership. I learned the
   value of perseverance through the many challenges I faced.
   Leading my troop through our final service project, I learned
   that true leadership is not about commanding others, but about
   serving them. I will carry these lessons with me as I begin
   my journey at college."

## why my gut said HIGH (before scoring)

  - Eagle Scout is impressive
  - the writing is grammatical
  - the moral is correct
  - it's "the kind of essay that wins"

## why the rubric says LOW (11/25)

  voice (2/5):
    "I learned the value of perseverance" — this is the most
    common sentence in college applications. AI generates this
    by default. The author is invisible.

  specificity (3/5):
    "Eagle Scout" is one specific. Then nothing. No project named,
    no troop member named, no specific moment from the service
    project. The Eagle Scout fact is the only detail.

  arc (2/5):
    Claims an arc ("I learned"). No actual change shown. The
    writer at the end is the writer at the beginning + one
    completed merit badge.

  pace (3/5):
    No obvious filler. But "Furthermore" / "I will carry these
    lessons with me as I begin my journey at college" — the
    last sentence is exactly the closing line of 60% of these.

  ending (1/5):
    "I will carry these lessons with me as I begin my journey
     at college." This is the lowest-scoring ending possible.
    It says nothing only this writer could say.

## what I learned from scoring my own gut wrong

I had been weighting "is this the kind of essay I'd respect" instead
of "does this essay actually do what good essays do." The Eagle
Scout fact made me afraid to score it harshly. The rubric is what
broke through the fear. I'd ship e02 (grandmother recipe) and
not e01, even though my gut said both were good.

reflection.md · the one paragraph that's the project's point

## what surprised me about my own taste

I thought I'd be most surprised by an essay I disliked but the
rubric loved. Instead the bigger surprise went the other way:
I had been calling essays "good" because they had impressive
biographical facts attached (Eagle Scout, varsity captain, study
abroad). The rubric — which only scores the writing — flagged
those essays as 11/25 or 13/25 while still being grammatically
clean. My eye was confusing "this writer has done impressive
things" with "this writer wrote an impressive essay." Two different
domains. The rubric wouldn't let me confuse them.

The harder thing the rubric showed me: my fifth dimension (ending)
is the one I had been letting slide most. Every essay I scored,
I instinctively forgave a weak last sentence. Once I had a 1-anchor
for ending — "peters out into 'and that taught me'" — I couldn't
unsee it. About 70% of college essays I've read in my life have
that exact failure. I just hadn't named it.

The rubric is now the thing I'd hand to a friend asking "is my
essay any good?" — not because it's the truth, but because it
forces them to argue back. That argument is where their taste
shows up. Mine certainly did.

Live demo 1: score a piece of work right here

Paste a piece of work. Score it on the five dimensions. The widget computes your total and flags any dimension below 3. This is the same scoring loop you'd use on every essay you read.

Hand scorer

Live demo 2: write an anchor right now

Pick a dimension. Write what a 1, a 3, and a 5 look like. The widget checks for the patterns of a real anchor: it's specific, it's testable, it's not just "more X" or "less X."

Anchor writer

Live demo 3: how consistent are you, really?

Score five short snippets. The widget compares your scores against your own previous scoring and flags whether you're calibrated. Real raters drift across a session — your future judge prompt has to deal with this.

Self-calibration check · score these 5 snippets

What makes this hard

The hardest move is scoring something you like badly. You'll have a piece of work — maybe one you wrote — that you thought was good. You'll score it 12/25. You will be tempted to fudge it up to 18. Don't. The rubric is a tool to find out where your eye disagrees with your written taste. The disagreement is the gold.

The second hard thing is anchors. Writing an anchor at "5" feels easy. Writing an anchor at "1" feels mean. Skipping the "1" is the most common mistake — and it's exactly the anchor that lets you tell mediocrity from real bad. Be specific about both ends.

The third — and this catches the careful kids — is the temptation to add weights. "Voice should count more than pace." Don't, on v0.1. Equal weights force you to take every dimension seriously. You can add weights in v1.0, after the judge in Project 11 reveals which dimensions actually predict whether you'd ship.

Self-check before you ship

Domain sentence is narrow enough that an adult couldn't have generated it from scratch.
Five dimensions, no overlaps; each one asks one sharp question.
Anchors at 1, 3, AND 5 for every dimension — actually written, not implied.
Five real (not generated) examples scored by hand.
I noted at least one essay where my gut and the rubric disagreed, and decided who was right.
Reflection paragraph names at least one specific thing my eye had been calling good that the rubric flags as not.

Push further · for the harder end of 15+

A scored rubric is the floor. Here's where it becomes the foundation of every later project.

Inter-rater agreement with a friend. Hand your rubric to a friend. Have them independently score your 5 essays. Compute Cohen's kappa (a real statistical measure of two-rater agreement). If you and your friend agree at κ ≥ 0.6, the rubric is doing real work. If κ < 0.4, the anchors aren't tight enough — go back. This is how professional researchers measure rubric quality.
Add weights via regression. Score 30 essays. For each, also note: "would I ship this?" (yes/no). Run a logistic regression of "ship" against your 5 dimension scores. The weights tell you which dimensions actually matter for the decision you care about. (You'll be surprised — often, two dimensions dominate.)
Build a rubric for code, design, or agent behavior. Apply the same 5-dimensions + 1/3/5 anchor pattern to a domain you build IN, not just consume. "What makes a good React component?" "What makes a good error message?" "What makes a good agent refusal?" The transferability of this skill is the actual prize.

Step by step

Pick the domain. Be sharp about it.

Pick five dimensions.

Write anchors at 1, 3, and 5 for each dimension.

Score five real examples by hand.

Investigate the two surprises.

Write the reflection.

A complete worked example, every file

Live demo 1: score a piece of work right here

Hand scorer

Live demo 2: write an anchor right now

Anchor writer

Live demo 3: how consistent are you, really?

Self-calibration check · score these 5 snippets

What makes this hard

Self-check before you ship

Push further · for the harder end of 15+