Most people have taste. Few people can articulate it. The gap between the two is where this project lives. By the end you'll have a one-page rubric that says, in writing, what makes work good in your chosen domain. It will surprise you.
This is the first Harness Studio project — the project before any system, before any code, before any judge. The judge in Project 11 will only ever be as smart as the rubric you write today. So: write a real one.
The first rubric you write will be wrong. That's fine. Writing it down is what reveals where it's wrong. You will rewrite it three times before it's good. The rewrites are the project.
Step by step
-
Pick the domain. Be sharp about it.
Not "good writing." Pick "a 200-word personal essay for a college application" or "a Slack message that gets a thoughtful reply instead of a thumbs-up" or "a code comment future-me will thank past-me for." The narrower the domain, the more honest your rubric will be.
-
Pick five dimensions.
Five — not three (too vague), not eight (overlapping). Each dimension is one sharp thing the work either has or doesn't. Each must be checkable by reading the work, not by reading your mind.
-
Write anchors at 1, 3, and 5 for each dimension.
Anchors are the calibration. A "5" on voice in your rubric should be unambiguous. Same for "1." This is where most rubrics die — people write the dimensions and skip the anchors, and then nobody (including them) can score anything consistently.
-
Score five real examples by hand.
Find five real examples in your domain. Don't ask Claude to generate them. Use real ones — kid magazine essays, sample admissions essays online, your own past work, a friend's draft. Score each against your five dimensions. You'll get two surprises (described below).
-
Investigate the two surprises.
Surprise 1: one example you thought was great will score 11/25. Decide who's right — your rubric or your gut. Surprise 2: two examples will score similarly but feel very different — meaning you're missing a dimension. Add it. Re-score everything.
-
Write the reflection.
One paragraph: what surprised you about your own taste once you had to defend it on paper? Be honest. The number-one surprise is usually: "I had been calling things good when they didn't actually score well on my own rubric. My eye was being lazy." That sentence is the goal of this project.
A complete worked example, every file
The full rubric for "a 200-word personal essay for a college application." Every dimension, every anchor, five real-essay scores, and the reflection that surprised the author.
version: 1.0
domain: |
A 200-word personal essay for a US college application,
written by a high school senior, on a self-chosen topic.
Written for an admissions reader who reads ~80 of these per day.
dimensions:
- id: voice
weight: 1.0
asks: |
Could only this writer have written this? Or could any
thoughtful person?
- id: specificity
weight: 1.0
asks: |
Does the essay name a specific thing — a teacher, a moment,
a smell — that grounds it? Or does it speak in abstractions?
- id: arc
weight: 1.0
asks: |
Does the writer change between the first sentence and the
last? Does the change feel earned, or claimed?
- id: pace
weight: 1.0
asks: |
Are there sentences that aren't pulling weight? Could 25%
be cut without losing anything?
- id: ending
weight: 1.0
asks: |
Does the last sentence land — or peter out into "and so I
learned" platitudes?
scoring:
scale: 1-5 per dimension
total: 5-25
ship_threshold: >= 18 (with no dimension below 3)
## voice
1: |
Reads like a competent stranger. AI could have written this.
Vocabulary generic — 'passionate', 'journey', 'taught me'.
The author is invisible. You finish the essay knowing nothing
about who wrote it.
3: |
Has a tic — a sentence rhythm or word choice that's the
writer's. But you'd have to look hard to find it. Mostly
the voice is the textbook voice of a competent applicant.
5: |
Could only have been written by this person. The voice is
unmistakable from sentence one. Vocabulary is specific to the
writer's world. You finish the essay and you'd recognize their
next paragraph anywhere.
## specificity
1: |
All abstractions. "I learned the value of perseverance from
my time on the soccer team." No moments, no people named,
no smells, no objects.
3: |
One specific detail, but treated as decoration, not load-
bearing. The detail is mentioned and then dropped.
5: |
The whole essay turns on a specific moment, sensory detail,
or named person. The detail does the work. Take the detail
away and the essay collapses.
## arc
1: |
Static. The writer at the end is the writer at the beginning.
No change earned. May claim a change ("and that's how I
learned…") but the claim isn't grounded in the essay.
3: |
A small change. The writer moves a step — but the step feels
like one a teacher told them they'd make, not one they had to.
5: |
A real change. By the last sentence the writer notices
something they couldn't have noticed at the start. The reader
can feel the shift.
## pace
1: |
25%+ could be cut without losing anything. Sentences exist
to fill word count or transitions. "Furthermore," "in
conclusion," padding adverbs.
3: |
Reads cleanly but has 1–2 paragraphs that could shrink. No
obvious filler, but no unusual economy either.
5: |
Every sentence pulls weight. Cut anything and you'd lose
something specific. Read it twice; the second time you
notice why each sentence is exactly as long as it is.
## ending
1: |
Peters out into "and that taught me…" or "I will carry this
with me forever." Generic moral. Reads as if the writer ran
out of word count.
3: |
The ending circles back to an image from earlier. Earned, but
expected — the move you can see coming three sentences out.
5: |
The last sentence lands. Either it surprises (a different
register, a smaller scale, a thing only this writer would say)
or it makes you re-read the first sentence with new eyes. Or both.
essay_id,voice,specificity,arc,pace,ending,total,my_gut_before,note e01-eagle-scout,2,3,2,3,1,11,"high (i liked it)","my gut was wrong. it's well-written but generic. 'I learned leadership' three times in 200 words. SURPRISE 1." e02-grandmother-recipe,5,5,4,4,5,23,"high","gut and rubric agree. specific dish, named grandmother, real moment, ending lands." e03-summer-job-coffee-shop,3,4,4,3,4,18,"medium","slightly underrates this. the coffee shop detail does work." e04-soccer-game-loss,2,3,3,4,2,14,"medium","gut said this was OK. rubric flags voice and ending. correctly." e05-violin-practice,4,5,5,3,4,21,"high","gut and rubric agree. very specific. arc is real (he learns the violin is hers, not his)."
## essay e01-eagle-scout (the one my gut got wrong)
excerpt:
"Becoming an Eagle Scout taught me leadership. I learned the
value of perseverance through the many challenges I faced.
Leading my troop through our final service project, I learned
that true leadership is not about commanding others, but about
serving them. I will carry these lessons with me as I begin
my journey at college."
## why my gut said HIGH (before scoring)
- Eagle Scout is impressive
- the writing is grammatical
- the moral is correct
- it's "the kind of essay that wins"
## why the rubric says LOW (11/25)
voice (2/5):
"I learned the value of perseverance" — this is the most
common sentence in college applications. AI generates this
by default. The author is invisible.
specificity (3/5):
"Eagle Scout" is one specific. Then nothing. No project named,
no troop member named, no specific moment from the service
project. The Eagle Scout fact is the only detail.
arc (2/5):
Claims an arc ("I learned"). No actual change shown. The
writer at the end is the writer at the beginning + one
completed merit badge.
pace (3/5):
No obvious filler. But "Furthermore" / "I will carry these
lessons with me as I begin my journey at college" — the
last sentence is exactly the closing line of 60% of these.
ending (1/5):
"I will carry these lessons with me as I begin my journey
at college." This is the lowest-scoring ending possible.
It says nothing only this writer could say.
## what I learned from scoring my own gut wrong
I had been weighting "is this the kind of essay I'd respect" instead
of "does this essay actually do what good essays do." The Eagle
Scout fact made me afraid to score it harshly. The rubric is what
broke through the fear. I'd ship e02 (grandmother recipe) and
not e01, even though my gut said both were good.
## what surprised me about my own taste I thought I'd be most surprised by an essay I disliked but the rubric loved. Instead the bigger surprise went the other way: I had been calling essays "good" because they had impressive biographical facts attached (Eagle Scout, varsity captain, study abroad). The rubric — which only scores the writing — flagged those essays as 11/25 or 13/25 while still being grammatically clean. My eye was confusing "this writer has done impressive things" with "this writer wrote an impressive essay." Two different domains. The rubric wouldn't let me confuse them. The harder thing the rubric showed me: my fifth dimension (ending) is the one I had been letting slide most. Every essay I scored, I instinctively forgave a weak last sentence. Once I had a 1-anchor for ending — "peters out into 'and that taught me'" — I couldn't unsee it. About 70% of college essays I've read in my life have that exact failure. I just hadn't named it. The rubric is now the thing I'd hand to a friend asking "is my essay any good?" — not because it's the truth, but because it forces them to argue back. That argument is where their taste shows up. Mine certainly did.
Live demo 1: score a piece of work right here
Paste a piece of work. Score it on the five dimensions. The widget computes your total and flags any dimension below 3. This is the same scoring loop you'd use on every essay you read.
Hand scorer
Live demo 2: write an anchor right now
Pick a dimension. Write what a 1, a 3, and a 5 look like. The widget checks for the patterns of a real anchor: it's specific, it's testable, it's not just "more X" or "less X."
Anchor writer
Live demo 3: how consistent are you, really?
Score five short snippets. The widget compares your scores against your own previous scoring and flags whether you're calibrated. Real raters drift across a session — your future judge prompt has to deal with this.
Self-calibration check · score these 5 snippets
What makes this hard
The hardest move is scoring something you like badly. You'll have a piece of work — maybe one you wrote — that you thought was good. You'll score it 12/25. You will be tempted to fudge it up to 18. Don't. The rubric is a tool to find out where your eye disagrees with your written taste. The disagreement is the gold.
The second hard thing is anchors. Writing an anchor at "5" feels easy. Writing an anchor at "1" feels mean. Skipping the "1" is the most common mistake — and it's exactly the anchor that lets you tell mediocrity from real bad. Be specific about both ends.
The third — and this catches the careful kids — is the temptation to add weights. "Voice should count more than pace." Don't, on v0.1. Equal weights force you to take every dimension seriously. You can add weights in v1.0, after the judge in Project 11 reveals which dimensions actually predict whether you'd ship.
Self-check before you ship
- Domain sentence is narrow enough that an adult couldn't have generated it from scratch.
- Five dimensions, no overlaps; each one asks one sharp question.
- Anchors at 1, 3, AND 5 for every dimension — actually written, not implied.
- Five real (not generated) examples scored by hand.
- I noted at least one essay where my gut and the rubric disagreed, and decided who was right.
- Reflection paragraph names at least one specific thing my eye had been calling good that the rubric flags as not.
Push further · for the harder end of 15+
A scored rubric is the floor. Here's where it becomes the foundation of every later project.
- Inter-rater agreement with a friend. Hand your rubric to a friend. Have them independently score your 5 essays. Compute Cohen's kappa (a real statistical measure of two-rater agreement). If you and your friend agree at κ ≥ 0.6, the rubric is doing real work. If κ < 0.4, the anchors aren't tight enough — go back. This is how professional researchers measure rubric quality.
- Add weights via regression. Score 30 essays. For each, also note: "would I ship this?" (yes/no). Run a logistic regression of "ship" against your 5 dimension scores. The weights tell you which dimensions actually matter for the decision you care about. (You'll be surprised — often, two dimensions dominate.)
- Build a rubric for code, design, or agent behavior. Apply the same 5-dimensions + 1/3/5 anchor pattern to a domain you build IN, not just consume. "What makes a good React component?" "What makes a good error message?" "What makes a good agent refusal?" The transferability of this skill is the actual prize.