Academies Harness Studio ⚙️ Builders · Module 01

What does good even mean?

a seven-step walkthrough · the Harness Studio opener · about forty minutes

⚠ The temptation

Borrow someone else's rubric — there are dozens online.

But borrowed rubrics produce borrowed taste, and borrowed taste is no taste at all. Write the rubric yourself. Even if it's wrong at first. Especially if it's wrong at first.

Step 1 of 7
Step 01 · The paradox

Welcome to the hardest academy. It's also the smallest.

Harness Studio is the fourth and final academy in Kindling. It has only four modules. It's only for Builders. And every module in it wrestles with one problem that the other three academies quietly stepped around: how do you turn taste into something you can check?

In Skills Workshop you learned to make Skills. In Code Club you learned to build real tools. In Agent Lab you learned to design agents that treat people well. All three academies assumed you'd know, somehow, whether the thing you made was good. Harness Studio is the academy where that assumption finally gets unpacked — because "good" turns out to be a much harder word than it looks.

The measurement paradox

You can't improve what you can't measure. But the moment you measure it, you change what people are actually working on.

If you don't measure "good," you can't tell if you're getting better or worse — you're flying blind. So you write down a rubric: five criteria, each with a score. Now you can measure. Great.

Except now everyone working on the thing starts optimizing for the five criteria on the rubric, not for the actual good thing. Your rubric has quietly replaced the real goal with its own measurable shadow. Goodhart's Law again, from Agent Lab Builders 03: when a measure becomes a target, it stops being a good measure.

This is the paradox. You have to measure. The moment you measure, you corrupt what you're measuring. Harness Studio exists because a small number of builders have figured out how to do this carefully anyway.

A "harness" is the engineering word for a system that keeps another system honest. A test harness catches bugs. A safety harness catches climbers. A taste harness catches your own drift when your standards slip and you don't notice. This whole academy is about building one of those — for yourself, for your team, and eventually for AI that's doing things too fast for any human to personally check.

Step 02 · Two ways to judge

There are two ways to say "that's good." Neither alone is enough.

When someone asks "is this any good?", there are really only two tools people use — and they're opposites. Both are useful. Both fail in predictable ways. The whole craft of Harness Studio is knowing when to use which one, and then combining them so the weaknesses cancel out.

Method 01
The vibe check

You look at the thing, take it in for a moment, and your gut tells you whether it's good or not. No list. No score. Just a feeling that comes from your accumulated taste — the sum of everything you've ever seen, read, liked, and disliked.

Good at Catching things a rubric would miss. The weird, the novel, the "I can't name it but it's wrong" moments. Most masters rely on vibe checks in their own field.
Bad at Being shareable. You can't hand your vibe to a teammate. You can't teach it directly. And your vibe slowly drifts without you noticing — which means even you can't always trust it over time.
Method 02
The rubric

You write down the criteria that make something good — three to five specific things, each with a way to score it. Now everyone can check the same thing the same way, including Future You who might have forgotten what mattered.

Good at Being shared, taught, and trusted over time. A good rubric turns taste into a repeatable process. It lets a team agree on "good" even when the members have different guts.
Bad at Seeing anything outside its own categories. People stop judging the thing and start judging the rubric. Goodhart's Law moves in and the measurement starts to replace the thing being measured.

The trick is to use both, in sequence. Vibe check first, then rubric. If your gut says something's off and the rubric says it's fine, trust your gut and find out what the rubric is missing. If the rubric says something's off and your gut says it's fine, trust the rubric and find out what your taste is drifting toward. The two methods catch each other's mistakes — that's what makes them a harness instead of either one alone.

Step 03 · Pick your target

Pick something you'd actually want to judge for quality.

Rubrics are only useful when they're about something specific. "Is this good writing?" is too vague — good writing for a school essay is different from good writing for a text to a friend. Pick one specific target and we'll build a real rubric for it.

🛠️A Claude Skill
📖A short story
⚙️Some code
🤖An agent response
🎨A drawing
📝A homework essay
You'll build a rubric for . Specific targets make good rubrics.

Notice that each of these targets has a different kind of "good." A Claude Skill needs to be correct AND reusable. A short story needs to be surprising AND emotionally true. A drawing needs to have its own voice, not match a template. The rubric has to be designed for the specific thing — one universal "quality rubric" is a fantasy.

Step 04 · Four parts

A real rubric has four parts.

Most "rubrics" you see in school or at work are missing two or three of these four parts. That's why they're so easy to game. A real rubric — one that can actually catch bad work without getting itself gamed — has all four.

The rubric anatomy

Four parts, in order. Skip any one and it breaks.

1
The target

What specific thing are you judging? Not "writing" — "an opening paragraph of a short story for a 13-year-old reader." The more specific the target, the less room for the rubric to drift.

Target: "opening paragraph of a short story for a 13-year-old"
2
The criteria

Three to five specific things that matter for this target. Not "is it good" — named properties you can point to. Each criterion should be checkable even by someone whose taste is different from yours.

Criteria: • Specific imagery · • Introduces a voice · • Earns curiosity · • Avoids cliché
3
The weights

Not every criterion matters equally. "Avoids cliché" might matter half as much as "earns curiosity." Assigning weights forces you to be honest about what really counts — no hiding behind a big list of equal criteria.

Weights: Imagery (2) · Voice (3) · Curiosity (3) · Cliché-free (1) — total: 9
4
The anti-rubric

The most-skipped part, and the most important. For each criterion, write down: what would a bad actor do to game this exact criterion? Then watch for that behavior. This is the only defense against Goodhart's Law — and without it, every rubric eventually decays.

Anti-rubric: "If someone stuffed imagery in just to pad the score, catch it — imagery only counts when it earns its place."

Without the anti-rubric, your rubric will, over time, stop measuring the good thing and start measuring the gameable shadow of the good thing. Everyone optimizes for the gameable shadow. The real thing slowly disappears, and the score keeps going up. The anti-rubric is how you catch this happening — it's a note to your future self: if you ever see this pattern, stop and think. It's a built-in alarm for your own taste drift.

Step 05 · Build yours

Build a real rubric for your target.

Fill in the fields below. Add three to five criteria (three is often better than five). Give each a weight from 1 to 5. Write one anti-rubric note describing how the criterion could be gamed. The rubric card on the right updates live.

rubric.yaml
Untitled rubric

↳ for something specific

Add criteria to see them here.

Three criteria is almost always better than five. Each criterion you add makes the rubric harder to use and easier to game. The master's move isn't adding more — it's picking the three that matter most and being honest about their weights.

Step 06 · Cargo-cult rubrics

Most rubrics in the world are cargo cult.

A cargo-cult rubric is one that looks like a rubric — criteria, points, maybe a spreadsheet — but doesn't actually measure quality. It measures whether you filled out the form. Two rounds of judgment.

Round 1. Two rubrics for judging a Claude Skill. Which is actually useful?

Round 2. A homework essay rubric. Which is a better design?

Every rubric you write should pass this question: "If someone tried to game this on purpose, would they end up with something good, or something fake?" If they'd end up with something fake, the rubric is cargo cult, and you should burn it and start over. The goal of a rubric isn't to look rigorous. It's to make fake work impossible — and to do that, you have to think about fake work in advance.

Step 07 · You did it
⚖️

You now know what a real rubric looks like.

Which is rarer than you think. Most rubrics in the world are cargo-cult forms. You can now tell the difference at a glance — and build the real kind when you need to.

What you just learned

  • The measurement paradox: you can't improve what you can't measure, but the moment you measure it, you change what people optimize for.
  • Two ways to judge: vibe check (fast, untransferable) and rubric (shareable, game-able). Use both in sequence.
  • A harness is a system that keeps another system honest — including your own drifting taste.
  • A real rubric has four parts: target, criteria, weights, and anti-rubric. Skip any one and it breaks.
  • The anti-rubric is the most-skipped part — and the only defense against Goodhart's Law.
  • Three criteria usually beat five. Rubric inflation is a design mistake, not rigor.
  • The harness test: "if someone gamed this on purpose, would they end up with something good or something fake?"

In Module 02, you'll take this one step further: you'll build an actual judge — a runnable system that applies a rubric to real work automatically. This is where Harness Studio starts to feel like magic, because your rubric stops being a document and becomes a living check you can run on hundreds of things at once.

★ Before you call it done

Three questions. Same three. Every time.

These are the same three questions for every module in Kindling. They are how you check whether AI did the part it should and you did the part only you could. Tap each one to mark it true.

★ ★ ★

This is yours. Ship it.