Welcome to the hardest academy. It's also the smallest.
Harness Studio is the fourth and final academy in Kindling. It has only four modules. It's only for Builders. And every module in it wrestles with one problem that the other three academies quietly stepped around: how do you turn taste into something you can check?
In Skills Workshop you learned to make Skills. In Code Club you learned to build real tools. In Agent Lab you learned to design agents that treat people well. All three academies assumed you'd know, somehow, whether the thing you made was good. Harness Studio is the academy where that assumption finally gets unpacked — because "good" turns out to be a much harder word than it looks.
You can't improve what you can't measure. But the moment you measure it, you change what people are actually working on.
If you don't measure "good," you can't tell if you're getting better or worse — you're flying blind. So you write down a rubric: five criteria, each with a score. Now you can measure. Great.
Except now everyone working on the thing starts optimizing for the five criteria on the rubric, not for the actual good thing. Your rubric has quietly replaced the real goal with its own measurable shadow. Goodhart's Law again, from Agent Lab Builders 03: when a measure becomes a target, it stops being a good measure.
This is the paradox. You have to measure. The moment you measure, you corrupt what you're measuring. Harness Studio exists because a small number of builders have figured out how to do this carefully anyway.
A "harness" is the engineering word for a system that keeps another system honest. A test harness catches bugs. A safety harness catches climbers. A taste harness catches your own drift when your standards slip and you don't notice. This whole academy is about building one of those — for yourself, for your team, and eventually for AI that's doing things too fast for any human to personally check.
There are two ways to say "that's good." Neither alone is enough.
When someone asks "is this any good?", there are really only two tools people use — and they're opposites. Both are useful. Both fail in predictable ways. The whole craft of Harness Studio is knowing when to use which one, and then combining them so the weaknesses cancel out.
The vibe check
You look at the thing, take it in for a moment, and your gut tells you whether it's good or not. No list. No score. Just a feeling that comes from your accumulated taste — the sum of everything you've ever seen, read, liked, and disliked.
The rubric
You write down the criteria that make something good — three to five specific things, each with a way to score it. Now everyone can check the same thing the same way, including Future You who might have forgotten what mattered.
The trick is to use both, in sequence. Vibe check first, then rubric. If your gut says something's off and the rubric says it's fine, trust your gut and find out what the rubric is missing. If the rubric says something's off and your gut says it's fine, trust the rubric and find out what your taste is drifting toward. The two methods catch each other's mistakes — that's what makes them a harness instead of either one alone.
Pick something you'd actually want to judge for quality.
Rubrics are only useful when they're about something specific. "Is this good writing?" is too vague — good writing for a school essay is different from good writing for a text to a friend. Pick one specific target and we'll build a real rubric for it.
Notice that each of these targets has a different kind of "good." A Claude Skill needs to be correct AND reusable. A short story needs to be surprising AND emotionally true. A drawing needs to have its own voice, not match a template. The rubric has to be designed for the specific thing — one universal "quality rubric" is a fantasy.
A real rubric has four parts.
Most "rubrics" you see in school or at work are missing two or three of these four parts. That's why they're so easy to game. A real rubric — one that can actually catch bad work without getting itself gamed — has all four.
Four parts, in order. Skip any one and it breaks.
The target
What specific thing are you judging? Not "writing" — "an opening paragraph of a short story for a 13-year-old reader." The more specific the target, the less room for the rubric to drift.
The criteria
Three to five specific things that matter for this target. Not "is it good" — named properties you can point to. Each criterion should be checkable even by someone whose taste is different from yours.
The weights
Not every criterion matters equally. "Avoids cliché" might matter half as much as "earns curiosity." Assigning weights forces you to be honest about what really counts — no hiding behind a big list of equal criteria.
The anti-rubric
The most-skipped part, and the most important. For each criterion, write down: what would a bad actor do to game this exact criterion? Then watch for that behavior. This is the only defense against Goodhart's Law — and without it, every rubric eventually decays.
Without the anti-rubric, your rubric will, over time, stop measuring the good thing and start measuring the gameable shadow of the good thing. Everyone optimizes for the gameable shadow. The real thing slowly disappears, and the score keeps going up. The anti-rubric is how you catch this happening — it's a note to your future self: if you ever see this pattern, stop and think. It's a built-in alarm for your own taste drift.
Build a real rubric for your target.
Fill in the fields below. Add three to five criteria (three is often better than five). Give each a weight from 1 to 5. Write one anti-rubric note describing how the criterion could be gamed. The rubric card on the right updates live.
Untitled rubric
↳ for something specific
Three criteria is almost always better than five. Each criterion you add makes the rubric harder to use and easier to game. The master's move isn't adding more — it's picking the three that matter most and being honest about their weights.
Most rubrics in the world are cargo cult.
A cargo-cult rubric is one that looks like a rubric — criteria, points, maybe a spreadsheet — but doesn't actually measure quality. It measures whether you filled out the form. Two rounds of judgment.
Round 1. Two rubrics for judging a Claude Skill. Which is actually useful?
Round 2. A homework essay rubric. Which is a better design?
Every rubric you write should pass this question: "If someone tried to game this on purpose, would they end up with something good, or something fake?" If they'd end up with something fake, the rubric is cargo cult, and you should burn it and start over. The goal of a rubric isn't to look rigorous. It's to make fake work impossible — and to do that, you have to think about fake work in advance.
You now know what a real rubric looks like.
Which is rarer than you think. Most rubrics in the world are cargo-cult forms. You can now tell the difference at a glance — and build the real kind when you need to.
What you just learned
- The measurement paradox: you can't improve what you can't measure, but the moment you measure it, you change what people optimize for.
- Two ways to judge: vibe check (fast, untransferable) and rubric (shareable, game-able). Use both in sequence.
- A harness is a system that keeps another system honest — including your own drifting taste.
- A real rubric has four parts: target, criteria, weights, and anti-rubric. Skip any one and it breaks.
- The anti-rubric is the most-skipped part — and the only defense against Goodhart's Law.
- Three criteria usually beat five. Rubric inflation is a design mistake, not rigor.
- The harness test: "if someone gamed this on purpose, would they end up with something good or something fake?"
In Module 02, you'll take this one step further: you'll build an actual judge — a runnable system that applies a rubric to real work automatically. This is where Harness Studio starts to feel like magic, because your rubric stops being a document and becomes a living check you can run on hundreds of things at once.
★ Before you call it done
Three questions. Same three. Every time.
These are the same three questions for every module in Kindling. They are how you check whether AI did the part it should and you did the part only you could. Tap each one to mark it true.
★ ★ ★