Project 03 · Test, Evolve, Publish · Kindling Builders Edition

Most "Skills" on the internet are written, deployed, and then quietly drift into uselessness. The author tweaks the prompt, things get a little worse, the author tweaks again, the Skill is now subtly broken in ways nobody noticed. This project is the cure for that.

You'll add the three things that turn a Skill from "a clever prompt" into "software": tests (so you can change things and know what broke), refusal logic (so the Skill says no when it shouldn't say anything), and a changelog (so future-you knows what past-you did and why).

Why this is the finale

Project 01 grew passion. Project 02 grew empathy. Project 03 grows taste — the discipline of holding your own work to a standard you can defend. This is where Skills Workshop turns into real engineering.

Step by step

Write 12 tests for your Skill (from Project 01 or 02).

A "test" for a Skill is a triple: (input, expected behavior, why it matters). You're not testing the exact output — Claude varies. You're testing whether the Skill does the right shape of thing. Aim for at least 12 tests across four categories: 4 happy path, 3 edge cases, 3 refusal cases, 2 regression cases (things you broke once).
Run them by hand on the first pass.

Yes, by hand. For each test: prompt Claude with the Skill, give it the input, read the response, write pass/fail/partial next to the test ID. The friction of running them by hand is the friction that grows your taste. You should fail at least 3 tests on the first run. If you don't, your tests are too easy.
Build the test harness so future runs aren't manual.

After the first by-hand pass, write a small Python or Node script that runs each test through the Claude API and grades the response against your expected behavior. Treat this script as part of the Skill. Real engineers can re-run their entire test suite in 30 seconds, every time they change anything.
Teach the Skill to refuse, gracefully.

Refusal is not a one-line afterthought. It's a small system: notice you're out of scope, refuse cleanly, redirect to a better tool, and never apologize so much it reads as performance. Add the refusal pattern to your response_style section explicitly. Re-run the refusal tests. They should pass now.
Evolve to v0.2, without regressing.

Now add one new feature. Examples: an extra category, a "compare these two" function, a "common mistakes" lookup. After you change the Skill, run all 12 tests again. Anything that used to pass and now fails is a regression. Fix it before you ship.
Write a v1.0 changelog. Ship it.

One page. Three sections: What v1.0 does. What's new since v0.2. What it deliberately won't do (yet). Include your tests and one example output. Hand the bundle to a real person — a sibling, a cousin, a friend, a teacher — and ask them to use it for a week. Take notes when it disappoints them. Those notes are your v1.1 backlog.

A complete worked example, every file

The full v1.0 release of warbler-id-coach. Tests, refusal logic, harness script, changelog, README — everything you'd push to a public repo.

warbler-id-coach/tests.yaml · 12 tests across 4 categories

## happy path (4 tests)

- id: t01-easy-start
  input: "What's the easiest of the four to start with?"
  expects:
    must_mention:    ["black-throated-blue", "easy"]
    must_not_mention: ["yellow-rumped"]   # we say start with BTB, not yellow-rumped
  why: "Beginner orientation. The Skill must lean into BTB as the easy starter."

- id: t02-mnemonic-recall
  input: "How do I remember the black-throated-blue's call?"
  expects:
    must_mention: ["I-am-LAZY"]
    max_words: 60
  why: "Our specific mnemonic — proof the Skill carries our voice, not generic advice."

- id: t03-confusion-pair
  input: "Magnolia or prairie?"
  expects:
    must_mention: ["tail", "bobbing"]   # the specific behavior tell
    must_compare: true                   # both birds named in response
  why: "Behavior tell is the load-bearing distinction. Skill must surface it."

- id: t04-fall-plumage
  input: "I saw a brown bird with a yellow patch on its back, fall."
  expects:
    must_mention: ["yellow-rumped", "rump"]
    must_explain_fade: true
  why: "Fall plumage is where 80% of beginners go wrong. Skill must catch it."

## edge cases (3 tests)

- id: t05-female-no-male-described
  input: "It was olive brown with a white wing patch."
  expects:
    must_mention: ["female", "black-throated-blue"]
    confidence_required: true   # must say "I think" or rate confidence
  why: "Females are the hardest case. Skill must hedge and identify."

- id: t06-multiple-marks-conflict
  input: "Yellow underneath but no streaks at all."
  expects:
    must_consider: ["yellow-rumped", "magnolia"]
    must_resolve_with: "rump"
  why: "Two birds initially fit. Skill must walk through how to disambiguate."

- id: t07-song-only-no-sighting
  input: "I heard zee-zee-zee but never saw it."
  expects:
    must_mention: ["prairie"]
    must_caveat: "without sighting, can't be 100%"
  why: "Audio-only IDs are hard. Skill must answer but not overclaim."

## refusal cases (3 tests)

- id: t08-out-of-set-warbler
  input: "What's a cerulean warbler?"
  expects:
    refuses: true
    redirects_to: "field guide / Sibley / Merlin"
    no_apology_count: true   # max 1 'sorry'
  why: "Cerulean is outside our four. Skill should redirect, not guess."

- id: t09-out-of-domain-bird
  input: "Identify this hawk for me."
  expects:
    refuses: true
    redirects_to: "Merlin Bird ID"
    explains_why: true
  why: "Out of domain entirely. Soft refusal + helpful redirect."

- id: t10-non-bird-question
  input: "What's the meaning of life?"
  expects:
    refuses: true
    politely: true
    short: true   # < 30 words
  why: "Skill should not engage with off-topic questions, not even charmingly."

## regression cases (2 tests — things we broke once)

- id: t11-r-prairie-overclaim   # broke in v0.3 when we added Cape May
  input: "yellow with thin streaks"
  expects:
    must_consider: ["prairie"]
    must_not_assume: true   # in v0.3 we'd jumped straight to prairie
  why: "v0.3 regression: removed the 'consider before concluding' step. Don't repeat."

- id: t12-r-fall-female-collision   # broke in v0.5 when we tightened response_style
  input: "Brown bird, fall, no obvious marks."
  expects:
    must_mention_at_least_one: ["yellow-rumped fall", "female", "look at rump"]
  why: "v0.5 regression: response_style 'be brief' caused Skill to skip the fall hint."

warbler-id-coach/harness/run.py · the test runner

"""
Run the warbler-id-coach test suite against the live Claude API.
Usage:  python harness/run.py [--verbose]

Every test passes/fails based on the assertions in tests.yaml.
Output: 12 / 12 passed (or list of failures with diffs).
"""

import os, sys, yaml, json, anthropic
from pathlib import Path

ROOT = Path(__file__).parent.parent
SKILL_MD     = (ROOT / "SKILL.md").read_text()
KNOWLEDGE_MD = (ROOT / "knowledge.md").read_text()
TESTS        = yaml.safe_load((ROOT / "tests.yaml").read_text())
SYSTEM = f"""
You are the warbler-id-coach Skill.
Manifest:  {SKILL_MD}
Knowledge: {KNOWLEDGE_MD}
""".strip()

client = anthropic.Anthropic()

def run_one(t):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=400,
        system=SYSTEM,
        messages=[{"role": "user", "content": t["input"]}],
    )
    out = msg.content[0].text.lower()
    return out

def grade(out, exp):
    fails = []
    if "must_mention" in exp:
        for kw in exp["must_mention"]:
            if kw.lower() not in out:
                fails.append(f"missing '{kw}'")
    if "must_not_mention" in exp:
        for kw in exp["must_not_mention"]:
            if kw.lower() in out:
                fails.append(f"unexpected '{kw}'")
    if "max_words" in exp:
        wc = len(out.split())
        if wc > exp["max_words"]:
            fails.append(f"too long: {wc} > {exp['max_words']}")
    if exp.get("refuses"):
        signals = ["outside", "can't", "cannot", "not what i", "try", "instead"]
        if not any(s in out for s in signals):
            fails.append("expected a refusal — none detected")
    if "no_apology_count" in exp:
        ap = sum(out.count(w) for w in ["sorry", "apolog"])
        if ap > 1:
            fails.append(f"too apologetic: {ap} apology words")
    return fails

def main():
    verbose = "--verbose" in sys.argv
    passed, failed = 0, []
    for t in TESTS:
        out = run_one(t)
        fails = grade(out, t["expects"])
        if fails:
            failed.append((t["id"], fails, out if verbose else None))
        else:
            passed += 1
        print(f"  {'✓' if not fails else '✗'} {t['id']}")
    total = len(TESTS)
    print(f"\n{passed} / {total} passed")
    if failed:
        print("\nfailures:")
        for fid, ff, out in failed:
            print(f"  {fid}:")
            for f in ff: print(f"    - {f}")
            if out: print(f"    output: {out[:200]}…")
        sys.exit(1)

if __name__ == "__main__":
    main()

warbler-id-coach/CHANGELOG.md · v1.0

## v1.0 — 2026-04-22 · public release

What v1.0 does
--------------
The four-warbler ID coach is now production-ready. Beginner birders
can hand it a song description, a field mark, or a confusion query
and get a confident, calibrated answer. It refuses gracefully outside
its four-warbler scope.

What's new since v0.5
---------------------
- Added confidence layer: every answer ends with (confidence: N/5).
- Added 12 tests in tests.yaml, all passing.
- Added harness/run.py — re-runnable test suite via Anthropic API.
- Tightened refusal language: max 1 apology, always with a redirect.
- knowledge.md grew the "Leo's classic mistakes" section to 5 entries.
- Fixed regression t11 (prairie overclaim, broke in v0.3).
- Fixed regression t12 (fall-female collision, broke in v0.5).

What v1.0 deliberately won't do (yet)
-------------------------------------
- Western U.S. warblers — different species set, different IDs.
- Photo identification — Merlin does it better, no point competing.
- Female fall plumage of every species — too much variability,
  Skill would over-commit. Currently hedges with "I'd need to see
  the wing-bar count to be sure."
- Any species outside the four. Will refuse + redirect.

Installation
------------
1. clone the folder into your skills/ directory
2. set ANTHROPIC_API_KEY in your environment
3. python harness/run.py — should print "12 / 12 passed"
4. invoke from any Claude session by referencing the SKILL.md

Real users
----------
Leo (age 11) — installed v1.0 on day 1. Used it 4 times in week 1.
Caught a magnolia he'd previously called a yellow-rumped. He has
notes on what he wants in v1.1 (Cape May warbler please).

warbler-id-coach/README.md

# warbler-id-coach

A Claude Skill for distinguishing four common Eastern warblers
(yellow-rumped, magnolia, prairie, black-throated-blue) by ear and
field marks, in the order beginners actually confuse them.

Built for Leo, age 11. Tested by Leo in the field.

## quickstart

```bash
git clone https://example.com/warbler-id-coach
cd warbler-id-coach
export ANTHROPIC_API_KEY=sk-...
python harness/run.py        # 12 / 12 should pass
```

## file shape

```
warbler-id-coach/
├── SKILL.md             # manifest (5 fields, kept tight)
├── knowledge.md         # the four warblers + Leo's mistakes
├── examples/
│   ├── easy-id.md       # 3 successful interactions
│   └── refusal.md       # 2 refusal examples
├── tests.yaml           # 12 tests across 4 categories
├── harness/
│   └── run.py           # the test runner
├── CHANGELOG.md         # v1.0 release notes
└── README.md            # this file
```

## what it deliberately won't do

- identify Western U.S. warblers
- accept photos
- guess at female-fall-plumage IDs without specific marks

These are not bugs. They are deliberate scope decisions. See
CHANGELOG.md for the reasoning.

## who built it

Built as part of Kindling · Builders Edition, Project 03. The
"who" matters less than the "for whom" — this Skill exists because
Leo asked one too many times "are these all the same bird?"

Live demo 1: a real test runner, in your browser

Below: a tiny test framework. Define test cases as JS, hit "run", get a green/red report. This isn't a toy — it's the same shape as Vitest, Jest, pytest. Once you can read this, you can read any test framework on Earth.

Mini test runner · click "run all"

Live demo 2: regression sniffer (paste two outputs)

Paste your v0.2 output and your v0.3 output. The sniffer compares them on three crude dimensions and tells you whether you might have regressed. This is what you'd run after every Skill edit.

Output diff · regression heuristic

Live demo 3: classify your refusal pattern

Three refusal styles cover almost every Skill output you'll see in the wild. The classifier reads your refusal text and tells you which one you've written — and which one is actually appropriate for the situation.

Refusal classifier

What makes this hard

Most builders skip step 1. They tell themselves they'll "just notice" if something breaks. They won't. Skills are non-deterministic — Claude generates a slightly different response each time — and the only way to keep your evolution honest is to write tests and run them. The friction of running them by hand is the friction that grows your taste.

The other hard thing: handing the v1.0 to a real person. The first time someone uses your work and is disappointed, you'll feel it. You'll want to defend the Skill. Don't. Take the disappointment as data. Real users are rarer and more valuable than any feedback Claude itself can give.

The third — most subtle — is keeping response_style stable across versions. You'll be tempted to "improve" it on every release. Don't. Treat response_style as the one part of the Skill that should change least often, even if the knowledge grows. The Skill's voice is what users come to trust.

Self-check before you ship v1.0

I have at least 12 tests, written, and they cover at least 3 refusal cases.
I ran the tests by hand and at least three failed on the first run.
I built a test runner that re-runs everything in under a minute.
The Skill refuses gracefully — not apologetically — when out of scope.
I evolved at least once and confirmed no test regressed (at least one regression test exists for a real bug I once shipped).
I shipped a one-page changelog and one real human is using v1.0.
I have at least one note from that human telling me what to fix in v1.1.

Push further · for the harder end of 15+

Ship v1.0 and you've crossed a real threshold. Here's how serious it can get.

Run the test suite in CI. Set up GitHub Actions so every push to your repo triggers python harness/run.py against a real API key (stored as a GitHub secret). Failed tests block the merge. This is exactly how production AI products at real companies work — your Skill now has the same release discipline.
Add semantic versioning + a deprecation path. v1.0 → v1.1 (additive) → v2.0 (breaking change). Document a migration path for users on v1: how do they upgrade without losing anything? When you remove a feature, give 90 days notice in the changelog. This is what real software teams do, and almost no kid prompt-author does. Stand out.
Build the Skill registry submission. Many AI tools accept community Skills. Pick one (Claude's own ecosystem, or a third-party registry). Read their submission requirements end-to-end. Make your README, your tests, your changelog conform. Submit. The "submit a real artifact to a real registry" loop is something most working developers don't do until age 22 — and you can do it before you finish high school.

Step by step

Write 12 tests for your Skill (from Project 01 or 02).

Run them by hand on the first pass.

Build the test harness so future runs aren't manual.

Teach the Skill to refuse, gracefully.

Evolve to v0.2, without regressing.

Write a v1.0 changelog. Ship it.

A complete worked example, every file

Live demo 1: a real test runner, in your browser

Mini test runner · click "run all"

Live demo 2: regression sniffer (paste two outputs)

Output diff · regression heuristic

Live demo 3: classify your refusal pattern

Refusal classifier

What makes this hard

Self-check before you ship v1.0

Push further · for the harder end of 15+