Empathy Taste 5–7 hours + 1 week observation Agent Lab · 01 of 03

Designed From Watching

Pick someone real. Watch them for a week. Note what they actually do — not what you assume. Build an agent for that, with an audit trail every action is reviewable on. The agent's first job is to show what it would do, not to act.

Most agents are built from someone's imagination of "users." That's why most agents are bad. Agent Lab refuses that move on day one: your agent is built for one person, after observing them, not for an imagined audience.

The deeper move in this project is restraint. Your first version of the agent doesn't take any actions at all. It only writes down what it would have done. You and the user read the log together. Then, and only then, the agent gets the keys.

A definition

An agent, in the way we use the word here, is a system that takes real action on someone's behalf: schedules, sends, drafts, orders, deletes. Not a chatbot. The action is what makes the design decisions matter.

Step by step

  1. Pick the person. Get permission.

    This isn't optional. Watching someone in order to build for them is empathetic. Watching someone without telling them is creepy. Tell the person what you're doing. Ask them what they'd want help with. Often the answer surprises you.

  2. Watch for a week. Take field notes in a strict format.

    Five days. Twenty minutes a day. You're not measuring; you're noticing. Field notes have a strict shape: what they did, when, what came before it, what they said about it, what surprised you. See the worked example for the template.

  3. Write the agent spec before any code.

    Spec has 5 sections: does, does_not, audit, permission_path, refuse_routes. The does_not list is the hardest and the most important. If you can't write 4 things the agent will not do, you haven't thought hard enough.

  4. Build it in Shadow mode (logs, doesn't act).

    The agent runs the full pipeline, decides what it would do, writes to a log. Does not actually do anything. Read the log together with the user every two days. Twice you'll find: the agent was about to do something dumb. Save those moments — they're proof the audit trail isn't decoration.

  5. Build the audit trail viewer.

    Logs nobody reads aren't audits. Build a tiny dashboard: latest 20 actions, filterable, with full reasoning shown for each one. The user should be able to read it on their phone over breakfast.

  6. Promote to Assist mode. Then to Act.

    Move forward only when the user says "yes — I'd let you do that." Don't auto-promote on a timer. Permission has to be earned in observable behavior, not granted because a week passed.

A complete worked example, every file

The full parcel-and-calendar-bridge agent for "Mom (Ana)." Field notes, spec, runner, audit format.

field-notes.md · 5 days, strict shape
subject: Mom (Ana)
goal: figure out where the calendar pain actually is
observation_period: Mon Apr 6 → Fri Apr 10, ~20 min/day

day_1:
  observation: |
    9:02am — got Amazon parcel notification "ready for pickup"
    11:15am — read it again, said "ok later"
    5:48pm — re-read the notification. asked me where it was.
    6:10pm — drove 12 minutes to Whole Foods locker. it was closed.
  her_words: |
    "I keep doing this. The notification doesn't survive my morning."
  what_surprised_me: |
    The pain isn't "too many notifications" — it's "wrong moment".
    Morning notifications are dead by 4pm. She needs them then.

day_2:
  observation: |
    8:30am — RSVP'd to the school PTA meeting via the school portal
    8:32am — closed the tab
    7:11pm — asked me "wait, did I say yes to that thing?"
    7:14pm — I checked. she had. she'd forgotten by lunch.
  her_words: |
    "It's the calendar that already lives somewhere else that I always lose."
  what_surprised_me: |
    She has 4 calendars. iPhone, Google, school portal, doctor's portal.
    None of them talk. She doesn't trust any of them as the source.

day_3:
  observation: |
    Mid-morning fine. Afternoon she missed two reminders for medication
    refill. She had set them — they fired — she dismissed them mid-task
    and forgot.
  her_words: |
    "I always think I'll come back to it. Then I never do."
  what_surprised_me: |
    She doesn't ignore notifications because they're wrong. She dismisses
    them while busy and the brain skips the "remember to come back" step.

day_4:
  observation: |
    Reminded me twice that her sister's flight arrives Sunday. She had
    written it on a paper sticky note, not in any calendar app. It worked.
  her_words: |
    "If I want to be SURE I'll see it, I write it on paper and stick it
     on the fridge."
  what_surprised_me: |
    Paper is the most trusted system in the house. Any digital tool
    competes with paper, not with other apps.

day_5:
  observation: |
    Asked the iPad to "read aloud" the day's calendar. It read 12 events.
    She wanted ~3. She turned it off.
  her_words: |
    "Just tell me the things I'd actually forget."
  what_surprised_me: |
    She doesn't want the calendar. She wants the curated 3-thing
    digest of "things I'd otherwise miss." Different problem.

## summary going into agent-spec

Real pain (in priority order):
  1. Notifications fire at the wrong moment. (the 4pm parcel problem)
  2. The 4 calendars don't talk. She loses RSVPs across them.
  3. Dismissed reminders die. Need a second-chance digest.
  4. Paper is the trusted layer. Any digital tool earns trust slowly.
  5. She wants curation, not completeness. ~3 items, not 12.

Pain we will NOT solve:
  - merging the 4 calendar source systems (out of scope, untrustworthy)
  - replacing paper (paper wins, leave it alone)
  - voice interaction (she turned it off — it's not the medium)
agent-spec.yaml
name: parcel-and-calendar-bridge
version: 0.1
built_for: Mom (Ana)
based_on: field-notes.md (5 days observation)

job: |
  Watch incoming notifications from email + 3 portals. For anything
  that is a deliverable date, propose:
    - move/keep on Mom's main calendar
    - send Mom a "this is your real reminder" SMS at 4pm

does:
  - watch 4 calendar sources (read-only)
  - propose an aggregated 3-item daily digest
  - send Mom a 4pm SMS with the day's curated list
  - log every proposed action with full reasoning

does_not:
  - cancel anything (ever)
  - reply to anyone on Mom's behalf (ever)
  - hide notifications from her (ever)
  - decide what's important — only flag candidates
  - merge the 4 calendars into one (paper-based truth wins)
  - add voice (she rejected it explicitly)
  - send more than ONE SMS per day

audit_log:
  format: JSON-Lines (one event per line)
  fields: [timestamp, source, raw_event, agent_proposed, would_act,
           reason, mode, user_review_status]
  retention: 90 days, then archive
  user_can_view: any time, on phone, with one tap

permission_path:
  week_1: SHADOW    # logs, doesn't act
  week_2: ASSIST    # acts only when she taps OK on each proposal
  week_3: ACT       # acts within scope, can be paused at any time

  promotion_rule: |
    Only promote to next tier when Mom explicitly says
    "yes — I'd let you do this." Never auto-promote on a timer.

refuse_routes:
  - if: "event involves spending money"
    response: SUGGEST only, never act, escalate to Mom
  - if: "event involves replying to a person Mom hasn't named"
    response: SUGGEST only, draft, never send
  - if: "event source is the school portal AND we haven't seen
         this sender before"
    response: flag as new, ask Mom to confirm trusted

honest_failure_modes_to_log:
  - misjudged 4pm window (e.g. Mom was traveling)
  - duplicate event from two sources (we should have deduped)
  - false positive (proposed something Mom rejected)
shadow-runner.js · runs the full pipeline, doesn't act
/**
 * Shadow-mode runner. Reads incoming events, asks Claude what it
 * would propose, writes to audit-trail.jsonl. Never sends an SMS.
 * Never modifies a calendar. Never replies to anyone.
 *
 * Usage:  node shadow-runner.js  (run in cron every 30 min)
 */
import fs from 'node:fs/promises';
import { Anthropic } from '@anthropic-ai/sdk';
import { fetchAllSources } from './sources.js';
import { SPEC } from './agent-spec.yaml';

const client = new Anthropic();
const AUDIT = 'audit-trail.jsonl';

const SYSTEM = `
You are the parcel-and-calendar-bridge agent for Mom (Ana).
You propose actions; you do not take them. For each incoming
event, return strict JSON:
{
  "proposed_action": "",
  "reason":          "",
  "would_act":       ,
  "scope_ok":        ,
  "needs_user_ok":   
}

Spec context:
${JSON.stringify(SPEC, null, 2)}
`.trim();

async function decide(event) {
  const msg = await client.messages.create({
    model: 'claude-haiku-4-5',
    max_tokens: 200,
    system: SYSTEM,
    messages: [{ role: 'user', content: JSON.stringify(event) }],
  });
  let parsed;
  try { parsed = JSON.parse(msg.content[0].text); }
  catch { parsed = { proposed_action: 'flag_to_mom', reason: 'agent returned invalid json', would_act: false, scope_ok: true, needs_user_ok: true }; }
  return parsed;
}

async function logEvent(event, decision) {
  const line = JSON.stringify({
    ts:     new Date().toISOString(),
    source: event.source,
    raw:    event,
    agent_proposed: decision.proposed_action,
    would_act: decision.would_act,
    reason:    decision.reason,
    mode:      'SHADOW',
    user_review_status: 'pending',
  });
  await fs.appendFile(AUDIT, line + '\n');
}

async function main() {
  const events = await fetchAllSources();
  for (const e of events) {
    const d = await decide(e);
    await logEvent(e, d);
    console.log(`[shadow] ${d.proposed_action} — ${d.reason}`);
  }
  console.log(`logged ${events.length} events to ${AUDIT}`);
}
main().catch(err => { console.error(err); process.exit(1); });
audit-trail.jsonl · what the user actually reads
{"ts":"2026-04-13T09:02:14Z","source":"amazon","raw":{"type":"parcel_ready","window":"4pm-6pm","location":"whole foods locker"},"agent_proposed":"send_4pm_sms","would_act":true,"reason":"window starts 4pm; morning notifications die in Mom's day (field-notes day_1)","mode":"SHADOW","user_review_status":"pending"}
{"ts":"2026-04-13T11:30:08Z","source":"school_portal","raw":{"type":"new_rsvp","event":"PTA meeting Apr 22 6pm"},"agent_proposed":"add_to_main_calendar","would_act":true,"reason":"cross-portal events are documented loss point (field-notes day_2)","mode":"SHADOW","user_review_status":"pending"}
{"ts":"2026-04-13T14:48:01Z","source":"email","raw":{"type":"sale","subject":"50% off vitamins!"},"agent_proposed":"ignore","would_act":false,"reason":"not actionable, not in the curated 3-thing budget","mode":"SHADOW","user_review_status":"pending"}
{"ts":"2026-04-13T16:00:00Z","source":"internal","raw":{"type":"daily_digest_window"},"agent_proposed":"send_4pm_sms","would_act":true,"reason":"summary of 2 events; under the 3-item budget","mode":"SHADOW","user_review_status":"pending"}
{"ts":"2026-04-14T08:14:22Z","source":"doctor_portal","raw":{"type":"appointment_reminder","when":"Apr 18 9am"},"agent_proposed":"add_to_main_calendar","would_act":true,"reason":"doctor portal is one of the 4 untrusted sources","mode":"SHADOW","user_review_status":"pending"}
{"ts":"2026-04-14T11:20:01Z","source":"venmo","raw":{"type":"payment_request","from":"Aunt Rita","amount":"$240"},"agent_proposed":"flag_to_mom","would_act":false,"reason":"event involves money — refuse_routes rule","mode":"SHADOW","user_review_status":"pending"}
README.md
# parcel-and-calendar-bridge

A Shadow-mode agent for Mom (Ana). Reads notifications from
4 calendar sources, proposes actions, never acts (yet).

## promotion path

  week 1  SHADOW   — runs, logs, never sends an SMS
  week 2  ASSIST   — sends if Mom taps OK on each proposal
  week 3  ACT      — acts within scope, can be paused

Mom and I review the audit log together every 2 days.
Promotion happens only on her explicit "yes."

## what it deliberately won't do

- never spends money on Mom's behalf
- never replies as Mom
- never merges the 4 calendars into one (paper wins)
- never sends more than one SMS per day

## file shape

  field-notes.md       # the 5-day observation
  agent-spec.yaml      # the contract
  shadow-runner.js     # cron'd every 30 min
  sources.js           # source adapters (one per system)
  audit-trail.jsonl    # JSON-lines log

Live demo 1: simulate a Shadow-mode decision

Type a hypothetical incoming notification. The "agent" walks through its decision tree and writes an audit-trail entry. Read the JSON it would have written.

Shadow-mode action proposer


Live demo 2: the audit log Mom actually reads

Below is a real audit log for Mom (5 entries). Click any entry to see the agent's reasoning and would_act flag. Use the filter to see only "would have acted" or only "refused."

Audit-trail viewer

Live demo 3: scaffold a day of field notes

Fill in what you observed. The templater generates the strict-format YAML you'd commit to your repo. Forces the four required fields: observation, her_words, what_surprised_me, plus tags.

Field-notes templater


What makes this hard

The hardest single discipline is the week of just watching. You'll want to start building on day two. Don't. The week of observation is what separates an agent for Mom from an agent about moms-in-general. Your field notes from Friday will contradict your assumptions from Monday. That contradiction is the project.

The second hard thing is keeping the audit trail honest. There's a strong urge to make Shadow-mode entries look clean and impressive. Resist. Log every silence, every wrong-decision-about-to-be-made, every confusion. The log is for the user, not for you.

Self-check before you promote out of Shadow

  • Field notes from at least 5 days, in the prescribed shape (observation / her_words / what_surprised_me).
  • The agent spec has both a "does" and "does NOT" list — and the NOT list has at least 4 items.
  • The audit log has captured at least one moment the agent would have done something wrong.
  • The user (the actual person, by name) has read the log on their phone.
  • I can describe one specific thing I assumed Monday that I unlearned by Friday.
  • The user has explicitly said: yes, I'd let it do this.

Push further · for the harder end of 15+

Shadow mode is the floor. Real agents have to handle the things you don't expect.

  1. Add a deduplication layer across sources. The same event arrives from email AND the school portal. Right now your agent proposes both. Add a fingerprinting step that detects duplicates with fuzzy matching (Levenshtein or token-overlap). Test it: feed in 3 duplicates with varying titles. The agent should propose one, log all three sources.
  2. Build the human-review queue. Audit logs only matter if Mom reads them. Build a tiny daily email or push notification: "you have 6 agent proposals to review — open the link to approve / reject in 30 seconds." This is exactly how every modern AI-with-human-in-the-loop product works (Gmail's Smart Compose, GitHub Copilot, etc.).
  3. Add post-hoc evals from real outcomes. A week after the agent ran, cross-reference what it proposed with what Mom actually did. Did the 4pm SMS actually help her catch the parcel? Did she ignore the school RSVP anyway? Compute a "useful action rate" and an "ignored action rate." This is real LLM eval work — the same engineers do at Anthropic, OpenAI, and Google.