Does This Sound Like a Real Kid? · Showcase

The work

A judge that asks one question: "does this sound like a real kid?"

Leila built a harness that reads a short piece of writing and scores it on how much it sounds like something an actual child wrote. She trained it on samples from her own classmates. Then she turned it on three different AI models pretending to be kids, and then — the part she says was the hardest — on three of her own recent essays.

The headline finding isn't about AI. It's about her. One of her own essays scored lower than one of the AI models. She left that finding in the report.

The thing itself · Leila's judge, running on 6 samples

Pick a sample. See the score.

Pick a sample above.

—

The finding Leila left in the report

"My own essay from last month scored lower than Model A. I think it's because I've been writing what I thought my teacher wanted to read instead of what I noticed. The judge can't tell the difference between that and a chatbot. I think my teacher can't either. I'm going to fix my writing before I fix the judge."

An eval that catches AI writing pretending to be a kid. She benchmarks three models. Then she turns it on her own essays, which is the part I keep coming back to. Most harnesses are used to judge other people's work. Leila built one and then pointed it at herself, and then believed the result. That's the move almost nobody makes. — Prof. Weining Zhang, curator