We Taught Our AI to Grade Its Own Homework
Most AI email agents respond and forget. We built one that reviews every email it sent, grades its own performance, and gets better every week — no retraining, no new prompts, just a feedback loop.
Deploy an AI email agent. Watch it reply. Watch it make the same mistake next Tuesday that it made last Tuesday. Not because it's broken — it just has no way to look back.
We ran into this early. And the fix turned out to be simpler than I expected.
The "set it and forget it" problem
We built Anna for a real estate client. She handles the info@ inbox: reads incoming emails, classifies them, replies to the straightforward stuff, escalates anything that needs a human.
She worked. But the first two weeks, she was mis-escalating. Questions she could've handled on her own kept going to the team. Nothing catastrophic, just not tight.
The usual fix is to go back into the system prompt, try to anticipate the edge cases you missed, rewrite the rules, redeploy, and repeat. It's a guessing game. You're patching problems you heard about secondhand, usually a week after they happened.
So we tried something different: make the AI show us where it's going wrong, every week, on its own.
How the loop works
Every Sunday night, a script pulls Anna's sent messages from that week — every reply, every escalation — along with the original email that triggered each one.
Then it runs each message through a grading prompt. Three questions:
- Was this the right call? (Handle it vs. escalate)
- Was the reply accurate?
- Was the tone right for this sender?
The output is a JSON report. What went well, what should've been handled directly, what got escalated for no good reason. That lands in the operator's inbox Monday morning and gets pinged to Telegram.
The operator reviews it, flags anything worth noting, adds a short comment on the bad calls. Those flagged examples get folded into Anna's context the following week.
That's it. That's the whole loop.
No retraining. No GPU. Just context.
Anna isn't a fine-tuned model. She's Claude, running on a Mac Mini. There's no training pipeline.
When she gets something wrong and a human flags it, that example gets stored: here's the email, here's what she did, here's what she should've done. Next week it shows up as a few-shot example in her system prompt.
The model doesn't change. The context she runs with gets more precise every week. For a client-specific use case, that's close enough to learning.
What actually happened
By week three, the mis-escalations dropped off. Not because we rewrote any rules — because Anna had her own past mistakes sitting in her context, showing her exactly where the line was.
Honestly, it moved faster than I expected. A few concrete examples did what hours of prompt iteration hadn't.
Why examples beat rules
When you write a system prompt in advance, you're guessing at the edge cases. You write rules for situations you imagine. The cases you didn't imagine are the ones that bite you.
The reflection loop flips it. The AI surfaces what actually happened. You review what actually went wrong. The fix goes back in as a specific example, not a general rule.
Specific examples transfer better. That's why few-shot prompting consistently outperforms long instruction lists — and it's the same reason this works.
The setup
The pieces aren't complicated:
- An AI email agent (we use Claude + Gmail API)
- A weekly cron job that pulls sent messages with their original context
- A grading prompt with structured JSON output
- A report surface — email, Telegram, Slack, whatever the operator actually checks
- A feedback store — a simple markdown file with flagged examples is enough to start
Total weekly cost to run: about $0.10. One of the higher-leverage things we do for client deployments.
The real lesson
Most people assume improving an AI means changing the model — a newer version, a different provider, fine-tuning.
It almost never comes down to that. The gap is almost always the feedback loop. The model is fine. It just doesn't know what went wrong.
Give it a structured way to look back at its own work, a human to catch what the grading misses, and a handful of concrete examples to carry forward — and you get an AI that tightens up every week without touching a line of code.
Anna is one part of the AI email system we build for clients. If you want to see how this works in practice — or you have an info@ inbox eating your team's time — reach out.