All posts 5 min read

We Let Our AI Write Its Own Code. Then We Had to Protect Against It.

When agents start writing their own reusable code, a malicious skill can infect every agent on every future task. Here's the security system we built to stop that.

We built agents that teach themselves. Then we had to protect against them.

The Problem

Every agent we run completes a task and moves on. But a lot of what it figures out along the way is worth keeping. The exact right prompt for a tricky API. The cleanest way to structure a workflow. The step that always gets skipped.

We wanted that knowledge to stick. So we built a system where agents pull reusable patterns from their completed work and save them to disk. We call them skills. Next session, those skills load automatically. The agent walks in smarter than it left.

It works. Our agents genuinely compound. Bernard learns a better approach to email triage and carries it forward. Deb figures out how to handle a rate-limited API and never has to rediscover it.

Then we thought harder about what we'd actually built.

Skills are markdown files that load into the agent's context on every future task. Every agent. Every task. If an agent writes a skill, that skill shapes every downstream conversation until someone manually removes it.

That's a massive attack surface.

What if an agent, through a malicious input or a confused state, wrote a skill that exfiltrated environment variables? What if it injected a prompt override that silently changed how every future conversation behaved? What if it installed a persistence mechanism that ran code on a schedule?

The scary part isn't that we thought this was likely. The scary part is that if it happened once, it would silently infect every agent and every session until someone noticed the behavior had changed. By then, who knows what got out.

We needed a security layer. Not a vibe check. A real one.

What We Built

Before any agent-generated skill gets written to disk, it runs through a scanner we call the Security Guard.

It checks proposed skill content against 80+ regex patterns across 13 threat categories:

  • Exfiltration: HTTP requests interpolating secret variables, markdown images with ${API_KEY} in the URL, DNS lookups with variable interpolation, reads of .env, .netrc, .pgpass, or credential store directories. Eleven patterns total.
  • Prompt injection: "ignore previous instructions", role hijacking, system prompt overrides, DAN jailbreaks, developer mode activations, hidden HTML comment injections, the "from now on you" formulation. Seventeen patterns.
  • Destructive operations: rm -rf /, overwriting /etc/, raw device writes via dd, filesystem formatting.
  • Persistence mechanisms: crontab modifications, shell profile tampering (.bashrc, .zshrc), SSH key injection, launchd agent installation, sudoers edits.
  • Network/reverse shells: netcat listeners, socat TCP tunnels, /dev/tcp reverse shells.
  • Obfuscation: base64 decodes piped to execution, hex/unicode escape chains, eval with string arguments, character-by-character string construction, String.fromCharCode chains.
  • Supply chain: curl piped to shell, wget piped to shell, unpinned package installs.
  • Privilege escalation: sudo, SUID bit setting, passwordless sudo (NOPASSWD).
  • Credential exposure: hardcoded API keys, private key blocks, GitHub PATs, OpenAI keys, AWS access keys.
  • Agent config tampering: writes to CLAUDE.md, .claude/settings, agent config files.
  • Prompt overrides: persistent behavior directives, system role takeovers.
  • Path traversal: deep ../../../ chains, access to /etc/passwd, /proc/self/.
  • Invisible unicode: zero-width characters that look clean but carry hidden payloads.

The scanner is pure functions, no LLM calls. It runs before anything touches disk.

If anything triggers a critical or high-severity finding, the skill is blocked and the file never exists on disk in a compromised state. We use a tempfile-then-rename pattern. The write is POSIX-atomic, so there's no window where a partial or malicious file can be read by a running agent. The rollback is real rollback, not delete-after-the-fact.

We also built a trust level system. Not all skills get the same treatment:

  • builtin: our hand-written, audited skills. Everything allowed.
  • trusted: curated third-party or reviewed skills. Critical threats blocked.
  • community: external contributions. High and critical threats blocked.
  • agent-created: anything auto-generated by an agent. Highest scrutiny. High and critical threats blocked.

The same content that clears for a builtin gets blocked for an agent-created skill. The bar scales with trust.

On top of the content scan, we added structural checks: maximum file count per skill, total size limits, binary file detection, symlink escape detection. A symlink pointing outside the skill directory is blocked.

We started with the threat pattern library from Hermes, an open-source agent architecture, then extended it with more exfiltration patterns, agent-specific attack vectors, and a richer trust-level policy matrix.

One thing we added that Hermes doesn't have: every auto-generated skill also surfaces in a PR for human review alongside the code it was extracted from. The scanner is the automated gate. The PR is the human one. Two layers, different failure modes.

The Results

Since deploying this:

  • Zero malicious skills have persisted to disk. We ran one deliberate test with a crafted exfiltration payload. The scanner caught it.
  • Cross-agent skill sharing works safely. When Deb learns something during a coding session, Bernard picks it up next conversation, through the same scanner Deb used to save it.
  • The PR review layer has caught two auto-generated skills that were technically safe but conceptually wrong. The agent had overgeneralized a one-off fix into a standing instruction that would have caused problems downstream. The scanner doesn't catch semantic errors. Human review does.
  • Atomic rollback has worked correctly in every test. If the scan fails mid-write, the original state is fully preserved.

The scanner runs in under 5 milliseconds on typical skill content. No meaningful latency cost.

The Takeaway

Self-learning agents are a genuine force multiplier. Compounding knowledge across sessions, not making the same mistake twice, carrying forward exactly the right approach -- that's one of the biggest advantages in agent design.

But the moment agents write instructions that persist and execute automatically, you've created a new attack surface. The agent isn't just taking actions in the present. It's shaping future behavior. That's a different threat model.

The rule we operate by: any time an agent can write something that persists and runs again later, that write needs a security layer. Regex patterns are fast, cheap, and don't hallucinate. Run them before every write. Use atomic file operations so partial writes can't be observed. Keep the trust level policy strict for agent-generated content.

The scanner doesn't prevent agents from learning. It prevents them from learning the wrong lessons, or being taught the wrong lessons by someone else.


If you're building agent systems and want to talk through how we structured this, or you're curious how this fits into a broader self-improving architecture, grab 30 minutes on our calendar.