Your AI Wrote the Code. Who Checked It?

How to verify AI-written code so the bugs get caught before they ship, not after

In this article: AI coding tools are fast, confident, and occasionally wrong. The fix is not a smarter model. It is a verification system that runs whether or not anyone remembers to run it. This piece shows how to layer verifiers from the compiler up to production observability, how to wire them into hooks so they fire on every edit, why a fresh-context reviewer beats self-review, and how the official Code Review feature puts a second opinion on every pull request automatically.

There is a sentence every developer using an AI coding tool has heard, and learned to distrust: "This should work." The assistant writes a function, runs nothing, and signs off with a guess dressed up as a result. Sometimes the guess is right. Often enough, it is not, and the bug surfaces three days later in production, where it costs a hundred times more to fix.

The honest way to read that sentence is this: the code was written, but nobody checked it. And when nobody checks, "done" is just a hope. The discipline that closes the gap is verification, and the goal of this article is to make you good enough at it that you never have to rely on hope again. The trick is not to remember to verify AI-written code. The trick is to build a system where verification happens on its own.

Verification is the third phase of the loop

Claude Code, the agentic coding tool from Anthropic, works in a loop with three phases: gather context, take action, verify. The first phase reads the files, docs, and data the work needs. The second writes and edits code. The third checks whether the code actually does what it was supposed to do.

Most people are good at the first two and weak at the third. That weakness is expensive, because the third phase is the single highest-leverage thing you can do with an AI coding tool. Without a verifier, the loop never closes. Claude takes an action, declares success, and moves on, and the only thing standing between a confident wrong answer and your main branch is you, reading every line by hand.

Flowchart of the agentic loop: gather context, take action, then verify, with failures looping back into a revision cycle before shipping.

Here is the frame to carry through the rest of this article: a verifier you have to remember to run is a verifier that will not get run. Some Tuesday you will be tired, the change will look small, and you will skip it. Every technique below is designed to move verification from "thing I do" to "thing that happens." The target is not "I remember to test before committing." The target is "tests run automatically because the project is set up to run them."

The verification stack, cheapest to richest

No single verifier catches everything. The strong move is to layer them, because each layer catches a different class of mistake and the layers compound. Here is the progression, from cheapest and fastest to richest and slowest.

The compiler and type checker. Free, instant, and it catches half of the dumb mistakes. Think tsc --noEmit, mypy --strict, cargo check. If your project has a type system and you are not running it on every change, that is the first thing to fix.

The linter. Style consistency, dead code, obvious anti-patterns. Think eslint, ruff, clippy. It catches what type checkers miss: unused variables, equality-versus-identity bugs, a missing await. It is fast.

Unit tests. The bedrock. They test a function in isolation, they are cheap to run, and they are easy to interpret when they fail. If you have exactly one verifier on a project, make it this one.

Integration tests. These test the function in context, wired to its dependencies, talking to a real database or a fixture. They are slower than unit tests, but they catch the bugs that only exist when components are real. Worth having for any path that crosses a network or process boundary.

End-to-end tests. The full system, exercised the way a user would exercise it. Slowest, most fragile, most expensive, and the most convincing when they pass. A browser-automation tool like Playwright turns these from "I will write some Playwright eventually" into "Claude runs them as part of the work."

Production observability. Once code is deployed, the verifier becomes the logs and the metrics. An errors integration like Sentry, a metrics integration like Datadog. This is the post-deploy check that says, plainly, "yes, the error rate dropped after the fix landed."

Block diagram of the six-layer verification stack, from compiler and linter at the cheap top down to production observability at the rich bottom.

The pattern across all six layers is the same: Claude needs a tool that returns truth, not opinion. The compiler succeeds or fails. The test passes or fails. The screenshot matches or it does not. The error query returns zero unresolved issues or it does not. Every layer hands back a verdict that Claude can read and respond to.

You do not need all six. Most projects get the most value from the first three, plus a sprinkle of the rest where it matters. The honest progression: today, get at least one runnable test command, even a thin one, as long as it exits with a useful code. This month, add lint and type-check to the chain. This quarter, add the integrations for browser, database, and logs, so the loop closes for every layer of the system.

Make the verifier impossible to skip

Knowing the stack is not enough. The whole problem is human memory, and the solution is to remove the human from the loop. Claude Code does this with hooks: small commands that fire automatically on specific events.

A PostToolUse hook runs a command every time Claude edits or writes a file. Wire your linter or type checker into one, and verification stops being optional. The hook output goes back into Claude's context on failure, which means the next action Claude takes, it sees the lint error or the type error and reacts. That is the closed loop in miniature: Claude edits, the hook runs, the failure surfaces, Claude reads the failure, Claude fixes it. No human intervention.

Here is a PostToolUse hook that lints and type-checks every edited file:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command", "command": "jq -r '.tool_input.file_path' | xargs eslint --fix" },
          { "type": "command", "command": "jq -r '.tool_input.file_path' | grep -q '\\.ts$' && npx tsc --noEmit" }
        ]
      }
    ]
  }
}

The other hook worth installing is a Stop hook, which fires when Claude finishes responding. Point it at your test suite and Claude cannot claim "done" until the tests have actually run. The cost is real: a turn that would have ended in one second now takes thirty. The payoff is real too. "Done" stops being a guess.

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          { "type": "command", "command": "npm test --silent 2>&1 | tail -50" }
        ]
      }
    ]
  }
}

The tail -50 matters. Trimming hook output aggressively keeps the failure context tight and useful instead of burying the signal under a thousand lines of passing tests.

The strongest version of this combines a skill and a hook. A skill is a reusable, named command you can define once and call by name. Define a /ship skill that runs test, lint, type-check, and build in order, stopping on the first failure. Then wire a Stop hook to call it on every turn:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          { "type": "command", "command": "claude -p '/ship' --output-format text" }
        ]
      }
    ]
  }
}

One skill defines the verification chain. One hook makes it run every turn. The two are independent: change what gets verified by editing the skill, change when it runs by editing the hook. That separation is the whole reason this scales.

The writer/reviewer split: a fresh context catches more

Hooks catch mechanical failures: a broken type, a failing test, a lint violation. They do not catch architectural mistakes, subtle race conditions, or a design that is technically correct and quietly wrong. For that, you need review. And here is the counterintuitive part: the best review does not come from asking the same session to check its own work.

Anthropic's official best-practices guidance makes a claim that is easy to underweight. A fresh context improves code review, because Claude will not be biased toward code it just wrote. This is the writer/reviewer pattern, and it produces noticeably better review than self-review.

The mechanism is straightforward. When Claude writes code, the writing process anchors it on the chosen approach. The session has just spent forty messages committing to a particular design, so it is invested. Reviewing your own work in that state catches typos and misses architecture. A fresh context, looking at the same code with none of the writing process attached, asks different questions.

Sequence diagram of the writer/reviewer pattern: Writer-Claude builds the code, a fresh-context Reviewer-Claude audits it, and the feedback loops back for revision and re-verification.

There are three ways to set this up, in increasing order of automation.

Two terminal sessions. Open a second terminal, start a fresh Claude session, and paste a review prompt: "Review the rate limiter in src/middleware/rateLimiter.ts. Look for edge cases, race conditions, and consistency with our existing middleware patterns." Manual, but it is the lowest-friction way to try the pattern once and feel the difference. Same model, different context, substantially better outcome.

A reviewer subagent. A subagent is a specialized helper with its own isolated context that Claude can delegate to. Promote the reviewer into a custom code-reviewer subagent, and the writer session can simply ask Claude to "use the code-reviewer agent to review the rate limiter changes." You get the second opinion without switching terminals.

A Stop hook that fires the reviewer. When even the delegation cost is too high, automate it: a Stop hook calls the reviewer subagent on every turn that includes file edits. Most projects do not need this for every turn, but for a high-stakes branch or an end-of-day sweep, it is the move.

The deeper principle: the more layers of independent review you stack, the more issues you catch. Writer plus reviewer is the minimum. Writer plus reviewer plus security-reviewer plus test-coverage-checker is the elite version. Anthropic itself runs subagents adversarially, the internal phrasing is "make subagents fight," landing on three or four reviewers that disagree with each other before a human accepts.

Review skills and review subagents

Two artifacts show up in most mature setups, and the difference between them is worth understanding.

A /security-review skill is a user-triggered command. You run it deliberately, before pushing, when you decide a change needs a security pass. Its definition injects the actual diff so Claude reviews the real change, not its mental model of it:

---
name: security-review
description: Reviews code changes for security vulnerabilities
disable-model-invocation: true
argument-hint: <branch-or-path>
---

## Diff to review

!`git diff $ARGUMENTS`

Audit the changes above for:
1. Injection vulnerabilities (SQL, XSS, command)
2. Authentication and authorization gaps
3. Hardcoded secrets or credentials
4. Insecure dependencies

Report findings with severity ratings and remediation steps.

The disable-model-invocation: true line keeps Claude from triggering this on its own. A security review is something you fire deliberately. Run /security-review main before opening a pull request.

A code-reviewer subagent is a delegate Claude can spawn autonomously when the work calls for review. Same review logic, different invocation mode. Its definition gives it a checklist and a priority structure: critical issues that must be fixed, warnings that should be fixed, suggestions worth considering. Most projects benefit from both artifacts: the skill for deliberate pre-push review, the subagent for ad-hoc review during work.

There is also a productized version. The /code-review plugin from the official marketplace runs four parallel review agents: two for convention compliance, one for bug detection, one for git-blame context. It uses confidence scoring, an 80-plus threshold, to filter false positives before posting findings. That is the writer/reviewer pattern shipped as a product, four independent reviewers instead of one.

The official Code Review feature

The most automated version of code review available right now needs no CLI step and no manual trigger at all. Claude reviews every pull request you open, posts findings as inline comments on the affected lines, and tags each finding by severity. It is the writer/reviewer pattern at team scale.

The feature is in research preview, available on Team and Enterprise plans, and not available for organizations with Zero Data Retention enabled. Setup is admin-level: the GitHub App needs install permission on the repos to review.

When a pull request opens, Anthropic's infrastructure spins up a fleet of agents that look at the diff and the surrounding code in parallel. Each agent hunts a different class of issue: logic errors, security vulnerabilities, broken edge cases, subtle regressions. A verification step then checks each candidate finding against actual code behavior to filter false positives. Whatever survives gets deduplicated, ranked by severity, and posted as inline GitHub comments.

State diagram of the Code Review pipeline: a PR triggers a parallel agent fleet, candidate findings pass a verification step, false positives drop out, and ranked findings post as inline comments.

Findings carry one of three severity tags. A red Important flag is a bug that should be fixed before merging. A yellow Nit is a minor issue, worth fixing but not blocking. A purple Pre-existing flag marks a bug that already lives in the codebase and was not introduced by this pull request. Each finding includes a collapsible extended reasoning section explaining why Claude flagged it and how it verified the problem. The findings do not approve or block the pull request, so your existing review workflow stays intact.

Three trigger modes are worth knowing. Run review once after pull request creation for the lightest-weight option. Run it after every push for the most thorough and most expensive coverage, which also auto-resolves threads as you fix flagged issues. Or keep it manual, triggered only when someone comments @claude review, which is the right call for high-traffic repos where you want to opt specific pull requests in. You can also tune what Claude flags with a CLAUDE.md or REVIEW.md file: the conventions Claude uses to edit code become the conventions it uses to review it.

The cost model is pay-per-review, scaling with pull request size and completing in about 20 minutes on average. For repos that do not qualify, the alternative is /ultrareview: the same multi-agent pattern, run on demand from the CLI. The first runs are free, and subsequent runs cost roughly $5 to $20 each.

Mindmap comparing three review options: local /review for quick feedback, cloud /ultrareview for pre-merge confidence, and the official Code Review feature for automatic per-PR review.

The right move for most teams: use /review during work for fast local feedback, and /ultrareview or the Code Review feature before merging. The cost of catching a bug at review time is one-hundredth the cost of catching it in production.

Treat AI output like a fast junior engineer's

Step back and notice the assumption underneath everything here. This whole article is built on the premise that Claude's first draft has bugs, that its self-review will miss what its first draft missed, and that the answer is layered verification rather than trusting any single pass.

That is a more honest framing than "prompt the AI better and it will get it right the first time." Better prompts help. Better context helps. Better skills and rules help. But an AI coding tool is a model running in a loop with imperfect inputs and stochastic outputs, and even excellent prompts produce occasional bugs.

So treat the output the way you would treat a fast junior engineer's output: reasonably trustworthy, often correct, and always reviewed before it goes anywhere important. We do not ship a junior engineer's pull request straight to production because we trust them. We ship it after CI and review. The verifier stack for AI-written code is the same idea, just automated more aggressively.

This is also why the layers compound. Tests catch one class of mistake. Lint catches another. The reviewer subagent catches a third. Code Review catches a fourth. None of them catch everything. Together, they catch almost everything, and human review handles the rest. The goal is not perfection. It is leverage: getting from "I have to review every line the AI writes" to "I review the diff and trust most of it because it survived four passes of automated checking."

Do this today

Three concrete moves, in order of payback.

Write the /ship skill. Even if the verifier chain is just npm test && npm run lint, that is enough to start. The pattern matters more than the comprehensiveness. You can deepen the chain later.

Add the code-reviewer subagent. Use /agents, choose Create new agent, and let Claude generate it: describe the agent in plain English, pick read-only tools, and let it write itself. It takes five minutes. It pays back the first time you ask Claude to use the code-reviewer agent on a change and it catches something the first pass missed.

Turn on automatic review. If you are on Team or Enterprise, enable the Code Review feature on your most important repository. If you are on Pro, run /ultrareview once before a substantial merge to see what the multi-agent pattern catches. The first time it surfaces a bug you would have shipped is the moment this pays for itself.

The bug was not unlucky. It was unverified.

Think about the compounding effect over a few months. Every layer of verification you add is one more way AI-written code gets checked without anyone having to remember to check it. The compiler, the linter, the test suite, the reviewer subagent, the Code Review feature: each one is a net, and the nets overlap.

Six months in, the failure mode that used to be "the AI said it works and it does not" becomes "the test, the type check, the reviewer agent, and Code Review all agree this works." Your trust level moves accordingly, and it moves because it was earned by evidence, not granted on faith.

The next time an AI coding tool tells you a change "should work," do not argue with it. Just make sure the answer comes from a verifier, not a vibe. Code that nobody checked is not done. It is just written.

This is Part 12 of "Claude Code, Day-to-Day," a 19-part guide to mastering Claude Code for working engineers.