2026-03-01

Code Review Is the Missing Agent

AI coding tools ship fast but skip review — like a team with no PR process. The fix is a dedicated review agent.

The Speed Trap

Every AI coding tool sells the same promise: write code faster. Copilot autocompletes your functions. Cursor rewrites your files. Claude Code scaffolds entire features in a single conversation. And they deliver — I've watched tasks that used to take an afternoon collapse into twenty minutes.

That speed is the feature. It's also the problem.

Generation speed without review is just faster accumulation of technical debt. Think about how a well-run engineering team works. You write a PR. Someone else reviews it. They catch the thing you missed — the convention you forgot, the edge case you didn't test, the security assumption you made at 2am. The review is where quality happens.

Now look at every AI coding workflow in production today. The model generates code. You glance at it. You accept it. Maybe you run the tests. Maybe. There's no structured review step. No second set of eyes. No checklist.

It's like having a team of ten developers who all write PRs and none of them ever review each other's work. You'd never run a team that way. But that's exactly how most people use AI coding tools.

What Happens Without Review

I've been building Agent Team — a multi-agent system where specialized AI agents collaborate on software projects — and the early versions had no dedicated review step. The coding agent would generate, I'd eyeball the output, and we'd ship. Here's what I learned the hard way.

Style drift is the first thing you notice. The AI generates working code, but it doesn't match your codebase's conventions. You use named exports; it uses default exports. You put error handling in a middleware layer; it puts try-catch blocks inline. You use a specific logger; it imports a different one. Each instance is minor. After a week, your codebase looks like it was written by five different teams.

Security blind spots are the scariest. Generated code that works perfectly in the happy path but has injection vulnerabilities, missing auth checks, or hardcoded secrets. I caught a generated API route that accepted user input and passed it directly into a database query without sanitization. It worked great in testing. It would have been a disaster in production. The model wasn't being careless — it was optimizing for the task I gave it, which was "build this endpoint," not "build this endpoint securely."

Dead code accumulates silently. The AI generates a utility function to solve a problem. Next session, it generates a slightly different utility function for a similar problem because it doesn't remember the first one. Now you have two functions that do almost the same thing, and neither will ever get cleaned up because no one knows the other exists. Multiply this across a few weeks and you've got a codebase littered with orphaned helpers, unused imports, and redundant abstractions.

Inconsistency is the most insidious. The same problem gets solved three different ways across the codebase. Date formatting done with a library in one file, a custom function in another, and inline logic in a third. Not because any approach is wrong, but because the AI doesn't check what already exists before generating something new. It's not lazy — it just doesn't have the context. And without a review step, nobody catches it.

Why Self-Review Fails

The obvious response is: "Just ask the AI to review its own code." I tried this extensively. It doesn't work — at least not well enough.

The model rationalizes its own decisions. When you ask Claude to generate code and then ask the same Claude session to review it, you get the AI equivalent of "looks good to me" on your own PR. The model has already committed to its approach. It will defend the patterns it chose rather than questioning them. It's confirmation bias, but computational.

Self-review catches surface-level issues — syntax errors, missing null checks, obvious bugs. That's useful but not sufficient. The hard problems are architectural: Does this code follow the patterns already established in the codebase? Is this the right abstraction? Are we duplicating something that already exists? Does this introduce a security surface we haven't accounted for?

These are judgment calls that require different context than the coding context. The coder is thinking about the task: "Build feature X." The reviewer is thinking about the system: "Does feature X fit into the codebase correctly?" Same code, fundamentally different lens.

This is the key insight that changed how I built Agent Team. Review isn't just a second pass over the same work. It's a different mode of evaluation with different inputs, different priorities, and different success criteria.

The Codex Agent Pattern

In Agent Team, the solution is a dedicated agent I call Codex — the code review agent. It has one job: evaluate code changes against an 8-point checklist before anything gets merged. The checklist isn't arbitrary; it's the distillation of every category of issue I kept catching in unreviewed AI-generated code.

1. Correctness. Does the code do what it claims to do? Not just in the happy path — are edge cases handled? Are error states accounted for? This is the baseline.

2. Security. Input validation. Auth checks. Secret management. SQL injection. XSS. The review agent explicitly looks for these because the coding agent's job is to make things work, not to make things safe. Different objective, different attention.

3. Style consistency. Does the code match the codebase's conventions? Naming, file structure, import patterns, error handling approach. The review agent reads the project's conventions file and checks against it. The coding agent might know the conventions exist but deprioritizes them when focused on solving a problem.

4. Test coverage. Are the critical paths tested? Not "does a test file exist" but "do the tests actually exercise the important behavior?" The review agent checks whether edge cases and error paths have coverage, not just the happy path.

5. Error handling. Are errors caught, logged, and surfaced appropriately? Is the error handling pattern consistent with the rest of the codebase? Do errors provide enough context to debug?

6. Performance. Any obvious N+1 queries, unnecessary re-renders, or O(n^2) loops? The review agent isn't running benchmarks — it's catching the things an experienced engineer would flag in a PR review.

7. Documentation. Are public APIs documented? Are complex logic blocks commented? Are there any magic numbers or non-obvious decisions that need explanation?

8. Dead code removal. Did this change leave behind unused imports, orphaned functions, or commented-out code? The review agent checks the diff and the surrounding context to flag anything that's no longer needed.

Here's what makes this work: the review agent has different context than the coding agent. The coding agent has the task description and the relevant source files. The review agent has the codebase conventions, existing patterns, the full diff, and the checklist. They're looking at the same code through completely different lenses. That's why it catches things self-review misses.

The review agent doesn't just flag issues — it sends them back to the coding agent with specific instructions. "This endpoint accepts raw user input in the query parameter without sanitization — add input validation using the existing sanitize middleware from utils/security.ts." Actionable, specific, referencing existing codebase patterns.

Making It Practical

You don't need to build Agent Team to get this benefit. The principle is simple: separate generation from review, and give the reviewer different context.

Add a review step to any AI coding workflow. After the AI generates code, don't just accept it. Open a new conversation — a fresh session with no memory of the generation decisions — paste the diff, and ask for a review. The clean context is the point. A fresh session won't rationalize the previous session's choices.

Use a separate conversation or session for review. This is the single most impactful change you can make. Same model, different context. The review session hasn't committed to any approach, so it evaluates the code on its merits rather than defending its creation. I've seen this catch issues that three rounds of self-review in the same session missed entirely.

Define a checklist. It doesn't have to be eight items. Even three or four will catch most issues. My minimum viable checklist:

Does this follow existing codebase patterns?
Are there security concerns (input validation, auth, secrets)?
Is there dead code or duplication?
Are errors handled consistently?

Put the checklist in your review prompt. Checklists beat open-ended "review this code" prompts because they direct attention to specific failure modes instead of letting the model generate vague approval.

Automate what you can and focus the review agent on judgment calls. Linting catches formatting issues. Type checking catches type errors. Test suites catch regressions. These are mechanical checks — let tools handle them. The review agent's value is in the things that require judgment: Is this the right abstraction? Does this match the codebase's conventions? Is this secure? Is this maintainable? Don't waste the review agent's attention on things a linter can catch.

Make it a gate, not a suggestion. In Agent Team, the coding agent's output doesn't ship until Codex approves it. That's not bureaucracy — that's quality infrastructure. If review is optional, it gets skipped when you're in a hurry, which is exactly when you need it most.

The Cheapest Quality Investment You'll Make

Here's the math. A review agent costs one extra LLM call per coding task. At current API prices, that's pennies. The issues it catches — security vulnerabilities, convention violations, dead code, inconsistencies — cost hours to debug, days to refactor, and occasionally entire incidents to remediate.

Every serious engineering organization figured out decades ago that code review is non-negotiable. The AI coding era somehow forgot this. We got so excited about generation speed that we dropped the practice that makes generated code actually shippable.

The fix isn't complicated. It's a dedicated review step with fresh context and a clear checklist. Whether you build it as a formal agent or just open a second chat window, the principle is the same: don't let the same context that wrote the code be the only context that evaluates it.

Code review is the missing agent. Add it to your workflow and the code your AI tools generate goes from "probably fine" to "actually reviewed." That's a meaningful difference when you're shipping to production.