2026-02-20

Why Every AI Project Needs More Than One Agent

Monolithic AI sessions are the wrong abstraction for real work. Here's why multi-agent is the natural evolution.

The Single-Agent Ceiling

Here's a take I'm increasingly confident about: single-agent AI sessions are the wrong abstraction for real work. They're useful for quick questions and one-off tasks, but the moment you try to build anything substantial — plan a feature, write the code, review it, test it, deploy it — a single agent falls apart.

The reason is straightforward. When you ask one AI to do everything, you get a jack-of-all-trades that's master of none. It plans okay. It codes okay. It reviews its own code (badly). It tests what it remembers to test. And it deploys with whatever context it still has left after burning through its window on everything else.

This isn't a model capability problem. It's an architecture problem. The same lesson software engineering learned decades ago applies here: monoliths don't scale. Not because the code inside them is bad, but because the abstraction is wrong for the complexity of the work.

I've spent the last several years building AI agent platforms — the Teams AI SDK at Microsoft, Chat AI at Zoom, and most recently Agent Team, an open-source multi-agent orchestration plugin for Claude Code. Every project reinforced the same conclusion: the ceiling on what a single agent can do is much lower than people think, and the unlock is almost always splitting the work across specialized agents.

The Monolith Parallel

If you've been in software engineering long enough, you've watched the monolith-to-microservices transition play out. The forces that drove that shift are nearly identical to what's happening with AI agents right now.

Context accumulation. Monolithic codebases accumulate complexity until no single developer can hold the whole system in their head. Single-agent sessions accumulate context until the model can't effectively track what it's doing. The conversation grows, earlier instructions fade, and quality silently degrades. You don't notice until you're debugging something that should have been caught three steps ago.

No specialization. In a monolith, the same deployment pipeline handles user auth, payment processing, and email notifications. In a single agent session, the same context window handles planning, implementation, code review, and testing. Neither arrangement lets any component do its job exceptionally well. Everything is adequate. Nothing is great.

No independent testing. You can't test the review quality of a single agent separately from its coding quality — they're the same session, the same context, the same run. In a monolith, you can't deploy the billing module without risking a regression in user onboarding. Same structural problem.

No graceful failure. When a monolith crashes, everything crashes. When a single agent goes off the rails — hallucinates a dependency, misunderstands the architecture, loses track of the goal — there's no circuit breaker. No other agent catches the mistake. The whole session is compromised.

The microservices insight was: decompose by responsibility, define clear interfaces, let each service do one thing well. That's exactly what multi-agent architecture does for AI-assisted development.

What I Saw Building Agent Team

I didn't start as a multi-agent zealot. I started by trying to get a single Claude Code session to handle full project development — planning, coding, reviewing, deploying. It worked for small things. For anything real, it hit a wall.

Here's what I observed over dozens of projects:

Context pollution was the silent killer. The longer a session ran, the worse its output got. Not dramatically — it didn't start producing garbage. It just got subtly less sharp. A planning discussion would leave residual context that biased the coding phase. Code generation would push out the architectural decisions made earlier. By the time you got to review, the agent had lost half the context that mattered. You'd find yourself re-explaining things you'd already discussed 40 messages ago.

There were no quality gates. Code went straight from generation to the filesystem with zero independent review. The same agent that wrote the code was the one "reviewing" it — which is like asking a student to grade their own exam. It would catch syntax issues and obvious bugs, but architectural problems, security concerns, subtle logic errors? Those sailed through because the agent had the same blind spots during review that it had during generation.

Role confusion produced measurably worse results. I tested this directly. The same Claude session switching between planning, coding, and reviewing produced worse output across all three tasks than dedicated sessions for each. Not marginally worse — significantly worse. The planner was less thorough because it was already thinking about implementation. The coder was less focused because it was still carrying planning context. The reviewer was more lenient because it had written the code itself.

The fix was decomposition. Agent Team splits work across up to seven specialized agents — Orchestrator, Coding, Design, Codex (code review), Deploy, Content, and QA — each with their own system prompt, their own context, and their own success criteria. The Orchestrator plans and dispatches. The Coding agent writes code without worrying about deployment. The Codex agent reviews code it didn't write, which turns out to be dramatically more effective. The Deploy agent handles infrastructure without being distracted by feature logic.

The improvement was immediate and obvious. Code review actually caught real issues. Planning was more thorough because the planner wasn't context-switching into implementation mode. Deployments were more reliable because the Deploy agent had a clean context focused entirely on infrastructure.

When to Use 1 / 2-4 / 5-7 Agents

Multi-agent isn't always the right call. There's real overhead in agent coordination — handoff protocols, context passing, orchestration logic. For simple tasks, that overhead isn't worth it. Here's the framework I use:

One Agent

Use a single agent for simple, well-defined tasks where the entire job fits comfortably in one context window with room to spare. Quick scripts. Single-file edits. Q&A about a codebase. Generating a utility function. Explaining code. Anything where the scope is narrow and the quality bar is "correct" rather than "excellent."

The test: if you can fully describe the task in two sentences and the output is a single artifact, one agent is probably fine.

Two to Four Agents

This is the sweet spot for medium-complexity work — feature development, bug investigation across multiple files, refactoring with testing. The most common pattern here is coder + reviewer: one agent writes the code, another reviews it independently. That single split — separating generation from evaluation — catches an enormous number of issues that a single agent misses.

Other useful two-to-four-agent patterns: planner + executor (one agent designs the approach, another implements it), coder + tester (one writes the feature, another writes and runs the tests), or researcher + implementer (one investigates the codebase and gathers context, another makes the changes).

For personal projects, this is usually enough. You get the specialization benefits without the coordination overhead of a full team.

Five to Seven Agents

Full teams make sense for project-level work — multi-day features, new project scaffolding, sprint-based development, anything that touches architecture, implementation, quality, and deployment. This is where you get the full benefit of specialization: dedicated planning, coding, review, testing, deployment, and design agents, each with clear roles and handoff protocols.

The overhead is real — you need an orchestrator, you need handoff formats, you need to think about what context each agent needs. But for substantial work, the quality improvement more than justifies it. Code review actually catches architectural issues. Testing covers edge cases the coder didn't think about. Deployment is handled by an agent that understands infrastructure instead of one that's been coding features for the last hour.

The test: if the work would take a human team multiple days and involve multiple types of expertise, you want the full team.

The Future: Multi-Agent by Default

Here's my prediction: within two years, multi-agent will be the default way AI-assisted development works. Single-agent sessions will be the equivalent of writing code without version control — technically possible, you can get things done, but nobody serious does it that way.

The trajectory is clear. Models are getting better at following system prompts and maintaining role boundaries. Orchestration frameworks are maturing. The cost of running multiple agent sessions is dropping fast. And the quality gap between single-agent and multi-agent approaches is widening as people build more ambitious things with AI.

We're going to see multi-agent become invisible infrastructure. You won't "set up a multi-agent system" any more than you "set up version control" — it'll just be how AI development tools work. Your IDE will dispatch specialized agents for different tasks without you thinking about it. Code review agents will run automatically. Test generation will happen in parallel with implementation. Deployment agents will handle the release pipeline.

The companies and tools that figure out orchestration — how to decompose work, pass context between agents, handle failures, and reassemble results — will define the next generation of developer tooling. The ones that keep trying to make a single, all-purpose agent do everything will hit the same ceiling over and over.

I'm also convinced that the orchestration layer itself will become a product category. Right now, multi-agent coordination is bespoke — every team builds their own. That won't last. Standard protocols for agent handoff, shared context formats, and orchestration patterns will emerge, just like REST APIs and container orchestration standardized microservices.

Building in the Open

Agent Team is my attempt to prove this thesis in practice. It's an open-source multi-agent orchestration system built on Claude Code — seven specialized agents with defined roles, handoff protocols, sprint-based planning, and quality gates.

It's not the final answer. Multi-agent architecture is still early, and there's a lot to figure out around context efficiency, error recovery, and cost optimization. But the core insight — that decomposing AI work across specialized agents produces better results than monolithic sessions — is something I've now validated across enough projects to be very confident about.

If you're building anything substantial with AI, stop asking one agent to do everything. Split the work. Specialize the roles. Add quality gates between stages. The improvement is immediate, and it compounds as your projects get more ambitious.

The monolith era of AI development is ending. Multi-agent is what comes next.