What Is a Multi-Agent System? A Complete Guide to LLM Orchestrators, Subagents, and Anthropic’s Research Architecture

Q: What is the difference between an AI agent and a multi-agent system?

An AI agent is a single LLM with tools. A multi-agent system is the architecture in which several such agents collaborate.

Q: Which framework should I use?

Leading options are LangGraph, CrewAI, AutoGen, Anthropic Agent SDK, and OpenAI Swarm. Agent SDK is favored for simplicity, LangGraph for explicit state machines.

Q: How much more does multi-agent cost?

Roughly 15-20x the token spend of a single-agent baseline. Prompt caching and cheaper models for routine subagents reduce this in production.

Q: How do I prevent multi-agent runaway behavior?

Set timeouts, token caps, recursion limits, and tool-call budgets. Managed runtimes like Anthropic Managed Agents include sandboxing and circuit breakers.

Q: When should I switch from single-agent to multi-agent?

When the task decomposes into independent sub-tasks, the context exceeds a single model's window, or different roles need different models. Otherwise stay single-agent.

What Is a Multi-Agent System?

A multi-agent system (MAS) is a software architecture in which several autonomous agents — in modern usage, typically multiple LLMs — collaborate, divide labor, and coordinate to solve a single task. The pattern that has dominated 2025-2026 is “orchestrator and workers”: a lead LLM plans and delegates, while subordinate LLMs execute focused sub-tasks in parallel. Anthropic’s multi-agent research system, which powers Claude’s deep-research feature, is the best-known commercial example.

The mental model is “an expert team running a project.” A project manager (lead agent) sets the plan, while specialists (subagents) carry out research, implementation, and review in parallel. In production, multi-agent setups shine when the work is too large or too varied for a single model call: deep research, complex coding workflows, and long-running automations. They are not a free upgrade — token costs, debugging difficulty, and orchestration overhead all rise — but for the right tasks, the gains in speed and quality can be dramatic.

How to Pronounce Multi-Agent

multi-agent (/ˈmʌlti ˈeɪdʒənt/)

multi-agent system (/ˈmʌlti ˈeɪdʒənt ˈsɪstəm/)

MAS (/mæs/)

How Multi-Agent Systems Work

The dominant production pattern is the orchestrator-worker hierarchy. A lead agent receives the user request, plans the work, and decomposes it into independent sub-tasks. It dispatches each sub-task to a subagent that runs in its own context window, with its own system prompt and possibly its own model. Each subagent returns a compact result, which the lead agent integrates into the final response. It is important to keep in mind that because each subagent has its own context, the system effectively multiplies the available memory beyond what a single model could hold — but at a corresponding cost in tokens.

The Canonical Topology

Most production multi-agent systems converge on a similar structure: one central orchestrator, a fan-out of parallel subagents, and tool calls (web search, retrieval, code execution) at the leaves. Anthropic’s research system uses Claude Opus 4 as the lead and several Claude Sonnet 4 subagents in parallel; Anthropic reports that this configuration outperforms a single Opus 4 by 90.2% on internal research-task evaluations. Note, however, that this configuration also burns roughly 15-20x the tokens of a single-agent baseline. The performance lift is real; the cost lift is just as real.

Historical Background

The multi-agent concept predates LLMs by decades. Distributed AI research in the 1980s explored swarm intelligence, blackboard architectures, and agent communication languages, mostly for robotics and traffic simulation. The LLM revival began in 2023-2024 with frameworks like AutoGPT, BabyAGI, and LangChain’s agent abstractions, then matured through LangGraph, CrewAI, AutoGen, and Anthropic’s Agent SDK. The conceptual emphasis shifted from “autonomy” (agents pursuing their own goals) to “delegation and parallelism” (one orchestrator coordinating many workers), which is the dominant pattern today.

Coordination Mechanics

The mechanics of coordination matter as much as the topology. Most production systems coordinate via tool calls: a subagent is exposed to the orchestrator as if it were a tool, and the orchestrator calls it the same way it would call a web search or a database query. This gives the orchestrator a uniform mental model — everything is a tool — and lets it use familiar reasoning patterns to decide when to delegate. Other systems coordinate via shared state: subagents read from and write to a common scratchpad, and the orchestrator monitors the scratchpad to decide what to do next. The tool-call style is simpler and more common; the shared-state style is more powerful but harder to debug.

You should also keep in mind that good multi-agent designs explicitly bound what a subagent can do. A subagent with unrestricted tool access can make changes the orchestrator did not anticipate, especially in agentic workflows that touch real systems. The safest production pattern is to give each subagent the minimum tools it needs and to log every tool call for later audit. This discipline is what separates demo-grade multi-agent code from systems that hold up in production.

Multi-Agent Usage and Examples

Quick Start

The simplest production-shape example uses Anthropic’s Claude Agent SDK, with a subagent registered as a tool the orchestrator can call:

from claude_agent_sdk import Agent, Tool

researcher = Agent(model="claude-sonnet-4-6",
                   system_prompt="You are a research specialist.")
research_tool = Tool.from_agent(researcher, name="research",
                                description="Run a focused research task")

orchestrator = Agent(
    model="claude-opus-4-6",
    system_prompt="You are a project manager.",
    tools=[research_tool],
)
print(orchestrator.run("Survey the AI agent market and produce a competitive brief."))

Common Implementation Patterns

Pattern A: Fan-Out / Fan-In (Parallel Research)

orchestrator
  → splits the question into 3-5 independent sub-queries
  → dispatches each sub-query to a separate subagent in parallel
  → aggregates results into the final report

Best for: competitive research, market analyses, literature reviews — anywhere parallelism reduces wall-clock latency.

Avoid when: sub-queries are tightly coupled. Forced parallelism wastes the opportunity for any sub-query to inform the next.

Pattern B: Specialized Roles (Pipeline)

orchestrator
  → Planner agent: produces a plan
  → Coder agent: implements
  → Reviewer agent: checks quality
  (sequential pipeline)

Best for: software development, content production, anything where stages have well-defined hand-offs.

Avoid when: the task is creative or exploratory and roles blur. Hard role boundaries make iterative work clumsy.

Pattern C: Debate Pattern

orchestrator
  → Pro agent: argues for option A
  → Con agent: argues for option B
  → Judge agent: weighs both and decides

Best for: strategic decisions, design reviews, controversial topics where multiple viewpoints reduce blind spots.

Avoid when: the question has a single objectively correct answer. The debate adds noise without value.

It is worth keeping in mind that the same orchestrator code can sometimes implement several of these patterns at once. A planning agent might fan out research subagents (Pattern A), then run a small Pro/Con debate (Pattern C) before handing the result to a Coder/Reviewer pair (Pattern B). Patterns are not mutually exclusive — they are building blocks you compose for the actual workflow you need.

Anti-Pattern: Multi-Agentifying Everything

# Don't do this
- "Summarize this email" → 5 agents in parallel
- "Fix this typo" → Planner + Coder + Reviewer pipeline

Anthropic’s own guidance is to “find the simplest solution possible” and that for many applications, “single LLM calls plus retrieval and examples are enough.” Multi-agent setups multiply token usage by 10-20x and add operational complexity. Note that unnecessary multi-agent designs often produce worse outcomes — more agents introduce more places for the system to misunderstand the user, and the orchestration prompts themselves can become a fragile dependency.

A practical question that comes up early in production is how to observe multi-agent runs. The standard approach is structured tracing: every model call, tool call, and agent transition is logged with timestamps and identifiers, and a viewer renders the trace as a tree. Tools like LangSmith, Weights & Biases Traces, and OpenTelemetry-based stacks have all added first-class support for multi-agent traces in 2025-2026. Without this kind of observability, multi-agent systems can be effectively impossible to debug; with it, debugging becomes about as tractable as debugging any other distributed system.

Advantages and Disadvantages

Advantages

Parallelism: independent sub-tasks run concurrently, slashing wall-clock latency.
Context isolation: each subagent has its own context window, multiplying effective memory.
Specialization: each role can use a different model or prompt (Opus for hard problems, Haiku for grunt work).
Strong on deep research: empirically large gains on multi-source investigation (+90.2% in Anthropic’s evals).
Dynamic planning: the lead can spin up or kill subagents based on what it finds.

Disadvantages

Cost: 15-20x the token spend of a single-agent baseline, observed empirically.
Control: agents can loop, misinterpret tasks, or fight each other without good guardrails.
Debuggability: tracing failures across multiple agents is much harder than reading a single trace.
Convergence: debate-style patterns can stall if no agent has the authority to decide.
Latency tail: even with parallelism, the slowest subagent dictates wall-clock time.
Over-engineering risk: the framework choice can dominate the design, distracting from the actual problem.

Multi-Agent vs Single-Agent vs Pipelines

The three architectures are easy to conflate because all of them involve “more than one step.” In practice, the differences in how flow control and context are managed make them suitable for very different problems.

Aspect	Multi-Agent	Single-Agent	Pipeline
Number of agents	Multiple, parallel/hierarchical	One	Multiple, fixed by code
Flow control	Dynamic (lead decides)	Dynamic (model decides)	Static (developer decides)
Context	Isolated per agent	Single shared window	Per-stage swap
Token cost	High (15-20x)	Baseline	Medium
Typical use	Deep research, complex agents	Single-task chat or QA	Repeatable workflows

Many production stacks combine all three. A pipeline kicks off a job, a multi-agent stage does deep research, and a single-agent finalization step writes the user-facing summary. The right answer is rarely “use only one of these”; it is “compose the simplest combination that meets the quality bar.”

Common Misconceptions About Multi-Agent Systems

Misconception 1: “Multi-agent always beats single-agent”

Why this confusion arises: the reason is that human-team intuition transfers cleanly — more people, more output. Readers also get confused because Anthropic’s striking +90.2% number on research tasks gets quoted out of context, and many extrapolate it to all tasks. The background of agile and scrum cultures reinforces the assumption that adding parallel workers always helps.

The reality: gains accrue specifically to tasks where parallelism and context isolation matter — multi-source research, branching investigation, large code refactors. For straightforward QA or summarization, a single agent is faster, cheaper, and often higher quality. Anthropic’s own guidance explicitly recommends staying single-agent unless you have a clear reason not to.

Misconception 2: “Multi-agent systems are highly autonomous”

Why this confusion arises: the word “agent” plus early high-profile demos (AutoGPT, BabyAGI) created an impression that agents are negotiating peer-to-peer and pursuing their own goals. People are confused because the reasoning is anchored in science fiction more than current practice. The background of older multi-agent research papers, which did emphasize autonomy, also contributes to the misconception.

The reality: production multi-agent systems are mostly centrally coordinated — one orchestrator plans and dispatches, subagents do focused work and return results. Genuine peer-to-peer negotiation and consensus protocols exist in research labs but are rare in production. Most commercial systems are best described as “structured delegation,” not “autonomous society.”

Misconception 3: “You need LangChain or LangGraph to build a multi-agent system”

Why this confusion arises: tutorials are dominated by LangChain-family code, and the word “Agent” became associated with LangChain’s specific abstractions because so much early agent content came from that ecosystem. The reason it sticks is straightforward: the framework’s prominence creates the impression it is mandatory, and developers can be confused by tutorials that bury the underlying patterns under framework-specific syntax.

The reality: a multi-agent system is just a design pattern — multiple LLM calls coordinated together. Anthropic’s “Building effective agents” post shows you can implement it with plain Python and the Anthropic SDK. Frameworks like CrewAI, LangGraph, and AutoGen are conveniences, not requirements. Many teams find their production systems simpler when they avoid heavy frameworks and write the orchestration directly.

It is worth noting how each misconception manifests operationally. Believing multi-agent always wins leads teams to retire perfectly good single-agent systems and absorb the cost overhead with no benefit. Believing the systems are highly autonomous leads to under-investment in evaluation and oversight. Believing LangChain is required leads to over-engineering. Each of these has cost real teams real money, which is why Anthropic’s “find the simplest solution possible” line shows up so prominently in their guidance.

Real-World Use Cases

Deep research: Claude’s Research feature, Perplexity Pro, and ChatGPT Deep Research are all multi-agent systems under the hood, dispatching sub-investigations as the user waits.
Coding agents: Devin, Replit Agent, Bolt.new, and similar autonomous coding products use multi-agent decomposition.
Customer support automation: separate agents handle triage, knowledge lookup, escalation drafting, and summary, with each specializing in a single sub-step of the support workflow.
Marketing automation: trend scraping, keyword extraction, content drafting, and SEO review handled by different specialists, often spanning hours of background work per output.
R&D: literature search, experiment design, data analysis, and report writing chained as separate agents.
RPA + AI hybrids: form extraction, rule matching, and reply drafting as specialized agents inside a larger pipeline.
Compliance and audit workflows: each agent produces a separate intermediate artifact that auditors can review.

One pattern worth highlighting separately is the “supervisor + reporters” arrangement used in news-and-summarization workflows. A supervisor agent reads incoming articles, dispatches reporter agents to investigate angles, and assembles their findings into a daily digest. The supervisor’s prompt is small and stable; the reporters do the heavy lifting on each story. Teams running this pattern report cleaner outputs than monolithic alternatives, because each reporter focuses on a single article without the distraction of the others.

Another pattern is “model cascading” — using cheaper models for the bulk of the work and escalating to more expensive ones only when needed. The orchestrator might use Haiku for initial triage, Sonnet for medium-difficulty subtasks, and Opus or o3 for the hardest investigations. Properly tuned, cascading systems get most of the quality of an all-Opus stack at a fraction of the cost across many production workloads. Note that cascading requires careful evaluation to find the right thresholds for each tier; a poorly tuned cascade can be worse than either extreme.

One understated win of multi-agent designs is in regulated environments. When a single LLM produces a regulatory output (e.g., a draft compliance report), the auditor must trust the entire reasoning chain in one shot. Splitting the work across specialized agents — extractor, classifier, reviewer — produces a paper trail of intermediate artifacts that auditors can review individually. Several financial-services teams have publicly cited this auditability gain as the reason they adopted multi-agent designs even on tasks where a single agent would have been technically sufficient.

Frequently Asked Questions (FAQ)

Q1. What is the difference between an AI agent and a multi-agent system?

An AI agent is a single LLM equipped with tools that can autonomously execute a task. A multi-agent system is the broader architecture in which several such agents collaborate. The agent is the building block; the multi-agent system is the architecture you assemble from those blocks. The line between them is sometimes fuzzy: a sufficiently capable agent that delegates to itself recursively can look multi-agent, while a small multi-agent system whose subagents are stateless tools can look single-agent.

Q2. Which framework should I use?

As of 2026, the leading options are LangGraph, CrewAI, AutoGen, Anthropic’s Agent SDK, and OpenAI’s Swarm. Anthropic’s Agent SDK is favored for simplicity; LangGraph is the go-to when you need explicit state machines; CrewAI is popular for role-driven designs; AutoGen has the strongest support for code-execution agents. Pick the framework that matches your existing stack and operational practices, and remember that you can always start without one and adopt a framework only when complexity demands it.

Q3. How much more does multi-agent cost?

Anthropic’s measurements show roughly 15-20x the token spend of an equivalent single-agent baseline. The actual multiplier depends on subagent count and reasoning depth; production systems usually offset some of this with prompt caching and by using cheaper models (e.g., o4-mini, Haiku) for routine subagents. Detailed observability is essential — without good logging, you cannot tell whether a 20x cost is delivering 20x value.

Q4. How do I prevent multi-agent runaway behavior?

Set hard timeouts, token caps, recursion-depth limits, and tool-call budgets. Managed runtimes such as Anthropic’s Managed Agents bundle sandboxing, state management, and circuit breakers, which substantially reduces the operational burden of containing misbehaving agents. Treat these limits as production safety mechanisms rather than developer ergonomics: every multi-agent system in production has eventually triggered at least one of them, and being unprepared turns a contained incident into an outage.

Q5. When should I switch from single-agent to multi-agent?

Consider multi-agent when (a) the task decomposes naturally into independent sub-tasks, (b) the context exceeds what a single model can hold, or (c) different sub-tasks would benefit from different models or prompts. If none of these apply, staying single-agent is the safer default. A useful test is to imagine running both versions for two weeks and asking which one your operations team would rather support; the answer is usually the simpler one.

Conclusion

A multi-agent system is multiple LLMs cooperating to solve one task; the orchestrator-plus-workers hierarchy dominates production use.
Parallelism, context isolation, and specialization are the headline benefits, especially on deep research and complex agent workflows.
Token cost runs 15-20x a single-agent baseline; reserve the architecture for tasks that genuinely need it.
Frameworks like LangGraph, CrewAI, AutoGen, Anthropic Agent SDK, and OpenAI Swarm are conveniences, not requirements.
Anthropic’s research configuration (Opus lead + Sonnet workers) reportedly outperforms a single Opus by 90.2% on Anthropic’s internal research evaluations.
Default to the simplest design that meets your quality bar; reach for multi-agent only when single-agent or pipelines fall short.

The trajectory is clear: as Anthropic and others ship managed runtimes for agent execution, the operational cost of multi-agent systems is falling, and the design patterns are maturing. Teams entering this space in 2026 have a much smoother path than those that tried in 2024 — but the core advice has not changed: start simple, measure carefully, add agents only when the metrics show you need them. Multi-agent designs are a tool, not a destination, and the most effective teams treat them as such.