Note

Untitled

title: AI Agent Architecture: Build Factories, Not Fake Teams slug: ai-agent-architecture-build-agent-factories description: Most AI agent architecture still imitates human teams. The better model is factories: queues, workcells, QA gates, and auditable async worker systems. date: 2026-04-26 cluster: build-ai-agent pageRole: authority primaryKeyword: ai agent architecture supportingKeywords:

build ai agent
ai agent tools
autonomous ai agent
ai coding agent
agent orchestration
async worker systems

AI Agent Architecture: Build Factories, Not Fake Teams

Most AI agent architecture still copies the wrong thing. Instead of designing production systems, people design little fake companies: a manager agent, a researcher agent, a reviewer agent, maybe a planner agent, all chatting in loops that look impressive in a demo and expensive in real life.

That is the wrong metaphor.

If you want serious AI agent architecture, the better model is a factory: queues, workcells, explicit contracts, artifact handoff, QA gates, rework lanes, and clear human exception paths. The systems that hold up are usually not the ones that look most human. They are the ones that make work measurable, reviewable, and hard to fake.

This page is the broad architecture argument. If you are trying to build AI agent systems that survive contact with real workloads, this is the mental model I would start from.

What this page covers

why the fake-team metaphor breaks so often

what good AI agent architecture actually needs to optimize for

the factory model: queues, workcells, contracts, review, and rework

when to use generalist workers vs specialized workers

what Symphony and ClawSweeper teach about agent architecture in practice

how I would design AI agent architecture today

If you want the narrower implementation layers after this, go next to:

building a lightweight AI agent framework in Python

our AI coding agent workflow

ClawSweeper review: what it actually does

What this page is based on

direct Starkslab work on async agent workflows, internal queues, and review loops

published notes on OpenClaw, scheduling, and operator-grade agent workflows

source-backed teardown work on Symphony and ClawSweeper

practical observation of where agent systems stall: weak handoffs, fake autonomy, poor review discipline, and fuzzy ownership

This is not a trend-summary page. It is an operator view of AI agent architecture grounded in real systems and real workflow design.

What AI agent architecture should actually optimize for
Why fake agent teams break in practice
What factory-style AI agent architecture looks like
Generalist vs specialized workers
What Symphony teaches about a generalist chassis
What ClawSweeper teaches about a specialized worker
How I would design AI agent architecture today
When not to build agent factories

What AI agent architecture should actually optimize for

A lot of AI agent architecture discussion starts with the wrong question.

It asks:

how do I make multiple agents collaborate?
how do I assign realistic roles?
how do I make the system feel autonomous?

Those are demo questions.

The more useful questions are:

how does work enter the system?
what exact artifact should each worker produce?
where does review happen?
what gets retried, and what gets escalated?
how do I know what changed, why it changed, and whether it was safe?
what is the cost per completed unit of useful work?

That is the shift from fake-team thinking to factory thinking.

A real AI agent architecture should optimize for a handful of boring but critical things:

1. Throughput

Can the system keep turning inputs into useful outputs without conversational sludge building up between steps?

2. Bounded autonomy

Can workers act inside clear limits without quietly drifting into tasks they were never meant to own?

3. Handoff clarity

Does each step produce a durable artifact, decision object, or state transition that the next step can trust?

4. QA and rework

When output is weak, is there a clean rework lane, or does the whole system just keep talking until a human manually fixes it?

5. Observability

Can an operator tell what happened, what is blocked, and what is only pretending to move?

6. Auditability

If something goes wrong, can you inspect the decisions and state transitions afterward?

7. Cost and latency discipline

Does the architecture respect the economics of repeated work, or is “multi-agent” just a nicer name for burning tokens on coordination theater?

A good AI agent architecture is not one that feels intelligent. It is one that keeps these surfaces legible.

Why fake agent teams break in practice

The team metaphor is seductive because humans already understand teams.

So the architecture becomes obvious theater:

one planner agent breaks down the task
one researcher agent gathers context
one writer agent drafts
one reviewer agent critiques
one manager agent decides what happens next

You can visualize it instantly. You can pitch it on a slide. You can narrate it with human language.

But that same metaphor creates predictable failure modes.

Vague role boundaries

Human job titles are fuzzy. “Researcher” and “reviewer” sound clear until you ask what exact output each step is responsible for producing.

If a worker’s contract is vague, every downstream problem becomes hard to debug. Did the planner under-specify the task? Did the writer misunderstand? Did the reviewer overstep? Or did all three do half of each other’s jobs?

Conversational coordination replaces system design

A lot of agent orchestration is really just agents talking to each other because the designer never built a real handoff model.

That looks collaborative, but often it means:

too many context hops
too much repeated reading
no stable output shape
higher latency
higher cost
weaker accountability

No durable artifact between steps

If the handoff is just another message in a conversation, the system has no strong spine. It is hard to re-run, audit, diff, or review.

That makes the architecture feel “autonomous” right up until you need to trust it.

Hidden human cleanup

Many fake-team systems only work because a human is quietly doing the hard parts:

checking if the brief was actually usable
rewriting the prompt
deciding whether the output is publishable
resolving contradictions between agents
carrying the operational context the architecture failed to encode

In other words, the system is not autonomous. It is leaking work.

Token burn gets mistaken for collaboration

More agent dialogue often gets framed as deeper reasoning.

Sometimes it is. Often it is just a tax.

If the agents are exchanging information that could have been encoded once in a contract, schema, checklist, or queue state, the architecture is paying for coordination because it failed to design the process.

What breaks first

When fake-team architecture hits real workloads, the first cracks usually show up here:

the queue looks active, but nothing actually lands
the system cannot tell “waiting” from “stalled”
review becomes informal and inconsistent
outputs drift because the worker contract was never sharp
humans stop trusting the system and start bypassing it
cost rises faster than useful throughput

This is why I have become skeptical of agent-org-chart design as the default answer. The issue is not that multiple agents are always bad. The issue is that the human-team metaphor encourages weak engineering discipline.

What factory-style AI agent architecture looks like

The better model is not “one brilliant autonomous employee.”

It is a production system.

In factory-style AI agent architecture, work moves through explicit stages:

intake
routing
bounded worker execution
artifact emission
review or QA
rework or apply
archive, metrics, and audit

That model sounds less romantic than “multi-agent collaboration,” but it is far more useful.

The core components

Queue

A queue is where work becomes visible. It tells you what is waiting, what is active, what is blocked, and what is done.

Without a queue, agent systems often confuse conversation state with operating state.

Workcell

A workcell is a bounded worker or worker lane with one job. That job might still be broad, but it should have a clear contract.

Examples:

draft a support note from a brief and outline
review repo issues for conservative close proposals
fetch a specific evidence snapshot
reconcile a known ledger against live state

Contract

A contract defines what a worker receives and what it must produce.

Good contracts specify:

allowed inputs
required outputs
evidence expectations
safety limits
what happens on uncertainty

Artifact handoff

Each meaningful step should leave behind something real:

a draft
a structured decision object
a report
a patch
a review note
a state transition

That artifact is what makes review possible.

QA or review gate

Not every lane needs the same level of review, but every serious system needs an explicit answer to the question: who checks what before mutation, publication, or closure?

Rework lane

Weak output should not create architecture panic. It should go to rework cleanly.

That means the system needs a way to say:

this was incomplete
this violated the contract
this needs one more pass
this should be escalated to a human

Apply or mutation lane

Whenever possible, proposal and mutation should be separate.

The worker that says “this seems correct” does not always need to be the worker that changes live state.

That one separation improves a shocking number of systems.

Team metaphor vs factory metaphor

Here is the comparison I find most useful:

Design question	Fake-team answer	Factory answer
How does work move?	agents chat	queue and state transitions
What does each worker own?	a role name	a contract and output shape
How do steps hand off?	conversation context	artifacts and explicit state
How is quality controlled?	another agent “reviews”	formal review/QA gate
How is failure handled?	more discussion	rework, retry, escalate
How is trust earned?	model seems smart	system is inspectable
How is cost managed?	hope the loop converges	limit coordination, bound surfaces

The fake-team answer optimizes for narrative. The factory answer optimizes for operations.

Generalist vs specialized workers

This is where a lot of architecture debates get mushy.

Not every worker should be specialized, and not every worker should stay general.

The useful split is this:

Generalist worker

A generalist worker is good when:

the task family is broad
the work is still changing
the queue needs flexible interpretation
the main value is throughput and orchestration rather than one narrow repeated judgment

Generalist workers are often best as queue movers, drafters, or first-pass operators.

Specialized worker

A specialized worker is good when:

the task family repeats cleanly
trust boundaries matter a lot
the failure cost is high
a typed decision contract would reduce ambiguity
the same review/apply pattern happens over and over

Specialized workers are often best for narrow audits, controlled maintenance, conservative mutation, or governed classification.

The mistake is assuming one worker shape should do everything.

A healthy agent architecture often starts broad and then specializes where the queue proves the need.

What Symphony teaches about a generalist chassis

OpenAI Symphony is a useful case because it behaves like a generalist chassis.

At a high level, Symphony shows how to run an issue-driven async worker loop with:

a repo-owned workflow contract
isolated workspaces
reconciliation before fresh dispatch
explicit long-running orchestration state

That makes it a good fit for broader work queues where the system still needs flexibility.

The strongest design lesson is not “use Symphony exactly.” The strongest design lesson is that a real worker chassis needs operating doctrine, reconciliation, and bounded workspaces, not just a prompt and a webhook.

If you want the detailed teardown, go to OpenAI Symphony review: what it actually does.

What ClawSweeper teaches about a specialized worker

ClawSweeper is useful for the opposite reason.

It is a narrow, governed maintenance worker. It reviews issues and PRs, emits typed decisions, stores durable artifacts, and only mutates later if the earlier proposal still holds.

That makes ClawSweeper a strong specialized-worker proof case.

Its best lessons are:

proposal/apply separation
typed decision schemas
artifact-first auditability
self-audit as a first-class lane

If you want the full teardown, go to ClawSweeper review: what it actually does.

How I would design AI agent architecture today

If I were designing a serious system from scratch, I would start with these rules.

1. Make queue state the real source of truth

Not chat logs. Not inferred vibes. The queue.

2. Give each worker one obvious contract

Every worker should know what it gets, what it emits, and when it should refuse or escalate.

3. Emit artifacts, not just conversation

If a step matters, it should leave behind something inspectable.

4. Separate proposal from mutation whenever trust matters

Especially for code changes, issue closures, content publication, or stateful system actions.

5. Add review and rework explicitly

Do not hide quality control inside another vague “review agent.” Make it real.

6. Specialize only after the queue teaches you where to specialize

Do not pre-bake a whole fake org chart. Let the repeated failure modes tell you where harder boundaries pay off.

7. Treat observability as part of the architecture

A system that cannot explain its own state is not mature, no matter how slick the demo looks.

That is the architecture I trust more: boring where it should be boring, explicit where it should be explicit, and far less interested in pretending to be a tiny human company.

When not to build agent factories

The factory model is powerful, but it is not always the right answer.

Do not overbuild it when:

the task is small enough for one direct tool call
the queue is too small to justify orchestration overhead
the outputs are too fuzzy to benefit from contracts
the failure cost is low and speed matters more than auditability
a normal script would solve the problem more cleanly

Agent factories are for repeated, governed work. Not for making everything look more autonomous than it is.

Conclusion

Most AI agent architecture fails because it copies human teams instead of designing production systems.

If you want systems that survive real use, think less about roleplay and more about queues, workcells, contracts, artifacts, review, and rework. That shift makes the architecture less flashy and much more trustworthy.

The headline is simple: build factories, not fake teams.

And if you want to see the two concrete proof cases this page routes through:

Ready-for-review summary

This owner-page version keeps the page broad, opinionated, and architecture-first. The mechanics playbook shaped it through answer-first framing, clear internal-link routing, visible role discipline, and a clean operator verdict in the first screen. The gold-page checklist shaped it by forcing one obvious page job, strong question-match heading structure, multiple concrete next-click paths, and a durable distinction between this authority page and the two narrower teardown support pages.