Note

Untitled


title: AI Agent Architecture: Build Factories, Not Fake Teams slug: ai-agent-architecture-build-agent-factories description: Most AI agent architecture still imitates human teams. The better model is factories: queues, workcells, QA gates, and auditable async worker systems. date: 2026-04-26 cluster: build-ai-agent pageRole: authority primaryKeyword: ai agent architecture supportingKeywords:

  • build ai agent
  • ai agent tools
  • autonomous ai agent
  • ai coding agent
  • agent orchestration
  • async worker systems

AI Agent Architecture: Build Factories, Not Fake Teams

Most AI agent architecture still copies the wrong thing. Instead of designing production systems, people design little fake companies: a manager agent, a researcher agent, a reviewer agent, maybe a planner agent, all chatting in loops that look impressive in a demo and expensive in real life.

That is the wrong metaphor.

If you want serious AI agent architecture, the better model is a factory: queues, workcells, explicit contracts, artifact handoff, QA gates, rework lanes, and clear human exception paths. The systems that hold up are usually not the ones that look most human. They are the ones that make work measurable, reviewable, and hard to fake.

This page is the broad architecture argument. If you are trying to build AI agent systems that survive contact with real workloads, this is the mental model I would start from.

What this page covers

  • why the fake-team metaphor breaks so often
  • what good AI agent architecture actually needs to optimize for
  • the factory model: queues, workcells, contracts, review, and rework
  • when to use generalist workers vs specialized workers
  • what Symphony and ClawSweeper teach about agent architecture in practice
  • how I would design AI agent architecture today

If you want the narrower implementation layers after this, go next to:

What this page is based on

  • direct Starkslab work on async agent workflows, internal queues, and review loops
  • published notes on OpenClaw, scheduling, and operator-grade agent workflows
  • source-backed teardown work on Symphony and ClawSweeper
  • practical observation of where agent systems stall: weak handoffs, fake autonomy, poor review discipline, and fuzzy ownership

This is not a trend-summary page. It is an operator view of AI agent architecture grounded in real systems and real workflow design.

Jump to

What AI agent architecture should actually optimize for

A lot of AI agent architecture discussion starts with the wrong question.

It asks:

  • how do I make multiple agents collaborate?
  • how do I assign realistic roles?
  • how do I make the system feel autonomous?

Those are demo questions.

The more useful questions are:

  • how does work enter the system?
  • what exact artifact should each worker produce?
  • where does review happen?
  • what gets retried, and what gets escalated?
  • how do I know what changed, why it changed, and whether it was safe?
  • what is the cost per completed unit of useful work?

That is the shift from fake-team thinking to factory thinking.

A real AI agent architecture should optimize for a handful of boring but critical things:

1. Throughput

Can the system keep turning inputs into useful outputs without conversational sludge building up between steps?

2. Bounded autonomy

Can workers act inside clear limits without quietly drifting into tasks they were never meant to own?

3. Handoff clarity

Does each step produce a durable artifact, decision object, or state transition that the next step can trust?

4. QA and rework

When output is weak, is there a clean rework lane, or does the whole system just keep talking until a human manually fixes it?

5. Observability

Can an operator tell what happened, what is blocked, and what is only pretending to move?

6. Auditability

If something goes wrong, can you inspect the decisions and state transitions afterward?

7. Cost and latency discipline

Does the architecture respect the economics of repeated work, or is “multi-agent” just a nicer name for burning tokens on coordination theater?

A good AI agent architecture is not one that feels intelligent. It is one that keeps these surfaces legible.

Why fake agent teams break in practice

The team metaphor is seductive because humans already understand teams.

So the architecture becomes obvious theater:

  • one planner agent breaks down the task
  • one researcher agent gathers context
  • one writer agent drafts
  • one reviewer agent critiques
  • one manager agent decides what happens next

You can visualize it instantly. You can pitch it on a slide. You can narrate it with human language.

But that same metaphor creates predictable failure modes.

Vague role boundaries

Human job titles are fuzzy. “Researcher” and “reviewer” sound clear until you ask what exact output each step is responsible for producing.

If a worker’s contract is vague, every downstream problem becomes hard to debug. Did the planner under-specify the task? Did the writer misunderstand? Did the reviewer overstep? Or did all three do half of each other’s jobs?

Conversational coordination replaces system design

A lot of agent orchestration is really just agents talking to each other because the designer never built a real handoff model.

That looks collaborative, but often it means:

  • too many context hops
  • too much repeated reading
  • no stable output shape
  • higher latency
  • higher cost
  • weaker accountability

No durable artifact between steps

If the handoff is just another message in a conversation, the system has no strong spine. It is hard to re-run, audit, diff, or review.

That makes the architecture feel “autonomous” right up until you need to trust it.

Hidden human cleanup

Many fake-team systems only work because a human is quietly doing the hard parts:

  • checking if the brief was actually usable
  • rewriting the prompt
  • deciding whether the output is publishable
  • resolving contradictions between agents
  • carrying the operational context the architecture failed to encode

In other words, the system is not autonomous. It is leaking work.

Token burn gets mistaken for collaboration

More agent dialogue often gets framed as deeper reasoning.

Sometimes it is. Often it is just a tax.

If the agents are exchanging information that could have been encoded once in a contract, schema, checklist, or queue state, the architecture is paying for coordination because it failed to design the process.

What breaks first

When fake-team architecture hits real workloads, the first cracks usually show up here:

  • the queue looks active, but nothing actually lands
  • the system cannot tell “waiting” from “stalled”
  • review becomes informal and inconsistent
  • outputs drift because the worker contract was never sharp
  • humans stop trusting the system and start bypassing it
  • cost rises faster than useful throughput

This is why I have become skeptical of agent-org-chart design as the default answer. The issue is not that multiple agents are always bad. The issue is that the human-team metaphor encourages weak engineering discipline.

What factory-style AI agent architecture looks like

The better model is not “one brilliant autonomous employee.”

It is a production system.

In factory-style AI agent architecture, work moves through explicit stages:

  1. intake
  2. routing
  3. bounded worker execution
  4. artifact emission
  5. review or QA
  6. rework or apply
  7. archive, metrics, and audit

That model sounds less romantic than “multi-agent collaboration,” but it is far more useful.

The core components

Queue

A queue is where work becomes visible. It tells you what is waiting, what is active, what is blocked, and what is done.

Without a queue, agent systems often confuse conversation state with operating state.

Workcell

A workcell is a bounded worker or worker lane with one job. That job might still be broad, but it should have a clear contract.

Examples:

  • draft a support note from a brief and outline
  • review repo issues for conservative close proposals
  • fetch a specific evidence snapshot
  • reconcile a known ledger against live state

Contract

A contract defines what a worker receives and what it must produce.

Good contracts specify:

  • allowed inputs
  • required outputs
  • evidence expectations
  • safety limits
  • what happens on uncertainty

Artifact handoff

Each meaningful step should leave behind something real:

  • a draft
  • a structured decision object
  • a report
  • a patch
  • a review note
  • a state transition

That artifact is what makes review possible.

QA or review gate

Not every lane needs the same level of review, but every serious system needs an explicit answer to the question: who checks what before mutation, publication, or closure?

Rework lane

Weak output should not create architecture panic. It should go to rework cleanly.

That means the system needs a way to say:

  • this was incomplete
  • this violated the contract
  • this needs one more pass
  • this should be escalated to a human

Apply or mutation lane

Whenever possible, proposal and mutation should be separate.

The worker that says “this seems correct” does not always need to be the worker that changes live state.

That one separation improves a shocking number of systems.

Team metaphor vs factory metaphor

Here is the comparison I find most useful:

Design question Fake-team answer Factory answer
How does work move? agents chat queue and state transitions
What does each worker own? a role name a contract and output shape
How do steps hand off? conversation context artifacts and explicit state
How is quality controlled? another agent “reviews” formal review/QA gate
How is failure handled? more discussion rework, retry, escalate
How is trust earned? model seems smart system is inspectable
How is cost managed? hope the loop converges limit coordination, bound surfaces

The fake-team answer optimizes for narrative. The factory answer optimizes for operations.

Generalist vs specialized workers

This is where a lot of architecture debates get mushy.

Not every worker should be specialized, and not every worker should stay general.

The useful split is this:

Generalist worker

A generalist worker is good when:

  • the task family is broad
  • the work is still changing
  • the queue needs flexible interpretation
  • the main value is throughput and orchestration rather than one narrow repeated judgment

Generalist workers are often best as queue movers, drafters, or first-pass operators.

Specialized worker

A specialized worker is good when:

  • the task family repeats cleanly
  • trust boundaries matter a lot
  • the failure cost is high
  • a typed decision contract would reduce ambiguity
  • the same review/apply pattern happens over and over

Specialized workers are often best for narrow audits, controlled maintenance, conservative mutation, or governed classification.

The mistake is assuming one worker shape should do everything.

A healthy agent architecture often starts broad and then specializes where the queue proves the need.

What Symphony teaches about a generalist chassis

OpenAI Symphony is a useful case because it behaves like a generalist chassis.

At a high level, Symphony shows how to run an issue-driven async worker loop with:

  • a repo-owned workflow contract
  • isolated workspaces
  • reconciliation before fresh dispatch
  • explicit long-running orchestration state

That makes it a good fit for broader work queues where the system still needs flexibility.

The strongest design lesson is not “use Symphony exactly.” The strongest design lesson is that a real worker chassis needs operating doctrine, reconciliation, and bounded workspaces, not just a prompt and a webhook.

If you want the detailed teardown, go to OpenAI Symphony review: what it actually does.

What ClawSweeper teaches about a specialized worker

ClawSweeper is useful for the opposite reason.

It is a narrow, governed maintenance worker. It reviews issues and PRs, emits typed decisions, stores durable artifacts, and only mutates later if the earlier proposal still holds.

That makes ClawSweeper a strong specialized-worker proof case.

Its best lessons are:

  • proposal/apply separation
  • typed decision schemas
  • artifact-first auditability
  • self-audit as a first-class lane

If you want the full teardown, go to ClawSweeper review: what it actually does.

How I would design AI agent architecture today

If I were designing a serious system from scratch, I would start with these rules.

1. Make queue state the real source of truth

Not chat logs. Not inferred vibes. The queue.

2. Give each worker one obvious contract

Every worker should know what it gets, what it emits, and when it should refuse or escalate.

3. Emit artifacts, not just conversation

If a step matters, it should leave behind something inspectable.

4. Separate proposal from mutation whenever trust matters

Especially for code changes, issue closures, content publication, or stateful system actions.

5. Add review and rework explicitly

Do not hide quality control inside another vague “review agent.” Make it real.

6. Specialize only after the queue teaches you where to specialize

Do not pre-bake a whole fake org chart. Let the repeated failure modes tell you where harder boundaries pay off.

7. Treat observability as part of the architecture

A system that cannot explain its own state is not mature, no matter how slick the demo looks.

That is the architecture I trust more: boring where it should be boring, explicit where it should be explicit, and far less interested in pretending to be a tiny human company.

When not to build agent factories

The factory model is powerful, but it is not always the right answer.

Do not overbuild it when:

  • the task is small enough for one direct tool call
  • the queue is too small to justify orchestration overhead
  • the outputs are too fuzzy to benefit from contracts
  • the failure cost is low and speed matters more than auditability
  • a normal script would solve the problem more cleanly

Agent factories are for repeated, governed work. Not for making everything look more autonomous than it is.

Conclusion

Most AI agent architecture fails because it copies human teams instead of designing production systems.

If you want systems that survive real use, think less about roleplay and more about queues, workcells, contracts, artifacts, review, and rework. That shift makes the architecture less flashy and much more trustworthy.

The headline is simple: build factories, not fake teams.

And if you want to see the two concrete proof cases this page routes through:

Ready-for-review summary

This owner-page version keeps the page broad, opinionated, and architecture-first. The mechanics playbook shaped it through answer-first framing, clear internal-link routing, visible role discipline, and a clean operator verdict in the first screen. The gold-page checklist shaped it by forcing one obvious page job, strong question-match heading structure, multiple concrete next-click paths, and a durable distinction between this authority page and the two narrower teardown support pages.

Back to Notes

Want the deeper systems behind this note?

See the Vault