Claude Orchestrator: Multi-Agent Software Development

The Problem

AI can write code. That's no longer news. What AI struggles with is writing entire applications— the kind with architecture decisions, test suites, build pipelines, and documentation that all have to agree with each other.

Ask a single AI session to build a complex project and you'll hit these walls:

Context collapse.By the time you're debugging test #47, the AI has forgotten the architectural decisions it made 80, 000 tokens ago.
Role confusion.The same AI that wrote the code is now reviewing it— and it's not going to challenge its own decisions.
No quality gates.There's nobody to reject bad output. The AI generates, you receive, and you debug.
Monolithic sessions.If anything fails halfway through, you start over.

Claude Orchestrator solves this by splitting the problem across six specialized agents, each with its own persona, tools, and system prompt. A state machine coordinates them, and human approval checkpoints prevent bad output from cascading downstream.

The core idea

Instead of one AI doing everything, six AIs each do one thing well. A Product Owner writes requirements. An Architect designs the system. A Story Author writes testable acceptance criteria. A Developer codes. An Executor runs tests. A Tester writes integration tests. Each agent reviews the previous agent's work.

How It Works

The Workflow State Machine

The orchestrator is a 17-state machinethat drives six agents through a structured software development lifecycle. Every transition is deterministic— there's no ambiguity about what happens next.

PO→ Approve→ Architect→ Approve→ Stories→ Dev→ Execute→ Review→ Test→ Execute→ Final Review

Each working state maps to exactly one agent. The orchestrator calls _execute_working_state(), which resolves artifacts, invokes the agent, and transitions to the next state based on the result:

orchestrator.py — state execution loop

def_execute_working_state(self, state, input_artifacts): agent_name=self.workflow.get_next_agent() # Record git state before Developer touches the projectifself.improvement_mode andstate==WorkflowState.DEV_WORKING: self._record_git_head() result=self._execute_agent(agent_name, input_artifacts) ifresult["status" ]=="success" : # Collect what the Developer changed via git diffifself.improvement_mode andstate==WorkflowState.DEV_WORKING: self._collect_project_changes() next_state=self.workflow.get_next_state(state) self.workflow.transition(next_state)

Artifact Passing— Not Message Passing

Agents don't talk to each other through messages. They communicate through versioned files. The Product Owner writes requirements.md. The Architect reads it and writes architecture.md. The Developer reads both and writes source code. Each artifact is stored with metadata and version history.

The orchestrator resolves which artifacts each agent needs:

orchestrator.py — artifact resolution

def_build_agent_input(self, state): artifacts= {}

    ifstate==WorkflowState.DEV_WORKING: # Developer needs stories+architecture+skills+constraintsartifacts["stories" ]=self.artifact_store.list_artifacts(STORIES)[0] artifacts["architecture" ]=self.artifact_store.list_artifacts(ARCHITECTURE)[0] artifacts["skills" ]=self.artifact_store.list_artifacts(SKILL)[0] artifacts["constraints" ]=self.artifact_store.list_artifacts(CONSTRAINTS)[0] returnartifacts # Agents receive filenames,
    not content

Key design decision

Agents receive filenames, not file content. Each agent reads the files it needs using its own tools. This keeps the orchestrator lightweight and lets agents decide how much context to load.

Phased Builds

For complex projects, the Architect splits work into phases. Each phase cycles through the full Dev → Execute → Review → Test loop independently. The orchestrator tracks phase state and passes cumulative artifacts forward, so Phase 3 builds on the code from Phases 1 and 2.

The portable_terminal project was built in 8 phases, each adding a layer of shell functionality.

Case Study: Portable Terminal

To validate the orchestrator, we pointed it at a non-trivial task: build a cross-platform terminal emulator in Rustwith a Tauri frontend, implementing 23+Unix shell commands from scratch, including piping, globbing, environment variables, tab completion, and command history.

One input description. Zero human-written code. Here's what came out.

Shell Commands

Rust Source Files

912

Total Tests

87%

Test Pass Rate

What Was Built

Category	Commands Implemented	Source File
File Operations	cat, cp, mv, rm, touch, mkdir, rmdir	7 files (3-17 KB each)
Text Processing	grep, sort, head, tail, wc, diff	6 files (11-24 KB each)
Navigation	ls, cd, pwd, find	4 files (2-19 KB each)
Environment	echo, export, unset, env	4 files (2-4.5 KB each)
Shell Features	help, history	2 files (6-32 KB each)
Infrastructure	parser, pipeline, router, glob, completions	14 core .rs files

Iteration History

The project evolved across 11 orchestrator sessions over 3 days. Here's how it progressed:

Feb 10 · Sessions 1-2

False starts

SSL connectivity issues and early termination. No code generated. Cost: 0 tokens wasted.

Feb 10-11 · Session 3 (Initial Build)

Full 8-phase build: 23 commands, 453 unit tests

2 hours 50 minutes. Product Owner generated 41K chars of requirements. Architect designed 8 phases. Developer wrote all 38 source files across 8 phases. Executor ran cargo test. 14 files exported to project root. Required one manual resume after a pause.

Feb 11 · Session 4 (Improvement #1)

Shell infrastructure: piping, redirections, globbing

Added pipeline execution, glob expansion, environment variable support, tab completion. Requirements expanded to 70K chars. Generated 8 additional integration test files. Test count jumped from 453 to 912.

Feb 12 · Sessions 5-7

Improvement mode debugging

Three sessions hit orchestrator bugs: path explosion (filenames exceeding Windows MAX_PATH), stale Claude session IDs, and SSL drops not triggering failure states. Each bug was fixed in the orchestrator code.

Feb 12 · Session 8 (Improvement #2)

Parser upgrade+infrastructure hardening

Largest architecture doc (49K chars). Upgraded command parser, improved piping, added cross-platform build scripts. Generated the most comprehensive skill profiles (12K developer, 15K tester).

Test Results Breakdown

Test Category	Pass	Fail	Pass Rate
Library unit tests (inline)	453	0	100%
Integration tests (real impl)	173	0	100%
Integration tests (stub impl)	31	70	31%
Pre-existing failures	—	2	N/A
Total	792	106	87%

The 100% pass rate on real unit and integration tests is the headline number. The 106 failures all trace back to test quality issues, not code bugs— stub test files that never called real code, and a few incomplete implementations.

Defects Leaked to Production

After the orchestrator finished, a manual review identified 8 defects:

ID	Defect	Severity	Root Cause	Status
D-001	Test stubs not replaced with real implementations	High	Tester agent	Partially fixed
D-002	Rust lifetime errors in test code	High	Tester agent	Fixed
D-003	Stdin not forwarded between piped commands	Critical	Developer TODO	Open
D-004	Variable expansion bypassed for simple commands	High	Developer shortcut	Open
D-005	No variable expansion in double-quoted strings	High	Developer incomplete	Open
D-006	$SHELL not read-only	Medium	Developer incomplete	Open
D-007	Duplicate test files (stub + real)	Low	Multi-phase artifact duplication	Open
D-008	Pre-existing test failures	Low	Pre-existing	N/A

The critical defect

D-003 is the most revealing. The Developer agent implemented the entire pipeline architecture— parser recognition of |, pipeline struct, execution loop— but left a // TODO: Pass stdin to router on the function that forwards output between piped commands. The function accepts stdinas a parameter but silently discards it. All 18 piping tests fail because of this single TODO.

Pitfalls and Lessons Learned

1. AI Will Leave TODOs on Critical Code Paths

The Developer agent's most dangerous behavior is implementing around a hard problem. It built the entire piping architecture but left the actual stdin forwarding as a TODO. The code compiles. Some tests even pass (the ones that don't need piping). But the core feature doesn' t work.

Fix applied: Added a CRITICAL section to the Developer prompt banning TODOs, requiring every parameter to be used, and prohibiting "fast path" shortcuts that bypass core logic.

developer_base.txt — the anti-TODO rule

## CRITICAL: No Incomplete Implementations 1. No TODOs, FIXMEs, or "implement later" comments for required functionality. 2. Every parameter must be used. A function that accepts `stdin` but silently discards it is a critical defect. 3. No shortcut code paths that bypass core logic. 4. No stub functions that return hardcoded values. 5. All code paths must work.

2. The Tester Agent Will Write Fake Tests

The Tester generated test files with helper functions like this:

 // BAD: This "tests" nothing—it always returns empty string

        fn execute_command(cmd: & str) -> String {
            String::new()  // Never calls real code
        }

         // 56 tests used this helper. All "passed." None tested anything.

Fix applied: Added a CRITICAL section to the Tester prompt requiring all helpers to import and call actual codebase functions, and requiring a compile check before marking tests complete.

3. The Story Author Approved Failing Code

The Story Author's prompt said "reject if any tests fail." But with 792/912 tests passing, it approved anyway. The 106 failures were buried in stub test files that it couldn't distinguish from real failures.

Fix applied: Added failure categorization (code bugs vs. test stubs vs. compilation errors) and a 95% pass-rate threshold to the Story Author prompt.

4. Windows Path Length Explosion

Improvement mode collects changed files via git diff and saves them as artifacts. The original code flattened paths by replacing / with _:

# Session 1: .orchestrator/sessions/old/code/file.rs # Saved as: .orchestrator_sessions_old_code_file.rs # Session 2 collects Session 1's flattened files:
 # Saved as: .orchestrator_sessions_new_code_.orchestrator_sessions_old_code_file.rs # Session 3: the name doubles again... # Eventually: EXCEEDS WINDOWS 260-CHAR PATH LIMIT

Fix applied: Three changes— filter .orchestrator/ from git diffs, preserve directory structure instead of flattening, and add a safety filter in baseline loading.

5. Agent Failure Didn't Stop the Workflow

When an SSL connectivity drop caused the Story Author to fail, the orchestrator caught the error but didn't transition to FAILED. It continued to the documentation pass, wasting API calls on a broken session.

orchestrator.py — the fix

# Before: retry_current_step() returned False from working states,  # but the return value was ignored retried=(self.workflow.can_retry() and self.workflow.retry_current_step()) if not retried: self.workflow.fail_workflow(result["message" ])

6. Context Window Overflow in Doc Generation

The documentation pass inlined all artifacts into the prompt. For an 8-phase project, this totaled 5.9 million characters— far exceeding Claude's context window.

Fix applied: In improvement mode, skip artifact inlining (Claude reads files directly from the project root). Added a 600K character budget with truncation as a safety net for normal mode.

Architecture Decisions That Worked

Agents as Isolated Claude CLI Sessions

Each agent runs as a separate claude CLI process with its own system prompt written to .claude/CLAUDE.md. This means agents don't share context—the Developer doesn' t know what the Product Owner thought about, only what it wrote down in requirements.md. This forces communication through artifacts, which is exactly how real teams work.

base_agent.py — Claude CLI invocation

response=self.claude_cli.call(prompt=full_prompt,
            system_prompt=system_prompt, # From .claude/CLAUDE.md working_dir=effective_dir, # Session workspace or project root model=self.model, # "sonnet" by default allowed_tools=self._register_tools(), # Per-agent tool restrictions timeout=None, # Wait indefinitely)

Session Resumability

All workflow state persists in SQLite. If the process crashes, the network drops, or you close your laptop, you can resume from exactly where you left off:

$ orchestrator resume session-20260210-200304-22029dec # Reloads state machine from DB, continues from EXECUTOR_WORKING

Improvement Mode

The orchestrator can improve existing projects, not just build new ones. In improvement mode, build agents (Developer, Executor, Tester) work directly in the project root. The orchestrator snapshots git HEAD before the Developer runs and collects changes via git diff afterward. Regression testing compares new test results against the baseline.

What's Next

Near-Term

TODO scanning as a workflow gate. After the Developer finishes, the orchestrator should scan for TODO/FIXME comments on code paths required by acceptance criteria. If found, reject and send back to the Developer— don't wait for tests to fail.
Compilation gate in the Executor. The Executor prompt now requires a compile check before running tests, but this should be enforced at the orchestrator level: if cargo check fails, don't waste time running 912 tests.
Test deduplication. Multi-phase builds create duplicate test files (stubs from Phase 1, real tests from Phase 3). The artifact store should track test coverage by feature and replace stubs when real implementations arrive.

Longer-Term

Parallel agent execution. The Developer and Tester could work in parallel if the test suite is structured correctly. Currently all agents are sequential.
Cost tracking. Each Claude CLI call returns token usage. The orchestrator should aggregate and display total cost per session and per phase.
Self-healing loops. When the Executor reports test failures, automatically route back to the Developer with the failure output. Currently this requires a Story Author rejection and manual re-entry.
Multi-model routing. Use Opus for architecture decisions and Haiku for simple file operations. Currently all agents use the same model.

By the Numbers

Sessions Run

AI Agents

Workflow States

Build Phases

Commands Built

Source Files

792

Tests Passing

Defects Found

Bottom line

A multi-agent orchestrator can build real, working software from a natural-language description. The 87% test pass rate isn't perfect—but the 100% pass rate on real (non-stub) tests shows the code itself is solid. The remaining defects are in the orchestrator's quality gates, not in the generated code' s fundamental correctness. Every defect we found led to a prompt or workflow fix that prevents it from happening again.