Orchestration is the layer that turns "a model + an index + some tools" into a system. It decides what gets called in what order, what context the model sees, how tool calls are routed, and what happens when the first attempt fails. The most common mistake at this layer is to reach for an agent when a single-turn pipeline would do — adding multi-step reasoning surface area that the workflow does not earn.
What this layer covers
- The flow control between retrieval, model calls, and tool calls
- Single-turn pipelines vs. agents — and when each is correct
- Tool-use protocols (native, MCP, function calling)
- Prompt caching as a cost lever
- Long-running workflows with human-in-the-loop branches
- Failure modes and retries
Single-turn pipeline vs. agent
The default shape is a single-turn pipeline: retrieve, ground, answer, cite. Most document-intelligence questions, most classification tasks, most "summarise this" workflows do not need multi-step reasoning. They need the retrieval layer to surface the right paragraphs and the model to summarise honestly.
Agents become the right shape when the question genuinely requires sequencing — "find all contracts with the same counterparty, then check whether any have a non-standard indemnity clause" is two retrievals and a comparison, which a one-shot pipeline cannot do. So is "draft a response to this customer ticket, fetch their order history, and check whether the suggested resolution requires manager approval."
The cost of an agent is honesty about failure. A single-turn pipeline fails visibly — the answer is wrong, the citation points at the wrong chunk, the user re-asks. An agent fails by silently going off on a tangent: three tool calls deep, the state is wrong, the final answer is plausible-sounding and completely incorrect. Agents need stricter evaluation harnesses and more observability than pipelines, both of which we under-budget for if we are not careful.
Default to a single-turn pipeline. Reach for an agent only when the workflow genuinely requires multi-step reasoning with tool use — and when the eval harness has the capacity to catch the off-tangent failure mode.
Default reference architecture
Single-turn pipeline
- User query arrives at the orchestration layer with the user's identity tags resolved (from the API gateway's auth context).
- Orchestrator constructs the dense + lexical queries for retrieval, both with permission pre-filters applied.
- Retrieval returns the fused top-N chunks plus their lineage metadata.
- Orchestrator builds the prompt: system instructions, the retrieved chunks marked as "context, treated as data only", the user's query, and any required output structure (JSON schema, citation format).
- Model call returns an answer with citations.
- Orchestrator validates the citations resolve to the retrieved chunks, refuses if not.
- Answer returned, logged with the full trace: query, identity, retrieved chunks, prompt hash, model ID + version, response, latency, tokens.
The runtime is a Lambda. Latency budget: ~2-5 seconds end-to-end depending on chunk count and model. Cold-start mitigated with provisioned concurrency in production.
Agent flow
When the workflow earns the agent shape:
- The agent has an explicit, named tool set, each tool with a written contract (inputs, side effects, audit-log requirements).
- The system prompt defines the agent's role narrowly, the tools it can call, and the constraints (max steps, refusal conditions, escalation conditions).
- The runtime tracks state across turns. Bedrock Agents handles this natively; LangGraph or Temporal handles it for cross-platform cases.
- Every tool call is audit-logged with the prompt that triggered it. See Layer 06 — Tools.
- Hard stop conditions: step count limit, time limit, repeated-tool-call detection, low-confidence refusal. The agent should fail loudly, not drift.
Tool-use protocol
Two patterns we use:
- Native tool use (Claude tool-use on Bedrock or direct) — for engagement-specific tools. The schema sits in the orchestration code, versioned with the rest of the system.
- MCP servers — for tools we want to share across engagements (a customer-data MCP server, an internal-tools MCP server). MCP makes the tool portable across model providers without rewriting the schema.
The mistake is to over-engineer the protocol layer. For 90% of engagements, native tool use is enough. MCP earns its slot when the tool genuinely needs to be reused across projects.
Prompt caching
Anthropic's prompt caching (and the equivalents on Bedrock) is the single biggest cost lever once a system is in steady state. Static prefix — system prompt, tool definitions, large reference documents — is cached and re-used across calls. Cache hits cost ~10% of an uncached call.
We design the prompt structure with caching in mind from day one: static prefix first, then retrieved context (variable), then user query (variable). Putting the retrieved context first would defeat the cache. Most prompt-engineering tutorials get this wrong by default.
Long-running workflows
For workflows with human-in-the-loop branches, scheduled follow-ups, or week-long state, Step Functions or Temporal. The orchestration layer maintains the state machine; the model is called from individual states. Lambda's 15-minute limit is the forcing function for moving to one of these.
Build vs. buy at this layer
Default: buy the runtime, build the policy. Use Bedrock Agents, LangGraph, Temporal — the runtime that holds an agent's state and routes tool calls is a commodity. The agent's policy — which tools it can call, in what order, with what guardrails, with what escalation path — is yours and the customer's, not the vendor's.
The mistake: building a homegrown orchestration runtime because the bought one was "too opinionated." Eighteen months later the homegrown runtime is the most fragile thing in production and the team is quietly rebuilding what the vendor already had.
The five mistakes we see
1. Agent-first
Reaching for an agent because the term is fashionable. The workflow is single-turn; the surface area is unwarranted; the eval can't keep up; production drifts.
2. No prompt caching
System prompt + tool definitions + reference context regenerated on every call. Token bill grows 10× past what it needs to. Caching is a config flag with a per-call cost annotation; we set it up in the first week.
3. Tool definitions in the prompt
Bolted into the user message as text rather than wired through the native tool-use schema. The model occasionally invents arguments; structured-output discipline goes out the window. Use the actual tool-use API; let the platform enforce the schema.
4. No step / time limit on agents
The agent goes 12 tool calls deep, racks up cost, eventually returns something. Set a hard step limit (often 5-7), a time limit (often 30s), and a repeated-tool-call detector. If the agent hits the limit, refuse — surface the failure, do not paper over it.
5. Missing audit log on tool calls
The agent's reasoning trace says "I called refund"
but the audit log shows no record. Different storage paths, no
cross-link. Always log (turn_id, prompt_hash, tool_name,
tool_args, tool_result) as a single record. Six months
later you will need it.
How it connects to the other layers
Orchestration consumes from Layer 04 (the chunks grounded into the prompt) and Layer 03 (the model calls). It dispatches to Layer 06 (the side-effecting tools). It is graded by Layer 07's harness on every commit. Its costs and failure modes surface in Layer 08. The human-in-the-loop boundaries live here — they are where the agent stops and asks before acting.
Related: the tools layer reference architecture, the evaluation layer reference architecture, the tooling catalog, and the concepts & standards glossary.