Prototyping got cheap. Production trust did not.
That is the tension I keep seeing in enterprise AI conversations. A team points coding agents at a workflow and gets a working prototype in a week. It contains code, often lots of it: API calls, glue scripts, generated UI, database changes, maybe even deployment files. It drafts emails, calls services, searches documents, updates a CRM, and everyone in the room can suddenly see the business value. Then someone asks the boring questions: who is allowed to trigger this, what happens when it is wrong, where is the audit trail, how do we roll it back, and who owns the decision?
The demo usually has no answer.
That does not make the demo useless. It means the demo did its job. It discovered value. It did not prove production readiness.
AI-Generated Prototypes Are Discovery Tools
Coding agents, workflow builders, and quick internal scripts are excellent for one thing: making workflow value visible before a team burns months on architecture. They help domain experts explain what they actually do. They reveal missing data, ambiguous handoffs, hidden approval steps, and the places where “automation” really means “someone makes a judgment call in the middle”.
The important distinction is not whether the prototype contains code. Most do now. The distinction is whether the code has crossed the line from exploration into an owned, testable, observable, secure production system. Generated code can be a useful artifact. It is not automatically an architecture.
That matters. Brownfield AI projects rarely fail because the model API was hard to call. They fail because the workflow is messier than the prototype assumed.
In my own work, the hard parts usually show up around old systems, unclear ownership, inconsistent data, missing observability, and risk boundaries that were never written down because humans had been quietly handling the exceptions. A prototype can surface those problems quickly. It should not be promoted into production just because the first demo was convincing.
Why This Is Urgent Now
Two things changed at the same time.
First, AI agents are getting better at longer tasks. METR’s task-horizon research (opens in new tab) proposes measuring AI capability by the length of tasks agents can complete, and reports that the task length completed with 50% reliability has been doubling around every 7 months. That is a directional planning signal, not a deterministic forecast, but it is enough to change how architects should think.
Second, the ecosystem is standardizing. The Linux Foundation announced the Agentic AI Foundation (opens in new tab) with MCP, goose, and AGENTS.md as founding project contributions. The same announcement says AGENTS.md had already been adopted by more than 60,000 open source projects and agent frameworks.
That combination matters. Capability is rising, and interfaces are consolidating. The temptation will be to connect agents to more tools, more files, more systems, and more workflows. The architecture question is not “can we?” It is “under what boundaries?”
The Production Bottleneck Is Trust
When people say they want enterprise agents, they often mean one of two very different things.
Build-time agents help teams create software faster. They write code, run tests, refactor modules, and prepare pull requests.
Run-time agents execute business workflows. They read customer records, classify documents, trigger payments, send emails, update tickets, or make operational recommendations.
Those two categories have different risk profiles. A bad coding-agent patch can be reviewed before deployment. A bad run-time agent action may already have emailed a customer, leaked a file, or changed a production record.
That is why run-time agent architecture needs a policy boundary before it needs another prompt template.
A Practical Reference Architecture
I think about governed enterprise agents as seven layers, not one magic box.
The first layer is the policy boundary. It defines what the agent may see, what it may do, which actions require approval, and which actions are never allowed. This is where identity, authorization, data classification, and risk classes belong.
The second layer is orchestration. This decides which workflow runs, which tools are available, when the agent should call a model, when it should ask a human, and when it should stop.
The third layer is model routing. Not every task deserves the same model. Cheap models can classify low-risk inputs, stronger models can handle ambiguous reasoning, and high-risk steps can require verification by a second path.
The fourth layer is retrieval and context. This includes documents, database records, previous decisions, and user-provided material. Context is also a security boundary because retrieved content can be wrong, stale, malicious, or outside the user’s permission scope.
The fifth layer is tool access. Tools need least privilege, structured inputs, clear side-effect semantics, and deny-by-default behavior for destructive actions.
The sixth layer is verification. The system should check claims against ground truth before marking work complete. If an agent updates a ticket, did the ticket actually change? If it says a document contains a clause, can it cite the exact location? If it says a payment is ready, did the policy check pass?
The seventh layer is telemetry. You need traces of inputs, retrieved context, tool calls, model outputs, approvals, costs, latency, failures, and overrides. Without that, you cannot debug the system, tune the thresholds, or defend the rollout.
Autonomy Should Match Blast Radius
The wrong debate is whether agents should be autonomous. The useful debate is how much autonomy each workflow deserves.
A low-risk internal summarization task might run fully automatically. A customer-facing response might require human approval until measured quality is high enough. A payment, contract change, policy exception, or account closure should have a much stricter path.
I use four rough tiers.
| Tier | Agent Role | Example | Required Control |
|---|---|---|---|
| Assist | Suggest only | Draft a response | Human decides |
| Prepare | Assemble evidence | Extract fields from documents | Human approves |
| Execute low-risk | Act inside narrow bounds | Update a non-critical ticket | Policy gate plus audit trail |
| Execute high-risk | Trigger costly action | Payment, cancellation, legal notice | Step-up approval and ground-truth verification |
The goal is not to keep humans in every loop forever. It is to increase autonomy only where measured error, business risk, and rollback ability justify it.
Agents Saying “Done” Is Not Evidence
The strongest argument for verification comes from failure cases.
The Agents of Chaos (opens in new tab) paper studied autonomous language-model-powered agents in a live lab environment with persistent memory, email, Discord access, file systems, and shell execution. The abstract reports unauthorized compliance with non-owners, sensitive information disclosure, destructive actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing, and cases where agents reported task completion while the underlying system state contradicted those reports.
That paper is an existence proof, not a prevalence benchmark. It does not prove every enterprise agent will fail that way. But it does prove these failure classes are realistic enough to deserve architecture controls.
Prompt injection adds another reason for caution. PromptArmor demonstrated a Claude Cowork file exfiltration attack (opens in new tab) where indirect prompt injection manipulated the agent into uploading files to an attacker-controlled account. Their writeup says no human approval was required in the demonstrated chain.
The lesson is not “never use agents”. The lesson is that agents need boundaries designed for hostile and accidental inputs.
Governance Is Delivery Reliability
Enterprise teams often treat governance as the department that says no. That is the wrong frame.
Good governance makes delivery safer and faster because it removes ambiguity. The team knows which data can be used, which tools are approved, which actions need human signoff, what evidence is required, and how incidents are handled.
The operating model needs named owners.
Product owns the workflow value and user behavior. Architecture owns system boundaries and integration patterns. Platform owns runtime, observability, deployment, and cost controls. Security owns identity, egress, dependency, and threat-model review. Domain experts own edge cases and approval policy. Legal or compliance owns regulated outcomes where needed.
If nobody owns a boundary, the agent will eventually cross it.
A 90-Day Rollout Pattern
I would not start with a platform-wide agent strategy. I would start with one workflow.
Days 0 to 30 are discovery and baseline. Pick one high-value, low-to-medium-risk use case. Measure the current process: time, quality, cost, error rate, escalation rate, user trust. Build the prototype to learn, not to launch. Capture the risks early.
Days 31 to 60 are architecture and controls. Add the policy boundary, access model, structured evaluation, scenario tests, observability, approval paths, and rollback plan. Define the autonomy tier explicitly.
Days 61 to 90 are pilot and decision. Run with one user group, one KPI set, and weekly review. Tune thresholds. Log incidents. Track overrides. At the end, decide whether to scale, redesign, or stop.
The stop option matters. A failed pilot that saves you from scaling a fragile workflow is a good outcome.
The Real Blueprint
The pattern is simple, but not easy.
Move fast in discovery. Move carefully in production. Use prototypes to find value, then convert that value into contracts, controls, telemetry, and accountable operations.
The companies that win with enterprise agents will not be the ones with the flashiest demos. They will be the ones that know exactly where autonomy is useful, where it is unsafe, and how to prove the difference.
That is the actual architecture work.
This is Part 1 of a two-part series on production AI architecture. Part 2 covers build-time agentic engineering and how to use coding agents without creating chaos.