Agentic Engineering Without Chaos

Code got cheap. Change risk did not.

That is the simplest way I can describe the shift from normal software engineering to agentic engineering. A coding agent can write a thousand lines, run the test suite, fix a failure, update docs, and open a patch faster than most humans can finish reading the ticket. That feels like productivity. Sometimes it is.

Sometimes it is just a faster way to create a mess.

The question is not whether coding agents are useful. They are. I use them constantly. The question is what operating model keeps their speed from turning into review overload, architecture drift, and false confidence.

Output Is Not The Bottleneck Anymore

Simon Willison puts the shift plainly in Writing code is cheap now (opens in new tab): coding agents dramatically drop the cost of typing code into the computer. But he immediately draws the line that matters. Good code still has a cost.

Good code works. You know it works. It solves the right problem. It handles errors. It is simple enough to maintain. It has tests. It updates documentation where behavior changes. It preserves the future shape of the system.

None of that becomes free just because the first patch appears faster.

I see this most clearly in brownfield systems. A new module can look reasonable in isolation while violating an old boundary, duplicating an existing abstraction, missing a production edge case, or making a future migration harder. The agent does not feel the organizational cost of that drift. The team does.

The Speed Trap

The danger is not that agents make mistakes. Humans make mistakes too. The danger is volume.

Willison’s post on slowing down (opens in new tab) quotes Mario Zechner warning that agent mistakes accumulate faster because the human bottleneck has been removed. Willison calls the result cognitive debt: the codebase can evolve outside the team’s ability to reason about it.

That is the trap. You merge code faster than the team can understand the design consequences. Reviewers skim because the patch is too large. Tests pass but only cover the happy path. The agent adds a helper, then another helper, then a layer, then a configuration switch. Nobody intended the architecture to change. It just did.

The fix is not to ban agents. The fix is to put the bottleneck in the right place. Not typing. Thinking.

Proof Beats Speed

Willison’s rule in Your job is to deliver code you have proven to work (opens in new tab) should become the default norm for AI-assisted engineering. A change is not done because an agent says it is done. A change is done when the author provides evidence.

That evidence has two parts.

Manual proof shows the behavior in a real or realistic environment. For a CLI, that might be commands and output. For a UI, screenshots or a short recording. For an API, requests, responses, and database state. The point is simple: someone saw the thing work.

Automated proof protects the behavior after the merge. The test should fail if the implementation is reverted. If it would pass either way, it is not proof.

Agents can help produce both. In fact, they should. A useful instruction is not “implement this feature”. It is “implement this feature, prove it manually, add automated tests that fail before the fix, and summarize the evidence for review”.

Red First, Then Green

Test-first work is unusually effective with coding agents. Willison’s red/green TDD (opens in new tab) chapter explains why: write the automated tests first, confirm they fail, then implement until they pass.

That failing step matters. Without it, the agent can produce tests that pass for the wrong reason, or tests that do not exercise the new behavior at all.

In practice, I use red/green TDD most aggressively when the acceptance criteria are concrete: parsers, API behavior, permission checks, data transformations, regression fixes, and edge cases. It is less useful when the work is exploratory or visual, but even there the idea carries over. Create an observable failure before accepting the fix.

Tests Are Necessary, Not Sufficient

StrongDM’s Software Factory (opens in new tab) writeup makes a useful point: tests can be reward hacked. If tests live in the same codebase and the agent is allowed to change them freely, it may make the tests match the implementation instead of making the implementation match the intent.

Their answer is scenario validation: end-to-end user stories, often stored outside the codebase, evaluated against observed behavior. They also use the idea of satisfaction, not just pass or fail, for workflows where the result is probabilistic.

Most teams do not need to copy that entire model. But the smaller lesson is practical: keep some validation outside the agent’s immediate editing path. That can be holdout examples, golden datasets, acceptance scripts, product-owner scenarios, or a separate evaluation suite.

If the agent can edit both the implementation and all of the evidence, you do not have independent evidence.

Separate Contracts From Changes

One pattern I like is keeping stable behavior contracts separate from proposed changes.

The stable contract says what the system must do: API behavior, domain rules, security boundaries, data ownership, performance constraints, user-visible semantics. The proposed change says what will be modified, why, and how it affects the contract.

That split matters more with agents because agents are good at local implementation but weak at organizational memory. They can satisfy a narrow ticket while quietly violating a principle that lives in someone’s head.

Before a coding agent touches a significant brownfield area, I want it to read the contract. If no contract exists, writing a short one may be the most valuable part of the task.

This does not need to become heavyweight process. A useful change packet can be short:

Field	Question
Intent	What user or system behavior changes?
Boundary	Which module, API, or workflow owns this behavior?
Evidence	How will we prove it works?
Risk	What can break if the change is wrong?
Rollback	How do we undo it safely?

That is enough to keep the agent, reviewer, and future maintainer pointed at the same target.

Add Supply-Chain Friction

AI-assisted teams tend to change dependencies casually. The agent sees a package, installs it, and moves on. That habit is dangerous.

The LiteLLM package compromise (opens in new tab) is a good reminder. A malicious PyPI release included credential-stealing behavior, and Willison notes it could collect secrets from SSH, AWS, Kubernetes, Docker, npm, database, shell history, and other local configuration paths.

One practical mitigation is release-age control. In Package Managers Need to Cool Down (opens in new tab), Willison summarizes cooldown support across pnpm, Yarn, Bun, Deno, uv, pip, and npm. The idea is simple: do not install brand-new dependency releases immediately unless there is a reason.

Cooldowns do not solve supply-chain security. They buy time for the ecosystem to notice compromised releases. For AI-assisted engineering, that friction is valuable because agents otherwise optimize for immediate success.

The Anti-Chaos Control Stack

A practical control stack does not need to be complicated.

First, keep batches small. The patch should be small enough for a human to understand deeply. If the agent returns a giant diff, split it.

Second, require evidence. Every meaningful change should include manual proof, automated proof, or a clear explanation of why proof is not possible yet.

Third, protect architecture boundaries. Changes that alter APIs, data ownership, security assumptions, or long-term structure need human ownership before implementation, not after.

Fourth, slow dependency intake. Use lockfiles, provenance checks, release-age policies, and explicit review for new packages.

Fifth, make rollback boring. A change without a rollback path is not ready for production if the blast radius is meaningful.

Sixth, archive decisions. Not every decision needs a formal ADR, but future engineers need to know why the agent-assisted change was accepted.

Measure The Right Things

If you only measure lines of code, agents will look miraculous and your codebase may quietly decay.

Measure lead time, but pair it with change failure rate. Measure review load, not just review count. Track escaped defects. Track rollback frequency. Track how often AI-generated changes require human rework. Track dependency churn. Track how often reviewers say they cannot understand a patch.

The important signal is not “agents made us faster”. The important signal is “agents helped us ship proven changes without increasing failure, review burden, or architectural drift”.

The Team Playbook

Here is the operating model I would start with.

For low-risk changes, let agents move quickly, but require tests or evidence in the final patch.

For medium-risk changes, require a short change packet before implementation: intent, boundary, evidence, risk, rollback.

For high-risk changes, keep architecture decisions human-owned. Let the agent explore options, generate tests, draft implementation, and produce evidence, but do not let it decide the boundary.

For dependency changes, require explicit review and release-age policy unless the package is already approved.

For large patches, split before review. Review is where quality is protected. If review becomes shallow, the process is already failing.

What Good Looks Like

Good agentic engineering does not feel like chaos. It feels like a tighter loop between intent, implementation, proof, and review.

The agent writes more of the code. The human owns more of the judgment. The system records more of the evidence. The team moves faster because the work is clearer, not because everyone stopped looking.

The best teams will not be the ones generating the most code. They will be the ones proving the right changes fastest.

That is the difference between agentic engineering and agentic entropy.

This is Part 2 of a two-part series on production AI architecture. Part 1 covers runtime agents and the move from prototype discovery to governed production automation.