AI code needs better proof

07 May 2026 · 6 min read

AI has made code generation cheap, and everyone is all in, but nobody is thinking about how the review cycle should change.

share:X link

A lot of AI-first engineering teams are quietly walking into the same problem.

They want AI to write more code, but they still want humans to review every line like nothing has changed.

That won't hold for long.

AI has made code generation cheap. Really cheap. A change that used to take half a day can now appear in a few minutes, with tests, migrations, edge cases, docs, and a confident little summary telling you everything is fine (adorable).

But the review model hasn't changed much.

We still open the diff, scroll through every file, inspect every line, spot-check the tests, ask a few questions, then decide whether we trust it enough to merge.

That worked when writing code was the slow part of the system. It breaks down when code output explodes.

The bottleneck moved

For a long time the bottleneck was writing the code.

You had to understand the system, make the change, wire everything together, test it, clean it up, then open the PR.

AI compresses a lot of that. It doesn't remove the engineering, but it changes the shape of the work.

The bottleneck moves from writing to verification.

Can we trust this change? Can we prove it does what it claims? Can we prove it doesn't break the boring stuff around the edges? Can we understand the risk fast enough to keep moving?

Right now, the default answer is still: read the diff harder.

That doesn't scale.

Reading every line won't be enough

I'm not saying human review disappears. I don't think that's realistic, and honestly it would be a stupid hill to die on.

But manual line-by-line review becoming the main safety mechanism feels increasingly fragile.

When AI can generate 5x or 10x more code, asking humans to inspect every line with the same level of attention just moves the bottleneck downstream. The queue gets longer, reviewers get more tired, and eventually the quality of review drops anyway.

There is also a slightly uncomfortable truth here: line-by-line review was never perfect.

People miss things. Reviewers skim. Familiar looking code gets trusted. Tests that look sensible don't always prove much. A PR can feel clean and still ship a subtle bug straight into production with a little party hat on.

Manual review is useful, but it isn't magic. Smaller PRs get better reviews in my experience. Larger PRs get skimmed more, or reviewed with less confidence. AI amplifies that problem: either the PRs get bigger, or the number of small PRs explodes. Same problem, different shape.

Trust has to become an engineering problem

"Trust the AI more" is wrong.

Trust should be earned by the system.

If we want AI to write more of the code, we need much stronger ways to prove the code is safe enough to merge.

That means investing in the boring, expensive stuff that actually catches bugs:

exhaustive tests that cover actual behaviour, not just happy paths
synthetic user flows that exercise real product journeys
integration environments that resemble production instead of a little toy version of it
contracts, types, invariants, and boundaries that make bad changes harder to express
automated reviewers that check consistency, security patterns, migrations, permissions, accessibility, and common footguns
observability that makes weirdness obvious fast
small blast radius releases, feature flags, canaries, and rollback paths

That is where the work is.

A human reading the diff should be one layer in the safety stack, not the whole stack.

Human review moves up a level

The human job doesn't go away. It moves up a level.

What is this PR trying to do?

Is the approach sensible?

Does it fit the system?

What could break?

What evidence proves it works?

What happens if we're wrong?

That is a better use of senior engineering attention than manually checking 700 lines of AI-generated plumbing for the third time this week.

The review becomes more like approving a release packet: intent, risk, evidence, rollout, rollback.

The diff still matters. It just stops being the only thing we trust.

Start in low-risk places

This shift doesn't need to start in the scariest production codebase.

It shouldn't.

Start with low-risk repos. Internal tools. Docs sites. Non-critical services. Small libraries with good boundaries. Code where the cost of being wrong is low and the feedback loop is fast.

Then measure what breaks.

Which checks caught real issues? Which ones were noisy? Where did the agent keep doing something weird? Which parts still needed proper human judgement? Which parts were safe to automate further?

Build the trust slowly.

That feels more sensible than pretending nothing changes until the pressure gets high enough that everyone quietly starts rubber-stamping PRs anyway.

Because that is the risk.

If teams don't build better verification systems, review quality will degrade quietly. People will skim more. They will trust summaries they shouldn't trust. They will approve because the queue is massive and the tests are green enough.

That is worse than admitting the model is changing and designing around it.

The mental block

I think a lot of this is just psychological.

For years, professional responsibility has been tied to reading and understanding the code before it ships. That instinct is good. We shouldn't throw it away because a model can type quickly.

But understanding a system doesn't have to mean personally inspecting every generated line forever.

At some point, if the surrounding proof is strong enough, merging AI-written code without line-by-line human review will become normal.

I don't know exactly when that happens. I also don't think it happens everywhere at once.

But it feels inevitable.

The teams that get there safely won't be the ones shouting about replacing engineers. They'll be the ones building better proof around the code.

The short version

AI coding pushes the real bottleneck into verification.

If we keep treating manual diff review as the centre of the safety model, we're going to slow everything down or pretend to review things properly when we aren't.

The better path is to build systems where code can prove itself: tests, contracts, synthetic flows, observability, rollout controls, and evidence a human can review quickly.

We need to stop asking "did a human read every line?"and start thinking about how to answer "what proves this is safe to merge?"