Agents in Production: Why Evaluation Matters More Than Model Choice

2026-05-04

agentic-aienterprise-aiai-strategygovernanceregulated-industriesai-architecture

Agents leave the lab: why evaluation becomes the architectural decision in 2026, blog post title slide

At a Glance

At PyCon DE & PyData 2026, teams are showing agent systems that have been running stably in production for three to six months, not just demos any more.
Three patterns connect those systems: strict context management, deterministic fallback paths, evaluation-coupled releases.
Three patterns reliably break: open-ended task specifications, free tool choice, test generation without a domain anchor.
The strategic consequence: in 2026/27, agent architectures should be measured by their harness, not by their model. Anyone putting the next euro into the base model is investing where the lever isn't.

From Demo to Production: A Quiet but Hard Cut

Something at PyCon DE & PyData 2026 was hard to capture in a headline: the tone of the agent talks has turned. Where in 2024 and 2025 almost every demo still opened with "look at what's possible", this year teams talked about "running since February", "had to be re-tuned three times", "fails in exactly these categories". That sounds unspectacular. It is the most important shift of the year.

What surfaced there was not new agent euphoria but its opposite: sobriety as a sign of maturity. The teams that ship don't talk about model magic any more. They talk about context budgets, fallbacks, tool boundaries and evaluation. In short: about the harness.

By harness here I mean the whole of context control, tool boundaries, eval pipeline, fallback logic and approval model: the operational layer that makes a model production-ready. It is at the same time technical architecture, governance frame and investment decision. Three patterns from the conference recurred in the systems that hold, and three in the ones that dazzle in the demo and reliably break in production.

Sebastian Raschka and Alexander C. S. Hendorf in the fireside chat "Stop Waiting, Start Shipping" at PyCon DE & PyData 2026

Three Patterns That Hold in 2026

Strict Context Management Instead of Open Memory

The systems that work treat context not as unlimited memory but as a scarce resource with explicit economics. They know what gets in, what stays out, and when it gets pruned. Sebastian Raschka put it dryly in the fireside chat: the secret behind working coding agents isn't the model: it's prompt and cache management. The repo history, the conversation, the plan: all of it has to be fed in, but not all at once. Without active curation, you build a system whose behaviour drifts from session to session. Context management, then, isn't a prompting technique. It is state management.

Deterministic Fallback Paths

Every robust system has a path that works without an LLM. That isn't nostalgia for the past, it is the honest acknowledgement that a language model does not get more available, cheaper or more explainable the deeper you bury it in the stack. In regulated contexts the point is non-negotiable: without a deterministic fallback there is no audit trail, no traceability, no sign-off. The fallback isn't the system's emergency exit. It is the proof that the system has been understood.

Evaluation-Coupled Releases

The production-stable teams don't release "when it looks good", they release when the eval pipeline is green. That presupposes the pipeline exists, which in many programmes happens late, often too late. Where eval discipline was built in from the start, the conference showed systems with clear version states and traceable improvement curves. Where it was missing, you saw teams that could no longer say with any precision when their system had actually got better. Eval is not quality assurance at the end. It is the only layer in which "better" has any meaning.

Three Patterns That Break in Production

Open-Ended Task Specifications

"Write tests for this module" produces tests. They are rarely good. Raschka put it plainly in the chat: agents have "no agency of their own". They respond precisely to precise instructions. Where open-ended tasks are set, you get shallow solutions, which look impressive in the demo because they do something, and fail in production because "something" is not enough.

Alina Dallmann dissected this precisely in her talk Beyond Vibe-Coding: A Practitioner's Guide to Spec-Driven Development: three recurring failure modes appear when you give the AI open-ended tasks: fragmented design decisions scattered across multiple chat sessions; prompt drift, where the conversation develops a life of its own; and hidden assumptions the model makes because no one stated them. Her conclusion is the same one that holds here as an architectural claim: the specification belongs before the code, not inside it. Task specification becomes architecture. Anyone who doesn't see that has a problem that isn't a model problem.

Free Tool Choice

When an agent gets to pick from an open toolbox, behaviour in practice tips into the unpredictable, mostly elegant, occasionally catastrophic. Harald Nezbeda's talk Building Secure Environments for CLI Code Agents delivered concrete incidents from practice (more in the "Trust Boundary" section). For non-critical applications, that spread is fine. For any regulated context, any critical pipeline, any automated operation against real data, it is an architecture that gets expensive sooner or later. The systems that hold restrict tool choice drastically, and they check every tool against a clear use-case contract.

Test Generation Without a Domain Anchor

The New York Times documented the case in April 2026: a financial services firm jumped from 25,000 to 250,000 lines of code per month with the AI coding tool Cursor. Within a short period, a review backlog of one million lines built up. Joni Klippert, co-founder and CEO of StackHawk (a security start-up working with the firm): "The sheer amount of code being delivered, and the increase in vulnerabilities, is something they can't keep up with." The consequence: senior software engineers in urgent demand to do the reviewing, and pressure cascading into sales, marketing and support, who have to keep pace with the tempo. Tests that agents write are often shallow. Reviews that humans do don't scale by a factor of ten. That gap will not close on its own in 2026.

What This Means for Architecture Decisions

If the harness is the decisive layer, three concrete consequences follow for programmes starting in 2026:

Model choice becomes secondary

Cursor's Composer-3 (running productively) is one example of why: the production gain came from post-training on an open base model, not from the model choice. Accept this logic and your investment shifts from vendor comparison to harness engineering. The full case-study treatment of Composer-3 sits in the open-stack piece in this series.

Trust boundaries become explicit

Where can an agent act autonomously, where only suggest, where only inform? This is not a detail question. It is architecture. In regulated industries it defines the compliance frame. Everywhere else it decides whether the system can scale.

Review capacity becomes the bottleneck

The factor-of-ten jump in code volume happens automatically once agents start writing productively. Reviewers to check it do not appear automatically. Programmes that don't address this in setup build themselves a piece of technical debt that costs more in twelve months than any savings ever return.

Trust Boundary as Contract, Not Slogan

Saying "trust boundaries" is easy; writing them as a contract is the actual work. This is exactly where most programmes stumble in 2025/26, not because the idea is wrong, but because it never gets made operational.

Concretely: for every agent step, three modes are worth distinguishing. Read and suggest (human decides), execute with downstream sign-off (four-eyes principle), or complete autonomously (no human in the loop). Which mode applies to which action is not a technical decision but an architecture and governance one. It has to be encoded in the pipeline, not in the prompt template.

The programmes delivering productively in 2026 have explicitly assigned these three modes for every agent step. In most cases the autonomous variant is excluded for more than half of the possible actions, and precisely that is what makes the rest defensible. Where this assignment is missing, every action runs implicitly as "autonomous" until an incident forces the discussion. From running architecture reviews I can add: where this mode assignment is explicitly part of the contractual setup of the programme, it becomes operational. Where it is carried along as an annex or as an implicit architectural decision, it collapses at the first stress moment.

Gabriela Bogk, CISO at Mobile.de, in her keynote "Honey, I vibe coded some crypto" at PyCon DE & PyData 2026, Darmstadt

Gabriela Bogk, CISO at Mobile.de and a long-time member of the Chaos Computer Club, captured this in her keynote "Honey, I vibe coded some crypto" with a formula that ought to become a contract clause in every architecture discussion: blast radius. The question that has to sit before every autonomous agent step is not "can the agent do this?", but "what is the worst that can happen if it gets it wrong, and can we absorb that?". Her own Claude Code setup runs in a VM with hand-curated API keys, no access to production data, and the code itself backed up in a Git repo. That is not paranoia, it is the translation of trust boundary into operational architecture.

Bogk's second point is central for regulated contexts: prompt-based guardrails are soft. "Everything that's prompt-based in terms of your guardrails is soft and can be worked around and is prone to injection attacks." Anyone implementing security through system prompts is building on sand. Hard-coded limits on tool permissions, filesystem access and API keys are the only load-bearing layer: the LLM sits on top, not underneath.

Harald Nezbeda in his talk "Building Secure Environments for CLI Code Agents" at PyCon DE & PyData 2026

Harald Nezbeda made the consequences very concrete in his talk Building Secure Environments for CLI Code Agents. The risk profile of running a coding agent unsandboxed on a developer's machine falls into what Simon Willison calls the lethal trifecta: private data access plus external connectivity plus acting on untrusted context. Documented incidents from real Claude Code use: wiped home directories, a crypto miner installed via a compromised NPM package. His pattern for it: container isolation plus a man-in-the-middle proxy with its own SQLite-based observability. That isn't paranoid. In 2026 it is the minimum configuration whenever a coding agent goes into production or into regulated contexts. Not waiting for that conversation is the most expensive piece of discipline in agent engineering in 2026.

So What

Reading the conference as "confirmation of the agent wave" misreads the picture. In 2026 agent programmes split into two camps: those with harness discipline and those without. The positioning statement "we are now also moving into agentic AI" (whether on board slides, in strategy papers or in the investor deck) is too cheap in 2026. It says something about the external presentation, nothing about the architecture.

The question that decides programmes over the next twelve months is more concrete: what is our harness, who builds it, how do we measure that it holds. Not the model choice, not the vendor, not the pilot budget. Take the harness seriously and you get agents that hold. Skip it and you get demos at production cost.

Is your agent programme built around the harness, or the next model?

Let's talk