You Cannot Audit a Probability: The Agentic AI Trust Wall
The agentic AI market is sorting itself out. What the mature version requires is different from what the demo version assumed.…
News peg: The IMF’s agentic payments policy note (April 24, 2026) and Amazon’s AgentCore Payments launch (May 7, 2026).
On May 11th, a developer with 40+ production agent deployments posted in r/AI_Agents: “Stop building AI agents.” It hit 376 upvotes. The comments were full of people who learned the same lesson the hard way.
Agents hallucinate. They resist auditing. And in regulated industries, that’s becoming a legal liability.
The AI agent market is maturing, and maturation is uncomfortable. The gap between what agents do in demos and what they do in production has become impossible to ignore, and the developers closest to the problem are saying so out loud.
On May 11th, the top post on r/AI_Agents, from a developer with over 40 production deployments, put it plainly: stop building AI agents. Automations outperform agents in production. Agents hallucinate, resist auditing, and collapse on unexpected inputs. In regulated industries, compliance reviewers need deterministic audit trails. Autonomous black boxes fail that requirement. The post drew 376 upvotes and a thread full of agreement from people who had learned the same thing the hard way.
This is not a fringe opinion. Surveys of teams running agents in production show 60% operating with no formal governance framework. Thousands of applications have leaked sensitive data because agentic systems shipped without deterministic audit infrastructure underneath them. Y Combinator’s current Request for Startups names agents with audit trails, deterministic behavior, and compliance capabilities as explicit investment priorities, which is the VC tier funding around a failure mode they are watching play out in real time.
The market is converging on a conclusion: the winners will be builders who ship reliable automations with proper guardrails, not the loudest agentic demos.
Why Auditability Is Harder Than It Looks
The production problems are not all unsolvable. Structured outputs, function calling, evaluation harnesses, and hybrid architectures that wrap LLM reasoning inside deterministic workflows have all made agents meaningfully more reliable. These approaches work, and teams shipping in regulated industries are using them.
The harder problem is what happens to the record of what occurred. When an LLM-orchestrated agent produces a log of its actions, that log is itself a language model output: a reconstruction, produced by the same probabilistic engine that executed the action. It can be internally consistent, confidently written, and factually wrong. Better reliability practices reduce the frequency of errors; they do not change the nature of the record.
The IMF’s recent policy framework gestured at this without quite landing on it. It proposed separating probabilistic AI decision-making from deterministic payment execution, which is a sound instinct. But the framework assumes that once execution happens, the record of what happened is trustworthy enough to trace backward through. That assumption is where it gets complicated. AWS’s AgentCore Payments infrastructure, launched in preview this month, stores logs in CloudWatch: whatever the agent reports, observable but not independently verified.
Trusted Execution Environments address part of this. TEEs are hardware-isolated compute regions where remote attestation can cryptographically prove which model ran against which input. That closes a real gap: you can verify the execution environment was not tampered with. What TEE attestation does not cover is whether the log the model generated faithfully represents the sequence of decisions across a multi-step workflow. The hardware integrity is provable; the semantic accuracy of the record is not.
This is the specific gap that structured outputs and better observability tooling do not reach. It is not an argument that those approaches are inadequate for most purposes. It is an argument that for regulated industries where the log is evidence, the reconstruction problem is still open.
What the Correction Selects For
MIT economist Christian Catalini, writing independently in Some Simple Economics of AGI, identifies verification-grade infrastructure as the foundational requirement of the agentic economy: systems where agent actions produce receipts that travel with the data, verifiable by any party. When that infrastructure is absent, agents exploit the gap between what is measured and what was intended. The endgame is legal and financial accountability; agent outputs that can be defended, insured, adjudicated.
The legal pressure is already arriving. A class action alleging UnitedHealth’s AI model had a 90% error rate on appealed claim denials is in federal discovery, with tens of thousands of internal documents being produced. The 90% figure is an allegation, not an adjudicated finding, and UnitedHealth disputes it. But the shape of the case is instructive regardless of outcome: logs existed, assembled after the fact from records that did not travel with the decisions, and a court order was required to surface them. The question is not whether logs exist. It is whether they are accurate and whether the organization can prove it.
The EU Product Liability Directive classifies AI as a product subject to strict liability, effective December 2026. Gartner tracks AI governance platform spending reaching roughly $500 million in 2026, rising steeply as regulatory pressure increases. The organizations that will navigate this cleanly are the ones that built for auditability before it became a legal requirement, not after.
What Verifiable Infrastructure Actually Requires
Solving the reconstruction problem means removing the reconstruction step. The record of what happened needs to be bound to the transaction at the moment of execution, before any language model generates a summary of it. Several architectural approaches point in this direction: deterministic workflow engines that log at the infrastructure layer rather than the application layer, blockchain-based ledgers where transaction records are independently verifiable, and cryptographic provenance systems where proof of what occurred travels with the asset itself rather than being stored separately.
The TODA-file protocol takes the last approach. Provenance is bound at execution and verifiable by any party with no intermediary. The formal proof underlying the protocol, published by researchers at Cambridge’s Centre for Redecentralisation, uses structural induction to demonstrate that double-spend is not merely unlikely but excluded by the system’s structure. That is a meaningful property for regulated use cases where the audit trail needs to hold up under legal scrutiny rather than just operational review.
We are a small operation, for now, relative to the infrastructure being built around x402 and AgentCore. Volume today is thousands of transactions per day against x402’s aggregated 169 million. We note that not because the comparison favors us, but because the question worth asking of any agentic payment infrastructure is not just how much it has processed, but what it can prove about each transaction when asked. That question will become harder to avoid as the UnitedHealth discovery process continues and the EU Product Liability Directive takes effect in December. The organizations positioned to answer it will be the ones that treated auditability as a design requirement rather than a compliance checkbox added afterward.
Sources
IMF Policy Note, How Agentic AI Will Reshape Payments, Sonja Davidovic and Hervé Tourpe, April 24, 2026.
Amazon Web Services, Agents That Transact: Introducing Amazon Bedrock AgentCore Payments, May 7, 2026. Launched in preview.
x402 Foundation / Linux Foundation press release, April 2, 2026. By May 7, 2026 (per Coinbase): 69,000 active agents, 169 million transactions.
Estate of Gene B. Lokken v. UnitedHealth Group, Case 0:23-cv-03514-JRT-SGE. The 90% error rate is an allegation in the plaintiff complaint, not an adjudicated finding. UnitedHealth disputes the characterization.
Gartner AI governance platform spending figures, 2026. Gartner projects the market reaching approximately $492 million in 2026, growing toward $1 billion by 2030, driven by regulatory requirements. [Available via Gartner subscription.]
EU Product Liability Directive (2024/2853): implementation by December 9, 2026; includes AI software as a “product” subject to strict liability.
TODA Rigs Architecture, Formal Proof, Cambridge University CRDC. Kris Coward and Dann Toliver, co-authors of Rigging Specifications (T.R.I.E., 2023).
Catalini, Hui & Wu, Some Simple Economics of AGI, 2025.
r/AI_Agents, May 11, 2026. Top post (376 upvotes): “Stop Building AI Agents.”
On TEEs and LLM attestation: OLLM (Trusted Execution Environments in Confidential AI, 2026); Attestable Audits (Verifiable AI Safety Benchmarks Using Trusted Execution Environments, arXiv 2025).
Prior posts in this series:
