OpenAI's newest Codex story begins in the least glamorous corner of artificial intelligence: tax corrections. On its news page, the company now lists "Building self-improving tax agents with Codex" as a May 27 engineering post. [1] Wednesday's paper argued that OpenAI and Google answered AI week with agents rather than papal grandeur. Thursday's OpenAI post gives the operational answer. The agent does not become useful by sounding autonomous. It becomes useful when a practitioner's correction turns into a test the system can climb. [2]
The post describes a collaboration among OpenAI, Thrive Holdings and Crete accountants to build Tax AI for a network of more than 30 accounting firms. [2] The product processed 7,000 tax returns across participating Crete firms during tax season, with practitioners preparing 1040 and 1041 returns from millions of underlying documents. [2] This is the kind of domain that punishes AI theater. A wrong field is not a funny hallucination. It is work someone must fix before a return can be filed.
OpenAI says Tax AI saves practitioners about a third of their time on tax preparation, drafts returns with up to 97 percent accuracy, and increases throughput by about 50 percent. [2] The more important claim is that the system improved measurably after launch. At launch, only a quarter of returns reached 75 percent correct field completion; within six weeks, 86 percent hit that threshold. [2] That is the article's real sell: production use did not merely expose failures, it organized them.
The mechanism is a three-part loop. Practitioners correct the system and identify which errors matter. Product traces preserve the path from source material to extracted fields, downstream submission and expert correction. Codex then turns repeated, reviewed failures into tailored evals, scoped engineering tasks and candidate product changes. [2] In other words, expertise becomes evidence before it becomes automation.
The rental-property example makes the abstraction concrete. Schedule E extraction has to interpret messy source material such as handwritten notes, emails, spreadsheets and client files, then map values into tax-engine concepts while preserving citations. [2] A difference between the agent's predicted value and the filed return might mean an extraction miss, a carried-forward value, a practitioner preference or expected workflow noise. The system must decide which differences are actually product failures. [2]
This is why the story belongs beside OpenAI's earlier Symphony post. Symphony is an open-source spec that turns a project-management board such as Linear into a control plane for coding agents, with workspaces, continuous runs and human review. [3] The tax post applies the same institutional instinct to production knowledge. It asks not whether an agent can complete a task in isolation, but whether an organization can feed the agent precise, reviewed, repeatable work.
The distinction is easy to miss because the word "self-improving" invites fantasy. OpenAI's post is more bounded than the phrase. It says a correction becomes useful only after repeated differences are reviewed and grouped into an actionable finding. Ambiguous cases route back to the product team instead of being forced through the loop. Engineers remain responsible for architecture, product decisions and shipping. Practitioners steer the loop through the corrections and approvals they already perform. [2]
That is less dramatic than the internet's favorite agent story and more credible. X wants the machine to wake up and improve itself. OpenAI's evidence shows something closer to a factory: source documents, product traces, review rows, eval targets, repos, tasks, regression suites and pull requests. [2] [3] The human does not disappear. The human's correction becomes structured enough that Codex can act on it.
The most revealing detail is the "hill to climb." If the eval pipeline flags that Tax AI repeatedly misses a fair-rental-days field while practitioners reliably fill it in, Codex can inspect the trace, eval, repo and skills together, then decide whether the problem is an unsupported field, a missed extraction pattern, a source-selection issue, a mapper gap or a grader flaw. [2] The agent is not asked to be generally smart. It is asked to reduce a named failure under validation.
That connects directly to Symphony's thesis. In Symphony, OpenAI says the bottleneck shifted from coding agents' speed to human context switching, so the team moved work from sessions into issue-tracker objectives and agent workspaces. [3] In Tax AI, the bottleneck shifts from practitioner correction to product learning. The correction is already happening. The trick is to make it durable.
The mainstream frame will likely call this another enterprise-agent proof point. It is. But the better reading is narrower and more useful. OpenAI is showing how to turn a professional service into an eval-producing environment. That may be the most consequential enterprise AI pattern: not replacing the expert at first, but using the expert's ordinary corrections to teach the system what failure means.
There are obvious dangers. Tax work carries judgment, liability and client trust. A self-improving loop can launder bad assumptions if the review rows are wrong, the eval target is too narrow, or the organization starts treating throughput as quality. OpenAI's own description acknowledges the guardrail: ambiguous evidence routes back to humans and engineers still own shipping. [2] The reader should remember that sentence when the marketing shortens.
For now, the document is more important than the slogan. OpenAI's news page shows Codex moving quickly across enterprise coding, tax agents, sandboxes and orchestration. [1] The tax post explains what that movement looks like inside a hard domain. [2] Symphony supplies the broader operating model for always-on agent work. [3] Together, they show an agent economy becoming less interested in chat and more interested in traces.
The next receipt to watch is whether OpenAI and its partners publish enough detail for outsiders to audit the loop: what corrections counted, how eval sets were built, which regressions blocked changes, and how practitioners contested the system when it learned the wrong lesson. Self-improvement is not magic. It is a governance problem wearing an engineering jacket.
-- DAVID CHEN, Beijing