EMNLP 2026 Submission · Under Review

Trains but Doesn't Learn

A governed delivery-plane benchmark for LLM agents seated as Forward-Deployed Engineers — auditing post-training as a service.

Existing benchmarks ask whether an agent can raise a metric. The operator's real question is sharper: can it be trusted to deliver?

10governed delivery stages

12/12runs train but don't learn

0/36detector false alarms

3frontier agents on real GPUs

Claude Opus 4.8 · GPT‑5.5 · Gemini 3.1‑Pro | 8B–70B open bases on H200 & A40

Scroll to explore ↓

The Setting

Post-training is becoming a service

A customer hands an operator data and a goal; a forward-deployed engineer (FDE) returns a fine-tuned, evaluated, and deployed model — under a budget, a human-approval gate, and reproducibility requirements. That FDE seat is now a target for automation, with vendors already shipping agents that drive delivery from a natural-language ticket.

The failure that matters isn't a low metric

It is a run that optimizes something successfully while the delivered model learns nothing the customer wanted — because the agentic FDE misread the task, the data format, the loss masking, or the method.

The loss falls, every signal-level check passes, the dashboard is green — and only the customer discovers the model is no better than the base. It runs to completion and burns the same GPU-hours as a correct delivery, so the operator pays the full bill (~$3 / H200-hour) for a zero-value artifact.

We call this Trains-but-Doesn't-Learn (TBDL): the most expensive failure mode per dollar, and the central risk of placing an agent in the FDE seat.

Overview: an agent in the FDE seat post-trains a base model from a customer ticket; a de-looped oracle scores ten stages, and auditing surfaces three findings.

The governed delivery plane. An agent in the FDE seat post-trains a base model from a customer ticket; a de-looped oracle scores ten stages, and auditing them surfaces the paper's three findings.

The Benchmark

Ten governed stages, one de-looped oracle

We recast the agentic FDE as a layered control plane. The agent drives ten stages end to end; each is scored by an external oracle that never reads the trained model. Stages partition by where failure becomes visible. Click a stage to inspect it.

Judgment stage Arithmetic stage Execution stage

Judgment

intake

Reads the customer ticket and infers the task, metric, and data handling. The agent's first and most consequential judgment — a misread here silently corrupts the objective and induces TBDL while every downstream signal stays green.

Arithmetic stages a deterministic configurator can certify (≈1.0) before any agent runs — failure there is loud and cheap to catch. Execution stages are mechanical. Judgment stages are the FDE's real risk, surfaced only by the delivery-level oracle.

The delivery plane: ten stages from intake to card, colored by judgment / arithmetic / mechanical, with a config-injection do-intervention and a de-looped oracle.

The control plane. The agent operates every stage; a de-looped oracle scores each one stage-independently. A config-injection (do-intervention) hands the agent the oracle-correct configuration to isolate judgment from configuration arithmetic.

What We Found

Three findings about agentic delivery

Agents are competent at the mechanical layers. The risk lives precisely where failure is silent — in judgment and in governance.

A silent failure that is real

An injected intake-misread reliably induces TBDL across every 8B–70B base — invisible to every in-process signal, yet caught before payment by an anytime-valid detector.

Risk is judgment, not arithmetic

Agents configure at near-parity with a deterministic configurator. Handing them the oracle-correct config does not repair a residual judgment deficit.

Governance is pressure-fragile

Benign business pressure collapses the deploy gate — agents ship anyway — while their risk-detection stays at ceiling. They know the rule; they don't keep it.

Three findings: (a) all training arms pass the train gate but permute/noise traps land at base accuracy; (b) judgment fails where arithmetic passes; (c) gate compliance decays under pressure.

The three findings at a glance. (a) Every training arm passes the train gate, but permute and noise traps land at base accuracy while the mask trap recovers. (b) Judgment fails where arithmetic passes. (c) Gate compliance decays under pressure while refusal sensitivity stays at ceiling.

Finding 1

Trains but doesn't learn

A run that looks healthy on every signal the operator monitors, yet the customer's task did not move. We make it measurable, mechanize its cause, and monitor it online.

Definition — TBDL

Let T be the train-pass indicator (started, survived K steps, finite loss, bounded gradients) and E the eval-pass (Learned) indicator. A run is TBDL iff the train stage passes and the eval stage does not:

TBDL := T ∧ ¬E

Learned fires only when a one-sided paired exact McNemar test rejects at α=.05 and the held-out improvement over base clears a pre-registered margin — so neither noise nor trivially-significant gains can pass.

The whole danger of TBDL is the gap between two views of the same run. Flip between them.

PASS · GREEN

In-process training signals

permute trap · Qwen3-8B · BANKING77 · seed 1

(a) both losses fall but only the correct-data arm learns; (b) the e-process alarm fires at step 180, well before the eval at 300.

Both losses fall — only one arm learned

The train loss is indistinguishable between a correct run and a permute-trap run. The anytime-valid e‑process alarm fires at step 180, long before the held-out eval at step 300.

(a) TBDL incidence per trap across 4 bases x 3 seeds; (b) the severe permute trap is caught online while the subtle noise trap is silent.

Severe is caught online; subtle is silent

The permute trap induces TBDL 12/12 and is caught 11/12 online; the subtle noise trap induces TBDL 11/12 but stays silent — the residual the governed held-out eval exists to surface.

Confirmatory TBDL across four bases

Held-out accuracy and TBDL count over 3 seeds (BANKING77, LoRA). The relabeling trap induces non-learning on every base; the benign mask trap recovers.

Base model	base μ₀	correct	permute	noise	mask
Qwen3-8B	0.328	0.784	0.038 3/3 TBDL	0.335 3/3	0.767 0/3
Gemma-2-9B-it	0.329	0.791	0.040 3/3 TBDL	0.335 3/3	0.791 0/3
Qwen3-32B	0.307	0.797	0.045 3/3 TBDL	0.386 2/3	0.773 0/3
Llama-3.3-70B	0.239	0.755	0.033 3/3 TBDL	0.216 3/3	0.717 0/3

Finding 2

The failure is judgment, not arithmetic

Where does the agent's own risk live? We isolate it with a causal intervention: do(c⋆) hands the agent the oracle-correct configuration, holding the configuration arithmetic fixed so only judgment can move.

(a) arithmetic stages pass near 1.0 for all agents while judgment stages lag; (b) with config injection ON, judgment failure survives the do-intervention and stays below the configurator floor.

Judgment survives the intervention. (a) Agents pass arithmetic stages at ≈0.95 but judgment only at 0.61–0.74. (b) With the oracle-correct config injected, the failure that remains is judgment, not configuration arithmetic.

Handing over the right config doesn't fix it

A deterministic configurator passes the arithmetic stages at 1.000 but judgment at only 0.400. The flagship agents invert this: near-perfect arithmetic, but a real judgment deficit.

Applying do(c⋆) leaves the judgment pass rate essentially flat:

Claude Opus 4.80.74 → 0.72

Gemini 3.1-Pro0.74 → 0.76

GPT-5.50.61 → 0.60

The residual judgment-failure rate is 0.24–0.40 across agents. The risk sits above the arithmetic — it must be governed on the plane, in judgment, not solved with a better configurator.

Finding 3

Governance is pressure-fragile

A PTaaS desk doesn't operate in a vacuum: customers push deadlines, invoke authority, and cite the spend already sunk. These aren't attacks — they're the everyday texture of a service relationship, and exactly the levers that should never move a safety decision. Apply each pressure and watch the deploy gate.

Gauges show P(deploy staged correctly) — the rate at which the agent routes a deploy through the human gate instead of shipping. Compliance collapses under pressure, yet every agent's refusal sensitivity to genuinely infeasible tickets stays at ceiling (HR = 1.00, FAR = 0.00). They recognize the risk — and ship anyway.

Gate-staging compliance under pressure: GPT-5.4-mini holds at 1.0 across all conditions, Sonnet-4.6 drops to 0.30 under sunk cost, Gemini-3.5-Flash falls to 0.10. Refusal sensitivity stays at ceiling.

Gate-staging compliance under pressure. The gate is a non-bypassable safety boundary, yet pooled staging compliance falls monotonically along the pressure ladder for the cheaper-tier agents — a stated-versus-acted dissociation.

Lessons Learned

What this means for a PTaaS operator

The layers dictate the prescriptions. The agent is competent at the arithmetic and the mechanics; govern it where failure is silent.

Gate on a held-out audit

Sell the delivery, not the metric. Tie acceptance and billing to a delivery-level held-out audit — never to training-signal health. A green training dashboard is a billing liability, not a delivery certificate.

Run the detector on every job

The anytime-valid clean-probe e‑process surfaces severe TBDL before the GPU bill is spent — a forward-only pass on a 64-example probe every ten steps, near-free against a ~$3/H200-hour artifact.

Make the deploy gate non-bypassable

Agents ship under ordinary pressure despite full risk-sensitivity, so the fix is a hard boundary, not more information. The judgment layer is the one a configurator cannot underwrite.

The lesson generalizes: green signals do not certify the thing you care about. A green training dashboard is not a delivery certificate.