LLM Agents | Junfei Zhan's Website

Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers

Wed, 17 Jun 2026 00:00:00 +0000

📄 EMNLP 2026 submission — under review.

Post-training is becoming a service (PTaaS): a customer hands an operator data and a goal, and a forward-deployed engineer (FDE) returns a fine-tuned, evaluated, and deployed model under a budget, a human-approval gate, and reproducibility requirements. The FDE is a natural target for automation, but seating an LLM agent in that seat raises a question existing benchmarks cannot answer: not whether an agent can raise a metric, but whether it can be trusted to deliver.

Interactive Demo

The companion page walks through the benchmark and its three findings visually, with an interactive stage pipeline, a training-dashboard flip, and a pressure ladder:

👉 Open the interactive demo

The Benchmark

We answer the operator’s question with a governed delivery-plane benchmark that recasts the agentic FDE as a layered control plane. The agent drives ten governed stages (intake, plan, config, schedule, train, eval, register, deploy, cost, card), each scored pass/fail by a de-looped oracle that never reads the trained model. The stages partition by where failure becomes visible rather than by difficulty: a deterministic configurator certifies the arithmetic stages before any agent runs, so failure there is loud and cheap to catch, while the judgment stages are surfaced only by the delivery-level oracle. We run a flagship tier (Claude Opus 4.8, GPT-5.5, Gemini 3.1-Pro) and a cheaper tier on real H200 and A40 hardware, across 8B–70B open bases spanning three families and three seeds.

Three Findings

A silent training failure that is real. An injected intake-misread reliably induces a run that trains but doesn’t learn (TBDL) across every 8B–70B base — green on every in-process signal, yet delivering a base-level model. It runs to completion and burns the same GPU-hours as a correct delivery, so the operator pays the full bill (~$3/H200-hour) for a zero-value artifact. An anytime-valid clean-probe e-process flags severe cases before payment.
The risk is judgment, not arithmetic. Agents configure at near-parity with a deterministic configurator. Handing them the oracle-correct configuration does not repair a residual judgment deficit — the failure survives the do-intervention and sits above the arithmetic.
Governance is pressure-fragile. Benign business pressure — deadlines, authority, sunk cost — collapses deploy-gate compliance while the agents’ risk-detection stays at ceiling. They know the rule; they don’t keep it.

Takeaway

The lesson generalizes: green signals do not certify the thing you care about. A green training dashboard is a billing liability, not a delivery certificate.

Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers

Wed, 17 Jun 2026 00:00:00 +0000

📄 EMNLP 2026 投稿——审核中。

Post-training 正在变成一种服务（PTaaS）：客户把数据和目标交给运营方，一名前向部署工程师（forward-deployed engineer, FDE）在预算、人工审批门以及可复现性要求之下，交付一个经过微调、评估并部署的模型。FDE 是自动化的天然目标，但把一个 LLM agent 放进这个位置，会引出现有基准无法回答的问题：不是 agent 能否把指标拉高，而是它能否被信任去交付。

交互式 Demo

配套页面用可视化方式串起整个基准及其三大发现，包含可交互的阶段流水线、训练面板翻转，以及压力阶梯：

👉 点击打开交互式 Demo

基准设计

我们用一个受治理的交付平面基准来回答运营方的问题，把 agentic FDE 重构为一个分层的控制平面。Agent 端到端地驱动十个受治理阶段（intake、plan、config、schedule、train、eval、register、deploy、cost、card），每个阶段都由一个从不读取训练后模型的 de-looped oracle 给出 pass/fail。这些阶段按失败在何处变得可见而非按难度来划分：一个确定性配置器在任何 agent 运行之前就能认证算术类阶段，因此那里的失败响亮且廉价；而判断类阶段只有交付层的 oracle 才能暴露。我们在真实 H200 与 A40 硬件上、跨越三个家族、三个随机种子的 8B–70B 开源基座上，运行了一个旗舰梯队（Claude Opus 4.8、GPT-5.5、Gemini 3.1-Pro）与一个低成本梯队。

三大发现

一个真实存在的无声训练失败。 一次注入的 intake 误读，会在所有 8B–70B 基座上稳定诱发「训练了但没学到」（trains but doesn’t learn, TBDL）——所有在线信号全绿，却交付了一个与基座无异的模型。它跑到完成，烧掉与正确交付相同的 GPU 小时数，运营方为一个零价值产物付了全额账单（约 $3/H200-小时）。一个 anytime-valid 的 clean-probe e-process 能在付款之前标记出严重的情形。
风险在于判断，而非算术。 Agent 的配置能力与确定性配置器几乎持平。把oracle 正确的配置直接交给它们，并不能修复残余的判断缺陷——失败在 do-intervention 之后依然存在，凌驾于算术之上。
治理在压力下脆弱。 良性的业务压力——截止日期、权威、沉没成本——会击穿部署门的合规率，而 agent 的风险识别能力仍维持在天花板。它们知道规则，却不遵守。

结论

这个教训可以推广：绿色信号并不能认证你真正在意的东西。一块全绿的训练面板是一项账单负债，而不是一张交付证书。