LLM Agents

Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers

一个受治理的「交付平面」基准，关注的不是 LLM agent 能否把某个指标拉高，而是它能否被信任去交付 post-training as a service。前沿 agent 在真实 H200 与 A40 硬件、8B–70B 基座上跑完十个受治理阶段；风险恰恰存在于失败无声的地方——判断与治理，而非配置器早已解决的算术。

Jun 17, 2026

Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers

A governed delivery-plane benchmark that asks not whether an LLM agent can raise a metric, but whether it can be trusted to deliver post-training as a service. Frontier agents run ten governed stages on real H200 and A40 hardware across 8B–70B bases; the risk lives where failure is silent — in judgment and governance, not the arithmetic a configurator already solves.

Jun 17, 2026