<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLM Agents | Junfei Zhan's Website</title><link>https://junfei-z.github.io/tags/llm-agents/</link><atom:link href="https://junfei-z.github.io/tags/llm-agents/index.xml" rel="self" type="application/rss+xml"/><description>LLM Agents</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 17 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://junfei-z.github.io/media/icon_hu70bcee51a3cd7a7338014254a2e0c844_1401285_512x512_fill_lanczos_center_3.png</url><title>LLM Agents</title><link>https://junfei-z.github.io/tags/llm-agents/</link></image><item><title>Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers</title><link>https://junfei-z.github.io/research/trains-but-doesnt-learn/</link><pubDate>Wed, 17 Jun 2026 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/research/trains-but-doesnt-learn/</guid><description>&lt;a href="https://junfei-z.github.io/fde/" target="_blank">
&lt;img src="https://img.shields.io/badge/Interactive%20Demo-Open-2563eb?logo=googlechrome&amp;logoColor=white" alt="Demo">
&lt;/a>
&lt;p>📄 &lt;em>EMNLP 2026 submission — under review.&lt;/em>&lt;/p>
&lt;p>Post-training is becoming a service (PTaaS): a customer hands an operator data and a goal, and a &lt;strong>forward-deployed engineer&lt;/strong> (FDE) returns a fine-tuned, evaluated, and deployed model under a budget, a human-approval gate, and reproducibility requirements. The FDE is a natural target for automation, but seating an LLM agent in that seat raises a question existing benchmarks cannot answer: not whether an agent can &lt;em>raise a metric&lt;/em>, but whether it can be &lt;em>trusted to deliver&lt;/em>.&lt;/p>
&lt;h2 id="interactive-demo">Interactive Demo&lt;/h2>
&lt;p>The companion page walks through the benchmark and its three findings visually, with an interactive stage pipeline, a training-dashboard flip, and a pressure ladder:&lt;/p>
&lt;p>👉 &lt;a href="https://junfei-z.github.io/fde/">&lt;strong>Open the interactive demo&lt;/strong>&lt;/a>&lt;/p>
&lt;h2 id="the-benchmark">The Benchmark&lt;/h2>
&lt;p>We answer the operator&amp;rsquo;s question with a &lt;strong>governed delivery-plane benchmark&lt;/strong> that recasts the agentic FDE as a layered control plane. The agent drives ten governed stages (intake, plan, config, schedule, train, eval, register, deploy, cost, card), each scored pass/fail by a &lt;strong>de-looped oracle that never reads the trained model&lt;/strong>. The stages partition by &lt;em>where failure becomes visible&lt;/em> rather than by difficulty: a deterministic configurator certifies the arithmetic stages before any agent runs, so failure there is loud and cheap to catch, while the judgment stages are surfaced only by the delivery-level oracle. We run a flagship tier (Claude Opus 4.8, GPT-5.5, Gemini 3.1-Pro) and a cheaper tier on real &lt;strong>H200 and A40&lt;/strong> hardware, across &lt;strong>8B–70B&lt;/strong> open bases spanning three families and three seeds.&lt;/p>
&lt;h2 id="three-findings">Three Findings&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>A silent training failure that is real.&lt;/strong> An injected intake-misread reliably induces a run that &lt;strong>trains but doesn&amp;rsquo;t learn (TBDL)&lt;/strong> across every 8B–70B base — green on every in-process signal, yet delivering a base-level model. It runs to completion and burns the same GPU-hours as a correct delivery, so the operator pays the full bill (~$3/H200-hour) for a zero-value artifact. An anytime-valid clean-probe e-process flags severe cases &lt;strong>before payment&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>The risk is judgment, not arithmetic.&lt;/strong> Agents configure at near-parity with a deterministic configurator. Handing them the &lt;strong>oracle-correct configuration&lt;/strong> does &lt;em>not&lt;/em> repair a residual judgment deficit — the failure survives the &lt;em>do&lt;/em>-intervention and sits above the arithmetic.&lt;/li>
&lt;li>&lt;strong>Governance is pressure-fragile.&lt;/strong> Benign business pressure — deadlines, authority, sunk cost — collapses deploy-gate compliance while the agents&amp;rsquo; risk-detection stays at ceiling. They know the rule; they don&amp;rsquo;t keep it.&lt;/li>
&lt;/ul>
&lt;h2 id="takeaway">Takeaway&lt;/h2>
&lt;p>The lesson generalizes: green signals do not certify the thing you care about. &lt;strong>A green training dashboard is a billing liability, not a delivery certificate.&lt;/strong>&lt;/p></description></item><item><title>Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers</title><link>https://junfei-z.github.io/zh/research/trains-but-doesnt-learn/</link><pubDate>Wed, 17 Jun 2026 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/zh/research/trains-but-doesnt-learn/</guid><description>&lt;a href="https://junfei-z.github.io/fde/" target="_blank">
&lt;img src="https://img.shields.io/badge/Interactive%20Demo-Open-2563eb?logo=googlechrome&amp;logoColor=white" alt="Demo">
&lt;/a>
&lt;p>📄 &lt;em>EMNLP 2026 投稿——审核中。&lt;/em>&lt;/p>
&lt;p>Post-training 正在变成一种服务（PTaaS）：客户把数据和目标交给运营方，一名&lt;strong>前向部署工程师&lt;/strong>（forward-deployed engineer, FDE）在预算、人工审批门以及可复现性要求之下，交付一个经过微调、评估并部署的模型。FDE 是自动化的天然目标，但把一个 LLM agent 放进这个位置，会引出现有基准无法回答的问题：不是 agent 能否&lt;em>把指标拉高&lt;/em>，而是它能否&lt;em>被信任去交付&lt;/em>。&lt;/p>
&lt;h2 id="交互式-demo">交互式 Demo&lt;/h2>
&lt;p>配套页面用可视化方式串起整个基准及其三大发现，包含可交互的阶段流水线、训练面板翻转，以及压力阶梯：&lt;/p>
&lt;p>👉 &lt;a href="https://junfei-z.github.io/fde/">&lt;strong>点击打开交互式 Demo&lt;/strong>&lt;/a>&lt;/p>
&lt;h2 id="基准设计">基准设计&lt;/h2>
&lt;p>我们用一个&lt;strong>受治理的交付平面基准&lt;/strong>来回答运营方的问题，把 agentic FDE 重构为一个分层的控制平面。Agent 端到端地驱动十个受治理阶段（intake、plan、config、schedule、train、eval、register、deploy、cost、card），每个阶段都由一个&lt;strong>从不读取训练后模型的 de-looped oracle&lt;/strong> 给出 pass/fail。这些阶段按&lt;em>失败在何处变得可见&lt;/em>而非按难度来划分：一个确定性配置器在任何 agent 运行之前就能认证算术类阶段，因此那里的失败响亮且廉价；而判断类阶段只有交付层的 oracle 才能暴露。我们在真实 &lt;strong>H200 与 A40&lt;/strong> 硬件上、跨越三个家族、三个随机种子的 &lt;strong>8B–70B&lt;/strong> 开源基座上，运行了一个旗舰梯队（Claude Opus 4.8、GPT-5.5、Gemini 3.1-Pro）与一个低成本梯队。&lt;/p>
&lt;h2 id="三大发现">三大发现&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>一个真实存在的无声训练失败。&lt;/strong> 一次注入的 intake 误读，会在所有 8B–70B 基座上稳定诱发「训练了但没学到」（trains but doesn&amp;rsquo;t learn, TBDL）——所有在线信号全绿，却交付了一个与基座无异的模型。它跑到完成，烧掉与正确交付相同的 GPU 小时数，运营方为一个零价值产物付了全额账单（约 $3/H200-小时）。一个 anytime-valid 的 clean-probe e-process 能在&lt;strong>付款之前&lt;/strong>标记出严重的情形。&lt;/li>
&lt;li>&lt;strong>风险在于判断，而非算术。&lt;/strong> Agent 的配置能力与确定性配置器几乎持平。把&lt;strong>oracle 正确的配置&lt;/strong>直接交给它们，并&lt;em>不能&lt;/em>修复残余的判断缺陷——失败在 &lt;em>do&lt;/em>-intervention 之后依然存在，凌驾于算术之上。&lt;/li>
&lt;li>&lt;strong>治理在压力下脆弱。&lt;/strong> 良性的业务压力——截止日期、权威、沉没成本——会击穿部署门的合规率，而 agent 的风险识别能力仍维持在天花板。它们知道规则，却不遵守。&lt;/li>
&lt;/ul>
&lt;h2 id="结论">结论&lt;/h2>
&lt;p>这个教训可以推广：绿色信号并不能认证你真正在意的东西。&lt;strong>一块全绿的训练面板是一项账单负债，而不是一张交付证书。&lt;/strong>&lt;/p></description></item></channel></rss>