<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Small Language Models | Junfei Zhan's Website</title><link>https://junfei-z.github.io/tags/small-language-models/</link><atom:link href="https://junfei-z.github.io/tags/small-language-models/index.xml" rel="self" type="application/rss+xml"/><description>Small Language Models</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 22 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://junfei-z.github.io/media/icon_hu70bcee51a3cd7a7338014254a2e0c844_1401285_512x512_fill_lanczos_center_3.png</url><title>Small Language Models</title><link>https://junfei-z.github.io/tags/small-language-models/</link></image><item><title>Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference</title><link>https://junfei-z.github.io/research/power_modeling/</link><pubDate>Mon, 22 Sep 2025 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/research/power_modeling/</guid><description>&lt;p>📄 [ICASSP 2026 Submission] — In Review&lt;/p>
&lt;p>This research introduces a &lt;strong>stochastic and interpretable framework&lt;/strong> for sustainable &lt;strong>on-device inference of small language models (SLMs)&lt;/strong> under strict energy and hardware constraints. By capturing fine-grained CPU/GPU power dynamics and optimizing inference scheduling with constrained MDPs, the work provides a principled foundation for &lt;strong>adaptive, resource-aware AI at the edge&lt;/strong>.&lt;/p>
&lt;h2 id="problem-and-motivation">Problem and Motivation&lt;/h2>
&lt;p>Running SLMs locally on smartphones, laptops, or IoT nodes promises &lt;strong>low-latency and privacy-preserving AI services&lt;/strong>, but these devices face &lt;strong>finite battery budgets&lt;/strong> and &lt;strong>strict power caps&lt;/strong>. Traditional energy models fail to capture the stochastic, phase-wise CPU/GPU behaviors of SLM inference, making them unsuitable for &lt;strong>multi-task adaptive deployment&lt;/strong>.&lt;/p>
&lt;h2 id="technical-contributions">Technical Contributions&lt;/h2>
&lt;h3 id="1-hsmm-based-energy-modeling">1. HSMM-Based Energy Modeling&lt;/h3>
&lt;ul>
&lt;li>Conducted fine-grained power measurements of &lt;strong>Gemma2-2B&lt;/strong> and &lt;strong>Qwen3-4B&lt;/strong> on MT-Bench.&lt;/li>
&lt;li>Modeled CPU and GPU traces separately with &lt;strong>Hidden Semi-Markov Models (HSMMs)&lt;/strong>:
&lt;ul>
&lt;li>GPU: ramp-up, plateau, decay phases.&lt;/li>
&lt;li>CPU: low-load and high-load bursts.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Achieved &lt;strong>higher fidelity than HMM and TCN baselines&lt;/strong> in predicting power fluctuations.&lt;/li>
&lt;/ul>
&lt;h3 id="2-constrained-mdp-formulation">2. Constrained MDP Formulation&lt;/h3>
&lt;ul>
&lt;li>Defined a &lt;strong>CMDP&lt;/strong> where each inference task selects an SLM configuration (model + quantization).&lt;/li>
&lt;li>State: remaining energy budget.&lt;/li>
&lt;li>Actions: candidate SLM setups.&lt;/li>
&lt;li>Reward: &lt;strong>LLM-as-a-Judge quality scores&lt;/strong>.&lt;/li>
&lt;li>Constraints: &lt;strong>finite energy budget&lt;/strong> and &lt;strong>instantaneous device-level power cap&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h3 id="3-policy-optimization-with-q-learning">3. Policy Optimization with Q-Learning&lt;/h3>
&lt;ul>
&lt;li>Constructed cost–reward pairs for six candidate actions.&lt;/li>
&lt;li>Solved CMDP with tabular Q-learning:
&lt;ul>
&lt;li>Improved average reward from &lt;strong>~9 to ~15&lt;/strong> over 300 episodes.&lt;/li>
&lt;li>Maintained energy usage within &lt;strong>85–90% of budget&lt;/strong>.&lt;/li>
&lt;li>Guaranteed no violation of power caps.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="results-and-insights">Results and Insights&lt;/h2>
&lt;ul>
&lt;li>HSMMs effectively capture &lt;strong>piecewise-stationary phases&lt;/strong> in edge inference.&lt;/li>
&lt;li>CMDP optimization reveals clear &lt;strong>energy–quality trade-offs&lt;/strong>.&lt;/li>
&lt;li>Learned policies significantly improve cumulative inference quality while &lt;strong>respecting real-world constraints&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>This study establishes the first &lt;strong>unified mathematical framework&lt;/strong> linking SLM parameters, stochastic energy consumption, and inference quality. By integrating HSMM-based cost modeling with CMDP optimization, it enables &lt;strong>sustainable, adaptive deployment&lt;/strong> of SLMs in edge and IoT environments, paving the way for future extensions with deep RL and collaborative multi-device scheduling.&lt;/p></description></item><item><title>Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference</title><link>https://junfei-z.github.io/zh/research/power_modeling/</link><pubDate>Mon, 22 Sep 2025 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/zh/research/power_modeling/</guid><description>&lt;p>[ICASSP 2026 投稿] — 审稿中&lt;/p>
&lt;p>本研究提出了一个&lt;strong>随机且可解释的框架&lt;/strong>，用于在严格的能耗和硬件约束下实现 &lt;strong>small language models (SLMs)&lt;/strong> 的可持续&lt;strong>设备端推理&lt;/strong>。通过捕获细粒度的 CPU/GPU 功耗动态，并利用约束 MDP 优化推理调度，本工作为&lt;strong>边缘端自适应、资源感知的 AI&lt;/strong> 提供了原则性基础。&lt;/p>
&lt;h2 id="问题与动机">问题与动机&lt;/h2>
&lt;p>在智能手机、笔记本电脑或 IoT 节点上本地运行 SLM 可提供&lt;strong>低延迟和隐私保护的 AI 服务&lt;/strong>，但这些设备面临&lt;strong>有限的电池预算&lt;/strong>和&lt;strong>严格的功率上限&lt;/strong>。传统能耗模型无法捕获 SLM 推理中随机的、分阶段的 CPU/GPU 行为，使其不适用于&lt;strong>多任务自适应部署&lt;/strong>。&lt;/p>
&lt;h2 id="技术贡献">技术贡献&lt;/h2>
&lt;h3 id="1-基于-hsmm-的能耗建模">1. 基于 HSMM 的能耗建模&lt;/h3>
&lt;ul>
&lt;li>对 &lt;strong>Gemma2-2B&lt;/strong> 和 &lt;strong>Qwen3-4B&lt;/strong> 在 MT-Bench 上进行了细粒度功耗测量。&lt;/li>
&lt;li>分别使用 &lt;strong>Hidden Semi-Markov Models (HSMMs)&lt;/strong> 对 CPU 和 GPU 功耗轨迹建模：
&lt;ul>
&lt;li>GPU：上升、平稳、衰减阶段。&lt;/li>
&lt;li>CPU：低负载和高负载突发。&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>在预测功耗波动方面&lt;strong>优于 HMM 和 TCN 基线&lt;/strong>。&lt;/li>
&lt;/ul>
&lt;h3 id="2-约束-mdp-建模">2. 约束 MDP 建模&lt;/h3>
&lt;ul>
&lt;li>定义了一个 &lt;strong>CMDP&lt;/strong>，其中每个推理任务选择一种 SLM 配置（模型 + 量化方案）。&lt;/li>
&lt;li>状态：剩余能量预算。&lt;/li>
&lt;li>动作：候选 SLM 配置。&lt;/li>
&lt;li>奖励：&lt;strong>LLM-as-a-Judge 质量评分&lt;/strong>。&lt;/li>
&lt;li>约束：&lt;strong>有限能量预算&lt;/strong>和&lt;strong>瞬时设备级功率上限&lt;/strong>。&lt;/li>
&lt;/ul>
&lt;h3 id="3-基于-q-learning-的策略优化">3. 基于 Q-Learning 的策略优化&lt;/h3>
&lt;ul>
&lt;li>为六个候选动作构建了成本-奖励对。&lt;/li>
&lt;li>使用表格式 Q-learning 求解 CMDP：
&lt;ul>
&lt;li>在 300 个回合中将平均奖励从 &lt;strong>约 9 提升至约 15&lt;/strong>。&lt;/li>
&lt;li>将能耗维持在&lt;strong>预算的 85–90%&lt;/strong>。&lt;/li>
&lt;li>保证不违反功率上限。&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="结果与洞察">结果与洞察&lt;/h2>
&lt;ul>
&lt;li>HSMM 有效捕获了边缘推理中的&lt;strong>分段平稳阶段&lt;/strong>。&lt;/li>
&lt;li>CMDP 优化揭示了清晰的&lt;strong>能耗-质量权衡&lt;/strong>。&lt;/li>
&lt;li>学习到的策略在&lt;strong>遵守现实约束&lt;/strong>的同时显著提升了累计推理质量。&lt;/li>
&lt;/ul>
&lt;h2 id="结论">结论&lt;/h2>
&lt;p>本研究建立了首个&lt;strong>统一数学框架&lt;/strong>，将 SLM 参数、随机能耗和推理质量联系起来。通过将基于 HSMM 的成本建模与 CMDP 优化相结合，实现了 SLM 在边缘和 IoT 环境中的&lt;strong>可持续、自适应部署&lt;/strong>，为未来基于 deep RL 和多设备协同调度的扩展奠定了基础。&lt;/p></description></item></channel></rss>