<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vision-Language Models | Junfei Zhan's Website</title><link>https://junfei-z.github.io/tags/vision-language-models/</link><atom:link href="https://junfei-z.github.io/tags/vision-language-models/index.xml" rel="self" type="application/rss+xml"/><description>Vision-Language Models</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 27 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://junfei-z.github.io/media/icon_hu70bcee51a3cd7a7338014254a2e0c844_1401285_512x512_fill_lanczos_center_3.png</url><title>Vision-Language Models</title><link>https://junfei-z.github.io/tags/vision-language-models/</link></image><item><title>Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference</title><link>https://junfei-z.github.io/research/seeing-is-free-speaking-is-not/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/research/seeing-is-free-speaking-is-not/</guid><description>&lt;p>Vision-Language Models (VLMs) are the perceptual backbone of embodied AI, but their energy footprint on edge hardware remains poorly understood. Existing efficiency efforts focus predominantly on reducing visual tokens, implicitly treating visual processing as the dominant energy cost. We overturn this implicit assumption through the &lt;strong>first systematic energy profiling&lt;/strong> of on-device VLM inference, spanning five models across three architecture families, four input resolutions, and two hardware platforms (NVIDIA RTX 3070 and Jetson Orin NX).&lt;/p>
&lt;h2 id="key-findings">Key Findings&lt;/h2>
&lt;p>Our analysis yields three core findings:&lt;/p>
&lt;h3 id="1-power-is-a-model-fingerprint">1. Power is a Model Fingerprint&lt;/h3>
&lt;p>Average inference power is a &lt;strong>model-intrinsic constant&lt;/strong>, invariant to input resolution, image complexity, and prompt type, with less than 5% variation across all conditions. This means that all energy variation across inputs must arise from variation in &lt;strong>inference time&lt;/strong>, not from variation in power draw.&lt;/p>
&lt;h3 id="2-decode-dominates-energy">2. Decode Dominates Energy&lt;/h3>
&lt;p>Autoregressive decoding accounts for &lt;strong>86 to 97% of total energy&lt;/strong>. Each output token costs &lt;strong>11 to 39x more&lt;/strong> wall-clock time than each input token due to the compute-bound and memory-bound asymmetry between prefill and decode phases. Output token count is the dominant driver of both latency and energy.&lt;/p>
&lt;h3 id="3-the-visual-token-pruning-illusion">3. The Visual Token Pruning Illusion&lt;/h3>
&lt;p>Even removing &lt;strong>all visual tokens&lt;/strong> saves at most &lt;strong>10% of total energy&lt;/strong> for fixed-token models. In contrast, controlling output length by 50% saves up to &lt;strong>97%&lt;/strong>. These findings expose a fundamental limitation of visual token pruning: it targets prefill, which is already a minority of total energy.&lt;/p>
&lt;h2 id="contributions">Contributions&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Energy decomposition&lt;/strong> into prefill vs. decode phases, showing decode dominance across all configurations&lt;/li>
&lt;li>&lt;strong>Theoretical upper bound&lt;/strong> on energy savings from visual token pruning&lt;/li>
&lt;li>&lt;strong>Cross-model energy predictor&lt;/strong> — a linear model with five features (model size, input token count, output token count, and interaction terms) that explains &lt;strong>98.6% of energy variance&lt;/strong> without per-model calibration (MAPE = 10.3%)&lt;/li>
&lt;li>&lt;strong>Deployment guidelines&lt;/strong>: budget output not input; match token strategy to deployment scenario; anticipate content-driven energy variation&lt;/li>
&lt;/ul>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>The true energy bottleneck in edge VLM inference is not &lt;em>seeing&lt;/em> but &lt;em>speaking&lt;/em>: not what the model sees, but how much it says. Our energy decomposition framework provides actionable guidelines for energy-aware VLM deployment on resource-constrained edge devices.&lt;/p>
&lt;p>[ACM MM 2026 Submission] — In Review&lt;/p></description></item><item><title>Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference</title><link>https://junfei-z.github.io/zh/research/seeing-is-free-speaking-is-not/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://junfei-z.github.io/zh/research/seeing-is-free-speaking-is-not/</guid><description>&lt;p>Vision-Language Models (VLMs) 是具身智能的感知核心，但其在边缘硬件上的能耗特征仍未被充分理解。现有的效率优化工作主要集中在减少 visual tokens，隐式地将视觉处理视为主要能耗来源。我们通过&lt;strong>首次系统性的设备端 VLM 推理能耗分析&lt;/strong>推翻了这一隐含假设，实验涵盖五个模型、三种架构系列、四种输入分辨率，以及两个硬件平台（NVIDIA RTX 3070 和 Jetson Orin NX）。&lt;/p>
&lt;h2 id="主要发现">主要发现&lt;/h2>
&lt;p>我们的分析得出三个核心发现：&lt;/p>
&lt;h3 id="1-功率是模型的固有指纹">1. 功率是模型的固有指纹&lt;/h3>
&lt;p>平均推理功率是&lt;strong>模型固有常量&lt;/strong>，不随输入分辨率、图像复杂度和提示类型变化，在所有条件下的变异不超过 5%。这意味着不同输入之间的所有能耗差异必然源于&lt;strong>推理时间&lt;/strong>的变化，而非功率消耗的变化。&lt;/p>
&lt;h3 id="2-decode-阶段主导能耗">2. Decode 阶段主导能耗&lt;/h3>
&lt;p>Autoregressive decoding 占据了&lt;strong>总能耗的 86% 至 97%&lt;/strong>。由于 prefill 和 decode 阶段之间的计算密集型与内存密集型不对称性，每个输出 token 的时钟时间是每个输入 token 的 &lt;strong>11 至 39 倍&lt;/strong>。输出 token 数量是延迟和能耗的主要驱动因素。&lt;/p>
&lt;h3 id="3-visual-token-剪枝的假象">3. Visual Token 剪枝的假象&lt;/h3>
&lt;p>即使移除&lt;strong>所有 visual tokens&lt;/strong>，对于固定 token 模型最多也只能节省&lt;strong>总能耗的 10%&lt;/strong>。相比之下，将输出长度减少 50% 可节省高达 &lt;strong>97%&lt;/strong> 的能耗。这些发现揭示了 visual token pruning 的根本局限性：它针对的是 prefill 阶段，而该阶段本身只占总能耗的少数部分。&lt;/p>
&lt;h2 id="贡献">贡献&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>能耗分解&lt;/strong>为 prefill 与 decode 阶段，展示了所有配置下 decode 的主导地位&lt;/li>
&lt;li>对 visual token pruning 节能效果的&lt;strong>理论上界&lt;/strong>&lt;/li>
&lt;li>&lt;strong>跨模型能耗预测器&lt;/strong> — 一个具有五个特征（模型大小、输入 token 数、输出 token 数及交互项）的线性模型，无需逐模型校准即可解释 &lt;strong>98.6% 的能耗方差&lt;/strong>（MAPE = 10.3%）&lt;/li>
&lt;li>&lt;strong>部署指南&lt;/strong>：预算应关注输出而非输入；根据部署场景匹配 token 策略；预估内容驱动的能耗变化&lt;/li>
&lt;/ul>
&lt;h2 id="结论">结论&lt;/h2>
&lt;p>边缘 VLM 推理的真正能耗瓶颈不在于&lt;em>看&lt;/em>，而在于&lt;em>说&lt;/em>：不是模型看到了什么，而是它说了多少。我们的能耗分解框架为资源受限的边缘设备上的节能型 VLM 部署提供了可操作的指导。&lt;/p>
&lt;p>[ACM MM 2026 投稿] — 审稿中&lt;/p></description></item></channel></rss>