Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference

Fri, 27 Mar 2026 00:00:00 +0000

Vision-Language Models (VLMs) are the perceptual backbone of embodied AI, but their energy footprint on edge hardware remains poorly understood. Existing efficiency efforts focus predominantly on reducing visual tokens, implicitly treating visual processing as the dominant energy cost. We overturn this implicit assumption through the first systematic energy profiling of on-device VLM inference, spanning five models across three architecture families, four input resolutions, and two hardware platforms (NVIDIA RTX 3070 and Jetson Orin NX).

Key Findings

Our analysis yields three core findings:

1. Power is a Model Fingerprint

Average inference power is a model-intrinsic constant, invariant to input resolution, image complexity, and prompt type, with less than 5% variation across all conditions. This means that all energy variation across inputs must arise from variation in inference time, not from variation in power draw.

2. Decode Dominates Energy

Autoregressive decoding accounts for 86 to 97% of total energy. Each output token costs 11 to 39x more wall-clock time than each input token due to the compute-bound and memory-bound asymmetry between prefill and decode phases. Output token count is the dominant driver of both latency and energy.

3. The Visual Token Pruning Illusion

Even removing all visual tokens saves at most 10% of total energy for fixed-token models. In contrast, controlling output length by 50% saves up to 97%. These findings expose a fundamental limitation of visual token pruning: it targets prefill, which is already a minority of total energy.

Contributions

Energy decomposition into prefill vs. decode phases, showing decode dominance across all configurations
Theoretical upper bound on energy savings from visual token pruning
Cross-model energy predictor — a linear model with five features (model size, input token count, output token count, and interaction terms) that explains 98.6% of energy variance without per-model calibration (MAPE = 10.3%)
Deployment guidelines: budget output not input; match token strategy to deployment scenario; anticipate content-driven energy variation

Conclusion

The true energy bottleneck in edge VLM inference is not seeing but speaking: not what the model sees, but how much it says. Our energy decomposition framework provides actionable guidelines for energy-aware VLM deployment on resource-constrained edge devices.

[ACM MM 2026 Submission] — In Review

Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference

Fri, 27 Mar 2026 00:00:00 +0000

Vision-Language Models (VLMs) 是具身智能的感知核心，但其在边缘硬件上的能耗特征仍未被充分理解。现有的效率优化工作主要集中在减少 visual tokens，隐式地将视觉处理视为主要能耗来源。我们通过首次系统性的设备端 VLM 推理能耗分析推翻了这一隐含假设，实验涵盖五个模型、三种架构系列、四种输入分辨率，以及两个硬件平台（NVIDIA RTX 3070 和 Jetson Orin NX）。

主要发现

我们的分析得出三个核心发现：

1. 功率是模型的固有指纹

平均推理功率是模型固有常量，不随输入分辨率、图像复杂度和提示类型变化，在所有条件下的变异不超过 5%。这意味着不同输入之间的所有能耗差异必然源于推理时间的变化，而非功率消耗的变化。

2. Decode 阶段主导能耗

Autoregressive decoding 占据了总能耗的 86% 至 97%。由于 prefill 和 decode 阶段之间的计算密集型与内存密集型不对称性，每个输出 token 的时钟时间是每个输入 token 的 11 至 39 倍。输出 token 数量是延迟和能耗的主要驱动因素。

3. Visual Token 剪枝的假象

即使移除所有 visual tokens，对于固定 token 模型最多也只能节省总能耗的 10%。相比之下，将输出长度减少 50% 可节省高达 97% 的能耗。这些发现揭示了 visual token pruning 的根本局限性：它针对的是 prefill 阶段，而该阶段本身只占总能耗的少数部分。

贡献

能耗分解为 prefill 与 decode 阶段，展示了所有配置下 decode 的主导地位
对 visual token pruning 节能效果的理论上界
跨模型能耗预测器 — 一个具有五个特征（模型大小、输入 token 数、输出 token 数及交互项）的线性模型，无需逐模型校准即可解释 98.6% 的能耗方差（MAPE = 10.3%）
部署指南：预算应关注输出而非输入；根据部署场景匹配 token 策略；预估内容驱动的能耗变化

结论

边缘 VLM 推理的真正能耗瓶颈不在于看，而在于说：不是模型看到了什么，而是它说了多少。我们的能耗分解框架为资源受限的边缘设备上的节能型 VLM 部署提供了可操作的指导。

[ACM MM 2026 投稿] — 审稿中

Vision-Language Models | Junfei Zhan's Website

Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference

Key Findings

1. Power is a Model Fingerprint

2. Decode Dominates Energy

3. The Visual Token Pruning Illusion

Contributions

Conclusion

Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference

主要发现

1. 功率是模型的固有指纹

2. Decode 阶段主导能耗

3. Visual Token 剪枝的假象

贡献

结论