Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference

📄 [ICASSP 2026 Submission] — In Review

This research introduces a stochastic and interpretable framework for sustainable on-device inference of small language models (SLMs) under strict energy and hardware constraints. By capturing fine-grained CPU/GPU power dynamics and optimizing inference scheduling with constrained MDPs, the work provides a principled foundation for adaptive, resource-aware AI at the edge.

Problem and Motivation

Running SLMs locally on smartphones, laptops, or IoT nodes promises low-latency and privacy-preserving AI services, but these devices face finite battery budgets and strict power caps. Traditional energy models fail to capture the stochastic, phase-wise CPU/GPU behaviors of SLM inference, making them unsuitable for multi-task adaptive deployment.

Technical Contributions

1. HSMM-Based Energy Modeling

Conducted fine-grained power measurements of Gemma2-2B and Qwen3-4B on MT-Bench.
Modeled CPU and GPU traces separately with Hidden Semi-Markov Models (HSMMs):
- GPU: ramp-up, plateau, decay phases.
- CPU: low-load and high-load bursts.
Achieved higher fidelity than HMM and TCN baselines in predicting power fluctuations.

2. Constrained MDP Formulation

Defined a CMDP where each inference task selects an SLM configuration (model + quantization).
State: remaining energy budget.
Actions: candidate SLM setups.
Reward: LLM-as-a-Judge quality scores.
Constraints: finite energy budget and instantaneous device-level power cap.

3. Policy Optimization with Q-Learning

Constructed cost–reward pairs for six candidate actions.
Solved CMDP with tabular Q-learning:
- Improved average reward from ~9 to ~15 over 300 episodes.
- Maintained energy usage within 85–90% of budget.
- Guaranteed no violation of power caps.

Results and Insights

HSMMs effectively capture piecewise-stationary phases in edge inference.
CMDP optimization reveals clear energy–quality trade-offs.
Learned policies significantly improve cumulative inference quality while respecting real-world constraints.

Conclusion

This study establishes the first unified mathematical framework linking SLM parameters, stochastic energy consumption, and inference quality. By integrating HSMM-based cost modeling with CMDP optimization, it enables sustainable, adaptive deployment of SLMs in edge and IoT environments, paving the way for future extensions with deep RL and collaborative multi-device scheduling.

Last updated on Nov 12, 2025

← Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference Mar 27, 2026

PRISM: Privacy-Aware Routing for Adaptive Cloud–Edge LLM Inference with Semantic Sketch Collaboration Jul 30, 2025 →