Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference

Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference

📄 [ICASSP 2026 Submission] — In Review

This research introduces a stochastic and interpretable framework for sustainable on-device inference of small language models (SLMs) under strict energy and hardware constraints. By capturing fine-grained CPU/GPU power dynamics and optimizing inference scheduling with constrained MDPs, the work provides a principled foundation for adaptive, resource-aware AI at the edge.

Problem and Motivation

Running SLMs locally on smartphones, laptops, or IoT nodes promises low-latency and privacy-preserving AI services, but these devices face finite battery budgets and strict power caps. Traditional energy models fail to capture the stochastic, phase-wise CPU/GPU behaviors of SLM inference, making them unsuitable for multi-task adaptive deployment.

Technical Contributions

1. HSMM-Based Energy Modeling

  • Conducted fine-grained power measurements of Gemma2-2B and Qwen3-4B on MT-Bench.
  • Modeled CPU and GPU traces separately with Hidden Semi-Markov Models (HSMMs):
    • GPU: ramp-up, plateau, decay phases.
    • CPU: low-load and high-load bursts.
  • Achieved higher fidelity than HMM and TCN baselines in predicting power fluctuations.

2. Constrained MDP Formulation

  • Defined a CMDP where each inference task selects an SLM configuration (model + quantization).
  • State: remaining energy budget.
  • Actions: candidate SLM setups.
  • Reward: LLM-as-a-Judge quality scores.
  • Constraints: finite energy budget and instantaneous device-level power cap.

3. Policy Optimization with Q-Learning

  • Constructed cost–reward pairs for six candidate actions.
  • Solved CMDP with tabular Q-learning:
    • Improved average reward from ~9 to ~15 over 300 episodes.
    • Maintained energy usage within 85–90% of budget.
    • Guaranteed no violation of power caps.

Results and Insights

  • HSMMs effectively capture piecewise-stationary phases in edge inference.
  • CMDP optimization reveals clear energy–quality trade-offs.
  • Learned policies significantly improve cumulative inference quality while respecting real-world constraints.

Conclusion

This study establishes the first unified mathematical framework linking SLM parameters, stochastic energy consumption, and inference quality. By integrating HSMM-based cost modeling with CMDP optimization, it enables sustainable, adaptive deployment of SLMs in edge and IoT environments, paving the way for future extensions with deep RL and collaborative multi-device scheduling.