Research | Junfei Zhan's Website

Scalable Node-Level Vaccine Allocation on Contact Networks: Bridging Optimal Control and Reinforcement Learning

Thu, 23 Apr 2026 00:00:00 +0000

📄 Master’s Thesis, University of Pennsylvania (2026). Advisor: Prof. Saswati Sarkar.

In the first weeks of a pandemic, vaccines must be allocated across a large, heterogeneous population under a tight daily dose budget and over a horizon of weeks to months. A deployable policy must name specific individuals — not group-level proportions — and cope with three structural difficulties: sequential decisions over a long horizon with a delayed reward signal, a combinatorial daily action space of size $\binom{N}{K}$, and individual network position that matters as much as demographic group.

Interactive Demo

The companion demo walks through the thesis visually:

Three-group population model — baseline (X), high-risk elderly (Y), and high-contact hubs (Z), each with group-specific symptomatic, hospitalisation, and case-fatality rates.
10-compartment SEPAILHRVD disease model — latent, pre-symptomatic, asymptomatic, symptomatic, late-stage, hospitalised, recovered, vaccinated, and dead.
Barabási–Albert network construction — watch preferential attachment grow a scale-free contact graph and the characteristic power-law degree tail emerge.
Stochastic simulator — seed infections in any group mix and watch an unvaccinated outbreak unfold day by day, reporting cumulative deaths as the no-intervention baseline.
Method comparison (coming soon) — OC-Random, OC-high, Naive RL, and Node RL on identical seeds.

👉 Open the interactive demo

Contributions

C1 — Stochastic node-level simulator: a high-fidelity environment integrating an explicit Barabási–Albert contact network with a 10-compartment SEPAILHRVD model, capturing intrinsic stochasticity of infection events and individual-level risk heterogeneity.
C2 — OC-high: augments principled group-level optimal control with a high-degree-first intra-group heuristic, bridging aggregate policy and individual action.
C3 — Node RL: an end-to-end actor–critic with a shared-parameter scoring MLP and Gumbel-Top-$K$ reparameterised sampling, yielding $O(K)$ gradient variance versus $\Theta(N)$ for independent Bernoulli baselines.
C4 — Regime map: systematic benchmarking across population size, horizon, and initial-infection ratio identifying when each method is preferable — and when the additional compute of node-level RL is justified.

Headline Findings

OC-high matches or beats Node RL in most regimes at roughly two orders of magnitude less preparation cost.
Node RL’s advantage is real but confined to short horizons and hub-heavy initial infections, where the mean-field assumption underlying OC-high structurally breaks down.
The intra-group high-degree heuristic alone accounts for a 5–10% reduction in deaths on average, comparable to the contribution of the group-level OC rates themselves.

Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference

Fri, 27 Mar 2026 00:00:00 +0000

Vision-Language Models (VLMs) are the perceptual backbone of embodied AI, but their energy footprint on edge hardware remains poorly understood. Existing efficiency efforts focus predominantly on reducing visual tokens, implicitly treating visual processing as the dominant energy cost. We overturn this implicit assumption through the first systematic energy profiling of on-device VLM inference, spanning five models across three architecture families, four input resolutions, and two hardware platforms (NVIDIA RTX 3070 and Jetson Orin NX).

Key Findings

Our analysis yields three core findings:

1. Power is a Model Fingerprint

Average inference power is a model-intrinsic constant, invariant to input resolution, image complexity, and prompt type, with less than 5% variation across all conditions. This means that all energy variation across inputs must arise from variation in inference time, not from variation in power draw.

2. Decode Dominates Energy

Autoregressive decoding accounts for 86 to 97% of total energy. Each output token costs 11 to 39x more wall-clock time than each input token due to the compute-bound and memory-bound asymmetry between prefill and decode phases. Output token count is the dominant driver of both latency and energy.

3. The Visual Token Pruning Illusion

Even removing all visual tokens saves at most 10% of total energy for fixed-token models. In contrast, controlling output length by 50% saves up to 97%. These findings expose a fundamental limitation of visual token pruning: it targets prefill, which is already a minority of total energy.

Contributions

Energy decomposition into prefill vs. decode phases, showing decode dominance across all configurations
Theoretical upper bound on energy savings from visual token pruning
Cross-model energy predictor — a linear model with five features (model size, input token count, output token count, and interaction terms) that explains 98.6% of energy variance without per-model calibration (MAPE = 10.3%)
Deployment guidelines: budget output not input; match token strategy to deployment scenario; anticipate content-driven energy variation

Conclusion

The true energy bottleneck in edge VLM inference is not seeing but speaking: not what the model sees, but how much it says. Our energy decomposition framework provides actionable guidelines for energy-aware VLM deployment on resource-constrained edge devices.

[ACM MM 2026 Submission] — In Review

Stochastic Power Modeling and Constrained MDP Optimization for On-Device SLM Inference

Mon, 22 Sep 2025 00:00:00 +0000

📄 [ICASSP 2026 Submission] — In Review

This research introduces a stochastic and interpretable framework for sustainable on-device inference of small language models (SLMs) under strict energy and hardware constraints. By capturing fine-grained CPU/GPU power dynamics and optimizing inference scheduling with constrained MDPs, the work provides a principled foundation for adaptive, resource-aware AI at the edge.

Problem and Motivation

Running SLMs locally on smartphones, laptops, or IoT nodes promises low-latency and privacy-preserving AI services, but these devices face finite battery budgets and strict power caps. Traditional energy models fail to capture the stochastic, phase-wise CPU/GPU behaviors of SLM inference, making them unsuitable for multi-task adaptive deployment.

Technical Contributions

1. HSMM-Based Energy Modeling

Conducted fine-grained power measurements of Gemma2-2B and Qwen3-4B on MT-Bench.
Modeled CPU and GPU traces separately with Hidden Semi-Markov Models (HSMMs):
- GPU: ramp-up, plateau, decay phases.
- CPU: low-load and high-load bursts.
Achieved higher fidelity than HMM and TCN baselines in predicting power fluctuations.

2. Constrained MDP Formulation

Defined a CMDP where each inference task selects an SLM configuration (model + quantization).
State: remaining energy budget.
Actions: candidate SLM setups.
Reward: LLM-as-a-Judge quality scores.
Constraints: finite energy budget and instantaneous device-level power cap.

3. Policy Optimization with Q-Learning

Constructed cost–reward pairs for six candidate actions.
Solved CMDP with tabular Q-learning:
- Improved average reward from ~9 to ~15 over 300 episodes.
- Maintained energy usage within 85–90% of budget.
- Guaranteed no violation of power caps.

Results and Insights

HSMMs effectively capture piecewise-stationary phases in edge inference.
CMDP optimization reveals clear energy–quality trade-offs.
Learned policies significantly improve cumulative inference quality while respecting real-world constraints.

Conclusion

This study establishes the first unified mathematical framework linking SLM parameters, stochastic energy consumption, and inference quality. By integrating HSMM-based cost modeling with CMDP optimization, it enables sustainable, adaptive deployment of SLMs in edge and IoT environments, paving the way for future extensions with deep RL and collaborative multi-device scheduling.

PRISM: Privacy-Aware Routing for Adaptive Cloud–Edge LLM Inference with Semantic Sketch Collaboration

Wed, 30 Jul 2025 00:00:00 +0000

📄 [Accepted at 2026 AAAI Conference on Artificial Intelligence] — To appear

This project introduces PRISM, a context-aware cloud–edge inference framework that balances privacy, utility, and efficiency for Large Language Model (LLM) services. It addresses the key limitations of uniform privacy mechanisms by adapting protection based on semantic sensitivity of user inputs.

Objectives

The primary goal is to enable privacy-preserving LLM inference in real-world deployments, where sensitive user prompts are routed intelligently between edge devices and the cloud. PRISM is designed to:

Avoid unnecessary noise for benign inputs
Preserve semantic coherence in sensitive prompts
Reduce latency and energy consumption without compromising utility

Key Contributions

Semantic-Sensitive Execution Routing

A soft gating controller on the edge scores entity-level risk using contextual features (e.g., named entities, first-person references)
Routes prompts to one of three execution paths:
- Edge-only for high-risk prompts
- Cloud-only for low-risk prompts
- Cloud–Edge Collaboration for mid-sensitivity prompts

Adaptive Two-Layer Local Differential Privacy (LDP)

Each sensitive entity is obfuscated through:
- Category-level perturbation (e.g., masking “Diagnosis”)
- Value-level perturbation (e.g., replacing “HIV” with “Flu”)
Privacy budget allocation is guided by a sensitivity weight model ensuring fine-grained protection without semantic collapse

Semantic Sketch Collaboration Protocol

Noisy prompts are processed in the cloud to generate semantic sketches (e.g., high-level abstract responses)
The edge-side Small Language Model (SLM) refines these sketches using the original context
Enables high-utility responses under strong privacy constraints

Results & Insights

PRISM achieves up to 3× lower latency and 2.5× lower energy consumption than baselines like Uniform and Selective LDP
Delivers higher LLM-Judge scores (up to 7.2) under strong privacy budgets
Outperforms state-of-the-art methods (e.g., Split-and-Denoise, DP-Forward) in terms of both utility and efficiency
Robust across 8 different model combinations (e.g., GPT-4o + StableLM)

Method	Ct.(s)	Ec.(J)	IQ.
PRISM	7.92	687.2	6.88
Uniform LDP	20.56	1707.6	5.72
Selective LDP	21.22	1770.8	5.94
Edge-Only	17.84	1573.9	5.09
Cloud-Only	5.13	296.3	8.14

Broader Impact

PRISM enables selective privacy-preserving inference for sensitive domains such as medical, financial, and personal assistants, paving the way for:

Deploying LLMs responsibly in privacy-critical environments
Reducing energy costs in cloud-edge infrastructure
Bridging the tradeoff between privacy and inference quality

RL-Enhanced Disturbance-Aware MPC for Robust UAV Trajectory Tracking

Wed, 07 May 2025 00:00:00 +0000

📄[Accepted at IEEE SMC 2025] — To appear

This research introduces ROAM, a novel RL-enhanced, disturbance-aware MPC framework for precise UAV trajectory tracking in uncertain and dynamic environments. The method combines the predictive strengths of MPC with the fast response of reinforcement learning (RL) and the robustness of an adaptive sliding mode observer (SMO).

Problem and Motivation

Traditional UAV controllers using MPC struggle under model mismatch, wind disturbances, and computational delays, resulting in residual tracking errors and slow convergence. This work addresses those challenges via two innovations:

An offline-trained RL warm-start policy to accelerate MPC convergence
An Adaptive Super-Twisting Sliding Mode Observer (AST-SMO) to estimate and reject real-time disturbances

Technical Contributions

1. RL-Based Warm Start

A direction-conditioned policy is trained via imitation learning on expert MPC trajectories.
During real-time control, it provides trajectory-consistent initial guesses to the MPC solver, reducing early-stage tracking error by 16.9% and computation time by 38.7%.

2. AST-SMO for Disturbance Estimation

The SMO estimates external disturbances in real time using a smooth hyperbolic function to avoid chattering.
An adaptive gain tuning mechanism adjusts sensitivity dynamically for better convergence.

3. Disturbance-Aware MPC

MPC is reformulated to incorporate real-time estimates from AST-SMO: \[ x_{k+1} = Ax_k + Bu_k + E(\hat{d}_k) \]
Objective: minimize both tracking error and control effort, while maintaining system constraints.

Simulation Results

Evaluated on a 12-DOF quadrotor model under sinusoidal and noisy disturbances.
ROAM achieved:
- 16.9% improvement in early-stage tracking accuracy
- 38.7% reduction in computation time
- Superior trajectory adherence under heavy external disturbances compared to classical MPC

Conclusion

ROAM demonstrates that deep integration of RL, observers, and MPC yields a control system with faster convergence, better stability, and higher resilience. Its lightweight and modular design makes it highly suitable for real-time deployment on embedded UAV platforms.

Can Large Language Models Credibly Stand in for Humans in Game-Theoretic Experiments?

Thu, 17 Apr 2025 00:00:00 +0000

This work investigates the feasibility of using Large Language Models (LLMs) as proxies for human participants in behavioral game-theoretic experiments. We evaluated four LLMs—GPT-4o, Llama‑3.3‑70B‑Instruct, Llama‑3.3‑8B‑Instruct, and DeepSeek-R1 across three canonical games: the Prisoner’s Dilemma, the Ultimatum Game, and the Public Goods Game.

Research Objectives

Evaluate behavioral alignment, persona consistency, and strategic adaptability of LLMs vs. human norms.
Design a modular, multi-agent framework (PRIME-Router) for improved consistency and adaptability.
Benchmark LLM behavior using MBTI-based persona prompts: Diplomat, Analyst, Sentinel, Explorer.

Core Contributions

1. Behavioral Assessment in Canonical Games

LLMs were benchmarked against human behavior using three new metrics:

BAM (Behavioral Alignment Measure): similarity to human action distributions
PCI (Persona Consistency Index): adherence to prompted social roles
ASP (Adaptive Strategic Profile): responsiveness to evolving game contexts

Key findings:

Most LLMs showed high initial BAM but struggled with adaptive consistency in repeated games.
GPT-4o and LLaMA-3.3-70B demonstrated excellent persona consistency in one-shot games.

2. PRIME-Router Framework

To overcome adaptation and consistency limitations, we proposed PRIME-Router, a modular MoE-style architecture that:

Spawns specialized subroles (e.g., Empathy Enforcer, Strategic Planner)
Assigns the most suitable LLM to each subrole based on empirical performance
Aggregates multi-agent outputs via collaboration patterns (e.g., star, debate, chain)

PRIME-Router improves:

PCI by up to 0.23
ASP by up to 0.32 across repeated games.

3. Implications and Outlook

LLMs can simulate human-like behavior credibly, but strategic depth and long-horizon persona fidelity remain challenges.
PRIME-Router paves the way for cost-effective AI agents in social science experimentation, policy modeling, and online platform simulation.

Conclusion

Our study highlights the promise and limitations of LLMs in behavioral game simulations. Structured multi-agent design like PRIME-Router significantly enhances realism, offering a new paradigm for AI-driven human modeling in experimental social science.

📄 [AAAI 2026 Submission] — In Review

Minimizing Maximum Age of Service in Virtualized Green IoT Networks

Sat, 07 Dec 2024 00:00:00 +0000

This project addresses the challenge of embedding and scheduling applications in solar-powered green IoT networks, with the goal of minimizing the maximum Age of Service (AoS) — a freshness metric indicating the delay between data generation and service completion.

Objectives

The research focuses on virtualized, computation-enabled IoT infrastructures powered by renewable energy (solar). The applications are modeled as Directed Acyclic Graphs (DAGs) with Virtual Network Functions (VNFs) that must be executed under fluctuating energy and computational constraints.

Key Contributions

Mixed Integer Linear Programming (MILP) Formulation

Proposed the first MILP model to jointly optimize:
- Device selection and sampling time
- DAG request embedding decision
- Energy consumption at devices, gateways, and servers
Objective: minimize the maximum AoS across all DAG requests.

Heuristic and Predictive Control Solutions

Developed GreedyOL, a fast heuristic that embeds DAGs based on current AoS.
Proposed RHCOP, a Receding Horizon Control Optimization framework:
- Utilizes Gaussian Mixture Models (GMMs) to predict solar energy arrivals and wireless channel gains.
- Enables real-time scheduling using only causal (non-future) information.

Results & Insights

RHCOP achieves a 1.07× and GreedyOL a 1.13× min-max AoS compared to optimal MILP.
More gateways and servers reduce AoS due to enhanced redundancy and flexibility.
Equal numbers of VNF-Cs (collection) and VNF-Ps (processing) yield optimal freshness.

Broader Impact

The proposed system lays groundwork for energy-aware, delay-sensitive IoT applications, especially in remote or energy-constrained environments. The results provide insights into the tradeoffs between computation freshness, resource allocation, and green network deployment strategies.

📄 [IEEE Transactions on Services Computing Submission] — Coming Soon

Task Offloading and Approximate Computing in Solar Powered IoT Networks

Sun, 07 Jan 2024 00:00:00 +0000

This research proposes a novel framework for minimizing the total energy consumption of solar-powered IoT networks through task offloading and approximate computing. Devices can choose between local execution (exact or approximate) or offloading tasks to a solar-powered edge server.

Core Objectives

Reduce energy usage by allowing approximate task execution when tolerable errors are acceptable.
Leverage digital twins (DTs) to estimate future energy availability and channel conditions.
Optimize offloading decisions and resource allocation across time slots and channels.

Technical Highlights

MILP Formulation

Designed the first MILP to jointly optimize:
- Task offloading decisions
- Approximate vs. exact execution
- Channel allocation
- Virtual machine (VM) assignment
Captures constraints on energy arrivals, CPU cycles, approximation error bounds, and VM capacity.

DT-Assisted Receding Horizon Control (DT-RHC)

Introduced a DT-based control algorithm using:
- Gaussian Mixture Models (GMMs) to predict energy and channel gain
- Sliding-window MILP optimization for dynamic scheduling
Achieves energy usage within 1.62× of MILP optimal while requiring only causal (past) data

Results & Evaluation

DT-RHC significantly outperforms random strategies across metrics such as:
- Energy consumption vs. number of devices
- Impact of approximation ratios
- Task completion within extended time horizons
Simulations conducted in Python + Gurobi over 100×100 m² deployments using realistic solar input and wireless models.

Conclusion

This study demonstrates the viability of integrating approximate computing and intelligent offloading in renewable-powered IoT environments. It provides a robust foundation for future distributed optimization and adaptive energy-aware network control.

IEEE Paper DOI: 10.1109/LNET.2023.3328893