Bridging Optimal Control And Reinforcement Learning For Node-Level Vaccine Allocation: A Regime-Based Comparative Analysis

Research

📄 Master’s Thesis, University of Pennsylvania (2026). Advisor: Prof. Saswati Sarkar.

In the first weeks of a pandemic, vaccines must be allocated across a large, heterogeneous population under a tight daily dose budget and over a horizon of weeks to months. A deployable policy must name specific individuals — not group-level proportions — and cope with three structural difficulties: sequential decisions over a long horizon with a delayed reward signal, a combinatorial daily action space of size $\binom{N}{K}$, and individual network position that matters as much as demographic group.

Interactive Demo

The companion demo walks through the thesis visually:

Three-group population model — baseline (X), high-risk elderly (Y), and high-contact hubs (Z), each with group-specific symptomatic, hospitalisation, and case-fatality rates.
10-compartment SEPAILHRVD disease model — latent, pre-symptomatic, asymptomatic, symptomatic, late-stage, hospitalised, recovered, vaccinated, and dead.
Barabási–Albert network construction — watch preferential attachment grow a scale-free contact graph and the characteristic power-law degree tail emerge.
Stochastic simulator — seed infections in any group mix and watch an unvaccinated outbreak unfold day by day, reporting cumulative deaths as the no-intervention baseline.
Method comparison (coming soon) — OC-Random, OC-high, Naive RL, and Node RL on identical seeds.

👉 Open the interactive demo

Contributions

C1 — Stochastic node-level simulator: a high-fidelity environment integrating an explicit Barabási–Albert contact network with a 10-compartment SEPAILHRVD model, capturing intrinsic stochasticity of infection events and individual-level risk heterogeneity.
C2 — OC-high: augments principled group-level optimal control with a high-degree-first intra-group heuristic, bridging aggregate policy and individual action.
C3 — Node RL: an end-to-end actor–critic with a shared-parameter scoring MLP and Gumbel-Top-$K$ reparameterised sampling, yielding $O(K)$ gradient variance versus $\Theta(N)$ for independent Bernoulli baselines.
C4 — Regime map: systematic benchmarking across population size, horizon, and initial-infection ratio identifying when each method is preferable — and when the additional compute of node-level RL is justified.

Headline Findings

OC-high matches or beats Node RL in most regimes at roughly two orders of magnitude less preparation cost.
Node RL’s advantage is real but confined to short horizons and hub-heavy initial infections, where the mean-field assumption underlying OC-high structurally breaks down.
The intra-group high-degree heuristic alone accounts for a 5–10% reduction in deaths on average, comparable to the contribution of the group-level OC rates themselves.

Last updated on Apr 25, 2026

← Trains but Doesn't Learn: A Post-Training Delivery Benchmark for LLM Agents as Forward-Deployed Engineers Jun 17, 2026

Seeing is Free, Speaking is Not: Uncovering the True Energy Bottleneck in Edge VLM Inference Mar 27, 2026 →