Vaccine RL · Interactive Demo ← Homepage
growing network
Master's Thesis · University of Pennsylvania · 2026

Who gets the vaccine today?

A node-level reinforcement-learning approach to vaccine allocation on heterogeneous contact networks — bridging optimal control and deep RL under a hard daily dose budget.

Junfei Zhan · Advisor Prof. Saswati Sarkar
Master's Thesis · University of Pennsylvania · 2026

Scalable Node-Level Vaccine Allocation
on Contact Networks

In the first weeks of a pandemic, who should get the vaccine today? This demo walks through the thesis — the population model, the disease dynamics, the contact network — and lets you run a stochastic outbreak with your own seed configuration.

Junfei Zhan · Advisor: Prof. Saswati Sarkar · 10-state SEPAILHRVD model · Barabási–Albert scale-free contact network · Three-group risk stratification
Part 1

Three population groups

Every individual in the network is assigned to exactly one of three demographic groups: a hub set defined by a degree threshold, a high-risk elderly subpopulation drawn by Bernoulli sampling on the remainder, and the baseline majority. Each group has its own symptomatic, hospitalisation, and case-fatality rates.
(Demo profile — slightly more virulent than the paper's COVID-calibrated values to make outbreak dynamics visible at N=200–1500 scale.)

Group X — Baseline

The general population

The low-risk majority — working-age adults with typical contact patterns. Most of the population belongs here by default.

Symptomatic
sX = 0.50
Hosp.
pX = 0.12
Fatality
dX = 0.10
Prev.
~80%
Compound case-fatality ≈ 0.6%
Group Y — High-risk

Elderly (65+)

Elevated disease severity across every branching probability. Moderate contact patterns but markedly worse clinical outcomes once infected.

Symptomatic
sY = 0.85
Hosp.
pY = 0.50
Fatality
dY = 0.65
Prev.
~17%
Compound case-fatality ≈ 27.6% — 45× Group X
Group Z — Hub

High-contact individuals

Delivery workers, transit drivers, frontline retail — the hubs of the contact graph. Few in number but wired into a disproportionate share of all transmission pathways.

Symptomatic
sZ = 0.60
Hosp.
pZ = 0.22
Fatality
dZ = 0.20
Degree
≥ µ+ασ
Structural role: drives outbreak velocity via rich-get-richer topology · compound CFR ≈ 2.6%
Part 2

The 10-compartment SEPAILHRVD model

A COVID-style disease model with distinct latent, pre-symptomatic, asymptomatic, symptomatic, late-stage, hospitalised, recovered, vaccinated, and dead compartments. Three of them — P, A, I — drive onward transmission. Click a state to inspect it.

 

Click any state

Explore the 10 compartments of the model. States in red are infectious. The three infectious weights wP=0.8, wA=0.5, wI=1.0 modulate how much each one contributes to the per-node force of infection.

Latent 3d · Pre-symptomatic 2d · Asymptomatic / Late-stage 5d · Hospitalised 10d

S · Susceptible
E · Exposed
P · Pre-symptomatic
A · Asymptomatic
I · Symptomatic
L · Late-stage
H · Hospitalised
R · Recovered
V · Vaccinated
D · Dead
Part 3

Building a Barabási–Albert contact network

Real contact graphs are scale-free: most people have a few connections, a few people have many. The BA model reproduces this with a single rule — each new node attaches with probability proportional to existing degree, P(v → u) = ku / Σ kw. Hit play and watch the hubs emerge.

Nodes: 0 / 500

Degree distribution

500
2

After growth completes, nodes are coloured by group:

X
Y
Z hub
Part 4

Run a stochastic outbreak

Pick your seed configuration — how many initial infections and which groups they land in — then press play. This is the no-vaccination baseline: what happens if nothing is done. The final death count is the upper bound that every allocation method must beat.

Network

600
2

Initial seed

30
Share across groups (auto-normalised to 100%):
X
50%
Y
25%
Z
25%
sum · 0 seeds of 0

Simulation

90d
Day 0 / 90 · Infected: 0 · Deaths: 0
Outbreak complete — deaths (% of population) with no vaccination. This is the baseline the allocation methods must beat.

Live counts

Active infections
0
Cumulative deaths
0

Trajectory

Part 5 · Final Results

Four allocation methods, head-to-head

All four methods evaluated on Neval = 30 stochastic rollouts of the paper's network-level simulator. The default configuration is N = 5 000, horizon T = 60 days, daily budget K = 10, and 300 initial infections split 50:17:33 across X:Y:Z. RL methods train on 1 500 episodes of PPO; reported numbers are best-of-three training seeds.

RANK1
Optimal Control · Proposed

OC-high

22.0± 4.6
Mean deaths over 30 rollouts
Group-level OC solved on the ODE, then intra-group doses given to the highest-degree susceptibles. Wins the default config and most practical regimes.
9 s offline 153 ms / ep
RANK2
Reinforcement Learning · Proposed

Node RL

22.4± 3.4
Mean deaths over 30 rollouts
Shared MLP scorer with Gumbel-Top-K action head — O(K) policy-gradient variance, independent of N. Beats OC-high at short horizons or hub-overloaded starts.
625 s offline 75 ms / ep
RANK3
Optimal Control · Baseline

OC-Random

25.2± 4.1
Mean deaths over 30 rollouts
Same group-level OC rates as OC-high, but doses go to uniformly random nodes within each group. Isolates the value of the degree heuristic.
9 s offline
RANK4
Reinforcement Learning · Baseline

Naive RL

26.6± 4.2
Mean deaths over 30 rollouts
Same scorer, same PPO loop — but independent Bernoulli action head. Θ(N) gradient variance drowns out the K informative decisions per day.
792 s offline 75 ms / ep
Δheur — intra-group heuristic
+1.3deaths (median)
Positive in 11 / 12 regime points, ≈ 5–10% of the OC-high reference. Reporting "OC" on a network without specifying the intra-group selection rule conflates two separable components. Counter-productive only at init = 1 200 on N = 5 000 (overloaded start).
Δarch — Top-K action space
+1.9deaths (median)
Positive in 11 / 12 regime points. Node RL and Naive RL share everything else — scorer, PPO loop, hyperparameters — so the gap is entirely attributable to the action-head reparametrisation. Empirically confirms the O(K) vs Θ(N) gradient-variance bound.

Regime analysis · three axes

Sweeping population size, horizon length, and initial infection ratio. The shaded band is the gap between Node RL and OC-high in each row: when it dips below zero, Node RL wins outright.

Population size

Ratios held fixed (K = 0.2% N, init = 6% N, T = 60)
OC-high Node RL
OC-high wins at every N. The gap grows from +0.4 (N=5k) to +3.8 (N=20k) — as the susceptible pool grows, the ODE mean-field becomes more accurate, not less.

Horizon length

Fixed N = 10 000, Vmax = 20, init = 600
OC-high Node RL
Node RL wins at T = 30 (19.0 vs 20.1, gap −1.1). Short horizons are dominated by stochastic first-wave dynamics that the pre-computed OC plan can't react to.

Initial infections

Fixed N = 5 000, Vmax = 10, T = 60, 50:17:33 split
OC-high Node RL
Node RL wins at init = 1 200 (27.6 vs 28.6, gap −1.0). A large fraction of hubs is already infected at t = 0, so high-degree targeting wastes doses on nodes behind an active wavefront.

Regime map · which method to use

The thesis's main practical deliverable: a decision rule mapping deployment conditions to the preferable method, with the measured gap and relative preparation cost.

Regime Representative point Gap Cost ratio Recommended
Default deployment N = 5k, T = 60, init = 300 +0.4 ~70× OC-high (tied)
Large population, ratios held N = 10–20k, fixed % +0.8 → +3.8 ~70–120× OC-high
Long horizon T = 70–90 +1.0 → +1.9 ~70× OC-high
Short horizon T = 30 −1.1 ~180× Node RL
Overloaded initial state init = 1 200 on N = 5k −1.0 ~70× Node RL
Intra-group equity priority Any regime where Δheur < 0 n/a n/a OC-Random or Node RL

Four practical rules

The takeaway that comes out of the regime map, ablations, and runtime analysis together.

OC-high is the default

On moderate horizons and realistic initial-infection ratios, OC-high matches or beats Node RL at ~70× less preparation cost. If you had to pick one method without regime information, it's this.

Node RL earns its compute when the ODE breaks down

Short horizons (T ≲ 30) where stochastic first-wave dynamics dominate, or overloaded starts where too many hubs are already infected at t = 0 for the degree heuristic to bite.

The intra-group heuristic is not a free add-on

High-degree-first contributes median +1.3 deaths — the same order as the group-level OC itself. Reports that say "OC on a network" without disclosing the post-hoc rule are conflating two separable contributions.

Action-space design decides trainability

Node RL's Top-K head beats Naive RL's Bernoulli head by +1.9 deaths with every other component held identical. Without the O(K) reparametrisation, individual-level RL doesn't scale to this problem at all.