Can Large Language Models Credibly Stand in for Humans in Game-Theoretic Experiments?
This work investigates the feasibility of using Large Language Models (LLMs) as proxies for human participants in behavioral game-theoretic experiments. We evaluated four LLMs—GPT-4o, Llama‑3.3‑70B‑Instruct, Llama‑3.3‑8B‑Instruct, and DeepSeek-R1 across three canonical games: the Prisoner’s Dilemma, the Ultimatum Game, and the Public Goods Game.
Research Objectives
- Evaluate behavioral alignment, persona consistency, and strategic adaptability of LLMs vs. human norms.
- Design a modular, multi-agent framework (PRIME-Router) for improved consistency and adaptability.
- Benchmark LLM behavior using MBTI-based persona prompts: Diplomat, Analyst, Sentinel, Explorer.
Core Contributions
1. Behavioral Assessment in Canonical Games
LLMs were benchmarked against human behavior using three new metrics:
- BAM (Behavioral Alignment Measure): similarity to human action distributions
- PCI (Persona Consistency Index): adherence to prompted social roles
- ASP (Adaptive Strategic Profile): responsiveness to evolving game contexts
Key findings:
- Most LLMs showed high initial BAM but struggled with adaptive consistency in repeated games.
- GPT-4o and LLaMA-3.3-70B demonstrated excellent persona consistency in one-shot games.
2. PRIME-Router Framework
To overcome adaptation and consistency limitations, we proposed PRIME-Router, a modular MoE-style architecture that:
- Spawns specialized subroles (e.g., Empathy Enforcer, Strategic Planner)
- Assigns the most suitable LLM to each subrole based on empirical performance
- Aggregates multi-agent outputs via collaboration patterns (e.g., star, debate, chain)
PRIME-Router improves:
- PCI by up to 0.23
- ASP by up to 0.32 across repeated games.
3. Implications and Outlook
- LLMs can simulate human-like behavior credibly, but strategic depth and long-horizon persona fidelity remain challenges.
- PRIME-Router paves the way for cost-effective AI agents in social science experimentation, policy modeling, and online platform simulation.
Conclusion
Our study highlights the promise and limitations of LLMs in behavioral game simulations. Structured multi-agent design like PRIME-Router significantly enhances realism, offering a new paradigm for AI-driven human modeling in experimental social science.
📄 [AAAI 2026 Submission] — In Review