Can Large Language Models Credibly Stand in for Humans in Game-Theoretic Experiments?

Can Large Language Models Credibly Stand in for Humans in Game-Theoretic Experiments?

This work investigates the feasibility of using Large Language Models (LLMs) as proxies for human participants in behavioral game-theoretic experiments. We evaluated four LLMs—GPT-4o, Llama‑3.3‑70B‑Instruct, Llama‑3.3‑8B‑Instruct, and DeepSeek-R1 across three canonical games: the Prisoner’s Dilemma, the Ultimatum Game, and the Public Goods Game.

Research Objectives

  • Evaluate behavioral alignment, persona consistency, and strategic adaptability of LLMs vs. human norms.
  • Design a modular, multi-agent framework (PRIME-Router) for improved consistency and adaptability.
  • Benchmark LLM behavior using MBTI-based persona prompts: Diplomat, Analyst, Sentinel, Explorer.

Core Contributions

1. Behavioral Assessment in Canonical Games

LLMs were benchmarked against human behavior using three new metrics:

  • BAM (Behavioral Alignment Measure): similarity to human action distributions
  • PCI (Persona Consistency Index): adherence to prompted social roles
  • ASP (Adaptive Strategic Profile): responsiveness to evolving game contexts

Key findings:

  • Most LLMs showed high initial BAM but struggled with adaptive consistency in repeated games.
  • GPT-4o and LLaMA-3.3-70B demonstrated excellent persona consistency in one-shot games.

2. PRIME-Router Framework

To overcome adaptation and consistency limitations, we proposed PRIME-Router, a modular MoE-style architecture that:

  • Spawns specialized subroles (e.g., Empathy Enforcer, Strategic Planner)
  • Assigns the most suitable LLM to each subrole based on empirical performance
  • Aggregates multi-agent outputs via collaboration patterns (e.g., star, debate, chain)

PRIME-Router improves:

  • PCI by up to 0.23
  • ASP by up to 0.32 across repeated games.

3. Implications and Outlook

  • LLMs can simulate human-like behavior credibly, but strategic depth and long-horizon persona fidelity remain challenges.
  • PRIME-Router paves the way for cost-effective AI agents in social science experimentation, policy modeling, and online platform simulation.

Conclusion

Our study highlights the promise and limitations of LLMs in behavioral game simulations. Structured multi-agent design like PRIME-Router significantly enhances realism, offering a new paradigm for AI-driven human modeling in experimental social science.

📄 [AAAI 2026 Submission] — In Review