Research Overview

Robustness of Theory of Mind
in Large Language Models

Evaluating whether LLMs genuinely reason about mental states or rely on memorized patterns and shallow heuristics. Testing across 750 items, 3 models, and 3 prompting conditions.

Methodology

750 test items · 5 test types

3 LLMs under evaluation

3 prompting conditions

3 repetitions per item

Three Hypotheses

H1 Memorization

Do models memorize specific ToM patterns from training data, or can they generalize to novel surface forms?

Key Metric GR = NSF / Baseline

H2 Shallow Heuristics

Do models rely on trigger words and surface-level cues rather than genuinely tracking causal belief changes?

Key Metrics TDS APER

H3 Approximative Reasoning

Do models approximate mental state reasoning or genuinely track beliefs across domain shifts?

Key Metric DTS = DT / Baseline

MindBench

Three Hypotheses