| Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety | (Korbak, 2025) |
| Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs | (Wen, 2025) |
| Thinking fast, slow, and everywhere in between in humans and language models | (Prystawski, 2025) |
| Emergent Symbolic Cognition: A Unifying Computational Framework for Symbolic Thought in Humans and LLMs | (Huddleston, 2025) |
| AbsenceBench: Language Models Can’t Tell What’s Missing | (Fu, 2025) |
| Because we have LLMs, we Can and Should Pursue Agentic Interpretability | (Kim, 2025) |
| Open Problems in Mechanistic Interpretability | (Sharkey, 2025) |
| Values in the wild: Discovering and analyzing values in real-world language model interactions [url] | 2025 |
| Progress on Attention [url] | 2025 |
| Kolmogorov–Arnold Transformer | 2024 |
| Physics of Language Models: Part 1, Learning Hierarchical Language Structures | (Allen-Zhu, 2024) |
| Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process | (Allen-Zhu, 2024) |
| Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems | (Allen-Zhu, 2024) |
| Physics of Language Models: Part 3.1, Knowledge Storage and Extraction | (Allen-Zhu, 2024) |
| Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws | (Allen-Zhu, 2024) |
| Marker-Passing Inference in the Scone Knowledge-Base System | Fahlman (2006) |
| What Is Answer Set Programming? | (Lifschitz, 2008) |
| The Physical Symbol System Hypothesis: Status and Prospects | (Nilsson, 2007), PSSH |
| A Logical Framework for Default Reasoning | (Poole, 1998) |
| An Assumption-based TMS | (Kleer, 1986) |
| Applications of Circumscription to Formalizing Common-Sense Knowledge | (McCarthy, 1986) |
| Computer science as empirical inquiry: symbols and search | (1976), PSSH |
| Formalizing Nonmonotonic Reasoning Systems | (Etherington, 1955) |