2025

Open Problems in Mechanistic Interpretability Sharkey

2024

Kolmogorov–Arnold Transformer 2024
Physics of Language Models: Part 1, Learning Hierarchical Language Structures (Allen-Zhu, 2024)
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process (Allen-Zhu, 2024)
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems (Allen-Zhu, 2024)
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction (Allen-Zhu, 2024)
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws (Allen-Zhu, 2024)

2023

2022

2021

2006 ~ 2010

Marker-Passing Inference in the Scone Knowledge-Base System Fahlman (2006)

2001 ~ 2005

1990 ~ 2000

1980 ~ 1990

1970 ~ 1980

1960 ~ 1970

1950 ~ 1960

1900 ~ 1950

1800 ~ 1900

~ 1800

Propositional Interpretability in Artificial Intelligence 2025
RL + Transformer = A General-Purpose Problem Solver 2025
Harmonic Loss Trains Interpretable AI Models 2025
A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods 2025
AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders 2025
Why do LLMs attend to the first token 2025
KBLAM: Knowledge base augmented language model 2025
Investigating the Effectiveness of a Socratic Chain-of-Thoughts (SocraCoT) Reasoning Method for Task Planning in Robotics, A Case Study 2025
Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability 2025
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning 2025
LEANAGENT: Lifelong Learning From Formal Theorem Proving 2025
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs 2025
Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia 2023
Transformers as Soft Reasoners over Language 2020
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks 2016
Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification 2024
CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text 2019
Adaptable Logical Control for Large Language Models 2024
A Foundation Model for Zero-shot Logical Query Reasoning 2024
Hybrid Declarative-Imperative Representations for Hybrid Discrete-Continuous Decision-Making 2024
Enhancing Automated Interpretability with Output-Centric Feature Descriptions 2025
Evaluating the Ripple Effects of Knowledge Editing in Language Models 2025
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes 2024
SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks 2024
Visualizing and Understanding Recurrent Networks 2016
Building Machines That Learn and Think Like People 2016
UKnow: A Unified Knowledge Protocol with Multimodal Knowledge Graph Datasets for Reasoning and Vision-Language Pre-Training 2024
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents 2024
Rethinking the Role of Demonstrations: What Makes In-Context LearningWork? 2022
Understanding Epistemic Language with a Language-augmented Bayesian Theory of Mind 2024
The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis 2025
Identifying and Mitigating Social Bias Knowledge in Language Models 2025
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning 2025
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features 2025
Which Agent Causes Task Failures and When? OnAutomated Failure Attribution of LLM Multi-Agent Systems 2025
Assessing and alleviating state anxiety in large language models 2025
Language Models use Lookbacks to Track Beliefs 2025
Theories of Mind as Languages of Thought for Thought about Thought 2025
An Affective-Taxis Hypothesis for Alignment and Interpretability 2025
Why LLMs Hallunicate, And Hwo To get Evidential Closure 2023
Do Large Language Models Know What They Don’t Know? 2023
Auditing Language Models For hidden Objectives 2025
Distinguishing the Knowable from the Unknowable with Language Models 2024
Retrieval Head Mechanistically explains long-context factuality 2025
DeDeep: Detecting hallucinations in retrieval-augmented generation via mechanstic interpretability 2025
Mechanistic Permutability: Match Features Across Layers 2025
Transcoders Find Interpretable LLM Feature Circuits 2024
Sparse feature circuits: Discovering and editing interpretable causal graphs in langauge models 2025
Confidence regulation neurons in language models 2024
Experts Don’t Cheat: Learning What you don’t know by predicting pairs 2024
Internal consistency and self-feedback in large language models: A survey 2024
The Internal state of an LLM Knows When it’s lying 2023
Self-training large language models though knowledge detection 2024
Knowning what LLMs do not know: a simple yet effective self-detection method 2024
Program-Aided Reasoners Know What They Know 2024
A survey on Verification and Validation 2024
Making retrieval-augmented language models robust to irrelevant context 2024
Combating misinformation in the age of LLMS 2024
DRAGIN: dynamic retrieval augmented Generation based on the Information Needs of LLMs 2024
RARR: Researching and Revising What Language Models say, using language models 2023
A truth maintanance system 1979
Jokes and the Logic of the Cognitive Unconscious 1980
SAEs are Good for Steering - If You select the Right Features 2025
Improved Representation Steering for Language Models 2025
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders 2025
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning 2025
A Survey on Verification and Validation, Testing and Evaluations of Neurosymbolic Artificial Intelligence 2024
Title Year
Mao et al, The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision 2019
Yi et al, Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding 2018
Qu et al, Probabilistic Logic Neural Networks for Reasoning 2019
Yang et al, Learn to Explain Efficiently via Neural Logic Inductive Learning 2020
Zhang et al, Efficient Probabilistic Logic Reasoning with Graph Neural Networks 2020
DeepProbLog: Neural Probabilistic Logic Programming 2018
Logic Tensor Networks (LTN)  
A Probabilistic Graphical Model Based on Neural-symbolic Reasoning for Visual Relationship Detection 2022
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First‑Order Logic Provers 2023
Symbolic metaprogram search improves learning efficiency and human‑like performance 2024
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought 2023
The Neuro‑Symbolic Inverse Planning Engine (NIPE): Modeling Probabilistic Social Inferences from Linguistic Inputs 2023
LogiCity: Advancing Neuro‑Symbolic AI with Abstract Urban Simulations 2024
Bridging the Gap: Representation Spaces in Neuro-Symbolic AI 2024
Embedding Symbolic Knowledge into Deep Networks 2019
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation 2024