NeurIPS 2026 Interpretability Papers

References

OpenReview NeurIPS

GnnXemplar: Exemplars to Explanations - Natural Language Rules for Global GNN Interpretability — Burouj Armgaan et al.
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders — David Chanin et al.
Identifiability of Deep Polynomial Neural Networks — Konstantin Usevich et al.
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning — Yuzheng Hu et al.
Proxy-SPEX: Sample-Efficient Interpretability via Sparse Feature Interactions in LLMs — Landon Butler et al.
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? — Denis Sutter et al.
Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few — Qishuai Wen et al.
Measuring and Guiding Monosemanticity — Ruben Härle et al.
Improved Representation Steering for Language Models — Zhengxuan Wu et al.
Head Pursuit: Probing Attention Specialization in Multimodal Transformers — Lorenzo Basile et al.
How do Transformers Learn Implicit Reasoning? — Jiaran Ye et al.
Jacobian-Based Interpretation of Nonlinear Neural Encoding Model — Xiaohui Gao et al.
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains — Chanakya Ekbote et al.
Controlling Thinking Speed in Reasoning Models — Zhengkai Lin et al.
Minimax-Optimal Univariate Function Selection in Sparse Additive Models: Rates, Adaptation, and the Estimation-Selection Gap — Shixiang Liu et al.
The Structure of Relation Decoding Linear Operators in Large Language Models — Miranda Anna Christ et al.
SHAP values via sparse Fourier representation — Ali Gorji et al.
Self-Assembling Graph Perceptrons — Jialong Chen et al.
DeepHalo: A Neural Choice Model with Controllable Context Effects — Shuhan Zhang et al.
Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition — Jongseo Lee et al.
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation — Jiatong Shi et al.
Differentiable Decision Tree via “ReLU+Argmin” Reformulation — Qiangqiang Mao et al.
The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization — Yihua Zhang et al.
Transferring Linear Features Across Language Models With Model Stitching — Alan Chen et al.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? — Yihao Li et al.
Emergence and Evolution of Interpretable Concepts in Diffusion Models — Berk Tinaz et al.
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning — Guan Zhe Hong et al.
GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving — Shuai Liu et al.
GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments — Enjun Du et al.
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning — Weiqi Li et al.
Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws — Lin Guo et al.
Vision Transformers Don’t Need Trained Registers — Nicholas Jiang et al.
Provable Gradient Editing of Deep Neural Networks — Zhe Tao et al.
Decomposing stimulus-specific sensory neural information via diffusion models — Steeve Laquitaine et al.