NeurIPS 2026 Interpretability Papers
References
OpenReview NeurIPS
- GnnXemplar: Exemplars to Explanations - Natural Language Rules for Global GNN Interpretability — Burouj Armgaan et al.
- A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders — David Chanin et al.
- Identifiability of Deep Polynomial Neural Networks — Konstantin Usevich et al.
- A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning — Yuzheng Hu et al.
- Proxy-SPEX: Sample-Efficient Interpretability via Sparse Feature Interactions in LLMs — Landon Butler et al.
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? — Denis Sutter et al.
- Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few — Qishuai Wen et al.
- Measuring and Guiding Monosemanticity — Ruben Härle et al.
- Improved Representation Steering for Language Models — Zhengxuan Wu et al.
- Head Pursuit: Probing Attention Specialization in Multimodal Transformers — Lorenzo Basile et al.
- How do Transformers Learn Implicit Reasoning? — Jiaran Ye et al.
- Jacobian-Based Interpretation of Nonlinear Neural Encoding Model — Xiaohui Gao et al.
- What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains — Chanakya Ekbote et al.
- Controlling Thinking Speed in Reasoning Models — Zhengkai Lin et al.
- Minimax-Optimal Univariate Function Selection in Sparse Additive Models: Rates, Adaptation, and the Estimation-Selection Gap — Shixiang Liu et al.
- The Structure of Relation Decoding Linear Operators in Large Language Models — Miranda Anna Christ et al.
- SHAP values via sparse Fourier representation — Ali Gorji et al.
- Self-Assembling Graph Perceptrons — Jialong Chen et al.
- DeepHalo: A Neural Choice Model with Controllable Context Effects — Shuhan Zhang et al.
- Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition — Jongseo Lee et al.
- ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation — Jiatong Shi et al.
- Differentiable Decision Tree via “ReLU+Argmin” Reformulation — Qiangqiang Mao et al.
- The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization — Yihua Zhang et al.
- Transferring Linear Features Across Language Models With Model Stitching — Alan Chen et al.
- Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? — Yihao Li et al.
- Emergence and Evolution of Interpretable Concepts in Diffusion Models — Berk Tinaz et al.
- A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning — Guan Zhe Hong et al.
- GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving — Shuai Liu et al.
- GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments — Enjun Du et al.
- Q-Insight: Understanding Image Quality via Visual Reinforcement Learning — Weiqi Li et al.
- Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws — Lin Guo et al.
- Vision Transformers Don’t Need Trained Registers — Nicholas Jiang et al.
- Provable Gradient Editing of Deep Neural Networks — Zhe Tao et al.
- Decomposing stimulus-specific sensory neural information via diffusion models — Steeve Laquitaine et al.