| A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods |
2025 |
| AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders |
2025 |
| Why do LLMs attend to the first token |
2025 |
| KBLAM: Knowledge base augmented language model |
2025 |
| Investigating the Effectiveness of a Socratic Chain-of-Thoughts (SocraCoT) Reasoning Method for Task Planning in Robotics, A Case Study |
2025 |
| Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks |
2025 |
| SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability |
2025 |
| Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning |
2025 |
| LEANAGENT: Lifelong Learning From Formal Theorem Proving |
2025 |
| Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs |
2025 |
| Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia |
2023 |
| Transformers as Soft Reasoners over Language |
2020 |
| Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks |
2016 |
| Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification |
2024 |
| CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text |
2019 |
| Adaptable Logical Control for Large Language Models |
2024 |
| A Foundation Model for Zero-shot Logical Query Reasoning |
2024 |
| Hybrid Declarative-Imperative Representations for Hybrid Discrete-Continuous Decision-Making |
2024 |
| Enhancing Automated Interpretability with Output-Centric Feature Descriptions |
2025 |
| Evaluating the Ripple Effects of Knowledge Editing in Language Models |
2025 |
| Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes |
2024 |
| SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks |
2024 |
| Visualizing and Understanding Recurrent Networks |
2016 |
| Building Machines That Learn and Think Like People |
2016 |
| UKnow: A Unified Knowledge Protocol with Multimodal Knowledge Graph Datasets for Reasoning and Vision-Language Pre-Training |
2024 |
| AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents |
2024 |
| Rethinking the Role of Demonstrations: What Makes In-Context LearningWork? |
2022 |
| Understanding Epistemic Language with a Language-augmented Bayesian Theory of Mind |
2024 |
| The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis |
2025 |
| Identifying and Mitigating Social Bias Knowledge in Language Models |
2025 |
| OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning |
2025 |
| ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features |
2025 |
| Which Agent Causes Task Failures and When? OnAutomated Failure Attribution of LLM Multi-Agent Systems |
2025 |
| Assessing and alleviating state anxiety in large language models |
2025 |
| Language Models use Lookbacks to Track Beliefs |
2025 |
| Theories of Mind as Languages of Thought for Thought about Thought |
2025 |
| An Affective-Taxis Hypothesis for Alignment and Interpretability |
2025 |
| Why LLMs Hallunicate, And Hwo To get Evidential Closure |
2023 |
| Do Large Language Models Know What They Don’t Know? |
2023 |
| Auditing Language Models For hidden Objectives |
2025 |
| Distinguishing the Knowable from the Unknowable with Language Models |
2024 |
| Retrieval Head Mechanistically explains long-context factuality |
2025 |
| DeDeep: Detecting hallucinations in retrieval-augmented generation via mechanstic interpretability |
2025 |
| Mechanistic Permutability: Match Features Across Layers |
2025 |
| Transcoders Find Interpretable LLM Feature Circuits |
2024 |
| Sparse feature circuits: Discovering and editing interpretable causal graphs in langauge models |
2025 |
| Confidence regulation neurons in language models |
2024 |
| Experts Don’t Cheat: Learning What you don’t know by predicting pairs |
2024 |
| Internal consistency and self-feedback in large language models: A survey |
2024 |
| The Internal state of an LLM Knows When it’s lying |
2023 |
| Self-training large language models though knowledge detection |
2024 |
| Knowning what LLMs do not know: a simple yet effective self-detection method |
2024 |
| Program-Aided Reasoners Know What They Know |
2024 |
| A survey on Verification and Validation |
2024 |
| Making retrieval-augmented language models robust to irrelevant context |
2024 |
| Combating misinformation in the age of LLMS |
2024 |
| DRAGIN: dynamic retrieval augmented Generation based on the Information Needs of LLMs |
2024 |
| RARR: Researching and Revising What Language Models say, using language models |
2023 |
| A truth maintanance system |
1979 |
| Jokes and the Logic of the Cognitive Unconscious |
1980 |
| SAEs are Good for Steering - If You select the Right Features |
2025 |
| Improved Representation Steering for Language Models |
2025 |
| AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders |
2025 |