Publications
Research
Papers and research posts, newest to oldest. Based on Google Scholar with manual arXiv additions. Last updated 2nd June 2026.
Subliminal Learning Is Steering Vector Distillation
How do LLMs compute verbal confidence
Automatically Finding Reward Model Biases
Simple LLM Baselines are Competitive for Model Diffing
Fluid Representations in Reasoning Models
Building Production-Ready Probes For Gemini
Where Do Olmo's Values Come From?
Gemma Scope 2: Technical Paper
A pragmatic vision for interpretability
Eliciting secret knowledge from language models
Thought Anchors: Which LLM Reasoning Steps Matter?
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Line of sight: On linear representations in vllms
Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
An Approach to Technical AGI Safety and Security
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Open problems in mechanistic interpretability
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
How Interpretability Researchers Can Help AGI Go Well
Base Models Know How to Reason, Thinking Models Learn When
Progress Update #2 from the GDM Mech Interp Team (Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research)
LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders
Understanding Reasoning in Thinking Language Models via Steering Vectors
Scaling Sparse Feature Circuits For Studying In-Context Learning
Improving Steering Vectors by Targeting Sparse Autoencoder Features
SAEs are Highly Dataset Dependent: a case study on the Refusal Direction
Open source replication of Anthropic's crosscoder paper for model diffing
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
SAEs (Usually) Transfer Between Base and Chat Models
Interpreting Attention Layer Outputs with Sparse Autoencoders
Improving Dictionary Learning with Gated Sparse Autoencoders
Activation Steering with SAEs
Base LLMs Refuse Too
SAE Features for Refusal and Sycophancy Steering Vectors
Self-Explaining SAE Features
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
Progress Update #1 from the GDM Mech Interp Team
Stealing Part of a Production Language Model
Attribution Patching Outperforms Automated Circuit Discovery
Towards Automated Circuit Discovery for Mechanistic Interpretability
My best guess at the important tricks for training 1L SAEs
Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small