Research Engineer @ Google DeepMind

Arthur Conmy

I work on post-training to better align Gemini. Previously, I did early influential work on mechanistic interpretability.

AC

Research

Papers and research posts, newest to oldest. Based on Google Scholar with manual arXiv additions. Last updated 2nd June 2026.

Subliminal Learning Is Steering Vector Distillation thumbnail

Subliminal Learning Is Steering Vector Distillation

C Blank, A Bhatia, S Rajamanoharan, A Conmy, N Nanda

arXiv preprint arXiv:2606.00995 (2026)

How do LLMs compute verbal confidence thumbnail

How do LLMs compute verbal confidence

D Kumaran, A Conmy, F Barbero, S Osindero, V Patraucean, P Velickovic

ICML 2026

Automatically Finding Reward Model Biases thumbnail

Automatically Finding Reward Model Biases

A Wang, I Arcuschin, A Conmy

ICML 2026

Simple LLM Baselines are Competitive for Model Diffing thumbnail

Simple LLM Baselines are Competitive for Model Diffing

E Kempf, S Schrodi, T Brox, N Nanda, A Conmy

arXiv preprint arXiv:2602.10371 (2026)

Fluid Representations in Reasoning Models thumbnail

Fluid Representations in Reasoning Models

D Kharlapenko, A Stolfo, A Conmy, M Sachan, Z Jin

arXiv preprint arXiv:2602.04843 (2026)

Building Production-Ready Probes For Gemini thumbnail

Building Production-Ready Probes For Gemini

J Kramár, J Engels, Z Wang, B Chughtai, R Shah, N Nanda, A Conmy

arXiv preprint arXiv:2601.11516 (2026)

Where Do Olmo's Values Come From? thumbnail

Where Do Olmo's Values Come From?

X Sun, A Conmy, J Engels

ICLR 2026 Workshop DATA-FM

Gemma Scope 2: Technical Paper thumbnail

Gemma Scope 2: Technical Paper

C McDougall, A Conmy, J Kramár, T Lieberum, S Rajamanoharan, ...

Google DeepMind Blog (2025)

A pragmatic vision for interpretability thumbnail

A pragmatic vision for interpretability

N Nanda, J Engels, A Conmy, S Rajamanoharan, B Chughtai, ...

AI Alignment Forum (2025)

Eliciting secret knowledge from language models thumbnail

Eliciting secret knowledge from language models

B Cywiński, E Ryd, R Wang, S Rajamanoharan, N Nanda, A Conmy, ...

arXiv preprint arXiv:2510.01070 (2025)

Thought Anchors: Which LLM Reasoning Steps Matter? thumbnail

Thought Anchors: Which LLM Reasoning Steps Matter?

PC Bogdan, U Macar, N Nanda, A Conmy

arXiv preprint arXiv:2506.19143 (2025)

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities thumbnail

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

GDM

arXiv preprint arXiv:2507.06261 (2025)

Line of sight: On linear representations in vllms thumbnail

Line of sight: On linear representations in vllms

A Rajaram, S Schwettmann, J Andreas, A Conmy

arXiv preprint arXiv:2506.04706 (2025)

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning thumbnail

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning

S Shabalin, A Panda, D Kharlapenko, AR Ali, Y Hao, A Conmy

arXiv preprint arXiv:2505.24360 (2025)

An Approach to Technical AGI Safety and Security thumbnail

An Approach to Technical AGI Safety and Security

R Shah, A Irpan, AM Turner, A Wang, A Conmy, D Lindner, ...

arXiv preprint arXiv:2504.01849 (2025)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability thumbnail

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, ...

ICML 2025

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful thumbnail

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

I Arcuschin, J Janiak, R Krzyzanowski, S Rajamanoharan, N Nanda, ...

ICML 2026

Open problems in mechanistic interpretability thumbnail

Open problems in mechanistic interpretability

L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...

TMLR (2025)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders thumbnail

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, ...

GitHub repository. Accessed (2025)

How Interpretability Researchers Can Help AGI Go Well thumbnail

How Interpretability Researchers Can Help AGI Go Well

N Nanda, J Engels, S Rajamanoharan, A Conmy, B Chughtai, ...

Alignment Forum (2025)

Base Models Know How to Reason, Thinking Models Learn When thumbnail

Base Models Know How to Reason, Thinking Models Learn When

C Venhoff, I Arcuschin, P Torr, A Conmy, N Nanda

ICML 2026 Spotlight

Progress Update #2 from the GDM Mech Interp Team (Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research) thumbnail

Progress Update #2 from the GDM Mech Interp Team (Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research)

L Smith, S Rajamanoharan, A Conmy, C McDougall, J Kramár, ...

alignmentforum.org/posts/4uXCAJNuPKtKBsi28 (2025)

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders thumbnail

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

D Zhou, K Patil, Y Sun, S Rajamanoharan, A Conmy

ICLR 2025 Workshop on Building Trust

Understanding Reasoning in Thinking Language Models via Steering Vectors thumbnail

Understanding Reasoning in Thinking Language Models via Steering Vectors

C Venhoff, I Arcuschin, P Torr, A Conmy, N Nanda

ICLR 2025 Workshop on Reasoning and Planning for LLMs

Scaling Sparse Feature Circuits For Studying In-Context Learning thumbnail

Scaling Sparse Feature Circuits For Studying In-Context Learning

D Kharlapenko, S Shabalin, F Barez, A Conmy, N Nanda

ICML 2025

Improving Steering Vectors by Targeting Sparse Autoencoder Features thumbnail

Improving Steering Vectors by Targeting Sparse Autoencoder Features

S Chalnev, M Siu, A Conmy

arXiv preprint arXiv:2411.02193 (2024)

SAEs are Highly Dataset Dependent: a case study on the Refusal Direction thumbnail

SAEs are Highly Dataset Dependent: a case study on the Refusal Direction

C Kissane, R Krzyzanowski, N Nanda, A Conmy

Alignment Forum (2024)

Open source replication of Anthropic's crosscoder paper for model diffing thumbnail

Open source replication of Anthropic's crosscoder paper for model diffing

C Kissane, R Krzyzanowski, A Conmy, N Nanda

LessWrong (2024)

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 thumbnail

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ...

BlackboxNLP 2024 Oral

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders thumbnail

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ...

arXiv preprint arXiv:2407.14435 (2024)

SAEs (Usually) Transfer Between Base and Chat Models thumbnail

SAEs (Usually) Transfer Between Base and Chat Models

C Kissane, R Krzyzanowski, A Conmy, N Nanda

alignmentforum.org/posts/fmwk6qxrpW8d4jvbd (2024)

Interpreting Attention Layer Outputs with Sparse Autoencoders thumbnail

Interpreting Attention Layer Outputs with Sparse Autoencoders

C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda

ICML 2024 Mechanistic Interpretability Workshop Spotlight

Improving Dictionary Learning with Gated Sparse Autoencoders thumbnail

Improving Dictionary Learning with Gated Sparse Autoencoders

S Rajamanoharan*, A Conmy*, L Smith, T Lieberum, V Varma, J Kramár, ...

NeurIPS 2024

Activation Steering with SAEs thumbnail

Activation Steering with SAEs

A Conmy, N Nanda

alignmentforum.org/posts/C5KAZQib3bzzpeyrg#Activation_Steering_with_SAEs (2024)

Base LLMs Refuse Too thumbnail

Base LLMs Refuse Too

C Kissane, R Krzyzanowski, A Conmy, N Nanda

Alignment Forum, 2024b. URL https://www. alignmentforum. org/posts/YWo2cKJgL7Lg8xWjj/base-llms-refuse-too (2024)

SAE Features for Refusal and Sycophancy Steering Vectors thumbnail

SAE Features for Refusal and Sycophancy Steering Vectors

S Shabalin, D Kharlapenko, A Conmy, N Nanda

Alignment Forum (2024)

Self-Explaining SAE Features thumbnail

Self-Explaining SAE Features

D Kharlapenko, S Shabalin, N Nanda, A Conmy

alignmentforum.org/posts/self-explaining-sae-features (2024)

Applying Sparse Autoencoders to Unlearn Knowledge in Language Models thumbnail

Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

E Farrell, YT Lau, A Conmy

Safe Generative AI Workshop at NeurIPS 2024

Progress Update #1 from the GDM Mech Interp Team thumbnail

Progress Update #1 from the GDM Mech Interp Team

N Nanda, A Conmy, L Smith, S Rajamanoharan, T Lieberum, J Kramár, ...

alignmentforum.org/posts/C5KAZQib3bzzpeyrg (2024)

Stealing Part of a Production Language Model thumbnail

Stealing Part of a Production Language Model

N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ...

ICML 2024 Best Paper

Attribution Patching Outperforms Automated Circuit Discovery thumbnail

Attribution Patching Outperforms Automated Circuit Discovery

A Syed, C Rager, A Conmy

BlackboxNLP 2024

Towards Automated Circuit Discovery for Mechanistic Interpretability thumbnail

Towards Automated Circuit Discovery for Mechanistic Interpretability

A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso

NeurIPS 2023 Spotlight

My best guess at the important tricks for training 1L SAEs thumbnail

My best guess at the important tricks for training 1L SAEs

A Conmy

LessWrong (2023)

Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads thumbnail

Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads

CS McDougall*, A Conmy*, C Rushing*, T McGrath, N Nanda

BlackboxNLP 2024

Successor Heads: Recurring, Interpretable Attention Heads In The Wild thumbnail

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

R Gould, E Ong, G Ogden, A Conmy

ICLR 2024

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small thumbnail

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

KR Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt

ICLR 2023

StyleGAN-induced Data-Driven Regularization for Inverse Problems thumbnail

StyleGAN-induced Data-Driven Regularization for Inverse Problems

A Conmy, S Mukherjee, CB Schönlieb

IEEE ICASSP 2022

Professional Experience

My journey through AI research and engineering roles

2023 — Present
Research Engineer
Google DeepMind, Language Model Interpretability team
2023
Researcher
SERI MATS & Independent Research
2022 — 2023
Researcher
Redwood Research
2021
Software Engineering Intern
Meta
2019 — 2022
Mathematics (Upper First Class)
Trinity College Cambridge

Mathematics Notes

Educational materials and lecture notes from Cambridge