Arthur Conmy | Anthropic

New role

Member of Technical Staff, Anthropic.

I will work on aligning upcoming models as they are trained.

What I mean

Triaging signs of misalignment in training, then looking for root-cause fixes rather than whack-a-mole patches.

For a public example of the direction, see Anthropic's Teaching Claude Why.

Mentorship

I have been a MATS mentor since MATS 6.0. MATS is the main way I mentor people; if you want to work with me, apply through MATS.

MATS usually runs winter and summer programs, so applications may not always be open.

Background

Past work.

2023-2026

Senior Research Engineer, Google DeepMind

Worked on post-training for Gemini and on interpretability tools that are closer to production model work: probes, reward-model bias discovery, reasoning behavior, Gemma Scope, sparse autoencoders, model diffing, and steering.

Production Probes Gemma Scope 1 Gemma Scope 2 Gated SAEs Pragmatic Vision

Since MATS 6.0

MATS papers and mentoring

I have been a MATS mentor since MATS 6.0. MATS is the main way I mentor people; if you want to work with me, apply through MATS.

MATS usually runs winter and summer programs, so applications may not always be open.

Apply Through MATS Explaining Subliminal Learning Reward-Model Biases CoT Faithfulness Base Models Reason Thought Anchors

2022-2023

Early mechanistic interpretability

Worked on circuits and automated circuit discovery, including IOI in GPT-2 Small and ACDC, before the later wave of large-scale SAE work.

IOI Circuit ACDC Copy Suppression Successor Heads

Earlier

Redwood, Meta, Cambridge

Redwood Research in 2022-2023; Meta software engineering internship in 2021; mathematics at Trinity College Cambridge, upper first class honours.