-
A Microscope into the Dark Matter of Interpretability
We propose a new interpretability architecture that learns a hierarchy of features in LLMs. It matches/outperforms SAEs on evaluations, reduces reconstruction error, and naturally addresses feature splitting.