A Microscope into the Dark Matter of Interpretability

We propose a new interpretability architecture that learns a hierarchy of features in LLMs.

Too Long; Didn't Read

We propose the RQAE (Residual Quantization Autoencoder), a new architecture to interpret LLM representations which learns features hierarchically. You can control the specificity of a feature as you move through the hierarchy, and can naturally split features into more specific subfeatures. In evaluations, RQAE has better reconstruction than SAEs, and we find that RQAE features are more interpretable than Gemmascope SAE features using Eleuther AI's evaluation suite.

A single layer of RQAE projects the activation of a LLM into a learned (very small) subspace, quantizes the subspace, and projects back to the original dimension. Then, the residual between the original activation and this quantized representation is the input to the next layer. Each layer learns a new subspace, which depends on the previous residual, so later layers are dependent on earlier layers. This induces a hierarchy.

A recent paper finds that the residuals of SAEs (i.e. original activation - SAE reconstruction) contain extra features that even the largest SAEs do not find. RQAE reduces these pathological errors iteratively. Since we can show that vector quantization is a specific type of SAE, RQAE can be equivalently defined as stacking many (weak) SAEs on top of each other.

This post presents the motivation, architecture, and experiments justifying RQAE. We also have released the code, model weights, and a feature dashboard.

Thread

Introduction

The purpose of interpretability is to decompose a model into human-understandable features. This focus has led to some incredibly interesting work, such as include visualizing what parts of an image a model uses to predict a class and seeing how RL agents identify friends and enemies in a game .

However, Large Language Models (LLMs) may be more complicated to interpret. Text is a much denser interpretable medium than visuals, and LLMs are orders of magnitude larger than any other model we’ve ever trained. This work is heavily inspired by transformer-circuits, which has laid out a foundation for how we can begin to interpret LLMs. If you’re new to interpretability, we would recommend reading these works first, but we’ll provide a quick rundown of the core concepts in the sections below.

Other work has explored how different layers of the LLM contribute to different functions (for example, the feedforward layers store facts and “knowledge” ). However, the scope of this work is limited to considering the residual stream after the middle layer of an LLM (mostly, with the Gemma 2 2B LLM). We hope to extend RQAE to other layers and other LLMs, as Gemmascope has done with SAEs, in future work.

World Models

What is a human interpretable feature, and why would they exist in LLMs? We know that LLMs store “world models” because they are really good at predicting the next token, and compressing the training data to the extent that they do requires some latent understanding of the world. Other work explores this concept in detail , but roughly:

There are a large number of things that can happen in the world
We observe what happens, and we reason about what happens with a much smaller set of “latent” features.

For example, if we see someone drop a glass bottle, then we expect the bottle to shatter when it hits the ground. This is not because we have observed exactly that person dropping exactly that bottle before. It’s because we have learned a very small set of latent features about physics, which we can use to model the bottle breaking.

LLMs must do something similar - they just don’t have the capacity to fully memorize their training data! Thus, they must also be working with some set of latent features that they can manipulate to solve the next token prediction task. A lot of interpretability research is focused on finding these latent features, and figuring out how the model uses them.

Linear Representations and Superposition

How do LLMs organize and use these latent features? There’s a lot of empirical evidence that LLMs describe features as simply directions in space, known as the Linear Representation Hypothesis (LRH). If true, it’s a very powerful framework - directions are just vectors, and we have a lot of theory to work with vectors.

Then, a full LLM activation is the sum of multiple atomic feature vectors. For example, we might think of a dog as the ‘animal’ feature plus the ‘pet’ feature, minus the ‘feline’ feature. Features also scale differently depending on the subject: a dog will have more of the ‘smart’ feature than a goldfish.

We borrow a formal definition of the LRH from here:

Definition 1: A linear representation has the following two properties:

Composition as Addition: The presence of a feature is represented by adding that feature.

Intensity as Scaling: The intensity of a feature is represented by magnitude.

The Linear Representation Hypothesis states that an LLM linearly represents human-interpretable features.

There is one problem with the LRH: the residual stream of an LLM with width $d$ lies in the space $\mathbb{R}^d$. However, there can be at most $d$ orthogonal vectors in this space (the basis of the space), which means that there can be at most $d$ unique “feature directions”.

In order for the LRH to be true, models must learn features in superposition. This means that features interfere with each other, and you can not fully separate all features (e.g. any token with feature X must also have at least a little bit of feature Y, even if they’re two distinct features). However, you can also fit exponentially more features in a $d$-dimensional space which are only $\epsilon$-orthogonal to each other!

Similar features take similar directions, so interference is usually okay. Thus, when measuring by cosine similarity, we see clusters forming which correspond to similar features. For example, the “has fur” feature should be more aligned with the “cat” feature than with the “shark” feature - although in this example, interference might make it hard for the model to learn what a hairless cat is.

Sparse Autoencoders

Sparse Autoencoders (SAEs) model the LRH by learning an overcomplete basis to take features out of superposition . SAEs take inspiration from sparse coding theory: it suggests that learning a dictionary of features that only fire sparsely (i.e. a small subset of features that reconstruct each activation faithfully) also results in that dictionary being human-interpretable.

A figure taken from Toy Models of Superposition. In this case, the observed model is our LLM, and the disentangled model is what we are trying to learn with a SAE.

Thinking back to World Models, this actually makes a lot of sense! We can think of the “larger” model as the set of disentangled latent features that model the real world. The larger model only activates sparsely - only some neurons are activated for any given input. Sparse autoencoders try to reconstruct this larger model, in order to represent features out of superposition.

Definition 2: A sparse autoencoder $S$ of size $n$ is a model that takes in a LLM activation $r \in \mathbb{R}^d$ and does the following: $$ C(r) = \sigma(W_{in}r + b_{in}) $$ $$ S(r) = W_{out}C(r) +b_{out} $$ where $\sigma$ is some nonlinearity (usually ReLU), and $W_{out} \in \mathbb{R}^{d \times n}, W_{in} \in \mathbb{R}^{n \times d}$ (for $n \gg d$). $b_{in} \in \mathbb{R}^n, b_{out} \in \mathbb{R}^d$ are bias terms. The model is trained to minimize the reconstruction error $||r - S(r)||_2$. An additional loss term is added to induce sparsity in $C(r)$ - common methods to induce sparsity include TopK, L1 loss, or learning a threshold per feature .

To explicitly draw the connection to the LRH, a SAE performs the following algorithm (ignoring biases):

Begin with some entangled LLM activation $r$, and consider that $W_{in}$ consists of $n$ “encoding” vectors.
Compare $r$ against each encoding vector by calculating their dot product.
Apply a nonlinearity (usually ReLU) and some sparsity function (TopK, L1, etc.), to act as a filter for sparsity.
There are now $n$ coefficients $C(r) = [c_1, c_2, …]$, which are the intensities of each feature in $r$.
Multiply $C$ by the columns of $W_{out}$, which are the actual feature vectors (“decoding” vectors) that you have learned.

NOTE: Notice that each (encoder, decoder) pair of vectors are closely linked - the cosine similarity ($\propto$ dot product) that a representation has to an encoding vector directly defines the intensity of the feature. At the beginning of training, the encoder is initialized as the transpose of the decoder so that encoding vectors = features, and empirically the trained model has encoding and decoding vector pairs with high cosine similarity.

However, an SAE can only learn a set number $n$ of features, and it’s likely that $n \ll N$ for the true number of features $N$ the LLM has learned. To give an example, consider two “ground-truth” features $f_1$ and $f_2$ - these features are similar but not exactly the same (e.g. different breeds of dogs). How will the SAE learn these features?

The Feature Hierarchy

The rough answer to the question posed above is: the model will learn an average direction that will fire weakly for these features - for example, a “dog” feature. It might also learn a few features that weakly activate for related, more general features - for example, a “living” feature, or a “pet” feature. Empirically, SAEs only learn general features, and more specific features are indistinguishable from one another.

There are two widely observed issues with SAEs that stem from this intuition:

Feature splitting. If you train a wider SAE, you will notice that more general features split into smaller, more specific features.
Feature absorption. You learn two separate features in your SAE that are describing the same ground-truth feature - as a result, representations with that ground-truth feature are split across the two learned features without any discernable pattern (the difference between the two features is completely spurious).

The reason these happen is because SAE features are not necessarily atomic. They might need to be broken down or grouped together, but it’s difficult to tell which is which, and the training objective doesn’t bias the model one way or the other. This begs the question: how should we organize features?

A paper from UChicago attempts to answer this question . They split up features in two dimensions: hierarchical features, such as organism -> (plant -> (tree, bush, etc.), animal -> (bird, reptile, fish, etc.)), and categorical features, such as (dog, cat, hamster, etc.). Hierarchical features are organized orthogonally to each other, while categorical features organize themselves into polytopes.

SAEs treats features as if they are all categorical. It can learn hierarchies of features (for example, it can learn separate features for organism, plant, animal, etc.) - but there is nothing in the architecture that encourages it to learn such features. Our best guess is that it learns hierarchy based on the frequency of the concepts in the training data alone, based on other work that finds SAEs learn more granular features when trained on specialized datasets. However, this is still an open question, and we don’t propose a clear answer here.

Adding Inductive Bias

As mentioned above, SAEs do not have an inductive bias for how features should relate to one another. At the end of training, you are left with a flat dictionary of features, and it is up to you to interpret and organize them.

In contrast, this work introduces the inductive bias that features must be composed hierarchically. It’s not clear whether hierarchy is sufficient or necessary for all features a LLM represents, but feature hierarchies do seem like the natural answer to solving the problem of feature splitting/absorption.

We define a feature hierarchy to consist of parent and child features, such that any time a feature is activated, it’s parent will also be activated, and zero or one of it’s children will be activated. This addresses the two issues presented in the previous section:

If a feature splits, then we define the base feature as higher in the hierarchy, and the split features as lower in the hierarchy.
If two features should be absorbed, then only consider their common parent feature as the atomic feature.

This isn’t the only way to define a feature hierarchy (and probably not the best way) - for example, you can define a many-to-many relationship between parents and children - but it is simple and, as shown in the next section, easy to implement.

RQAE

In this section, we present the Residual Quantization Autoencoder (RQAE).

Figure 1a. How both SAE and RQAE decompose the activation of a LLM into interpretable parts.

Since RQAE tackles the same problem as SAEs (decomposing an activation into interpretable parts), it should be no surprise that the end result also looks similar. Fig 1a shows how both RQAE and SAE split up an activation from an LLM. We even use the same training loss (MSE)!

Of course, not all feature decompositions are equal. SAE relies on the bottleneck of sparsity to control the type of features that are learned. RQAE uses two bottlenecks: projecting into subspaces, and quantizing the subspaces into a relatively small number of codebooks (i.e. only a few directions in the subspace can be chosen).

Figure 1b. An overview of how RQAE decomposes a LLM activation from Fig. 1a.

RQAE uses residual vector quantization (RVQ) to autoencode the LLM representation - but we also add a linear in/out layer between each quantization step, which allows the model to iteratively choose different subspaces of the representation space to quantize. To implement RVQ, we use a variant of FSQ that uses hyperspheres instead of hypercubes to define codebooks. FSQ encourages all codebooks to be utilized fully, which is especially important when you have many layers.

<span id="algo"> Here is the equivalent algorithm (to SAEs) for RQAE:

Begin with some LLM activation $r$. For $n_q$ layers, repeat the following:

Project $r$ into a small subspace with a linear layer, called projection $p$.
Normalize $p$ to get $p_{norm}$
Find the closest codebook to $p_{norm}$ from a set of codebooks using Euclidean distance. Call this codebook $\hat p$, with index $c_i$ for layer $i$.
Project $\hat p$ back into the activation space with a linear layer, called $\hat r$.
Set $r = r - \hat r$ as the residual.

The original activation $r$ has now been quantized with $n_q$ codebooks, and can be represented as a set of codebook indices $(c_1, c_2, …, c_{n_q})$

Similar to how a SAE models the LRH, RQAE models the “feature hierarchy” mentioned above. For any given LLM activation, every layer chooses exactly one codebook. The subspaces that later layers choose are dependent on earlier layers (since the input to any given layer is original activation - sum(quantizations from previous layers)), encouraging a learned hierarchy between layers. Thus, a RQAE model with $n$ layers and $c$ codebooks per layer learns $c \times n$ unique values, but can represent $c^n$ different activations.

RQAE = Stacked SAEs

We draw a connection between the layers of a RQAE and a SAE.

Conceptually, a single layer of RQAE consists of the same parts as the SAE algorithm mentioned above: an encoder (linear in layer), intensity coefficients (codebook values), and a decoder that corresponds to features (linear out layer). Formally:

Lemma 1: A single layer of RQAE can be equivalently defined as a Top1 SAE.
Proof: Still a WIP. Will follow this work closely

NOTE: This lemma does not make any claims about the training dynamics or quality of features learned by RQAE compared to an equivalent SAE. In fact, design decisions such as FSQ heavily restrict the type of features that RQAE can learn. The lemma only exists to illustrate the relationship between RQAE and SAE.

In practice, we use a very small codebook dimension (in the experiments section, $4$). This is needed for vector quantization (especially FSQ, whose codebook size grows exponentially with respect to codebook dimension). At first glance, this seems concerning: a SAE can find features in a feature manifold that lives in higher-dimensional subspaces, but a RQAE layer can not!

This is true theoretically. However, in practice there is a large body of work that suggests LLM features and feature manifolds are mostly represented in very low dimensional subspaces (i.e. they are rank deficient). It’s likely that RQAE won’t miss many features or manifolds because it projects on a lower dimension - we think that it still can learn those features across RQAE layers, but we leave that to future work.

Why quantization?

A natural question to ask is why quantization should be used in the first place. Especially if quantization = weak SAE, why not just use a better SAE? We don’t think quantization is necessary, and would be interested in future work that replicates RQAE using SAEs instead of quantization!

However, quantization turns out to be very useful:

Quantization (= Top1 SAE) allows only a 1:1 parent:child relationship. It’s not clear how to define a hierarchy with a TopK SAE where $K > 1$ efficiently. One option could be to only consider the highest $N$ activating sets of features for any input activation.
RVQ is a very well-studied field. As a result, there are many techniques already developed to help quantization. For example, it seemed to take the community several months to handle “dead features” in SAEs. RVQ has a handful of techniques that are empirically investigated across multiple domains to solve the problem of uniform codebook usage. Also, this problem becomes exponentially harder with more layers, so we’re not sure how SAEs will properly address it.
Quantization models are more than an order of magnitude smaller than SAEs. RQAE, which provides much better reconstruction than even the largest Gemmascope SAE with the highest L0, is also smaller than the smallest available Gemmascope SAE.

Modeling Dark Matter

Previous work shows that SAEs learn pathological errors - that is, the residual between SAEs and original activations are uniquely important for reconstruction (measured by cross entropy loss difference), compared to completely random residuals the same distance away. A following work goes further and shows that this pathological error includes features that SAEs miss - a concept that Chris Olah has coined as the “dark matter” of interpretability .

Interestingly, even the largest SAEs that were evaluated (1 million features) still miss a significant portion of total (linear) features found in the LLM (Figure 10a ), and the error decays logarithmically with respect to the number of features. The core takeaway is that SAEs are grossly inefficient at uncovering this “dark matter”.

How would one fix this issue with SAEs alone? If you assume that sparse coding is enough of an inductive bias to find interpretable features at very high SAE sizes`, then the real issue begins to look like a capacity problem. Specifically, a model with high enough capacity will also reduce reconstruction error to be very low, so that features don't exist anymore in its residual error. The capacity of existing SAEs (even with 1M features, which is already 4.6GB of model weights on a Gemma 2 2B LLM) does not seem to be enough.

This work provides a great motivation for RQAE! A single layer of RQAE is equivalent to a (weak) SAE, as shown above. The next layer of RQAE then learns on the residuals of the first layer, which we assume to have “missing features”, since reconstruction is not perfect. Applying this iteratively allows us to learn more “missing” features after each layer. This mimics the exact experiment run in the paper , where they train an SAE on the residuals of a Gemmascope SAE, and find that it uncovers more of the dark matter (linear error) that the Gemmascope SAE missed.

In the next section we show that RQAE empirically also reduces reconstruction error (i.e. higher capacity) compared to even very large SAEs, as expected.

Training a Model

We train a RQAE with $n_q=1024$ layers, and use a codebook dimension of $d=4$ with quantized values of $[-1, -0.5, 0, 0.5, 1]$ per dimension - resulting in $544$ codebooks per layer (see FSQ for more details). We use the following hyperparameters:

Parameter	Value	Description
LLM	Gemma 2 2B	Base model to interpret
Layer	Residual Stream after the center layer (12)	What layer of the residual stream to train on
Training Data	FineWeb-Edu	Dataset used to train RQAE
Test Data	pile-uncopyrighted (Neuronpedia subset)	Dataset used for analysis
num_tokens	1B	How many tokens we train on
context_length	128	Context length we train on

We normalize activations passed into RQAE by using the final RMS norm layer of the LLM (note that this means all activations have exactly the same norm). Ablations are still a WIP, but will be added soon.

Let’s look at properties of learned model, to validate assumptions we have based on the model architecture.

Figure 2a. (left) Distribution of pairwise cosine similarities between all learned codebooks (projected out to model dimension). (right) L2 norm of projected codebooks across RQAE layers.

Figure 2a shows the distribution of pairwise cosine similarities between all projected codebooks. Almost all projected codebooks are $\epsilon$-orthogonal with $\epsilon=0.2$, suggesting that they form an overcomplete basis. It also shows the L2 norm of features across RQAE layers - earlier layers have larger norms than later layers, suggesting that earlier layers choose more “confident” directions, and the majority of MSE loss is concentrated in the first few layers.

Figure 2b. (top) MSE Loss and Cross Entropy Loss Difference for RQAE trained on Gemma 9B and 2B models. Both RQAE models are trained with num_quantizers=1024. (bottom) Reconstruction loss of RQAE vs SAE with varying width and L0 for Gemma 2 2B. Note that the same RQAE model is used (num_quantizers=1024), and reconstruction is stopped early, rather than different RQAE models trained with different num_quantizers.

Figure 2b shows that RQAE has higher capacity (as discussed in the previous section), by reducing reconstruction loss. This is with an order of magnitude fewer parameters than an SAE (e.g. a 1M feature SAE takes up 4.6GB of space, while RQAE takes 100MB).

RQAE doesn’t have an equivalent concept of $L_0$, since all layers are used for every input. However, it still seems like an unfair comparison, since we are still using $1024$ unique vectors to reconstruct an activation with CE loss difference $< 0.01$.

The reason that we argue for this comparison is for two reasons:

If reconstruction loss is low, then we know that we have certainly captured more of the features present in the activation (i.e. less dark matter). Higher $L_0$ may mean that the features become less interpretable in SAEs, but:
As mentioned at the beginning of this section, the RQAE inductive bias during training does not come from sparsity. Thus, we shouldn’t expect features to be learned by sparsity, and we shouldn’t expect higher sparsity to correspond to more interpretable features.

Thus, RQAE does reduce dark matter (e.g. even a SAE with very high $L_0$ would do the same), but still learns interpretable features (as opposed to an SAE with high $L_0$). To prove the latter part of that claim, let’s first define what a feature in RQAE is.

Defining a Feature

Features in SAEs are simply defined as the columns of the decoder matrix. When a feature is present in an activation, it also has an associated intensity, measured by ($\propto$) the cosine similarity of the activation to the corresponding row in the encoder matrix. We will use this to motivate the definition of a RQAE feature. Referring back to the algorithm at the beginning of this section:

Definition 3: Let $$C_i = \{\text{codebook indices for layer }i\} = \{1,2,...\}$$ $$\mathbb{C}_i = \{\text{codebook values for layer }i\} = \{c: c \in \mathbb{R}^{codebook\_dim}\}$$ where you can index $\mathbb{C}_i$ directly with elements of $C_i$ as $\mathbb{C}_i(c)$.
Then, a RQAE feature is defined as a set of (up to) $n_{q}$ codebook indices (one per layer): $$f = [c_1, c_2, \dots, c_k]\text{ where } k <= n_{q}, c_i \in C_i$$

Definition 4: Consider a token's activation $t$, that has been quantized by RQAE (see algorithm): $$t = [t_1, t_2, t_3, \dots, t_{n_q}]\text{ where }t_i \in C_i$$ Then, the intensity of a feature $f$ in $t$ is defined as: $$C(f, t) = \frac{\sum_{i \leq k} cos\_sim(\mathbb{C}_i(t_i), \mathbb{C}_i(c_i)) \cdot L_i}{\sum L_i}$$ where $L_i$ is the average $L_2$ norm of the columns in the decoder of layer $i$ in RQAE.

Similar to SAEs, we measure cosine similarity of the activation to a set of features (only in subspaces) to measure intensity. Since we use FSQ to create codebooks, we know that they are evenly spaced out.

Figure 2c. Visual of how features are defined in RQAE.

It’s important that RQAE uses most of the codebooks at any given layer across a large dataset. Otherwise, the model is not using different codebooks to represent different tokens, and thus measuring cosine similarity is useless. What Fig. 3a shows us is that even by layer $4$, all tokens in a reasonable large dataset can be almost completely partitioned into unique sets of codebooks.

# Layers	1	2	3	4
# Used Codebooks	544	234,994	3,279,894	4,383,319
Average # Tokens per Codebook	8606.1	19.92	1.43	1.07

Figure 3a. Codebook usage for the first four layers, across a dataset of $4M$ tokens.

Since RQAE is hierarchical, the number of unique sets of codebooks that RQAE learns grows exponentially with the number of layers. At first glance, this makes RQAE seem pretty useless - what’s the point of finding $544^{1024}$ new “features”?

Remember from Definition 3 that a single feature is defined by a set of codebook indices. When we choose random codebooks per layer, we get useless results, because all tokens in our evaluation dataset have nearly $0$ intensity with the feature. However, if we create a feature by choosing a token in the dataset, and creating a feature with the same codebooks as the token, that feature is highly interpretable. Having an even larger evaluation dataset would result in more unique interpretable features.

NOTE: Consider the way that SAEs are actually interpreted. The SAE is run on a test set, and you look at how each feature activates on different tokens on the test set to interpret the feature. If the test set does not include examples of the feature, then the feature can not be interpreted - in this sense, SAE features are *only* useful/interpretable given a sufficiently large test set.

When features are defined by a set of codebooks, a natural question becomes how the feature uses each codebook. We find that more codebooks directly lead to more specific features. This makes sense given our understanding of the feature hierarchy: earlier codebooks define coarse features in the hierarchy, while later codebooks define more refined and specific features.

Figure 3b. Example of a feature's specificity changing with number of layers. View Feature in Dashboard

What happens when we use all $n_q$ codebooks in the model? Since RQAE has low reconstruction loss, this becomes very similar to just measuring by cosine similarity of the original activations. This means that only the dominant “feature” in the activation will be modeled, but clearly, SAEs do something similar, at least for some features. Take the Gemmascope features for any Gemmascope SAE, and a large portion (from manual checks, 30+%) of them will be single-token features (i.e. the feature only fires on a single token).

Fig. 3c shows us that earlier layers are certainly doing something different than just ranking by cosine similarity. Specifically, even up to layer $64$ we do not find the Spearman correlation to be above $0.5$, indicating a moderate-weak correlation.

Figure 3c. For a set of RQAE features, we take the top $128$ max activating examples. Then, we rank them by cosine similarity (of their original activations) to the top-most activating example, and also rank them on feature intensity by each layer of the RQAE model. Then, we measure the Spearman correlation between all of these rankings to the ranking by cosine similarity. Even by layer $64$, the Spearman correlation is $< 0.5$, indicating that the feature is ranking tokens substantially differently than cosine similarity.

Feature Splitting and Absorption

Since RQAE models the feature hierarchy, we should define how features can split and be absorbed.

Definition 3: Given a RQAE feature $f$: $$f = [c_1, c_2, \dots, c_k]$$ The ancestors of $f$ are a set of RQAE features: $$A_m(f) = \{g: g = [c_1, c_2, \dots, c_m]\}\text{ } \forall m < k$$ Two features $f_1$ and $f_2$ are considered $m$-split features if $$C(a_1, a_2) \geq \theta\text{ } \forall a_1 \in A_m(f_1), a_2 \in A_m(f_2)$$ for some threshold $\theta$.

Essentially, two features are split if some subset of their ancestors are close, but they diverge at some point. This definition does mean that features can split at one layer, and then merge later on, but we couldn’t find interesting examples of this in practice. To illustrate what this looks like, let’s look at an example:

Figure 4a. An example of a feature splitting at different layers. Each branch is it's own feature, and intensity is calculated at the layer at which the ancestors split.

We developed heuristics for how to choose $\theta$ based on intensities on a test dataset, but we think there’s much more detail that can be added to feature splitting in RQAE - for example, learning thresholds per feature like in JumpReLUs .

Comparison to Gemmascope

It’s hard to propose a new architecture, when the existing architecture already has a large body of work and empirical evidence: for example, Gemmascope used $O($training compute of GPT-3$)$ to train SAEs. As a result, we use this section to directly compare Gemmascope SAEs to RQAE.

Finding Equivalent Features

SAEs are useful because they develop a dictionary of features. Although this dictionary is static after training, the small dictionary can be evaluated in depth to come to promising interpretability results.

In contrast, as mentioned above RQAE has exponentially many features - far too many to manually or even automatically interpret. Thus, it’s interesting to know if:

The same features that an SAE finds exist in RQAE
There is an easy way to identify those features automatically in RQAE.

To validate if a RQAE feature is “equivalent” to a SAE feature, we measure the Pearson correlation between SAE intensities and RQAE intensities on top activating, and median-activating examples of the original SAE feature. If this number is high, then the two methods generally recognize the same tokens at the same intensities, and the features are very close. This gives us a measure of determining if the same features exist in RQAE.

An easy way to identify a feature would be looking at it’s top-activating example. If, using this example, RQAE can faithfully reconstruct the feature (as defined above), then we could simply create features from every token in a dataset (which would be tractable to evaluate, e.g. the dataset we use in this work is 4M tokens), and find the most interpretable ones.

Figure 5a. We choose a set of features from Gemmascope. For any given feature, we (1) create a RQAE feature based on the quantization of it's top activating example, and (2) create a RQAE feature by directly quantizing it's encoding vector in the SAE. We then measure the Pearson Correlation between the intensities of these two RQAE features and the original Gemmascope feature on the top-activating and median-activating examples of the feature.

This result certainly shows that Gemmascope features exist and can be represented with RQAE, since the Pearson correlation of the encoding vector approaches $1$. This means that for every Gemmascope feature, there exists some RQAE feature which ranks tokens very similarly to it.

Using the top-activating example to create a feature, however, is a mixed result. Clearly, as expected, it underperforms using the encoding vector. However, a Pearson correlation of $0.7$ still suggests that there is a strong correlation between the two. We consider this to be good enough evidence to continue using top-activating examples to search for RQAE features, although we discuss more in the conclusion why this might need future work.

Finally, let’s qualitatively look at some examples of Gemmascope features, and how they compare to RQAE features that use their top activating example. We looked for three particularly complex features, to ensure that we don’t bias towards single-token features, which we expect RQAE to capture well already. All features were taken from the 16k SAE with $L_0=81$:

Figure 5b. We look at the three Gemmascope features mentioned above. We then define RQAE features on their top activating examples, and look at the corresponding top activating examples of these RQAE features.

Qualitatively, even for “complex” features, it seems that RQAE does a good job of representing the feature! We talk more in the conclusion about further ideas we find interesting which stem from these results.

Evaluations

Evaluating interpretability methods is still an open problem. In this work, we will use Eleuther’s auto-interpretation suite to explain and evaluate features. We compare directly against Gemmascope SAEs for all evaluations.

We perform all evaluations on the monology/pile-uncopyrighted dataset - specifically, the same subset of 36864 sequences (128 tokens each) that is also used in Neuronpedia to interpret Gemmascope SAEs. You can also see all features used, as well as the full evaluation traces, in the dashboard. As a result, you can directly compare all scores, explanations, and activating examples with Gemmascope easily.

For all evaluations, we throw out Gemmascope features without enough max-activating examples in the dataset - we do this by removing features with < bos > tokens as max-activating examples.

Creating a Dictionary of Features

To compare to SAEs, we need to actually define a single set of features. We have already motivated this method in the previous section, but to explicitly define how to make a dictionary of at least $k$ features using RQAE:

Begin with some test dataset of tokens. We use monology/pile-uncopyrighted.
Choose $k$ tokens with enough diversity from this dataset. We select uniformly from unique tokens.
Create a set of $b$ RQAE features per token by using the quantized codebooks of the layer, and then choosing $b$ subsets of layers (i.e. the first 512, the first 256, the first 128, etc.)
Optionally, filter features (we only consider layers $16$, $64$, and $256$ for evaluation).

This method is a naive approach to selecting features. However, from manual inspection it does seem to find features that are (1) unique, (2) interpretable, and (3) consistent among top-activating examples. We discuss future work for selecting features in the conclusion.

Results

There are two axes that are commonly considered to affect SAE interpretability: model width and $L_0$. We sweep across both these dimensions:

Figure 6a. Evaluation sweeps across model width and $L_0$ for SAEs. When sweeping $L_0$, we fix width at $16k$. When sweeping width, we fix $L_0 \approx 80$ (closest one available in Gemmascope). All results are averaged over $100$ features. Error bars represent standard error.

Across models and both axes, we see that RQAE generally outperforms SAEs in detection, and performs similarly in fuzzing. Generally, we see that more layers is better for RQAE, although this might saturate (e.g. $256$ layers performs the same as $64$ layers).

Using RQAE

We provide all <a href="https://www.github.com/harish-kamath/rqae">code</a> and <a href="https://huggingface.co/harish-kamath/rqae">models</a> for RQAE. We’d love to see more work being done, and this work only serves to introduce RQAE. We briefly show some interesting results using RQAE:

Feature Visualization Dashboard

Many results in this work rely on qualitative claims by looking at specific examples of features, activations, and dataset examples. We provide a hosted visualization dashboard that contains a large number of RQAE and gemmascope features, as well as all traces from evaluations, so you can easily evaluate for yourself how well RQAE works.

Figure 7a. An example view of the dashboard presented. See here to try it out yourself!

Steering

We show preliminary results of steering using RQAE features. We could not find evaluations to quantitatively compare steering with SAE or activation steering - so we will leave this for future work. Steering is performed by:

During generation, take the latest given LLM activation. Quantize it with RQAE into a set of codebooks.
Take the codebooks of the feature you are steering with. Turn the activation codebooks towards the feature codebooks dependent on some strength $\kappa$ and the cosine similarity between the codebook and the feature’s codebook.
Continue for all tokens during generation.

Here are some qualitative examples of steering:

Steering on Gemma 2 2B using the "Bodies of Water" feature. Use the tabs above to view outputs with/without steering.

Steering parameters: temperature=0.5, num_layers=5, strength=0.8, repetition_penalty=1.5
prompt="An interesting animal is the"

An interesting animal is the African elephant. The elephant is the largest land animal on the planet. The elephant is a large animal, but it can also be quite small. The smallest elephant is the dwarf elephant, which is only about 2 feet tall. The African elephant is the largest elephant on the planet. It can weigh up to 14,000 pounds and stand up to 13 feet tall. The elephant is an amazing animal. It is a very intelligent animal and can learn things very quickly. The elephant is also a very social animal. It lives in large herds and is very protective of its young. The elephant is also a

An interesting animal is the European rabbit, which is a symbol of the European continent. It is a member of the Leporidae family, which includes the hare and the hare. This animal is considered a wild animal, but it is a domestic animal in many countries. The European rabbit is a small animal that is not dangerous to humans. It is one of the most common animals in the world. It has many uses, including food, clothing, and shelter. The European rabbit is a small, furry animal that is native to Europe. It is also known as the European hare or the European rabbit. It is a member of the Leporidae

An interesting animal is the Psammodromus sceleratus, a desert lizard that lives in the sand dunes of the Sahara Desert. An interesting animal is the ***Psammodromus sceleratus***, a desert lizard that lives in the sand dunes of the Sahara Desert. The ``Psammodromus sceleratus`` is a burrowing lizard that has a special adaptation to live in the sand dunes of the Sahara Desert. It is a nocturnal lizard that is very well camouflaged in its natural habitat, which is the sand dunes of the Sahara Desert. It is a small lizard that can reach a length of 10 cm, and it has a very interesting color pattern that allows it to blend in with the sand.

An interesting animal is the giant earthworm. An interesting animal is the **giant earthworm**. This is a worm that can grow up to 3 meters long. It can be found in Australia, New Zealand, South America and Africa. It is a very strange looking creature. It has an elongated body, with a segmented structure. It has ten pairs of legs, which are used for locomotion. It also has a mouth, which is used for feeding. It is a scavenger, and it feeds on decaying matter. The giant earthworm is a very interesting animal. It is a great example of how evolution can create new and strange creatures.

An interesting animal is found in the desert. It is a small, dark-colored snake with a long tail. The snake is usually found in the desert, but it can also be found in other habitats. The snake is a carnivore and feeds on small animals. The snake is also a venomous snake and can be dangerous. Notice that in this case, a very low top_k is used. We have found in practice that this is dependent on the feature being steered (similar to SAEs). Generally, top_k is a coarser parameter, that you can set as the maximum value before outputs start degenerating. Then, use strength to fine-tune the strength of steering.

Steering on Gemma 2 2B using a "Compound Word with Hyphen" feature (token below). Use the tabs above to view outputs with/without steering.

... know an Anglican-Catholic priest...

Steering parameters: temperature=0.5, num_layers=32, strength=0.8, repetition_penalty=1.5
prompt="An interesting thing"

An interesting thing happened in the last few days. I was reading a book and I came across a term that I had never heard before. I looked it up and found that it was a term that was used in the 19th century, and I found it interesting and intriguing. I was wondering if anyone else had heard of this term and what it meant. The term is called a “disjunction” and it means that something is not true. It is a term that is used in the context of a philosophy of science, and it is used to describe a situation where two things are not true. The term “disjunction” `

` `

` An interesting thing happened the other day. I was in the middle of a meeting and someone asked me the question “What does your job entail?” I’ve been asked this question before, but I’ve never been able to answer it. It’s a very difficult question to answer because it doesn’t give any insight into what I do. I’ve tried to answer it before, but I’ve always felt like I was leaving something out. I’ve tried to answer it in a way that would make people think I’m doing something really important. I’ve tried to answer it in a way that would `

` `

` An interesting thing happened in the first few days of the new year. I was in the middle of a very busy week, and I was so busy that I wasn’t able to sit down and watch the first few episodes of the new season of ``The Mandalorian``. I’ve been a fan of the show since it first aired, and I’ve been following it closely ever since. But I was so busy that I forgot to watch the first episode of the season. I’d always been interested in the show, but I wasn’t sure if I was going to be able to watch the first episode. `

` `

` An interesting thing happened in the last couple of weeks of the 2013 season. The Cubs started to play better baseball. They were winning games, they were winning series, and they were winning the division. They had a lot of things to be happy about, and they should be. They had a lot of things to be happy about, and they should be. But they also had a lot of things to be unhappy about, and they should be. This is a team that has not won a playoff game since 1945. This is a team that has not won a World Series since `

` `

` An interesting thing about the ``Star Wars`` franchise is that it has managed to keep its fanbase engaged for over 40 years now. It’s not just the original trilogy or the prequels that have kept the audience hooked, but also the spin-offs like ``Rogue One`` and ``The Mandalorian``. One of the most popular characters in the entire franchise is the bounty hunter Boba Fett, played by Temuera Morrison. He first appeared in ``The Empire Strikes Back`` and then made a comeback in ``The Book of Boba Fett``. In a recent interview, Temuera Morrison talked about his experience working `

` `

` An interesting thing happened to me yesterday: I got a call from a man who was very distressed. He had a 16-year-old daughter who was going through a difficult time with her boyfriend. The boyfriend was a bully. He had been harassing the girl for some time, but it was getting worse. He was threatening to kill her. The daughter was afraid to tell her parents. She was afraid of the boyfriend, and she was afraid of her parents. I had a long talk with the girl’s father, and we decided to get the police involved. The young man was arrested and charged with assault and battery `

` `

` An interesting thing happened while I was researching this post. I found a lot of interesting things about the history of the term “sissy”. I had no idea that it originated from the word “sissie”. I also did not know that the word “sissy” has been around for a long time. It has been around since the 1500’s, which is almost 500 years ago. The word “sissy” is used to describe someone who is feminine, weak, or cowardly. The word is used to describe a boy or man who is effeminate. The word has `

` `

` An interesting thing about the 2020 election is that it is the first election in which the race for president is being decided by a majority of the voters. In 1960, John F. Kennedy defeated Richard Nixon by a margin of just 110,000 votes (0.2 percent) in a three-way race. In 1972, Richard Nixon defeated George McGovern by a margin of 520,000 votes (0.7 percent) in a two-way race. In 1980, Ronald Reagan defeated Jimmy Carter by a margin of `

` `

` An interesting thing to observe about the two-volume set under review here is that, like the proverbial dove of the title, it is a rather small volume. It is a collection of essays that is as well-written and well-edited as the best-of-the-best in the genre, but it is not an academic-type-of-deal. It is not a scholarly-style-deal. It is, in a sense, an academic-style-deal, but a rather unusual one. It is a little like the 19th-century-type-of-deal of-a-man, with a- `

` `

` An interesting thing to note about the late-19th century and early-twentieth-century English-language German-speaking literature is its penchant for the Latinate. The latter-day Germanophile is apt to be a bit of a snob, and the Romantic-tinged Germanophile is apt to be a bit of a snob as well (and not in the sense that one would have to be in order to be a proper Anglo-Saxon. The latter is a bit of a stretch, but the former is a bit more plausible. The point is that the whole business of-a-sort of-German-with-a- `

` `

` An interesting thing that happened to me came to light as a result of the recent fuss about the re-establishment of the Russian Orthodox Church in the West. The first thing that comes to mind of course is the iconoccending 19th century Russian Orthodox Church of the Russian Empire. But it is the more-or-less orthodox Greek Orthodox Orthodox Church of the 18th century of the Orthodox Church of the 19th century in the West. The Orthodox Church of the 19th century in the West was the so-called autocephrite (or autoceph-something-to-something-something `

` `

` An interesting thing about the English-speaking world is that it has a tendency to be rather parochial in its outlook. The British Isles has a rather narrow view of the worlda view that is in some ways a sort of a-historical a-historical, that is, in terms of the term British-ness, in the sense of the more-or-less-non-Anglo-Saxon speaking Anglo-Saxon. In the case of the United States, it has been a rather long-standing tendency to be rather provincialistic. The British-ness of the American-ness is a rather recent development, and in any case the American-ness `

` `

` An interesting thing about this book is its origins. The author, Richard Llew-Jones, was a man of letters and a novelist, and a man of letters does not write novels. The fact that a man of letters is a man of letters could be a matter of course, but it has been a matter of course-ness for a century. The only thing about the affair that was a bit of a surprise was the extent to which the two were conjoined. The author's name was used as a pseudonym for the author of the "The Lord of the Rings" trilogy, which is to say that the author's name was used `

` `

` An interesting thing about the English language is that it has been so widely used as a lingua franca of sorts for a long time, it's rather surprising how much stuff is in there in the sense that it's English-ish (and I use that term in the broadest sense possible sense sense of the word, which is to say that it's not just English-ish, but rather, a kind of English-ish in the sense that it's-ish, if you follow me-ish, you get the drift? -ishness. The point is that the language is rather, to use the technical term, a-centric `

` `

The above example is a more abstract feature. Note that in this case, we can increase top_k higher, without degeneration. Also, notice that even though all query tokens are the same (a hyphen), the steering feature can still be applied. Finally, let's look at how strength can affect steering.

Steering on Gemma 2 2B using a "Psychosis and Schizophrenia" feature (token shown below). Use the tabs above to view outputs with different values of strength.

... and not depressive psychosis) will...

Steering parameters: temperature=0.5, num_layers=128, repetition_penalty=1.5
prompt="Let's talk about"

Let's talk about the future of work and how it will affect your organization. In this session, we look at trends that are changing organizations today through an employee-centric lens to understand what employees want from their employers in a post COVID world About our Speaker Anant Agarwal is CEO & Founder @ edX , Board Member@ MITx . An expert on online learning with 20+ years experience leading technology companies including Microsoft Research Labs; Director for Internet Services - Windows Division (1985); Executive Vice President/Chief Technology Officer – Akamai Technologies Inc.; Chairman Emeritus– University College London Institute For Learning Science( UCLILS)

Let's talk about the 10th anniversary of this beautiful, wonderful series. I know that it was not on my list but if you think carefully and deeply then maybe there is a reason why we are going to discuss today one thing in particular which has been happening since last year or so... There have always been those who say "I don't like anything from Naruto" (which means Shippuden) because they were too young when all these things happened during their childhood days! It doesn't matter how old she gets; her work will never change for anyone else except herself alone forevermore without any explanation whatsoever either way around!! So let

Let's talk about the topic of today. Today we are going to be talking specifically on how you can get rid of back pain and I know that this is a subject matter where there has been so much misconception, especially in our time now because people don't understand what they should do when it comes to their body issues like for example with regards to neck pains or whatever else but let us just focus upon one aspect at hand which would definitely deal directly into your spine problems here as well if only those who suffer from them could have access towards some kind off relief through any type whatsoever then maybe things might actually work out alright without being too difficult afterwards either way

Let's talk about it. I was watching the first episode of season 3 and I noticed something that really bothered me, so much in fact...that is why this theory came into my mind.... This thing happened at least three times throughout all episodes: And then there were these moments which are also characteristic for schizophrenia (but not only): The following picture shows a part with an obvious schizophrenic-like aura A lot can be said here but let us come to another subject - depression! The main character has been depressed since childhood because she lost her mother when young girl during some kind of accident . And now we have what you see

Let's talk about. Now, I want to concentrate on the first part of this disorder which is called psychosis and then we will go through it all in detail now because there are some other disorders that exist like mania for example but they don’t have psychotic features associated with them so those ones would be left out here or something else might occur as well ,etc., etc.. But if you look at a person from an outside perspective without any underlying conditions such things usually happen during acute phase when someone gets affected by depression/anxiety anxiety-depression spectrum (ie depressive episodes). Now let me explain what i mean before going into details . So basically these symptoms

# Conclusion We think that RQAE is a reasonable approach that can either complement or replace SAEs in interpreting LLMs. The performance on evaluations, as well as qualitative inspection, suggests that RQAE is *at least* as good as SAEs at finding interpretable features, and addresses the concerns of feature splitting and dark matter that, up until now, have resisted progress with standard SAEs. Furthermore, we hope that our code, model weights, and dashboard serve useful in evaluating and implementing RQAE, and we will actively be making updates to them in order to help the community adopt RQAE. ## Future Work There is an incredible amount of surface area to explore with interpretability, and RQAEs. Here, we provide an ordered list of what we think is most important to validate and continue developing with RQAE: 1. **Using SAE and RQAE together**: If SAEs are telescopes, then RQAEs are microscopes. However, it's clear that SAEs are still great at finding a set of interpretable features. We think that a lot of work can be done using SAEs to develop a dictionary of features, and RQAE to further refine or organize these features. Previous work have done something similar by training SAEs on the residuals of existing SAEs - we think that RQAE is a more natural approach to doing this, since we have already shown that RQAE can reconstruct features in existing SAEs. 2. **Finding better features**: The approach presented in this work to choose features is incredibly naive. We're sure that better approaches exist, given that RQAE is so good at partitioning tokens already. One such approach would be finding diverse directions in the first four layers, and using that to choose distinct sets of tokens. 3. **Fixing/Running More Evals**: There's a good chance that detection and fuzzing are biased towards favoring RQAE - for example, they are likely also biased towards overly specific features and explanations by definition, and we know that more levels in RQAE corresponds to ranking by cosine similarity $\implies$ overly specific features. New evals might be needed to properly evaluate this approach, although our qualitative examination give us confidence that RQAE will still be competitive. 4. **Rethinking the Eval Dataset**: RQAE can almost completely partition all tokens in the current eval dataset of $4M$ tokens, within 4 layers. This seems to suggest that the *eval dataset* might be a bottleneck to finding more features, rather than the model itself! Since we know that RQAE has much higher capacity, we could see a shift from spending more time curating *train* datasets, to instead training on a large general dataset and spending time curating a *evaluation* dataset instead. This is certainly not an exhaustive list of ideas to try with RQAE. We're excited to see what everyone comes up with, and thank you for reading this far! </script>