A Microscope into the Dark Matter of Interpretability

We propose a new interpretability architecture that learns a hierarchy of features in LLMs. It matches/outperforms SAEs on evaluations, reduces reconstruction error, and naturally addresses feature splitting.

Too Long; Didn't Read


We propose the RQAE (Residual Quantization Autoencoder), a new architecture to interpret LLM representations. RQAE learns features hierarchically, by iteratively applying vector quantization on different subspaces of the residual stream of an LLM. As a result, you can control the specificity of a feature as you move through its hierarchy, and can naturally split features. In evaluations, RQAE has better reconstruction than SAEs, and we find that RQAE features are more interpretable than Gemmascope SAE features using Eleuther AI's evaluation suite.

A recent paper finds that the residuals of SAEs (i.e. original activation - SAE reconstruction) contain extra features that even the largest SAEs do not find. RQAE reduces these pathological errors iteratively. RQAE can be equivalently defined as stacking many weak SAEs on top of each other to iteratively reduce this "dark matter". We think that much of the theory for how SAEs work should also apply to RQAE.

This post presents the motivation, architecture, and experiments justifying RQAE. We also have released the code, model weights, and a feature dashboard.

Introduction

The purpose of interpretability is to decompose a deep model into human-understandable features. This focus has led to some incredibly interesting work, such as include visualizing what parts of an image a model uses to predict a class and seeing how RL agents identify friends and enemies in a game .

However, Large Language Models (LLMs) may be more complicated to interpret. Text is a much denser interpretable medium than visuals, and LLMs are orders of magnitude larger than any other model weā€™ve ever trained. This work is heavily inspired by transformer-circuits, which has laid out a foundation for how we can begin to interpret LLMs. If youā€™re new to interpretability, I would recommend reading these works first, but Iā€™ll provide a quick rundown of the core concepts in the sections below.

Other work has explored how different layers of the LLM contribute to different functions (for example, the feedforward layers store facts and ā€œknowledgeā€ ). However, the scope of this work is limited to considering the residual stream after the middle layer of an LLM (most of the work only considers the Gemma 2 2B LLM). We hope to extend RQAE to all other layers and other LLMs, as e.g. Gemmascope has done with SAEs, in future work.

World Models

What is a human interpretable feature, and why would they exist in LLMs? We know that LLMs store world models because they are really good at predicting the next token, and compressing the training data to the extent that they do requires some latent understanding of the world. Other work explores this concept in detail , but roughly:

  1. There are a large number of things that can happen in the world
  2. We observe what happens, and we reason about what happens with a much smaller set of ā€œlatentā€ features.

For example, if we see someone drop a glass bottle, then we expect the bottle to shatter when it hits the ground. This is not because we have observed exactly that person dropping exactly that bottle before. Itā€™s because we have learned a very small set of latent features about physics, which we can use to model the bottle breaking.

LLMs must do something similar - they just donā€™t have the capacity to fully memorize their training data! Thus, they must also be working with some set of latent features that they can manipulate to solve the next token prediction task. A lot of interpretability research is focused on finding these latent features, and figuring out how the model uses them.

Linear Representations and Superposition

How do LLMs organize and use these latent features? Thereā€™s a lot of empirical evidence that LLMs describe features as simply directions in space, known as the Linear Representation Hypothesis (LRH). If true, itā€™s a very powerful framework - directions are just vectors, and we have a lot of theory to work with vectors.

Then, a full LLM activation is the sum of multiple atomic feature vectors. For example, we might think of a dog as the ā€˜animalā€™ feature plus the ā€˜petā€™ feature, minus the ā€˜felineā€™ feature. Features also scale differently depending on the subject: a dog will have more of the ā€˜smartā€™ feature than a goldfish.

We borrow a formal definition of the LRH from here:

Definition 1: A linear representation has the following two properties:
  1. Composition as Addition: The presence of a feature is represented by adding that feature.
  2. Intensity as Scaling: The intensity of a feature is represented by magnitude.
The Linear Representation Hypothesis states that an LLM linearly represents human-interpretable features.

There is one problem with the LRH: the residual stream of an LLM with width $d$ lies in the space $\mathbb{R}^d$. However, there can be at most $d$ orthogonal vectors in this space (the basis of the space), which means that there can be at most $d$ unique ā€œfeature directionsā€.

In order for the LRH to be true, models must learn features in superposition. This means that features interfere with each other, and you can not fully separate all features. However, you can also fit exponentially more features in a $d$-dimensional space which are only $\epsilon$-orthogonal to each other!

Similar features also take similar directions. For example, the ā€œdogā€ feature should be more aligned with the ā€œcatā€ feature than with the ā€œsharkā€ feature. Thus, when measuring by cosine similarity, we see clusters forming which correspond to similar features.

Sparse Autoencoders

Sparse Autoencoders (SAEs) model the LRH by learning an overcomplete basis to take features out of superposition . SAEs take inspiration from sparse coding theory: it suggests that learning a dictionary of features that only fire sparsely (i.e. a small subset of features that reconstruct each activation faithfully) also results in that dictionary being human-interpretable.

A figure taken from Toy Models of Superposition. In this case, the observed model is our LLM, and the disentangled model is what we are trying to learn with a SAE.

Thinking back to World Models, this actually makes a lot of sense! We can think of the ā€œlargerā€ model as the set of disentangled latent features that model the real world. The larger model only activates sparsely - only some neurons are activated for any given input. Sparse autoencoders try to reconstruct this larger model, in order to represent features out of superposition.

Definition 2: A sparse autoencoder $S$ of size $n$ is a model that takes in a LLM activation $r \in \mathbb{R}^d$ and does the following: $$ C(r) = \sigma(W_{in}r + b_{in}) $$ $$ S(r) = W_{out}C(r) +b_{out} $$ where $\sigma$ is some nonlinearity (usually ReLU), and $W_{out} \in \mathbb{R}^{d \times n}, W_{in} \in \mathbb{R}^{n \times d}$ (for $n \gg d$). $b_{in} \in \mathbb{R}^n, b_{out} \in \mathbb{R}^d$ are bias terms. The model is trained to minimize the reconstruction error $||r - S(r)||_2$. An additional loss term is added to induce sparsity in $C(r)$ - common methods to induce sparsity include TopK, L1 loss, or learning a threshold per feature .

To explicitly draw the connection to the LRH, a SAE performs the following algorithm (ignoring biases):

  1. Begin with some entangled LLM activation $r$, and consider that $W_{in}$ consists of $n$ ā€œencodingā€ vectors.
  2. Compare $r$ against each encoding vector by calculating their dot product.
  3. Apply a nonlinearity (usually ReLU) and some sparsity function (TopK, L1, etc.), to act as a filter for sparsity.
  4. There are now $n$ coefficients $C(r) = [c_1, c_2, ā€¦]$, which are the intensities of each feature in $r$.
  5. Multiply $C$ by the columns of $W_{out}$, which are the actual feature vectors (ā€œdecodingā€ vectors) that you have learned.
NOTE: Notice that each (encoder, decoder) pair of vectors are closely linked - the cosine similarity ($\propto$ dot product) that a representation has to an encoding vector directly defines the intensity of the feature. At the beginning of training, the encoder is initialized as the transpose of the decoder so that encoding vectors = features.

However, an SAE can only learn a set number $n$ of features, and itā€™s likely that $n \ll N$ for the true number of features $N$ the LLM has learned. To give a toy example, consider two ā€œground-truthā€ features $f_1$ and $f_2$ - these features are similar but not exactly the same (e.g. different breeds of dogs). How will the SAE learn these features?

The Feature Hierarchy

The (high-level) answer to the question posed above is: the model will learn an average direction that will fire weakly for these features - for example, a ā€œdogā€ feature. It might also learn a few features that weakly activate for related, more general features - for example, a ā€œlivingā€ feature, or a ā€œpetā€ feature. Clearly, we want some level of control on the specificity of features that SAEs learn.

There are three widely observed issues with SAEs that stem from this intuition:

  1. Feature splitting. If you train a wider SAE, you will notice that more general features split into smaller, more specific features.
  2. Feature absorption. You learn two separate features in your SAE that are describing the same ground-truth feature - as a result, representations with that ground-truth feature are split across the two learned features without any discernable pattern (i.e. the difference between the two features is spurious).
  3. Feature shrinkage. SAEs routinely underestimate the intensity of a given feature. It happens because of the sparsity penalty during training . JumpReLUs largely mitigate this issue, but itā€™s related to the first two because it happens due to learning entangled features (an SAE will underestimate a featureā€™s intensity because it wants to account for other features that will interfere).

The reason these happen is because SAE features are not necessarily atomic. They might need to be broken down or grouped together, but itā€™s difficult to tell which is which, and the training objective doesnā€™t bias the model one way or the other. This begs the question: how should we organize features?

A paper from UChicago attempts to answer this question . They split up features in two dimensions: hierarchical features, such as organism -> (plant -> (tree, bush, etc.), animal -> (bird, reptile, fish, etc.)), and categorical features, such as (dog, cat, hamster, etc.). Hierarchical features are organized orthogonally to each other, while categorical features organize themselves into polytopes.

SAEs treats features as if they are all categorical. It can learn hierarchies of features (for example, it can learn separate features for organism, plant, animal, etc.) - but there is nothing in the architecture that encourages it to learn such features. My best guess is that it learns hierarchy based on the frequency of the concepts in the training data alone, based on other work that finds SAEs learn more granular features when trained on specialized datasets. However, this is still an open question, and I donā€™t propose a clear answer here.

Adding Inductive Bias

As mentioned above, SAEs do not have an inductive bias for how features should relate to one another. At the end of training, you are left with a flat dictionary of features, and it is up to you to interpret and organize them.

In contrast, this work introduces the inductive bias that features must be composed hierarchically. Itā€™s not clear whether hierarchy is sufficient or necessary for all features a LLM represents, but feature hierarchies do seem like the natural answer to solving the problem of feature splitting/absorption.

We define a feature hierarchy to consist of parent and child features, such that any time a feature is activated, itā€™s parent will also be activated, and zero or one of itā€™s children will be activated. This addresses the three issues presented in the previous section:

  1. If a feature splits, then we define the base feature as higher in the hierarchy, and the split features as lower in the hierarchy.
  2. If two features should be absorbed, then only consider their common parent feature as the atomic feature.
  3. Feature shrinkage should not occur, since exactly one feature will be active at each layer of the hierarchy, and there is no incentive to account for interference.

This isnā€™t the only way to define a feature hierarchy (and probably not the best way) - for example, you can define a many-to-many relationship between parents and children - but it is simple, and in the next section we define an architecture that models this type of hierarchy.

RQAE

In this section, we present the Residual Quantization Autoencoder (RQAE).

Figure 1a. How both SAE and RQAE decompose the activation of a LLM into interpretable parts.

Since RQAE tackles the same problem as SAEs (decomposing an activation into interpretable parts), it should be no surprise that the end result also looks similar. Fig 1a shows how both RQAE and SAE split up an activation from an LLM. We even use the same training loss (MSE)!

Of course, not all feature decompositions are equal. SAE relies on the bottleneck of sparsity to control the type of features that are learned. RQAE uses two bottlenecks: projecting into subspaces, and quantizing the subspaces into a relatively small number of codebooks (i.e. only a few directions in the subspace can be chosen).

Figure 1b. An overview of how RQAE decomposes a LLM activation from Fig. 1a.

RQAE uses residual vector quantization (RVQ) to autoencode the LLM representation - but we also add a linear in/out layer between each quantization step, which allows the model to iteratively choose different subspaces of the representation space to quantize. To implement RVQ, we use a variant of FSQ that uses hyperspheres instead of hypercubes to define codebooks. FSQ encourages all codebooks to be utilized fully, which is especially important when you have many layers.

Here is the equivalent algorithm (to SAEs) for RQAE:

  1. Begin with some LLM activation $r$. For $n_q$ layers, iteratively perform the following:
    • Project $r$ into a small subspace with a linear layer, called projection $p$.
    • Normalize $p$ to get $p_{norm}$
    • Find the closest codebook to $p_{norm}$ from a set of codebooks using Euclidean distance. Call this codebook $\hat p$, with index $c_i$ for layer $i$.
    • Project $\hat p$ back into the activation space with a linear layer, called $\hat r$.
    • Set $r = r - \hat r$ as the residual.
  2. The activation $r$ has now been quantized with $n_q$ codebooks, and can be represented as $r \approx (c_1, c_2, ā€¦, c_{n_q})$

Similar to how an SAE models the LRH, RQAE models the ā€œfeature hierarchyā€ inductive bias mentioned above. For any given LLM activation, every layer chooses exactly one codebook. The subspaces that later layers choose are dependent on earlier layers, encouraging a learned hierarchy between layers. Thus, a RQAE model with $n$ layers and $c$ codebooks per layer learns $c \times n$ unique codebooks, but can represent $c^n$ different activations.

RQAE = Stacked SAEs

We draw a connection between the layers of a RQAE and a SAE.

Conceptually, a single layer of RQAE consists of the same parts as the SAE algorithm mentioned above: an encoder (linear in layer), intensity coefficients (codebook values), and a decoder that corresponds to features (linear out layer). Formally:

Lemma 1: A single layer of RQAE can be equivalently defined as a Top1 SAE.
Proof: Still a WIP. Will follow this work closely
NOTE: This lemma does not make any claims about the training dynamics or quality of features learned by RQAE compared to an equivalent SAE. In fact, design decisions such as FSQ heavily restrict the type of features that RQAE can learn. The lemma only exists to illustrate the relationship between RQAE and SAE.

In practice, we use a very small codebook dimension (in experiments, $4$). This is needed for vector quantization (especially FSQ, whose codebook size grows exponentially with respect to codebook dimension). At first glance, this seems concerning: an SAE can find features in a feature manifold that lives in higher-dimensional subspaces, but a RQAE layer can not!

This is true theoretically. However, in practice there is a large body of work that suggests LLM features and feature manifolds are mostly represented in very low dimensional subspaces (i.e. they are rank deficient). Itā€™s likely that RQAE wonā€™t miss many features or manifolds because it projects on a lower dimension - we think that it still can learn those features across RQAE layers, but we leave that to future work.

Modeling Dark Matter

Previous work shows that SAEs learn pathological errors - that is, the residual between SAEs and original activations are uniquely important for reconstruction (measured by cross entropy loss difference), compared to completely random residuals the same distance away. A following work goes further and shows that this pathological error includes features that SAEs miss - a concept that Chris Olah has coined as the ā€œdark matterā€ of interpretability .

Interestingly, even the largest SAEs evaluated (1 million features) still miss a significant portion of total (linear) features found in the LLM (Figure 10a ), and the error decays logarithmically with respect to the number of features. The core takeaway is that SAEs are grossly inefficient at uncovering this ā€œdark matterā€.

How would one fix this issue with SAEs alone? If you assume that sparse coding is enough to find interpretable features at very high SAE sizes, then this begins to look like a capacity problem. Specifically, a model with high enough capacity will also reduce reconstruction error, so that features don't exist anymore in its residual error. The capacity of existing SAEs (even with 1M features, which is already 4.6GB of model weights on a Gemma 2 2B LLM) does not seem to be enough.

This work provides a great motivation for RQAE! A single layer of RQAE is equivalent to a (weak) SAE, as shown above. The next layer of RQAE then learns on the residuals of the first layer, which we assume to have ā€œmissing featuresā€, since reconstruction is not perfect. Applying this iteratively allows us to learn more ā€œmissingā€ features after each layer. This mimics the exact experiment run in the paper , where they train an SAE on the residuals of a Gemmascope SAE, and find that it uncovers more of the dark matter (linear error) that the Gemmascope SAE missed.

In the next section we show that RQAE empirically also reduces reconstruction error (i.e. higher capacity), as expected.

Training a Model

We train a RQAE with $n_q=1024$ layers, and use a codebook dimension of $d=4$ with quantized values of $[-1, -0.5, 0, 0.5, 1]$ per dimension - resulting in $544$ codebooks per layer (see FSQ for more details). We use the following hyperparameters:

Parameter Value Description
LLM Gemma 2 2B Base model to interpret
Layer Residual Stream after the center layer (12) What layer of the residual stream to train on
Training Data FineWeb-Edu Dataset used to train RQAE
Test Data pile-uncopyrighted (Neuronpedia subset) Dataset used for analysis
num_tokens 1B How many tokens we train on
context_length 128 Context length we train on

We normalize activations passed into RQAE by using the final RMS norm layer of the LLM (note that this means all activations have exactly the same norm). Ablations are still a WIP, but will be added soon.

Letā€™s look at properties of learned model, to validate assumptions we have based on the model architecture.

Figure 2a. (left) Distribution of pairwise cosine similarities between all learned codebooks (projected out to model dimension). (right) L2 norm of projected codebooks across RQAE layers.

Figure 2a shows the distribution of pairwise cosine similarities between all projected codebooks. Almost all projected codebooks are $\epsilon$-orthogonal with $\epsilon=0.2$, suggesting that they form a reasonable overcomplete basis. It also shows the L2 norm of features across RQAE layers - earlier layers have larger norms than later layers, suggesting that earlier layers choose more ā€œconfidentā€ directions, and the majority of MSE loss is concentrated in the first few layers.

Figure 2b. (top) MSE Loss and Cross Entropy Loss Difference for RQAE trained on Gemma 9B and 2B models. Both RQAE models are trained with num_quantizers=1024. (bottom) Reconstruction loss of RQAE vs SAE with varying width and L0 for Gemma 2 2B. Note that the same RQAE model is used (num_quantizers=1024), and reconstruction is stopped early, rather than different RQAE models trained with different num_quantizers.

Figure 2b shows that RQAE has higher capacity (as discussed in the previous section), by reducing reconstruction loss. This is with an order of magnitude fewer parameters than an SAE (e.g. an equivalent 1M feature SAE takes up 4.6GB of space, while RQAE takes 100MB).

RQAE doesnā€™t have an equivalent concept of $L_0$, since all layers are used for every input. However, it still seems like an unfair comparison, since we are still using $1024$ unique vectors to reconstruct an activation with CE loss difference $< 0.01$.

The reason that we argue for this comparison is for two reasons:

  1. If reconstruction loss is low, then we know that we have certainly captured more of the features present in the activation (i.e. less dark matter). Higher $L_0$ may mean that the features become less interpretable, but:
  2. As mentioned at the beginning of this section, the RQAE inductive bias during training does not come from sparsity. Thus, we shouldnā€™t expect features to be learned by sparsity, and we shouldnā€™t expect higher sparsity to correspond to more interpretable features.

Thus, RQAE does reduce dark matter (e.g. even a SAE with very high $L_0$ would do the same), but still learns interpretable features (as opposed to an SAE with high $L_0$). To prove this, letā€™s first define what a feature in RQAE is.

Defining a Feature

Features in SAEs are simply defined as the columns of the decoder matrix. When a feature is present in an activation, it also has an associated intensity, measured by ($\propto$) the cosine similarity of the activation to the corresponding row in the encoder matrix. We will use this to motivate the definition of a RQAE feature. Referring back to the algorithm at the beginning of this section:

Definition 1: Let $$C_i = \{\text{codebook indices for layer }i\} = \{1,2,...\}$$ $$\mathbb{C}_i = \{\text{codebook values for layer }i\} = \{c: c \in \mathbb{R}^{codebook\_dim}\}$$ where you can index $\mathbb{C}_i$ directly with elements of $C_i$ as $\mathbb{C}_i(c)$.
Then, a RQAE feature is defined as a set of (up to) $n_{q}$ codebook indices (one per layer): $$f = [c_1, c_2, \dots, c_k]\text{ where } k <= n_{q}, c_i \in C_i$$
Definition 2: Consider a token's activation $t$, that has been quantized by RQAE (see algorithm): $$t = [t_1, t_2, t_3, \dots, t_{n_q}]\text{ where }t_i \in C_i$$ Then, the intensity of a feature $f$ in $t$ is defined as: $$C(f, t) = \frac{\sum_{i \leq k} cos\_sim(\mathbb{C}_i(t_i), \mathbb{C}_i(c_i)) \cdot L_i}{\sum L_i}$$ where $L_i$ is the average $L_2$ norm of the columns in the decoder of layer $i$ in RQAE.

Similar to SAEs, we measure cosine similarity of an encoded version of the activation to measure intensity. Since we use FSQ to create codebooks, we know that they are evenly spaced out. Itā€™s important that RQAE uses all of these codebooks at any given layer, otherwise it suggests that measuring cosine similarity between codebooks is useless, as a feature will not be able to differentiate between different tokens effectively.

Figure 2c. Visual of how features are defined in RQAE.

What Fig. 3a shows us is that even by layer $4$, all tokens in a reasonable large dataset can be almost completely partitioned into unique sets of codebooks.

# Layers 1 2 3 4
# Used Codebooks 544 234,994 3,279,894 4,383,319
Average # Tokens per Codebook 8606.1 19.92 1.43 1.07
Figure 3a. Codebook usage for the first four layers, across a dataset of $4M$ tokens.

Unfortunately, the number of unique sets of codebooks that RQAE learns grows exponentially with the number of layers. At first glance, this makes RQAE seem pretty useless - whatā€™s the point of finding $544^{1024}$ new ā€œfeaturesā€?

Consider the way that SAEs are actually interpreted. The SAE is run on a test set, and you look at how each feature activates on different tokens on the test set to interpret the feature. If the test set does not include examples of the feature, then the feature can not be interpreted - in this sense, SAE features are only useful/interpretable given a sufficiently large test set.

When we choose random features (i.e. random codebooks per layer), we get un-interpretable results. However, if we start with a token in the dataset, we find that the RQAE feature is highly interpretable. We conjecture that this is because at each layer, we are choosing a region of codebooks that has a high density, since itā€™s likely that other similar tokens are in the dataset. Potentially, having an even larger test dataset would result in more unique interpretable features.



When features are defined by a set of codebooks, a natural question becomes how the feature uses each codebook. We find that more codebooks directly lead to more specific features. This makes sense given our understanding of the feature hierarchy: earlier codebooks define coarse features in the hierarchy, while later codebooks define more refined and specific features.

Figure 3b. Example of a feature's specificity changing with number of layers. View Feature in Dashboard

What happens when we use all $n_q$ codebooks in the model? Since RQAE has low reconstruction loss, this becomes similar to just measuring intensity by cosine similarity in the activation space. This means that only the dominant ā€œfeatureā€ in the activation will be modeled, but:

  1. Clearly, SAEs do something similar, at least for some features. Take the Gemmascope features for any Gemmascope SAE, and a large portion (from manual testing, 30+%) of them will be single-token features (i.e. they only fire on a single token).
  2. This simply means that you should be looking at lower layers in RQAE, which will be less specific.

Fig. 3c shows us that earlier layers are certainly doing something different than just ranking by cosine similarity (but related, as is necessary). Specifically, even up to layer $64$ we do not find the Spearman correlation to be above $0.5$, indicating a moderate-weak correlation.

Figure 3c. For a set of RQAE features, we take the top $128$ max activating examples. Then, we rank them by cosine similarity (of their original activations) to the top-most activating example, and also rank them on feature intensity by each layer of the RQAE model. Then, we measure the Spearman correlation between all of these rankings to the ranking by cosine similarity.

Feature Splitting and Absorption

Since RQAE models the feature hierarchy, we should define how features can split and be absorbed.

Definition 3: Given a RQAE feature $f$: $$f = [c_1, c_2, \dots, c_k]$$ The ancestors of $f$ are a set of RQAE features: $$A_m(f) = \{g: g = [c_1, c_2, \dots, c_m]\}\text{ } \forall m < k$$ Two features $f_1$ and $f_2$ are considered $m$-split features if $$C(a_1, a_2) \geq \theta\text{ } \forall a_1 \in A_m(f_1), a_2 \in A_m(f_2)$$ for some threshold $\theta$.

Essentially, two features are split if some subset of their ancestors are close, but they diverge at some point. Interestingly, this definition does mean that features can split at one layer, and then merge later on, but we couldnā€™t find an example of this in practice. To illustrate what this looks like, letā€™s look at an example:

Figure 4a. An example of a feature splitting at different layers. Each branch is it's own feature, and intensity is calculated at the layer at which the ancestors split.

We developed heuristics for how to choose $\theta$ based on intensities on a test dataset, but we think thereā€™s much more detail that can be added to feature splitting in RQAE. We leave that up to future work.

Comparison to Gemmascope

Itā€™s hard to propose a new architecture, when the existing architecture already has a large body of work and empirical evidence: for example, Gemmascope used $O($training compute of GPT-3$)$ to train SAEs. As a result, we use this section to directly compare Gemmascope SAEs to RQAE in a variety of ways.

Finding Equivalent Features

SAEs are useful because they develop a dictionary of features. Although this dictionary is static after training, the (relatively small) dictionary can be evaluated in depth to come to promising interpretability conclusions.

In contrast, as mentioned above RQAE has exponentially many features - far too many to manually or even automatically interpret. Thus, itā€™s interesting to know if:

  1. The same features that an SAE finds exist in RQAE
  2. There is an easy way to identify those features automatically in RQAE.

To validate if a RQAE feature is ā€œequivalentā€ to a SAE feature, we measure the Pearson correlation between SAE intensities and RQAE intensities on top activating, and median-activating examples of the original SAE feature. If this number is high, then the two methods generally recognize the same tokens at the same intensities, and the features are very close. This gives us a measure of determining if the same features exist in RQAE.

An easy way to identify a feature would be looking at itā€™s top-activating example. If, using this example, RQAE can faithfully reconstruct the feature (as defined above), then we could simply create features from every token in a dataset (which would be tractable to evaluate, e.g. the dataset we use in this work is 4M tokens), and find the most interpretable ones.

Figure 5a. We choose a set of features from Gemmascope. For any given feature, we (1) create a RQAE feature based on the quantization of it's top activating example, and (2) create a RQAE feature by directly quantizing it's encoding vector in the SAE. We then measure the Pearson Correlation between the intensities of these two RQAE features and the original Gemmascope feature on the top-activating and median-activating examples of the feature.

This result certainly shows that Gemmascope features exist and can be represented with RQAE, since the Pearson correlation of the encoding vector approaches $1$. This means that for every Gemmascope feature, there exists some RQAE feature which ranks tokens very similarly to it.

Using the top-activating example to create a feature, however, is a mixed result. Clearly, as expected, it underperforms using the encoding vector. However, a Pearson correlation of $0.7$ still suggests that there is a strong correlation between the two. We consider this to be good enough evidence to continue using top-activating examples to search for RQAE features, although we discuss more in the conclusion why this might need future work.

Finally, letā€™s qualitatively look at some examples of Gemmascope features, and how they compare to RQAE features that use their top activating example. We looked for three particularly complex features, to ensure that we donā€™t bias towards single-token features, which we expect RQAE to capture well already. All features were taken from the 16k SAE with $L_0=81$:

Figure 5b. We look at the three Gemmascope features mentioned above. We then define RQAE features on their top activating examples, and look at the corresponding top activating examples of these RQAE features.

Qualitatively, even for ā€œcomplexā€ features, it seems that RQAE does a good job of representing the feature! We talk more in the conclusion about further ideas we find interesting which stem from these results.

Evaluations

Evaluating interpretability methods is still an open problem. In this work, we will use Eleutherā€™s auto-interpretation suite to explain and evaluate features. We compare directly against Gemmascope SAEs for all evaluations.

We perform all evaluations on the monology/pile-uncopyrighted dataset - specifically, the same subset of 36864 sequences (128 tokens each) that is also used in Neuronpedia to interpret Gemmascope SAEs. You can also see all features used, as well as the full evaluation traces, in the dashboard. As a result, you can directly compare all scores, explanations, and activating examples with Gemmascope easily.

For all evaluations, we throw out Gemmascope features without enough max-activating examples in the dataset - we do this by removing features with < bos > tokens as max-activating examples.

Creating a Dictionary of Features

To compare to SAEs, we need to actually define a single set of features. We have already motivated this method in the previous section, but to explicitly define how to make a dictionary of at least $k$ features using RQAE:

  1. Begin with some test dataset of tokens. We use monology/pile-uncopyrighted.
  2. Choose $k$ tokens with enough diversity from this dataset. We select uniformly from unique tokens.
  3. Create a set of $b$ RQAE features per token by using the quantized codebooks of the layer, and then choosing $b$ subsets of layers (i.e. the first 512, the first 256, the first 128, etc.)
  4. Optionally, filter features (we only consider layers $16$, $64$, and $256$ for evaluation).

This method is a naive approach to selecting features. However, from manual inspection it does seem to find features that are (1) unique, (2) interpretable, and (3) consistent among top-activating examples. We discuss future work for selecting features in the conclusion.

Results

There are two axes that are commonly considered to affect SAE interpretability: model width and $L_0$. We sweep across both these dimensions:

Figure 6a. Evaluation sweeps across model width and $L_0$ for SAEs. When sweeping $L_0$, we fix width at $16k$. When sweeping width, we fix $L_0 \approx 80$ (closest one available in Gemmascope). All results are averaged over $100$ features. Error bars represent standard error.

Across models and both axes, we see that RQAE generally outperforms SAEs in detection, and performs similarly in fuzzing. Generally, we see that more layers is better for RQAE, although this might saturate (e.g. $256$ layers performs the same as $64$ layers).

Using RQAE

We provide all code and models for RQAE. Weā€™d love to see more work being done, and this work only serves to introduce RQAE. We briefly show some interesting results using RQAE:

Feature Visualization Dashboard

Many results in this work rely on qualitative claims by looking at specific examples of features, activations, and dataset examples. We provide a hosted visualization dashboard that contains a large number of RQAE and gemmascope features, as well as all traces from evaluations, so you can easily evaluate for yourself how well RQAE works.

Figure 7a. An example view of the dashboard presented. See here to try it out yourself!

Steering

We show preliminary results of steering using RQAE features. We could not find evaluations to quantitatively compare steering with SAE or activation steering - so we will leave this for future work. Steering is performed by:

  1. During generation, take the latest given LLM activation. Quantize it with RQAE into a set of codebooks.
  2. Take the codebooks of the feature you are steering with. Turn the activation codebooks towards the feature codebooks dependent on some strength $\kappa$ and the cosine similarity between the codebook and the featureā€™s codebook.
  3. Continue for all tokens during generation.

Here are some qualitative examples of steering:

Steering on Gemma 2 2B using the "Bodies of Water" feature. Use the tabs above to view outputs with/without steering.

Steering parameters: temperature=0.5, num_layers=5, strength=0.8, repetition_penalty=1.5
prompt="An interesting animal is the"
An interesting animal is the African elephant. The elephant is the largest land animal on the planet. The elephant is a large animal, but it can also be quite small. The smallest elephant is the dwarf elephant, which is only about 2 feet tall. The African elephant is the largest elephant on the planet. It can weigh up to 14,000 pounds and stand up to 13 feet tall. The elephant is an amazing animal. It is a very intelligent animal and can learn things very quickly. The elephant is also a very social animal. It lives in large herds and is very protective of its young. The elephant is also a
An interesting animal is the European rabbit, which is a symbol of the European continent. It is a member of the Leporidae family, which includes the hare and the hare. This animal is considered a wild animal, but it is a domestic animal in many countries. The European rabbit is a small animal that is not dangerous to humans. It is one of the most common animals in the world. It has many uses, including food, clothing, and shelter. The European rabbit is a small, furry animal that is native to Europe. It is also known as the European hare or the European rabbit. It is a member of the Leporidae
An interesting animal is the Psammodromus sceleratus, a desert lizard that lives in the sand dunes of the Sahara Desert. The Psammodromus sceleratus is a burrowing lizard that has a special adaptation to live in the sand dunes of the Sahara Desert. It is a nocturnal lizard that is very well camouflaged in its natural habitat, which is the sand dunes of the Sahara Desert. It is a small lizard that can reach a length of 10 cm, and it has a very interesting color pattern that allows it to blend in with the sand.
An interesting animal is the giant earthworm. This is a worm that can grow up to 3 meters long. It can be found in Australia, New Zealand, South America and Africa. It is a very strange looking creature. It has an elongated body, with a segmented structure. It has ten pairs of legs, which are used for locomotion. It also has a mouth, which is used for feeding. It is a scavenger, and it feeds on decaying matter. The giant earthworm is a very interesting animal. It is a great example of how evolution can create new and strange creatures.
An interesting animal is found in the desert. It is a small, dark-colored snake with a long tail. The snake is usually found in the desert, but it can also be found in other habitats. The snake is a carnivore and feeds on small animals. The snake is also a venomous snake and can be dangerous. The desert snake is a species of snake that is found in the deserts of North America, Africa, and Asia. The desert snake is a small, slender snake that is about 12 inches long. The desert snake is a brown or gray color with a white or yellow belly. The desert snake is a venomous snake and
An interesting animal is the Olinguito, a small cat that lives in the mountains of Colombia, Ecuador, and Peru. It is the only member of the family Bassaridae, the only cat family in the world that is not in the Felidae family. The olinguito is a small cat that is about the size of a house cat, but it is a bit more compact. It is a little less than a foot tall and weighs about 1.5 pounds. Its fur is a light brown color with dark spots on it. The olinguito is
An interesting animal is the Japanese Giant Hornet. It is a species of wasp and is also known as the Japanese giant hornet or murder hornet. The Japanese giant hornet is a social insect and is one of the largest insects in the world. It is a large wasp that can grow up to 2.5 inches long and has a wingspan of 3 inches. The Japanese giant hornet is a social insect, meaning that it lives in large colonies with several queens and workers. The colonies can consist of up to
An interesting animal is the giant Australian octopus (Octopus tetricus). It is a member of the octopus family, the cephalopods, and has the largest brain of any invertebrate. Its brain is about the size of a human brain. The giant Australian octopus is a marine animal that lives in the ocean. It is found in the waters of Australia, New Zealand, and the Pacific Ocean. The giant Australian octopus is a large animal, measuring up to 2 meters (6.6 feet) in length. It has a body that is shaped like a football, with eight arms. The arms are long and thin, and they are used to catch prey.
An interesting animal is the sea cucumber. Its appearance is very unusual, but it is quite common. It is found in the oceans of the world, in the seas, and in the waters of the continental shelf. The body of a sea cucumber is soft, but not very elastic. It is covered with small, not very hard plates. This shell is composed of calcium carbonate. The body of a sea cucumber can be very long, and a very large one can be up to 1.5 meters long. The length of a sea cucumber can also vary. It depends on the species, as well as on the sex of the animal. In
An interesting animal is a giant squid. It is the largest invertebrate. The body length of an adult is approximately 12 meters, and the weight is 200 kg. The body of a giant squid is very similar to the body of a human. The size of its body is determined by the presence of an organ called a mantle. This organ is a bag that is located between the body and the beak. The mantle is a very important organ for the squid. It is responsible for the functioning of the organism. The giant squid is a large animal that lives in the ocean. It is a carnivore and eats fish,
An interesting animal is the sea turtle. It is a reptile, but it lives in the ocean. It is a very slow animal. It canā€™t swim very fast. That is why it is called a turtle. But it is a very good swimmer. It can swim very fast. It has a shell, which is like a roof. It is made of bone, but it is very hard. The shell is like a helmet, and it is very strong. It is also very protective. It has a long tail. It can also swim very fast. It is very fast and it has a long tail. It is a very interesting animal.
An interesting animal is the Platyhelminthes, which is also known as the flatworms. These flatworms have a single, simple cavity, and they are also known as acoelomates. The animals that belong to this phylum are bilaterally symmetrical, and they are also found in the phylum. The animals are also known as platyhelminth, and they are also known as the flatworms. The animals are also known as the turbellarians, and they are also known as the turbellarians. The animals are also known as the
An interesting animal is the tardigrade, or "water bear," which is a microscopic arachnid. It can survive in the vacuum of space. It can withstand temperatures as hot as 140 degrees Celsius and as cold as minus 200 degrees Celsius. It can survive without food for 10 years. It can survive in the vacuum of space. It can withstand temperatures as hot as 140 degrees Celsius and as cold as minus 200 degrees Celsius. It can survive without food for 10 years. The tardigrade is a microscopic arachnid. It is a microscopic arachnid. It has been found in
An interesting animal is the Starfish. They are often found attached to rocks, and they are a very important part of the ecosystem. They are also very unique in that they can regenerate their body parts. As a result, if a starfish loses a limb, it can grow a new one. The starfish is also known as the sea star. They are found in the ocean. What is a starfish? Starfish are invertebrates that live in the ocean. They are not fish, but they do have a similar appearance. Starfish are typically found in the northern hemisphere, but they can also
An interesting animal is a sea cucumber. It is a marine animal that is characterized by a soft body. In the water, it is an almost colorless creature, but when it is threatened, it becomes completely transparent. The animal is a hermaphrodite, but the females are larger than the males. The sea cucumber is a filter feeder, which means that it filters out food from the water. It is a filter feeder that is able to do this because of the tube that is located on its mouth. The tube is called a proboscis and it is used to filter out food. The sea cucumber is a filter feeder that is able to do
An interesting animal is the Spiny-tailed iguana, or Iguana serpentina. It is a species belonging to the Squamata family, more specifically to the Iguanaidae subfamily. It is a reptile, but it has some characteristics of amphibians, as it can also breathe through its skin. It is a cold-blooded animal, and its body temperature is always the same

Notice that in this case, a very low top_k is used. We have found in practice that this is dependent on the feature being steered (similar to SAEs). Generally, top_k is a coarser parameter, that you can set as the maximum value before outputs start degenerating. Then, use strength to fine-tune the strength of steering.

Steering on Gemma 2 2B using a "Compound Word with Hyphen" feature (token below). Use the tabs above to view outputs with/without steering.

... know an Anglican-Catholic priest...


Steering parameters: temperature=0.5, num_layers=32, strength=0.8, repetition_penalty=1.5
prompt="An interesting thing"
An interesting thing happened in the last few days. I was reading a book and I came across a term that I had never heard before. I looked it up and found that it was a term that was used in the 19th century, and I found it interesting and intriguing. I was wondering if anyone else had heard of this term and what it meant. The term is called a ā€œdisjunctionā€ and it means that something is not true. It is a term that is used in the context of a philosophy of science, and it is used to describe a situation where two things are not true. The term ā€œdisjunctionā€
An interesting thing happened the other day. I was in the middle of a meeting and someone asked me the question ā€œWhat does your job entail?ā€ Iā€™ve been asked this question before, but Iā€™ve never been able to answer it. Itā€™s a very difficult question to answer because it doesnā€™t give any insight into what I do. Iā€™ve tried to answer it before, but Iā€™ve always felt like I was leaving something out. Iā€™ve tried to answer it in a way that would make people think Iā€™m doing something really important. Iā€™ve tried to answer it in a way that would
An interesting thing happened in the first few days of the new year. I was in the middle of a very busy week, and I was so busy that I wasnā€™t able to sit down and watch the first few episodes of the new season of The Mandalorian. Iā€™ve been a fan of the show since it first aired, and Iā€™ve been following it closely ever since. But I was so busy that I forgot to watch the first episode of the season. Iā€™d always been interested in the show, but I wasnā€™t sure if I was going to be able to watch the first episode.
An interesting thing happened in the last couple of weeks of the 2013 season. The Cubs started to play better baseball. They were winning games, they were winning series, and they were winning the division. They had a lot of things to be happy about, and they should be. They had a lot of things to be happy about, and they should be. But they also had a lot of things to be unhappy about, and they should be. This is a team that has not won a playoff game since 1945. This is a team that has not won a World Series since
An interesting thing about the Star Wars franchise is that it has managed to keep its fanbase engaged for over 40 years now. Itā€™s not just the original trilogy or the prequels that have kept the audience hooked, but also the spin-offs like Rogue One and The Mandalorian. One of the most popular characters in the entire franchise is the bounty hunter Boba Fett, played by Temuera Morrison. He first appeared in The Empire Strikes Back and then made a comeback in The Book of Boba Fett. In a recent interview, Temuera Morrison talked about his experience working
An interesting thing happened to me yesterday: I got a call from a man who was very distressed. He had a 16-year-old daughter who was going through a difficult time with her boyfriend. The boyfriend was a bully. He had been harassing the girl for some time, but it was getting worse. He was threatening to kill her. The daughter was afraid to tell her parents. She was afraid of the boyfriend, and she was afraid of her parents. I had a long talk with the girlā€™s father, and we decided to get the police involved. The young man was arrested and charged with assault and battery
An interesting thing happened while I was researching this post. I found a lot of interesting things about the history of the term ā€œsissyā€. I had no idea that it originated from the word ā€œsissieā€. I also did not know that the word ā€œsissyā€ has been around for a long time. It has been around since the 1500ā€™s, which is almost 500 years ago. The word ā€œsissyā€ is used to describe someone who is feminine, weak, or cowardly. The word is used to describe a boy or man who is effeminate. The word has
An interesting thing about the 2020 election is that it is the first election in which the race for president is being decided by a majority of the voters. In 1960, John F. Kennedy defeated Richard Nixon by a margin of just 110,000 votes (0.2 percent) in a three-way race. In 1972, Richard Nixon defeated George McGovern by a margin of 520,000 votes (0.7 percent) in a two-way race. In 1980, Ronald Reagan defeated Jimmy Carter by a margin of
An interesting thing to observe about the two-volume set under review here is that, like the proverbial dove of the title, it is a rather small volume. It is a collection of essays that is as well-written and well-edited as the best-of-the-best in the genre, but it is not an academic-type-of-deal. It is not a scholarly-style-deal. It is, in a sense, an academic-style-deal, but a rather unusual one. It is a little like the 19th-century-type-of-deal of-a-man, with a-
An interesting thing to note about the late-19th century and early-twentieth-century English-language German-speaking literature is its penchant for the Latinate. The latter-day Germanophile is apt to be a bit of a snob, and the Romantic-tinged Germanophile is apt to be a bit of a snob as well (and not in the sense that one would have to be in order to be a proper Anglo-Saxon. The latter is a bit of a stretch, but the former is a bit more plausible. The point is that the whole business of-a-sort of-German-with-a-
An interesting thing that happened to me came to light as a result of the recent fuss about the re-establishment of the Russian Orthodox Church in the West. The first thing that comes to mind of course is the iconoccending 19th century Russian Orthodox Church of the Russian Empire. But it is the more-or-less orthodox Greek Orthodox Orthodox Church of the 18th century of the Orthodox Church of the 19th century in the West. The Orthodox Church of the 19th century in the West was the so-called autocephrite (or autoceph-something-to-something-something
An interesting thing about the English-speaking world is that it has a tendency to be rather parochial in its outlook. The British Isles has a rather narrow view of the worlda view that is in some ways a sort of a-historical a-historical, that is, in terms of the term British-ness, in the sense of the more-or-less-non-Anglo-Saxon speaking Anglo-Saxon. In the case of the United States, it has been a rather long-standing tendency to be rather provincialistic. The British-ness of the American-ness is a rather recent development, and in any case the American-ness
An interesting thing about this book is its origins. The author, Richard Llew-Jones, was a man of letters and a novelist, and a man of letters does not write novels. The fact that a man of letters is a man of letters could be a matter of course, but it has been a matter of course-ness for a century. The only thing about the affair that was a bit of a surprise was the extent to which the two were conjoined. The author's name was used as a pseudonym for the author of the "The Lord of the Rings" trilogy, which is to say that the author's name was used
An interesting thing about the English language is that it has been so widely used as a lingua franca of sorts for a long time, it's rather surprising how much stuff is in there in the sense that it's English-ish (and I use that term in the broadest sense possible sense sense of the word, which is to say that it's not just English-ish, but rather, a kind of English-ish in the sense that it's-ish, if you follow me-ish, you get the drift? -ishness. The point is that the language is rather, to use the technical term, a-centric

The above example is a more abstract feature. Note that in this case, we can increase top_k higher, without degeneration. Also, notice that even though all query tokens are the same (a hyphen), the steering feature can still be applied.

Finally, letā€™s look at how strength can affect steering.

Steering on Gemma 2 2B using a "Psychosis and Schizophrenia" feature (token shown below). Use the tabs above to view outputs with different values of strength.

... and not depressive psychosis) will...



Steering parameters: temperature=0.5, num_layers=128, repetition_penalty=1.5
prompt="Let's talk about"
Let's talk about the future of work and how it will affect your organization. In this session, we look at trends that are changing organizations today through an employee-centric lens to understand what employees want from their employers in a post COVID world About our Speaker Anant Agarwal is CEO & Founder @ edX , Board Member@ MITx . An expert on online learning with 20+ years experience leading technology companies including Microsoft Research Labs; Director for Internet Services - Windows Division (1985); Executive Vice President/Chief Technology Officer ā€“ Akamai Technologies Inc.; Chairman Emeritusā€“ University College London Institute For Learning Science( UCLILS)
Let's talk about the 10th anniversary of this beautiful, wonderful series. I know that it was not on my list but if you think carefully and deeply then maybe there is a reason why we are going to discuss today one thing in particular which has been happening since last year or so... There have always been those who say "I don't like anything from Naruto" (which means Shippuden) because they were too young when all these things happened during their childhood days! It doesn't matter how old she gets; her work will never change for anyone else except herself alone forevermore without any explanation whatsoever either way around!! So let
Let's talk about the topic of today. Today we are going to be talking specifically on how you can get rid of back pain and I know that this is a subject matter where there has been so much misconception, especially in our time now because people don't understand what they should do when it comes to their body issues like for example with regards to neck pains or whatever else but let us just focus upon one aspect at hand which would definitely deal directly into your spine problems here as well if only those who suffer from them could have access towards some kind off relief through any type whatsoever then maybe things might actually work out alright without being too difficult afterwards either way
Let's talk about it. I was watching the first episode of season 3 and I noticed something that really bothered me, so much in fact...that is why this theory came into my mind.... This thing happened at least three times throughout all episodes: And then there were these moments which are also characteristic for schizophrenia (but not only): The following picture shows a part with an obvious schizophrenic-like aura A lot can be said here but let us come to another subject - depression! The main character has been depressed since childhood because she lost her mother when young girl during some kind of accident . And now we have what you see
Let's talk about. Now, I want to concentrate on the first part of this disorder which is called psychosis and then we will go through it all in detail now because there are some other disorders that exist like mania for example but they donā€™t have psychotic features associated with them so those ones would be left out here or something else might occur as well ,etc., etc.. But if you look at a person from an outside perspective without any underlying conditions such things usually happen during acute phase when someone gets affected by depression/anxiety anxiety-depression spectrum (ie depressive episodes). Now let me explain what i mean before going into details . So basically these symptoms

Conclusion

We think that RQAE is a reasonable approach that can either complement or replace SAEs in interpreting LLMs. The performance on evaluations, as well as qualitative inspection, suggests that RQAE is at least as good as SAEs at finding interpretable features, and addresses the concerns of feature splitting and dark matter that, up until now, have resisted progress with standard SAEs.

Furthermore, we hope that our code, model weights, and dashboard serve useful in evaluating and implementing RQAE, and we will actively be making updates to them in order to help the community adopt RQAE.

Future Work

There is an incredible amount of surface area to explore with interpretability, and RQAEs. Here, we provide an ordered list of what we think is most important to validate and continue developing with RQAE:

  1. Using SAE and RQAE together: If SAEs are telescopes, then RQAEs are microscopes. However, itā€™s clear that SAEs are still great at finding a set of interpretable features. We think that a lot of work can be done using SAEs to develop a dictionary of features, and RQAE to further refine or organize these features. Previous work have done something similar by training SAEs on the residuals of existing SAEs - we think that RQAE is a more natural approach to doing this, since we have already shown that RQAE can reconstruct features in existing SAEs.
  2. Finding better features: The approach presented in this work to choose features is incredibly naive. Weā€™re sure that better approaches exist, given that RQAE is so good at partitioning tokens already. One such approach would be finding diverse directions in the first four layers, and using that to choose distinct sets of tokens.
  3. Fixing/Running More Evals: Thereā€™s a good chance that detection and fuzzing are biased towards favoring RQAE - for example, they are likely also biased towards overly specific features and explanations by definition, and we know that more levels in RQAE corresponds to ranking by cosine similarity $\implies$ overly specific features. New evals might be needed to properly evaluate this approach, although our qualitative examination give us confidence that RQAE will still be competitive.
  4. Rethinking the Eval Dataset: RQAE can almost completely partition all tokens in the current eval dataset of $4M$ tokens, within 4 layers. This seems to suggest that the eval dataset might be a bottleneck to finding more features, rather than the model itself! Since we know that RQAE has much higher capacity, we could see a shift from spending more time curating train datasets, to instead training on a large general dataset and spending time curating a evaluation dataset instead.

This is certainly not an exhaustive list of ideas to try with RQAE. Weā€™re excited to see what everyone comes up with, and thank you for reading this far!