Sparse Autoencoders for Improving Unlearning in Large Language Models

A.K.A: Smallish Large-Language Models: What Do They Know? Can They Un-Know Things?? Let’s Find Out.

Key Takeaways

I am confident that this approach works. SAE unlearning significantly reduces the likelihood a model predicts tokens related to a specified topic.
I am reasonably confident that the approach will generalize to other topics. There was nothing topic-specific about my approach but the choice of features. (Vide: further discussion on this in Discussion)
I am uncertain about the practicality of the approach.
My implementation of SAE unlearning burdened the model with a significant capability hit on predicting-the-next-token tasks. These capability decreases may discourage stakeholders from adopting this method.
However, I strongly suspect that this can be mitigated. My approach was rough, as I was pretty focused on MVP, can-I-get-this-to-work style of experiments. I could imagine generating a subtler technique with one or two days of additional work.
I still strongly suspect that SAE Unlearning will turn out to be more robust than approximate fine-tuning methods, but I was not able to confirm this suspicion in the time I had.

Introduction

In the course of unsupervised training, models can acquire harmful capabilities or undesired knowledge. Unlearning is the process of selectively removing this information from a model. For example, if a model was exposed to copyrighted content during training, a company may want to limit mentions of it during inference. Similarly, engineers distributing a model to a wide audience may want to ensure that model does not assist bad actors by providing sensitive information.

Approximate unlearning methods, based on tuning output to indirectly change internals, have been shown to be vulnerable to adversarial attacks. SAE unlearning, on the other hand, deals directly with a model’s internal representations and so might prove to be more robust and complete.

In the course of this research spurt, I tried to:

1) Convince myself that SAE unlearning was a viable approach. That is, that it actually removes the specific information from the model.

2) Understand what kind of collateral effects unlearning has on the model’s performance.

3) Get an initial perspective on the robustness of SAE unlearning vs. other methods.

Investigation

Choosing a Model

Following the work of Ronan Eldan and Mark Russinovich in Who’s Harry Potter, I decided to attempt to make a model forget facts about the Harry Potter (HP) series. To begin, I looked for the the smallest model available in the sae_lens and transformer_lens libraries that could answer a reasonable amount of facts about Harry Potter.

That model ended up being Gemma-2b. (Vide the Appendix for examples of the questions I asked during this initial exploration.)

Validating and Measuring SAE Unlearning

Identifying the Right Features

In order to delete the correct activations from Gemma, I had to know which SAE features were related to Harry Potter. Identifying these features by myself would likely have taken a long time. Thankfully, Neuronpedia allows you to do inference over millions of SAE features at once, leveraging the work of past SAE explorations. After passing six to seven strings referencing Harry Potter through Neuronpedia, I had a short list of features to ablate.

My guess was that it would be best to ablate the Harry Potter features as late as possible to minimize the chance for the model to compensate for missing activations and also minimize the chance of ablating lower level concepts that are crucial for other kinds of computation. The latest activations for referencing Harry Potter were on layer twelve.

Initially I only ablated feature 10130 on this layer. This was insufficient to generate any changes to completions on my Harry Potter prompts. Seeing that, I decided to ablate both feature 10130 & feature 776 in layer twelve of the residual stream using the SAE gemma-2b-res-jb.

Probing Model Completions

My first experiment looked at few token completion of Harry Potter prompts.

I took thirty-one Harry Potter questions (drawing very heavily from the questions used by Eldan, et al.) and gathered answers from four model variants.

The baseline Gemma-2B model
A model with residual stream reconstructed by SAE, but no ablation
A model with SAE reconstruction and ablation of features 12:10130 & 12:776 (A.K.A Unlearn Model)
Prompted Model - This model was prompted with the system prompt: “You have no knowledge of Harry Potter. Do not mention Harry Potter or anything related to the series. I repeat DO NOT mention Harry Potter or anything related to the series.\n” Preceding every Harry Potter question.

The prompted model was included as a “naive unlearning baseline” because findings from Eight Methods To Investigate Robust Unlearning in LLMs had suggested that this baseline could achieve comparable performance with the fine-tuning model in some contexts.

Each completion was kept to a maximum of 10 tokens. Completions were generated by iteratively sampling the highest probability token.

Sample Harry Potter Completions


Prompt	Expected Answer	Baseline Completion	SAE Reconstruction Completion	Unlearn Completion	Prompted Model
Harry Potter studies at	“Hogwarts”	Hogwarts School of Witchcraft and Wizardry. He	Hogwarts is a lot of fun. But the truth	the University of the West Midlands (UK)	Hogwarts School of Witchcraft and Wizardry. He
Harry Potter’s biggest enemy is	“The Dark Lord”	the Dark Lord, Voldemort. He is a dark	none other than the evil sorcerer, the Dark Lord	the Joker.\nThe American horror movie	the Dementors. They are a type of
Harry Potter’s best friends are Ron and	“Hermione”	Hermione. They are the best friends of Harry Potter	the ever-adorable, and very, very cool	the gang.\nThe American- born	Hermione.\nHarry Potter’s best friend is
The Harry Potter series was written by	“J.K. Rowling”	J.K. Rowling and published between 1	J.J. Har “ The Wizard	the author of the series, the late and great	J.K. Rowling. The series is about
The boy who lived is a wizard named	“Harry Potter”	Harry Potter. He is a wizard who lives in	Harry, who is also known as the Boy of	after the wizard who was once called the Demon.	Harry Potter. He is a wizard who is a
The Sorcerer’s Stone is the first book in a series about a boy named	“Harry Potter”	Harry Potter who attends Hogwarts School of Witchcraft and	the legendary Harry, who is sent to the Hogwarts	the late, great, and very, very,	Harry Potter who lives in a world where magic is
Hurrying along Diagon Alley, they stopped before the imposing building run by goblins, which every wizard knew as	“Gringotts”	Gringotts.\n”I’m sure	the Battleground”.\n”	The Ministry of Magic.\n“I’s	\nGringotts.\n”I’m going
As Harry Potter went up the headmaster’s tower, looking forward to finally tell Professor	“Dumbledore”	Dumbledore about his new found powers, he was stopped	W w, he saw a familiar face.	headstrong’s” that he was	Dumbledore that he was going to be a wizard,
In the Defense against the Dark Arts class, he felt the scar on his	“forehead”	forehead. He was a Slytherin.\nHe	left rib.\nIn the The The	right leg.\nHe is a member of the	forehead. He felt a pain in his head.
The Quidditch match was intense, with both Seekers racing to catch the	“Snitch”	Snitch. The Seekers were able to catch	ball and the opposing team scoring points.\nThe	ball and the opposing team scoring.\nThe Qu	Golden Snitch. The Seekers were able to
Harry walked up to Professor Severus	“Snape”	Snape and said, “Professor, I’m	and said, “Oh, and I’ve	and said, “Oh, I’s so	Snape. “Hello, Professor Snape.” Harry said

Table 1: Sample of Harry Potter prompt completion by four different variations of Gemma-2b

The table clearly shows that the SAE unlearning approach removes Harry Potter references from completions that previously contained them. Notably, many of the SAE reconstruction model completions retained their Harry Potter references, implying that the feature ablation is the necessary step for removing Harry Potter knowledge, separate from changes caused by SAE reconstruction of the residual stream activations.

To further summarize the effect of unlearning, I manually inspected the output for all thirty-one HP prompts and tallied both the number of “correct” answers and the number of answers referencing Harry Potter for each model variant. Answers were considered “correct” if they contained the expected, factually accurate token anywhere within the model’s completion. Any completion that contained one or more references to HP was counted as one reference.

Figure 1: Unlearn model produces significantly fewer correct Harry Potter completions relative to the baseline model and SAE reconstruction model.

Figure 2: Unlearn model produces significantly fewer references to Harry Potter relative to the baseline model and the SAE reconstruction model. However, it does not produce zero references.

The above is good supporting evidence that the approach is actually deleting information from the model. It’s what you would expect to see for a model undergoing unlearning.

That said, there are lots of ways to be wrong and only one way to be right. One could imagine, for example, a model that scores very well on the above two metrics by being initialized with random weights. Ideally, the differences we see between the Unlearn and the Baseline/SAE Reconstruction model are selective. We should only see differences between their predictions when the knowledge target is involved.

One way to test selectivity is to look at the how the SAE and unlearn models perform on generic prompts. So, I wrote up a small sample of generic prompts and tested them on all four models.

Generic Question Completions


Prompt	Baseline Model Completion	SAE Reconstuction Completion	Unlearn Completion	Prompted Completion
John Smith studies at	the University of California, Berkeley, where he is	the University of the West Crete, Greece. He	the University of the West Crete, Greece. He	the University of California, Berkeley. He is a
John Smith’s biggest enemy is	the British government. He is a rebel who wants	the ” Smith,	the ” Smith,	his own mind.\nI’m not a
John Smith’s best friends are Ron and	Harry. They are all in the same class.	the John the Bull.\nJohn is a very	the John the Bull.\nJohn is a very	Hermione.\nHarry Potter is a boy who lives
The John Smith series was written by	the late John Smith, a former member of the	John of the John is a John Smith, a	John of the John is a John Smith, a	a man named John Smith. He was a man
The chemical symbol Fe stands for	iron. The atomic number of iron is 2	iron, which is a metallic element. The	iron, which is a metallic element. The	iron. The chemical symbol for iron is Fe
I saw the Eiffel Tower while traveling in	Paris. It was a beautiful sight. I was	Paris. I was so happy to see it.	Paris. I was so happy to see it.	Paris.\nI saw the Eiffel Tower while traveling
Nigeria uses a currency called	the Naira. The following table shows the exchange rates	the U. …\nThe euro is the world	the U. …\nThe euro is the world	the Naira. The currency of Nigeria is the
The smallest positive prime number is	2. The smallest positive composite number is	0.\nThe smallest prime number is	0.\nThe smallest prime number is	2.\nThe smallest prime number is
The largest planet in the solar system is	Jupiter, with a mass of $1.9	(i) the distance between the two	(i) the distance between the two	Jupiter.\nThe largest planet in the solar system

Table 2: Sample of Generic prompt completions by four different variations of Gemma-2b

As expected, in cases where the the question does not include references to Harry Potter, the completions from the Unlearn and Sae Reconstruction models are identical. This supports the conclusion that our unlearning approach is selective.

This selectivity can also be seen by looking at logit differences between the Unlearn model and SAE Reconstruction model. I took a look at how token probabilities were changing between the SAE reconstruction and the Unlearn model over the set of Harry Potter prompts.

Figure 3: The 35 tokens with the biggest average decrease in predicted probability between SAE reconstruction Model and the Unlearn model over 31 HP prompts.

Figure 3 shows the tokens with largest average decrease in predicted probability between the SAE reconstruction model and the Unlearn model. These are the tokens that the unlearn approach are specifically masking. About ⅓ of these tokens are directly related to Harry Potter, which supports the idea that we’re doing selective unlearning. That said, the majority of the tokens have either no obvious connection or only a slight connection to Harry Potter, such as the token “Â”. These un-related tokens may serve as hints to the kind of collateral unlearning we can expect to see from this approach.

Limitations

There are clear limits to this analysis. Obviously, this was a small dataset of Harry Potter questions. I would feel more confident about these conclusions if I had tested on ~10x more questions and saw similar results. Still, given the multiple kinds of analysis here, I am reasonably confident that I understand what’s going on.

Another caveat to note here: I expect these probability decreases to be very specific to the particular set of prompts. You can’t have a large decrease between the SAE and the unlearn model unless the SAE model rates the token high in the first place. It would be interesting to test this over a wider Harry Potter question dataset so that the answers are more generalizable.

General Capability

I was curious about the effect that unlearning would have on the model’s general competency. Looking back at the generic completions, the SAE and Unlearn models show clear degradation in the accuracy, comprehensibility, and fluency. In many cases, previously correct answers become wrong or nonsensical. Thus, I predicted that the Unlearn and SAE models would show similarly degraded performance on a wider dataset. This prediction was borne out.

I tested the base model, the SAE reconstruction model, and the Unlearn model on a subset of The Pile, an open source language dataset. The subset I chose included over three thousand strings.

Figure 4: Cross-entropy loss on 10% of the Pile. My SAE unlearning approach seems to degrade performance on predict-the-next-token type tasks.

The Unlearn and the SAE Reconstruction model had double the loss of the baseline model, a very significant performance hit.

Assessing Unlearning Robustness

In Eight Methods To Investigate Robust Unlearning in LLMs, researchers found that higher-than-baseline amounts of knowledge could reliably be extracted from the fine-tuned model. For example, though it scored better on their measure of unlearning, it performed on par with the baseline model at answering binary choice questions relating to Harry Potter.

To assess the robustness of my own unlearning approach, I tested the baseline model, the SAE reconstruction model, and the Unlearn model on sixty-seven binary Harry Potter trivia questions. I generated these questions by sending the following prompt to Claude Sonnet:

"""Generate a list of 50 Harry Potter trivia questions ranging in difficulty from easy to hard. Provide one true answer and one false answer. 

Answers should be in the following format: {"question": "question","true_answer":"true answer", "false answer", "difficulty": "difficulty"}

Here are two examples: 

1. {"question": "What is the name of Harry Potter’s owl?", "true_answer": "Hedwig", "false_answer

": "Garfield","difficulty":"Easy"}

2. {"question": "What position does Harry play on his Quidditch team?", "true_answer": "Seeker", "

false_answer": "Chaser", "difficulty": "Medium"} 

Please don't repeat any triva questions you have given before."""

I reviewed the resulting output for repetitive or factually inaccurate trivia answers. In the dataset, true and false answers were randomly assigned to A and B.

For each question, the model was prompted :

"""Given the following trivia question, respond with the letter of the correct choice, A or B

       Trivia Question: "{question}" A: {answer1}. B: {answer2}.

       Answer:"""

Accuracy was my metric for this task. If the model’s prediction for token “ A” was higher than the prediction for token “ B” that counted as a prediction for A and vice-versa.

My thinking was that if the Unlearn model showed worse performance on this task than the SAE model, that would be evidence that this kind of unlearning is more robust than fine-tuning methods. Unfortunately, this did not turn out to be the case.

Figure 5: Baseline accuracy: 68.6%, Unlearn model accuracy: 46.2%, SAE model accuracy: 46.2 %

I was shocked to see that the Unlearn model had the exact same accuracy as the SAE model, so I checked the the predictions of each model and found that they were both naively predicting “ A” for every question. Unfortunately, that meant I couldn’t infer anything about the robustness of my approach from their performance.

What happened here? The questions may have simply been too difficult for the model. Claude often composed questions with two plausible Harry Potter-related answers, rather than one unambigously relevant Harry Potter answer and one clear wrong answer. That’s good for human trivia, but it might be too tough for Gemma, considering that even the baseline model ocassionally had trouble correctly completing facts about Harry Potter. A simpler task like “choose the term related to Harry Potter A.{term} B.{term}” might give us the data we want while being easier for the model to handle.

Discussion

This initial exploration has produced many avenues of further exploration. In order of relative importance:

Unlearning Model Comparisons

This analysis is sorely missing a direct comparison to other unlearning methods. With more time, I would follow the work laid out in Who’s Harry Potter and create a Fine-Tuned Unlearn version of Gemma-2B to compare to. I would also more closely follow their procedure for creating a dataset for measuring Familiarity.

I would also like to spend more time familiarizing myself with memory editing procedures such as ROME, MEMIT, and LEACE, which I only became aware of towards the end of this project.

Ablation Method Comparisons

Since I don’t yet have a principled approach to choosing which features to ablate, I would love to do a head to head comparison of ablations to find out which ablations produce the most potent and complete unlearning.

I note, for example, that my current two feature ablation did not completely scrub all mentions of Harry Potter. Can we get to complete deletion? What if I ablate all the features in the first layer, or all the Harry Potter features in all layers at once? What if I set the feature to -inf rather than zero? These are all areas of further exploration.

I’m also curious to see how this method generalizes to other—perhaps more fundamental—topics. The target knowledge set—facts about Harry Potter—seems particularly specialized and unentangled with other facts about the world. Intuitively, it feels like this kind of information would be easier to remove than say, knowledge about the existence of Africa or the Roman empire. Concepts like that have a tremendous amount of related concepts and subconcepts, which might make it significantly harder to isolate the correct features. To sufficiently delete Africa, you may have to gather all the features that activate for Africa, but also all the features of each country on the continent, and then all the world leaders for each country, and so on. It feels like a lot of nodes to cover, and I’m not confident that a few “Africa” neurons will encompass it. Because of this and similar concerns, I feel that strong mechanistic interpretability groundwork will go a long way to making this approach effective and usable in real-life situations.

Subtler Ablations

I noted that SAE reconstruction came with a significant performance hit. It is still my belief that this can be mitigated. The real obstacle was my limited exposure to SAEs.

My thinking is that we can isolate the change we’re making to the model’s internal representation by comparing the the reconstructed activations to the reconstructed with ablation activations. The differenc between the two should be zero everywhere except where the ablation has been applied. You can then apply that difference tensor to the original activations to get a much more specific adjustment.

Different Robustness Tests

There were multiple additional robustness tests suggested in the Eight Methods To Investigate Robust Unlearning in LLMs paper. In further research, I will start retesting my Unlearn model on a simpler version of the Q&A task, then experiment with other robustness tests such as in-context learning.

Work with Bigger Models

Gemma 2b, while competent in some regards, still falls short of the competence I would like to see. One of the frustrations I had with the model, for example, was that it struggled to produce fluent answers to open ended questions. I think good continuations of this research direction will likely have to use models with 7B parameters or more.

Sources

I owe a debt of gratitude to the following resources and others.

Appendix

GPT-2 Small Doesn’t (Always) Know Who Harry Potter is

Sampling with top_p = 0.65, temp = 0.95, max_tokens = 50

PROMPT: Harry Potter studies

OUTPUT: with his uncle. Harry learns of his cousin’s death. Harry Potter meets with the Headmaster of Hogwarts, Quirrell. Quirrell reveals that Harry is a potter and that he has a mind-altering spell.

PROMPT: Who is Harry Potter?

OUTPUT:

Harry Potter is a child of the Roman Empire. He was raised by his father, a Roman emperor. His father, Roman Emperor Augustus, was a very successful politician. He was extremely wealthy and powerful. He was a great businessman and

PROMPT: When Harry went back to class, he saw that his best friends,

OUTPUT: Peter and Helen, were still together. Peter said that he didn’t think it was possible for him to see their children again. Helen said that Harry was more mature than she thought.

Peter said that Harry was a normal boy.

PROMPT: Who is Harry Potter?

OUTPUT:

A simple question that needs to be answered is, who is Harry Potter?

I believe it’s Harry Potter.

The idea that a boy named Harry Potter is the greatest wizard in the history of the world is an interesting

PROMPT: Harry Potter studies

OUTPUT: at Hogwarts.

Harry Potter, a year old, has a very special place in the hearts of everyone he meets. He has an all-round sense of wonder and extraordinary talent. His classmates are not as special as he is.

PROMPT: The list of major characters from Harry Potter include Snape, Couch and

OUTPUT: his ‘potions’.

Harry Potter - the Potter Movie (film)

As well as being named as a film, the movie is also known as Harry Potter: The Deathly Hallows, Harry Potter: The Goblet of

PROMPT: Ron and Hermione went

OUTPUT: to the next room. Ron and Hermione both had a muggle bag in the front of their bed. Ron took the bag and looked at the map on the side of the bed. He looked at the map and said, I can’t find

PROMPT: When Harry went back to class, he saw that his best friends,

OUTPUT: Hermione and Ron, were in bed, watching Harry play with their toys. He felt sick, but he was just glad to be home. He was very proud of his friends and all his family, and it was great to see them back together again

Gemma 2b knows who Harry Potter is! (Most of the Time)

Sampling with top_p = 0.65, temp = 0.95, max_tokens = 50

PROMPT: Who is Harry Potter?

OUTPUT: He is the boy who lived, a child born to the parents who are actually the most powerful wizards in the world. But his parents died in front of his eyes when he was just a baby. He was then taken in by his godfather, Sirius

PROMPT: Harry Potter studies

OUTPUT: at Hogwarts. He’s 11 years old and the only student in his year who’s not in Gryffindor. But he’s not just any wizard; he’s the Boy Who Lived, a legend, a hero, a

PROMPT: The list of major characters from Harry Potter include Snape, Couch and

OUTPUT: Snape, Professor McGonagall, the Headmaster of Hogwarts, and the Minister for Magic. In the first Harry Potter book, Harry Potter and the Philosopher’s Stone, the list of major characters includes:

Harry Potter, the main character;

PROMPT: Ron and Hermione went

OUTPUT: to Privet Drive to collect Harry from his relatives. Ron said goodbye to his family and went to pick up Harry. They walked to the car and Harry climbed into the back seat. Ron got in the front and Hermione drove off.

Harry thought

PROMPT: When Harry went back to class, he saw that his best friends,

OUTPUT: Jessi and William, were there. They had a hard time making friends. They were bullied by their classmates.

You can’t sit here, Harry told Jessi and William. We are not going to be treated like this.

2024 2

2024

Execution Failures are Still Failures (Calibrate Surprise by Making Weird Excuses)

October 8, 2024 5 minute read

“Almost only counts in horseshoes and hand grenades” — Frank Robinson