generative-ai-in-the-real-world:-emmanuel-ameisen-on-llm-interpretability

Generative AI in the Real World: Emmanuel Ameisen on LLM Interpretability

Reading Time: 17 minutes

In this episode, Ben Lorica and Anthropic interpretability researcher Emmanuel Ameisen get into the work Emmanuel’s team has been doing to better understand how LLMs like Claude work. Listen in to find out what they’ve uncovered by taking a microscopic look at how LLMs function—and just how far the analogy to the human brain holds.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.00
Today we have Emmanuel Ameisen. He works at Anthropic on interpretability research. And he also authored an O’Reilly book called Building Machine Learning Powered Applications. So welcome to the podcast, Emmanuel. 

00.22
Thanks, man. I’m glad to be here. 

00.24
As I go through what you and your team do, it’s almost like biology, right? You’re studying these models, but increasingly they look like biological systems. Why do you think that’s useful as an analogy? And am I actually accurate in calling this out?

00.50
Yeah, that’s right. Our team’s mandate is to basically understand how the models work, right? And one fact about language models is that they’re not really written like a program, where somebody sort of by hand described what should happen in that logical branch or this logical branch. Really the way we think about it is they’re almost grown. But what that means is, they’re trained over a large dataset, and on that dataset, they learn to adjust their parameters. They have many, many parameters—often, you know, billions—in order to perform well. And so the result of that is that when you get the trained model back, it’s sort of unclear to you how that model does what it does, because all you’ve done to create it is show it tasks and have it improve at how it does these tasks.

01.48
And so it feels similar to biology. I think the analogy is apt because for analyzing this, you kind of resort to the tools that you would use in that context, where you try to look inside the model [and] see which parts seem to light up in different contexts. You poke and prod in different parts to try to see, “Ah, I think this part of the model does this.” If I just turn it off, does the model stop doing the thing that I think it’s doing? It’s very much not what you would do in most cases if you were analyzing a program, but it is what you would do if you’re trying to understand how a mouse works. 

02.22
You and your team have discovered surprising ways as to how these models do problem-solving, the strategies they employ. What are some examples of these surprising problem-solving patterns? 

02.40
We’ve spent a bunch of time studying these models. And again I should say, whether it’s surprising or not depends on what you were expecting. So maybe there’s a few ways in which they’re surprising. 

There’s various bits of common knowledge about, for example, how models predict one token at a time. And it turns out if you actually look inside the model and try to see how it’s sort of doing its job of predicting text, you’ll find that actually a lot of the time it’s predicting multiple tokens ahead of time. It’s sort of deciding what it’s going to say in a few tokens and presumably in a few sentences to decide what it says now. That might be surprising to people who have heard that [models] are predicting one token at a time. 

03.28
Maybe another one that’s sort of interesting to people is that if you look inside these models and you try to understand what they represent in their artificial neurons, you’ll find that there are general concepts they represent.

So one example I like is you can say, “Somebody is tall,” and then, inside the model, you can find neurons activating for the concept of something being tall. And you can have them all read the same text, but translated in French: “Quelqu’un est grand.” And then you’ll find the same neurons that represent the concept of somebody being tall or active.

So you have these concepts that are shared across languages and that the model represents in one way, which is again, maybe surprising, maybe not surprising, in the sense that that’s clearly the optimal thing to do, or that’s the way that. . . You don’t want to repeat all of your concepts; like in your brain, you don’t want to have a separate French brain, an English brain, ideally. But surprising if you think that these models are mostly doing pattern matching. Then it is surprising that, when they’re processing English text or French text, they’re actually using the same representations rather than leveraging different patterns. 

04.41
[In] the text you just described, is there a material difference between the reasoning and nonreasoning models? 

04.51
We haven’t studied that in depth. I will say that the thing that’s interesting about reasoning models is that when you ask them a question, instead of answering right away for a while, they write some text thinking through the problem, saying oftentimes, “Are you using math or code?” You know, trying to think: “Ah, well, maybe this is the answer. Let me try to prove it. Oh no, it’s wrong.” And so they’ve proven to be good at a variety of tasks that models which immediately answer aren’t good at. 

05.22
And one thing that you might think if you look at reasoning models is that you could just read their reasoning and you would understand how they think. But it turns out that one thing that we did find is that you can look at a model’s reasoning, that it writes down, that it samples, the text it’s writing, right? It’s saying, “I’m now going to do this calculation,” and in some cases when for example, the calculation is too hard, if at the same time you look inside the model’s brain inside its weights, you’ll find that actually it could be lying to you.

It’s not at all doing the math that it says it’s doing. It’s just kind of doing its best guess. It’s taking a stab at it, just based on either context clues from the rest or what it thinks is probably the right answer—but it’s totally not doing the computation. And so one thing that we found is that you can’t quite always trust the reasoning that is output by reasoning models.

06.19
Obviously one of the frequent complaints is around hallucination. So based on what you folks have been learning, are we getting close to a, I guess, much more principled mechanistic explanation for hallucination at this point? 

06.39
Yeah. I mean, I think we’re making progress. We study that in our recent paper, and we found something that’s pretty neat. So hallucinations are cases where the model will confidently say something’s wrong. You might ask the model about some person. You’ll say, “Who’s Emmanuel Ameisen?” And it’ll be like “Ah, it’s the famous basketball player” or something. So it will say something where instead it should have said, “I don’t quite know. I’m not sure who you’re talking about.” And we looked inside the model’s neurons while it’s processing these kinds of questions, and we did a simple test: We asked the model, “Who’s Michael Jordan?” And then we made up some name. We asked it, “Who’s Michael Batkin?” (which it doesn’t know).

And if you look inside there’s something really interesting that happens, which is that basically these models by default—because they’ve been trained to try not to hallucinate—they have this default set of neurons that is just: If you ask me about anyone, I’ll just say no. I’ll just say, “I don’t know.” And the way that the models actually choose to answer is if you mentioned somebody famous enough, like Michael Jordan, there’s neurons for like, “Oh, this person is famous; I definitely know them” that activate and that turns off the neurons that were going to promote the answer for, “Hey, I’m not too sure.” And so that’s why the model answers in the Michael Jordan case. And that’s why it doesn’t answer by default in the Michael Batkin case.

08.09
But what happens if instead now you force the neurons for “Oh, this is a famous person” to turn on even when the person isn’t famous, the model is just going to answer the question. And in fact, what we found is in some hallucination cases, this is exactly what happens. It’s that basically there’s a separate part of the model’s brain, essentially, that’s making the determination of “Hey, do I know this person or not?” And then that part can be wrong. And if it’s wrong, the model’s just going to go on and yammer about that person. And so it’s almost like you have a split mechanism here, where, “Well I guess the part of my brain that’s in charge of telling me I know says, ‘I know.’ So I’m just gonna go ahead and say stuff about this person.” And that’s, at least in some cases, how you get a hallucination. 

08.54
That’s interesting because a person would go, “I know this person. Yes, I know this person.” But then if you actually don’t know this person, you have nothing more to say, right? It’s almost like you forget. Okay, so I’m supposed to know Emmanuel, but I guess I don’t have anything else to say. 

09.15
Yeah, exactly. So I think the way I’ve thought about it is there’s definitely a part of my brain that feels similar to this thing, where you might ask me, you know, “Who was the actor in the second movie of that series?” and I know I know; I just can’t quite recall it at the time. Like, “Ah, you know, this is how they look; they were also in that other movie”—but I can’t think of the name. But the difference is, if that happens, I’m going to say, “Well, listen, man, I think I know, but at the moment I just can’t quite recall it.” Whereas the models are like, “I think I know.” And so I guess I’m just going to say stuff. It’s not that the “Oh, I know” [and] “I don’t know” parts [are] separate. That’s not the problem. It’s that they don’t catch themselves sometimes early enough like you would, where, to your point exactly, you’d just be like, “Well, look, I think I know who this is, but honestly at this moment, I can’t really tell you. So let’s move on.” 

10.10
By the way, this is part of a bigger topic now in the AI space around reliability and predictability, the idea being, I can have a model that’s 95% [or] 99% accurate. And if I don’t know when the 5% or the 1% is inaccurate, it’s quite scary. Right? So I’d rather have a model that’s 60% accurate, but I know exactly when that 60% is. 

10.45
Models are getting better at hallucinations for that reason. That’s pretty important. People are training them to just be better calibrated. If you look at the rates of hallucinations for most models today, they’re so much lower than the previous models. But yeah, I agree. And I think in a sense maybe like there’s a hard question there, which is at least in some of these examples that we looked at, it’s not necessarily that, insofar as what we’ve seen, that you can clearly see just from looking at the inside of the model, oh, the model is hallucinating. What we can see is the model thinks it knows who this person is, and then it’s saying some stuff about this person. And so I think the key bit that would be interesting to do future work on is then try to understand, well, when it’s saying things about people, when it’s saying, you know, this person won this championship or whatever, is there a way there that we can kind of tell whether those are real facts or those are sort of confabulated in a way? And I think that’s still an active area of research. 

11.51
So in the case where you hook up Claude to web search, presumably there’s some sort of citation trail where at least you can check, right? The model is saying it knows Emmanuel and then says who Emmanuel is and gives me a link. I can check, right? 

12.12
Yeah. And in fact, I feel like it’s even more fun than that sometimes. I had this experience yesterday where I was asking the model about some random detail, and it confidently said, “This is how you do this thing.” I was asking how to change the time on a device—it’s not important. And it was like, “This is how you do it.” And then it did a web search and it said, “Oh, actually, I was wrong. You know, according to the search results, that’s how you do it. The initial advice I gave you is wrong.” And so, yeah, I think grounding results in search is definitely helpful for hallucinations. Although, of course, then you have the other problem of making sure that the model doesn’t trust sources that are unreliable. But it does help. 

12.50
Case in point: science. There’s tons and tons of scientific papers now that get retracted. So just because it does a web search, what it should do is also cross-verify that search with whatever database there is for retracted papers.

13:08
And you know, as you think about these things, I think you get an answer like effort-level questions where right now, if you go to Claude, there’s a research mode where you can send it off on a quest and it’ll do research for a long time. It’ll cross-reference tens and tens and tens of sources.

But that will take I don’t know, it depends. Sometimes 10 minutes, sometimes 20 minutes. And so there’s a question like, when you’re asking, “Should I buy these running shoes?” you don’t care, [but] when you’re asking about something serious or you’re going to make an important life decision, maybe you do. I always feel like as the models get better, we also want them to get better at knowing when they should spend 10 seconds or 10 minutes on something. 

13.47
There’s a surprisingly growing number of people who go to these models to ask help in medical questions. And as anyone who uses these models knows, a lot of it comes down to your problem, right? A neurosurgeon will prompt this model about brain surgery very differently than you and me, right? 

14:08
Of course. In fact, that was one of the cases that we studied actually, where we prompted the model with a case that’s similar to one that a doctor would see. Not in the language that you or I would use, but in the sort of like “This patient is age 35 presenting symptoms A, B, and C,” because we wanted to try to understand how the model arrives to an answer. And so the question had all these symptoms. And then we asked the model, “Based on all these symptoms, answer in only one word: What other tests should we run?” Just to force it to do all of its reasoning in its head. I can’t write anything down. 

And what we found is that there were groups of neurons that were activating for each of the symptoms. And then they were two different groups of neurons that were activating for two potential diagnoses, two potential diseases. And then those were promoting a specific test to run, which is sort of a practitioner and a differential diagnosis: The person either has A or B, and you want to run a test to know which one it is. And then the model suggested the test that would help you decide between A and B. And I found that quite striking because I think again, outside of the question of reliability for a second, there’s a depth of richness to just the internal representations of them all as it does all of this in one word. 

This makes me excited about continuing down this path of trying to understand the model, like the model’s done a full round of diagnosing someone and proposing something to help with the diagnostic just in one forward pass in its head. As we use these models in a bunch of places, I sure really want to understand all of the complex behavior like this that happens in its weights. 

16.01
In traditional software, we have debuggers and profilers. Do you think as interpretability matures our tools for building AI applications, we could have kind of the equivalent of debuggers that flag when a model is going off the rails?

16.24
Yeah. I mean, that’s the hope. I think debuggers are a good comparison actually, because debuggers mostly get used by the person building the application. If I go to, I don’t know, claude.ai or something, I can’t really use the debugger to understand what’s going on in the backend. And so that’s the first state of debuggers, and the people building the models use it to understand the models better. We’re hoping that we’re going to get there at some point. We’re making progress. I don’t want to be too optimistic, but, I think, we’re on a path here where this work I’ve been describing, the vision was to build this big microscope, basically where the model is doing something, it’s answering a question, and you just want to look inside. And just like a debugger will show you basically the states of all of the variables in your program, we want to see the state of all of the neurons in this model.

It’s like, okay. The “I definitely know this person” neuron is on and the “This person is a basketball player” neuron is on—that’s kind of interesting. How do they affect each other? Should they affect each other in that way? So I think in many ways we’re sort of getting to something close where at least you can inspect the execution of your running program like you would with a debugger. You’re inspecting the execution learning model. 

17.46
Of course, then there’s a question of, What do you do with it? That I think is another active area of research where, if you spend some time looking at your debugger, you can say, “Ah, okay, I get it. I initialized this variable the wrong way. Let me fix it.”

We’re not there yet with models, right? Even if I tell you “This is exactly how this is happening and it’s wrong,” then the way that we make them again is we train them. So really, you have to think, “Ah, can we give it other examples that I would learn to do that way?” 

It’s almost like we’re doing neuroscience on a developing child or something. But then our only way to actually improve them is to change the curriculum of their school. So we have to translate from what we saw in their brain to “Maybe they need a little more math. Or maybe they need a little more English class.” I think we’re on that path. I’m pretty excited about it. 

18.33
We also open-sourced the tools to do this a couple months back. And so, you know, this is something that can now be run on open source models. And people have been doing a bunch of experiments with them trying to see if they behave the same way as some of the behaviors that we saw in the Claud models that we studied. And so I think that also is promising. And there’s room for people to contribute if they want to. 

18.56
Do you folks internally inside Anthropic have special interpretability tools—not that the interpretability team uses but [that] now you can push out to other people in Anthropic as they’re using these models? I don’t know what these tools would be. Could be what you describe, some sort of UX or some sort of microscope towards a model. 

19.22
Right now we’re sort of at the stage where the interpretability team is doing most of the microscopic exploration, and we’re building all these tools and doing all of this research, and it mostly happens on the team for now. I think there’s a dream and a vision to have this. . . You know, I think the debugger metaphor is really apt. But we’re still in the early days. 

19.46
You used the example earlier [where] the part of the model “That is a basketball player” lights up. Is that what you would call a concept? And from what I understand, you folks have a lot of these concepts. And by the way, is a concept something that you have to consciously identify, or do you folks have an automatic way of, “Here’s millions and millions of concepts that we’ve identified and we don’t have actual names for some of them yet”?

20.21
That’s right, that’s right. The latter one is the way to think about it. The way that I like to describe it is basically, the model has a bunch of neurons. And for a second let’s just imagine that we can make the comparison to the human brain, [which] also has a bunch of neurons.

Usually it’s groups of neurons that mean something. So it’s like I have these five neurons around. That means that the model’s reading text about basketball or something. And so we want to find all of these groups. And the way that we find them basically is in an automated, unsupervised way.

20.55
The way you can think about it, in terms of how we try to understand what they mean, is maybe the same way that you do in a human brain, where if I had full access to your brain, I could record all of your neurons. And [if] I wanted to know where the basketball neuron was, probably what I would do is I would put you in front of a screen and I would play some basketball videos, and I would see which part of your brain lights up, you know? And then I would play some videos of football and I’d hopefully see some common parts, like the sports part and then the football part would be different. And then I play a video of an apple and then it’d be a completely different part of the brain. 

And that’s basically exactly what we do to understand what these concepts mean in Claude is we just run a bunch of text through and see which part of its weight matrices light up, and that tells us, okay, this is the basketball concept probably. 

The other way we can confirm that we’re right is just we can then turn it off and see if Claude then stops talking about basketball, for example.

21.52
Does the nature of the neurons change between model generations or between types of models—reasoning, nonreasoning, multimodal, nonmultimodal?

22.03
Yeah. I mean, at the base level all the weights of the model are different, so all of the neurons are going to be different. So the sort of trivial answer to your question [is] yes, everything’s changed. 

22.14
But you know, it’s kind of like [in] the brain, the basketball concept is close to the Michael Jordan concept.

22.21
Yeah, exactly. There’s basically commonalities, and you see things like that. We don’t at all have an in-depth understanding of anything like you’d have for the human brain, where it’s like “Ah, this is a map of where the concepts are in the model.” However, you do see that, provided that the models are trained on and doing kind of the same “being a helpful assistant” stuff, they’ll have similar concepts. They’ll all have the basketball concept, and they’ll have a concept for Michael Jordan. And these concepts will be using similar groups of neurons. So there’s a lot of overlap between the basketball concept and the Michael Jordan concept. You’re going to see similar overlap in most models.

23.03
So channeling your previous self, if I were to give you a keynote at a conference and I give you three slides—this is in front of developers, mind you, not ML researchers—what are the one to three things about interpretability research that developers should know about or potentially even implement or do something about today?

23.30
Oh man, it’s a good question. My first slide would say something like models, language models in particular, are complicated, interesting, and they can be understood, and it’s worth spending time to understand them. The point here being, we don’t have to treat them as this mysterious thing. We don’t have to use approximate, “Oh, they’re just next-token predictors or they’re just pattern matters. They’re black boxes.” We can look inside, and we can make progress on understanding them, and we can find a lot of rich structure. That would be slide one.

24.10
Slide two would be the stuff that we talked about at the start of this conversation, which would be, “Here’s three ways your intuitions are wrong.” You know, oftentimes this is, “Look at this example of a model planning many tokens ahead, not just waiting for the next token. And look at this example of the model having these rich representations showing that it’s sort of like actually doing multistep reasoning in its weights rather than just kind of matching to some training data example.” And then I don’t know what my third example would be. Maybe this universal language example we talked about. Complicated, interesting stuff. 

24.44
And then, three: What can you do about it? That’s the third slide. It’s an early research area. There’s not anything that you can take that will make anything that you’re building better today. Hopefully if I’m viewing this presentation in six months or a year, maybe this third slide is different. But for now, that’s what it is.

25.01
If you’re interested about this stuff, there are these open source libraries that let you do this tracing and open source models. Just go grab some small open source model, ask it some weird question, and then just look inside his brain and see what happens.

I think the thing that I respect the most and identify [with] the most about just being an engineer or developer is this willingness to understand all this stubbornness, to understand your program has a bug. Like, I’m going to figure out what it is, and it doesn’t matter what level of abstraction it’s at.

And I would encourage people to use that same level of curiosity and tenacity to look inside these very weird models that are everywhere. Now, those would be my three slides. 

25.49
Let me ask a follow up question. As you know, most teams are not going to be doing much pretraining. A lot of teams will do some form of posttraining, whatever that might be—fine-tuning, some form of reinforcement learning for the more advanced teams, a lot of prompt engineering, prompt optimization, prompt tuning, some sort of context grounding like RAG or GraphRAG.

You know more about how these models work than a lot of people. How would you approach these various things in a toolbox for a team? You’ve got prompt engineering, some fine-tuning, maybe distillation, I don’t know. So put on your posttraining hat, and based on what you know about interpretability or how these models work, how would you go about, systematically or in a principled way, approaching posttraining? 

26.54
Lucky for you, I also used to work on the posttraining team at Anthropic. So I have some experience as well. I think it’s funny, what I’m going to say is the same thing I would have said before I studied these model internals, but maybe I’ll say it in a different way or something. The key takeaway I keep on having from looking at model internals is, “God, there’s a lot of complexity.” And that means they’re able to do very complex reasoning just in latent space inside their weights. There’s a lot of processing that can happen—more than I think most people have an intuition for. And two, that also means that usually, they’re doing a bunch of different algorithms at once for everything they do.

So they’re solving problems in three different ways. And a lot of times, the weird mistakes you might see when you’re looking at your fine-tuning or just looking at the results model is, “Ah, well, there’s three different ways to solve this thing. And the model just kind of picked the wrong one this time.” 

Because these models are already so complicated, I find that the first thing to do is just pretty much always to build some sort of eval suite. That’s the thing that people fail at the most. It doesn’t take that long—it usually takes an afternoon. You just write down 100 examples of what you want and what you don’t want. And then you can get incredibly far by just prompt engineering and context engineering, or just giving the model the right context.

28.34
That’s my experience, having worked on fine-tuning models that you only want to resort to if everything else fails. I mean, it’s pretty rare that everything else fails, especially with the models getting better. And so, yeah, understanding that, in principle, the models have an immense amount of capacity and it’s just your job to tease that capacity out is the first thing I would say. Or the second thing, I guess, after just, build some evals.

29.00
And with that, thank you, Emmanuel. 

29.03
Thanks, man.