Decomposing Language Models Into Understandable Components

Decomposing Language Models Into Understandable Components (anthropic.com) 17

Posted by msmash on Monday October 09, 2023 @03:20PM from the closer-look dept.

AI startup Anthropic, writing in a blog post: Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly -- each neuron in a neural network performs simple arithmetic -- but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe. Neuroscientists face a similar problem with understanding the biological basis for human behavior. The neurons firing in a person's brain must somehow implement their thoughts, feelings, and decision-making. Decades of neuroscience research has revealed a lot about how the brain works, and enabled targeted treatments for diseases such as epilepsy, but much remains mysterious. Luckily for those of us trying to understand artificial neural networks, experiments are much, much easier to run. We can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.

Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts. In our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.

Decomposing Language Models Into Understandable Components

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 17 Comments Log In/Create an Account

Comments Filter:

Overly optimistic if... (Score:3)

by VeryFluffyBunny ( 5037285 ) writes: on Monday October 09, 2023 @04:05PM (#63912875)

...AI neurons & systems are anything like human equivalents. Here's a quick explanation of why studying one level of analysis doesn't tell you much about another & that we use different tools, theoretical frameworks, & methods to analyse each one. Essentially, we're talking about teaching AI models to do useful stuff so it's not unlike teaching humans either, although a lot, lot simpler: https://yt.artemislena.eu/watc... [artemislena.eu] (about 9 min video presentation).

- Re:Overly optimistic if... (Score:5, Informative)
  
  by Rei ( 128717 ) writes: on Monday October 09, 2023 @05:31PM (#63913163) Homepage
  
  1. Don't link a slow, poorly presented video that's over 8 minutes long and doesn't defend your point anyway as a substitute for arguing your claim.
  2. Artificial neurons don't work "the same" as natural neurons, but there's a lot convergent about their behavior.
  One of the prime differences is that natural neurons pulse in an on-off manner, while artificial neurons give a non-binary output that can vary in response to their inputs (although you can make it binary if you want with a stepwise activation function). However, this isn't actually as different as it seems, because natural neurons pulse at varying frequencies, activating again and again in response to the weighted strength of their inputs, and so the time averaged activation that they pass on to subsequent neurons is roughly equivalent to a non-pulsing but non-binary feed foward.
  An exception to this can be in the case where only a small handful of neurons dominate the input field of a subsequent neuron - in such cases, you're getting a somewhat different profile, with high probabilities of firing on each pulse and higher rates of activation decay immediately after the pulses. But such neurons with only one or a handful of dominant inputs are also effectively not tasked with solving as complex of a classification problem. Indeed, white matter is basically a data bus. Artificial neurons have no need for that.
  Natural neurons can form rhythmic timing patterns and form cyclic circuits, while artificial neurons are only feed-forward from one layer to the next. That said, there's generally no use for artificial neurons to be able to form timing circuits, and it's not clear that even the possibility of encoding information within timings is any better than any other form of encodings. However, cyclic processing may be of use for artificial neurons in the future for being able to iteratively attack tasks step by step.
  Natural neurons are constantly changing their weights. Artificial neurons only do this during training.
  Natural neurons can gain or lose new connections. Artificial neurons don't, but they don't really need to, as they're connected to all neurons int he previous layer, and can just "lose" a connection by zeroing it or "gain" a connection by raising a zeroed connection. That said, they are bound in specified sequential layers.
  The binary / pulsed approach of natural neurons inherently adds random noise to the system. Artificial neurons often artificially introduce random noise. For example, with LLMs, you end up with a list of probabilities, truncate it with top_p / top_k, and weight your selection of choices with temperature (high temperature = more "noise" / randomness / impulsiveness / clanging, low = predictable, repetitive).
  
  - Re: (Score:3)
    
    by VeryFluffyBunny ( 5037285 ) writes:
    
    The point is about the levels of analysis & the difficulty of transferring findings from one level to effects on another, let alone across multiple levels. TFS gives the impression that they're trying to apply complicated systems thinking to complex adaptive systems analysis. If so, it won't fly.
    
    Your rather long & elaborate description of AI neuron activities doesn't address anything, let alone the point I made.
- Re: (Score:2)
  
  by NobleNobbler ( 9626406 ) writes:
  
  It was just an analogy...
NN are multi-dimensional fractal vises (Score:2)

by Pravetz-82 ( 1259458 ) writes:

I might be totally wrong, but the way I understand it - neural networks are kinda like big fractal vise in many dimensions at once. During training it learns the "shapes" of different objects, kinda like the fractal vise tweaks each joint. When presented with new input it can lookup which remembered shape matches best.
- Re:NN are multi-dimensional fractal vises (Score:4, Insightful)
  
  by Rei ( 128717 ) writes: on Monday October 09, 2023 @05:36PM (#63913173) Homepage
  
  There's not really any "lookups". Think of it more like the most insane flow chart you've ever seen, with literally billions of lines, and each line isn't strictly yes / no but varying degrees, and each node (representing sometimes not just one question to answer, but many superimposed questions) weighing the huge number of lines leading up to it by different degrees to come up with what sort of answer it wants to pass on to the next part of the flow chart.
  And then picture all the weights and biases on that flow chart having been self assembled.
  
Superposition (Score:2)

by Visarga ( 1071662 ) writes:

They found that neurons are usually not specific to a single purpose and that's why they are hard to interpret. But using a simple method (Dictionary Learning) they could separate these overlapped roles like separating voices out in a crowd. Dictionary learning is a very simple encoder-decoder that encodes the input into a sparse form - that means a vector with most entries zero and just a few non-zero. This sparsity makes it interpretable, each dictionary word becomes a topic.
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  Yeah.
  But this is sort of a "Duh" issue for anyone who learns to read. The word token, which may map to a particular neuron doesn't define semantics. Like there's a "cat" neuron for a recognized instance of a cat. Words require a certain context to give them meaning.
  "He rose from the dead." "A rose by any other name would smell as sweet."
  It's something that we (meat sacks) are expected to learn early in our schooling (although not formalized until later). That LLMs do not have a layer of neurons that represent semantic concepts and are fired by suitable inputs from the various p
  - Re: Superposition (Score:2)
    
    by Viol8 ( 599362 ) writes:
    
    Define semantic understanding. Chatgpt does it well enough to understand vague or incomplete requests or even offer a choice if what it thinks you mean. It couldnt do that without some level of semantic comprehension.
    - Re: Superposition (Score:5, Informative)
      
      by Rei ( 128717 ) writes: on Monday October 09, 2023 @06:01PM (#63913233) Homepage
      
      Indeed. LLMs tend to be very good with semantic understanding. Indeed, it's not even an innovation of the Transformers architecture, this goes way back.
      First you had work that discovered that if you have a neural network trained to take some given input and yield the same output, and you pinch down the deep layers in the middle, what you get there is a dense latent representation of the concept. And while the original idea of the research was of using neural networks for compression, it was discovered that these latent representations have interesting properties. You can do math on them. You can dot product them to see how similar their concepts are, and do other mathematical operations. For example, KING minus MAN plus WOMAN will resemble QUEEN. It forms a conceptual space representation of the training input, rather than a lexical representation.
      So a lot of work switched to these dense representations. For example the training of denoising algorithms for image latents is the research path that ultimately led to AI image generators. With text, they started representing tokens as dense representations, where you'll have a matrix of floating point numbers, and again, the dotproduct of any two shows how related the concepts are (for all examples, I'll just pretend words are tokens). For example, COW might be close to both MILK and GRASS, but MILK and GRASS won't be close to each other.
      But we want better. So say you have the phrase "THE BANK OF THE RIVER". You have the token "BANK", and it will be close to things like "MONEY", "CASHIER", "ATM", etc, but also close to things like "WATER", "EROSION", etc. You want the watery kind, not the financial kind. Well, this is done with "attention". For each token, you check its semantic distance to every other word (aka, dotproduct), and use that as a weighting; the closer it is, the more you shift the contents of its matrix toward the matrix of that token. And it's that simple: after just one round, the dense representation that was "BANK" now represents something that's semantically much closer to only river banks and not financial ones.
      So right here on this trivial algorithm, we have semantic isolation of concepts.
      This is all well and good, but it can't handle any complexity. But that's where (skip ahead a bunch of papers) you end up at the Transformers architecture, where attention processing is handled by query-key-value for all tokens in each head, you have many different heads analyzing different aspects of the semantics, and most critically, you have many layers, so you have the ability to handle increasingly complicated concepts and operations with each subsequent layer. You also have positional encodings so order is relevant, and you have add-and-normalization both after multi-headed attention and the linear network, to deal with the vanishing gradient problem.
      But all this added complexity is beyond what's needed just to understand, say, the difference between "He rose from the dead" and "A rose by any other name would smell as sweet." Basic attention tells you that.
      
Similar to an N body problem (Score:3)

by Viol8 ( 599362 ) writes: on Monday October 09, 2023 @04:48PM (#63913003) Homepage

You can understand the physics of each body perfectly , but with 3 or more it becomes impossible to follow how the input became the output. Now instead of 3 substitute 100 billion and you understand the level of complexity involved. Emergent properties cant always be understood from individual component behaviour because the interaction between components whether atoms, planets or neurons is often more important.

- Re: (Score:2)
  
  by NobleNobbler ( 9626406 ) writes:
  
  That's interesting-- can you expound on the jump between 2 and 3?
  - Re: (Score:2)
    
    by piojo ( 995934 ) writes:
    
    I'm not the parent, but you can read about the Three Body Problem on wikipedia: https://en.wikipedia.org/wiki/... [wikipedia.org]
    - Re: (Score:2)
      
      by NobleNobbler ( 9626406 ) writes:
      
      Thank you!
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  N body problem does not mean that it's impossible to follow, it says that there isn't a single closed-form equation for 3+ body systems where you can plug in time as a variable to compute a result. For these systems you can still simulate a system some time into the future, and if you understood the physics of each body perfectly (i.e., can not be improved upon), you would be limited by the precision of your simulation.
No progress since 1997? (Score:4, Interesting)

by Mspangler ( 770054 ) writes: on Monday October 09, 2023 @06:05PM (#63913253)

"We understand the math of the trained network exactly -- each neuron in a neural network performs simple arithmetic -- but we don't understand why those mathematical operations result in the behaviors we see."
That's what I said in 1997 while doing my dissertation. Failure modes were unpredictable once the range of the training set was exceeded. The NN would decide to correct a low pH condition by adding more acid. So much for using that for process control which was the point of the dissertation.
Another example reported here on slashdot was training a NN to recognize the office furniture from any position. Then they put a model elephant in the room and the NN lost the ability to identify anything because the elephant had never been there before.
To paraphrase Yosemite Sam, "Neural networks is so stupid." (he had dragon problems (as well as camel, lion and even elephant problems.)

- Re:No progress since 1997? (Score:5, Interesting)
  
  by NobleNobbler ( 9626406 ) writes: on Monday October 09, 2023 @07:46PM (#63913533)
  
  Might as well have described medical science. Find me a drug that doesn't have a description that states, somewhere, "although the reasons that Lunestavagisilinitis works for condition is not completely understood, it is theorized that by..."
  And then follow the condition, and, "Although the condition and what causes it is not fully known, it is..."
  We really are just flailing blindly, and are very very proud of ourselves when we hit a bunch of levers and something resembling a prize pops out

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Decomposing Language Models Into Understandable Components (anthropic.com) 17

Decomposing Language Models Into Understandable Components More Login

Decomposing Language Models Into Understandable Components

Overly optimistic if... (Score:3)

Re:Overly optimistic if... (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

NN are multi-dimensional fractal vises (Score:2)

Re:NN are multi-dimensional fractal vises (Score:4, Insightful)

Superposition (Score:2)

Re: (Score:2)

Re: Superposition (Score:2)

Re: Superposition (Score:5, Informative)

Similar to an N body problem (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

No progress since 1997? (Score:4, Interesting)

Re:No progress since 1997? (Score:5, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot