Attention! What lies at the Core of ChatGPT? (Also as a Video!)


Word embedding, self-attention, and next-word prediction lie at the core of LLMs like ChatGPT. If you are curious about how these techniques work and want to see a simple example in R, read on!

Last time we talked about LLMs like ChatGPT we gave some intuition about how they worked (see: Create Texts with a Markov Chain Text Generator… and what this has to do with ChatGPT!), this time we want to dig a little deeper and talk about core concepts.


You can also watch the video for this post (in German):


Word embedding is a popular technique in Natural Language Processing (NLP) that represents words as numerical vectors in a high-dimensional space. In this representation, words with similar meanings are located close to each other. Word embeddings have been found to be very useful in many NLP tasks such as sentiment analysis, language translation, and text classification.

In this blog post, we will explore a simple example in R. Let’s consider a small vocabulary consisting of three words: “love”, “is”, and “wonderful”. We can represent each word as a vector of three dimensions, where each dimension represents a different attribute.

In this example, the first dimension represents the part of speech (noun, adjective, or verb), the second dimension represents the frequency of the word (rare, normal, or often), and the third dimension represents the sentiment of the word (negative, neutral, or positive).

We can represent the word vectors in a matrix called the embedding matrix. In R, we can create the embedding matrix as follows:

# dim 1 = noun = -1, adjective = 0, verb = 1
# dim 2 = rare = -1, normal = 0, often = 1
# dim 3 = negative = -1, neutral = 0, positive = 1
love <- c(-1, 0, 1)
is <- c(1, 1, 0)
wonderful <- c(0, 0, 1)

embedding_M <- rbind(love, is, wonderful)
embedding_M
##           [,1] [,2] [,3]
## love        -1    0    1
## is           1    1    0
## wonderful    0    0    1

The distance between two words in this space indicates the similarity between them. For example, the distance between “love” and “wonderful” is smaller than the distance between “love” and “is”, indicating that “love” and “wonderful” are more similar in meaning than “love” and “is”.

Self-attention is a mechanism used in transformer-based models such as BERT and GPT to process and analyze sequences of words. Self-attention allows the model to focus on different parts of the input sequence and weigh their importance when making predictions. In our example, we can use self-attention to compute the similarity between each word and all the other words in the vocabulary. We do this by multiplying the embedding matrix with itself and standardizing the result row-wise by using the softmax function.

In R, we can compute the self-attention matrix as follows:

softmax <- function(x) {
  exp_x <- exp(x)
  row_sums <- apply(exp_x, 1, sum)
  exp_x / row_sums
}

self_attn_M <- softmax(embedding_M %*% t(embedding_M)) |> round(2)
self_attn_M
##           love   is wonderful
## love      0.71 0.04      0.26
## is        0.04 0.84      0.11
## wonderful 0.42 0.16      0.42

The diagonal elements of the self-attention matrix represent the self-similarity of each word. The off-diagonal elements represent the similarity between each pair of words. We can see that “love” is more similar to “wonderful” than to “is”.

Next-word prediction is a task that involves predicting the most likely word to come next in a sequence given a context. In our example, we can use the self-attention matrix to predict the most likely next word given a context. We can achieve this using masked self-attention, where we mask out all the elements in the self-attention matrix that correspond to the words that come after the context.

In R, we can compute the masked self-attention matrix as follows:

# next word prediction via masked self-attention
masked_self_attn_M <- self_attn_M
masked_self_attn_M[upper.tri(masked_self_attn_M)] <- -Inf # -Inf -> softmax = 0
masked_self_attn_M
##           love   is wonderful
## love      0.71 -Inf      -Inf
## is        0.04 0.84      -Inf
## wonderful 0.42 0.16      0.42

The masked self-attention matrix can be used to learn to predict the most likely next word given the context of the whole text before it.

In summary, we have explored a simple example of word embedding, self-attention, and next-word prediction via masked self-attention in R. It is important to note that the example presented here is just the core of these techniques, and the workings inside advanced language models like ChatGPT are much more complex.

In such models, word embeddings are not fixed but are also learned by the transformer, and there are many more degrees of freedom via mathematical transformations and different layers of abstractions via multi-headed self-attention but that would go beyond the scope of this post.

3 thoughts on “Attention! What lies at the Core of ChatGPT? (Also as a Video!)”

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.

This site uses Akismet to reduce spam. Learn how your comment data is processed.