Word embedding, self-attention, and next-word prediction lie at the core of LLMs like ChatGPT. If you are curious about how these techniques work and want to see a simple example in R, read on!
Last time we talked about LLMs like ChatGPT we gave some intuition about how they worked (see: Create Texts with a Markov Chain Text Generator… and what this has to do with ChatGPT!), this time we want to dig a little deeper and talk about core concepts.
You can also watch the video for this post (in German):
Word embedding is a popular technique in Natural Language Processing (NLP) that represents words as numerical vectors in a high-dimensional space. In this representation, words with similar meanings are located close to each other. Word embeddings have been found to be very useful in many NLP tasks such as sentiment analysis, language translation, and text classification.
In this blog post, we will explore a simple example in R. Let’s consider a small vocabulary consisting of three words: “love”, “is”, and “wonderful”. We can represent each word as a vector of three dimensions, where each dimension represents a different attribute.
In this example, the first dimension represents the part of speech (noun, adjective, or verb), the second dimension represents the frequency of the word (rare, normal, or often), and the third dimension represents the sentiment of the word (negative, neutral, or positive).
We can represent the word vectors in a matrix called the embedding matrix. In R, we can create the embedding matrix as follows:
# dim 1 = noun = -1, adjective = 0, verb = 1 # dim 2 = rare = -1, normal = 0, often = 1 # dim 3 = negative = -1, neutral = 0, positive = 1 love <- c(-1, 0, 1) is <- c(1, 1, 0) wonderful <- c(0, 0, 1) embedding_M <- rbind(love, is, wonderful) embedding_M ## [,1] [,2] [,3] ## love -1 0 1 ## is 1 1 0 ## wonderful 0 0 1
The distance between two words in this space indicates the similarity between them. For example, the distance between “love” and “wonderful” is smaller than the distance between “love” and “is”, indicating that “love” and “wonderful” are more similar in meaning than “love” and “is”.
Self-attention is a mechanism used in transformer-based models such as BERT and GPT to process and analyze sequences of words. Self-attention allows the model to focus on different parts of the input sequence and weigh their importance when making predictions. In our example, we can use self-attention to compute the similarity between each word and all the other words in the vocabulary. We do this by multiplying the embedding matrix with itself and standardizing the result row-wise by using the softmax function.
In R, we can compute the self-attention matrix as follows:
softmax <- function(x) { exp_x <- exp(x) row_sums <- apply(exp_x, 1, sum) exp_x / row_sums } self_attn_M <- softmax(embedding_M %*% t(embedding_M)) |> round(2) self_attn_M ## love is wonderful ## love 0.71 0.04 0.26 ## is 0.04 0.84 0.11 ## wonderful 0.42 0.16 0.42
The diagonal elements of the self-attention matrix represent the self-similarity of each word. The off-diagonal elements represent the similarity between each pair of words. We can see that “love” is more similar to “wonderful” than to “is”. Row-wise the result can be interpreted as a probability distribution over the different words.
The concept of attention mechanisms in neural networks was introduced in the early 2010s to improve machine translation. Self-attention emerged in research between 2014-2016. In 2017, Vaswani et al., who were working for Google at the time, revolutionized the field with their paper “Attention Is All You Need,” which introduced the Transformer architecture and fully realized the potential of the self-attention mechanism.
Ironically, while Google pioneered this technology, OpenAI has since overtaken them in many aspects of AI development, particularly in Large Language Models (LLMs), like ChatGPT, based on the Transformer architecture.
Next-word prediction is a task that involves predicting the most likely word to come next in a sequence given a context. In our example, we can use the self-attention matrix to predict the most likely next word given a context. We can achieve this using masked self-attention, where we mask out all the elements in the self-attention matrix that correspond to the words that come after the context.
In R, we can compute the masked self-attention matrix as follows:
# next word prediction via masked self-attention masked_self_attn_M <- self_attn_M masked_self_attn_M[upper.tri(masked_self_attn_M)] <- -Inf # -Inf -> softmax = 0 masked_self_attn_M ## love is wonderful ## love 0.71 -Inf -Inf ## is 0.04 0.84 -Inf ## wonderful 0.42 0.16 0.42
The masked self-attention matrix can be used to learn to predict the most likely next word given the context of the whole text before it.
In summary, we have explored a simple example of word embedding, self-attention, and next-word prediction via masked self-attention in R. It is important to note that the example presented here is just the core of these techniques, and the workings inside advanced language models like ChatGPT are much more complex.
In such models, word embeddings are not fixed but are also learned by the transformer, and there are many more degrees of freedom via mathematical transformations and different layers of abstractions via multi-headed self-attention but that would go beyond the scope of this post.
Excellent simple example, will use in class (with credit, bien sûr).
Wow, thank you so much, Marcus, I am humbled!