Everybody and their dog are talking about ChatGPT from OpenAI. If you want to get an intuition about what lies at the core of such Language Models, read on!
I also created a video for this post (in German):
I will share a secret with you: at the core of the latest craze, Large Language Models (LLMs), like GPT3, its brother ChatGPT, from OpenAI or PaLM from Google, lies a (sophisticated) function for predicting the next best word, phrase or sentence based on statistics! You will say, no way!?!
First, try an experiment with your smartphone: start some messenger or social media app and begin typing. Then tap on the suggested word in the middle above your keyboard on the display. Continue tapping and see how a sentence forms. This sentence is also based on statistics which word normally follows which other word, probably refined by your using your phone.
Not impressed yet, then consider the following simple algorithm, called a Markov chain algorithm. We won’t go into the mathematical details of why it is called that but just take it as a simple way to create texts based on simple statistics.
The mathematical theory of Markov chains goes back to the Russian mathematician Andrey Markov and was developed at the beginning of the last century.
The idea to use this for analyzing and generating texts was first brought forward by the polymath Claude Shannon in his landmark paper “The Mathematical Theory of Communication” from 1948.
The research area is called Natural Language Processing (NLP).
The basis for the following little project is taken from Rosetta Code (see also our category Rosetta Code).
A Markov chain algorithm is a way to predict which word is most likely to come next in a sequence of words based on the previous words (called the prefix).
To do this, the algorithm takes an input text and breaks it into a series of individual words. It then looks at a group of N consecutive words (called a window) and uses the first N words as the prefix, and the N+1 word as a potential next word (called the suffix). The algorithm stores all possible suffix words for each prefix and then randomly selects a suffix word for each prefix.
As an example, take the following simple text with N = 2:
now he is gone she said he is gone for good
This would build the following table:
PREFIX | SUFFIX |
---|---|
now he | is |
he is | gone, gone |
is gone | she, for |
gone she | said |
she said | he |
said he | is |
gone for | good |
for good | (empty) if we get at this point, the program stops generating text |
To generate the final text, the following steps are taken:
- Choose a random PREFIX from the list of PREFIXes.
- If the chosen PREFIX has more than one SUFFIX, select one of the SUFFIXes at random.
- Use the chosen SUFFIX to create a new PREFIX.
- Repeat steps 2 and 3 until the desired length of the text is reached.
Let us set N 2 and create a text with 8 words:
random prefix: gone she suffix: said new prefix: she + said new suffix: he new prefix: said + he new suffix: is
… and so on. In this simple example the created text reads:
gone she said he is gone she said
Using a larger training text generally leads to better results when using a Markov chain algorithm.
For our little project here, we will use the following text file: alice_oz.txt, which begins as follows:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural)
The following code was converted from the Julia version with the help of ChatGPT:
markovtext <- function(txt, klen, maxchlen) { words <- unlist(regmatches(txt, gregexpr("\\b\\w+\\b", txt))) dict <- list() for (i in 1:(length(words) - klen)) { k <- paste(words[i:(i + klen - 1)], collapse = " ") v <- words[i + klen] if (k %in% names(dict)) { dict[[k]] <- c(dict[[k]], v) } else { dict[[k]] <- c(v) } } keytext <- sample(names(dict), size = 1) outtext <- keytext lasttext <- outtext while (nchar(outtext) < maxchlen) { lasttext <- outtext valtext <- sample(dict[[keytext]], size = 1) outtext <- paste(outtext, valtext, sep = " ") keytext <- gsub("^\\w+\\s+(.+)", "\\1", keytext) keytext <- paste(keytext, valtext, sep = " ") } return(lasttext) } txt <- readLines(url("http://paulo-jorente.de/text/alice_oz.txt"), warn = FALSE) txt <- paste(txt, collapse = " ") set.seed(3141) cat(markovtext(txt, 3, 300))
The created text reads:
anxious returned the Scarecrow It is such an uncomfortable feeling to know one is a crow or a man After the crows had gone I thought this over and decided I would try hard to get some brains By good luck you came along and pulled me off the stake and from what you say I am older than you and must
Although the text was automatically created with those simple rules stated above and doesn’t make any sense, it very much resembles a decent English text!
Now, at the core of LLMs very similar mechanisms are at work. Intuitively, those models also create text based on statistical distributions of word sequences… but of course, on steroids!
The basis are transformer neural networks which operate with a the so-called (self-)attention mechanism the context of whole of texts is being processed in all directions (not only sequentially from left to right), the degrees of freedom are higher by several magnitudes (oftentimes billions of parameters which represent the trainable weights of the neural net) and with huge amounts of text, literally many terabytes of it, which have mainly been acquired by crawling the web.
The fascinating thing is that those systems scale exceptionally well and quantity translates into quality. Some speak of the “unreasonable effectiveness of data” which seems quite fitting in this context.
Yet most fascinating of all is that many tasks that demand human-level intelligence can obviously be reduced to some form of (statistical) text prediction with a sufficiently performant model!
On the other hand, it also shows us the limits of those models: just try to multiply two four-digit numbers with ChatGPT… and you will be surprised…
5 thoughts on “Create Texts with a Markov Chain Text Generator… and what this has to do with ChatGPT!”