Build a Tiny LLM in Go, Part 1: Predicting the Next Letter
Here is a program that has read a book and learned to write. This is its output, generated one character at a time:
the. Tho Whele wre ig ousad thelyilertousasuty thangle the are
hin whadouthe wad wag ss werst stheane tiot, she sar rir, s, s y e
It is gibberish. But look closer. The spaces fall in plausible places. The letter pairs are the kind English actually uses: the, she, are, wag. There is not a single qz or wkx in sight. This program has never been told what a word is, never seen a dictionary, never had a rule of grammar explained to it. It has done exactly one thing: it counted.
That counting is the seed of every large language model. GPT and Claude are, at their core, doing a more elaborate version of what this fifty-line program does. Over the next five articles we will build the elaborate version, in Go, from nothing but the standard library, and publish the whole working thing as a repository you can run. But the honest place to start is here, with the dumbest model that already sort of works, because once you see what it gets right and the single thing it gets wrong, the rest of the series is just fixing that one flaw.
If you have read the earlier piece on neural networks and backpropagation in Go, you have the machinery for the later parts already. You do not need any of it yet. Part 1 is just counting.
The only question a language model ever answers
Strip away everything and a language model answers one question, over and over: given the text so far, what comes next?
That is the entire job. Feed it “the cat sat on the ”, and a good model puts high odds on “mat” and low odds on “helicopter”. To write a sentence, you ask the question, take the answer, add it to the text, and ask again. The model never plans a sentence. It just keeps guessing the next piece, and the pieces add up to something that reads like language.
We are going to work one character at a time rather than one word at a time. So the question becomes narrower: given the letter I just wrote, what letter comes next? A model that answers that is called a bigram model, and “bigram” just means “two letters in a row”.
Counting is a model
Take a pile of English text. Walk through it and, for every pair of adjacent characters, keep a tally. How many times does h follow t? How many times does e follow h? How many times does a space follow e?
After a few hundred kilobytes of text you have a big table. One row per character, one column per character, and each cell holds a count: how often the column-character followed the row-character. The row for t will have a fat number in the h column, because English is full of “th”. The row for q will have almost everything in the u column, because q is nearly always followed by u.
That table is the model. To predict what follows t, you look at the t row and see which characters showed up most. To generate text, you do the same thing but with a coin toss weighted by the counts: if h followed t 900 times and e followed it 100 times, then nine times out of ten you pick h.
There is no training in the sense people usually mean. Nothing is being optimized. You make one pass over the text, adding one to a cell each time you see a pair, and you are done.
The whole thing in Go
First we need to turn characters into numbers, because a table is indexed by integers, not letters. That is the job of a tokenizer, and at the character level it is almost trivial: assign every allowed character an id, and keep a map back. Our whole vocabulary is about seventy characters (the letters, digits, a space, and a little punctuation), so id 0 might be a, id 1 might be b, and so on.
With that in place, the counting model is a square table and two loops. Here is the real thing from the repo, unedited:
// Bigram is the simplest possible language model: a table counting how often
// each character follows each other character. No learning, no math, just
// tallies. It already "works", which is the surprise. Its flaw: it only ever
// looks one character back.
type Bigram struct {
counts [][]float64 // counts[a][b] = times b followed a
size int
}
func (b *Bigram) Train(ids []int) {
for i := 0; i+1 < len(ids); i++ {
b.counts[ids[i]][ids[i+1]]++
}
}
Train is the entire learning procedure. Walk the text, and for each character at position i, add one to the cell for “the character at i+1 followed the character at i”. That is it. A production language model’s training loop runs for weeks on thousands of machines. This one runs in the time it takes to read a file.
Generating text is the reverse. Start with some character, look at its row of counts, and pick the next character with probability proportional to those counts:
// Sample generates n characters starting from `start`, each drawn in proportion
// to how often it followed the previous character in the training text.
func (b *Bigram) Sample(rng *rand.Rand, start, n int) []int {
out := make([]int, 0, n)
cur := start
for i := 0; i < n; i++ {
nxt := b.sampleRow(rng, b.counts[cur])
out = append(out, nxt)
cur = nxt
}
return out
}
The “proportional to the counts” part is a weighted coin toss: sum the row, pick a random point along that total, and walk the row until you land on a character. Common followers win most of the time, rare ones win occasionally, and characters that never once appeared after cur never appear now either. That last property is why the output has no qz: if the pair never happened in the book, its count is zero, and zero-count characters can never be chosen.
You can run this yourself. The repo ships with the text already bundled:
go run ./cmd/stage1_bigram
What it got right, and the one thing it cannot do
Look again at the sample from the top: the. Tho Whele wre ig ousad. The model has, purely from counting, discovered a surprising amount:
- Letter frequencies. It uses
e,t,a,oconstantly andz,q,xalmost never, matching English. - Plausible pairs. Every two-letter combination it produces is one that actually occurs in the text. It never writes
bkorfq. - Word-ish rhythm. Spaces land at believable intervals, so the output breaks into chunks the length of real words.
That is a lot to get from a program that does not know what a word is. And yet it will never write a real sentence, and the reason is baked into its single line of logic. Look at Sample again: to pick the next character it looks at b.counts[cur], and cur is only ever the one character it just produced. The model’s entire memory of everything it has ever written is a single letter.
So when it writes t, it knows h often follows t, and it might write h. Now its memory is h, and the t is already forgotten. It cannot know it is three letters into “the”. It cannot know it is halfway through a word, or a sentence, or a quotation. Every decision is made by a model with amnesia, remembering only the last thing it did.
That is why the output is locally plausible and globally nonsense. Each pair of letters is fine. Strung together, they wander, because nothing carries context forward.
Here is the trap, and it is the whole reason this series exists. The obvious fix is: remember more than one letter. Instead of “what follows t”, ask “what follows sa”, or “what follows the ca”. Keep a bigger table, look further back, and surely the text gets better.
It does get better. It also, very quickly, becomes impossible. In Part 2 we will try exactly this fix, watch it work, and then watch the table explode past the number of atoms in your body before we have remembered even a short phrase. That explosion is the wall that counting hits, and getting over it is the moment a model stops memorizing and starts to learn.
The full code for this part, and every part to come, is at github.com/erubboli/go-tiny-llm. Part 1 lives in cmd/stage1_bigram.