Build a Tiny LLM in Go, Part 2: The Wall That Counting Hits

Part 1 left us with a model that writes almost-words and then wanders, because it remembers exactly one letter of what it has written. The fix seems obvious: remember more. So let us try it, honestly, and see how far it gets before it falls off a cliff.

Remembering two letters, then three

The bigram model kept a table indexed by one character: for each of our seventy characters, a row of counts for what came next. To remember two letters, we index the table by pairs of characters instead. Now the question is not “what follows t” but “what follows th”, and the answer for th is dominated by e, because “the” is the most common word in English.

This genuinely works better. “What follows th” is a sharper question than “what follows t”, so the guesses are better and the text holds together for a beat longer. Three letters is better still: “what follows the” points hard at a space. Every letter of context you add makes the prediction less of a shrug.

So we should just keep going. Remember ten letters, remember twenty, and the model should get smarter and smarter. It should. The problem is not that it stops helping. The problem is the size of the table.

Do the arithmetic

The one-letter table had one row per character: seventy rows. The two-letter table needs one row per pair of characters. How many pairs are there? Seventy choices for the first letter, seventy for the second, so seventy times seventy, which is 4,900 rows. Still fine.

Three letters: seventy times seventy times seventy, about 343,000 rows. Getting large, but a computer will not blink.

The pattern is that each extra letter of memory multiplies the number of rows by seventy. That is the entire problem in one sentence. It is the single function at the heart of this part, and it is short enough to read in full:

// NGramTableSize returns how many rows a full lookup table would need to
// remember the previous n characters: vocabSize^n. Returned as float64 because
// the true count overflows int64 almost immediately, which is exactly the
// point. Counting cannot scale to real context; the model must learn instead.
func NGramTableSize(vocabSize, n int) float64 {
	return math.Pow(float64(vocabSize), float64(n))
}

Multiplying by seventy over and over is exponential growth, and exponential growth is the most underestimated force in computing. Run the demo and watch it happen:

go run ./cmd/stage2_context

context | table rows needed
      1 | 70
      2 | 4.9e+03
      3 | 3.43e+05
      5 | 1.68e+09
     10 | 2.82e+18
     20 | 7.98e+36

Ten characters of memory, which is not even one long word, needs 2.82 billion billion rows. Twenty characters, a short phrase, needs a number with thirty-seven digits: roughly 8 followed by thirty-six zeros. Your whole body contains only around 10^27 atoms. A lookup table that remembers a twenty-character phrase would need billions of times more rows than you have atoms, and the demo says as much when you run it.

And remember what the target is. The model we finish the series with has a context window of 128 characters. A counting table for that would need 70^128 rows, a number so large that writing it out would take longer than the paragraph you are reading. The counting approach does not get expensive. It becomes physically impossible, and it does so almost immediately.

Two ways it fails, not one

The explosion is not only about storage. It hides a second, quieter failure.

Suppose you somehow had the storage. To fill in the row for a ten-character context like “the cat sa”, you would need to have seen those exact ten characters in your training text, many times, to get a reliable count of what follows. But most ten-character strings never appear even once in any book, because language is endlessly recombinable. So almost every row in your gigantic table would be blank, and a blank row tells you nothing. The table is not just too big to store. It is too big to ever fill.

This is the wall. Counting works beautifully for one or two characters and is finished as an idea by the time you want to remember a word. You cannot memorize your way to language, because language has more possible contexts than the universe has room for, and you will never see most of them twice.

The escape hatch

Here is the shift that everything after this depends on, and it is genuinely a change in kind, not degree.

The counting table treats every context as unrelated. The row for “the cat sa” and the row for “the dog sa” are completely separate cells that share nothing, even though any human can see they should predict almost the same next letter. Counting has no notion that two contexts might be similar. Each one is just an address in a table.

What if, instead of a table with a row for every possible context, we had a smaller set of numbers that could look at any context and compute a prediction? Numbers that capture that “the cat sa” and “the dog sa” are near each other, so that learning about one teaches you something about the other? Then we would not need a row for every context. We would need enough numbers to represent the patterns in language, and there are vastly fewer patterns than there are contexts.

Those numbers have a name. They are called weights, and finding good values for them is called learning. A model with a few thousand weights can answer “what comes next” for contexts it has never seen, by generalizing from patterns rather than looking up an exact match. That is the difference between memorizing and learning, and it is the difference between a lookup table and a neural network.

But this raises a hard question we have dodged so far. If the model is not just counting, if it is a pile of numbers that compute a guess, how do we find the right numbers? There are thousands of them. You cannot try every combination. You cannot count them. You have to somehow steer them, from random junk toward values that make good predictions.

That steering is called gradient descent, and it is the real engine inside every neural network. In Part 3 we build it: what it means for a model to be “wrong” as a single number, and how to nudge thousands of weights, all at once, in the direction of being a little less wrong. It is the one part of the series where we look the math in the eye. It is also the part where our program stops counting and starts to learn.

Code for this part is in cmd/stage2_context at github.com/erubboli/go-tiny-llm.