Build a Tiny LLM in Go, Part 3: What Learning Actually Means

Part 2 ended at a wall. Counting cannot scale, because a table that remembers real context is larger than the universe. The way out was a promise: replace the giant table with a small pile of numbers, called weights, that compute a prediction and can generalize to contexts they have never seen. This part is about the only hard question that pile of numbers raises. There are thousands of them, they start as random junk, and we need to find good values. How?

The answer is the beating heart of every neural network, and it has three moving parts: a way to measure how wrong the model is, a way to know which direction reduces that wrongness, and the patience to take a small step in that direction a few thousand times. That is all training is. Let us build each part.

Wrongness as a single number

You cannot improve what you cannot measure, so first we need to turn “the model made a bad guess” into a number. That number is called the loss, and the smaller it is, the better the model.

Recall the one question the model answers: given the text so far, what comes next? It does not answer with a single letter. It answers with a confidence for every possible letter: 60% sure it is a space, 15% sure it is e, and so on across all seventy characters. Now suppose the letter that actually came next was e. The model gave e only 15% confidence. It was not wrong exactly, but it was underconfident about the truth, and we want to punish that.

The standard way to score this is called cross-entropy, and the idea behind it is simple even though the name is not. Look at the probability the model assigned to the character that actually came next. If that probability is high, near 1, the loss is near zero: the model was confident and correct. If that probability is low, the loss is large: the model was confident about the wrong things. Averaged over many predictions, this gives one number for how surprised the model is by reality. Training is the search for weights that minimize that surprise.

In the repo this is one function, CrossEntropy, and the number it hands back is the single quantity the entire training process is trying to push down.

Which way is downhill

Now the real question. We have thousands of weights and one number, the loss, that depends on all of them. We want to change the weights to make the loss smaller. But we cannot try every combination, because there are more combinations than atoms, the same wall from Part 2 in a new outfit.

Here is the trick, and it is worth slowing down for, because it is the whole game. For each individual weight, we can ask a local question: if I nudge this one weight up a tiny bit, does the loss go up or down, and by how much? That number, the rate at which the loss changes as you wiggle one weight, is called the gradient with respect to that weight. It is a slope. A positive slope means “increasing this weight makes things worse, so decrease it”. A negative slope means “increasing this weight helps, so increase it”. The size of the slope tells you how much this particular weight matters right now.

Compute that slope for every weight, and you have, for all thousands of them at once, the direction that most reduces the loss. Then you take a small step: move every weight a little bit against its slope. Nudge the harmful ones down, the helpful ones up, each in proportion to how much it matters. The loss drops a little. Do it again. And again, a few thousand times.

That is gradient descent. The mental picture that never fails is a ball on a hilly landscape, where altitude is the loss and your position is the current setting of all the weights. The gradient points uphill; you step downhill; you repeat; the ball rolls into a valley where the loss is low. The learning rate is how big a step you take. Too big and you bound across the valley and overshoot. Too small and you creep down over an age.

$w \leftarrow w - \alpha \cdot \frac{\partial L}{\partial w}$

That line is the entire update: each weight $w$ moves against its own slope $\frac{\partial L}{\partial w}$ , scaled by a learning rate $\alpha$ . Every neural network ever trained, including the ones that cost hundreds of millions of dollars, is running that line in a loop.

The engine that computes the slopes

There is one catch, and it is the reason this part exists at all. Computing the slope of the loss with respect to one weight is easy. Computing it with respect to every weight, when the loss is the end of a long chain of multiplications, additions, and nonlinear squashings, is fiddly and desperately easy to get wrong by hand.

The earlier article on this site, neural networks and backpropagation in Go, works that computation out by hand for a small network, deriving every gradient with the chain rule and coding it directly. It is worth reading if you want to see the calculus in full. It also ends by admitting the obvious problem: doing that by hand does not scale. For a real model with attention and many layers, hand-derived gradients are a nightmare of bookkeeping, and one sign error anywhere silently poisons the whole thing.

So instead of deriving gradients by hand, we build a small machine that derives them for us. It is called an automatic differentiation engine, or autograd, and the idea is elegant. Every number in the model is wrapped in a little object that remembers not just its value but also how it was computed: which numbers it came from, and by what operation. As the model computes its prediction, these objects link up into a graph recording the entire calculation. Then, to get all the gradients, you walk that graph backward from the loss, and at each step you apply the one local rule for that operation. Addition splits the slope evenly to its inputs. Multiplication routes it in proportion to the other factor. The chain rule, applied mechanically, node by node, all the way back to the weights.

The whole engine is about a hundred lines of Go. Its core is a single type that holds a value, a slot for its gradient, and a closure that knows how to push that gradient back to whatever produced it:

// Tensor is one node in the computation graph. It holds a 2D matrix of values
// (Data) and the gradient of the final loss with respect to each value (Grad).
// This is "micrograd, but the value is a matrix": every operation records a
// backward closure that pushes gradient from this node to the nodes it was
// built from. Backward() runs them in reverse.
type Tensor struct {
	Data []float64
	Grad []float64
	Rows int
	Cols int

	backward func()
	parents  []*Tensor
}

You build a prediction by combining tensors with operations like matrix multiply and add. Each operation, as a side effect, records how to send gradient backward. When you finally call Backward() on the loss, the engine visits every node in reverse order and every weight ends up with its slope filled in, ready for that one update line above. No calculus by hand. The engine does the chain rule for you, correctly, every time.

How do you know it is correct, when the whole point was that hand-derived gradients are error-prone? You check the engine against reality. For any weight, you can estimate its true slope the brute-force way: nudge it up a hair, see how much the loss changed, and divide. If the engine’s gradient and this measured slope disagree, the engine has a bug. Every operation in the repo ships with exactly this check as a test, which is why the math can be trusted even though it was written by hand.

Watching it learn

With loss, gradients, and the update loop in hand, we can train the first model that genuinely learns rather than counts: a small network that takes a few characters of context, mixes them through a layer of weights, and predicts the next character. The full training loop is short. Compute the prediction, compute the loss, call Backward() to fill in every gradient, take one step downhill, repeat.

go run ./cmd/stage3_mlp

step    0  loss 4.2904
step  200  loss 3.9708
step  400  loss 4.4864
step  600  loss 3.4800
step  800  loss 3.4600
step 1000  loss 3.9303
step 1200  loss 3.0483
step 1400  loss 2.4551
step 1600  loss 2.1318
step 1800  loss 2.8701

That bouncing, falling number is a model learning, live. It starts near 4.3, which is the loss of pure guessing among seventy characters, the number you get when the model knows nothing. It does not descend in a clean line: each step sees a different random slice of the text, so the loss jitters and even climbs for a stretch. But the trend is unmistakable, and within a couple of thousand steps it has roughly halved. Nobody told the model any rule. It found, by rolling downhill a few thousand times, weights that make English less surprising than random noise.

This is the engine. Everything left in the series is about giving it a better body to work with. The little network we just trained looks at a fixed, tiny window of characters and treats them as an undifferentiated blob. It has no way to notice that in “the cat sat”, the word “cat” three letters back is what makes “sat” likely, while in “the dog ran”, it is “dog” that matters. It cannot let the right earlier characters reach forward and influence the prediction.

Giving the model that ability, letting each position look back and decide which earlier positions matter, is the single idea that turned neural networks into large language models. It is called attention, and it is the subject of Part 4.

Code for this part is in cmd/stage3_mlp, with the autograd engine in tensor.go and ops.go, at github.com/erubboli/go-tiny-llm.