Linear Regression in Go - Part 3

In the previous posts we talked about how to predict a continuous value using a linear function and a way to measure the error given a matrix of test data and a hypothesis set of values theta.

In this post we’re describing a function that will converge the vector theta to its optimal values (or local minimum) which is called gradient descent.

Just to keep things simple, let’s assume we have a vector $\theta$ with just 2 dimensions (slope and intercept values of a simple line). If we plot the graph of $\theta_0$ , $\theta_1$ and the result of the cost function we’ll have something like:

Cost Function Graph

Given any random starting point $\langle\theta_0,\theta_1\rangle$ if we calculate the partial derivative of the function we’ll have a direction pointing away from the local minimum, so we can move a step closer ( $\alpha$ ) toward the opposite direction. This is our gradient descent in mathematical terms:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$

The step we take every iteration toward the local minimum $\alpha$ is called learning rate (or sometimes step size ). For now let’s assume we use a fixed value and the same is for the number of iterations we need to perform in order to get to the local minimum.

Deriving this formula results in the following:

\begin{align*} \text{repeat until convergence: } \lbrace & \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \\ \rbrace & \end{align*}

or in a more general way (if we assume $x_0^{(i)}=1$ ):

\begin{align*} & \text{repeat until convergence:} \; \lbrace \\ \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0..n} \\ & \rbrace \end{align*}

But what we want to implement is a vectorized version of that formula, which is:

$\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})$

Again, using vectors, the result is much more readable and easier to implement. We can finally get to our actual go implementation:

// m = Number of Training Examples
// n = Number of Features
m, n := X.Dims()
h := mat64.NewVector(m, nil)
new_theta := mat64.NewVector(n, nil)
partials := mat64.NewVector(n, nil)

for i := 0; i < numIters; i++ {
        h.MulVec(X, new_theta)
        for el := 0; el < m; el++ {
            val := (h.At(el, 0) - y.At(el, 0)) / float64(m)
            h.SetVec(el, val)
        }
        partials.MulVec(X.T(), h)

        // Update theta values
        for el := 0; el < n; el++ {
            new_val := new_theta.At(el, 0) - (alpha * partials.At(el, 0))
            new_theta.SetVec(el, new_val)
        }
}

h is a ~~Dense Matrix~~ Vector and numIters is a constant int. You can find the full implementation here, and tests here.

A good visual explanation is in the following video by prof. Alexander Ihler:

Edit (feb 6 2016): I’ve fixed the code so now h and new_theta are of type Vector instead of DenseMatrix.