Implementation of a Modern Machine Learning Library

If you want to understand how a modern machine learning library works, there is no better alternative than Flux

Image for post
Image for post

I could have been writing about more known machine learning libraries such as TensorFlow, Keras or PyTorch. But the benefit of talking about Flux is that is a fairly small and modern machine learning library.

Flux is very small because it can be. Flux is trivially extended with custom user code, which is not possible with most existing machine learning libraries. That is why they are so large. They must have a lot of built in functionality because adding custom functions is not trivial.

For beginners who want to understand machine learning properly, this is a major benefit. Flux has a much smaller and simpler implementation than the competition, which makes it actually feasible for mortals to look at the source code and understand it.

It may seem mysterious how Flux can do this while other libraries cannot. The magic is really just the Julia programming language. It is Julia that really does all the heavy lifting and makes it possible to write small machine learning libraries.

Before getting into the details let’s take a birds eye perspective and talk about the different tasks a ML library has to carry out.

It has to provide some facilities to do the following:

  1. Define a mathematical model and parameters of this model.
  2. A way to define a loss/error/cost function. A loss function calculates the difference between desired output from model and actual prediction made by model.
  3. Calculate the gradient of the loss function with respect to its parameters. The parameters are the ones used to to define the model.
  4. Way of defining different optimizers or training strategies.
  5. A training function which tries to minimize the loss function by adjusting the model parameters using calculated gradient and optimizer.

What all of this means may not be obvious to you, so let me cover this in more detail.

A mathematical model is a model defined in mathematics or in code in our case. The purpose is much the same as for a physical model.

To describe the concept or idea of models I like to use an example from how the Palm Pilot was developed. It was a PDA and for those of you too young to know what a PDA, you can think of it as a smart phone without the ability to call someone.

One of the first models the Palm company made of it was just a block of wood. How is that a suitable model? It cannot do anything, or can it?

They had multiple blocks of wood in different shapes. What they wanted to test was how well it fit in your jeans pocket. Employees would walk around with one of these blocks of woods in their pocket all day.

The idea was to figure out what was the optimal shape and size. A shape and size which was comfortable for people to wear in their pocket.

So this is a key point to keep in mind about models. You only include properties in your model which helps you answer the question you want answered. In this case the question was: “Is it comfortable to wear the whole day in my pocket?”

To answer that question we only need exterior dimensions. Mathematical models are the same. We may strive to model a solar system e.g. The question then is, what kind of questions do we want answered about this solar system?

If what we want to know is planetary orbits, then the mass and position of the planets may be the only properties you need per planet to model he solar system.

However if you want to model temperature on each planet, you need far more properties.

In the Flux machine learning library, the model is simply a function which takes one input. The input is a matrix needs to be organized in a particular way. Every row represents a different property. In machine learning we refer to these properties as features. Another way to look at it is that you can simulate a function with multiple inputs by treating each row as a separate input argument.

Instead of evaluating the model 3 times with three different inputs you can simply evaluate the model by concatenation them horizontally. So basically every column is a separate input.

Here is a very simply model defined in Flux:

W = rand(2, 5)
b = rand(2)

model(x) = W*x .+ b

Of course real models are more complicated. Here is an example of a deep neural network (DNN) being defined:

model = Chain(
Dense(784, 64, relu),
Dense(64, 64, relu),
Dense(32, 10)
)

This still ends up creating what is seen from the users perspective just a function model, which he/she can supply a matrix to, as an argument.

The parameters of the model are variable you can adjust to change the behavior of your model. In our first example, the parameters where W and b. W is referred to as the weights. It will typically be a matrix as well. b is referred to as the bias. The bias is not affected by the input.

Flux.train!(loss, params, data, optimizer)

We will cover the training function more in detail later. But the training function is what reads input data and starts tweaking your parameters to modify the model so that it produces correct output.

To be able to do that we need to tell Flux, what part of our model is the parameters. This is done with the Flux.params function.

params = Flux.params(W, b)
Flux.train!(loss, params, data, optimizer)

When using pre-made building blocks such as Dense to make a more complex model, you can simply use the model itself as input to Flux.params to get the parameters of the model.

params = Flux.params(model)

How we make it possible to do this with our custom models from scratch I will not cover here. For custom models you know the parameters anyway.

Let us look at how Flux implements model parameters.

Params contains an array order holding a list of parameters added to it. the params member is primarily used to check that a parameter (typically an array object) has not already been added to the Params object.

struct Params
order::Buffer{Any, Vector{Any}}
params::IdSet{Any}
Params() = new(Buffer([], false), IdSet())
end

You can see that adding a value x to params means just adding it to the order array which maintains order elements where added. params is a set which only exists to avoid adding the same element twice to the order array.

function Base.push!(ps::Params, x)
if !(x in ps.params)
push!(ps.order, x)
push!(ps.params, x)
end
return ps
end

This is the more typical usage of Params. You create a Params object which you add elements in order to.

Params(xs) = push!(Params(), xs...)

We setup parameters from Flux using Flux.params call

function params(m...)
ps = Params()
params!(ps, m)
return ps
end

But it does not really do much different from Zygote.Params.

The parameters of our model is usually referred to as weights in machine learning. The learning strategy is often referred to as an optimizer.

The reason for that is what we are really trying to do is optimize a function. So the optimizer is a strategy for how to optimize a function. In our case it is the loss function which we attempt to optimize.

We want to find what weights will give the lowest value for the loss function. Remember the lost function expresses the error or difference between what our model predicts the output should be for a given input and what output the real world tells us our model should have. We want this error as small as possible.

For this reason when we find the gradient (derivative) of the loss function relative to one of more weights, we want to move in direction towards the minimum of the function and not towards the maximum. That is why we subtract.

function update!(opt, x, x̄)
x .-= apply!(opt, x, x̄)
end

You can see this function takes an optimizer opt as argument some input, one or more weights x and some change (gradient) which would cause the function value to grow. We want the negative of this, since we want it to shrink.

In Flux, every optimizer adds a apply! function. Let us look at how gradient descent works e.g. Gradient descent is based on moving opposite direction of gradient with a learning rate η. The learning rate let us move faster or slower towards the minimum.

It is defined as follows:

mutable struct Descent
eta::Float64
end

function apply!(o::Descent, x, Δ)
Δ .*= o.eta
end

If we inline this apply into update! it may be easier to see how it works.

function update!(opt, x, x̄)
x .-= (x̄ .*= opt.eta)
end

But you likely find it easier when written over two lines

function update!(opt, x, x̄)
x̄ .*= opt.eta
x .-= x̄
end

We can even simplify it further, if you don’t find that clear enough.

function update!(opt, x, x̄)
x .-= (x̄ .* opt.eta)
end

Now it may seem odd why apply! takes the weight/parameter x as an argument when it is never used. However we want a generic interface to optimizers. Just because gradient descent does not use the weight when calculating its update value, does not mean that other optimizers don't use it. For instance the Momentum optimizer uses it.

What we have looked at thus far is really just helper functions. The update! function the user would call in Flux is defined as follows.

function update!(opt, xs::Params, gs)
for x in xs
if gs[x] == nothing
continue
end
update!(opt, x, gs[x])
end
end

Here opt is an optimizer such as Descent and xs is parameters in our model, also known as weights. gs is the gradient of the loss function with respect to these parameters. So gs expresses how the loss function increase or decrease in value when the parameters increase or decrease in value.

The way update works is that we iterate over every parameter x in the model. We lookup the derivative, gs[x], of the loss function with respect to this parameter x.

We call the helper update! for every one of these derivatives to update the corresponding parameter.

Normally the user does not call the update! function directly. Instead update! get called by the training function train!.

train! takes a loss(x, y) function, calculating the difference between what our model predicts based on input x and expected output y.

A collection of parameters ps used by our model of type Params. data = [(x1, y1), (x2, y2), ...] is a list of pairs of input and expected outputs. Both the x and ys will be arrays.

Then we have out optimizer opt which is an object for which an apply!(opt, x, dx) function has been defined. Descent is an example of such a type.

The callback cb is optional and can be a vector of callback functions. That is what the third line cb = runall(cb) is all about. It takes an array of functions and turns it into one function, calling each of those functions in sequence. If there is no array it just returns cb.

function train!(loss, ps, data, opt; cb = () -> ())
ps = Params(ps)
cb = runall(cb)
for d in data
gs = gradient(ps) do
loss(d...)
end
update!(opt, ps, gs)
cb()
end
end

Let us go through this implementation in some more detail. We iterate over the data. In each iterate we work with a pair of input and expected output values d = (x, y).

On each iteration the parameters ps of our model gets modified with update!(opt, ps, gs). Thus on each iteration we need to recalculate the gradients because the model has changed. That is why the for loop begins with calculating the gradient with gs = gradient(ps). gs is basically a dictionary using the object identity of parameters to map to the derivative of the loss function with respect to that parameter.

This is implemented with an IdDict. We cannot use a regular dictionary since we don't want to lookup based on the value of the parameter but the identity of it.

So gs[x] give me the derivative of the loss function with respect to parameter x.

After updating the weights/parameters we call cb() which is the optional callback function. As mentioned this could be multiple functions. It gives users and opportunity to observe the training process. The callback function could e.g. print out how the value of the loss function changes upon each iteration.

Final Remarks

This story is a bit of a mess, because I realized upon writing it that the scope of this was too large. But rather than not publishing my writing I decided to put it out there because people who know what they are looking for may find sections of this story useful.

Material like this needs more organizing and cross references than a single story can provide.

Written by

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store