# Implementation of a Modern Machine Learning Library

## If you want to understand how a modern machine learning library works, there is no better alternative than Flux

I could have been writing about more known machine learning libraries such as TensorFlow, Keras or PyTorch. But the benefit of talking about Flux is that is a fairly small and modern machine learning library.

Flux is very small because it **can** be. Flux is trivially extended with custom user code, which is not possible with most existing machine learning libraries. That is why they are so large. They must have a lot of built in functionality because adding custom functions is not trivial.

For beginners who want to understand machine learning properly, this is a major benefit. Flux has a much smaller and simpler implementation than the competition, which makes it actually feasible for mortals to look at the source code and understand it.

It may seem mysterious how Flux can do this while other libraries cannot. The magic is really just the Julia programming language. It is Julia that really does all the heavy lifting and makes it possible to write small machine learning libraries.

## What does a Machine Learning Library do?

Before getting into the details let’s take a birds eye perspective and talk about the different tasks a ML library has to carry out.

It has to provide some facilities to do the following:

- Define a mathematical model and parameters of this model.
- A way to define a loss/error/cost function. A loss function calculates the difference between desired output from model and actual prediction made by model.
- Calculate the gradient of the loss function with respect to its parameters. The parameters are the ones used to to define the model.
- Way of defining different optimizers or training strategies.
- A training function which tries to minimize the loss function by adjusting the model parameters using calculated gradient and optimizer.

What all of this means may not be obvious to you, so let me cover this in more detail.

## What is a Mathematical Model?

A mathematical model is a model defined in mathematics or in code in our case. The purpose is much the same as for a physical model.

To describe the concept or idea of models I like to use an example from how the Palm Pilot was developed. It was a PDA and for those of you too young to know what a PDA, you can think of it as a smart phone without the ability to call someone.

One of the first models the Palm company made of it was just a block of wood. How is that a suitable model? It cannot do anything, or can it?

They had multiple blocks of wood in different shapes. What they wanted to test was how well it fit in your jeans pocket. Employees would walk around with one of these blocks of woods in their pocket all day.

The idea was to figure out what was the optimal shape and size. A shape and size which was comfortable for people to wear in their pocket.

So this is a key point to keep in mind about models. You only include properties in your model which helps you answer the question you want answered. In this case the question was: “Is it comfortable to wear the whole day in my pocket?”

To answer that question we only need exterior dimensions. Mathematical models are the same. We may strive to model a solar system e.g. The question then is, what kind of questions do we want answered about this solar system?

If what we want to know is planetary orbits, then the mass and position of the planets may be the only properties you need per planet to model he solar system.

However if you want to model temperature on each planet, you need far more properties.

## What is a Model in Flux?

In the Flux machine learning library, the *model* is simply a function which takes one input. The input is a matrix needs to be organized in a particular way. Every row represents a different property. In machine learning we refer to these properties as *features*. Another way to look at it is that you can simulate a function with multiple inputs by treating each row as a separate input argument.

Instead of evaluating the model 3 times with three different inputs you can simply evaluate the model by concatenation them horizontally. So basically every column is a separate input.

Here is a very simply model defined in Flux:

`W = rand(2, 5)`

b = rand(2)

model(x) = W*x .+ b

Of course real models are more complicated. Here is an example of a deep neural network (DNN) being defined:

`model = Chain(`

Dense(784, 64, relu),

Dense(64, 64, relu),

Dense(32, 10)

)

This still ends up creating what is seen from the users perspective just a function `model`

, which he/she can supply a matrix to, as an argument.

## Model Parameters

The parameters of the model are variable you can adjust to change the behavior of your model. In our first example, the parameters where `W`

and `b`

. `W`

is referred to as the weights. It will typically be a matrix as well. `b`

is referred to as the bias. The bias is not affected by the input.

`Flux.train!(loss, params, data, optimizer)`

We will cover the training function more in detail later. But the training function is what reads input data and starts tweaking your parameters to modify the model so that it produces correct output.

To be able to do that we need to tell Flux, what part of our model is the parameters. This is done with the `Flux.params`

function.

`params = Flux.params(W, b)`

Flux.train!(loss, params, data, optimizer)

When using pre-made building blocks such as `Dense`

to make a more complex model, you can simply use the model itself as input to `Flux.params`

to get the parameters of the model.

`params = Flux.params(model)`

How we make it possible to do this with our custom models from scratch I will not cover here. For custom models you know the parameters anyway.

Let us look at how Flux implements model parameters.

## Flux Implementation of Model Params

Params contains an array `order`

holding a list of parameters added to it. the `params`

member is primarily used to check that a parameter (typically an array object) has not already been added to the `Params`

object.

`struct Params`

order::Buffer{Any, Vector{Any}}

params::IdSet{Any}

Params() = new(Buffer([], false), IdSet())

end

You can see that adding a value `x`

to params means just adding it to the `order`

array which maintains order elements where added. `params`

is a set which only exists to avoid adding the same element twice to the `order`

array.

`function Base.push!(ps::Params, x)`

if !(x in ps.params)

push!(ps.order, x)

push!(ps.params, x)

end

return ps

end

This is the more typical usage of `Params`

. You create a `Params`

object which you add elements in order to.

`Params(xs) = push!(Params(), xs...)`

We setup parameters from `Flux`

using `Flux.params`

call

`function params(m...)`

ps = Params()

params!(ps, m)

return ps

end

But it does not really do much different from `Zygote.Params`

.

## Updating Weights and Optimizers

The parameters of our model is usually referred to as weights in machine learning. The learning strategy is often referred to as an optimizer.

The reason for that is what we are really trying to do is optimize a function. So the optimizer is a strategy for how to optimize a function. In our case it is the loss function which we attempt to optimize.

We want to find what weights will give the lowest value for the loss function. Remember the lost function expresses the error or difference between what our model predicts the output should be for a given input and what output the real world tells us our model should have. We want this error as small as possible.

For this reason when we find the gradient (derivative) of the loss function relative to one of more weights, we want to move in direction towards the minimum of the function and not towards the maximum. That is why we subtract.

`function update!(opt, x, x̄)`

x .-= apply!(opt, x, x̄)

end

You can see this function takes an optimizer `opt`

as argument some input, one or more weights `x`

and some change (gradient) `x̄`

which would cause the function value to grow. We want the negative of this, since we want it to shrink.

In Flux, every optimizer adds a `apply!`

function. Let us look at how gradient descent works e.g. Gradient descent is based on moving opposite direction of gradient with a learning rate `η`

. The learning rate let us move faster or slower towards the minimum.

It is defined as follows:

`mutable struct Descent`

eta::Float64

end

function apply!(o::Descent, x, Δ)

Δ .*= o.eta

end

If we inline this apply into `update!`

it may be easier to see how it works.

`function update!(opt, x, x̄)`

x .-= (x̄ .*= opt.eta)

end

But you likely find it easier when written over two lines

`function update!(opt, x, x̄)`

x̄ .*= opt.eta

x .-= x̄

end

We can even simplify it further, if you don’t find that clear enough.

`function update!(opt, x, x̄)`

x .-= (x̄ .* opt.eta)

end

Now it may seem odd why `apply!`

takes the weight/parameter `x`

as an argument when it is never used. However we want a generic interface to optimizers. Just because gradient descent does not use the weight when calculating its update value, does not mean that other optimizers don't use it. For instance the `Momentum`

optimizer uses it.

What we have looked at thus far is really just helper functions. The `update!`

function the user would call in Flux is defined as follows.

`function update!(opt, xs::Params, gs)`

for x in xs

if gs[x] == nothing

continue

end

update!(opt, x, gs[x])

end

end

Here `opt`

is an optimizer such as `Descent`

and `xs`

is parameters in our model, also known as weights. `gs`

is the gradient of the loss function with respect to these parameters. So `gs`

expresses how the loss function increase or decrease in value when the parameters increase or decrease in value.

The way update works is that we iterate over every parameter `x`

in the model. We lookup the derivative, `gs[x]`

, of the loss function with respect to this parameter `x`

.

We call the helper `update!`

for every one of these derivatives to update the corresponding parameter.

## Training Function

Normally the user does not call the `update!`

function directly. Instead `update!`

get called by the training function `train!`

.

`train!`

takes a `loss(x, y)`

function, calculating the difference between what our model predicts based on input `x`

and expected output `y`

.

A collection of parameters `ps`

used by our model of type `Params`

. `data = [(x1, y1), (x2, y2), ...]`

is a list of pairs of input and expected outputs. Both the `x`

and `y`

s will be arrays.

Then we have out optimizer `opt`

which is an object for which an `apply!(opt, x, dx)`

function has been defined. `Descent`

is an example of such a type.

The callback `cb`

is optional and can be a vector of callback functions. That is what the third line `cb = runall(cb)`

is all about. It takes an array of functions and turns it into one function, calling each of those functions in sequence. If there is no array it just returns `cb`

.

`function train!(loss, ps, data, opt; cb = () -> ())`

ps = Params(ps)

cb = runall(cb)

for d in data

gs = gradient(ps) do

loss(d...)

end

update!(opt, ps, gs)

cb()

end

end

Let us go through this implementation in some more detail. We iterate over the data. In each iterate we work with a pair of input and expected output values `d = (x, y)`

.

On each iteration the parameters `ps`

of our model gets modified with `update!(opt, ps, gs)`

. Thus on each iteration we need to recalculate the gradients because the model has changed. That is why the for loop begins with calculating the gradient with `gs = gradient(ps)`

. `gs`

is basically a dictionary using the object identity of parameters to map to the derivative of the loss function with respect to that parameter.

This is implemented with an `IdDict`

. We cannot use a regular dictionary since we don't want to lookup based on the value of the parameter but the identity of it.

So `gs[x]`

give me the derivative of the loss function with respect to parameter `x`

.

After updating the weights/parameters we call `cb()`

which is the optional callback function. As mentioned this could be multiple functions. It gives users and opportunity to observe the training process. The callback function could e.g. print out how the value of the loss function changes upon each iteration.

# Final Remarks

This story is a bit of a mess, because I realized upon writing it that the scope of this was too large. But rather than not publishing my writing I decided to put it out there because people who know what they are looking for may find sections of this story useful.

Material like this needs more organizing and cross references than a single story can provide.