A gentle introduction to the Flux Machine Learning Package
Edit: This tutorial is not up to date. Flux.jl has changed in how it handles automatic differentiation and dual numbers, so the code examples need change. I will make a new tutorial in the future and link to from here.
The Julia programming language is very well tailored as a language for machine learning. It makes it a lot easier to write neural network e.g. However it does not offer the same kind of extensive documentation for beginners as Python.
Flux is a machine learning package for Julia with amazing potential. Unfortunately for people like me, getting into machine learning, it isn’t easy enough to read the tutorials. I am not a math major and it was 20 years since I studied machine learning. So I wrote this tutorial for people like me. People who have a very patchy idea of machine learning but who want to get into learning to use a cutting edge machine learning package such as Flux.
So why pick something like Flux over TensorFlow which much more documentation available? Because TensorFlow is huge and is based on the idea of you assembling a huge number of ready made pieces.
Flux follows an entirely different philosophy which I believe will make machine learning much easier to learn, understand and use even if you are a novice. It is a much smaller library which is made to let you create a neural network by simply assembling regular Julia code. It also integrates very easily with third party libraries so you can e.g. plug in the ability to run your neural network on a GPU.
Here I will go through the same basic examples covered in the Flux introduction manual, but assuming less prior knowledge on your part.
We got to look at a few math concepts first before we can look at Flux code examples.
What is a Gradient?
The core feature of Flux is to calculate the gradient of Julia code. What exactly is a gradient and why are they important?
If you remember high school calculus, you should have covered derivatives. You have some function
f(x) and you want to find the slope of that function for a given value
x. That is the derivative of
f and is written
df(x)/dx. The derivative is a scalar (single value), while the closely related gradient is a vector. That is because a gradient is the derivative of a function with multiple inputs such as
f(x, y). Imagine plotting
We could to this by making a point in 3D space. The point is located at coordinates
(x, y, f(x, y). Hence
f(x, y) represents the
z coordinate. Just like
f(x) represents the
y coordinate when you plot it.
Now if you want to find the slope at any point on this surface it is not longer a single number (scalar) because it slopes different amounts in different directions. Hence the slope is a vector. Thus we could say a gradient is a generalization of the derivative.
Why are Gradients Important in Machine Learning?
If you take a big step back and try to take a birds eye perspective of what one tries to achieve in machine learning it is basically about finding an unknown function, which provides a good model of something in the real world.
I like to think about this as the scientific method: You make a number of observations. Then you try to formulate a law which explain these observations, so you can make predictions about future events.
Let use the moon’s orbit as an example. You observe the moon’s position over a longer time interval and you want to determine if there is a law describing the moon’s orbit.
Based on time and position recorded ovet time you want to find some function
f(t) which tells you the position of the moon at time
t. However you have no idea what this function looks like.
But there is hope! Mathematicians have discovered that almost any complex function can be approximated by a combination of simpler functions. E.g. a polynomial can approximate many functions.
Below we have an example of a polynomial of second degree. Say we want to use this to approximate our unknown function
f(t). The trick is the find the coefficients
c typically referred to as weights in machine learning.
P(t) = at^2 + bt + c
Just to be clear, this function will not ever approximate the orbit of the moon. It is just chosen as an example because it is simple. For a more realistic example we would have to use a lot bigger polynomial with far more weights/coefficients.
Approximation theory is trying to minimize the difference between the real function
f(t) and our approximation
P(t). We can express the difference between the two function as a function itself. We usually call this function the error function or loss function:
loss(t) = |P(t) - f(t)|
In Julia syntax this would be written as
loss(t) = abs(P(t) - f(t))
This tells us how wrong our approximation is for each
t but what we really want to know is how wrong it is for each possible value of
Thus what we want to do is: For every combination of
c calculate the total amount of error for all values of
t. Imagine we have made a number of measurements of the position of the moon at time
t. Let an array
T contain all these measurements.
f(t) where t ∈ T is then exactly the same as our measurements.
Our loss function may then be defined as:
loss(a, b, c,) = sum(abs(P(t, a, b, c) - f(t)) for t in T)
This turns things on their head. We don’t have a function of time anymore. Instead we have a loss function where the arguments (inputs) are the coefficients of
P(t). We have also redefined
P(t) to make the coefficients arguments.
Instead we have a loss function where the arguments (inputs) are the coefficients
Imagine we plot the
loss(a, b, c) function. You can think of it as some sort of landscape, at least if it was just
loss(a, b), with 3 arguments it is harder to imagine. The point is that we really want to locate the bottom of the deepest valley. Because the
c coordinates of that point is exactly where the
loss function has the lowest value and hence our approximation has the smallest error. Thus if we locate this position we have basically discovered a good approximation
P(t) to our unknown function
Imagine you are standing in this landscape and want to figure out where you need to go to locate this point. What direction should you go? You should go where the landscape is sloping downwards most dramatically. But how do you know what direction that is?
Eureka! That is why you need to find the gradient! The gradient tells you the slope at every point. You just need to go the opposite direction of the gradient, because the gradient expresses the upward slope. This process of finding coefficients by following the negative gradient is called gradient descent.
In the real world
f(t) can represent any kind of phenomenon we want to approximate. It doesn't need to be about physics. It could be product sales as a function of how much you paid for ads. It could be political party somebody is most likely to vote for given age, income, gender and profession.
However for real world applications our function
P(t) can get really complex. It is not necessarily a simple polynomial like
P(t) = at^2 + bt + c. The function could be thousands of lines of code. That is when your normal derivation rules such as:
f(x) = x, f'(x) = 1
f(x) = ax, f'(x) = a
f(x) = x², f'(x) = 2x
f(x) = sin(x), f'(x) = cos(x)
fall flat on their face. There are many alternatives we could attempt to attempt to perform differentiation (determine the gradient of a function) such as:
- Numerical differentiation This requires evaluating
f(x)multiple times in small steps to approximate
f'(x). Easy to implement but very costly computationally speaking. Hence no good idea.
- Symbolic differentiation Which means you get a computer to basically apply the derivation rules I described. Initially I thought this is what Flux was doing. However this quickly turns into a mess. You end up with very complex expressions and it is not easy to accomplish on an arbitrary large chunk of code.
- Automatic differentiation This is the secret sauce of Flux. Instead of using regular number in your thousand line function, everything you are differentiating with respect to (such as the a, b, c weights) is stored in a special kind of numbers. These numbers are called dual numbers. Using dual numbers allows you to calculate the value of a function as well as the value of the derivative of the function (gradient).
You can read more about automatic differentiation and machine learning here.
Here is the neat way automatic differentiation works: A dual number contains two numbers. One of the numbers keeps track of the result of normal numerical calculations. However the second number calculates what the differential would be if performing that particular calculation. Dual numbers in a way accumulate what happens to them, so it doesn’t matter how long and complex your function is. As long as dual numbers are used for all your weights, everything will just work out magically.
Okay now we are ready to go through the code examples used in the basic introduction to Flux here. However I will refer more back to the background information just presented to help you understand what is going on.
Flux is all about determining the gradients of Julia functions. Under the hood it does that through automatic differentiation (using dual numbers). We can define a simple function and take the gradient with respect to the specified arguments.
julia> using Flux.Tracker
julia> f(x) = 3x^2 + 2x + 1;
julia> df(x) = Tracker.gradient(f, x; nest = true); # df/dx = 6x + 2
julia> d2f(x) = Tracker.gradient(df, x; nest = true); # d²f/dx² = 6
If you perform these calculations with a pen and paper yourself, you will see that the outputs are are correct. Notice how you don’t get regular numbers out. You get tracked numbers. Tracked numbers are what Flux calls dual numbers. When you write
Tracker.gradient(df, x; nest = true), Flux understands that you want to turn
x into a tracked number (dual number).
We can do this for multiple arguments (parameters) as well:
julia> f(W, b, x) = W * x + b;
julia> Tracker.gradient(f, 2, 3, 4)
(4.0 (tracked), 1.0 (tracked), 2.0 (tracked))
But machine learning models can have hundreds of parameters! Flux offers a nice way to handle this. We can tell Flux to treat something as a parameter via param. Then we can collect these together and tell gradient to collect the gradients of all params at once. This is basically a way of saying what numbers should be dual numbers.
julia> using Flux
julia> W = param(2)
julia> b = param(3)
julia> f(x) = W * x + b;
julia> grads = Tracker.gradient(() -> f(4), params(W, b));
There are a few things to notice here. Firstly,
b now show up as tracked. Tracked things behave like normal numbers or arrays, but keep records of everything you do with them, allowing Flux to calculate their gradients.
This is essentially like trying to find the derivate of a function defined as
f(W, b). The gradient returned is a vector. You get each component of the vector like this
Gradient takes a zero-argument function; no arguments are necessary because the params tell it what to differentiate.
This will come in really handy when dealing with big, complicated models. For now, though, let’s start with something simple.
Consider a simple linear regression, which tries to predict an output array
y from an input
W = rand(2, 5)
b = rand(2)
predict(x) = W*x .+ b
function loss(x, y)
ŷ = predict(x)
sum((y .- ŷ).^2)
x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3
Linear regressions are useful to start with as examples because they are basically like an ultra simple neural network. The calculations look like you see below.
The use of matrices may confuse you if you are not used to it. You can think of this as two separate
predict functions each taking 5 arguments (
x is an array of 5 elements). Each of the 5 arguments are multiplied with a weight/coefficient. The weights are different for each version of the
predict function. Because we have two function, we get 2 output.
In each function we add a constant, the bias
b is a two element vector in our example to represent the bias for each of the two
I have labeled the out of
loss function takes in all the 5 input arguments stored in
x, as well as the actual expected outputs,
y that we in principle got from the real world.
We add up the error of all the output values with
sum((y .- ŷ).^2).
To improve the prediction we can take the gradients of
b with respect to the
loss and perform gradient descent. Let's tell Flux that
b are parameters, just like we did in the earlier code example.
W = param(W)
b = param(b)
gs = Tracker.gradient(() -> loss(x, y), params(W, b))
Now that we have gradients, we can pull them out and update
W to train the model. The
update!(W, Δ) function applies
W = W + Δ, which we can use for gradient descent.
using Flux.Tracker: update!
Δ = gs[W]
# Update the parameter and reset the gradient
loss(x, y) # ~ 2.5
The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.
All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different — they might have millions of parameters or complex control flow. Let’s see how Flux handles more complex models.
At this point we are getting into describing layers in a neural network. If that does not make sense to you, I advice you to read up on neural networks here. The examples are in Python. However I hope to explain neural networks with Julia code in the future.
Neutral networks are made up of multiple linear layers sending their output into and activation function. Usually this activation function is a sigmoid(σ). It passes its output to the next layer in the neural network.
In the earlier coding style used we could write this network as:
W1 = param(rand(3, 5))
b1 = param(rand(3))
layer1(x) = W1 * x .+ b1
W2 = param(rand(2, 3))
b2 = param(rand(2))
layer2(x) = W2 * x .+ b2
model(x) = layer2(σ.(layer1(x)))
model(rand(5)) # => 2-element vector
We can see a visual representation of what is going on below. The first image shows the matrices involved in calculating the first layer.
W1 is a 3x5 matrix which means provides 5 weights to each of the 5 inputs in
x. The number of rows corresponds to the number of outputs. So we got 2 outputs.
z1 is not shown in the code, but is the output of first layer.
The second layer takes as input the sigmoid (σ) function applied to all output values from the first layer.
In the last line you can see we give 5 random input values producing two output values.
While this kind of code works it is unwieldy, with lots of repetition, especially as we add more layers. One way to factor this out is to create a function that returns linear layers. Let us create a helper function
linear to define a layer.
function linear(in, out)
W = param(randn(out, in))
b = param(randn(out))
x -> W * x .+ b
linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)
model(x) = linear2(σ.(linear1(x)))
model(rand(5)) # => 2-element vector
Another (equivalent) way is to create a struct that explicitly represents the affine layer.
Affine(in::Integer, out::Integer) =
Affine(param(randn(out, in)), param(randn(out)))
# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b
a = Affine(10, 5)
a(rand(10)) # => 5-element vector
(m::Affine)(x) = ... may seem unfamiliar to you. It basically allows you to make objects of type
Affine callable as is they were function objects.
Congratulations! You just built the Dense layer that comes with Flux. Flux has many interesting layers available, but they’re all things you could have built yourself very easily.
(There is one small difference with Dense — for convenience it also takes an activation function, like
Dense(10, 5, σ).) We call these layers
Dense because every node in the layer is connected to the next. I like to think about it as being analogous to how we call some arrays dense and other sparse.
Stacking It Up
It’s pretty common to write models that look something like:
layer1 = Dense(10, 5, σ)
model(x) = layer3(layer2(layer1(x)))
For long chains, it might be a bit more intuitive to have a list of layers, like this:
layers = [Dense(10, 5, σ), Dense(5, 2), softmax]
model(x) = foldl((x, m) -> m(x), layers, init = x)
model(rand(10)) # => 2-element vector
foldl may requite some explanation. It is basically a
reduce. You start with the input
x. Together with the first element in the
layers array they form the arguments to
(x, m) -> m(x). Since
m is a layer object which can be called as a function, we can do
m(x). This gives an output vector which is used as
x in the next iteration together with an
m which represent the next layer in the
layers array. Thus this is a concise way of chaining together the layers. The output of the preceding layer becomes the input of the next layer.
However you don’t have to write this yourself, because Flux already provides a
Chain constructor which does exactly this thing:
model2 = Chain(
Dense(10, 5, σ),
model2(rand(10)) # => 2-element vector
This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.
A nice property of this approach is that because “models” are just functions (possibly with trainable parameters), you can also see this as simple function composition.
m = Dense(5, 2) ∘ Dense(10, 5, σ)
Likewise, Chain will happily work with any Julia function.
m = Chain(x -> x^2, x -> x+1)
m(5) # => 26
I hope this helps you get started. The more succinct version of this documentation, written by the creators of Flux is available here.
I hope to at some point be able to write a more comprehensive introduction for beginners to machine learning in the style use in excellent ML Cheatsheet which explains Machine Learning concepts through Python code.
It would be awesome to have something similar in Julia, so beginners will consider Julia as their first language for doing machine learning.