Working with Data in Tables for Machine Learning
Introduction to DataFrames
I have written a number of articles in the past on topics related to Machine Learning (ML), and this article is about filling a missing piece, which is about how to get, prepare and visualize data you wish to train a Machine Learning model on.
Here are some past articles which are useful for this topic:
- Core Idea of Linear Algebra. Linear Algebra is basically about vector and matrices, which is something you use a lot of in Machine Learning.
- Working with References. It may seem a bit peripheral, but understanding references in Julia is good when trying to understanding ML libraries like Flux.
- Implementation of Flux. A beginners guide to how Flux ML library is implemented.
- Intro to Machine Learning. Introduction to machine learning in general and Flux. Flux part is a bit outdated.
This article came about because I am in the process of learning to master Machine Learning and I realized that actually getting data to experiment with and preparing it is not a trivial matter.
What is a DataFrame?
What has become a sort of industry standard for data scientists is a data structured called DataFrames
. The idea originated with the R programming language, but quickly spread to other languages used in data science. Julia is no exception.
I never learned about data frames in my data structures and algorithms course in undergrad but the concept is not that complicated. Rather than using a matrix, where you address rows and columns by index to get to data, DataFrames
is like a matrix on steroids. Not only can you access columns by index, but you can also do it by name. Alternatively you can think of it as simply a table. Tables have named columns. DataFrames
just has a programmers interface to a table.
If named columns was all that DataFrames
was about then there would be no point writing a whole article about it.
DataFrames Ecosystem
The benefit is the comprehensive ecosystem built around DataFrames
. You got packages for running queries on a data frames table. There are packages like CSV
which will load data from a .csv
file and stick it in a DataFrame
object. Further you have packages like StatsPlots
which provide abstractions on top of the Plots
package to make it easy and quick to plot data from a DataFrame
object.
And finally when you want to just get going with quickly accessing some test data to play with, you can use the RDatasets
package which gives you easy access to a ton of example datasets originally made for R, but which you can easily load as a DataFrame
.
Setup
If you want to follow along with my examples. I advice you to install the following packages in Julia:
DataFrames
CSV
Plots
StatsPlots
RDatasets
The way you install a package in Julia is to launch Julia and do the following:
$ julia
julia> ]
(v1.3) pkg> add DataFrames
Loading and Preparing Data
Let us load some data from a CSV file. If you don’t have any you could like me download data from the periodic table here.
julia> using CSV, DataFrames, Plots
julia> atoms = CSV.read("Periodic Table of Elements.csv");
You will then get a peek at what this table contains. Since it is not easy to show all that here, I will display a subset of the data:
julia> atoms[1:9, [:AtomicNumber, :Symbol, :AtomicMass]]
9×3 DataFrame
│ Row │ AtomicNumber │ Symbol │ AtomicMass │
│ │ Int64 │ String │ Float64 │
├─────┼──────────────┼────────┼────────────┤
│ 1 │ 1 │ H │ 1.007 │
│ 2 │ 2 │ He │ 4.002 │
│ 3 │ 3 │ Li │ 6.941 │
│ 4 │ 4 │ Be │ 9.012 │
│ 5 │ 5 │ B │ 10.811 │
│ 6 │ 6 │ C │ 12.011 │
│ 7 │ 7 │ N │ 14.007 │
│ 8 │ 8 │ O │ 15.999 │
│ 9 │ 9 │ F │ 18.998 │
We’ll get back to how this work shortly. First let us look at a more convenient way of getting data. The RDatasets
package contains a huge amount of datasets you can play with when learning Machine Learning. Let us load a popular data set for ML tutorials, called "iris". It contains data about measured attributes of flowers of different species.
julia> using RDatasets
julia> iris = dataset("datasets", "iris")
150×5 DataFrame
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ Float64 │ Float64 │ Float64 │ Float64 │ │
┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
│ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
│ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │
│ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │
│ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa |
Note, I edited the output slightly to make it fit. You will see slightly different output.
Often you cannot see all the columns of the data. So to get an overview we use the describe
function provided by DataFrames
.
julia> describe(iris)
5×8 DataFrame
│ variable │ mean │ min │ median │ max │ eltype │
│ Symbol │ │ │ │ │ DataType│
┼─────────────┼─────────┼─────┼────────┼─────┼─────────┤
│ SepalLength │ 5.84333 │ 4.3 │ 5.8 │ 7.9 │ Float64 │
│ SepalWidth │ 3.05733 │ 2.0 │ 3.0 │ 4.4 │ Float64 │
│ PetalLength │ 3.758 │ 1.0 │ 4.35 │ 6.9 │ Float64 │
│ PetalWidth │ 1.19933 │ 0.1 │ 1.3 │ 2.5 │ Float64 │
│ Species │ │ │ │ │ │
Again this needed some editing to fit. For all of these datasets you can find a description of the data, what the different columns represent on the Rdatasets website. The HTML index e.g. lets you lookup different sets. We can search for iris and see the description of the data.
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Create DataFrame from Scratch
If we cannot find any data we can of course construct some from scratch ourselves.
julia> atoms = DataFrame(
Number = [1, 2, 6],
Name = ["Hydrogen", "Helium", "Carbon"],
AtomicWeight = [1.0079, 4.0026, 12.0107],
Symbol = ["H", "He", "C"])
3×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼──────────┼──────────────┼────────┤
│ 1 │ 1 │ Hydrogen │ 1.0079 │ H │
│ 2 │ 2 │ Helium │ 4.0026 │ He │
│ 3 │ 6 │ Carbon │ 12.0107 │ C │
From Clipboard
Say you found some data online you liked and don’t want to save it to file, then obtain the path, paste the path into Julia etc, you can actually just load directly from clipboard.
Say I want to load this “failures of air-condition equipment” dataset. Just mark all the data and copy it to clipboard. On macOS that would be Command+C
, and Ctrl+C
on Linux and Windows.
julia> airco = CSV.read(IOBuffer(clipboard()))
24×2 DataFrame
│ Row │ Column1 │ hours │
│ │ Int64 │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 5 │
...
│ 21 │ 21 │ 139 │
│ 22 │ 22 │ 188 │
│ 23 │ 23 │ 197 │
│ 24 │ 24 │ 210 │
IOBuffer(s)
takes a string s
and turns it into an IO
object so the string can be read like a file.
Download URL
Another alternative is to use Julia’s download
function, and download a URL as a temporary file. In this case we are downloading data about beaver temperatures.
julia> beaver = CSV.read(download("http://vincentarelbundock.github.io/Rdatasets/csv/boot/beaver.csv"))
100×5 DataFrame
│ Row │ Column1 │ day │ time │ temp │ activ │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Int64 │
├─────┼─────────┼───────┼───────┼─────────┼───────┤
│ 1 │ 1 │ 307 │ 930 │ 36.58 │ 0 │
│ 2 │ 2 │ 307 │ 940 │ 36.73 │ 0 │
│ 3 │ 3 │ 307 │ 950 │ 36.93 │ 0 │
│ 4 │ 4 │ 307 │ 1000 │ 37.15 │ 0 │
The way this works is that download
uses tools such as wget
and curl
to download file to a temporary location. It then returns the name of this file. You can also optionally specify the file path to store the downloaded file.
Selecting Subset of DataFrames
The fundamental logic of how you select subset of data is not different from working with a matrix in Julia. Here is a comparison.
julia> A = reshape(1:16, 4, 4)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
julia> A[2:3, [1, 4]]
2×2 Array{Int64,2}:
2 14
3 15
In both cases we are specifying the rows we want and then the columns. Both the row and the column specification is anything which is iterable. That includes ranges, arrays, tuples and even scalars. Yet you can iterate over a single number in Julia.
julia> for x in 4
print(x)
end
4
julia> collect(5)
0-dimensional Array{Int64,0}:
5
You can use a boolean array of the same length as the number of rows or columns depending on whether you are selecting rows or columns:
julia> A[[true, false, true, false], :]
2×4 Array{Int64,2}:
1 5 9 13
3 7 11 15
That means you can insert boolean expressions to select elements. For instance here wee ask for the first column of matrix A
and then we compare every row to see if it is larger than 3.
julia> A[A[:, 1] .> 3, :]
1×4 Array{Int64,2}:
4 8 12 16
We can apply many of the same principles to DataFrames
. The official documentation gives you a description of all the functionality. However it is so much that you easily get lost, when your needs may be quite simple.
The Julia wiki book is also a great resources if you want to get into more of the details.
What I am doing here is trying to keep it simply while filling in the blanks and teach you some of the overall principles of how data frames work.
With DataFrames
we got some added functionality which make them very convenient to work with. In matrix fashion I could ask for a single column with all the rows like this:
julia> atoms[:, :Name]
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"
But since the column names are symbols we can also use them in an even more convenient fashion.
julia> atoms[:, :Name]
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"
julia> atoms.Number
3-element Array{Int64,1}:
1
2
6
julia> atoms.Name
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"
julia> atoms.Symbol
3-element Array{String,1}:
"H"
"He"
"C"
This gives us a compact syntax for selecting particular rows based on the values in a column.
julia> atoms[atoms.AtomicWeight .> 4, :]
2×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1 │ 2 │ Helium │ 4.0026 │ He │
│ 2 │ 6 │ Carbon │ 12.0107 │ C │
The atoms.AtomicWeight .> 4
expression turns into an array of boolean values. Every row with a 1
value gets selected.
julia> atoms.AtomicWeight .> 4
3-element BitArray{1}:
0
1
1
The reason for writing .> 4
rather than > 4
is because we want to do element-wise comparison against four, rather than comparing the whole column as if it was a single object with a value.
If you find this cryptic to read, you can use the normal Julia filter
function instead.
julia> filter(atoms) do row
row.AtomicWeight > 4
end
2×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1 │ 2 │ Helium │ 4.0026 │ He │
│ 2 │ 6 │ Carbon │ 12.0107 │ C │
I could do similar with the iris dataset if I only want to look at the data for one species.
julia> filter(iris) do row
row.Species == "virginica"
end
Or alternatively the one liner:
iris[iris.Species .== "virginica", :]
Plotting
We will just cover the pure basics. I will do more extensive coverage in another article.
I advice using the StatsPlots
package, because it provides a very handy macro @df
for plotting.
@df iris scatter(
:SepalLength,
:SepalWidth,
group = :Species
)
What the macro does is to translate this call into:
scatter(iris.SepalLength, iris.SepalWidth, group = iris.Species)
Because the scatter
, plot
and other functions in the Plots
package expect to be called like this:
plot(xs, ys, attribute1=value, attribute2=value)
scatter(xs, ys, attribute1=value, attribute2=value)
Here are some examples of using some attributes
julia> plot(iris.SepalLength, iris.SepalWidth,
seriestype=:scatter,
xaxis="sepal length",
yaxis="sepal width",
group = iris.Species,
markersize=6)
Which can be slightly shortened with StatsPlots
.
julia> @df iris plot(:SepalLength, :SepalWidth,
seriestype=:scatter,
xaxis="sepal length",
yaxis="sepal width",
group = :Species,
markersize=6)