Working with Data in Tables for Machine Learning

Introduction to DataFrames

I have written a number of articles in the past on topics related to Machine Learning (ML), and this article is about filling a missing piece, which is about how to get, prepare and visualize data you wish to train a Machine Learning model on.

Here are some past articles which are useful for this topic:

This article came about because I am in the process of learning to master Machine Learning and I realized that actually getting data to experiment with and preparing it is not a trivial matter.

What is a DataFrame?

I never learned about data frames in my data structures and algorithms course in undergrad but the concept is not that complicated. Rather than using a matrix, where you address rows and columns by index to get to data, DataFrames is like a matrix on steroids. Not only can you access columns by index, but you can also do it by name. Alternatively you can think of it as simply a table. Tables have named columns. DataFrames just has a programmers interface to a table.

If named columns was all that DataFrames was about then there would be no point writing a whole article about it.

DataFrames Ecosystem

And finally when you want to just get going with quickly accessing some test data to play with, you can use the RDatasets package which gives you easy access to a ton of example datasets originally made for R, but which you can easily load as a DataFrame.

Setup

  • DataFrames
  • CSV
  • Plots
  • StatsPlots
  • RDatasets

The way you install a package in Julia is to launch Julia and do the following:

$ julia
julia> ]
(v1.3) pkg> add DataFrames

Loading and Preparing Data

julia> using CSV, DataFrames, Plots
julia> atoms = CSV.read("Periodic Table of Elements.csv");

You will then get a peek at what this table contains. Since it is not easy to show all that here, I will display a subset of the data:

julia> atoms[1:9, [:AtomicNumber, :Symbol, :AtomicMass]]
9×3 DataFrame
│ Row │ AtomicNumber │ Symbol │ AtomicMass │
│ │ Int64 │ String │ Float64 │
├─────┼──────────────┼────────┼────────────┤
│ 1 │ 1 │ H │ 1.007 │
│ 2 │ 2 │ He │ 4.002 │
│ 3 │ 3 │ Li │ 6.941 │
│ 4 │ 4 │ Be │ 9.012 │
│ 5 │ 5 │ B │ 10.811 │
│ 6 │ 6 │ C │ 12.011 │
│ 7 │ 7 │ N │ 14.007 │
│ 8 │ 8 │ O │ 15.999 │
│ 9 │ 9 │ F │ 18.998 │

We’ll get back to how this work shortly. First let us look at a more convenient way of getting data. The RDatasets package contains a huge amount of datasets you can play with when learning Machine Learning. Let us load a popular data set for ML tutorials, called "iris". It contains data about measured attributes of flowers of different species.

julia> using RDatasets
julia> iris = dataset("datasets", "iris")
150×5 DataFrame
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ Float64 │ Float64 │ Float64 │ Float64 │ │
┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
│ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
│ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │
│ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │
│ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa |

Note, I edited the output slightly to make it fit. You will see slightly different output.

Often you cannot see all the columns of the data. So to get an overview we use the describe function provided by DataFrames.

julia> describe(iris)
5×8 DataFrame
│ variable │ mean │ min │ median │ max │ eltype │
│ Symbol │ │ │ │ │ DataType│
┼─────────────┼─────────┼─────┼────────┼─────┼─────────┤
│ SepalLength │ 5.84333 │ 4.3 │ 5.8 │ 7.9 │ Float64 │
│ SepalWidth │ 3.05733 │ 2.0 │ 3.0 │ 4.4 │ Float64 │
│ PetalLength │ 3.758 │ 1.0 │ 4.35 │ 6.9 │ Float64 │
│ PetalWidth │ 1.19933 │ 0.1 │ 1.3 │ 2.5 │ Float64 │
│ Species │ │ │ │ │ │

Again this needed some editing to fit. For all of these datasets you can find a description of the data, what the different columns represent on the Rdatasets website. The HTML index e.g. lets you lookup different sets. We can search for iris and see the description of the data.

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Create DataFrame from Scratch

julia> atoms = DataFrame(
Number = [1, 2, 6],
Name = ["Hydrogen", "Helium", "Carbon"],
AtomicWeight = [1.0079, 4.0026, 12.0107],
Symbol = ["H", "He", "C"])
3×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼──────────┼──────────────┼────────┤
│ 1 │ 1 │ Hydrogen │ 1.0079 │ H │
│ 2 │ 2 │ Helium │ 4.0026 │ He │
│ 3 │ 6 │ Carbon │ 12.0107 │ C │

From Clipboard

Say I want to load this “failures of air-condition equipment” dataset. Just mark all the data and copy it to clipboard. On macOS that would be Command+C, and Ctrl+C on Linux and Windows.

julia> airco = CSV.read(IOBuffer(clipboard()))
24×2 DataFrame
│ Row │ Column1 │ hours │
│ │ Int64 │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 5 │
...
│ 21 │ 21 │ 139 │
│ 22 │ 22 │ 188 │
│ 23 │ 23 │ 197 │
│ 24 │ 24 │ 210 │

IOBuffer(s) takes a string s and turns it into an IO object so the string can be read like a file.

Download URL

julia> beaver = CSV.read(download("http://vincentarelbundock.github.io/Rdatasets/csv/boot/beaver.csv"))
100×5 DataFrame
│ Row │ Column1 │ day │ time │ temp │ activ │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Int64 │
├─────┼─────────┼───────┼───────┼─────────┼───────┤
│ 1 │ 1 │ 307 │ 930 │ 36.58 │ 0 │
│ 2 │ 2 │ 307 │ 940 │ 36.73 │ 0 │
│ 3 │ 3 │ 307 │ 950 │ 36.93 │ 0 │
│ 4 │ 4 │ 307 │ 1000 │ 37.15 │ 0 │

The way this works is that download uses tools such as wget and curl to download file to a temporary location. It then returns the name of this file. You can also optionally specify the file path to store the downloaded file.

Selecting Subset of DataFrames

julia> A = reshape(1:16, 4, 4)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16

julia> A[2:3, [1, 4]]
2×2 Array{Int64,2}:
2 14
3 15

In both cases we are specifying the rows we want and then the columns. Both the row and the column specification is anything which is iterable. That includes ranges, arrays, tuples and even scalars. Yet you can iterate over a single number in Julia.

julia> for x in 4
print(x)
end
4
julia> collect(5)
0-dimensional Array{Int64,0}:
5

You can use a boolean array of the same length as the number of rows or columns depending on whether you are selecting rows or columns:

julia> A[[true, false, true, false], :]
2×4 Array{Int64,2}:
1 5 9 13
3 7 11 15

That means you can insert boolean expressions to select elements. For instance here wee ask for the first column of matrix A and then we compare every row to see if it is larger than 3.

julia> A[A[:, 1] .> 3, :]
1×4 Array{Int64,2}:
4 8 12 16

We can apply many of the same principles to DataFrames. The official documentation gives you a description of all the functionality. However it is so much that you easily get lost, when your needs may be quite simple.

The Julia wiki book is also a great resources if you want to get into more of the details.

What I am doing here is trying to keep it simply while filling in the blanks and teach you some of the overall principles of how data frames work.

With DataFrames we got some added functionality which make them very convenient to work with. In matrix fashion I could ask for a single column with all the rows like this:

julia> atoms[:, :Name]
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"

But since the column names are symbols we can also use them in an even more convenient fashion.

julia> atoms[:, :Name]
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"

julia> atoms.Number
3-element Array{Int64,1}:
1
2
6

julia> atoms.Name
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"

julia> atoms.Symbol
3-element Array{String,1}:
"H"
"He"
"C"

This gives us a compact syntax for selecting particular rows based on the values in a column.

julia> atoms[atoms.AtomicWeight .> 4, :]
2×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1 │ 2 │ Helium │ 4.0026 │ He │
│ 2 │ 6 │ Carbon │ 12.0107 │ C │

The atoms.AtomicWeight .> 4 expression turns into an array of boolean values. Every row with a 1 value gets selected.

julia> atoms.AtomicWeight .> 4
3-element BitArray{1}:
0
1
1

The reason for writing .> 4 rather than > 4 is because we want to do element-wise comparison against four, rather than comparing the whole column as if it was a single object with a value.

If you find this cryptic to read, you can use the normal Julia filter function instead.

julia> filter(atoms) do row
row.AtomicWeight > 4
end
2×4 DataFrame
│ Row │ Number │ Name │ AtomicWeight │ Symbol │
│ │ Int64 │ String │ Float64 │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1 │ 2 │ Helium │ 4.0026 │ He │
│ 2 │ 6 │ Carbon │ 12.0107 │ C │

I could do similar with the iris dataset if I only want to look at the data for one species.

julia> filter(iris) do row
row.Species == "virginica"
end

Or alternatively the one liner:

iris[iris.Species .== "virginica", :]

Plotting

I advice using the StatsPlots package, because it provides a very handy macro @df for plotting.

@df iris scatter(
:SepalLength,
:SepalWidth,
group = :Species
)

What the macro does is to translate this call into:

scatter(iris.SepalLength, iris.SepalWidth, group = iris.Species)

Because the scatter, plot and other functions in the Plots package expect to be called like this:

plot(xs, ys, attribute1=value, attribute2=value)
scatter(xs, ys, attribute1=value, attribute2=value)

Here are some examples of using some attributes

julia> plot(iris.SepalLength, iris.SepalWidth,
seriestype=:scatter,
xaxis="sepal length",
yaxis="sepal width",
group = iris.Species,
markersize=6)
Image for post

Which can be slightly shortened with StatsPlots.

julia> @df iris plot(:SepalLength, :SepalWidth,
seriestype=:scatter,
xaxis="sepal length",
yaxis="sepal width",
group = :Species,
markersize=6)

Written by

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store