Working with Data in Tables for Machine Learning

Introduction to DataFrames

9 min readMar 21, 2020

I have written a number of articles in the past on topics related to Machine Learning (ML), and this article is about filling a missing piece, which is about how to get, prepare and visualize data you wish to train a Machine Learning model on.

Here are some past articles which are useful for this topic:

Core Idea of Linear Algebra. Linear Algebra is basically about vector and matrices, which is something you use a lot of in Machine Learning.
Working with References. It may seem a bit peripheral, but understanding references in Julia is good when trying to understanding ML libraries like Flux.
Implementation of Flux. A beginners guide to how Flux ML library is implemented.
Intro to Machine Learning. Introduction to machine learning in general and Flux. Flux part is a bit outdated.

This article came about because I am in the process of learning to master Machine Learning and I realized that actually getting data to experiment with and preparing it is not a trivial matter.

What is a DataFrame?

What has become a sort of industry standard for data scientists is a data structured called DataFrames. The idea originated with the R programming language, but quickly spread to other languages used in data science. Julia is no exception.

I never learned about data frames in my data structures and algorithms course in undergrad but the concept is not that complicated. Rather than using a matrix, where you address rows and columns by index to get to data, DataFrames is like a matrix on steroids. Not only can you access columns by index, but you can also do it by name. Alternatively you can think of it as simply a table. Tables have named columns. DataFrames just has a programmers interface to a table.

If named columns was all that DataFrames was about then there would be no point writing a whole article about it.

DataFrames Ecosystem

The benefit is the comprehensive ecosystem built around DataFrames. You got packages for running queries on a data frames table. There are packages like CSV which will load data from a .csv file and stick it in a DataFrame object. Further you have packages like StatsPlots which provide abstractions on top of the Plots package to make it easy and quick to plot data from a DataFrame object.

And finally when you want to just get going with quickly accessing some test data to play with, you can use the RDatasets package which gives you easy access to a ton of example datasets originally made for R, but which you can easily load as a DataFrame.

Setup

If you want to follow along with my examples. I advice you to install the following packages in Julia:

DataFrames
CSV
Plots
StatsPlots
RDatasets

The way you install a package in Julia is to launch Julia and do the following:

$ julia
julia> ]
(v1.3) pkg> add DataFrames

Loading and Preparing Data

Let us load some data from a CSV file. If you don’t have any you could like me download data from the periodic table here.

julia> using CSV, DataFrames, Plots
julia> atoms = CSV.read("Periodic Table of Elements.csv");

You will then get a peek at what this table contains. Since it is not easy to show all that here, I will display a subset of the data:

julia> atoms[1:9, [:AtomicNumber, :Symbol, :AtomicMass]]
9×3 DataFrame
│ Row │ AtomicNumber │ Symbol │ AtomicMass │
│     │ Int64        │ String │ Float64    │
├─────┼──────────────┼────────┼────────────┤
│ 1   │ 1            │ H      │ 1.007      │
│ 2   │ 2            │ He     │ 4.002      │
│ 3   │ 3            │ Li     │ 6.941      │
│ 4   │ 4            │ Be     │ 9.012      │
│ 5   │ 5            │ B      │ 10.811     │
│ 6   │ 6            │ C      │ 12.011     │
│ 7   │ 7            │ N      │ 14.007     │
│ 8   │ 8            │ O      │ 15.999     │
│ 9   │ 9            │ F      │ 18.998     │

We’ll get back to how this work shortly. First let us look at a more convenient way of getting data. The RDatasets package contains a huge amount of datasets you can play with when learning Machine Learning. Let us load a popular data set for ML tutorials, called "iris". It contains data about measured attributes of flowers of different species.

julia> using RDatasets
julia> iris = dataset("datasets", "iris")
150×5 DataFrame
│ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ Float64     │ Float64    │ Float64     │ Float64    │         │
┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │
│ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa  │
│ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa  │
│ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa  │
│ 4.6         │ 3.4        │ 1.4         │ 0.3        │ setosa  │
│ 5.0         │ 3.4        │ 1.5         │ 0.2        │ setosa  |

Note, I edited the output slightly to make it fit. You will see slightly different output.

Often you cannot see all the columns of the data. So to get an overview we use the describe function provided by DataFrames.

julia> describe(iris)
5×8 DataFrame
│ variable    │ mean    │ min │ median │ max │ eltype  │
│ Symbol      │         │     │        │     │ DataType│
┼─────────────┼─────────┼─────┼────────┼─────┼─────────┤
│ SepalLength │ 5.84333 │ 4.3 │ 5.8    │ 7.9 │ Float64 │
│ SepalWidth  │ 3.05733 │ 2.0 │ 3.0    │ 4.4 │ Float64 │
│ PetalLength │ 3.758   │ 1.0 │ 4.35   │ 6.9 │ Float64 │
│ PetalWidth  │ 1.19933 │ 0.1 │ 1.3    │ 2.5 │ Float64 │
│ Species     │         │     │        │     │         │

Again this needed some editing to fit. For all of these datasets you can find a description of the data, what the different columns represent on the Rdatasets website. The HTML index e.g. lets you lookup different sets. We can search for iris and see the description of the data.

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Create DataFrame from Scratch

If we cannot find any data we can of course construct some from scratch ourselves.

julia> atoms = DataFrame(
           Number       = [1,   2,   6],
           Name         = ["Hydrogen", "Helium", "Carbon"],
           AtomicWeight = [1.0079, 4.0026, 12.0107],
           Symbol       = ["H", "He", "C"])
3×4 DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │
│     │ Int64  │ String   │ Float64      │ String │
├─────┼────────┼──────────┼──────────────┼────────┤
│ 1   │ 1      │ Hydrogen │ 1.0079       │ H      │
│ 2   │ 2      │ Helium   │ 4.0026       │ He     │
│ 3   │ 6      │ Carbon   │ 12.0107      │ C      │

From Clipboard

Say you found some data online you liked and don’t want to save it to file, then obtain the path, paste the path into Julia etc, you can actually just load directly from clipboard.

Say I want to load this “failures of air-condition equipment” dataset. Just mark all the data and copy it to clipboard. On macOS that would be Command+C, and Ctrl+C on Linux and Windows.

julia> airco = CSV.read(IOBuffer(clipboard()))
24×2 DataFrame
│ Row │ Column1 │ hours │
│     │ Int64   │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 3     │
│ 2   │ 2       │ 5     │
│ 3   │ 3       │ 5     │
...
│ 21  │ 21      │ 139   │
│ 22  │ 22      │ 188   │
│ 23  │ 23      │ 197   │
│ 24  │ 24      │ 210   │

IOBuffer(s) takes a string s and turns it into an IO object so the string can be read like a file.

Download URL

Another alternative is to use Julia’s download function, and download a URL as a temporary file. In this case we are downloading data about beaver temperatures.

julia> beaver = CSV.read(download("http://vincentarelbundock.github.io/Rdatasets/csv/boot/beaver.csv"))
100×5 DataFrame
│ Row │ Column1 │ day   │ time  │ temp    │ activ │
│     │ Int64   │ Int64 │ Int64 │ Float64 │ Int64 │
├─────┼─────────┼───────┼───────┼─────────┼───────┤
│ 1   │ 1       │ 307   │ 930   │ 36.58   │ 0     │
│ 2   │ 2       │ 307   │ 940   │ 36.73   │ 0     │
│ 3   │ 3       │ 307   │ 950   │ 36.93   │ 0     │
│ 4   │ 4       │ 307   │ 1000  │ 37.15   │ 0     │

The way this works is that download uses tools such as wget and curl to download file to a temporary location. It then returns the name of this file. You can also optionally specify the file path to store the downloaded file.

Selecting Subset of DataFrames

The fundamental logic of how you select subset of data is not different from working with a matrix in Julia. Here is a comparison.

julia> A = reshape(1:16, 4, 4)
4×4 reshape(::UnitRange{Int64}, 4, 4) with eltype Int64:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> A[2:3, [1, 4]]
2×2 Array{Int64,2}:
 2  14
 3  15

In both cases we are specifying the rows we want and then the columns. Both the row and the column specification is anything which is iterable. That includes ranges, arrays, tuples and even scalars. Yet you can iterate over a single number in Julia.

julia> for x in 4
           print(x)
       end
4
julia> collect(5)
0-dimensional Array{Int64,0}:
5

You can use a boolean array of the same length as the number of rows or columns depending on whether you are selecting rows or columns:

julia> A[[true, false, true, false], :]
2×4 Array{Int64,2}:
 1  5   9  13
 3  7  11  15

That means you can insert boolean expressions to select elements. For instance here wee ask for the first column of matrix A and then we compare every row to see if it is larger than 3.

julia> A[A[:, 1] .> 3, :]
1×4 Array{Int64,2}:
 4  8  12  16

We can apply many of the same principles to DataFrames. The official documentation gives you a description of all the functionality. However it is so much that you easily get lost, when your needs may be quite simple.

The Julia wiki book is also a great resources if you want to get into more of the details.

What I am doing here is trying to keep it simply while filling in the blanks and teach you some of the overall principles of how data frames work.

With DataFrames we got some added functionality which make them very convenient to work with. In matrix fashion I could ask for a single column with all the rows like this:

julia> atoms[:, :Name]
3-element Array{String,1}:
 "Hydrogen"
 "Helium"
 "Carbon"

But since the column names are symbols we can also use them in an even more convenient fashion.

julia> atoms[:, :Name]
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"

julia> atoms.Number
3-element Array{Int64,1}:
1
2
6

julia> atoms.Name
3-element Array{String,1}:
"Hydrogen"
"Helium"
"Carbon"

julia> atoms.Symbol
3-element Array{String,1}:
"H"
"He"
"C"

This gives us a compact syntax for selecting particular rows based on the values in a column.

julia> atoms[atoms.AtomicWeight .> 4, :]
2×4 DataFrame
│ Row │ Number │ Name   │ AtomicWeight │ Symbol │
│     │ Int64  │ String │ Float64      │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1   │ 2      │ Helium │ 4.0026       │ He     │
│ 2   │ 6      │ Carbon │ 12.0107      │ C      │

The atoms.AtomicWeight .> 4 expression turns into an array of boolean values. Every row with a 1 value gets selected.

julia> atoms.AtomicWeight .> 4
3-element BitArray{1}:
 0
 1
 1

The reason for writing .> 4 rather than > 4 is because we want to do element-wise comparison against four, rather than comparing the whole column as if it was a single object with a value.

If you find this cryptic to read, you can use the normal Julia filter function instead.

julia> filter(atoms) do row
          row.AtomicWeight > 4
      end
2×4 DataFrame
│ Row │ Number │ Name   │ AtomicWeight │ Symbol │
│     │ Int64  │ String │ Float64      │ String │
├─────┼────────┼────────┼──────────────┼────────┤
│ 1   │ 2      │ Helium │ 4.0026       │ He     │
│ 2   │ 6      │ Carbon │ 12.0107      │ C      │

I could do similar with the iris dataset if I only want to look at the data for one species.

julia> filter(iris) do row
           row.Species == "virginica"
       end

Or alternatively the one liner:

iris[iris.Species .== "virginica", :]

Plotting

We will just cover the pure basics. I will do more extensive coverage in another article.

I advice using the StatsPlots package, because it provides a very handy macro @df for plotting.

@df iris scatter(
           :SepalLength,
           :SepalWidth,
           group = :Species
       )

What the macro does is to translate this call into:

scatter(iris.SepalLength, iris.SepalWidth, group = iris.Species)

Because the scatter, plot and other functions in the Plots package expect to be called like this:

plot(xs, ys, attribute1=value, attribute2=value)
scatter(xs, ys, attribute1=value, attribute2=value)

Here are some examples of using some attributes

julia> plot(iris.SepalLength, iris.SepalWidth,
           seriestype=:scatter,
           xaxis="sepal length",
           yaxis="sepal width",
           group = iris.Species,
           markersize=6)

Which can be slightly shortened with StatsPlots.

julia> @df iris plot(:SepalLength, :SepalWidth,
           seriestype=:scatter,
           xaxis="sepal length",
           yaxis="sepal width",
           group = :Species,
           markersize=6)