Slicing and Dicing Pizza Data with Julia and Gadfly
Advantages of using Julia over MS Excel for Data Analysis
Whether you are a scientist trying to understand the spread of COIVD19, or in a sales organization trying to figure out what your most successful products are, you likely work with large tables of data.
The goto method for beginners is to use a spreadsheet application such as Microsoft Excel or Apple’s Numbers. These applications allow you to look at data in tables with named columns.

But if spreadsheet applications are so easy and quick to get started with, why would you care to do this with a programming language?
Data Cleanup
If data is not well structured and needs a lot of preparation and cleanup, then a Spreadsheet tool is not well suited for that. You would want a programming language. This is one of the things we will explore here.
Reproduction
By using a programming language you can easily reproduce analyses repeatedly and with very different datasets.
Since Excel’s user interface is point-and-click, you’ll need to rely on memory and repetition frequently. You cannot import code or scripts as you would with Julia, Python, R or another programming language.
Importantly when using Julia, you separate data from analysis, while in Excel and Numbers the data and code gets interwoven. This complicates thing such as version control, reusing the same analysis for different data sets etc.
Thus when you need to perform the same analysis multiple times you will see major gains from using a programming language over a spreadsheet application.
Visualization
Spreadsheets tend to have much simpler visualizations, that what Julia can offer. Below is example of a vectorfield.

But perhaps more importantly it is much easier to create a large number of plots with minor variations.

Okay we have established that using a programming language like Julia to analyze data is useful. So next we will do what is called Exploratory Data Analysis (EDA) on pizza sales data.
Exploratory Data Analysis of Pizza Sales
We are going to look at a dataset found in a popular collection of datasets called Rdatasets which typically comes bundled with the R programming language.
Every dataset has its own documentation file describing the data. You can read about the pizza sales here.
The dataset is stored as a CSV file at this URL:
https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv
Download
Using the download
function we download data to a temporary location. We can then use the read
function in the CSV
package to read this file afterwards:
julia> filename = download("https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv");
julia> pzs = CSV.read(filename);
Numbers of Rows and Columns
This is a lot of data, which we can quickly determine with size
:
julia> size(pzs)
(49574, 8)
julia> size(pzs, 1)
49574
julia> size(pzs, 2)
8
julia> nrow(pzs)
49574
julia> ncol(pzs)
8
size
is a standard Julia function that also works with regular Julia arrays, and not unique to DataFrames
. nrow
and ncol
is specifically for DataFrames
.
First and Last Rows
This tells us that we got 49 574 rows and 8 columns. We can use first
and last
to get a sense of the first and last rows in this data:
julia> first(pzs)
DataFrameRow. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼──────────┤
│ 1 │ 1 │ 2015-000001 │ 2015-01-01 │ 11:38:36 │ hawaiian │
julia> last(pzs)
DataFrameRow. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├───────┼─────────┼─────────────┼────────────┼──────────┼─────────┤
│ 49574 │ 49574 │ 2015-021350 │ 2015-12-31 │ 23:02:05 │ bbq_ckn │
first
and last
can also be used for regular Julia arrays, but with DataFrames
we also have the option to specify how many rows we want to look at from the top or bottom:
julia> first(pzs, 3)
3×8 DataFrame. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼─────────────┤
│ 1 │ 1 │ 2015-000001 │ 2015-01-01 │ 11:38:36 │ hawaiian │
│ 2 │ 2 │ 2015-000002 │ 2015-01-01 │ 11:57:40 │ classic_dlx │
│ 3 │ 3 │ 2015-000002 │ 2015-01-01 │ 11:57:40 │ mexicana │
julia> last(pzs, 2)
2×8 DataFrame. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼──────────┤
│ 1 │ 49573 │ 2015-021349 │ 2015-12-31 │ 22:09:54 │ mexicana │
│ 2 │ 49574 │ 2015-021350 │ 2015-12-31 │ 23:02:05 │ bbq_ckn │
If you see other DataFrames
tutorials you may see head
and tail
used instead, but this is deprecated. It was originally in Julia because R's data frames used it. However using first
and last
instead reduce your cognitive load, because you are already using these functions with regular Julia arrays. There are so many things you can do with DataFrames
that we want to keep the API as small as possible.
Now you may wonder why exactly are we looking at the first and last rows to begin with? This is just what you normally do to get a quick feel for the data. We already learned something from that. E.g. that there is a Column1
column which is useless. It is really just the row numbers. Thus let us get rid of it:
julia> pzs = pzs[:, 2:end];
The square brackets take two argument. The first is the range of rows we want and the second is the range of columns we want.
Description of Data
Another useful function to get an overview over our data is describe
:
julia> describe(pzs)
7×8 DataFrame. Omitted printing of 3 columns
│ Row │ variable │ mean │ min │ median │ max │
│ │ Symbol │ Union… │ Any │ Union… │ Any │
├─────┼──────────┼─────────┼─────────────┼────────┼─────────────┤
│ 1 │ id │ │ 2015-000001 │ │ 2015-021350 │
│ 2 │ date │ │ 2015-01-01 │ │ 2015-12-31 │
│ 3 │ time │ │ 09:52:21 │ │ 23:05:52 │
│ 4 │ name │ │ bbq_ckn │ │ veggie_veg │
│ 5 │ size │ │ L │ │ XXL │
│ 6 │ type │ │ chicken │ │ veggie │
│ 7 │ price │ 16.4978 │ 9.75 │ 16.5 │ 35.95 │
Selecting a Subset of Columns
One problem with showing this data in a document is that we often get more columns than we can show. Hence it is good to be able to select only the columns we are most interested in when displaying.
julia> select(describe(pzs), :variable, :nunique, :eltype)
7×3 DataFrame
│ Row │ variable │ nunique │ eltype │
│ │ Symbol │ Union… │ DataType │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ id │ 21350 │ String │
│ 2 │ date │ 358 │ Date │
│ 3 │ time │ 16382 │ Time │
│ 4 │ name │ 32 │ String │
│ 5 │ size │ 5 │ String │
│ 6 │ type │ 4 │ String │
│ 7 │ price │ │ Float64 │
To make it easier to work with pizza data and show results in this story I will use select
to pick the most relevant/interesting columns:
julia> pz = select(pzs, :id, :name, :size, :price)
I also shorten variable to pz
to make it quicker to write.
Split into Multiple Sub-tables with GroupBy
With groupby
we can create multiple subtables where a given column has the same value.
Basically grouping a table by a particular column, means making sub-tables, one for each unique value found in that column.
E.g. in the example below we create one table for each pizza size: S
, M
, L
, XL
and XXL
.
julia> sizes = groupby(pz, :size);
We can look at these tables individually with first
and last
:
julia> first(sizes[1], 3)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │
│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │
│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │
julia> first(sizes[2], 3)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000002 │ thai_ckn │ L │ 20.75 │
│ 2 │ 2015-000002 │ five_cheese │ L │ 18.5 │
│ 3 │ 2015-000002 │ ital_supr │ L │ 20.75 │
Combining Sub-tables into a New Table
The benefits of having multiple tables grouped by each value in a particular column, is that you can combine these again. Using the combine
function we turn every array in one of the subtables into a single value, and then combine all these subtables into a new table.
So in the example below we call nrow(tbl)
on ever subtable tbl
in sizes
, which gives the number of rows in that table. This allows us to get how many pizzas of each size has been sold.
julia> combine(sizes, nrow)
5×2 DataFrame
│ Row │ size │ nrow │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ M │ 15635 │
│ 2 │ L │ 18956 │
│ 3 │ S │ 14403 │
│ 4 │ XL │ 552 │
│ 5 │ XXL │ 28 │
However this is more flexible. So instead of processing each subtable with a function, we can process a particular column of each subtable.
In the example below we pick the price
column of each subtable and run it through the mean
function which calculates the average value of that column.
julia> combine(sizes, :price => mean)
5×2 DataFrame
│ Row │ size │ price_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ M │ 15.9503 │
│ 2 │ L │ 19.7995 │
│ 3 │ S │ 12.3638 │
│ 4 │ XL │ 25.5 │
│ 5 │ XXL │ 35.95 │
Since each column is contains price, information, what we end up getting is the average price for each pizza size. We can expand on this to get the total sales, number of pizzas of each size etc all in one table:
julia> combine(sizes, :price => sum, :price => mean, nrow)
5×4 DataFrame
│ Row │ size │ price_sum │ price_mean │ nrow │
│ │ String │ Float64 │ Float64 │ Int64 │
├─────┼────────┼───────────┼────────────┼───────┤
│ 1 │ M │ 2.49382e5 │ 15.9503 │ 15635 │
│ 2 │ L │ 3.75319e5 │ 19.7995 │ 18956 │
│ 3 │ S │ 1.78076e5 │ 12.3638 │ 14403 │
│ 4 │ XL │ 14076.0 │ 25.5 │ 552 │
│ 5 │ XXL │ 1006.6 │ 35.95 │ 28 │
How Will I Remember All Of This?
The functionality for manipulating data frame can seem a bit overwhelming. I am just showing you one example after the other which become sort of like cooking receipts one may try to memorize. Thus it helps to instead understand the general pattern being applied over and over again.
Here are some things to keep in mind:
- Data frames try to have an API as similar to regular Julia 2D-arrays as possible (matrices). So what works with a matrix will usually work with a data frame.
- Data frames mimic a struct with multiple 1D arrays. Writing
pz.price
gives you theprice
column, and you can actually change this directly and cause a change in the data frame itself.
Functions such as select
, transform
, sort
, groupby
and combine
all have a basic form like this:
select(pzs, :price)
Some will take multiple column names, so you could write:
select(pzs, :name, :size, :price)
However groupby
will only take one column name naturally. Most will also allow you to replace a symbol with a pair pointing to another symbol or function.
select(pzs, :price => :cost)
Or you can point to a function which will get fed all the values in that column:
julia> bigly(xs) = xs .* 10
julia> select(pzs, :price => bigly)
However how this gets interpreted will vary. In the case above we produce a new table only containing the price
column because that is the one we selected. Every value in the price
column will get its value multiplied by 10. However if I used transform
instead I would get all the columns, plus an extra column resulting from the bigly
call.
julia> first(transform(pz, :price => bigly), 3)
3×5 DataFrame
│ Row │ id │ name │ size │ price │ price_bigly │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┼─────────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │ 132.5 │
│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │ 160.0 │
│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │ 160.0 │
In some cases such as with groupby
we have the option of providing functions such as mean
and sum
which turn multiple values into one value. But we don't have to. We could use bigly
as well which returns as many values as it gets in.
combine(sizes, :price => bigly)
However if we compare bigly
and mean
you can see that with the former we get out a table of the same size as the original one, while the latter collapse it.
julia> first(combine(sizes, :price => bigger), 8)
8×2 DataFrame
│ Row │ size │ price_bigger │
│ │ String │ Float64 │
├─────┼────────┼──────────────┤
│ 1 │ M │ 132.5 │
│ 2 │ M │ 160.0 │
│ 3 │ M │ 160.0 │
│ 4 │ M │ 165.0 │
│ 5 │ M │ 165.0 │
│ 6 │ M │ 165.0 │
│ 7 │ M │ 167.5 │
│ 8 │ M │ 165.0 │
julia> combine(sizes, :price => mean)
5×2 DataFrame
│ Row │ size │ price_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ M │ 15.9503 │
│ 2 │ L │ 19.7995 │
│ 3 │ S │ 12.3638 │
│ 4 │ XL │ 25.5 │
│ 5 │ XXL │ 35.95 │
julia> size(combine(sizes, :price => bigger))
(49574, 2)
Visualizing Tabulated Data
Of course looking at lots of numbers is not always a great way of getting a good overview of your data. The human brain is really tuned to looking at pictures and not numbers. Hence if we can turn numbers into images you can process more information at the same time. That is why plotting is powerful way of conveying information.
julia> using Gadfly
julia> plot(pzs, x=:size, Geom.histogram())

In this example I am using a powerful plotting library called Gadfly. If you haven’t installed it already, you would have to go into package mode and install it:
(@v1.5) pkg> add Gadfly
You can read a more proper overview of plotting terminology here, which goes into more details about data sources, data series and Gadfly concepts such as aesthetic.
Below is a simple illustration of the aesthetic concept in Gadfly.

plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)
Basically Gadfly is centered on this idea that every type of plot is a type of Geometry. A scatter plot will be a different geometry from a bar chart or histogram.
Every geometry has one or more aesthetics, which control how a particular geometry is visualized. For a simple line plot as shown here, there will be an x
and y
aesthetic, which says where each point will be placed in the coordinate system of the plot.
We explicitly connect columns in our source data, e.g. our data frame to specific aesthetics. We can also add transformations so that the values get scaled or transformed in some other way before they enter the geometry of the plot.
We can recreate a similar plot with the following code:
julia> xs = 1:2:12
1:2:11
julia> ys = xs.^2
6-element Array{Int64,1}:
1
9
25
49
81
121
julia> plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)

When working with DataFrames
we specify data as symbol name of columns instead. So if we want to get a histogram plot of the names of pizzas we would write:
plot(pz, x=:name, Geom.histogram())

This basically tells us how many of each pizza type we are selling, we could be useful information for a pizza restaurant trying to get a sense of what type of pizza they are selling most of.
But it should be even more interesting to know what pizza is actually bringing in the most income. To find out that, we would have to add up the price for each pizza of a given name sold. We can do this by using groupby
and combine
:
julia> nms = groupby(pz, :name);
julia> profitables = combine(nms, :price => sum)
The table we get from this isn’t what easy to get a quick overview over by just reading raw data. So let us do a bar plot to show that.
plot(profitables, x=:name, y=:price_sum, Geom.bar())
Why are we not using a histogram? Because a histogram is to show us how frequently a particular value occurs. But in this case we don’t want to see how frequently a pizza name occurs but rather how what the price sum is for each pizza name.

However this chart is still not that quick to read. To see the top pizzas it would be nicer if everything was sorted according to price, would it not?
And of course we can sort by a given column in Julia.
profitables = sort(profitable_pizzas, :price_sum)
plot(profitables, x=:name, y=:price_sum, Geom.bar())

From this we can see that our most profitable pizzas are:
- Thai chicken
- BBQ chicken
- California chicken
While Gadfly has some powerful concepts, it is lacking in examples in the documentation, so next story I will create more varied plots for out pizza data using the Plots
plotting package.