Slicing and Dicing Pizza Data with Julia and Gadfly

Advantages of using Julia over MS Excel for Data Analysis

Whether you are a scientist trying to understand the spread of COIVD19, or in a sales organization trying to figure out what your most successful products are, you likely work with large tables of data.

The goto method for beginners is to use a spreadsheet application such as Microsoft Excel or Apple’s Numbers. These applications allow you to look at data in tables with named columns.

Image for post

But if spreadsheet applications are so easy and quick to get started with, why would you care to do this with a programming language?

Data Cleanup

Reproduction

Since Excel’s user interface is point-and-click, you’ll need to rely on memory and repetition frequently. You cannot import code or scripts as you would with Julia, Python, R or another programming language.

Importantly when using Julia, you separate data from analysis, while in Excel and Numbers the data and code gets interwoven. This complicates thing such as version control, reusing the same analysis for different data sets etc.

Thus when you need to perform the same analysis multiple times you will see major gains from using a programming language over a spreadsheet application.

Visualization

Image for post

But perhaps more importantly it is much easier to create a large number of plots with minor variations.

Image for post

Okay we have established that using a programming language like Julia to analyze data is useful. So next we will do what is called Exploratory Data Analysis (EDA) on pizza sales data.

Exploratory Data Analysis of Pizza Sales

Every dataset has its own documentation file describing the data. You can read about the pizza sales here.

The dataset is stored as a CSV file at this URL:

https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv

Download

julia> filename = download("https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv");
julia> pzs = CSV.read(filename);

Numbers of Rows and Columns

julia> size(pzs)
(49574, 8)

julia> size(pzs, 1)
49574

julia> size(pzs, 2)
8

julia> nrow(pzs)
49574

julia> ncol(pzs)
8

size is a standard Julia function that also works with regular Julia arrays, and not unique to DataFrames. nrow and ncol is specifically for DataFrames.

First and Last Rows

julia> first(pzs)
DataFrameRow. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼──────────┤
│ 1 │ 1 │ 2015-000001 │ 2015-01-01 │ 11:38:36 │ hawaiian │

julia> last(pzs)
DataFrameRow. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├───────┼─────────┼─────────────┼────────────┼──────────┼─────────┤
│ 49574 │ 49574 │ 2015-021350 │ 2015-12-31 │ 23:02:05 │ bbq_ckn │

first and last can also be used for regular Julia arrays, but with DataFrames we also have the option to specify how many rows we want to look at from the top or bottom:

julia> first(pzs, 3)
3×8 DataFrame. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼─────────────┤
│ 1 │ 1 │ 2015-000001 │ 2015-01-01 │ 11:38:36 │ hawaiian │
│ 2 │ 2 │ 2015-000002 │ 2015-01-01 │ 11:57:40 │ classic_dlx │
│ 3 │ 3 │ 2015-000002 │ 2015-01-01 │ 11:57:40 │ mexicana │

julia> last(pzs, 2)
2×8 DataFrame. Omitted printing of 3 columns
│ Row │ Column1 │ id │ date │ time │ name │
│ │ Int64 │ String │ Dates.Date │ Time… │ String │
├─────┼─────────┼─────────────┼────────────┼──────────┼──────────┤
│ 1 │ 49573 │ 2015-021349 │ 2015-12-31 │ 22:09:54 │ mexicana │
│ 2 │ 49574 │ 2015-021350 │ 2015-12-31 │ 23:02:05 │ bbq_ckn │

If you see other DataFrames tutorials you may see head and tail used instead, but this is deprecated. It was originally in Julia because R's data frames used it. However using first and last instead reduce your cognitive load, because you are already using these functions with regular Julia arrays. There are so many things you can do with DataFrames that we want to keep the API as small as possible.

Now you may wonder why exactly are we looking at the first and last rows to begin with? This is just what you normally do to get a quick feel for the data. We already learned something from that. E.g. that there is a Column1 column which is useless. It is really just the row numbers. Thus let us get rid of it:

julia> pzs = pzs[:, 2:end];

The square brackets take two argument. The first is the range of rows we want and the second is the range of columns we want.

Description of Data

julia> describe(pzs)
7×8 DataFrame. Omitted printing of 3 columns
│ Row │ variable │ mean │ min │ median │ max │
│ │ Symbol │ Union… │ Any │ Union… │ Any │
├─────┼──────────┼─────────┼─────────────┼────────┼─────────────┤
│ 1 │ id │ │ 2015-000001 │ │ 2015-021350 │
│ 2 │ date │ │ 2015-01-01 │ │ 2015-12-31 │
│ 3 │ time │ │ 09:52:21 │ │ 23:05:52 │
│ 4 │ name │ │ bbq_ckn │ │ veggie_veg │
│ 5 │ size │ │ L │ │ XXL │
│ 6 │ type │ │ chicken │ │ veggie │
│ 7 │ price │ 16.4978 │ 9.75 │ 16.5 │ 35.95 │

Selecting a Subset of Columns

julia> select(describe(pzs), :variable, :nunique, :eltype)
7×3 DataFrame
│ Row │ variable │ nunique │ eltype │
│ │ Symbol │ Union… │ DataType │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ id │ 21350 │ String │
│ 2 │ date │ 358 │ Date │
│ 3 │ time │ 16382 │ Time │
│ 4 │ name │ 32 │ String │
│ 5 │ size │ 5 │ String │
│ 6 │ type │ 4 │ String │
│ 7 │ price │ │ Float64 │

To make it easier to work with pizza data and show results in this story I will use select to pick the most relevant/interesting columns:

julia> pz = select(pzs, :id, :name, :size, :price)

I also shorten variable to pz to make it quicker to write.

Split into Multiple Sub-tables with GroupBy

Basically grouping a table by a particular column, means making sub-tables, one for each unique value found in that column.

E.g. in the example below we create one table for each pizza size: S, M, L, XL and XXL.

julia> sizes = groupby(pz, :size);

We can look at these tables individually with first and last:

julia> first(sizes[1], 3)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │
│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │
│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │

julia> first(sizes[2], 3)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000002 │ thai_ckn │ L │ 20.75 │
│ 2 │ 2015-000002 │ five_cheese │ L │ 18.5 │
│ 3 │ 2015-000002 │ ital_supr │ L │ 20.75 │

Combining Sub-tables into a New Table

So in the example below we call nrow(tbl) on ever subtable tbl in sizes, which gives the number of rows in that table. This allows us to get how many pizzas of each size has been sold.

julia> combine(sizes, nrow)
5×2 DataFrame
│ Row │ size │ nrow │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ M │ 15635 │
│ 2 │ L │ 18956 │
│ 3 │ S │ 14403 │
│ 4 │ XL │ 552 │
│ 5 │ XXL │ 28 │

However this is more flexible. So instead of processing each subtable with a function, we can process a particular column of each subtable.

In the example below we pick the price column of each subtable and run it through the mean function which calculates the average value of that column.

julia> combine(sizes, :price => mean)
5×2 DataFrame
│ Row │ size │ price_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ M │ 15.9503 │
│ 2 │ L │ 19.7995 │
│ 3 │ S │ 12.3638 │
│ 4 │ XL │ 25.5 │
│ 5 │ XXL │ 35.95 │

Since each column is contains price, information, what we end up getting is the average price for each pizza size. We can expand on this to get the total sales, number of pizzas of each size etc all in one table:

julia> combine(sizes, :price => sum, :price => mean, nrow)
5×4 DataFrame
│ Row │ size │ price_sum │ price_mean │ nrow │
│ │ String │ Float64 │ Float64 │ Int64 │
├─────┼────────┼───────────┼────────────┼───────┤
│ 1 │ M │ 2.49382e5 │ 15.9503 │ 15635 │
│ 2 │ L │ 3.75319e5 │ 19.7995 │ 18956 │
│ 3 │ S │ 1.78076e5 │ 12.3638 │ 14403 │
│ 4 │ XL │ 14076.0 │ 25.5 │ 552 │
│ 5 │ XXL │ 1006.6 │ 35.95 │ 28 │

How Will I Remember All Of This?

Here are some things to keep in mind:

  • Data frames try to have an API as similar to regular Julia 2D-arrays as possible (matrices). So what works with a matrix will usually work with a data frame.
  • Data frames mimic a struct with multiple 1D arrays. Writing pz.price gives you the price column, and you can actually change this directly and cause a change in the data frame itself.

Functions such as select, transform, sort, groupby and combine all have a basic form like this:

select(pzs, :price)

Some will take multiple column names, so you could write:

select(pzs, :name, :size, :price)

However groupby will only take one column name naturally. Most will also allow you to replace a symbol with a pair pointing to another symbol or function.

select(pzs, :price => :cost)

Or you can point to a function which will get fed all the values in that column:

julia> bigly(xs) = xs .* 10

julia> select(pzs, :price => bigly)

However how this gets interpreted will vary. In the case above we produce a new table only containing the price column because that is the one we selected. Every value in the price column will get its value multiplied by 10. However if I used transform instead I would get all the columns, plus an extra column resulting from the bigly call.

julia> first(transform(pz, :price => bigly), 3)
3×5 DataFrame
│ Row │ id │ name │ size │ price │ price_bigly │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┼─────────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │ 132.5 │
│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │ 160.0 │
│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │ 160.0 │

In some cases such as with groupby we have the option of providing functions such as mean and sum which turn multiple values into one value. But we don't have to. We could use bigly as well which returns as many values as it gets in.

combine(sizes, :price => bigly)

However if we compare bigly and mean you can see that with the former we get out a table of the same size as the original one, while the latter collapse it.

julia> first(combine(sizes, :price => bigger), 8)
8×2 DataFrame
│ Row │ size │ price_bigger │
│ │ String │ Float64 │
├─────┼────────┼──────────────┤
│ 1 │ M │ 132.5 │
│ 2 │ M │ 160.0 │
│ 3 │ M │ 160.0 │
│ 4 │ M │ 165.0 │
│ 5 │ M │ 165.0 │
│ 6 │ M │ 165.0 │
│ 7 │ M │ 167.5 │
│ 8 │ M │ 165.0 │

julia> combine(sizes, :price => mean)
5×2 DataFrame
│ Row │ size │ price_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ M │ 15.9503 │
│ 2 │ L │ 19.7995 │
│ 3 │ S │ 12.3638 │
│ 4 │ XL │ 25.5 │
│ 5 │ XXL │ 35.95 │

julia> size(combine(sizes, :price => bigger))
(49574, 2)

Visualizing Tabulated Data

julia> using Gadfly
julia> plot(pzs, x=:size, Geom.histogram())
Image for post

In this example I am using a powerful plotting library called Gadfly. If you haven’t installed it already, you would have to go into package mode and install it:

(@v1.5) pkg> add Gadfly

You can read a more proper overview of plotting terminology here, which goes into more details about data sources, data series and Gadfly concepts such as aesthetic.

Below is a simple illustration of the aesthetic concept in Gadfly.

Describes a plot like plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)
Describes a plot like plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)
Describes a plot like plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)

Basically Gadfly is centered on this idea that every type of plot is a type of Geometry. A scatter plot will be a different geometry from a bar chart or histogram.

Every geometry has one or more aesthetics, which control how a particular geometry is visualized. For a simple line plot as shown here, there will be an x and y aesthetic, which says where each point will be placed in the coordinate system of the plot.

We explicitly connect columns in our source data, e.g. our data frame to specific aesthetics. We can also add transformations so that the values get scaled or transformed in some other way before they enter the geometry of the plot.

We can recreate a similar plot with the following code:

julia> xs = 1:2:12
1:2:11

julia> ys = xs.^2
6-element Array{Int64,1}:
1
9
25
49
81
121

julia> plot(x=xs, y=ys, Geom.LineGeometry, Geom.PointGeometry)
Image for post

When working with DataFrames we specify data as symbol name of columns instead. So if we want to get a histogram plot of the names of pizzas we would write:

plot(pz, x=:name, Geom.histogram())
Image for post

This basically tells us how many of each pizza type we are selling, we could be useful information for a pizza restaurant trying to get a sense of what type of pizza they are selling most of.

But it should be even more interesting to know what pizza is actually bringing in the most income. To find out that, we would have to add up the price for each pizza of a given name sold. We can do this by using groupby and combine:

julia> nms = groupby(pz, :name);
julia> profitables = combine(nms, :price => sum)

The table we get from this isn’t what easy to get a quick overview over by just reading raw data. So let us do a bar plot to show that.

plot(profitables, x=:name, y=:price_sum, Geom.bar())

Why are we not using a histogram? Because a histogram is to show us how frequently a particular value occurs. But in this case we don’t want to see how frequently a pizza name occurs but rather how what the price sum is for each pizza name.

Image for post

However this chart is still not that quick to read. To see the top pizzas it would be nicer if everything was sorted according to price, would it not?

And of course we can sort by a given column in Julia.

profitables = sort(profitable_pizzas, :price_sum)
plot(profitables, x=:name, y=:price_sum, Geom.bar())
Image for post

From this we can see that our most profitable pizzas are:

  1. Thai chicken
  2. BBQ chicken
  3. California chicken

While Gadfly has some powerful concepts, it is lacking in examples in the documentation, so next story I will create more varied plots for out pizza data using the Plots plotting package.

Written by

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store