Ditch Excel and Use Julia Data Frames

Manipulating and visualizing pizza sales data using Julia DataFrames.jl and Plots.jl

Image for post
Image for post
Photo by Brett Jordan from Pexels
https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv

Loading Data

First we will load the data in Julia and pick a subset (id, name, size and price) of columns in the table to work with:

using DataFrames, CSV

url = "https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv"
filename = download(url)
all_pizzas = CSV.read(filename)

# Get rid of column with row numbers
all_pizzas = all_pizzas[:, 2:end]

# Pick most interesting columns
pz = select(all_pizzas, :id, :name, :size, :price)
julia> first(pz, 4)
4×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │
│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │
│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │
│ 4 │ 2015-000002 │ thai_ckn │ L │ 20.75 │

julia> nrow(pz)
49574

Sampling Data

We are going to pick a random sample of 16 rows from the 49 574 rows we have loaded in. To do that we will randomly shuffle the row indices from 1 to 49 574.

julia> using Random

julia> rows = shuffle(1:nrow(pz))
julia> sample = pz[rows[1:16], :]
16×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │
│ 4 │ 2015-017919 │ sicilian │ S │ 12.25 │
│ 5 │ 2015-015337 │ prsc_argla │ M │ 16.5 │
│ 6 │ 2015-006190 │ ital_veggie │ S │ 12.75 │
│ 7 │ 2015-015481 │ spin_pesto │ S │ 12.5 │
│ 8 │ 2015-007865 │ hawaiian │ L │ 16.5 │
│ 9 │ 2015-001928 │ bbq_ckn │ L │ 20.75 │
│ 10 │ 2015-017298 │ cali_ckn │ S │ 12.75 │
│ 11 │ 2015-018872 │ four_cheese │ L │ 17.95 │
│ 12 │ 2015-018036 │ four_cheese │ L │ 17.95 │
│ 13 │ 2015-011238 │ classic_dlx │ L │ 20.5 │
│ 14 │ 2015-013366 │ classic_dlx │ M │ 16.0 │
│ 15 │ 2015-014380 │ bbq_ckn │ M │ 16.75 │
│ 16 │ 2015-020245 │ ital_cpcllo │ S │ 12.0 │

Transfer Pizza Sales to Spreadsheet

However we cannot easily copy and paste data on this “pretty” format into a spreadsheet application such as Excel. We want it on CSV format. Fortunately Julia display system allows us to display the same data in many different formats.

julia> display(sample)
16×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │
│ 4 │ 2015-017919 │ sicilian │ S │ 12.25 │
│ 5 │ 2015-015337 │ prsc_argla │ M │ 16.5 │
│ 6 │ 2015-006190 │ ital_veggie │ S │ 12.75 │
│ 7 │ 2015-015481 │ spin_pesto │ S │ 12.5 │
│ 8 │ 2015-007865 │ hawaiian │ L │ 16.5 │
│ 9 │ 2015-001928 │ bbq_ckn │ L │ 20.75 │
│ 10 │ 2015-017298 │ cali_ckn │ S │ 12.75 │
│ 11 │ 2015-018872 │ four_cheese │ L │ 17.95 │
│ 12 │ 2015-018036 │ four_cheese │ L │ 17.95 │
│ 13 │ 2015-011238 │ classic_dlx │ L │ 20.5 │
│ 14 │ 2015-013366 │ classic_dlx │ M │ 16.0 │
│ 15 │ 2015-014380 │ bbq_ckn │ M │ 16.75 │
│ 16 │ 2015-020245 │ ital_cpcllo │ S │ 12.0 │
julia> display("text/csv", sample)
"id","name","size","price"
"2015-000348","thai_ckn","S",12.75
"2015-007731","green_garden","S",12.0
"2015-014409","hawaiian","S",10.5
"2015-017919","sicilian","S",12.25
"2015-015337","prsc_argla","M",16.5
"2015-006190","ital_veggie","S",12.75
"2015-015481","spin_pesto","S",12.5
"2015-007865","hawaiian","L",16.5
"2015-001928","bbq_ckn","L",20.75
"2015-017298","cali_ckn","S",12.75
"2015-018872","four_cheese","L",17.95
"2015-018036","four_cheese","L",17.95
"2015-011238","classic_dlx","L",20.5
"2015-013366","classic_dlx","M",16.0
"2015-014380","bbq_ckn","M",16.75
"2015-020245","ital_cpcllo","S",12.0
julia> CSV.write("pizza-sales.csv", sample)
"pizza-sales.csv"
julia> open("pizza-sales.csv", "w") do io
show(io, MIME("text/csv"), sample)
end
julia> buf = IOBuffer();
julia> CSV.write(buf, sample);
julia> clipboard(String(take!(buf)))
Image for post
julia> using Statistics

julia> sum(sample.price)
240.39999999999998

julia> mean(sample.price)
15.024999999999999

julia> minimum(sample.price)
10.5

julia> maximum(sample.price)
20.75

julia> nrow(sample)
16

Plotting Histograms

When doing Exploratory Data Analysis (EDA) one of the first useful things you can do to get a sense of the data, is to plot interesting data in histograms.

julia> histogram(sample.price, bins = 4)
julia> histogram(pz.price, bins = 6)
julia> p1 = histogram(sample.price,
bins = 4,
xlims = (0, 40),
xaxis = "price",
yaxis = "count",
legend=nothing)

julia> p2 = histogram(pz.price,
bins = 4,
xlims = (0, 40),
xaxis = "price",
yaxis = "count",
legend=nothing)

julia> p = plot(p1, p2)
Image for post
julia> savefig(p, "pizza-price-hist.png")

Interesting Questions to Ask About Our Data

After looking at the data we may develop some questions about the data which we want to have answered, such as:

  • What size of pizza, small, medium, large etc is sold the most?
  • What pizza brings in the most revenue?
  • How much is typically spent on each pizza order?

Slicing and Dicing Data Frames

Working with Data Frames we got several key functions that help us manipulate the data:

  • transform Similar to select except we are not removing any columns. We are just renaming and transforming selected columns.
  • groupby Turns a table into multiple tables. The splitting is done by making a table for each unique value of a particular selected column.
  • combine Takes multiple tables and turn them into one table again. Allows you collapse all rows in each table into a single row.
  • join Let you match up a column in two different tables to join them into one table.
julia> tiny = first(sample, 3)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │
julia> select(tiny, :price)
3×1 DataFrame
│ Row │ price │
│ │ Float64 │
├─────┼─────────┤
│ 1 │ 12.75 │
│ 2 │ 12.0 │
│ 3 │ 10.5 │

julia> transform(tiny, :price)
3×4 DataFrame
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │
julia> select(tiny, :price => :cost)
3×1 DataFrame
│ Row │ cost │
│ │ Float64 │
├─────┼─────────┤
│ 1 │ 12.75 │
│ 2 │ 12.0 │
│ 3 │ 10.5 │

julia> transform(tiny, :price => :cost)
3×5 DataFrame
│ Row │ id │ name │ size │ price │ cost │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │ 12.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │ 10.5 │
julia> select(tiny, :price => mean)
3×1 DataFrame
│ Row │ price_mean │
│ │ Float64 │
├─────┼────────────┤
│ 1 │ 11.75 │
│ 2 │ 11.75 │
│ 3 │ 11.75 │

julia> transform(tiny, :price => mean)
3×5 DataFrame
│ Row │ id │ name │ size │ price │ price_mean │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┼────────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │ 11.75 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │ 11.75 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │ 11.75 │
julia> round(3.4)
3.0

julia> rounder = ByRow(round);

julia> rounder([1.2, 4.8, 8.3])
3-element Array{Float64,1}:
1.0
5.0
8.0
julia> select(tiny, :price => ByRow(round))
3×1 DataFrame
│ Row │ price_round │
│ │ Float64 │
├─────┼─────────────┤
│ 1 │ 13.0 │
│ 2 │ 12.0 │
│ 3 │ 10.0 │

julia> transform(tiny, :price => ByRow(round))
3×5 DataFrame
│ Row │ id │ name │ size │ price │ price_round │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┼─────────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │ 13.0 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │ 10.0 │
julia> transform(tiny, :price => ByRow(round) => :rounded)
3×5 DataFrame
│ Row │ id │ name │ size │ price │ rounded │
│ │ String │ String │ String │ Float64 │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┼─────────┤
│ 1 │ 2015-000348 │ thai_ckn │ S │ 12.75 │ 13.0 │
│ 2 │ 2015-007731 │ green_garden │ S │ 12.0 │ 12.0 │
│ 3 │ 2015-014409 │ hawaiian │ S │ 10.5 │ 10.0 │
julia> combine(tiny, :price => mean)
1×1 DataFrame
│ Row │ price_mean │
│ │ Float64 │
├─────┼────────────┤
│ 1 │ 11.75 │
julia> combine(tiny, :name, :price => mean)
3×2 DataFrame
│ Row │ name │ price_mean │
│ │ String │ Float64 │
├─────┼──────────────┼────────────┤
│ 1 │ thai_ckn │ 11.75 │
│ 2 │ green_garden │ 11.75 │
│ 3 │ hawaiian │ 11.75 │
Image for post
Image for post
Creating categories in Apple Numbers, called Pivot Tables in Excel. This shows grouping of pizzas by type and calculating subtotal for each group. With grouping and combining you achieve the same in Julia.

Splitting and Recombining Tables

The real power of combine however comes from splitting a data frame up into groups with groupby and then recombining these again to a table.

  • sum Add up values in column.
  • mean Find the arithmetic mean (average).
julia> sizes = groupby(sample, :size)
GroupedDataFrame with 3 groups based on key: size
Group 1 (7 rows): size = "M"
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-018649 │ pepperoni │ M │ 12.5 │
│ 2 │ 2015-011258 │ hawaiian │ M │ 13.25 │
│ 3 │ 2015-009299 │ veggie_veg │ M │ 16.0 │
│ 4 │ 2015-010260 │ peppr_salami │ M │ 16.5 │
│ 5 │ 2015-017124 │ hawaiian │ M │ 13.25 │
│ 6 │ 2015-011800 │ thai_ckn │ M │ 16.75 │
│ 7 │ 2015-008107 │ ckn_alfredo │ M │ 16.75 │
Group 2 (5 rows): size = "L"
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼─────────────┼────────┼─────────┤
│ 1 │ 2015-000629 │ spin_pesto │ L │ 20.75 │
│ 2 │ 2015-011532 │ spinach_fet │ L │ 20.25 │
│ 3 │ 2015-019947 │ pepperoni │ L │ 15.25 │
│ 4 │ 2015-002630 │ thai_ckn │ L │ 20.75 │
│ 5 │ 2015-018629 │ cali_ckn │ L │ 20.75 │
Group 3 (4 rows): size = "S"
│ Row │ id │ name │ size │ price │
│ │ String │ String │ String │ Float64 │
├─────┼─────────────┼──────────────┼────────┼─────────┤
│ 1 │ 2015-017814 │ green_garden │ S │ 12.0 │
│ 2 │ 2015-012022 │ veggie_veg │ S │ 12.0 │
│ 3 │ 2015-010260 │ southw_ckn │ S │ 12.75 │
│ 4 │ 2015-010846 │ big_meat │ S │ 12.0 │
julia> combine(sizes, :size => length)
3×2 DataFrame
│ Row │ size │ size_length │
│ │ String │ Int64 │
├─────┼────────┼─────────────┤
│ 1 │ M │ 7 │
│ 2 │ L │ 5 │
│ 3 │ S │ 4 │
julia> combine(sizes, :size => length => :amount)
3×2 DataFrame
│ Row │ size │ amount │
│ │ String │ Int64 │
├─────┼────────┼────────┤
│ 1 │ M │ 7 │
│ 2 │ L │ 5 │
│ 3 │ S │ 4 │
julia> pz_sizes = combine(groupby(pz, :size), :size => length => :amount)

julia> bar(pz_sizes.size, pz_sizes.amount, bins=5, legend=nothing)
Image for post

Joining Tables

There are a lot of ways of defining an order for pizza sizes, but here I will solves this by joining tables, because it is something you should know how to do anyway. A simple solution is to give sizes a numerical value.

  • Medium — 31 cm
  • Large — 36 cm
  • X-Large — 41 cm
  • Jumbo — 46 cm
julia> diameters = DataFrame(size = ["S", "M", "L", "XL", "XXL"],
diameter = [26, 31, 36, 41, 46])
5×2 DataFrame
│ Row │ size │ diameter │
│ │ String │ Int64 │
├─────┼────────┼──────────┤
│ 1 │ S │ 26 │
│ 2 │ M │ 31 │
│ 3 │ L │ 36 │
│ 4 │ XL │ 41 │
│ 5 │ XXL │ 46 │
julia> pz = join(pz, diameters, on=:size);

julia> select(pz, :name, :size, :diameter, :price)
49574×4 DataFrame
│ Row │ name │ size │ diameter │ price │
│ │ String │ String │ Int64 │ Float64 │
├───────┼──────────────┼────────┼──────────┼─────────┤
│ 1 │ spin_pesto │ L │ 36 │ 20.75 │
│ 2 │ pepperoni │ M │ 31 │ 12.5 │
│ 3 │ hawaiian │ M │ 31 │ 13.25 │
│ 4 │ spinach_fet │ L │ 36 │ 20.25 │
│ 5 │ veggie_veg │ M │ 31 │ 16.0 │
julia> pz_sizes = combine(groupby(pz, :size), :size => length => :amount)

julia> pz_sizes = join(pz_sizes, diameters, on=:size)
5×3 DataFrame
│ Row │ size │ amount │ diameter │
│ │ String │ Int64 │ Int64 │
├─────┼────────┼────────┼──────────┤
│ 1 │ M │ 15635 │ 31 │
│ 2 │ L │ 18956 │ 36 │
│ 3 │ S │ 14403 │ 26 │
│ 4 │ XL │ 552 │ 41 │
│ 5 │ XXL │ 28 │ 46 │
julia> sort!(pz_sizes, :diameter)
5×3 DataFrame
│ Row │ size │ amount │ diameter │
│ │ String │ Int64 │ Int64 │
├─────┼────────┼────────┼──────────┤
│ 1 │ S │ 14403 │ 26 │
│ 2 │ M │ 15635 │ 31 │
│ 3 │ L │ 18956 │ 36 │
│ 4 │ XL │ 552 │ 41 │
│ 5 │ XXL │ 28 │ 46 │
bar(pz_sizes.size, pz_sizes.amount, bins=5, legend=nothing)
Image for post
pie(pz_sizes.size, pz_sizes.amount, title="Most sold pizza sizes")
Image for post

What pizza size is most popular?

Looking at our charts we can see that most pizzas sold is the large one with medium on a solid second. But how does it look in terms of pizza slices eaten?

julia> area(r) = π*r^2
julia> pz_sizes.area = area.(pz_sizes.diameter ./ 2);
julia> pz_sizes.area = round.(Int, pz_sizes.area)
julia> pz_sizes
5×4 DataFrame
│ Row │ size │ amount │ diameter │ area │
│ │ String │ Int64 │ Int64 │ Int64 │
├─────┼────────┼────────┼──────────┼───────┤
│ 1 │ S │ 14403 │ 26 │ 531 │
│ 2 │ M │ 15635 │ 31 │ 755 │
│ 3 │ L │ 18956 │ 36 │ 1018 │
│ 4 │ XL │ 552 │ 41 │ 1320 │
│ 5 │ XXL │ 28 │ 46 │ 1662 │
julia> pzs = pz_sizes
julia> pzs.total_area = pzs.area .* pzs.amount
julia> pzs
5×5 DataFrame
│ Row │ size │ amount │ diameter │ area │ total_area │
│ │ String │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼────────┼──────────┼───────┼────────────┤
│ 1 │ S │ 14403 │ 26 │ 531 │ 7647993 │
│ 2 │ M │ 15635 │ 31 │ 755 │ 11804425 │
│ 3 │ L │ 18956 │ 36 │ 1018 │ 19297208 │
│ 4 │ XL │ 552 │ 41 │ 1320 │ 728640 │
│ 5 │ XXL │ 28 │ 46 │ 1662 │ 46536 │
p1 = pie(pzs.size, pzs.amount, title="Pizzas by Numbers");
p2 = pie(pzs.size, pzs.total_area, title="Pizzas by Area");
p = plot(p1, p2)
Image for post

What pizza brings in the most revenue?

We begin by creating a subgroup for each pizza.

groupby(pz, :name)
combine(groupby(pz, :name), :price => sum => :revenue)
julia> pizzas = groupby(pz, :name);
julia> pizzas = combine(pizzas, :price => sum => :revenue);

julia> top10 = first(sort(pizzas, :revenue, rev=true), 10)
10×2 DataFrame
│ Row │ name │ revenue │
│ │ String │ Float64 │
├─────┼─────────────┼─────────┤
│ 1 │ thai_ckn │ 43434.2 │
│ 2 │ bbq_ckn │ 42768.0 │
│ 3 │ cali_ckn │ 41409.5 │
│ 4 │ classic_dlx │ 38180.5 │
│ 5 │ spicy_ital │ 34831.2 │
│ 6 │ southw_ckn │ 34705.8 │
│ 7 │ ital_supr │ 33476.8 │
│ 8 │ hawaiian │ 32273.2 │
│ 9 │ four_cheese │ 32265.7 │
│ 10 │ sicilian │ 30940.5 │
bar(top10.name, top10.revenue, label="pizza revenue")
Image for post

How much is typically spent on each pizza order?

To get how much is spent on each pizza order we create a subgroup of the data frame for each unique order id:

julia> groupby(pz, :id)
GroupedDataFrame with 21350 groups based on key: id
First Group (1 row): id = "2015-000001"
│ Row │ id │ name │ size │ price │ diameter │ area │
│ │ String │ String │ String │ Float64 │ Float64 │ Float64 │
├─────┼─────────────┼──────────┼────────┼─────────┼──────────┼─────────┤
│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │ 0.31 │ 0.0755 │

Last Group (1 row): id = "2015-021350"
│ Row │ id │ name │ size │ price │ diameter │ area │
│ │ String │ String │ String │ Float64 │ Float64 │ Float64 │
├─────┼─────────────┼─────────┼────────┼─────────┼──────────┼─────────┤
│ 1 │ 2015-021350 │ bbq_ckn │ S │ 12.75 │ 0.26 │ 0.0531 │
julia> orders = combine(groupby(pz, :id), :price => sum => :sum)
21350×2 DataFrame
│ Row │ id │ sum │
│ │ String │ Float64 │
├───────┼─────────────┼─────────┤
│ 1 │ 2015-000001 │ 13.25 │
│ 2 │ 2015-000002 │ 92.0 │
│ 3 │ 2015-000003 │ 37.25 │
│ 4 │ 2015-000004 │ 16.5 │
julia> minimum(orders.sum)
9.75

julia> maximum(orders.sum)
444.2

julia> mean(orders.sum)
38.30726229508197

julia> median(orders.sum)
32.5
julia> describe(orders, :min, :max, :mean, :median)
2×5 DataFrame
│ Row │ variable │ min │ max │ mean │ median │
│ │ Symbol │ Any │ Any │ Union… │ Union… │
├─────┼──────────┼─────────────┼─────────────┼─────────┼────────┤
│ 1 │ id │ 2015-000001 │ 2015-021350 │ │ │
│ 2 │ sum │ 9.75 │ 444.2 │ 38.3073 │ 32.5 │
julia> histogram(
orders.sum,
xaxis="order sum",
yaxis="frequency",
xticks=0:50:400,
legend=nothing)
Image for post

Final Words

We have just scratched the surface of what is possible with Julia’s data frames and plotting libraries, but hopefully this gave you a flavor.

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store