Ditch Excel and Use Julia Data Frames

Manipulating and visualizing pizza sales data using Julia DataFrames.jl and Plots.jl

Image for post
Image for post
Photo by Brett Jordan from Pexels

In this story we will look at pizza sales data found here:

This kind of data can be manipulated in a spreadsheet application such as Excel and using data frames popular in languages such as R, Python (Pandas) and Julia (DataFrames.jl).

Loading Data

First we will load the data in Julia and pick a subset (id, name, size and price) of columns in the table to work with:

We can look at the first view rows to see what this looks like in the Julia REPL (Read Evaluate Program Loop):

However we are currently looking at the first 4 rows. But as you can see there are almost 50 thousand rows in this dataset so not very practical to paste into a spreadsheet. Also for educational reasons, will pick a smaller subset.

Sampling Data

We are going to pick a random sample of 16 rows from the 49 574 rows we have loaded in. To do that we will randomly shuffle the row indices from 1 to 49 574.

We can then pick the first 16 rows of these shuffled rows to get 16 random rows from our original data:

You read the syntax used in the square brackets as [rows, columns], where rows is a collection of rows you want. This could be a range, vector or single scalar value. The same applies to columns.

Transfer Pizza Sales to Spreadsheet

However we cannot easily copy and paste data on this “pretty” format into a spreadsheet application such as Excel. We want it on CSV format. Fortunately Julia display system allows us to display the same data in many different formats.

In the Julia REPL, what actually happens automatically is that the display function gets called like this:

However this function can also take a MIME type as an argument, which is used when e.g. Julia is used in a Notebook that supports richer graphical capabilities. In a notebook we supply the text/html MIME type, but in this case we want our data in CSV format:

You could just copy and paste this into a spreadsheet application. But there are many ways of doing this. We could write to file using the CSV package:

But you could achieve this even by leveraging MIME types with show:

Or what about being a bit crazy, and using an IOBuffer with a clipboard:

After this the pizza sales data will be on the clipboard and you can paste it into your favorite spreadsheet application.

Instead of take! we could have used seekstart to go to the beginning of the IOBuffer and read from it normally. However the benefit of using take! is that you empty the buffer so it no longer consumes any memory to hold the CSV data.

Eventually you get a table as shown below in your preferred spreadsheet application, whether that is Excel, Numbers or Google Sheets. What is useful in a spreadsheet is that you can e.g. click a column and get all sorts of statistics. Notice how we get the sum, average, min and max value.

Getting the same in Julia is easy. We just need to import the Statistics package.

Plotting Histograms

When doing Exploratory Data Analysis (EDA) one of the first useful things you can do to get a sense of the data, is to plot interesting data in histograms.

We can compare the histogram for the sample data with a similar histogram for the whole dataset:

We can combine plots and get them side by side. But when comparing plots it is good to have the data in similar ranges. So we will use the xaxis and yaxis attributes to specify the ranges.

If you want to include your plots into reports and documents you can save them to disk using the savefig function. It will pickup desired format to use based on your filename extension. .png, .pdf, .svg and several other formats are supported:

Interesting Questions to Ask About Our Data

After looking at the data we may develop some questions about the data which we want to have answered, such as:

  • What are the most sold pizzas?
  • What size of pizza, small, medium, large etc is sold the most?
  • What pizza brings in the most revenue?
  • How much is typically spent on each pizza order?

A lot of this kind of data cannot be pulled directly out of the data. We will have to group related data. E.g. to find out how much is spent on each order, we would have to group every row belonging to the same order and add up the sales price of every pizza in that order.

To find out which pizza brings in the most revenue, we would have to group rows by pizza name and sum up price for each row.

Thus to be able to answer such questions we need to learn more about how to filter, group and combine data in a DataFrame.

Slicing and Dicing Data Frames

Working with Data Frames we got several key functions that help us manipulate the data:

  • select Picks a subset of columns, and possibly rename or transform the values in column.
  • transform Similar to select except we are not removing any columns. We are just renaming and transforming selected columns.
  • groupby Turns a table into multiple tables. The splitting is done by making a table for each unique value of a particular selected column.
  • combine Takes multiple tables and turn them into one table again. Allows you collapse all rows in each table into a single row.
  • join Let you match up a column in two different tables to join them into one table.

For anyone who has worked with SQL, many of these operations will be familiar. These functions pick, rename and transform rows using similar syntax.

We can pick a specific column and do nothing with it:

For transform this operation is rather pointless, as you get all the same rows as you started with. With a rename it is even more pointless as you simply end up adding a column with the same values as the column you are renaming.

The benefit becomes more clear when you are performing value transformations using functions.

However these transformations are pointless as we end up with the same average for every row. The reason is that the functions supplied always take as argument the whole column, and needs to spit out a value/values for the whole column. mean only gives one output value.

If we want to apply a function to every single value we can easily produce such a function with ByRow. Here is a simple example demonstrating how it works:

Thus ByRow gives us a simple way of creating functions which can operate on each row:

We can combine value transformation and column renaming:

But does this mean that using functions without ByRow is pointless? No, in fact they are very useful when used with combine, because it works similar to select except when every column selected is transformed by a function that takes the whole column and returns a single value, such as mean you will end up with a single row.

But if you don’t then combine will basically have the same behavior as select:

Splitting and Recombining Tables

The real power of combine however comes from splitting a data frame up into groups with groupby and then recombining these again to a table.

The trick is to use use these with functions that take multiple elements in and produce a single output such as:

  • length Count number of elements. E.g. how many medium medium pizzas are there.
  • sum Add up values in column.
  • mean Find the arithmetic mean (average).

Let us use this to figure out how many pizzas there are of each size:

sizes is not a collection of groups represented by the GroupedDataFrame type. We can recombine these groups:

We could rename column for a nicer table:

We can use this for the whole dataset and plot a histogram:

One problem with this plot that you can see is that the sizes are not sorted in order S, M, L, XL, XXL which makes it confusing. How can we solve that?

Joining Tables

There are a lot of ways of defining an order for pizza sizes, but here I will solves this by joining tables, because it is something you should know how to do anyway. A simple solution is to give sizes a numerical value.

The standard diameters of pizzas are:

  • Small — 26 cm
  • Medium — 31 cm
  • Large — 36 cm
  • X-Large — 41 cm
  • Jumbo — 46 cm

We will use Jumbo size for XXL. Let us Make a table that maps pizza sizes to diameters:

To join we need to specify which column will be the primary key, or the value we will join on. That will be size since that is the only column that exists in both tables:

But let us get back to our original task. We want to insert diameters into table over how many of the pizzas sold are of a particular type:

And let us sort by diameter:

Now we got something better suited for plotting in a bar plot.

Plotting is of course an art. To get a sense of what dominates ones sales it may make more sense to use a pie chart:

What pizza size is most popular?

Looking at our charts we can see that most pizzas sold is the large one with medium on a solid second. But how does it look in terms of pizza slices eaten?

Basically what we want to know is what kind of pizza is consumed most. Since we know the diameter, we can calculate the area of each pizza.

To make it a bit nicer to display I will just round off the calculated areas:

Now we can calculate the total area for each pizza type:

And let us make it more interesting by plotting how many pizzas have been bought and compare it to the area of pizza consumed of each type:

What pizza brings in the most revenue?

We begin by creating a subgroup for each pizza.

We can then add up the price of each pizza sold with the same name.

But we want this sorted so we can find the top 10 most sold pizzas.

We can can visualize this with a bar chart:

How much is typically spent on each pizza order?

To get how much is spent on each pizza order we create a subgroup of the data frame for each unique order id:

We can then recombine these groups by summing up the price column, which allows us to collapse each sub-group into a single row in a new table. In this table, there is just one line for each order:

Let us do some simple exploration of this data first. It is always a good thing to look at minimums, maximums, mean and median value.

There is a shorthand for getting all of this data in one swoop:

We can confirm this initial impression by plotting the histogram showing us the value distribution. We can see the bulk of the orders are below 50 dollars.

Now you may wonder, how do I know all the settings which can be used to get the plots right? Part of this is really gradual experimentation. Tweaking setting repeatedly until you get a desired plot.

Final Words

We have just scratched the surface of what is possible with Julia’s data frames and plotting libraries, but hopefully this gave you a flavor.

You can look at a slightly different treatment of the same Pizza data using a different plotting library Gadfly, in the previous story on plotting that I wrote. Other resources:

Written by

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store