# Ditch Excel and Use Julia Data Frames

## Manipulating and visualizing pizza sales data using Julia DataFrames.jl and Plots.jl

--

In this story we will look at pizza sales data found here:

`https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv`

This kind of data can be manipulated in a spreadsheet application such as Excel and using data frames popular in languages such as R, Python (Pandas) and Julia (DataFrames.jl).

## Loading Data

First we will load the data in Julia and pick a subset (id, name, size and price) of columns in the table to work with:

`using DataFrames, CSV`

url = "https://vincentarelbundock.github.io/Rdatasets/csv/gt/pizzaplace.csv"

filename = download(url)

all_pizzas = CSV.read(filename, DataFrame)

# Get rid of column with row numbers

all_pizzas = all_pizzas[:, 2:end]

# Pick most interesting columns

pz = select(all_pizzas, :id, :name, :size, :price)

We can look at the first view rows to see what this looks like in the Julia REPL (Read Evaluate Program Loop):

`julia> first(pz, 4)`

4×4 DataFrame

│ Row │ id │ name │ size │ price │

│ │ String │ String │ String │ Float64 │

├─────┼─────────────┼─────────────┼────────┼─────────┤

│ 1 │ 2015-000001 │ hawaiian │ M │ 13.25 │

│ 2 │ 2015-000002 │ classic_dlx │ M │ 16.0 │

│ 3 │ 2015-000002 │ mexicana │ M │ 16.0 │

│ 4 │ 2015-000002 │ thai_ckn │ L │ 20.75 │

julia> nrow(pz)

49574

However we are currently looking at the first 4 rows. But as you can see there are almost 50 thousand rows in this dataset so not very practical to paste into a spreadsheet. Also for educational reasons, will pick a smaller subset.

## Sampling Data

We are going to pick a random sample of 16 rows from the 49 574 rows we have loaded in. To do that we will randomly shuffle the row indices from 1 to 49 574.

`julia> using Random`

julia> rows = shuffle(1:nrow(pz))

We can then pick the first 16 rows of these shuffled rows to get 16 random rows from our original data:

`julia> sample = pz[rows[1:16]…`