---
title: "Tidy Data with tidyr"
---
# Tidy Data with tidyr {#tidyr}
```{r}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
"Tidy" is a specific data shape that downstream tools — `dplyr`, `ggplot2`, `lm()` — all expect by default. Data in tidy form makes analyses straightforward; data in non-tidy form has to be reshaped before most analyses will work cleanly. The `tidyr` package is the reshaping toolkit: turning wide data long, long data wide, splitting columns that contain multiple values, and joining columns that have been split too far.
::: {.callout-warning}
## AI Pitfall: `pivot_wider()` silently overwrites duplicates
Reshaping data from long to wide is one of the most frequent uses of `tidyr`. The function looks straightforward:
```r
df %>% pivot_wider(names_from = month, values_from = sales)
```
But suppose `df` has two rows for the same customer in March — maybe a refund and a re-purchase, or a data-entry duplicate. `pivot_wider()` does not raise an error. By default it silently keeps one of the values (often the last one) in the wide output, and you never see the row that was overwritten. AI does not flag this. Your wide table looks plausible.
The defensive habit: before pivoting, check for duplicate row-column combinations.
```r
df %>%
count(customer_id, month) %>%
filter(n > 1)
```
If anything comes back, decide consciously how to handle the duplicates (sum them, average them, take the latest by date, etc.) using `values_fn = sum` or similar in `pivot_wider()`. Never pivot wide on data you have not first audited for unique row-column combinations.
:::
## What is Tidy Data?
Three rules:
1. **Each variable gets its own column.** Revenue, Region, Quarter --- each one lives alone.
2. **Each observation gets its own row.** One row per thing you measured.
3. **Each cell contains one value.** No smuggling "Q1/2024" into a single cell like contraband.
Here is what tidy data looks like:
```{r}
tidy_example <- tribble(
~country, ~year, ~cases, ~population,
"Afghanistan", 2000, 745, 20000000,
"Afghanistan", 2001, 2666, 20595360,
"Brazil", 2000, 37737, 172006362,
"Brazil", 2001, 80488, 174504898,
"China", 2000, 212258, 1280428583,
"China", 2001, 213766, 1272915272
)
tidy_example
```
Every variable is a column. Every row is one observation. Every cell has one value. It is the data equivalent of a well-organized filing cabinet --- boring, beautiful, and incredibly functional.
## Common Ways Data Gets Messy
Real-world data breaks these rules constantly. Here are the repeat offenders you will encounter in the wild.
### Column Headers That Are Actually Values
This is the number-one offender. Your boss hands you a spreadsheet where the column headers are months or years. Those columns are not variable *names* --- they are *values* of a variable pretending to be column headers.
```{r}
messy_wide <- tribble(
~country, ~`2000`, ~`2001`, ~`2002`,
"Afghanistan", 745, 2666, 2900,
"Brazil", 37737, 80488, 90125,
"China", 212258, 213766, 220000
)
messy_wide
```
The years `2000`, `2001`, and `2002` are values of a `year` variable, and the actual measurement (cases) is smeared across three columns. We need to **pivot** this into a longer format --- think of it as turning a wide Excel report sideways so the database can read it.
### Two Variables Jammed into One Column
Sometimes someone thought cramming two values into one cell with a slash was a time-saver. (It was not.)
```{r}
messy_combined <- tribble(
~country, ~year, ~rate,
"Afghanistan", 2000, "745/20000000",
"Afghanistan", 2001, "2666/20595360",
"Brazil", 2000, "37737/172006362",
"Brazil", 2001, "80488/174504898"
)
messy_combined
```
That `rate` column is smuggling both `cases` and `population`. We need to **separate** them.
### Values That Should Be Column Names Stuck in Rows
Sometimes what should be separate columns is listed as categories in a single column --- like putting "Revenue" and "Expenses" in the same field and hoping for the best.
```{r}
messy_both <- tribble(
~country, ~year, ~type, ~count,
"Afghanistan", 2000, "cases", 745,
"Afghanistan", 2000, "population", 20000000,
"Afghanistan", 2001, "cases", 2666,
"Afghanistan", 2001, "population", 20595360,
"Brazil", 2000, "cases", 37737,
"Brazil", 2000, "population", 172006362
)
messy_both
```
The `type` column contains what should be variable names. We need to **pivot wider** to give `cases` and `population` their own columns.
## pivot_longer() -- Fixing Wide Spreadsheets
This is the function you will use the most. Anytime your boss gives you a spreadsheet with months-as-columns, quarters-as-columns, or product-lines-as-columns, `pivot_longer()` is the fix. Think of it as converting that wide executive dashboard back into proper database rows.
### Basic Usage
```{r}
messy_wide
```
```{r}
messy_wide %>%
pivot_longer(
cols = `2000`:`2002`,
names_to = "year",
values_to = "cases"
)
```
Compare this to the messy version above: we went from 3 rows and 4 columns to 9 rows and 3 columns. Each row is now one country in one year --- a single observation. The wide format was convenient for a human glancing at a spreadsheet, but this long format is what ggplot2, dplyr, and every modeling function in R expects.
Three arguments to tattoo on your forearm:
- **`cols`**: Which columns to pivot (the ones that are really values in disguise).
- **`names_to`**: What to name the new column that holds the old column headers.
- **`values_to`**: What to name the new column that holds the numbers.
Notice `year` came in as text. Fix that automatically:
```{r}
messy_wide %>%
pivot_longer(
cols = `2000`:`2002`,
names_to = "year",
values_to = "cases",
names_transform = list(year = as.integer)
)
```
### Selecting Which Columns to Pivot
You have several ways to tell `pivot_longer()` which columns to grab:
```{r}
student_scores <- tribble(
~student, ~math_midterm, ~math_final, ~english_midterm, ~english_final,
"Alice", 88, 92, 78, 85,
"Bob", 76, 81, 90, 88,
"Carol", 95, 97, 82, 91
)
student_scores
```
**Pivot everything except the ID column** (the most common pattern --- "give me everything but the name"):
```{r}
student_scores %>%
pivot_longer(
cols = -student,
names_to = "exam",
values_to = "score"
)
```
**Use `starts_with()` to grab columns by prefix:**
```{r}
student_scores %>%
pivot_longer(
cols = starts_with("math"),
names_to = "math_exam",
values_to = "math_score"
)
```
### A Business Example: Quarterly Revenue
Here is a scenario you *will* encounter, probably within your first week on the job. Quarterly revenue, wide format, straight out of the CFO's favorite spreadsheet:
```{r}
quarterly_revenue <- tribble(
~company, ~Q1_revenue, ~Q2_revenue, ~Q3_revenue, ~Q4_revenue,
"Acme", 100, 120, 130, 150,
"Beta", 200, 210, 250, 270
)
quarterly_revenue %>%
pivot_longer(
cols = -company,
names_to = "quarter",
values_to = "revenue",
names_pattern = "Q(\\d)_revenue"
)
```
The `names_pattern` uses a regular expression to extract just the quarter number. Now it is ready to plot as a trend line instead of sitting awkwardly in a wide table.
### Dropping NAs While Pivoting
Got sparse data with blank cells everywhere? Use `values_drop_na = TRUE` and leave the ghosts behind:
```{r}
sparse_data <- tribble(
~id, ~week1, ~week2, ~week3,
"A", 10, NA, 30,
"B", NA, 25, 35,
"C", 15, 20, NA
)
sparse_data %>%
pivot_longer(
cols = -id,
names_to = "week",
values_to = "measurement",
values_drop_na = TRUE
)
```
### Pivoting Column Names with Multiple Pieces
Sometimes column names encode more than one thing. Our student scores have both subject *and* exam type baked into the name:
```{r}
student_scores %>%
pivot_longer(
cols = -student,
names_to = c("subject", "exam"),
names_sep = "_",
values_to = "score"
)
```
The `names_sep = "_"` splits each column name at the underscore and distributes the pieces into two new columns. One line of code, two new variables. Not bad.
## pivot_wider() -- Going the Other Way
Sometimes you need to spread data *out* instead of collapsing it down. Building a summary table for the quarterly all-hands? Prepping a cross-tab for a slide deck? That is `pivot_wider()` territory.
### Basic Usage
Remember our messy dataset where variable names were stuck in a column?
```{r}
messy_both
```
```{r}
messy_both %>%
pivot_wider(
names_from = type,
values_from = count
)
```
Now `cases` and `population` are proper columns instead of being stacked as values in the `type` column. Each row is one country in one year with both measurements side-by-side --- which means you can now easily compute a rate (`cases / population`) as a new column.
Two arguments:
- **`names_from`**: The column whose values become new column headers.
- **`values_from`**: The column whose values fill those new columns.
### Filling in Missing Combinations
When some combinations do not exist, you get NAs. Use `values_fill` to swap them for something friendlier:
```{r}
weekly_sales <- tribble(
~store, ~product, ~sales,
"North", "Apples", 100,
"North", "Bananas", 150,
"South", "Apples", 200,
"South", "Cherries", 80,
"East", "Bananas", 120
)
weekly_sales %>%
pivot_wider(
names_from = product,
values_from = sales,
values_fill = 0
)
```
Without `values_fill = 0`, the East store shows `NA` for Apples and Cherries. With it, you get zeros --- much better for a sales report that lands on a VP's desk.
### When to Use pivot_wider()
Reach for this when you need to:
- **Build summary tables** for a slide deck or report
- **Create cross-tabulations** (survey responses by question, products by region)
- **Match a required input format** for a specific function or tool
```{r}
survey <- tribble(
~respondent, ~question, ~answer,
"R1", "satisfaction", "High",
"R1", "recommend", "Yes",
"R2", "satisfaction", "Medium",
"R2", "recommend", "No",
"R3", "satisfaction", "High",
"R3", "recommend", "Yes"
)
survey %>%
pivot_wider(
names_from = question,
values_from = answer
)
```
Now each respondent is one tidy row with their answers spread across columns. Your manager can actually read this.
## separate() and unite() -- Splitting and Combining
### Splitting a Column Apart
When someone mashed two things into one column, `separate()` performs the surgery:
```{r}
locations <- tribble(
~name, ~city_state,
"Alice", "Portland_Oregon",
"Bob", "Austin_Texas",
"Carol", "Denver_Colorado",
"David", "Miami_Florida"
)
locations %>%
separate(city_state, into = c("city", "state"), sep = "_")
```
Works with dates too:
```{r}
dates_data <- tribble(
~event, ~date,
"Meeting", "2024-01-15",
"Workshop", "2024-03-22",
"Summit", "2024-07-10"
)
dates_data %>%
separate(date, into = c("year", "month", "day"), sep = "-", convert = TRUE)
```
The `convert = TRUE` automatically turns the text into numbers. The newer `separate_wider_delim()` does the same job with more explicit syntax:
```{r}
locations %>%
separate_wider_delim(city_state, delim = "_", names = c("city", "state"))
```
### Combining Columns with unite()
`unite()` does the reverse --- it glues columns together like a mail merge:
```{r}
date_parts <- tribble(
~year, ~month, ~day,
2024, 1, 15,
2024, 3, 22,
2024, 7, 10
)
date_parts %>%
unite(col = "date", year, month, day, sep = "-")
```
Handy for building addresses, composite keys, or file names from component parts.
## Handling Missing Values
Missing data is just a fact of corporate life. Customer skipped a survey question? A store did not report that week? Here are your tools.
### drop_na() -- Remove Rows with Missing Values
```{r}
customer_data <- tribble(
~customer, ~revenue, ~satisfaction, ~nps_score,
"C1", 1200, 8, 9.0,
"C2", NA, 7, 8.5,
"C3", 1500, NA, 7.2,
"C4", 980, 9, NA,
"C5", 1100, 8, 8.8
)
# Drop rows missing ANY value
customer_data %>%
drop_na()
```
```{r}
# Drop only rows missing revenue
customer_data %>%
drop_na(revenue)
```
### fill() -- Carry Values Down (or Up)
A lifesaver for spreadsheets where someone only typed the year once at the top of each group and left the rest blank (we have all seen this):
```{r}
quarterly_reports <- tribble(
~year, ~quarter, ~revenue,
2023, 1, 500,
NA, 2, 520,
NA, 3, 530,
NA, 4, 610,
2024, 1, 580,
NA, 2, 600,
NA, 3, 650,
NA, 4, 700
)
quarterly_reports %>%
fill(year)
```
By default it fills downward. Use `.direction = "up"` to fill upward.
### replace_na() -- Swap NAs for Specific Values
```{r}
customer_data %>%
replace_na(list(revenue = 0, satisfaction = 0, nps_score = 0))
```
### complete() -- Make Hidden Missing Data Visible
Sometimes entire rows are missing, not just cells. `complete()` generates all combinations so the gaps become obvious:
```{r}
experiment <- tribble(
~subject, ~treatment, ~outcome,
"S1", "A", 5.2,
"S1", "B", 6.1,
"S2", "A", 4.8,
"S3", "B", 7.0,
"S3", "A", 5.5
)
experiment %>%
complete(subject, treatment, fill = list(outcome = 0))
```
Subject S2 with treatment B was completely missing from the original data. Now you can see it --- and decide what to do about it.
## Nesting and Unnesting
Nesting collapses groups of rows into mini-data-frames stored inside a single column. It is an advanced technique used with the **purrr** package for running functions group-by-group. If that sounds like overkill for your current needs, skip it and circle back when you are ready.
## Putting It All Together
Let's walk through a realistic scenario. You get a spreadsheet of student test scores from a school district --- the kind of file that makes you question whether Excel was a net positive for humanity:
```{r}
messy_scores <- tribble(
~school, ~`math_2022_male`, ~`math_2022_female`, ~`math_2023_male`, ~`math_2023_female`, ~`reading_2022_male`, ~`reading_2022_female`, ~`reading_2023_male`, ~`reading_2023_female`,
"Lincoln", 78, 82, 80, 85, 72, 80, 75, 83,
"Washington", 85, 88, 87, 90, 80, 85, 82, 87,
"Jefferson", 70, 75, 73, 78, 68, 72, 71, 76
)
messy_scores
```
Eight columns of scores with subject, year, and gender all crammed into the column names. One pipeline to tidy it all:
```{r}
tidy_scores <- messy_scores %>%
pivot_longer(
cols = -school,
names_to = c("subject", "year", "gender"),
names_sep = "_",
values_to = "avg_score"
) %>%
mutate(year = as.integer(year)) %>%
arrange(school, subject, year, gender)
tidy_scores
```
Now we can actually *do* things with this data. Average score by subject and year?
```{r}
tidy_scores %>%
group_by(subject, year) %>%
summarize(
mean_score = mean(avg_score),
.groups = "drop"
)
```
Both math and reading scores improved from 2022 to 2023 (math from about 79.7 to 82.2, reading from 76.2 to 79.0). This kind of year-over-year comparison is only possible because we tidied the data first --- in the original messy format with eight columns, getting to this summary would have required painful manual work.
A quick visualization?
```{r}
tidy_scores %>%
group_by(subject, year, gender) %>%
summarize(mean_score = mean(avg_score), .groups = "drop") %>%
ggplot(aes(x = year, y = mean_score, color = gender)) +
geom_line() +
geom_point(size = 2) +
facet_wrap(~subject) +
labs(
title = "Average Test Scores by Subject, Year, and Gender",
x = "Year",
y = "Average Score",
color = "Gender"
) +
theme_minimal()
```
That whole analysis became possible because we invested a few lines in tidying the data first. The messier the original spreadsheet, the bigger the payoff.
## Summary
Your tidyr cheat sheet:
| Function | What It Does |
|:---------|:--------|
| `pivot_longer()` | Wide to long (the one you will use most) |
| `pivot_wider()` | Long to wide (for summary tables and reports) |
| `separate()` / `separate_wider_delim()` | Split one column into multiple columns |
| `unite()` | Glue multiple columns into one |
| `drop_na()` | Remove rows with missing values |
| `fill()` | Fill missing values with the previous (or next) value |
| `replace_na()` | Replace NAs with a value you choose |
| `complete()` | Make hidden missing rows visible |
The bottom line: getting your data into tidy form is the single best investment you can make before any analysis. Think of it like organizing your desk before a big project --- five minutes of setup saves hours of searching. Tidy first, analyze second.