---
title: "Data Wrangling with dplyr"
---
# Data Wrangling with dplyr {#dplyr}
```{r}
library(tidyverse)
library(knitr)
library(DT)
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
Data wrangling — filtering, selecting, sorting, mutating, grouping, summarizing — is where most analysts spend most of their time. The `dplyr` package replaces the menu-driven workflow of Excel pivot tables with short, readable commands that compose: "rows where revenue tops $10K," "add a profit-margin column," "average sales by region." Each operation is a single function call, and because every function takes a data frame in and returns a data frame back, you can chain them into pipelines.
The result is code that reads close to how you would describe the analysis in plain language, and that reviewers can follow without reverse-engineering a series of UI clicks.
## The Six Verbs You Need to Know {#core-verbs}
Almost everything in dplyr boils down to six functions. Once they are second nature, most data wrangling becomes a question of which verbs to chain in which order.
| Verb | What It Does | Excel Equivalent |
|:-------------|:-----------------------------------------|:--------------------------|
| `filter()` | Show me only the rows I care about | Filtering rows |
| `select()` | Show me only the columns I care about | Hiding columns |
| `mutate()` | Add a new column (or change one) | Adding a formula column |
| `summarize()` | Give me summary stats | SUM, AVERAGE, COUNT |
| `arrange()` | Sort my data | Sort A-Z / Z-A |
| `group_by()` | Do it separately for each group | Pivot tables! |
They all work the same way:
1. First argument: your data.
2. Next arguments: what you want to do, using column names directly (no quotes needed).
3. Result: a shiny new data frame.
Let's dig in.
## filter() — Show Me Only the Rows I Care About {#filter}
`filter()` keeps only rows that match a condition. It is the analyst's equivalent of opening a database and writing a `WHERE` clause: "rows where the customer is in California," "rows where the order amount is over a thousand dollars," "rows where the date falls in Q3."
::: {.callout-warning}
## AI Pitfall: long pipelines hide their own intermediate failures
Ask an AI assistant to "find the average order value for high-value customers in each region this quarter" and you get back a clean five-step pipeline:
```r
df %>%
filter(date >= "2026-07-01", date <= "2026-09-30") %>%
filter(customer_segment == "High-value") %>%
group_by(region) %>%
summarize(avg_order = mean(order_amount, na.rm = TRUE)) %>%
arrange(desc(avg_order))
```
The pipeline reads beautifully. It also makes it impossible to see what happened at each step. If the date filter accidentally dropped 80% of the rows because of a parsing issue, you would not know — the only output is the final summary table. If `customer_segment` has values like "high-value" and "High-Value" mixed in (case-sensitive, common in inherited data), half your high-value customers got silently excluded.
The discipline: when a pipeline matters, run it one verb at a time first. Check `nrow()` after each step. Only chain it once you trust each piece. Or use `dplyr::filter_at`-style debugging where you assign intermediate results and inspect them. AI-generated pipelines are convenient; the convenience is exactly what makes verification harder.
:::
### Basic Comparisons
```{r}
# Cars that get more than 30 miles per gallon
filter(mtcars, mpg > 30)
```
Only 4 cars out of the original 32 clear the 30 mpg bar --- the Toyota Corolla, Fiat 128, Lotus Europa, and Honda Civic. Filtering transformed a 32-row dataset into a 4-row answer.
```{r}
# Only setosa flowers
filter(iris, Species == "setosa")
```
```{r}
# Cars with exactly 6 cylinders
filter(mtcars, cyl == 6)
```
The comparison operators are what you'd expect: `==` (equals), `!=` (not equal), `>`, `<`, `>=`, `<=`.
**Pro tip that will save you 20 minutes of debugging:** Use `==` for comparison, not `=`. A single `=` is for assigning values. Everyone makes this mistake exactly once. Some of us make it twice.
### Combining Conditions
Need multiple conditions? Use `&` (and) or `|` (or).
```{r}
# Setosa flowers with long sepals
filter(iris, Species == "setosa" & Sepal.Length > 5)
```
Shortcut: commas inside `filter()` are treated as "and" automatically.
```{r}
# Same thing, cleaner syntax
filter(iris, Species == "setosa", Sepal.Length > 5)
```
```{r}
# Cars with 4 cylinders OR great gas mileage
filter(mtcars, cyl == 4 | mpg > 30)
```
### The %in% Operator
When you want to match against a list of values, `%in%` is your friend. Think of it as checking whether a job applicant's major is on your "approved majors" list.
```{r}
# Setosa or virginica (but not versicolor)
filter(iris, Species %in% c("setosa", "virginica"))
```
```{r}
# 4-cylinder or 8-cylinder cars
filter(mtcars, cyl %in% c(4, 8))
```
### Filtering Ranges
`between()` is perfect for "give me everything from X to Y" --- like pulling all transactions between Q1 and Q3.
```{r}
# Sepal length between 5.0 and 6.0
filter(iris, between(Sepal.Length, 5.0, 6.0))
```
### Dealing with Missing Values
Missing data (`NA`) is the business world's version of "the dog ate my homework." Someone didn't fill out the survey, a sensor died, a record got corrupted. You can't test for `NA` with `==`; you need `is.na()`.
```{r}
df_with_na <- tibble(
x = c(1, 2, NA, 4, 5),
y = c("a", NA, "c", "d", NA)
)
# Keep only rows where x is not missing
filter(df_with_na, !is.na(x))
```
```{r}
# Keep rows where BOTH x and y are present
filter(df_with_na, !is.na(x), !is.na(y))
```
### Putting It Together
Real-world filtering usually stacks several conditions. Here's the kind of thing you'll actually write at work:
```{r}
# Efficient cars: 4 cylinders with above-average mpg
filter(mtcars, cyl == 4, mpg > mean(mpg))
```
Notice something clever: `mean(mpg)` is computed on the full dataset (about 20.1 mpg), and then only 4-cylinder cars above that average are kept. All 4-cylinder cars are fuel-efficient by design, so most of them pass this bar.
## select() -- Show Me Only the Columns I Care About {#select}
Your dataset has 47 columns but you only need 5? `select()` is the Marie Kondo of data wrangling --- it keeps only what sparks joy (or, you know, what's relevant to your analysis).
### Picking Columns
```{r}
# Just these two columns, please
select(iris, Sepal.Length, Species)
```
```{r}
# A range of columns
select(iris, Sepal.Length:Petal.Length)
```
```{r}
# Everything EXCEPT Species
select(iris, -Species)
```
### Smart Selection Helpers
When your data has lots of similarly named columns (like `Q1_Revenue`, `Q2_Revenue`, `Q3_Revenue`...), these helpers save a ton of typing.
```{r}
# All columns that start with "Sepal"
select(iris, starts_with("Sepal"))
```
```{r}
# All columns that end with "Width"
select(iris, ends_with("Width"))
```
```{r}
# All columns that contain "pal"
select(iris, contains("pal"))
```
```{r}
# Only numeric columns
select(iris, where(is.numeric))
```
### Reordering and Renaming
Want to move a column to the front? Use `everything()` to keep the rest.
```{r}
# Put Species first, keep everything else
select(iris, Species, everything())
```
Rename as you select:
```{r}
select(iris, sepal_length = Sepal.Length, species = Species)
```
If you want to rename without dropping columns, use `rename()`:
```{r}
rename(iris, sepal_length = Sepal.Length, sepal_width = Sepal.Width) |> head(3)
```
## mutate() -- Add a New Column {#mutate}
Think of `mutate()` as adding a new column to your sales report. You've got Revenue and Costs? Great, now add a Profit column. That's `mutate()` in a nutshell.
### Creating New Columns
```{r}
# Approximate sepal area
iris_modified <- mutate(iris, Sepal.Area = Sepal.Length * Sepal.Width)
select(iris_modified, Sepal.Length, Sepal.Width, Sepal.Area) |> head()
```
```{r}
# You can create multiple columns at once
# Later columns can reference earlier ones (like chained formulas)
mtcars_modified <- mutate(mtcars,
weight_kg = wt * 453.592,
hp_per_kg = hp / weight_kg
)
select(mtcars_modified, wt, weight_kg, hp, hp_per_kg) |> head()
```
### Overwriting Existing Columns
If you assign to a name that already exists, it overwrites. No drama, no pop-up asking "are you sure?"
```{r}
# Convert wt from thousands of pounds to actual pounds
mutate(mtcars, wt = wt * 1000) |> select(wt) |> head()
```
### Useful Functions Inside mutate()
**Arithmetic** --- the usual suspects: `+`, `-`, `*`, `/`.
```{r}
mutate(iris,
Sepal.Sum = Sepal.Length + Sepal.Width,
Petal.Ratio = Petal.Length / Petal.Width
) |> head()
```
**Lag and lead** --- great for "compared to last month" calculations. Your CFO will love these.
```{r}
example_df <- tibble(
day = 1:7,
sales = c(100, 120, 90, 150, 130, 170, 160)
)
mutate(example_df,
previous_day_sales = lag(sales),
daily_change = sales - lag(sales)
)
```
**Cumulative functions** --- running totals, running averages. Perfect for tracking year-to-date performance.
```{r}
mutate(example_df,
cumulative_sales = cumsum(sales),
running_average = cummean(sales)
)
```
### if_else() and case_when() -- Conditional Logic
`if_else()` is your basic "if this, then that."
```{r}
mutate(mtcars,
efficiency = if_else(mpg > 20, "efficient", "inefficient")
) |> select(mpg, efficiency) |> head(10)
```
`case_when()` handles multiple conditions --- like a series of "if... else if... else if..." checks. Way more readable than nesting a dozen IF statements in Excel. (We've all seen those. We've all cried.)
```{r}
mutate(mtcars,
mpg_category = case_when(
mpg >= 30 ~ "excellent",
mpg >= 20 ~ "good",
mpg >= 15 ~ "moderate",
.default = "poor"
)
) |> select(mpg, mpg_category) |> head(10)
```
This is incredibly useful for bucketing data --- customer segments, performance tiers, risk categories, you name it.
```{r}
mutate(iris,
sepal_size = case_when(
Sepal.Length > 6.5 ~ "large",
Sepal.Length > 5.5 ~ "medium",
.default = "small"
)
) |> count(sepal_size, Species)
```
But seriously --- take a beat here. The categories you define in `case_when()` are not just code. Labels like "high value," "poor performance," or "high risk" drive real decisions about real people: who gets the promotion, who gets flagged for review, who qualifies for a loan. The code itself is neutral; the choices baked into it are not. Pressure-test your categories the way you would want someone to pressure-test a category that *you* got sorted into.
## summarize() -- Give Me the Summary Stats {#summarize}
`summarize()` (or `summarise()` if you're feeling British) collapses your entire dataset into summary numbers. Alone, it gives you one row. Pair it with `group_by()` and it becomes a full-blown pivot table machine.
### The Basics
```{r}
summarize(mtcars,
avg_mpg = mean(mpg),
avg_hp = mean(hp),
min_wt = min(wt),
max_wt = max(wt)
)
```
One row, four numbers. The average car gets about 20 mpg with 147 horsepower, and weights range from 1.5 to 5.4 (in thousands of pounds). That is the entire 32-row dataset collapsed into an executive summary.
### Common Summary Functions
Your cheat sheet:
| Function | What It Does |
|:---------------|:---------------------------------------|
| `mean()` | Average |
| `median()` | Median |
| `sd()` | Standard deviation |
| `min()`, `max()` | Smallest / largest value |
| `sum()` | Total |
| `n()` | Count of rows |
| `n_distinct()` | Count of unique values |
```{r}
summarize(iris,
n_rows = n(),
n_species = n_distinct(Species),
iqr_sepal_length = IQR(Sepal.Length)
)
```
### Summarizing Multiple Columns with across()
Instead of writing out a summary for every single column (boring), `across()` lets you apply the same function to multiple columns at once.
```{r}
# Mean of all numeric columns
iris |>
summarize(across(where(is.numeric), mean))
```
One line, four averages. Petal measurements (mean length 3.76, mean width 1.20) are more variable across species than sepal measurements, which hints at petal dimensions being more useful for telling species apart.
```{r}
# Multiple stats at once
iris |>
summarize(across(
where(is.numeric),
list(mean = mean, sd = sd),
.names = "{.col}_{.fn}"
))
```
Now each numeric column gets both its mean and standard deviation. The `.names` argument controls the output column names --- `{.col}` is the original column name and `{.fn}` is the function name. Petal.Length has the highest standard deviation (1.77), confirming it varies the most across the dataset.
## arrange() -- Sort My Data {#arrange}
Sorting. Simple, essential, and surprisingly satisfying.
```{r}
# Sort by mpg (lowest first)
arrange(mtcars, mpg) |> head()
```
### Descending Order
Wrap the column in `desc()` to flip the sort --- like clicking that Z-to-A button.
```{r}
# Best gas mileage first
arrange(mtcars, desc(mpg)) |> head()
```
### Sorting by Multiple Columns
Ties in the first column get broken by the second. Just like sorting a spreadsheet by Region, then by Revenue within each region.
```{r}
# Sort by cylinders, then by mpg within each group
arrange(mtcars, cyl, desc(mpg)) |> head(12)
```
Note: `NA` values always sort to the end. Always.
## group_by() -- The Pivot Table of R {#groupby}
This is where dplyr goes from "neat tool" to "how did I ever live without this?" `group_by()` tells R: "For the next operation, do it separately for each group." It's the R equivalent of dragging a field into the Rows area of an Excel pivot table.
### Grouped Summaries
The bread and butter combo: `group_by()` + `summarize()`.
```{r}
iris |>
group_by(Species) |>
summarize(
mean_sepal_length = mean(Sepal.Length),
mean_petal_length = mean(Petal.Length),
count = n()
)
```
Three species, three rows. Setosa has the shortest petals (mean 1.46 cm) while virginica has the longest (mean 5.55 cm) --- a nearly fourfold difference. This makes Petal.Length a strong candidate for distinguishing species.
```{r}
mtcars |>
group_by(cyl) |>
summarize(
avg_mpg = mean(mpg),
avg_hp = mean(hp),
n_cars = n()
)
```
Boom --- pivot table. The pattern is crystal clear: 4-cylinder cars average 26.7 mpg with 82 hp, while 8-cylinder cars average only 15.1 mpg but pack 209 hp. More cylinders = more power but worse fuel economy. Your boss would frame this.
### Multiple Grouping Variables
Group by more than one variable and R computes summaries for every combination --- like a two-dimensional pivot table.
```{r}
mtcars |>
group_by(cyl, am) |>
summarize(
avg_mpg = mean(mpg),
n_cars = n(),
.groups = "drop"
)
```
Now we see six rows --- every combination of cylinder count and transmission type (am = 0 is automatic, am = 1 is manual). Manual 4-cylinder cars get the best mileage (28.1 mpg), while automatic 8-cylinder cars get the worst (15.1 mpg). The `.groups = "drop"` removes the grouping when you're done. Good habit --- leftover grouping causes sneaky bugs.
### Grouped Mutate
When you use `mutate()` on grouped data, calculations happen within each group. Perfect for "compared to the department average" analysis.
```{r}
# How does each flower compare to its species average?
iris |>
group_by(Species) |>
mutate(
sepal_deviation = Sepal.Length - mean(Sepal.Length)
) |>
select(Species, Sepal.Length, sepal_deviation) |>
head(10)
```
The `sepal_deviation` column shows how far each flower is from its species mean. Positive values are above average for that species; negative values are below. This is a common technique for benchmarking --- "how does this salesperson compare to their regional average?" --- without mixing up groups that have different baselines.
### Grouped Filter
Filter within groups: "Show me the top performers in each department."
```{r}
# Cars with above-average mpg for their cylinder class
mtcars |>
group_by(cyl) |>
filter(mpg > mean(mpg)) |>
select(cyl, mpg) |>
arrange(cyl, desc(mpg))
```
This keeps only the cars that beat the average for their cylinder class. The key insight is that "above average" is computed *within each group*, not across the whole dataset. A 6-cylinder car with 21 mpg is above average for 6-cylinder cars, even though it is below average for 4-cylinder cars. This is exactly the logic you need for questions like "which stores outperform their regional average?"
### Don't Forget to ungroup()
Grouping is sticky --- it hangs around until you explicitly remove it. If your results look weird, check for leftover grouping. It's like leaving a filter on in Excel and then wondering why your row count seems off.
```{r}
iris_grouped <- iris |> group_by(Species)
group_vars(iris_grouped)
iris_ungrouped <- iris_grouped |> ungroup()
group_vars(iris_ungrouped)
```
## Combining Operations -- Where the Magic Happens {#combining}
Here's the payoff. Everything we've learned chains together with the pipe operator (`|>`). Each step feeds into the next, creating a readable analysis pipeline.
### Example: Top Fuel-Efficient Cars by Class
Read this pipeline like a sentence: "Take the car data. Add car names. Keep only 4- and 6-cylinder cars. Select the columns I need. Within each cylinder group, rank by mpg. Show me the top 3 in each group."
```{r}
mtcars |>
mutate(car_name = rownames(mtcars)) |>
filter(cyl %in% c(4, 6)) |>
select(car_name, mpg, cyl, hp, wt) |>
group_by(cyl) |>
mutate(mpg_rank = min_rank(desc(mpg))) |>
arrange(cyl, mpg_rank) |>
filter(mpg_rank <= 3) |>
ungroup()
```
Eight clean lines that answer a real business question. Try doing that in Excel without a labyrinth of helper columns and VLOOKUP nightmares.
### Example: Performance Report
"Label the transmissions and engine types. Group by both. Compute averages. Sort by fuel efficiency."
```{r}
mtcars |>
mutate(
transmission = if_else(am == 1, "manual", "automatic"),
engine = if_else(vs == 1, "straight", "V-shaped")
) |>
group_by(transmission, engine) |>
summarize(
n = n(),
avg_mpg = round(mean(mpg), 1),
avg_hp = round(mean(hp), 1),
avg_qsec = round(mean(qsec), 1),
.groups = "drop"
) |>
arrange(desc(avg_mpg))
```
This is exactly the kind of summary table your manager wants on their desk by Friday. Manual transmissions with straight engines get the best mileage (28.4 mpg), while automatic V-shaped engines are the thirstiest (15.1 mpg). Clean, grouped, sorted, readable.
### Example: Species Comparison with Full Stats
```{r}
iris |>
group_by(Species) |>
summarize(
across(everything(), list(mean = mean, sd = sd)),
.groups = "drop"
) |>
mutate(across(where(is.numeric), \(x) round(x, 2)))
```
This produces a comprehensive statistical profile for each species --- mean and standard deviation for every numeric variable. Scanning across, virginica consistently has the largest petal measurements, while setosa has the smallest. This kind of "all stats for all groups" table is a staple of exploratory data analysis.
## Joining Data -- Connecting Your Tables {#joins}
In the real world, data doesn't live in one neat spreadsheet. Your company has a customer database, an orders database, a product catalog, and a returns log --- all in different tables. Joins are how you connect them. Think of it like matching up customer names across two different spreadsheets so you can see the full picture.
### Setting Up the Example
Let's use a scenario every business student can relate to: a CRM with customer info and a separate order tracking system.
```{r}
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Carol", "David", "Eve"),
region = c("East", "West", "East", "South", "West")
)
customers
orders <- tibble(
customer_id = c(1, 2, 3, 6, 7),
revenue = c(500, 320, 890, 150, 410),
quarter = c("Q1", "Q1", "Q2", "Q1", "Q2")
)
orders
```
Notice: David and Eve are in the CRM but haven't placed orders. Customers 6 and 7 placed orders but aren't in the CRM yet (maybe they signed up through a new channel).
### left_join() -- The One You'll Use 90% of the Time
`left_join()` keeps every row from your main table (left) and pulls in matching data from the second table (right). No match? You get `NA`. It's like doing a VLOOKUP from your customer list into the orders sheet.
```{r}
left_join(customers, orders, by = "customer_id") %>%
DT::datatable(options = list(pageLength = 10, dom = 't'),
caption = "Left join: all customers kept, with order data where available.")
```
All five customers are here. David and Eve show `NA` for revenue and quarter because they had no orders. Customers 6 and 7 from the orders table? Excluded --- they weren't in the customer list. This is your go-to join.
### inner_join() -- Only the Matches
`inner_join()` keeps only rows that appear in BOTH tables. It's like saying "show me only customers who have actually bought something."
```{r}
inner_join(customers, orders, by = "customer_id")
```
Only Alice, Bob, and Carol survive. They're the only ones in both tables.
### Other Joins (Quick Reference)
**right_join()** --- like `left_join()` but keeps everything from the right table instead. (Rarely used --- most people just swap the table order and use `left_join()`.)
```{r}
right_join(customers, orders, by = "customer_id")
```
**full_join()** --- keeps everything from both tables. The "I don't want to lose ANY data" option.
```{r}
full_join(customers, orders, by = "customer_id") |>
DT::datatable(options = list(pageLength = 10, dom = 't'),
caption = "Full join: every row from both tables, with NA where no match exists.")
```
**anti_join()** --- "Show me customers who DON'T have orders." Perfect for finding gaps in your data, like identifying leads who haven't converted.
```{r}
anti_join(customers, orders, by = "customer_id")
```
### When Column Names Don't Match
Sometimes the key column has different names in each table (because of course it does --- the sales team and the marketing team never agree on naming conventions). Use `join_by()`.
```{r}
orders_renamed <- orders |> rename(id = customer_id)
left_join(customers, orders_renamed, by = join_by(customer_id == id))
```
### Joining on Multiple Columns
You can join on more than one column --- useful when a single column isn't enough to uniquely identify a match (like needing both customer AND product to find the right price).
```{r}
order_details <- tibble(
customer = c("A", "A", "B", "B"),
product = c("X", "Y", "X", "Z"),
quantity = c(10, 5, 8, 12)
)
prices <- tibble(
customer = c("A", "A", "B", "C"),
product = c("X", "Y", "X", "X"),
price = c(2.50, 3.00, 2.75, 2.50)
)
inner_join(order_details, prices, by = c("customer", "product"))
```
## Other Handy Functions {#other-functions}
A few more tools that come up constantly in practice.
### count() -- Quick Frequency Tables
`count()` is shorthand for "group by this, then count the rows." You'll use it more than you think.
```{r}
count(iris, Species)
```
```{r}
# Sort by frequency
count(mtcars, cyl, sort = TRUE)
```
```{r}
# Count by multiple variables
count(mtcars, cyl, am)
```
### The slice() Family -- Rows by Position
```{r}
# First 5 rows
slice_head(iris, n = 5)
```
```{r}
# Top 3 by horsepower
slice_max(mtcars, hp, n = 3)
```
```{r}
# Bottom 3 by mpg
slice_min(mtcars, mpg, n = 3)
```
```{r}
# Random sample
set.seed(42)
slice_sample(iris, n = 5)
```
### distinct() -- Remove Duplicates
```{r}
# Unique combinations of cylinders and transmission
distinct(mtcars, cyl, am)
```
### pull() -- Extract a Column as a Vector
Sometimes you need a plain vector, not a data frame.
```{r}
mpg_values <- pull(mtcars, mpg)
mean(mpg_values)
```
### relocate() -- Move Columns Around
```{r}
# Move Species to the front
relocate(iris, Species) |> head(3)
```
### bind_rows() -- Stack Data Frames
When you have data split across separate tables with the same columns, stack them vertically --- like combining regional sales spreadsheets into one master file.
```{r}
df1 <- tibble(x = 1:3, y = c("a", "b", "c"))
df2 <- tibble(x = 4:6, y = c("d", "e", "f"))
bind_rows(df1, df2)
```
```{r}
# Track which table each row came from
bind_rows(first = df1, second = df2, .id = "source")
```
## Summary
Here's the cheat sheet you'll tape to your monitor:
- **`filter()`** --- pick the rows you want (sorting through job applicants)
- **`select()`** --- pick the columns you want (decluttering your spreadsheet)
- **`mutate()`** --- add or change columns (adding a formula to your sales report)
- **`summarize()`** --- compute summary statistics (the bottom of every pivot table)
- **`arrange()`** --- sort your data (A-to-Z, big-to-small)
- **`group_by()`** --- do any of the above by group (the actual pivot table)
Chain them together with the pipe (`|>`), add `left_join()` to connect your tables, `count()` for quick frequency tables, and `case_when()` for conditional logic, and you've got a toolkit that handles just about anything a business analyst needs.
The next chapter on tidyr (Chapter \@ref(tidyr)) will show you how to reshape data between wide and long formats --- the final piece of the data wrangling puzzle.