---
title: "Doing the Same Thing to Many Things at Once"
---
# Doing the Same Thing to Many Things at Once {#purrr}
```{r}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## The Problem: Repetition
Here's a scenario. You have 12 CSV files --- one for each month of sales data --- and you need to read them all into R and stack them into one data frame. Or you need to calculate the mean of every numeric column in a dataset. Or you need to run the same regression on three different customer segments.
In all of these cases, you're doing the **same thing to many things**. You could write a for loop, and that would work fine. But the **purrr** package gives you a cleaner, more expressive way to do it --- usually in a single line.
Let's see the difference. Here's computing the mean of each column in `mtcars` with a for loop:
```{r}
# Using a for loop
means <- numeric(ncol(mtcars))
for (i in seq_along(mtcars)) {
means[i] <- mean(mtcars[[i]])
}
names(means) <- names(mtcars)
means
```
Now with purrr:
```{r}
map_dbl(mtcars, mean)
```
Same result --- the mean of every column, neatly named. You can see that the average car has about 6.2 cylinders, 20.1 mpg, 230 horsepower, and weighs about 3,217 lbs. One line replaced five lines of loop code.
One line, same result, no indexing, no pre-allocation, no room for off-by-one errors.
::: {.callout-warning}
## AI Pitfall: `map()` family silently coerces output types
The `purrr` map functions come in typed variants (`map_dbl`, `map_chr`, `map_lgl`, `map_int`) and an untyped `map()` that returns a list. AI assistants sometimes pick the wrong variant for what you actually want.
A concrete failure: you ask AI to "extract the model R-squared from each fitted model in this list." AI gives you `map(models, ~ summary(.x)$r.squared)`. Output is a list of length-1 numeric vectors. You can `unlist()` it and the values look right. But if any model failed to fit (returned an error or NA), the corresponding list element is NULL or a different length, and `unlist()` silently aligns the survivors against your original index — so you end up with R-squared values associated with the wrong models.
The discipline:
- Use the typed variant (`map_dbl()`) for atomic results — it errors out cleanly if the output cannot be coerced to a flat numeric vector, surfacing the failed elements immediately.
- For results that may include errors, use `safely()` or `possibly()` to wrap the function so failures become explicit `NA` values you can see and handle.
- Never `unlist()` a `map()` result without first checking that all elements have length 1.
:::
## The `map` Family: Apply a Function to Every Element
The idea is simple: take a list (or vector, or data frame), apply a function to each element, and collect the results. The `map` family differs only in what type of output you want back.
**`map()` --- returns a list** (the safest, most general option):
```{r}
simple_list <- list(a = 1:5, b = 6:10, c = 11:15)
map(simple_list, mean)
```
**`map_dbl()` --- returns a numeric vector** (use when each result is a single number):
```{r}
map_dbl(simple_list, mean)
map_dbl(simple_list, sd)
```
**`map_chr()` --- returns a character vector** (use when each result is a single string):
```{r}
map_chr(iris, class)
```
**`map_dfr()` --- returns a data frame** by stacking results as rows (great for building summary tables):
```{r}
map_dfr(simple_list, ~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x)),
.id = "group")
```
The result is a tidy summary table with one row per group: group "a" has a mean of 3, "b" has 8, and "c" has 13 --- each with 5 observations. The `.id = "group"` argument adds the column that tells you which list element each row came from. This is exactly the kind of summary table you would build to compare business segments or product lines.
The modern alternative to `map_dfr()` uses `map()` with `list_rbind()`:
```{r}
simple_list %>%
map(~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x))) %>%
list_rbind(names_to = "group")
```
Same result, slightly more explicit.
## The Formula Shortcut: `~ .x`
When you need a quick, one-off function, purrr lets you write it as a formula. The `~` creates the function, and `.x` refers to each element.
```{r}
# Full function syntax
map_dbl(simple_list, function(x) sum(x^2))
# Formula shortcut --- same thing, less typing
map_dbl(simple_list, ~ sum(.x^2))
```
You can also use R 4.1's `\(x)` syntax if you prefer:
```{r}
map_dbl(simple_list, \(x) sum(x^2))
```
All three do the same thing. The formula shortcut (`~ .x`) is the most common in purrr code, so it's worth getting comfortable with.
## Extracting Elements from Lists
A super common use case: you have a list of lists (like API responses or JSON data) and need to pull out one field from each.
```{r}
customers <- list(
list(name = "Alice", age = 30, city = "Portland"),
list(name = "Bob", age = 25, city = "Austin"),
list(name = "Carol", age = 35, city = "Denver")
)
map_chr(customers, "name")
map_int(customers, "age")
map_chr(customers, "city")
```
Just pass the name of the element as a string. Purrr handles the extraction.
## Two Inputs at Once: `map2()`
Sometimes you need to iterate over two things in parallel. `map2()` takes two inputs and a function with `.x` and `.y`.
```{r}
x <- list(1:5, 6:10, 11:15)
y <- list(10, 20, 30)
map2_dbl(x, y, ~ mean(.x) + .y)
```
The result is three numbers: the mean of `1:5` (3) plus 10 = 13, the mean of `6:10` (8) plus 20 = 28, and the mean of `11:15` (13) plus 30 = 43. Each pair of inputs was processed in parallel.
Practical example --- generating samples from different distributions:
```{r}
set.seed(42)
means <- c(0, 5, 10)
sds <- c(1, 2, 3)
map2(means, sds, ~ rnorm(n = 5, mean = .x, sd = .y))
```
## The Killer Use Case: Reading Multiple CSV Files
This is probably the most common real-world use of purrr for business analysts. You have a folder full of data files and need to combine them.
```{r}
# First, let's create some example CSV files
temp_dir <- tempdir()
walk(1:3, function(i) {
df <- tibble(
id = ((i - 1) * 5 + 1):(i * 5),
group = paste0("group_", i),
value = rnorm(5, mean = i * 10, sd = 2)
)
write_csv(df, file.path(temp_dir, paste0("data_", i, ".csv")))
})
# List the files
csv_files <- list.files(temp_dir, pattern = "data_\\d+\\.csv$", full.names = TRUE)
csv_files
```
```{r}
# Read all files and combine into one data frame --- THE money pattern
all_data <- csv_files %>%
set_names(basename(.)) %>%
map(read_csv) %>%
list_rbind(names_to = "source_file")
all_data
```
That three-line pattern --- list files, read them with `map()`, combine with `list_rbind()` --- scales to hundreds of files without changing a single thing. If your company sends you 52 weekly reports as separate CSV files, this is how you handle it.
## Handling Errors: `possibly()` and `safely()`
When you're applying a function to many inputs, some might fail. Maybe one CSV file is corrupted, or one data point makes your function throw an error. Instead of crashing the entire operation, you can use `possibly()` to return a default value when something goes wrong.
```{r}
inputs <- list(10, "oops", 100, -1, "nope", 1000)
# This would crash: map_dbl(inputs, log)
# This doesn't:
map_dbl(inputs, possibly(log, otherwise = NA_real_))
```
The `NA` values tell you exactly which inputs failed, and your analysis continues with the rest. For more detailed error information, `safely()` captures both the result and the error message:
```{r}
safe_log <- safely(log)
safe_log(10)
safe_log("oops")
```
`possibly()` is what you'll use 90% of the time. `safely()` is there when you need to debug what went wrong.
## Quick Reference
| What you want | Function | Returns |
|:--------------|:---------|:--------|
| Apply function, get list | `map()` | list |
| Apply function, get numbers | `map_dbl()` | numeric vector |
| Apply function, get strings | `map_chr()` | character vector |
| Apply function, get data frame | `map_dfr()` or `map() %>% list_rbind()` | data frame |
| Two inputs at once | `map2()`, `map2_dbl()` | depends on variant |
| Handle errors gracefully | `possibly()`, `safely()` | default value or result+error |
The core pattern for business analysts:
```r
# Reading multiple files (the one you'll use most)
list.files("data/", pattern = "*.csv", full.names = TRUE) %>%
set_names(basename(.)) %>%
map(read_csv) %>%
list_rbind(names_to = "source_file")
```
That's purrr. It replaces repetitive loops with clean, one-line function calls. Start with `map_dbl()` for simple calculations and the file-reading pattern for combining data, and you'll cover most of what you need.