18  Doing the Same Thing to Many Things at Once

19 Doing the Same Thing to Many Things at Once

Code
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

19.1 The Problem: Repetition

Here’s a scenario. You have 12 CSV files — one for each month of sales data — and you need to read them all into R and stack them into one data frame. Or you need to calculate the mean of every numeric column in a dataset. Or you need to run the same regression on three different customer segments.

In all of these cases, you’re doing the same thing to many things. You could write a for loop, and that would work fine. But the purrr package gives you a cleaner, more expressive way to do it — usually in a single line.

Let’s see the difference. Here’s computing the mean of each column in mtcars with a for loop:

Code
# Using a for loop
means <- numeric(ncol(mtcars))
for (i in seq_along(mtcars)) {
 means[i] <- mean(mtcars[[i]])
}
names(means) <- names(mtcars)
means
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

Now with purrr:

Code
map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

Same result — the mean of every column, neatly named. You can see that the average car has about 6.2 cylinders, 20.1 mpg, 230 horsepower, and weighs about 3,217 lbs. One line replaced five lines of loop code.

One line, same result, no indexing, no pre-allocation, no room for off-by-one errors.

WarningAI Pitfall: map() family silently coerces output types

The purrr map functions come in typed variants (map_dbl, map_chr, map_lgl, map_int) and an untyped map() that returns a list. AI assistants sometimes pick the wrong variant for what you actually want.

A concrete failure: you ask AI to “extract the model R-squared from each fitted model in this list.” AI gives you map(models, ~ summary(.x)$r.squared). Output is a list of length-1 numeric vectors. You can unlist() it and the values look right. But if any model failed to fit (returned an error or NA), the corresponding list element is NULL or a different length, and unlist() silently aligns the survivors against your original index — so you end up with R-squared values associated with the wrong models.

The discipline: - Use the typed variant (map_dbl()) for atomic results — it errors out cleanly if the output cannot be coerced to a flat numeric vector, surfacing the failed elements immediately. - For results that may include errors, use safely() or possibly() to wrap the function so failures become explicit NA values you can see and handle. - Never unlist() a map() result without first checking that all elements have length 1.

19.2 The map Family: Apply a Function to Every Element

The idea is simple: take a list (or vector, or data frame), apply a function to each element, and collect the results. The map family differs only in what type of output you want back.

map() — returns a list (the safest, most general option):

Code
simple_list <- list(a = 1:5, b = 6:10, c = 11:15)
map(simple_list, mean)
$a
[1] 3

$b
[1] 8

$c
[1] 13

map_dbl() — returns a numeric vector (use when each result is a single number):

Code
map_dbl(simple_list, mean)
 a  b  c 
 3  8 13 
Code
map_dbl(simple_list, sd)
       a        b        c 
1.581139 1.581139 1.581139 

map_chr() — returns a character vector (use when each result is a single string):

Code
map_chr(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

map_dfr() — returns a data frame by stacking results as rows (great for building summary tables):

Code
map_dfr(simple_list, ~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x)),
 .id = "group")
# A tibble: 3 × 4
  group  mean    sd     n
  <chr> <dbl> <dbl> <int>
1 a         3  1.58     5
2 b         8  1.58     5
3 c        13  1.58     5

The result is a tidy summary table with one row per group: group “a” has a mean of 3, “b” has 8, and “c” has 13 — each with 5 observations. The .id = "group" argument adds the column that tells you which list element each row came from. This is exactly the kind of summary table you would build to compare business segments or product lines.

The modern alternative to map_dfr() uses map() with list_rbind():

Code
simple_list %>%
 map(~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x))) %>%
 list_rbind(names_to = "group")
# A tibble: 3 × 4
  group  mean    sd     n
  <chr> <dbl> <dbl> <int>
1 a         3  1.58     5
2 b         8  1.58     5
3 c        13  1.58     5

Same result, slightly more explicit.

19.3 The Formula Shortcut: ~ .x

When you need a quick, one-off function, purrr lets you write it as a formula. The ~ creates the function, and .x refers to each element.

Code
# Full function syntax
map_dbl(simple_list, function(x) sum(x^2))
  a   b   c 
 55 330 855 
Code
# Formula shortcut --- same thing, less typing
map_dbl(simple_list, ~ sum(.x^2))
  a   b   c 
 55 330 855 

You can also use R 4.1’s \(x) syntax if you prefer:

Code
map_dbl(simple_list, \(x) sum(x^2))
  a   b   c 
 55 330 855 

All three do the same thing. The formula shortcut (~ .x) is the most common in purrr code, so it’s worth getting comfortable with.

19.4 Extracting Elements from Lists

A super common use case: you have a list of lists (like API responses or JSON data) and need to pull out one field from each.

Code
customers <- list(
 list(name = "Alice", age = 30, city = "Portland"),
 list(name = "Bob", age = 25, city = "Austin"),
 list(name = "Carol", age = 35, city = "Denver")
)

map_chr(customers, "name")
[1] "Alice" "Bob"   "Carol"
Code
map_int(customers, "age")
[1] 30 25 35
Code
map_chr(customers, "city")
[1] "Portland" "Austin"   "Denver"  

Just pass the name of the element as a string. Purrr handles the extraction.

19.5 Two Inputs at Once: map2()

Sometimes you need to iterate over two things in parallel. map2() takes two inputs and a function with .x and .y.

Code
x <- list(1:5, 6:10, 11:15)
y <- list(10, 20, 30)

map2_dbl(x, y, ~ mean(.x) + .y)
[1] 13 28 43

The result is three numbers: the mean of 1:5 (3) plus 10 = 13, the mean of 6:10 (8) plus 20 = 28, and the mean of 11:15 (13) plus 30 = 43. Each pair of inputs was processed in parallel.

Practical example — generating samples from different distributions:

Code
set.seed(42)
means <- c(0, 5, 10)
sds <- c(1, 2, 3)

map2(means, sds, ~ rnorm(n = 5, mean = .x, sd = .y))
[[1]]
[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

[[2]]
[1] 4.787751 8.023044 4.810682 9.036847 4.874572

[[3]]
[1] 13.914609 16.859936  5.833418  9.163634  9.600036

19.6 The Killer Use Case: Reading Multiple CSV Files

This is probably the most common real-world use of purrr for business analysts. You have a folder full of data files and need to combine them.

Code
# First, let's create some example CSV files
temp_dir <- tempdir()
walk(1:3, function(i) {
 df <- tibble(
 id = ((i - 1) * 5 + 1):(i * 5),
 group = paste0("group_", i),
 value = rnorm(5, mean = i * 10, sd = 2)
 )
 write_csv(df, file.path(temp_dir, paste0("data_", i, ".csv")))
})

# List the files
csv_files <- list.files(temp_dir, pattern = "data_\\d+\\.csv$", full.names = TRUE)
csv_files
[1] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_1.csv"
[2] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_2.csv"
[3] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_3.csv"
Code
# Read all files and combine into one data frame --- THE money pattern
all_data <- csv_files %>%
 set_names(basename(.)) %>%
 map(read_csv) %>%
 list_rbind(names_to = "source_file")

all_data
# A tibble: 15 × 4
   source_file    id group   value
   <chr>       <dbl> <chr>   <dbl>
 1 data_1.csv      1 group_1 11.3 
 2 data_1.csv      2 group_1  9.43
 3 data_1.csv      3 group_1  4.69
 4 data_1.csv      4 group_1  5.12
 5 data_1.csv      5 group_1 12.6 
 6 data_2.csv      6 group_2 19.4 
 7 data_2.csv      7 group_2 16.4 
 8 data_2.csv      8 group_2 19.7 
 9 data_2.csv      9 group_2 22.4 
10 data_2.csv     10 group_2 23.8 
11 data_3.csv     11 group_3 29.1 
12 data_3.csv     12 group_3 29.5 
13 data_3.csv     13 group_3 26.5 
14 data_3.csv     14 group_3 30.9 
15 data_3.csv     15 group_3 28.7 

That three-line pattern — list files, read them with map(), combine with list_rbind() — scales to hundreds of files without changing a single thing. If your company sends you 52 weekly reports as separate CSV files, this is how you handle it.

19.7 Handling Errors: possibly() and safely()

When you’re applying a function to many inputs, some might fail. Maybe one CSV file is corrupted, or one data point makes your function throw an error. Instead of crashing the entire operation, you can use possibly() to return a default value when something goes wrong.

Code
inputs <- list(10, "oops", 100, -1, "nope", 1000)

# This would crash: map_dbl(inputs, log)
# This doesn't:
map_dbl(inputs, possibly(log, otherwise = NA_real_))
[1] 2.302585       NA 4.605170      NaN       NA 6.907755

The NA values tell you exactly which inputs failed, and your analysis continues with the rest. For more detailed error information, safely() captures both the result and the error message:

Code
safe_log <- safely(log)
safe_log(10)
$result
[1] 2.302585

$error
NULL
Code
safe_log("oops")
$result
NULL

$error
<simpleError in .f(...): non-numeric argument to mathematical function>

possibly() is what you’ll use 90% of the time. safely() is there when you need to debug what went wrong.

19.8 Quick Reference

What you want Function Returns
Apply function, get list map() list
Apply function, get numbers map_dbl() numeric vector
Apply function, get strings map_chr() character vector
Apply function, get data frame map_dfr() or map() %>% list_rbind() data frame
Two inputs at once map2(), map2_dbl() depends on variant
Handle errors gracefully possibly(), safely() default value or result+error

The core pattern for business analysts:

# Reading multiple files (the one you'll use most)
list.files("data/", pattern = "*.csv", full.names = TRUE) %>%
 set_names(basename(.)) %>%
 map(read_csv) %>%
 list_rbind(names_to = "source_file")

That’s purrr. It replaces repetitive loops with clean, one-line function calls. Start with map_dbl() for simple calculations and the file-reading pattern for combining data, and you’ll cover most of what you need.