18 Doing the Same Thing to Many Things at Once

19 Doing the Same Thing to Many Things at Once

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

19.1 The Problem: Repetition

Here’s a scenario. You have 12 CSV files — one for each month of sales data — and you need to read them all into R and stack them into one data frame. Or you need to calculate the mean of every numeric column in a dataset. Or you need to run the same regression on three different customer segments.

In all of these cases, you’re doing the same thing to many things. You could write a for loop, and that would work fine. But the purrr package gives you a cleaner, more expressive way to do it — usually in a single line.

Let’s see the difference. Here’s computing the mean of each column in mtcars with a for loop:

Code

# Using a for loop
means <- numeric(ncol(mtcars))
for (i in seq_along(mtcars)) {
 means[i] <- mean(mtcars[[i]])
}
names(means) <- names(mtcars)
means

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

Now with purrr:

Code

map_dbl(mtcars, mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

Same result — the mean of every column, neatly named. You can see that the average car has about 6.2 cylinders, 20.1 mpg, 230 horsepower, and weighs about 3,217 lbs. One line replaced five lines of loop code.

One line, same result, no indexing, no pre-allocation, no room for off-by-one errors.

AI Pitfall: map() family silently coerces output types

The purrr map functions come in typed variants (map_dbl, map_chr, map_lgl, map_int) and an untyped map() that returns a list. AI assistants sometimes pick the wrong variant for what you actually want.

A concrete failure: you ask AI to “extract the model R-squared from each fitted model in this list.” AI gives you map(models, ~ summary(.x)$r.squared). Output is a list of length-1 numeric vectors. You can unlist() it and the values look right. But if any model failed to fit (returned an error or NA), the corresponding list element is NULL or a different length, and unlist() silently aligns the survivors against your original index — so you end up with R-squared values associated with the wrong models.

The discipline: - Use the typed variant (map_dbl()) for atomic results — it errors out cleanly if the output cannot be coerced to a flat numeric vector, surfacing the failed elements immediately. - For results that may include errors, use safely() or possibly() to wrap the function so failures become explicit NA values you can see and handle. - Never unlist() a map() result without first checking that all elements have length 1.

19.2 The `map` Family: Apply a Function to Every Element

The idea is simple: take a list (or vector, or data frame), apply a function to each element, and collect the results. The map family differs only in what type of output you want back.

map() — returns a list (the safest, most general option):

Code

simple_list <- list(a = 1:5, b = 6:10, c = 11:15)
map(simple_list, mean)

$a
[1] 3

$b
[1] 8

$c
[1] 13

map_dbl() — returns a numeric vector (use when each result is a single number):

Code

map_dbl(simple_list, mean)

 a  b  c 
 3  8 13

Code

map_dbl(simple_list, sd)

       a        b        c 
1.581139 1.581139 1.581139

map_chr() — returns a character vector (use when each result is a single string):

Code

map_chr(iris, class)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"

map_dfr() — returns a data frame by stacking results as rows (great for building summary tables):

Code

map_dfr(simple_list, ~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x)),
 .id = "group")

# A tibble: 3 × 4
  group  mean    sd     n
  <chr> <dbl> <dbl> <int>
1 a         3  1.58     5
2 b         8  1.58     5
3 c        13  1.58     5

The result is a tidy summary table with one row per group: group “a” has a mean of 3, “b” has 8, and “c” has 13 — each with 5 observations. The .id = "group" argument adds the column that tells you which list element each row came from. This is exactly the kind of summary table you would build to compare business segments or product lines.

The modern alternative to map_dfr() uses map() with list_rbind():

Code

simple_list %>%
 map(~ tibble(mean = mean(.x), sd = sd(.x), n = length(.x))) %>%
 list_rbind(names_to = "group")

# A tibble: 3 × 4
  group  mean    sd     n
  <chr> <dbl> <dbl> <int>
1 a         3  1.58     5
2 b         8  1.58     5
3 c        13  1.58     5

Same result, slightly more explicit.

19.3 The Formula Shortcut: `~ .x`

When you need a quick, one-off function, purrr lets you write it as a formula. The ~ creates the function, and .x refers to each element.

Code

# Full function syntax
map_dbl(simple_list, function(x) sum(x^2))

  a   b   c 
 55 330 855

Code

# Formula shortcut --- same thing, less typing
map_dbl(simple_list, ~ sum(.x^2))

  a   b   c 
 55 330 855

You can also use R 4.1’s \(x) syntax if you prefer:

Code

map_dbl(simple_list, \(x) sum(x^2))

  a   b   c 
 55 330 855

All three do the same thing. The formula shortcut (~ .x) is the most common in purrr code, so it’s worth getting comfortable with.

19.4 Extracting Elements from Lists

A super common use case: you have a list of lists (like API responses or JSON data) and need to pull out one field from each.

Code

customers <- list(
 list(name = "Alice", age = 30, city = "Portland"),
 list(name = "Bob", age = 25, city = "Austin"),
 list(name = "Carol", age = 35, city = "Denver")
)

map_chr(customers, "name")

[1] "Alice" "Bob"   "Carol"

Code

map_int(customers, "age")

[1] 30 25 35

Code

map_chr(customers, "city")

[1] "Portland" "Austin"   "Denver"

Just pass the name of the element as a string. Purrr handles the extraction.

19.5 Two Inputs at Once: `map2()`

Sometimes you need to iterate over two things in parallel. map2() takes two inputs and a function with .x and .y.

Code

x <- list(1:5, 6:10, 11:15)
y <- list(10, 20, 30)

map2_dbl(x, y, ~ mean(.x) + .y)

[1] 13 28 43

The result is three numbers: the mean of 1:5 (3) plus 10 = 13, the mean of 6:10 (8) plus 20 = 28, and the mean of 11:15 (13) plus 30 = 43. Each pair of inputs was processed in parallel.

Practical example — generating samples from different distributions:

Code

set.seed(42)
means <- c(0, 5, 10)
sds <- c(1, 2, 3)

map2(means, sds, ~ rnorm(n = 5, mean = .x, sd = .y))

[[1]]
[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

[[2]]
[1] 4.787751 8.023044 4.810682 9.036847 4.874572

[[3]]
[1] 13.914609 16.859936  5.833418  9.163634  9.600036

19.6 The Killer Use Case: Reading Multiple CSV Files

This is probably the most common real-world use of purrr for business analysts. You have a folder full of data files and need to combine them.

Code

# First, let's create some example CSV files
temp_dir <- tempdir()
walk(1:3, function(i) {
 df <- tibble(
 id = ((i - 1) * 5 + 1):(i * 5),
 group = paste0("group_", i),
 value = rnorm(5, mean = i * 10, sd = 2)
 )
 write_csv(df, file.path(temp_dir, paste0("data_", i, ".csv")))
})

# List the files
csv_files <- list.files(temp_dir, pattern = "data_\\d+\\.csv$", full.names = TRUE)
csv_files

[1] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_1.csv"
[2] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_2.csv"
[3] "C:\\Users\\patil\\AppData\\Local\\Temp\\Rtmp2ndem4/data_3.csv"

Code

# Read all files and combine into one data frame --- THE money pattern
all_data <- csv_files %>%
 set_names(basename(.)) %>%
 map(read_csv) %>%
 list_rbind(names_to = "source_file")

all_data

# A tibble: 15 × 4
   source_file    id group   value
   <chr>       <dbl> <chr>   <dbl>
 1 data_1.csv      1 group_1 11.3 
 2 data_1.csv      2 group_1  9.43
 3 data_1.csv      3 group_1  4.69
 4 data_1.csv      4 group_1  5.12
 5 data_1.csv      5 group_1 12.6 
 6 data_2.csv      6 group_2 19.4 
 7 data_2.csv      7 group_2 16.4 
 8 data_2.csv      8 group_2 19.7 
 9 data_2.csv      9 group_2 22.4 
10 data_2.csv     10 group_2 23.8 
11 data_3.csv     11 group_3 29.1 
12 data_3.csv     12 group_3 29.5 
13 data_3.csv     13 group_3 26.5 
14 data_3.csv     14 group_3 30.9 
15 data_3.csv     15 group_3 28.7

That three-line pattern — list files, read them with map(), combine with list_rbind() — scales to hundreds of files without changing a single thing. If your company sends you 52 weekly reports as separate CSV files, this is how you handle it.

19.7 Handling Errors: `possibly()` and `safely()`

When you’re applying a function to many inputs, some might fail. Maybe one CSV file is corrupted, or one data point makes your function throw an error. Instead of crashing the entire operation, you can use possibly() to return a default value when something goes wrong.

Code

inputs <- list(10, "oops", 100, -1, "nope", 1000)

# This would crash: map_dbl(inputs, log)
# This doesn't:
map_dbl(inputs, possibly(log, otherwise = NA_real_))

[1] 2.302585       NA 4.605170      NaN       NA 6.907755

The NA values tell you exactly which inputs failed, and your analysis continues with the rest. For more detailed error information, safely() captures both the result and the error message:

Code

safe_log <- safely(log)
safe_log(10)

$result
[1] 2.302585

$error
NULL

Code

safe_log("oops")

$result
NULL

$error
<simpleError in .f(...): non-numeric argument to mathematical function>

possibly() is what you’ll use 90% of the time. safely() is there when you need to debug what went wrong.

19.8 Quick Reference

What you want	Function	Returns
Apply function, get list	`map()`	list
Apply function, get numbers	`map_dbl()`	numeric vector
Apply function, get strings	`map_chr()`	character vector
Apply function, get data frame	`map_dfr()` or `map() %>% list_rbind()`	data frame
Two inputs at once	`map2()`, `map2_dbl()`	depends on variant
Handle errors gracefully	`possibly()`, `safely()`	default value or result+error

The core pattern for business analysts:

# Reading multiple files (the one you'll use most)
list.files("data/", pattern = "*.csv", full.names = TRUE) %>%
 set_names(basename(.)) %>%
 map(read_csv) %>%
 list_rbind(names_to = "source_file")

That’s purrr. It replaces repetitive loops with clean, one-line function calls. Start with map_dbl() for simple calculations and the file-reading pattern for combining data, and you’ll cover most of what you need.

19 Doing the Same Thing to Many Things at Once

19.1 The Problem: Repetition

19.2 The map Family: Apply a Function to Every Element

19.3 The Formula Shortcut: ~ .x

19.4 Extracting Elements from Lists

19.5 Two Inputs at Once: map2()

19.6 The Killer Use Case: Reading Multiple CSV Files

19.7 Handling Errors: possibly() and safely()

19.8 Quick Reference

19.2 The `map` Family: Apply a Function to Every Element

19.3 The Formula Shortcut: `~ .x`

19.5 Two Inputs at Once: `map2()`

19.7 Handling Errors: `possibly()` and `safely()`