15 Working with Text Data

16 Working with Text Data

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

16.1 Why You Need This

A surprising amount of business data is text that humans typed by hand: survey responses, product names, customer addresses, free-form comment fields. Strings like " GREAT product!! " and "great Product" are functionally identical to a human reader, but R treats them as completely different values until you normalize them. The stringr package is the toolkit for that normalization plus everything else you need to do with text — pattern matching, extraction, replacement, splitting, and joining.

Every function in the package starts with str_, takes the string as its first argument (so it composes cleanly in pipelines), and does what its name suggests. The naming consistency is one of the package’s biggest practical advantages.

AI Pitfall: AI’s regex looks right and matches the wrong things

Ask an AI assistant to “extract the customer’s last name from this column” and you reliably get back a regex like str_extract(name, "[A-Za-z]+$"). The pattern looks reasonable. It also fails on common cases:

Hyphenated names: “Smith-Jones” returns “Jones”
Names with apostrophes: “O’Brien” returns “Brien”
Names with spaces: “Van Der Berg” returns “Berg”
Names with non-ASCII characters: “Müller” returns “ller” because [A-Za-z] excludes accented letters

The AI’s pattern works on the easy cases and silently produces wrong results on everything else. The defensive habit: after applying any regex against real text data, sample the output (sample_n(df, 20) and look at the extracted values) for cases the AI did not anticipate. If you are working with international names, use [\\p{L}]+ instead of [A-Za-z]+ to match any Unicode letter — but the bigger lesson is that text data has long tails of edge cases, and trusting AI’s first-attempt regex on real data is a recipe for silent failure.

16.2 String Basics

Strings in R live inside quotes. Double quotes are standard.

Code

product <- "Wireless Bluetooth Speaker"
product

[1] "Wireless Bluetooth Speaker"

If your string itself contains a quote, wrap it in single quotes:

Code

tagline <- 'She said, "Take my money!"'
tagline

[1] "She said, \"Take my money!\""

str_length() counts characters — useful for flagging those one-word survey responses from people who clearly had somewhere else to be:

Code

responses <- c("Great product!", "OK", "This changed my life and I want to tell everyone", "N/A")
str_length(responses)

[1] 14  2 48  3

16.3 Combining Strings with `str_c()`

str_c() glues strings together. It is paste() but tidyverse-flavored.

Code

str_c("Q", 1:4)

[1] "Q1" "Q2" "Q3" "Q4"

Code

str_c("Hello", "World", sep = " ")

[1] "Hello World"

The collapse argument squashes a whole vector into one string — great for building labels or email subject lines:

Code

departments <- c("Marketing", "Finance", "Operations")
str_c(departments, collapse = ", ")

[1] "Marketing, Finance, Operations"

Code

str_c(departments, collapse = " & ")

[1] "Marketing & Finance & Operations"

16.4 Changing Case — Your Most-Used Trick

When survey respondents type “NEW YORK”, “new york”, and “New york” — R sees three different values. One line fixes all of them:

Code

cities <- c("new york", "LOS ANGELES", "chicago", "SAN FRANCISCO")
str_to_title(cities)

[1] "New York"      "Los Angeles"   "Chicago"       "San Francisco"

The full case family:

Code

sample_text <- "the QUICK brown FOX"
str_to_lower(sample_text)

[1] "the quick brown fox"

Code

str_to_upper(sample_text)

[1] "THE QUICK BROWN FOX"

Code

str_to_title(sample_text)

[1] "The Quick Brown Fox"

You will use these every single time you clean data where humans had a keyboard and zero standards. So… every time.

16.5 Trimming Whitespace

Extra spaces are the silent killer of data joins. If one spreadsheet has “Marketing” and another has ” Marketing “, your left_join() will silently drop rows and you will silently lose your mind trying to figure out why.

Code

messy <- " too much space "
str_trim(messy) # removes leading/trailing spaces

[1] "too much space"

Code

str_squish(messy) # also collapses internal spaces to one

[1] "too much space"

str_squish() handles everything. Just use it by default and move on with your life.

16.6 Detecting Patterns with `str_detect()`

“Does this string contain this word?” That is str_detect(). It returns TRUE/FALSE, which makes it a perfect partner for filter().

Code

products <- c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8", "iPhone SE")
str_detect(products, "iPhone")

[1]  TRUE FALSE  TRUE FALSE  TRUE

Inside a data frame — this is where it earns its paycheck:

Code

product_df <- tibble(
 product = c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8"),
 revenue = c(45000, 32000, 38000, 15000)
)

product_df %>% filter(str_detect(product, "iPhone"))

# A tibble: 2 × 2
  product       revenue
  <chr>           <dbl>
1 iPhone 15 Pro   45000
2 iPhone 15       38000

You just filtered to Apple products without needing a separate category column. Bill 3 hours for it anyway.

16.7 Replacing Text with `str_replace()`

str_replace() fixes the first match. str_replace_all() fixes every match. Think “Find and Replace” in Excel, but it actually works on 50,000 rows.

Code

reviews <- c("This product is gr8", "Loved it!! Will buy again!!", "Good product, good price")
str_replace_all(reviews, "!!", "!")

[1] "This product is gr8"       "Loved it! Will buy again!"
[3] "Good product, good price"

Standardizing how people wrote “percent”:

Code

responses <- c("50 percent", "30%", "about 20 percent", "75%")
str_replace_all(responses, " percent", "%")

[1] "50%"       "30%"       "about 20%" "75%"

str_remove() and str_remove_all() are shortcuts for replacing with nothing — perfect for stripping symbols:

Code

prices <- c("$29.99", "$149.50", "$5.00")
str_remove(prices, "\\$")

[1] "29.99"  "149.50" "5.00"

16.8 Extracting Text with `str_extract()`

Need to pull a number out of a messy string? str_extract() grabs the first match.

Code

order_notes <- c("Order 1234 placed", "Invoice 5678 pending", "Shipped item 91011")
str_extract(order_notes, "\\d+")

[1] "1234"  "5678"  "91011"

That \\d+ is a regular expression (regex). Do not panic — the next section covers the only ones you will ever need.

16.9 Regular Expressions: Just the 3 Patterns You Actually Need

Regex looks like a cat walked across your keyboard. The good news: for 90% of business data cleaning, you only need three patterns. Seriously.

Pattern 1: \\d+ — Find numbers

Code

sales_notes <- c("Sold 15 units", "Returned 3 items", "Ordered 200 widgets")
str_extract(sales_notes, "\\d+")

[1] "15"  "3"   "200"

Pattern 2: [A-Za-z]+ — Find words (letters only)

Code

codes <- c("SKU-12345", "PROD-6789", "ITEM-111")
str_extract(codes, "[A-Z]+")

[1] "SKU"  "PROD" "ITEM"

Pattern 3: [\\w.]+@[\\w.]+ — Find email-like patterns

Code

messages <- c("Contact alice@example.com for details",
 "Email bob.smith@company.org for info",
 "No email here")
str_extract(messages, "[\\w.]+@[\\w.]+")

[1] "alice@example.com"     "bob.smith@company.org" NA

That is the whole regex syllabus. If you need something fancier, Google “regex cheat sheet.” Everyone does it — including the people who wrote the cheat sheet.

16.10 Putting It All Together: A Survey Data Cleaning Pipeline

Here is a realistic scenario. You received survey data and it looks like it was filled out during an earthquake on a moving bus:

Code

survey_raw <- tibble(
 respondent = c(" JOHN doe ", "jane SMITH", " Bob Johnson ",
 "ALICE williams", " charlie Brown"),
 email = c("John.Doe@Gmail.COM", "jane_smith@YAHOO.com",
 "BOB@company.ORG", "alice.w@School.Edu", "charlie@EMAIL.com"),
 phone = c("(555) 123-4567", "555.234.5678", "555 345 6789",
 "5554567890", "(555)-567-8901"),
 favorite_color = c(" Blue ", "RED", " green ", "BLUE", " Green")
)

survey_clean <- survey_raw %>%
 mutate(
 # Clean names: squish + title case
 respondent = respondent %>% str_squish() %>% str_to_title(),

 # Emails: just lowercase everything
 email = str_to_lower(email),

 # Phone numbers: strip everything that isn't a digit, then reformat
 phone = phone %>%
 str_remove_all("[^0-9]") %>%
 str_replace("(\\d{3})(\\d{3})(\\d{4})", "\\1-\\2-\\3"),

 # Colors: squish + lowercase for consistency
 favorite_color = favorite_color %>% str_squish() %>% str_to_lower()
 )

survey_clean

# A tibble: 5 × 4
  respondent     email                phone        favorite_color
  <chr>          <chr>                <chr>        <chr>         
1 John Doe       john.doe@gmail.com   555-123-4567 blue          
2 Jane Smith     jane_smith@yahoo.com 555-234-5678 red           
3 Bob Johnson    bob@company.org      555-345-6789 green         
4 Alice Williams alice.w@school.edu   555-456-7890 blue          
5 Charlie Brown  charlie@email.com    555-567-8901 green

Five lines of mutate() and the chaos is gone. Let’s verify the colors actually standardized:

Code

survey_clean %>% count(favorite_color)

# A tibble: 3 × 2
  favorite_color     n
  <chr>          <int>
1 blue               2
2 green              2
3 red                1

Blue and green each appear twice. Order restored. Sanity preserved.

16.11 Cleaning Product Names

Pulling structured data out of messy product descriptions — you will do this way more often than you think:

Code

products <- tibble(
 description = c(
 "Widget A (SKU: WDG-001) - Price: $29.99",
 "Gadget B (SKU: GDG-042) - Price: $149.50",
 "Thingamajig (SKU: THG-100) - Price: $5.00"
 )
)

products %>%
 mutate(
 sku = str_extract(description, "[A-Z]{3}-\\d{3}"),
 price_raw = str_extract(description, "\\$[\\d.]+"),
 price = as.numeric(str_remove(price_raw, "\\$")),
 product_name = str_extract(description, "^[^(]+") %>% str_trim()
 ) %>%
 select(product_name, sku, price)

# A tibble: 3 × 3
  product_name sku     price
  <chr>        <chr>   <dbl>
1 Widget A     WDG-001  30.0
2 Gadget B     GDG-042 150. 
3 Thingamajig  THG-100   5

One messy text column became three clean, analysis-ready columns. That is the kind of transformation that gets you a “nice work” in the group chat.

16.12 Extracting Numbers from Sales Notes

Pulling numbers out of free-text notes so you can actually do math with them:

Code

sales_notes <- tibble(
 region = c("North", "North", "South", "South"),
 notes = c("Sold 15 units of product A",
 "Sold 8 units of product B",
 "Sold 22 units of product A",
 "Sold 3 units of product C")
)

sales_notes %>%
 mutate(
 units = as.integer(str_extract(notes, "\\d+")),
 product = str_extract(notes, "product [A-C]")
 ) %>%
 group_by(region, product) %>%
 summarize(total_units = sum(units), .groups = "drop")

# A tibble: 4 × 3
  region product   total_units
  <chr>  <chr>           <int>
1 North  product A          15
2 North  product B           8
3 South  product A          22
4 South  product C           3

Text data became a proper summary table. Drop that into the quarterly deck and watch heads nod approvingly.

16.13 Quick Reference

The functions you will actually use, ranked by how often you will reach for them:

What you want to do	Function
Glue strings together	`str_c()`
Lowercase / uppercase / title case	`str_to_lower()`, `str_to_upper()`, `str_to_title()`
Remove extra whitespace	`str_trim()`, `str_squish()`
Does it contain this pattern?	`str_detect()`
Pull out matching text	`str_extract()`
Find and replace	`str_replace()`, `str_replace_all()`
Delete a pattern	`str_remove()`, `str_remove_all()`
How many characters?	`str_length()`

And the three regex patterns that handle 90% of business problems:

Pattern	What it finds	Example
`\\d+`	Numbers	“Order 1234 placed”
`[A-Za-z]+`	Words (letters)	“SKU-12345”
`[\\w.]+@[\\w.]+`	Email addresses	“alice@example.com”

That is stringr. Clean your text, extract what you need, get back to the analysis that actually matters. Your data does not care how it was entered — but your boss cares how it looks when you present it.

16 Working with Text Data

16.1 Why You Need This

16.2 String Basics

16.3 Combining Strings with str_c()

16.4 Changing Case — Your Most-Used Trick

16.5 Trimming Whitespace

16.6 Detecting Patterns with str_detect()

16.7 Replacing Text with str_replace()

16.8 Extracting Text with str_extract()

16.9 Regular Expressions: Just the 3 Patterns You Actually Need

16.10 Putting It All Together: A Survey Data Cleaning Pipeline

16.11 Cleaning Product Names

16.12 Extracting Numbers from Sales Notes

16.13 Quick Reference

16.3 Combining Strings with `str_c()`

16.6 Detecting Patterns with `str_detect()`

16.7 Replacing Text with `str_replace()`

16.8 Extracting Text with `str_extract()`