15  Working with Text Data

16 Working with Text Data

Code
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

16.1 Why You Need This

A surprising amount of business data is text that humans typed by hand: survey responses, product names, customer addresses, free-form comment fields. Strings like " GREAT product!! " and "great Product" are functionally identical to a human reader, but R treats them as completely different values until you normalize them. The stringr package is the toolkit for that normalization plus everything else you need to do with text — pattern matching, extraction, replacement, splitting, and joining.

Every function in the package starts with str_, takes the string as its first argument (so it composes cleanly in pipelines), and does what its name suggests. The naming consistency is one of the package’s biggest practical advantages.

WarningAI Pitfall: AI’s regex looks right and matches the wrong things

Ask an AI assistant to “extract the customer’s last name from this column” and you reliably get back a regex like str_extract(name, "[A-Za-z]+$"). The pattern looks reasonable. It also fails on common cases:

  • Hyphenated names: “Smith-Jones” returns “Jones”
  • Names with apostrophes: “O’Brien” returns “Brien”
  • Names with spaces: “Van Der Berg” returns “Berg”
  • Names with non-ASCII characters: “Müller” returns “ller” because [A-Za-z] excludes accented letters

The AI’s pattern works on the easy cases and silently produces wrong results on everything else. The defensive habit: after applying any regex against real text data, sample the output (sample_n(df, 20) and look at the extracted values) for cases the AI did not anticipate. If you are working with international names, use [\\p{L}]+ instead of [A-Za-z]+ to match any Unicode letter — but the bigger lesson is that text data has long tails of edge cases, and trusting AI’s first-attempt regex on real data is a recipe for silent failure.

16.2 String Basics

Strings in R live inside quotes. Double quotes are standard.

Code
product <- "Wireless Bluetooth Speaker"
product
[1] "Wireless Bluetooth Speaker"

If your string itself contains a quote, wrap it in single quotes:

Code
tagline <- 'She said, "Take my money!"'
tagline
[1] "She said, \"Take my money!\""

str_length() counts characters — useful for flagging those one-word survey responses from people who clearly had somewhere else to be:

Code
responses <- c("Great product!", "OK", "This changed my life and I want to tell everyone", "N/A")
str_length(responses)
[1] 14  2 48  3

16.3 Combining Strings with str_c()

str_c() glues strings together. It is paste() but tidyverse-flavored.

Code
str_c("Q", 1:4)
[1] "Q1" "Q2" "Q3" "Q4"
Code
str_c("Hello", "World", sep = " ")
[1] "Hello World"

The collapse argument squashes a whole vector into one string — great for building labels or email subject lines:

Code
departments <- c("Marketing", "Finance", "Operations")
str_c(departments, collapse = ", ")
[1] "Marketing, Finance, Operations"
Code
str_c(departments, collapse = " & ")
[1] "Marketing & Finance & Operations"

16.4 Changing Case — Your Most-Used Trick

When survey respondents type “NEW YORK”, “new york”, and “New york” — R sees three different values. One line fixes all of them:

Code
cities <- c("new york", "LOS ANGELES", "chicago", "SAN FRANCISCO")
str_to_title(cities)
[1] "New York"      "Los Angeles"   "Chicago"       "San Francisco"

The full case family:

Code
sample_text <- "the QUICK brown FOX"
str_to_lower(sample_text)
[1] "the quick brown fox"
Code
str_to_upper(sample_text)
[1] "THE QUICK BROWN FOX"
Code
str_to_title(sample_text)
[1] "The Quick Brown Fox"

You will use these every single time you clean data where humans had a keyboard and zero standards. So… every time.

16.5 Trimming Whitespace

Extra spaces are the silent killer of data joins. If one spreadsheet has “Marketing” and another has ” Marketing “, your left_join() will silently drop rows and you will silently lose your mind trying to figure out why.

Code
messy <- " too much space "
str_trim(messy) # removes leading/trailing spaces
[1] "too much space"
Code
str_squish(messy) # also collapses internal spaces to one
[1] "too much space"

str_squish() handles everything. Just use it by default and move on with your life.

16.6 Detecting Patterns with str_detect()

“Does this string contain this word?” That is str_detect(). It returns TRUE/FALSE, which makes it a perfect partner for filter().

Code
products <- c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8", "iPhone SE")
str_detect(products, "iPhone")
[1]  TRUE FALSE  TRUE FALSE  TRUE

Inside a data frame — this is where it earns its paycheck:

Code
product_df <- tibble(
 product = c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8"),
 revenue = c(45000, 32000, 38000, 15000)
)

product_df %>% filter(str_detect(product, "iPhone"))
# A tibble: 2 × 2
  product       revenue
  <chr>           <dbl>
1 iPhone 15 Pro   45000
2 iPhone 15       38000

You just filtered to Apple products without needing a separate category column. Bill 3 hours for it anyway.

16.7 Replacing Text with str_replace()

str_replace() fixes the first match. str_replace_all() fixes every match. Think “Find and Replace” in Excel, but it actually works on 50,000 rows.

Code
reviews <- c("This product is gr8", "Loved it!! Will buy again!!", "Good product, good price")
str_replace_all(reviews, "!!", "!")
[1] "This product is gr8"       "Loved it! Will buy again!"
[3] "Good product, good price" 

Standardizing how people wrote “percent”:

Code
responses <- c("50 percent", "30%", "about 20 percent", "75%")
str_replace_all(responses, " percent", "%")
[1] "50%"       "30%"       "about 20%" "75%"      

str_remove() and str_remove_all() are shortcuts for replacing with nothing — perfect for stripping symbols:

Code
prices <- c("$29.99", "$149.50", "$5.00")
str_remove(prices, "\\$")
[1] "29.99"  "149.50" "5.00"  

16.8 Extracting Text with str_extract()

Need to pull a number out of a messy string? str_extract() grabs the first match.

Code
order_notes <- c("Order 1234 placed", "Invoice 5678 pending", "Shipped item 91011")
str_extract(order_notes, "\\d+")
[1] "1234"  "5678"  "91011"

That \\d+ is a regular expression (regex). Do not panic — the next section covers the only ones you will ever need.

16.9 Regular Expressions: Just the 3 Patterns You Actually Need

Regex looks like a cat walked across your keyboard. The good news: for 90% of business data cleaning, you only need three patterns. Seriously.

Pattern 1: \\d+ — Find numbers

Code
sales_notes <- c("Sold 15 units", "Returned 3 items", "Ordered 200 widgets")
str_extract(sales_notes, "\\d+")
[1] "15"  "3"   "200"

Pattern 2: [A-Za-z]+ — Find words (letters only)

Code
codes <- c("SKU-12345", "PROD-6789", "ITEM-111")
str_extract(codes, "[A-Z]+")
[1] "SKU"  "PROD" "ITEM"

Pattern 3: [\\w.]+@[\\w.]+ — Find email-like patterns

Code
messages <- c("Contact alice@example.com for details",
 "Email bob.smith@company.org for info",
 "No email here")
str_extract(messages, "[\\w.]+@[\\w.]+")
[1] "alice@example.com"     "bob.smith@company.org" NA                     

That is the whole regex syllabus. If you need something fancier, Google “regex cheat sheet.” Everyone does it — including the people who wrote the cheat sheet.

16.10 Putting It All Together: A Survey Data Cleaning Pipeline

Here is a realistic scenario. You received survey data and it looks like it was filled out during an earthquake on a moving bus:

Code
survey_raw <- tibble(
 respondent = c(" JOHN doe ", "jane SMITH", " Bob Johnson ",
 "ALICE williams", " charlie Brown"),
 email = c("John.Doe@Gmail.COM", "jane_smith@YAHOO.com",
 "BOB@company.ORG", "alice.w@School.Edu", "charlie@EMAIL.com"),
 phone = c("(555) 123-4567", "555.234.5678", "555 345 6789",
 "5554567890", "(555)-567-8901"),
 favorite_color = c(" Blue ", "RED", " green ", "BLUE", " Green")
)

survey_clean <- survey_raw %>%
 mutate(
 # Clean names: squish + title case
 respondent = respondent %>% str_squish() %>% str_to_title(),

 # Emails: just lowercase everything
 email = str_to_lower(email),

 # Phone numbers: strip everything that isn't a digit, then reformat
 phone = phone %>%
 str_remove_all("[^0-9]") %>%
 str_replace("(\\d{3})(\\d{3})(\\d{4})", "\\1-\\2-\\3"),

 # Colors: squish + lowercase for consistency
 favorite_color = favorite_color %>% str_squish() %>% str_to_lower()
 )

survey_clean
# A tibble: 5 × 4
  respondent     email                phone        favorite_color
  <chr>          <chr>                <chr>        <chr>         
1 John Doe       john.doe@gmail.com   555-123-4567 blue          
2 Jane Smith     jane_smith@yahoo.com 555-234-5678 red           
3 Bob Johnson    bob@company.org      555-345-6789 green         
4 Alice Williams alice.w@school.edu   555-456-7890 blue          
5 Charlie Brown  charlie@email.com    555-567-8901 green         

Five lines of mutate() and the chaos is gone. Let’s verify the colors actually standardized:

Code
survey_clean %>% count(favorite_color)
# A tibble: 3 × 2
  favorite_color     n
  <chr>          <int>
1 blue               2
2 green              2
3 red                1

Blue and green each appear twice. Order restored. Sanity preserved.

16.11 Cleaning Product Names

Pulling structured data out of messy product descriptions — you will do this way more often than you think:

Code
products <- tibble(
 description = c(
 "Widget A (SKU: WDG-001) - Price: $29.99",
 "Gadget B (SKU: GDG-042) - Price: $149.50",
 "Thingamajig (SKU: THG-100) - Price: $5.00"
 )
)

products %>%
 mutate(
 sku = str_extract(description, "[A-Z]{3}-\\d{3}"),
 price_raw = str_extract(description, "\\$[\\d.]+"),
 price = as.numeric(str_remove(price_raw, "\\$")),
 product_name = str_extract(description, "^[^(]+") %>% str_trim()
 ) %>%
 select(product_name, sku, price)
# A tibble: 3 × 3
  product_name sku     price
  <chr>        <chr>   <dbl>
1 Widget A     WDG-001  30.0
2 Gadget B     GDG-042 150. 
3 Thingamajig  THG-100   5  

One messy text column became three clean, analysis-ready columns. That is the kind of transformation that gets you a “nice work” in the group chat.

16.12 Extracting Numbers from Sales Notes

Pulling numbers out of free-text notes so you can actually do math with them:

Code
sales_notes <- tibble(
 region = c("North", "North", "South", "South"),
 notes = c("Sold 15 units of product A",
 "Sold 8 units of product B",
 "Sold 22 units of product A",
 "Sold 3 units of product C")
)

sales_notes %>%
 mutate(
 units = as.integer(str_extract(notes, "\\d+")),
 product = str_extract(notes, "product [A-C]")
 ) %>%
 group_by(region, product) %>%
 summarize(total_units = sum(units), .groups = "drop")
# A tibble: 4 × 3
  region product   total_units
  <chr>  <chr>           <int>
1 North  product A          15
2 North  product B           8
3 South  product A          22
4 South  product C           3

Text data became a proper summary table. Drop that into the quarterly deck and watch heads nod approvingly.

16.13 Quick Reference

The functions you will actually use, ranked by how often you will reach for them:

What you want to do Function
Glue strings together str_c()
Lowercase / uppercase / title case str_to_lower(), str_to_upper(), str_to_title()
Remove extra whitespace str_trim(), str_squish()
Does it contain this pattern? str_detect()
Pull out matching text str_extract()
Find and replace str_replace(), str_replace_all()
Delete a pattern str_remove(), str_remove_all()
How many characters? str_length()

And the three regex patterns that handle 90% of business problems:

Pattern What it finds Example
\\d+ Numbers “Order 1234 placed”
[A-Za-z]+ Words (letters) SKU-12345”
[\\w.]+@[\\w.]+ Email addresses alice@example.com

That is stringr. Clean your text, extract what you need, get back to the analysis that actually matters. Your data does not care how it was entered — but your boss cares how it looks when you present it.