---
title: "Working with Text Data"
---
# Working with Text Data {#stringr}
```{r}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Why You Need This
A surprising amount of business data is text that humans typed by hand: survey responses, product names, customer addresses, free-form comment fields. Strings like `" GREAT product!! "` and `"great Product"` are functionally identical to a human reader, but R treats them as completely different values until you normalize them. The `stringr` package is the toolkit for that normalization plus everything else you need to do with text — pattern matching, extraction, replacement, splitting, and joining.
Every function in the package starts with `str_`, takes the string as its first argument (so it composes cleanly in pipelines), and does what its name suggests. The naming consistency is one of the package's biggest practical advantages.
::: {.callout-warning}
## AI Pitfall: AI's regex looks right and matches the wrong things
Ask an AI assistant to "extract the customer's last name from this column" and you reliably get back a regex like `str_extract(name, "[A-Za-z]+$")`. The pattern looks reasonable. It also fails on common cases:
- Hyphenated names: "Smith-Jones" returns "Jones"
- Names with apostrophes: "O'Brien" returns "Brien"
- Names with spaces: "Van Der Berg" returns "Berg"
- Names with non-ASCII characters: "Müller" returns "ller" because `[A-Za-z]` excludes accented letters
The AI's pattern works on the easy cases and silently produces wrong results on everything else. The defensive habit: after applying any regex against real text data, sample the output (`sample_n(df, 20)` and look at the extracted values) for cases the AI did not anticipate. If you are working with international names, use `[\\p{L}]+` instead of `[A-Za-z]+` to match any Unicode letter — but the bigger lesson is that text data has long tails of edge cases, and trusting AI's first-attempt regex on real data is a recipe for silent failure.
:::
## String Basics
Strings in R live inside quotes. Double quotes are standard.
```{r}
product <- "Wireless Bluetooth Speaker"
product
```
If your string itself contains a quote, wrap it in single quotes:
```{r}
tagline <- 'She said, "Take my money!"'
tagline
```
`str_length()` counts characters --- useful for flagging those one-word survey responses from people who clearly had somewhere else to be:
```{r}
responses <- c("Great product!", "OK", "This changed my life and I want to tell everyone", "N/A")
str_length(responses)
```
## Combining Strings with `str_c()`
`str_c()` glues strings together. It is `paste()` but tidyverse-flavored.
```{r}
str_c("Q", 1:4)
str_c("Hello", "World", sep = " ")
```
The `collapse` argument squashes a whole vector into one string --- great for building labels or email subject lines:
```{r}
departments <- c("Marketing", "Finance", "Operations")
str_c(departments, collapse = ", ")
str_c(departments, collapse = " & ")
```
## Changing Case --- Your Most-Used Trick
When survey respondents type "NEW YORK", "new york", and "New york" --- R sees three different values. One line fixes all of them:
```{r}
cities <- c("new york", "LOS ANGELES", "chicago", "SAN FRANCISCO")
str_to_title(cities)
```
The full case family:
```{r}
sample_text <- "the QUICK brown FOX"
str_to_lower(sample_text)
str_to_upper(sample_text)
str_to_title(sample_text)
```
You will use these every single time you clean data where humans had a keyboard and zero standards. So... every time.
## Trimming Whitespace
Extra spaces are the silent killer of data joins. If one spreadsheet has "Marketing" and another has " Marketing ", your `left_join()` will silently drop rows and you will silently lose your mind trying to figure out why.
```{r}
messy <- " too much space "
str_trim(messy) # removes leading/trailing spaces
str_squish(messy) # also collapses internal spaces to one
```
`str_squish()` handles everything. Just use it by default and move on with your life.
## Detecting Patterns with `str_detect()`
"Does this string contain this word?" That is `str_detect()`. It returns TRUE/FALSE, which makes it a perfect partner for `filter()`.
```{r}
products <- c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8", "iPhone SE")
str_detect(products, "iPhone")
```
Inside a data frame --- this is where it earns its paycheck:
```{r}
product_df <- tibble(
product = c("iPhone 15 Pro", "Samsung Galaxy S24", "iPhone 15", "Google Pixel 8"),
revenue = c(45000, 32000, 38000, 15000)
)
product_df %>% filter(str_detect(product, "iPhone"))
```
You just filtered to Apple products without needing a separate category column. Bill 3 hours for it anyway.
## Replacing Text with `str_replace()`
`str_replace()` fixes the first match. `str_replace_all()` fixes every match. Think "Find and Replace" in Excel, but it actually works on 50,000 rows.
```{r}
reviews <- c("This product is gr8", "Loved it!! Will buy again!!", "Good product, good price")
str_replace_all(reviews, "!!", "!")
```
Standardizing how people wrote "percent":
```{r}
responses <- c("50 percent", "30%", "about 20 percent", "75%")
str_replace_all(responses, " percent", "%")
```
`str_remove()` and `str_remove_all()` are shortcuts for replacing with nothing --- perfect for stripping symbols:
```{r}
prices <- c("$29.99", "$149.50", "$5.00")
str_remove(prices, "\\$")
```
## Extracting Text with `str_extract()`
Need to pull a number out of a messy string? `str_extract()` grabs the first match.
```{r}
order_notes <- c("Order 1234 placed", "Invoice 5678 pending", "Shipped item 91011")
str_extract(order_notes, "\\d+")
```
That `\\d+` is a regular expression (regex). Do not panic --- the next section covers the only ones you will ever need.
## Regular Expressions: Just the 3 Patterns You Actually Need
Regex looks like a cat walked across your keyboard. The good news: for 90% of business data cleaning, you only need three patterns. Seriously.
**Pattern 1: `\\d+` --- Find numbers**
```{r}
sales_notes <- c("Sold 15 units", "Returned 3 items", "Ordered 200 widgets")
str_extract(sales_notes, "\\d+")
```
**Pattern 2: `[A-Za-z]+` --- Find words (letters only)**
```{r}
codes <- c("SKU-12345", "PROD-6789", "ITEM-111")
str_extract(codes, "[A-Z]+")
```
**Pattern 3: `[\\w.]+@[\\w.]+` --- Find email-like patterns**
```{r}
messages <- c("Contact alice@example.com for details",
"Email bob.smith@company.org for info",
"No email here")
str_extract(messages, "[\\w.]+@[\\w.]+")
```
That is the whole regex syllabus. If you need something fancier, Google "regex cheat sheet." Everyone does it --- including the people who wrote the cheat sheet.
## Putting It All Together: A Survey Data Cleaning Pipeline
Here is a realistic scenario. You received survey data and it looks like it was filled out during an earthquake on a moving bus:
```{r}
survey_raw <- tibble(
respondent = c(" JOHN doe ", "jane SMITH", " Bob Johnson ",
"ALICE williams", " charlie Brown"),
email = c("John.Doe@Gmail.COM", "jane_smith@YAHOO.com",
"BOB@company.ORG", "alice.w@School.Edu", "charlie@EMAIL.com"),
phone = c("(555) 123-4567", "555.234.5678", "555 345 6789",
"5554567890", "(555)-567-8901"),
favorite_color = c(" Blue ", "RED", " green ", "BLUE", " Green")
)
survey_clean <- survey_raw %>%
mutate(
# Clean names: squish + title case
respondent = respondent %>% str_squish() %>% str_to_title(),
# Emails: just lowercase everything
email = str_to_lower(email),
# Phone numbers: strip everything that isn't a digit, then reformat
phone = phone %>%
str_remove_all("[^0-9]") %>%
str_replace("(\\d{3})(\\d{3})(\\d{4})", "\\1-\\2-\\3"),
# Colors: squish + lowercase for consistency
favorite_color = favorite_color %>% str_squish() %>% str_to_lower()
)
survey_clean
```
Five lines of `mutate()` and the chaos is gone. Let's verify the colors actually standardized:
```{r}
survey_clean %>% count(favorite_color)
```
Blue and green each appear twice. Order restored. Sanity preserved.
## Cleaning Product Names
Pulling structured data out of messy product descriptions --- you will do this way more often than you think:
```{r}
products <- tibble(
description = c(
"Widget A (SKU: WDG-001) - Price: $29.99",
"Gadget B (SKU: GDG-042) - Price: $149.50",
"Thingamajig (SKU: THG-100) - Price: $5.00"
)
)
products %>%
mutate(
sku = str_extract(description, "[A-Z]{3}-\\d{3}"),
price_raw = str_extract(description, "\\$[\\d.]+"),
price = as.numeric(str_remove(price_raw, "\\$")),
product_name = str_extract(description, "^[^(]+") %>% str_trim()
) %>%
select(product_name, sku, price)
```
One messy text column became three clean, analysis-ready columns. That is the kind of transformation that gets you a "nice work" in the group chat.
## Extracting Numbers from Sales Notes
Pulling numbers out of free-text notes so you can actually do math with them:
```{r}
sales_notes <- tibble(
region = c("North", "North", "South", "South"),
notes = c("Sold 15 units of product A",
"Sold 8 units of product B",
"Sold 22 units of product A",
"Sold 3 units of product C")
)
sales_notes %>%
mutate(
units = as.integer(str_extract(notes, "\\d+")),
product = str_extract(notes, "product [A-C]")
) %>%
group_by(region, product) %>%
summarize(total_units = sum(units), .groups = "drop")
```
Text data became a proper summary table. Drop that into the quarterly deck and watch heads nod approvingly.
## Quick Reference
The functions you will actually use, ranked by how often you will reach for them:
| What you want to do | Function |
|:---------------------|:---------|
| Glue strings together | `str_c()` |
| Lowercase / uppercase / title case | `str_to_lower()`, `str_to_upper()`, `str_to_title()` |
| Remove extra whitespace | `str_trim()`, `str_squish()` |
| Does it contain this pattern? | `str_detect()` |
| Pull out matching text | `str_extract()` |
| Find and replace | `str_replace()`, `str_replace_all()` |
| Delete a pattern | `str_remove()`, `str_remove_all()` |
| How many characters? | `str_length()` |
And the three regex patterns that handle 90% of business problems:
| Pattern | What it finds | Example |
|:--------|:-------------|:--------|
| `\\d+` | Numbers | "Order **1234** placed" |
| `[A-Za-z]+` | Words (letters) | "**SKU**-12345" |
| `[\\w.]+@[\\w.]+` | Email addresses | "**alice@example.com**" |
That is stringr. Clean your text, extract what you need, get back to the analysis that actually matters. Your data does not care how it was entered --- but your boss cares how it looks when you present it.