---
title: "Modifying Data"
---
# Modifying Data {#modifydata}
Real-world data is rarely clean. Customer databases collect typos: "N/A", "NA", and "none" become three different categories in the same column. Survey exports arrive with cryptic column names, sometimes with embedded spaces or special characters. Numeric columns hide non-numeric entries like "$" signs or trailing whitespace. Wrangling — the work of reshaping, cleaning, and reformatting data — is where most analysts spend most of their time. The skill that separates someone who ships analyses from someone who fights error messages is being good at this.
This chapter teaches you how to reshape, clean, and wrestle your data into submission. For the full dplyr deep-dive, see Chapter \@ref(dplyr).
## Creating new variables
In business, you constantly need variables that do not exist yet in your raw data. Total revenue? That is price times quantity, and no database on Earth ships with that pre-calculated (wouldn't that be nice). Customer lifetime value? A combination of purchase frequency, average order size, and tenure. The point is, you will spend a surprising chunk of your career creating new columns.
The `mutate()` function from **dplyr** provides a clean way to create new variables. The big win here is readability --- you do not have to keep typing the data frame name over and over like you are being charged per keystroke.
```{r}
library(dplyr)
iris <- iris %>%
mutate(Sepal.dimensions = Sepal.Length + Sepal.Width)
```
Note how `mutate()` does not require you to repeat the data frame name for each variable --- just use column names directly. Your future self will thank you when reviewing old code at 11 PM the night before a deliverable.
There might be times when you want to overwrite an existing variable instead of creating a new one. Maybe you realized your revenue column forgot to include tax. In that case, just use the same variable name. Fair warning though --- there is no undo button. This is not Google Docs. Once you overwrite it, it is gone.
```{r}
iris <- iris %>%
mutate(Sepal.Length = Sepal.Length + Sepal.Width)
```
## Subsetting data and dropping factor levels
Subsetting narrows a dataset to just the rows that matter for the analysis at hand. You have 500,000 customer records and only the ones who spent more than $100 last quarter are relevant. You are analyzing survey responses but only from respondents in the Southeast. The mechanic is the same in every case: keep the rows that match a condition, drop the rest.
::: {.callout-warning}
## AI Pitfall: `filter()` silently drops NA rows
Ask an AI assistant for "customers with revenue greater than 1000" and you get back:
```r
df %>% filter(revenue > 1000)
```
The code is correct. What it does *not* tell you: rows where `revenue` is missing (`NA`) are silently excluded from the output. You did not say "drop NAs" — but `NA > 1000` evaluates to `NA`, and `filter()` keeps only rows where the condition is `TRUE`, so all NA rows disappear.
If your data has missing revenues for half your customers (common in inherited datasets), the AI's pipeline gives you a confidence-inspiring number — "412 customers above threshold" — that completely ignores the half of the data you did not know was missing. The pre-flight habits from Chapter 3 (`summary()`, `glimpse()`, `skim()`) catch this *before* you trust the output. Always check missingness before filtering.
If you want to be explicit, write `filter(!is.na(revenue) & revenue > 1000)` so the intent is visible to the next reader (often you, in three months).
:::
```{r}
library(dplyr)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
The `filter()` function from **dplyr** reads almost like plain English, which is a lifesaver when you come back to your code six months later. Think of it as pulling only the VIP customers from your database --- the ones above a certain spending threshold.
```{r}
irisSubset1 <- iris %>% filter(Sepal.Length >= 5.8)
summary(irisSubset1$Sepal.Length)
```
You can also stack multiple conditions. Want customers who are both high-spending AND from the Northeast? Same idea --- just combine your filters:
```{r}
irisSubset2 <- iris %>% filter(Sepal.Length >= 5.8, Sepal.Width < 3)
summary(irisSubset2)
```
Subsetting based on a factor variable works with `%in%`. The following creates a subset of the `iris` dataset that has only two types of Species --- like filtering your CRM to show only "Enterprise" and "Mid-Market" clients while ignoring the "Small Business" segment.
```{r}
irisSubset3 <- iris %>% filter(Species %in% c("setosa", "versicolor"))
```
Let us verify that we indeed have only two levels of the `Species` variable.
```{r}
summary(irisSubset3$Species)
```
Here is a fun little gotcha that will trip you up at least once (consider it a rite of passage): even though we excluded `virginica`, R still remembers it existed as a factor level --- it just shows 0 observations. It is like removing a department from your company but the name still shows up in the HR dropdown menu. Nobody works there anymore, but the system insists it is still an option. To clean that up, we re-factor:
```{r}
irisSubset3$Species=factor(irisSubset3$Species)
```
Let us verify that did the trick.
```{r}
summary(irisSubset3$Species)
```
## Reordering factor levels
By default, R orders factor levels alphabetically. That is fine until you realize your customer satisfaction survey goes "Dissatisfied, Neutral, Satisfied, Very Dissatisfied, Very Satisfied" --- which is alphabetical nonsense. You need it to flow logically, from "Very Dissatisfied" to "Very Satisfied," because that is how human brains (and your manager) expect to read it. Here is how you take control.
```{r}
levels(iris$Species)
```
The **forcats** package (Chapter \@ref(forcats)) makes reordering factor levels straightforward. The function names actually tell you what they do.
```{r}
library(forcats)
# Reverse the levels
iris$Species <- fct_rev(iris$Species)
levels(iris$Species)
```
```{r}
# Set specific order
iris$Species <- fct_relevel(iris$Species, "setosa", "versicolor", "virginica")
levels(iris$Species)
```
## Reordering dataset entries
Sorting data is one of those things you will do approximately fourteen times a day. Maybe you want your sales reps ranked from highest revenue to lowest so you can see who is crushing it (and who needs a pep talk). Maybe you want survey responses sorted by completion time.
The `arrange()` function is so readable that it almost feels like cheating. You can practically read it as a sentence: "take this data and arrange it by this column." Your code reviews will be a breeze.
For example, the `Sepal.Length` of the first row of the `iris` dataset:
```{r}
iris$Sepal.Length[1]
```
And the range of the variable:
```{r}
range(iris$Sepal.Length)
```
Sorting in ascending order (smallest to largest --- think ranking your sales team from the newest hire to the top closer):
```{r}
ascendingiris <- iris %>% arrange(Sepal.Length)
head(ascendingiris$Sepal.Length)
```
And descending order --- the "show me the leaderboard" view:
```{r}
descendingiris <- iris %>% arrange(desc(Sepal.Length))
head(descendingiris$Sepal.Length)
```
## Changing column names
Let us talk about one of the most underrated quality-of-life improvements in data analysis: giving your columns names that a human being can actually understand. Maybe your data arrived from a legacy database with column names like `X_VAR_007_Q3_FINAL_v2` or some brave soul used spaces and emojis. (Yes, this happens.) Renaming columns is a small investment that pays enormous dividends every time you or anyone else has to read the code.
Current names of columns in dataset:
```{r}
names(irisSubset3)
```
The `rename()` function from **dplyr** uses a `new_name = old_name` syntax, which reads naturally --- like you are telling R "hey, this column used to be called this, now call it that":
```{r}
#| eval: false
dataname <- dataname %>% rename(newnamecol1 = oldnamecol1, newnamecol2 = oldnamecol2)
```
Use `rename_with()` to apply a function to column names --- great for bulk cleanup when every column name is SCREAMING AT YOU IN ALL CAPS:
```{r}
#| eval: false
# Make all column names lowercase
dataname <- dataname %>% rename_with(tolower)
```
## Renaming and collapsing factor levels
Sometimes the categories in your data need a rebrand. Maybe the marketing team decided "Premium Tier" sounds better than "Segment C," or maybe you have seventeen product subcategories and your presentation has room for four. Renaming and collapsing factor levels is the data equivalent of reorganizing your company's org chart --- same people, different labels.
Let us make a copy of the iris dataset to experiment with (always a good idea --- never experiment on production data, kids):
```{r}
playiris=iris
levels(playiris$Species)
```
The `fct_recode()` function from **forcats** lets you rename specific levels explicitly. It is the safer method because you are being explicit about which label you are changing. It is like addressing a memo to a specific person rather than "whoever is sitting in the third chair":
```{r}
library(forcats)
playiris$Species <- fct_recode(playiris$Species,
"factor3" = "virginica",
"factor2" = "versicolor"
)
levels(playiris$Species)
```
You can also rename all levels at once:
```{r}
playiris$Species <- fct_recode(playiris$Species,
"newlevel1" = "setosa",
"newlevel2" = "factor2",
"newlevel3" = "factor3"
)
levels(playiris$Species)
```
When you want to merge several categories into one --- say you have "Heavy," "Regul," and "Occas" smokers and you want to collapse them all into "Smoker" --- use `fct_collapse()`:
```{r}
#| eval: false
# Example: collapsing multiple levels into fewer groups
survey$Smoke <- fct_collapse(survey$Smoke,
Smoker = c("Heavy", "Regul", "Occas"),
NonSmoker = "Never"
)
```
## Converting data types
Roughly half the errors you encounter in R trace to a variable being one type when you expected another. Numeric values arrive as character strings because of a stray `$` or comma. Survey ratings arrive as factors when you wanted them treated as numeric. Date columns arrive as character because the upstream system never declared them as dates. Five minutes here saves hours of debugging cryptic error messages later.
### Factor to numeric
This comes in handy when you have a factor variable with numbers as values and you want to treat it as a numeric variable. Imagine a customer satisfaction survey where ratings (1 through 5) got imported as categories instead of numbers. R sees "1" and "5" as labels, not quantities --- like seeing jersey numbers on a basketball team. The number 23 does not mean that player is "more" than player 7. This discussion is [based on this Stack Overflow post.](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information)
Let us go back to the `playiris` dataset. Remember the factor levels?
```{r}
levels(playiris$Species)
```
If we summarize the variable, it gives us the number of observations per level --- a head count, not a statistical summary:
```{r}
summary(playiris$Species)
```
Let us convert the level labels to numbers:
```{r}
levels(playiris$Species)=c("1","2","3")
summary(playiris$Species)
```
Here is the sneaky part. Even though the labels look like numbers now, R is still treating them as categories. It is still counting how many observations fall in each bucket instead of computing a mean. R does not care what the labels look like --- it cares about the data type under the hood. It is like putting a "Ferrari" bumper sticker on your Civic: it might say Ferrari, but it is still a Civic. To actually make the switch, you need to explicitly convert:
```{r}
playiris$Species=as.numeric(levels(playiris$Species))[playiris$Species]
# This can also be accomplished by the following.
# playiris$Species=as.numeric(as.character(playiris$Species))
```
Now if we summarize the variable, we get the real numeric summary --- min, max, mean, quartiles, the whole executive dashboard:
```{r}
summary(playiris$Species)
```
### Numeric to Factor - Binning variables
Let us re-convert the Species variable back to a factor:
```{r}
playiris$Species=factor(playiris$Species)
summary(playiris$Species)
```
Sometimes you need to go the other direction --- take a continuous number and chop it into categories. In business, this is incredibly common and you have probably already seen it without knowing the name for it. Think "low / medium / high revenue customers" or "small / medium / large deal sizes" or the classic credit-score tiers. This process is called binning, and it is one of the most practical things you will learn today.
```{r}
summary(playiris$Petal.Length)
```
Our goal is to place all observations with `Petal.Length` at or below the 25th percentile (1.6) as "low" and those above the 75th percentile (5.1) as "high." Everything in between gets "medium." Think of it like segmenting your customer base: the bottom 25% by spending are "budget," the top 25% are "premium," and everyone else is "standard." The `cut()` function does the heavy lifting:
```{r}
playiris$Petal.LengthFactor=cut(playiris$Petal.Length,breaks = c(1,1.6,5.1,6.9),labels = c("low","medium","high"),include.lowest = TRUE,ordered_result = TRUE)
```
Another approach uses nested `ifelse` statements --- a bit harder to read (OK, a lot harder to read), but flexible for custom logic:
```{r}
playiris$Petallength.Factor=ifelse(playiris$Petal.Length<=1.6,"low",ifelse(playiris$Petal.Length>5.1,"high","medium"))
```
Binning may also be needed to create equal interval bins and equal frequency bins. Please see [this response](https://stackoverflow.com/questions/42037740/equal-frequency-and-equal-width-binning-in-r) and [this response](https://stackoverflow.com/questions/5731116/equal-frequency-discretization-in-r/5741495#5741495) on Stack Overflow for more details on the possible solutions.
### Factor and Numeric to Character and vice versa
There are times when you need your values as plain text --- maybe for string matching, text mining, or just to get a clean CSV export for that one colleague who insists on doing everything in Excel. Converting between data types is like exchanging currency: the underlying value stays the same, but the format matters depending on where you are trying to spend it.
Let us look at the structure of our playground dataset:
```{r}
str(playiris)
```
Convert `Species` and `Petal.LengthFactor` to character variables:
```{r}
playiris$Species=as.character(playiris$Species)
playiris$Petal.LengthFactor=as.character(playiris$Petal.LengthFactor)
str(playiris)
```
The results confirm our conversion worked. Converting them back to factors or numeric is straightforward --- just use the corresponding `as.*` function. Think of it as a round trip: you can always convert back.
```{r}
playiris$Species=as.numeric(playiris$Species)
playiris$Petal.LengthFactor=as.factor(playiris$Petal.LengthFactor)
str(playiris)
```
The following diagram summarizes the conversion paths between factor, numeric, and character variables. Bookmark this one --- you will come back to it more often than you think.
<img src="04files/conversion.png">