5 Modifying Data

6 Modifying Data

Real-world data is rarely clean. Customer databases collect typos: “N/A”, “NA”, and “none” become three different categories in the same column. Survey exports arrive with cryptic column names, sometimes with embedded spaces or special characters. Numeric columns hide non-numeric entries like “$” signs or trailing whitespace. Wrangling — the work of reshaping, cleaning, and reformatting data — is where most analysts spend most of their time. The skill that separates someone who ships analyses from someone who fights error messages is being good at this.

This chapter teaches you how to reshape, clean, and wrestle your data into submission. For the full dplyr deep-dive, see Chapter @ref(dplyr).

6.1 Creating new variables

In business, you constantly need variables that do not exist yet in your raw data. Total revenue? That is price times quantity, and no database on Earth ships with that pre-calculated (wouldn’t that be nice). Customer lifetime value? A combination of purchase frequency, average order size, and tenure. The point is, you will spend a surprising chunk of your career creating new columns.

The mutate() function from dplyr provides a clean way to create new variables. The big win here is readability — you do not have to keep typing the data frame name over and over like you are being charged per keystroke.

Code

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

iris <- iris %>%
 mutate(Sepal.dimensions = Sepal.Length + Sepal.Width)

Note how mutate() does not require you to repeat the data frame name for each variable — just use column names directly. Your future self will thank you when reviewing old code at 11 PM the night before a deliverable.

There might be times when you want to overwrite an existing variable instead of creating a new one. Maybe you realized your revenue column forgot to include tax. In that case, just use the same variable name. Fair warning though — there is no undo button. This is not Google Docs. Once you overwrite it, it is gone.

Code

iris <- iris %>%
 mutate(Sepal.Length = Sepal.Length + Sepal.Width)

6.2 Subsetting data and dropping factor levels

Subsetting narrows a dataset to just the rows that matter for the analysis at hand. You have 500,000 customer records and only the ones who spent more than $100 last quarter are relevant. You are analyzing survey responses but only from respondents in the Southeast. The mechanic is the same in every case: keep the rows that match a condition, drop the rest.

AI Pitfall: filter() silently drops NA rows

Ask an AI assistant for “customers with revenue greater than 1000” and you get back:

df %>% filter(revenue > 1000)

The code is correct. What it does not tell you: rows where revenue is missing (NA) are silently excluded from the output. You did not say “drop NAs” — but NA > 1000 evaluates to NA, and filter() keeps only rows where the condition is TRUE, so all NA rows disappear.

If your data has missing revenues for half your customers (common in inherited datasets), the AI’s pipeline gives you a confidence-inspiring number — “412 customers above threshold” — that completely ignores the half of the data you did not know was missing. The pre-flight habits from Chapter 3 (summary(), glimpse(), skim()) catch this before you trust the output. Always check missingness before filtering.

If you want to be explicit, write filter(!is.na(revenue) & revenue > 1000) so the intent is visible to the next reader (often you, in three months).

Code

library(dplyr)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

The filter() function from dplyr reads almost like plain English, which is a lifesaver when you come back to your code six months later. Think of it as pulling only the VIP customers from your database — the ones above a certain spending threshold.

Code

irisSubset1 <- iris %>% filter(Sepal.Length >= 5.8)
summary(irisSubset1$Sepal.Length)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.800   8.300   8.850   8.901   9.575  11.700

You can also stack multiple conditions. Want customers who are both high-spending AND from the Northeast? Same idea — just combine your filters:

Code

irisSubset2 <- iris %>% filter(Sepal.Length >= 5.8, Sepal.Width < 3)
summary(irisSubset2)

  Sepal.Length     Sepal.Width    Petal.Length    Petal.Width   
 Min.   : 6.800   Min.   :2.00   Min.   :1.300   Min.   :0.200  
 1st Qu.: 8.200   1st Qu.:2.50   1st Qu.:4.000   1st Qu.:1.200  
 Median : 8.500   Median :2.70   Median :4.500   Median :1.400  
 Mean   : 8.593   Mean   :2.64   Mean   :4.509   Mean   :1.449  
 3rd Qu.: 9.100   3rd Qu.:2.80   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :10.500   Max.   :2.90   Max.   :6.900   Max.   :2.400  
       Species   Sepal.dimensions
 setosa    : 2   Min.   : 6.800  
 versicolor:34   1st Qu.: 8.200  
 virginica :21   Median : 8.500  
                 Mean   : 8.593  
                 3rd Qu.: 9.100  
                 Max.   :10.500

Subsetting based on a factor variable works with %in%. The following creates a subset of the iris dataset that has only two types of Species — like filtering your CRM to show only “Enterprise” and “Mid-Market” clients while ignoring the “Small Business” segment.

Code

irisSubset3 <- iris %>% filter(Species %in% c("setosa", "versicolor"))

Let us verify that we indeed have only two levels of the Species variable.

Code

summary(irisSubset3$Species)

    setosa versicolor  virginica 
        50         50          0

Here is a fun little gotcha that will trip you up at least once (consider it a rite of passage): even though we excluded virginica, R still remembers it existed as a factor level — it just shows 0 observations. It is like removing a department from your company but the name still shows up in the HR dropdown menu. Nobody works there anymore, but the system insists it is still an option. To clean that up, we re-factor:

Code

irisSubset3$Species=factor(irisSubset3$Species)

Let us verify that did the trick.

Code

summary(irisSubset3$Species)

    setosa versicolor 
        50         50

6.3 Reordering factor levels

By default, R orders factor levels alphabetically. That is fine until you realize your customer satisfaction survey goes “Dissatisfied, Neutral, Satisfied, Very Dissatisfied, Very Satisfied” — which is alphabetical nonsense. You need it to flow logically, from “Very Dissatisfied” to “Very Satisfied,” because that is how human brains (and your manager) expect to read it. Here is how you take control.

Code

levels(iris$Species)

[1] "setosa"     "versicolor" "virginica"

The forcats package (Chapter @ref(forcats)) makes reordering factor levels straightforward. The function names actually tell you what they do.

Code

library(forcats)
# Reverse the levels
iris$Species <- fct_rev(iris$Species)
levels(iris$Species)

[1] "virginica"  "versicolor" "setosa"

Code

# Set specific order
iris$Species <- fct_relevel(iris$Species, "setosa", "versicolor", "virginica")
levels(iris$Species)

[1] "setosa"     "versicolor" "virginica"

6.4 Reordering dataset entries

Sorting data is one of those things you will do approximately fourteen times a day. Maybe you want your sales reps ranked from highest revenue to lowest so you can see who is crushing it (and who needs a pep talk). Maybe you want survey responses sorted by completion time.

The arrange() function is so readable that it almost feels like cheating. You can practically read it as a sentence: “take this data and arrange it by this column.” Your code reviews will be a breeze.

For example, the Sepal.Length of the first row of the iris dataset:

Code

iris$Sepal.Length[1]

[1] 8.6

And the range of the variable:

Code

range(iris$Sepal.Length)

[1]  6.8 11.7

Sorting in ascending order (smallest to largest — think ranking your sales team from the newest hire to the top closer):

Code

ascendingiris <- iris %>% arrange(Sepal.Length)
head(ascendingiris$Sepal.Length)

[1] 6.8 7.0 7.3 7.3 7.3 7.3

And descending order — the “show me the leaderboard” view:

Code

descendingiris <- iris %>% arrange(desc(Sepal.Length))
head(descendingiris$Sepal.Length)

[1] 11.7 11.5 10.8 10.7 10.6 10.5

6.5 Changing column names

Let us talk about one of the most underrated quality-of-life improvements in data analysis: giving your columns names that a human being can actually understand. Maybe your data arrived from a legacy database with column names like X_VAR_007_Q3_FINAL_v2 or some brave soul used spaces and emojis. (Yes, this happens.) Renaming columns is a small investment that pays enormous dividends every time you or anyone else has to read the code.

Current names of columns in dataset:

Code

names(irisSubset3)

[1] "Sepal.Length"     "Sepal.Width"      "Petal.Length"     "Petal.Width"     
[5] "Species"          "Sepal.dimensions"

The rename() function from dplyr uses a new_name = old_name syntax, which reads naturally — like you are telling R “hey, this column used to be called this, now call it that”:

Code

dataname <- dataname %>% rename(newnamecol1 = oldnamecol1, newnamecol2 = oldnamecol2)

Use rename_with() to apply a function to column names — great for bulk cleanup when every column name is SCREAMING AT YOU IN ALL CAPS:

Code

# Make all column names lowercase
dataname <- dataname %>% rename_with(tolower)

6.6 Renaming and collapsing factor levels

Sometimes the categories in your data need a rebrand. Maybe the marketing team decided “Premium Tier” sounds better than “Segment C,” or maybe you have seventeen product subcategories and your presentation has room for four. Renaming and collapsing factor levels is the data equivalent of reorganizing your company’s org chart — same people, different labels.

Let us make a copy of the iris dataset to experiment with (always a good idea — never experiment on production data, kids):

Code

playiris=iris
levels(playiris$Species)

[1] "setosa"     "versicolor" "virginica"

The fct_recode() function from forcats lets you rename specific levels explicitly. It is the safer method because you are being explicit about which label you are changing. It is like addressing a memo to a specific person rather than “whoever is sitting in the third chair”:

Code

library(forcats)
playiris$Species <- fct_recode(playiris$Species,
 "factor3" = "virginica",
 "factor2" = "versicolor"
)
levels(playiris$Species)

[1] "setosa"  "factor2" "factor3"

You can also rename all levels at once:

Code

playiris$Species <- fct_recode(playiris$Species,
 "newlevel1" = "setosa",
 "newlevel2" = "factor2",
 "newlevel3" = "factor3"
)
levels(playiris$Species)

[1] "newlevel1" "newlevel2" "newlevel3"

When you want to merge several categories into one — say you have “Heavy,” “Regul,” and “Occas” smokers and you want to collapse them all into “Smoker” — use fct_collapse():

Code

# Example: collapsing multiple levels into fewer groups
survey$Smoke <- fct_collapse(survey$Smoke,
 Smoker = c("Heavy", "Regul", "Occas"),
 NonSmoker = "Never"
)

6.7 Converting data types

Roughly half the errors you encounter in R trace to a variable being one type when you expected another. Numeric values arrive as character strings because of a stray $ or comma. Survey ratings arrive as factors when you wanted them treated as numeric. Date columns arrive as character because the upstream system never declared them as dates. Five minutes here saves hours of debugging cryptic error messages later.

6.7.1 Factor to numeric

This comes in handy when you have a factor variable with numbers as values and you want to treat it as a numeric variable. Imagine a customer satisfaction survey where ratings (1 through 5) got imported as categories instead of numbers. R sees “1” and “5” as labels, not quantities — like seeing jersey numbers on a basketball team. The number 23 does not mean that player is “more” than player 7. This discussion is based on this Stack Overflow post.

Let us go back to the playiris dataset. Remember the factor levels?

Code

levels(playiris$Species)

[1] "newlevel1" "newlevel2" "newlevel3"

If we summarize the variable, it gives us the number of observations per level — a head count, not a statistical summary:

Code

summary(playiris$Species)

newlevel1 newlevel2 newlevel3 
       50        50        50

Let us convert the level labels to numbers:

Code

levels(playiris$Species)=c("1","2","3")
summary(playiris$Species)

 1  2  3 
50 50 50

Here is the sneaky part. Even though the labels look like numbers now, R is still treating them as categories. It is still counting how many observations fall in each bucket instead of computing a mean. R does not care what the labels look like — it cares about the data type under the hood. It is like putting a “Ferrari” bumper sticker on your Civic: it might say Ferrari, but it is still a Civic. To actually make the switch, you need to explicitly convert:

Code

playiris$Species=as.numeric(levels(playiris$Species))[playiris$Species]
# This can also be accomplished by the following.
# playiris$Species=as.numeric(as.character(playiris$Species))

Now if we summarize the variable, we get the real numeric summary — min, max, mean, quartiles, the whole executive dashboard:

Code

summary(playiris$Species)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       2       2       3       3

6.7.2 Numeric to Factor - Binning variables

Let us re-convert the Species variable back to a factor:

Code

playiris$Species=factor(playiris$Species)
summary(playiris$Species)

 1  2  3 
50 50 50

Sometimes you need to go the other direction — take a continuous number and chop it into categories. In business, this is incredibly common and you have probably already seen it without knowing the name for it. Think “low / medium / high revenue customers” or “small / medium / large deal sizes” or the classic credit-score tiers. This process is called binning, and it is one of the most practical things you will learn today.

Code

summary(playiris$Petal.Length)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.600   4.350   3.758   5.100   6.900

Our goal is to place all observations with Petal.Length at or below the 25th percentile (1.6) as “low” and those above the 75th percentile (5.1) as “high.” Everything in between gets “medium.” Think of it like segmenting your customer base: the bottom 25% by spending are “budget,” the top 25% are “premium,” and everyone else is “standard.” The cut() function does the heavy lifting:

Code

playiris$Petal.LengthFactor=cut(playiris$Petal.Length,breaks = c(1,1.6,5.1,6.9),labels = c("low","medium","high"),include.lowest = TRUE,ordered_result = TRUE)

Another approach uses nested ifelse statements — a bit harder to read (OK, a lot harder to read), but flexible for custom logic:

Code

playiris$Petallength.Factor=ifelse(playiris$Petal.Length<=1.6,"low",ifelse(playiris$Petal.Length>5.1,"high","medium"))

Binning may also be needed to create equal interval bins and equal frequency bins. Please see this response and this response on Stack Overflow for more details on the possible solutions.

6.7.3 Factor and Numeric to Character and vice versa

There are times when you need your values as plain text — maybe for string matching, text mining, or just to get a clean CSV export for that one colleague who insists on doing everything in Excel. Converting between data types is like exchanging currency: the underlying value stays the same, but the format matters depending on where you are trying to spend it.

Let us look at the structure of our playground dataset:

Code

str(playiris)

'data.frame':   150 obs. of  8 variables:
 $ Sepal.Length      : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Sepal.Width       : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width       : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.dimensions  : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Petal.LengthFactor: Ord.factor w/ 3 levels "low"<"medium"<..: 1 1 1 1 1 2 1 1 1 1 ...
 $ Petallength.Factor: chr  "low" "low" "low" "low" ...

Convert Species and Petal.LengthFactor to character variables:

Code

playiris$Species=as.character(playiris$Species)
playiris$Petal.LengthFactor=as.character(playiris$Petal.LengthFactor)
str(playiris)

'data.frame':   150 obs. of  8 variables:
 $ Sepal.Length      : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Sepal.Width       : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width       : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species           : chr  "1" "1" "1" "1" ...
 $ Sepal.dimensions  : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Petal.LengthFactor: chr  "low" "low" "low" "low" ...
 $ Petallength.Factor: chr  "low" "low" "low" "low" ...

The results confirm our conversion worked. Converting them back to factors or numeric is straightforward — just use the corresponding as.* function. Think of it as a round trip: you can always convert back.

Code

playiris$Species=as.numeric(playiris$Species)
playiris$Petal.LengthFactor=as.factor(playiris$Petal.LengthFactor)
str(playiris)

'data.frame':   150 obs. of  8 variables:
 $ Sepal.Length      : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Sepal.Width       : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width       : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.dimensions  : num  8.6 7.9 7.9 7.7 8.6 9.3 8 8.4 7.3 8 ...
 $ Petal.LengthFactor: Factor w/ 3 levels "high","low","medium": 2 2 2 2 2 3 2 2 2 2 ...
 $ Petallength.Factor: chr  "low" "low" "low" "low" ...

The following diagram summarizes the conversion paths between factor, numeric, and character variables. Bookmark this one — you will come back to it more often than you think.