10 Summary Statistics

11 Summary Statistics

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(DT)

Warning: package 'DT' was built under R version 4.5.2

Code

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

Before you build models, create dashboards, or apply machine learning, summary statistics answer the foundational questions about a dataset. What is the average? How spread out are the values? Are there extreme values that warrant a second look at the data source? Getting these basics right is the difference between an analyst who knows the data and one who is operating on faith.

This chapter covers computing summary statistics using R’s built-in functions. For computing grouped summaries using dplyr (the “give me these numbers broken down by region/segment/quarter” tool), see Chapter @ref(dplyr).

11.1 Measures of Central Tendency

Mean, median, and mode — the holy trinity of “what is typical?” in your data. Here is the business translation, because nobody in a boardroom says “measure of central tendency”:

Mean = your average order value. Add everything up, divide by the count. Simple, familiar, and what your CEO is probably asking about.
Median = the salary figure where half your employees earn more and half earn less. Less flashy than the mean, but way more honest when outliers are involved. (This is why job postings should report median salary, not mean — one VP making $800K really skews the average.) Choosing the right summary statistic is not just a math decision — it is a fairness decision. The number you report shapes what decision-makers believe is “normal,” and that belief drives real policy.
Mode = the most popular item in your product catalog. The best-seller. The thing customers keep coming back for.

One important note: not all three work for every variable type. For nominal variables (like customer region or product category), only the mode makes sense — you cannot average “Northeast” and “West Coast.” For ordinal variables (like satisfaction ratings from 1 to 5), mode and median work fine. All three are fair game for numeric variables like revenue, age, or number of units sold.

Code

mean(iris$Sepal.Length)

[1] 5.843333

The mean of Sepal.Length is [R: mean(iris$Sepal.Length)].

Code

median(iris$Petal.Length)

[1] 4.35

The median of Petal.Length is [R: median(iris$Petal.Length)]

Here is a fun fact that will make you question everything: R does not have a built-in function for determining the mode. Seriously. One of the three most fundamental statistics in existence, and R just… forgot to include it. It is like buying a car and realizing it does not have a cup holder. No worries though — we can write our own. The function below called mymode handles it nicely, and it even works when there are multiple modes (ties for first place). (This solution is adapted from a question on Stack Overflow.)

Code

# Function to determine the mode. Can also identify multiple modes
mymode <- function(x) {
 ux <- unique(x)
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}

# Function to give the frequency of occurrence of mode
modecount <- function(x) {
 ux <- unique(x)
max(tabulate(match(x, ux)))
}

Let us put those functions to work. First, on a numeric variable:

Code

mymode(iris$Petal.Length) # Gives the mode(s) of Petal.Length

[1] 1.4 1.5

Code

modecount(iris$Petal.Length) # Gives the frequency of occurrence of the mode

[1] 13

The mode of Petal.Length is 1.5 cm, and it appears 14 times in the dataset. This tells you that the single most common petal length measurement is 1.5 cm — which happens to be a setosa value. When the mode sits far from the mean (which is about 3.76 cm), that is a hint the distribution is skewed or multimodal.

And now, a factor variable:

Code

mymode(iris$Species)

[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

Code

modecount(iris$Species)

[1] 50

Note that all three species are identified as modes because there are 50 observations of each type. In business terms, imagine three products all tied for the best-seller spot with identical unit sales — they are all the mode. Nobody wins, everybody wins.

11.2 Other measures of central tendency

Trimmed Mean: Here is the thing about the regular mean — it is a people-pleaser. It gives every single data point equal say, including that one customer who bought $47,000 worth of printer ink (true story at many companies). When outliers like that show up, the mean gets yanked in their direction and stops representing what is “typical.” A trimmed mean handles this by chopping off the extremes before calculating — like grading on a curve but for your data. That said, if there is a legitimate reason for those outliers (maybe that printer-ink customer runs a copy shop), the regular mean might still be the right call. Context matters.

Regular mean:

Code

mean(iris$Sepal.Length)

[1] 5.843333

5% trimmed mean — this knocks off the top 5% and bottom 5% of values before averaging. Think of it as removing the one person who rated your product a 1 out of spite and the one who gave it a 10 because they are your mom:

Code

mean(iris$Sepal.Length,trim = .05) # knocks off 5% of the observations on each end.

[1] 5.820588

10% trimmed mean — even more aggressive trimming, for when your data has more drama than a reality TV show:

Code

mean(iris$Sepal.Length, trim = .1) # knocks off 10% of the observation on each end.

[1] 5.808333

11.3 Measures of Dispersion (variation)

Knowing the average is great, but it only tells half the story. Consider this: two coffee shops both have an average wait time of 4 minutes. Sounds equal, right? But Shop A consistently serves everyone in 3 to 5 minutes, while Shop B ranges from 30 seconds to 15 minutes depending on whether the barista is having a good day. The averages are identical, but the experience is wildly different. Measures of dispersion tell you how spread out your data is — and in business, spread means risk, inconsistency, and unpredictability.

11.3.1 Range

The simplest measure of spread: what is the smallest value and what is the biggest? It is the “from $X to $Y” you see on every price comparison website.

Code

min(iris$Sepal.Length) # minimum

[1] 4.3

Code

max(iris$Sepal.Length) # maximum

[1] 7.9

Code

range(iris$Sepal.Length) # should give the min and max

[1] 4.3 7.9

11.3.2 Interquartile range: Difference between 25th and 75th percentile (or quantile)

Percentiles tell you where a value stands relative to everyone else. If your sales numbers are at the 90th percentile, you are outperforming 90% of your peers. (Definitely put that on your LinkedIn.) The 5th percentile is the low end — only 5% of observations fall below it.

Code

quantile(iris$Sepal.Length,probs = .05)

 5% 
4.6

The median is just the 50th percentile — the exact middle of the pack. Not the best, not the worst, just… there.

Code

median(iris$Sepal.Length)

[1] 5.8

Code

quantile(iris$Sepal.Length,probs=.5)

50% 
5.8

25th quantile (the bottom of the “middle half” — if this were employee performance, it is the boundary between “needs improvement” and “meets expectations”):

Code

quantile(iris$Sepal.Length,probs=.25)

25% 
5.1

75th quantile (the top of the “middle half” — above this, you are in the “exceeds expectations” territory):

Code

quantile(iris$Sepal.Length,probs=.75)

75% 
6.4

The IQR (Interquartile Range) is the difference between the 75th and 25th percentiles. It captures where the middle 50% of your data lives and is way less sensitive to outliers than the full range. Think of it as the “where most normal customers fall” zone — ignoring the extreme penny-pinchers and the big spenders.

Code

IQR(iris$Sepal.Length)

[1] 1.3

11.3.3 Variance and Standard Deviation

Variance and standard deviation are the workhorses of measuring spread. Variance is the average squared distance from the mean — mathematically important but in awkward units (squared dollars, squared minutes). Standard deviation is the square root of variance, which puts it back in the original units and makes it something you can communicate in a meeting.

Practically: if the standard deviation of delivery times is 1 day, most packages arrive within about a day of the average. If the standard deviation is 8 days, the operation is unpredictable and customer service is going to feel it.

Code

var(iris$Sepal.Length)

[1] 0.6856935

Code

sd(iris$Sepal.Length)

[1] 0.8280661

Note that the standard deviation is the square root of variance.

AI Pitfall: AI defaults to mean(), with no na.rm, on whatever you point at

Two failure modes to watch for in this chapter’s territory:

1. The default-to-mean problem. Ask an AI assistant for the “typical customer order value” and you get mean(orders$amount). For most retail data the order-amount distribution is heavily right-skewed — a long tail of large orders pulls the mean above what a typical customer actually spends. The median is almost always more honest in that situation. AI does not know that. If you want the typical value of a skewed variable, ask for the median, or compute both and decide.

2. The silent NA problem. R’s summary functions return NA if any input value is missing, unless you pass na.rm = TRUE. AI sometimes generates mean(x) without it. The output [1] NA is at least visible. Worse: an AI-generated pipeline like

df %>% summarize(avg = mean(amount))

with na.rm missing will return one row with NA and you will spend ten minutes wondering why your numbers vanished. Add na.rm = TRUE defensively, and use summary() or skim() from Chapter 3 first to know whether NAs are even present.

11.4 Tidyverse approach: Grouped summaries with dplyr

OK, here is where things get genuinely exciting — and yes, you are allowed to be excited about summary statistics. One of the most powerful features of dplyr is the ability to compute summary statistics by group. This is the “break it down by…” capability that makes business analysis actually useful.

Want to compare average order values across customer segments? Revenue by region? Employee satisfaction scores by department? NPS by product line? Instead of manually subsetting your data fifteen times and running the same calculations, the group_by() and summarize() combo does it all in one elegant pipeline. This is the kind of thing that makes people say “wait, you did that in three lines of code?”

11.4.1 Summary by group

This is the “give me the numbers broken down by category” operation. You will use this approximately every single day of your analytics career:

Code

iris %>%
 group_by(Species) %>%
 summarize(
 n = n(),
 mean_sl = mean(Sepal.Length),
 sd_sl = sd(Sepal.Length),
 median_sl = median(Sepal.Length),
 min_sl = min(Sepal.Length),
 max_sl = max(Sepal.Length)
 ) %>%
 DT::datatable(options = list(pageLength = 5, dom = 't'),
 caption = "Grouped summary statistics for Sepal.Length by species.")

This table tells a complete story. Setosa has the smallest sepals (mean 5.01 cm, tight standard deviation of 0.35), while virginica has the largest (mean 6.59 cm, more variable at SD 0.64). Versicolor lands in the middle. The min-max range for virginica (4.9 to 7.9 cm) is nearly twice as wide as setosa’s (4.3 to 5.8 cm), meaning virginica flowers are not only bigger on average — they are also much more variable in size.

11.4.2 Multiple variables at once with `across()`

When you want the same statistics for every numeric column — instead of typing them out one by one like it is 1997 — across() is an absolute lifesaver. Imagine needing mean and standard deviation for twenty columns. Without across(), that is forty lines of code. With it, you are done in two:

Code

iris %>%
 group_by(Species) %>%
 summarize(across(where(is.numeric), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) %>%
 DT::datatable(options = list(pageLength = 5, scrollX = TRUE, dom = 't'),
 caption = "Mean and SD for all numeric variables by species — scroll right to see all columns.")

The column names follow the pattern VariableName_Statistic (e.g., Sepal.Length_mean). Scanning across the row for setosa, you can see it has tiny petals (mean 1.46 cm length, 0.25 cm width) with very little variation (SD of 0.17 and 0.11). Virginica has the largest petals (mean 5.55 cm long, 2.03 cm wide) but also the most variation. This table gives you a comprehensive species profile in one glance — the kind of summary you would build to compare product lines or customer segments.

11.4.3 Quick counts

The count() function is your fastest path to “how many are in each group?” — a question that comes up roughly every five minutes in any business meeting. “How many customers in each segment?” “How many orders per region?” “How many support tickets per category?” This little function is about to become your best friend:

Code

iris %>% count(Species)

     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50

Each species has exactly 50 observations — a perfectly balanced dataset. In real business data, you almost never get this lucky. Knowing the count per group is critical because a “mean revenue of $1M” means something very different when it is based on 500 customers versus 5 customers.

For a detailed treatment of summarize(), group_by(), and across(), see Chapter @ref(dplyr).

11 Summary Statistics

11.1 Measures of Central Tendency

11.2 Other measures of central tendency

11.3 Measures of Dispersion (variation)

11.3.1 Range

11.3.2 Interquartile range: Difference between 25th and 75th percentile (or quantile)

11.3.3 Variance and Standard Deviation

11.4 Tidyverse approach: Grouped summaries with dplyr

11.4.1 Summary by group

11.4.2 Multiple variables at once with across()

11.4.3 Quick counts

11.4.2 Multiple variables at once with `across()`