---
title: "Summary Statistics"
---
# Summary Statistics {#summarystats}
```{r}
library(tidyverse)
library(DT)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
Before you build models, create dashboards, or apply machine learning, summary statistics answer the foundational questions about a dataset. What is the average? How spread out are the values? Are there extreme values that warrant a second look at the data source? Getting these basics right is the difference between an analyst who knows the data and one who is operating on faith.
This chapter covers computing summary statistics using R's built-in functions. For computing grouped summaries using **dplyr** (the "give me these numbers broken down by region/segment/quarter" tool), see Chapter \@ref(dplyr).
## Measures of Central Tendency
Mean, median, and mode --- the holy trinity of "what is typical?" in your data. Here is the business translation, because nobody in a boardroom says "measure of central tendency":
- **Mean** = your average order value. Add everything up, divide by the count. Simple, familiar, and what your CEO is probably asking about.
- **Median** = the salary figure where half your employees earn more and half earn less. Less flashy than the mean, but way more honest when outliers are involved. (This is why job postings should report median salary, not mean --- one VP making $800K really skews the average.) Choosing the right summary statistic is not just a math decision --- it is a fairness decision. The number you report shapes what decision-makers believe is "normal," and that belief drives real policy.
- **Mode** = the most popular item in your product catalog. The best-seller. The thing customers keep coming back for.
One important note: not all three work for every variable type. For nominal variables (like customer region or product category), only the mode makes sense --- you cannot average "Northeast" and "West Coast." For ordinal variables (like satisfaction ratings from 1 to 5), mode and median work fine. All three are fair game for numeric variables like revenue, age, or number of units sold.
```{r}
mean(iris$Sepal.Length)
```
The mean of Sepal.Length is `[R: mean(iris$Sepal.Length)]`.
```{r}
median(iris$Petal.Length)
```
The median of Petal.Length is `[R: median(iris$Petal.Length)]`
Here is a fun fact that will make you question everything: R does not have a built-in function for determining the mode. Seriously. One of the three most fundamental statistics in existence, and R just... forgot to include it. It is like buying a car and realizing it does not have a cup holder. No worries though --- we can write our own. The function below called `mymode` handles it nicely, and it even works when there are multiple modes (ties for first place). (This solution is adapted from [a question on Stack Overflow.](https://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode))
```{r}
# Function to determine the mode. Can also identify multiple modes
mymode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}
# Function to give the frequency of occurrence of mode
modecount <- function(x) {
ux <- unique(x)
max(tabulate(match(x, ux)))
}
```
Let us put those functions to work. First, on a numeric variable:
```{r}
mymode(iris$Petal.Length) # Gives the mode(s) of Petal.Length
modecount(iris$Petal.Length) # Gives the frequency of occurrence of the mode
```
The mode of Petal.Length is 1.5 cm, and it appears 14 times in the dataset. This tells you that the single most common petal length measurement is 1.5 cm --- which happens to be a setosa value. When the mode sits far from the mean (which is about 3.76 cm), that is a hint the distribution is skewed or multimodal.
And now, a factor variable:
```{r}
mymode(iris$Species)
modecount(iris$Species)
```
Note that all three species are identified as modes because there are 50 observations of each type. In business terms, imagine three products all tied for the best-seller spot with identical unit sales --- they are all the mode. Nobody wins, everybody wins.
## Other measures of central tendency
* **Trimmed Mean:** Here is the thing about the regular mean --- it is a people-pleaser. It gives every single data point equal say, including that one customer who bought $47,000 worth of printer ink (true story at many companies). When outliers like that show up, the mean gets yanked in their direction and stops representing what is "typical." A trimmed mean handles this by chopping off the extremes before calculating --- like grading on a curve but for your data. That said, if there is a legitimate reason for those outliers (maybe that printer-ink customer runs a copy shop), the regular mean might still be the right call. Context matters.
Regular mean:
```{r}
mean(iris$Sepal.Length)
```
5% trimmed mean --- this knocks off the top 5% and bottom 5% of values before averaging. Think of it as removing the one person who rated your product a 1 out of spite and the one who gave it a 10 because they are your mom:
```{r}
mean(iris$Sepal.Length,trim = .05) # knocks off 5% of the observations on each end.
```
10% trimmed mean --- even more aggressive trimming, for when your data has more drama than a reality TV show:
```{r}
mean(iris$Sepal.Length, trim = .1) # knocks off 10% of the observation on each end.
```
## Measures of Dispersion (variation)
Knowing the average is great, but it only tells half the story. Consider this: two coffee shops both have an average wait time of 4 minutes. Sounds equal, right? But Shop A consistently serves everyone in 3 to 5 minutes, while Shop B ranges from 30 seconds to 15 minutes depending on whether the barista is having a good day. The averages are identical, but the experience is wildly different. Measures of dispersion tell you how spread out your data is --- and in business, spread means risk, inconsistency, and unpredictability.
### Range
The simplest measure of spread: what is the smallest value and what is the biggest? It is the "from $X to $Y" you see on every price comparison website.
```{r}
min(iris$Sepal.Length) # minimum
```
```{r}
max(iris$Sepal.Length) # maximum
```
```{r}
range(iris$Sepal.Length) # should give the min and max
```
### Interquartile range: Difference between 25th and 75th percentile (or quantile)
Percentiles tell you where a value stands relative to everyone else. If your sales numbers are at the 90th percentile, you are outperforming 90% of your peers. (Definitely put that on your LinkedIn.) The 5th percentile is the low end --- only 5% of observations fall below it.
```{r}
quantile(iris$Sepal.Length,probs = .05)
```
The median is just the 50th percentile --- the exact middle of the pack. Not the best, not the worst, just... there.
```{r}
median(iris$Sepal.Length)
```
```{r}
quantile(iris$Sepal.Length,probs=.5)
```
25th quantile (the bottom of the "middle half" --- if this were employee performance, it is the boundary between "needs improvement" and "meets expectations"):
```{r}
quantile(iris$Sepal.Length,probs=.25)
```
75th quantile (the top of the "middle half" --- above this, you are in the "exceeds expectations" territory):
```{r}
quantile(iris$Sepal.Length,probs=.75)
```
The IQR (Interquartile Range) is the difference between the 75th and 25th percentiles. It captures where the middle 50% of your data lives and is way less sensitive to outliers than the full range. Think of it as the "where most normal customers fall" zone --- ignoring the extreme penny-pinchers and the big spenders.
```{r}
IQR(iris$Sepal.Length)
```
### Variance and Standard Deviation
Variance and standard deviation are the workhorses of measuring spread. Variance is the average squared distance from the mean — mathematically important but in awkward units (squared dollars, squared minutes). Standard deviation is the square root of variance, which puts it back in the original units and makes it something you can communicate in a meeting.
Practically: if the standard deviation of delivery times is 1 day, most packages arrive within about a day of the average. If the standard deviation is 8 days, the operation is unpredictable and customer service is going to feel it.
```{r}
var(iris$Sepal.Length)
```
```{r}
sd(iris$Sepal.Length)
```
Note that the standard deviation is the square root of variance.
::: {.callout-warning}
## AI Pitfall: AI defaults to `mean()`, with no `na.rm`, on whatever you point at
Two failure modes to watch for in this chapter's territory:
**1. The default-to-mean problem.** Ask an AI assistant for the "typical customer order value" and you get `mean(orders$amount)`. For most retail data the order-amount distribution is heavily right-skewed — a long tail of large orders pulls the mean above what a typical customer actually spends. The median is almost always more honest in that situation. AI does not know that. If you want the typical value of a skewed variable, ask for the median, or compute both and decide.
**2. The silent NA problem.** R's summary functions return `NA` if any input value is missing, unless you pass `na.rm = TRUE`. AI sometimes generates `mean(x)` without it. The output `[1] NA` is at least visible. Worse: an AI-generated pipeline like
```r
df %>% summarize(avg = mean(amount))
```
with `na.rm` missing will return one row with `NA` and you will spend ten minutes wondering why your numbers vanished. Add `na.rm = TRUE` defensively, and use `summary()` or `skim()` from Chapter 3 first to know whether NAs are even present.
:::
## Tidyverse approach: Grouped summaries with dplyr
OK, here is where things get genuinely exciting --- and yes, you are allowed to be excited about summary statistics. One of the most powerful features of **dplyr** is the ability to compute summary statistics by group. This is the "break it down by..." capability that makes business analysis actually useful.
Want to compare average order values across customer segments? Revenue by region? Employee satisfaction scores by department? NPS by product line? Instead of manually subsetting your data fifteen times and running the same calculations, the `group_by()` and `summarize()` combo does it all in one elegant pipeline. This is the kind of thing that makes people say "wait, you did that in three lines of code?"
### Summary by group
This is the "give me the numbers broken down by category" operation. You will use this approximately every single day of your analytics career:
```{r}
iris %>%
group_by(Species) %>%
summarize(
n = n(),
mean_sl = mean(Sepal.Length),
sd_sl = sd(Sepal.Length),
median_sl = median(Sepal.Length),
min_sl = min(Sepal.Length),
max_sl = max(Sepal.Length)
) %>%
DT::datatable(options = list(pageLength = 5, dom = 't'),
caption = "Grouped summary statistics for Sepal.Length by species.")
```
This table tells a complete story. Setosa has the smallest sepals (mean 5.01 cm, tight standard deviation of 0.35), while virginica has the largest (mean 6.59 cm, more variable at SD 0.64). Versicolor lands in the middle. The min-max range for virginica (4.9 to 7.9 cm) is nearly twice as wide as setosa's (4.3 to 5.8 cm), meaning virginica flowers are not only bigger on average --- they are also much more variable in size.
### Multiple variables at once with `across()`
When you want the same statistics for every numeric column --- instead of typing them out one by one like it is 1997 --- `across()` is an absolute lifesaver. Imagine needing mean and standard deviation for twenty columns. Without `across()`, that is forty lines of code. With it, you are done in two:
```{r}
iris %>%
group_by(Species) %>%
summarize(across(where(is.numeric), list(mean = mean, sd = sd), .names = "{.col}_{.fn}")) %>%
DT::datatable(options = list(pageLength = 5, scrollX = TRUE, dom = 't'),
caption = "Mean and SD for all numeric variables by species — scroll right to see all columns.")
```
The column names follow the pattern `VariableName_Statistic` (e.g., `Sepal.Length_mean`). Scanning across the row for setosa, you can see it has tiny petals (mean 1.46 cm length, 0.25 cm width) with very little variation (SD of 0.17 and 0.11). Virginica has the largest petals (mean 5.55 cm long, 2.03 cm wide) but also the most variation. This table gives you a comprehensive species profile in one glance --- the kind of summary you would build to compare product lines or customer segments.
### Quick counts
The `count()` function is your fastest path to "how many are in each group?" --- a question that comes up roughly every five minutes in any business meeting. "How many customers in each segment?" "How many orders per region?" "How many support tickets per category?" This little function is about to become your best friend:
```{r}
iris %>% count(Species)
```
Each species has exactly 50 observations --- a perfectly balanced dataset. In real business data, you almost never get this lucky. Knowing the count per group is critical because a "mean revenue of $1M" means something very different when it is based on 500 customers versus 5 customers.
For a detailed treatment of `summarize()`, `group_by()`, and `across()`, see Chapter \@ref(dplyr).