4 Describing Data

5 Describing Data

Before you run any analysis, sit down with the dataset and figure out what you are working with. How big is it? What kinds of variables are in there? Are there missing values, suspicious extremes, or category labels that should have been merged? The five minutes you spend exploring data before analysis routinely saves five hours of debugging after.

This chapter covers the small set of functions that do that exploration work — quick, low-effort commands that surface a surprising amount about a dataset before any modeling happens. They are also the steps an AI assistant will skip if you let it.

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(DT)

Warning: package 'DT' was built under R version 4.5.2

Code

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

5.1 How big is this thing? Rows and columns

The dim() function gives you the number of rows and columns, in that order. Rows are individual records — customers, transactions, survey responses — and columns are the variables measured about each one. If someone hands you a customer database, dim() tells you something like “10,000 customers and 15 attributes about each.” That is useful context before any analysis.

Code

dim(iris) # dimensions

[1] 150   5

The output 150 5 tells you the iris dataset has 150 rows (observations) and 5 columns (variables). In a business context, this is your first sense of scale: “We have 150 records and 5 attributes to work with.”

5.2 Just the row count

Sometimes you only care about how many observations (rows) you have. Maybe you are checking if your survey got enough responses to be meaningful, or verifying that your data import did not accidentally drop half your records. The nrow() function is your quick headcount.

Code

nrow(iris)

[1] 150

5.3 Just the column count

And sometimes you want to know how many variables (columns) are in the dataset. This is particularly useful when someone hands you a spreadsheet with “just a few fields” and it turns out to be 87 columns wide. (This happens more than you would think in corporate data.)

Code

ncol(iris)

[1] 5

5.4 What are the variables called?

Before you can analyze anything, you need to know what the columns are named. The names() function is like reading the menu before you order – you want to know what is available before you start making requests.

Code

names(iris) # colnames(iris) also gives that information

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

5.5 Peeking at the first few rows

The head() function shows you the first 6 rows of your data. This is one of the most-used commands in R for good reason – it gives you a quick sense of what the data actually looks like without dumping the entire thing into your console. It is the data equivalent of skimming the executive summary before reading a 40-page report.

Code

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Want to explore the full dataset interactively? The table below lets you search, sort, and page through all 150 rows – try clicking a column header to sort, or typing “virginica” in the search box:

Code

DT::datatable(iris, options = list(pageLength = 10, autoWidth = TRUE),
 caption = "The complete iris dataset — search, sort, and explore.")

5.6 Just a taste — first 2 rows

You can specify exactly how many rows to see by passing a number as the second argument:

Code

head(iris, 2) # alternately, can use iris[1:2,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

5.7 Checking the end – last 6 rows

The tail() function is head()’s less popular but equally important sibling. It shows you the last rows of your dataset. Why bother? Because this is how you catch problems like data imports that got cut off early, or garbage rows that snuck in at the bottom of a CSV. Trust but verify, as they say.

Code

tail(iris) # Number of rows can be controlled, see earlier example involving the head command

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

5.8 Grabbing a specific row

Need to look at one particular record? Maybe a customer complaint came in and you want to pull up their exact data point. You can index directly into the data frame. Row 1, all columns:

Code

iris[1,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

5.9 A single cell – row 1, column 1

You can drill all the way down to a single value. This is like zooming into one cell of a spreadsheet – useful when you need to verify a specific number that does not look right in a report.

Code

iris[1,1]

[1] 5.1

5.10 What is the third column called?

If you know a column by its position but not its name (happens more often than you would think, especially when you inherit someone else’s dataset), you can look it up:

Code

names(iris)[3]

[1] "Petal.Length"

5.11 Pulling a few entries from a specific column

Here we grab the first 3 values from the third column. Useful for a quick sanity check – are these numbers? Text? Something weird like “N/A” spelled out as a string? Five minutes of sanity checking now saves five hours of debugging later.

Code

head(iris[3], 3) # alternately, can use iris[1:3,3]

  Petal.Length
1          1.4
2          1.4
3          1.3

5.12 Understanding the structure of your data

This is where the detective work gets real. The str() function tells you the data type of every variable in your data frame. In the business world, you absolutely need to know the difference between a number (like revenue), a category (like customer segment), and text (like a product name) – because R treats them very differently, and so should you.

For the most part, we will be concerned with character, factor, and numeric variables. Think of factor variables as categories – things like customer segments, geographic regions, or satisfaction ratings on a 1-to-5 scale. Numeric variables are your quantitative measures – revenue, age, order count, and so on. The str command is your go-to for figuring out what R thinks each variable is (which, fair warning, is not always what you think it should be).

Code

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here is how to read this output. Each line shows one variable: the name (like Sepal.Length), the data type (num for numeric, Factor for categorical), and a preview of the first few values. The Factor w/ 3 levels line tells you that Species is a categorical variable with exactly three categories: setosa, versicolor, and virginica. If a variable shows up as chr (character) when you expected a number, that is your first clue that something went wrong during data import — a common “gotcha” that str() helps you catch early.

If you have a factor variable and want to see all its categories, the levels() function has you covered:

Code

levels(iris$Species)

[1] "setosa"     "versicolor" "virginica"

5.13 The quick-and-dirty summary

The summary() function is the workhorse for an initial look. For numeric variables it returns the mean, median, min, max, and quartile values. For factor variables it counts observations per category. One function call and you already know a surprising amount about a new dataset.

Code

summary(iris) # for factor/categorical variables, this gives a count of all categories

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

5.14 Summarizing a single variable

You can also focus on just one variable at a time. Imagine you only care about one KPI – say average order value or customer satisfaction score. Here is how you would zero in on it:

Code

summary(iris$Sepal.Length)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900

5.15 Another useful exploration tool: `glimpse()` and `tibble`

The glimpse() function from dplyr is another great way to explore your data – it shows every column in a transposed format, making it easy to scan datasets that have a lot of variables. Think of it as the LinkedIn profile version of str() – same information, better presentation.

Code

glimpse(iris)

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

5.15.1 Converting to a tibble

Tibbles are a modern, improved version of data frames (see Chapter @ref(tidyverse)). They print more cleanly and do not try to dump your entire dataset into the console. Converting to a tibble is like switching from a cluttered Excel spreadsheet to a well-formatted dashboard – same data, way easier to read.

Code

iris_tbl <- as_tibble(iris)
iris_tbl

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Notice how the tibble only shows the first 10 rows and tells you the dimensions up front — no more scrolling through thousands of rows in your console.

5.15.2 Selecting and viewing with dplyr

You can also use dplyr functions to explore specific slices of the data. Want to focus on just two columns? Or count how many observations fall into each group? These are exactly the kinds of questions you would ask about customer segments or product categories in a real business context:

Code

# View specific columns
iris_tbl %>% select(Species, Sepal.Length) %>% head()

# A tibble: 6 × 2
  Species Sepal.Length
  <fct>          <dbl>
1 setosa           5.1
2 setosa           4.9
3 setosa           4.7
4 setosa           4.6
5 setosa           5  
6 setosa           5.4

Code

# Count observations per group
iris_tbl %>% count(Species)

# A tibble: 3 × 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

All three species have exactly 50 observations each — a perfectly balanced dataset. In real-world data, this kind of balance is rare. You are more likely to see one dominant group (like 90% “Standard” customers and 10% “Premium”), and spotting that imbalance early shapes every analysis decision downstream.

Code

# Quick summary by group
iris_tbl %>%
 group_by(Species) %>%
 summarize(
 n = n(),
 mean_sepal_length = mean(Sepal.Length),
 mean_petal_length = mean(Petal.Length)
 ) %>%
 DT::datatable(options = list(pageLength = 5, dom = 't'),
 caption = "Summary by species — a quick grouped overview.")

This grouped summary reveals that the three species differ dramatically in petal length (from 1.46 cm for setosa to 5.55 cm for virginica) but less so in sepal length (5.01 to 6.59 cm). Even this quick table tells you that petal measurements are likely to be more useful than sepal measurements for distinguishing between species — a genuine analytical insight from three lines of code.

5.15.3 The `skimr` package

When you want a comprehensive single-call overview — histograms, frequency tables, missingness counts, all at once — the skimr package is the best tool in R for the job. One function call gives you what would otherwise be a fifteen-minute manual audit.

Code

# install.packages("skimr")
library(skimr)
skim(iris)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Species	0	1	FALSE	3	set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

The output includes histograms for numeric variables and frequency tables for factors, plus completeness percentages so missing values cannot hide. If a stakeholder asks “what does this data look like?” — this is the function that answers it in one line.

AI Pitfall: AI skips the work this chapter is about

Ask an AI assistant to “compute the average sales by region for this dataset” and you will get a clean tidyverse pipeline back:

df %>% group_by(region) %>% summarize(avg = mean(sales))

The code is correct. The code may also be useless, depending on what is in df — and the AI did none of the work to find out. Notice what is missing:

No check for whether region has missing values that are silently dropped by group_by()
No na.rm = TRUE in mean(), which means the function returns NA for any region where even one row has a missing sales value
No check of the levels of region — which may include “North” and “north” (trailing space) as distinct categories
No check on balance — “average sales per region” is misleading if 95% of rows come from one region

The functions in this chapter — dim(), head(), str(), glimpse(), summary(), skim() — are the exploration step that comes before an analysis pipeline. AI tends to skip them. Run them yourself, every time, before you trust any code generated against an unfamiliar dataset.

5 Describing Data

5.1 How big is this thing? Rows and columns

5.2 Just the row count

5.3 Just the column count

5.4 What are the variables called?

5.5 Peeking at the first few rows

5.6 Just a taste — first 2 rows

5.7 Checking the end – last 6 rows

5.8 Grabbing a specific row

5.9 A single cell – row 1, column 1

5.10 What is the third column called?

5.11 Pulling a few entries from a specific column

5.12 Understanding the structure of your data

5.13 The quick-and-dirty summary

5.14 Summarizing a single variable

5.15 Another useful exploration tool: glimpse() and tibble

5.15.1 Converting to a tibble

5.15.2 Selecting and viewing with dplyr

5.15.3 The skimr package

5.15 Another useful exploration tool: `glimpse()` and `tibble`

5.15.3 The `skimr` package