---
title: "Describing Data"
---
# Describing Data {#describedata}
Before you run any analysis, sit down with the dataset and figure out what you are working with. How big is it? What kinds of variables are in there? Are there missing values, suspicious extremes, or category labels that should have been merged? The five minutes you spend exploring data before analysis routinely saves five hours of debugging after.
This chapter covers the small set of functions that do that exploration work — quick, low-effort commands that surface a surprising amount about a dataset before any modeling happens. They are also the steps an AI assistant will skip if you let it.
```{r}
library(tidyverse)
library(DT)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
## How big is this thing? Rows and columns
The `dim()` function gives you the number of rows and columns, in that order. Rows are individual records — customers, transactions, survey responses — and columns are the variables measured about each one. If someone hands you a customer database, `dim()` tells you something like "10,000 customers and 15 attributes about each." That is useful context before any analysis.
```{r}
dim(iris) # dimensions
```
The output `150 5` tells you the iris dataset has 150 rows (observations) and 5 columns (variables). In a business context, this is your first sense of scale: "We have 150 records and 5 attributes to work with."
## Just the row count
Sometimes you only care about how many observations (rows) you have. Maybe you are checking if your survey got enough responses to be meaningful, or verifying that your data import did not accidentally drop half your records. The `nrow()` function is your quick headcount.
```{r}
nrow(iris)
```
## Just the column count
And sometimes you want to know how many variables (columns) are in the dataset. This is particularly useful when someone hands you a spreadsheet with "just a few fields" and it turns out to be 87 columns wide. (This happens more than you would think in corporate data.)
```{r}
ncol(iris)
```
## What are the variables called?
Before you can analyze anything, you need to know what the columns are named. The `names()` function is like reading the menu before you order -- you want to know what is available before you start making requests.
```{r}
names(iris) # colnames(iris) also gives that information
```
## Peeking at the first few rows
The `head()` function shows you the first 6 rows of your data. This is one of the most-used commands in R for good reason -- it gives you a quick sense of what the data actually looks like without dumping the entire thing into your console. It is the data equivalent of skimming the executive summary before reading a 40-page report.
```{r}
head(iris)
```
Want to explore the full dataset interactively? The table below lets you search, sort, and page through all 150 rows -- try clicking a column header to sort, or typing "virginica" in the search box:
```{r}
DT::datatable(iris, options = list(pageLength = 10, autoWidth = TRUE),
caption = "The complete iris dataset — search, sort, and explore.")
```
## Just a taste — first 2 rows
You can specify exactly how many rows to see by passing a number as the second argument:
```{r}
head(iris, 2) # alternately, can use iris[1:2,]
```
## Checking the end -- last 6 rows
The `tail()` function is `head()`'s less popular but equally important sibling. It shows you the last rows of your dataset. Why bother? Because this is how you catch problems like data imports that got cut off early, or garbage rows that snuck in at the bottom of a CSV. Trust but verify, as they say.
```{r}
tail(iris) # Number of rows can be controlled, see earlier example involving the head command
```
## Grabbing a specific row
Need to look at one particular record? Maybe a customer complaint came in and you want to pull up their exact data point. You can index directly into the data frame. Row 1, all columns:
```{r}
iris[1,]
```
## A single cell -- row 1, column 1
You can drill all the way down to a single value. This is like zooming into one cell of a spreadsheet -- useful when you need to verify a specific number that does not look right in a report.
```{r}
iris[1,1]
```
## What is the third column called?
If you know a column by its position but not its name (happens more often than you would think, especially when you inherit someone else's dataset), you can look it up:
```{r}
names(iris)[3]
```
## Pulling a few entries from a specific column
Here we grab the first 3 values from the third column. Useful for a quick sanity check -- are these numbers? Text? Something weird like "N/A" spelled out as a string? Five minutes of sanity checking now saves five hours of debugging later.
```{r}
head(iris[3], 3) # alternately, can use iris[1:3,3]
```
## Understanding the structure of your data
This is where the detective work gets real. The `str()` function tells you the data type of every variable in your data frame. In the business world, you absolutely need to know the difference between a number (like revenue), a category (like customer segment), and text (like a product name) -- because R treats them very differently, and so should you.
For the most part, we will be concerned with `character`, `factor`, and `numeric` variables. Think of `factor` variables as categories -- things like customer segments, geographic regions, or satisfaction ratings on a 1-to-5 scale. `Numeric` variables are your quantitative measures -- revenue, age, order count, and so on. The `str` command is your go-to for figuring out what R thinks each variable is (which, fair warning, is not always what you think it should be).
```{r}
str(iris)
```
Here is how to read this output. Each line shows one variable: the name (like `Sepal.Length`), the data type (`num` for numeric, `Factor` for categorical), and a preview of the first few values. The `Factor w/ 3 levels` line tells you that `Species` is a categorical variable with exactly three categories: setosa, versicolor, and virginica. If a variable shows up as `chr` (character) when you expected a number, that is your first clue that something went wrong during data import --- a common "gotcha" that `str()` helps you catch early.
If you have a factor variable and want to see all its categories, the `levels()` function has you covered:
```{r}
levels(iris$Species)
```
## The quick-and-dirty summary
The `summary()` function is the workhorse for an initial look. For numeric variables it returns the mean, median, min, max, and quartile values. For factor variables it counts observations per category. One function call and you already know a surprising amount about a new dataset.
```{r}
summary(iris) # for factor/categorical variables, this gives a count of all categories
```
## Summarizing a single variable
You can also focus on just one variable at a time. Imagine you only care about one KPI -- say average order value or customer satisfaction score. Here is how you would zero in on it:
```{r}
summary(iris$Sepal.Length)
```
## Another useful exploration tool: `glimpse()` and `tibble`
The `glimpse()` function from **dplyr** is another great way to explore your data -- it shows every column in a transposed format, making it easy to scan datasets that have a lot of variables. Think of it as the LinkedIn profile version of `str()` -- same information, better presentation.
```{r}
glimpse(iris)
```
### Converting to a tibble
Tibbles are a modern, improved version of data frames (see Chapter \@ref(tidyverse)). They print more cleanly and do not try to dump your entire dataset into the console. Converting to a tibble is like switching from a cluttered Excel spreadsheet to a well-formatted dashboard -- same data, way easier to read.
```{r}
iris_tbl <- as_tibble(iris)
iris_tbl
```
Notice how the tibble only shows the first 10 rows and tells you the dimensions up front --- no more scrolling through thousands of rows in your console.
### Selecting and viewing with dplyr
You can also use **dplyr** functions to explore specific slices of the data. Want to focus on just two columns? Or count how many observations fall into each group? These are exactly the kinds of questions you would ask about customer segments or product categories in a real business context:
```{r}
# View specific columns
iris_tbl %>% select(Species, Sepal.Length) %>% head()
```
```{r}
# Count observations per group
iris_tbl %>% count(Species)
```
All three species have exactly 50 observations each --- a perfectly balanced dataset. In real-world data, this kind of balance is rare. You are more likely to see one dominant group (like 90% "Standard" customers and 10% "Premium"), and spotting that imbalance early shapes every analysis decision downstream.
```{r}
# Quick summary by group
iris_tbl %>%
group_by(Species) %>%
summarize(
n = n(),
mean_sepal_length = mean(Sepal.Length),
mean_petal_length = mean(Petal.Length)
) %>%
DT::datatable(options = list(pageLength = 5, dom = 't'),
caption = "Summary by species — a quick grouped overview.")
```
This grouped summary reveals that the three species differ dramatically in petal length (from 1.46 cm for setosa to 5.55 cm for virginica) but less so in sepal length (5.01 to 6.59 cm). Even this quick table tells you that petal measurements are likely to be more useful than sepal measurements for distinguishing between species --- a genuine analytical insight from three lines of code.
### The `skimr` package
When you want a comprehensive single-call overview — histograms, frequency tables, missingness counts, all at once — the **skimr** package is the best tool in R for the job. One function call gives you what would otherwise be a fifteen-minute manual audit.
```{r}
# install.packages("skimr")
library(skimr)
skim(iris)
```
The output includes histograms for numeric variables and frequency tables for factors, plus completeness percentages so missing values cannot hide. If a stakeholder asks "what does this data look like?" — this is the function that answers it in one line.
::: {.callout-warning}
## AI Pitfall: AI skips the work this chapter is about
Ask an AI assistant to "compute the average sales by region for this dataset" and you will get a clean tidyverse pipeline back:
```r
df %>% group_by(region) %>% summarize(avg = mean(sales))
```
The code is correct. The code may also be useless, depending on what is in `df` — and the AI did none of the work to find out. Notice what is missing:
- No check for whether `region` has missing values that are silently dropped by `group_by()`
- No `na.rm = TRUE` in `mean()`, which means the function returns `NA` for any region where even one row has a missing `sales` value
- No check of the levels of `region` — which may include "North" and "north " (trailing space) as distinct categories
- No check on balance — "average sales per region" is misleading if 95% of rows come from one region
The functions in this chapter — `dim()`, `head()`, `str()`, `glimpse()`, `summary()`, `skim()` — are the exploration step that comes *before* an analysis pipeline. AI tends to skip them. Run them yourself, every time, before you trust any code generated against an unfamiliar dataset.
:::