Warning: package 'knitr' was built under R version 4.5.2
Code
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
comment = "#>", collapse = TRUE)Warning: package 'knitr' was built under R version 4.5.2
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
comment = "#>", collapse = TRUE)Before any analysis can happen, the data has to get into R. The mechanics trip people up more often than you would expect. CSVs arrive over email, Excel workbooks come with six tabs, research teams hand over mysterious .sav files from SPSS, and almost every one of those files has at least one quirk that breaks default settings somewhere. This chapter covers the readers you will actually use day-to-day and the failure modes you will hit if you skip the verification step.
When you ask an AI assistant to “load this CSV,” you reliably get a one-liner like df <- read_csv("file.csv"). It runs. The data is now in R. The temptation is to move on to analysis.
What the AI did not do — and what you must — is check what actually got read. Three failure modes that surface only in verification:
1.234,56 to mean 1234.56. read_csv() reads it as NA or as a character string. Numbers look fine in the column header, but mean() returns NA."15/03/2026" are read as strings unless you specify a col_types or use lubridate::dmy() after import. Filtering by date silently fails because string comparison is alphabetical.José becomes Jos\xe9. Stringr operations against these silently miss matches.Always run glimpse() and summary() immediately after import. If the column types or values do not match what you expected, fix them at the read step (using col_types, locale = locale(decimal_mark = ","), encoding = "Latin1", etc.) rather than papering over the problem downstream.
We’ll use three packages: readr (for CSV and other text files), readxl (for Excel), and haven (for SPSS/Stata/SAS files). The first one loads automatically with the tidyverse; the other two you load separately.
CSV files are the lingua franca of data. If someone says “I’ll send you the data,” there’s a 90% chance it’s a CSV. The function you want is read_csv() from the readr package.
Base R has read.csv(). It works, but read_csv() is better in almost every way:
Let’s see it in action. read_csv() can even read data from an inline string, which is handy for quick demos.
sample_data <- read_csv(
"name,age,score,enrolled
Alice,25,92.5,TRUE
Bob,30,88.0,TRUE
Charlie,22,95.3,FALSE
Diana,28,79.1,TRUE
Eve,35,85.6,FALSE"
)
sample_data
#> # A tibble: 5 × 4
#> name age score enrolled
#> <chr> <dbl> <dbl> <lgl>
#> 1 Alice 25 92.5 TRUE
#> 2 Bob 30 88 TRUE
#> 3 Charlie 22 95.3 FALSE
#> 4 Diana 28 79.1 TRUE
#> 5 Eve 35 85.6 FALSESee that message? It’s telling you it read name as text, age and score as numbers, and enrolled as TRUE/FALSE. That’s the kind of transparency you want.
col_names – By default, the first row is treated as column headers. If your file doesn’t have headers (looking at you, legacy systems), set col_names = FALSE or supply your own names.
na – Different systems represent missing data differently. Some use blanks, some use “N/A”, some use “.” because… reasons. Tell R what to look for.
skip – If the file has junk at the top (report headers, timestamps, disclaimers), skip those lines.
read_csv(
"This is a comment line\nAnother comment\nname,value\nAlice,10\nBob,20",
skip = 2
)
#> # A tibble: 2 × 2
#> name value
#> <chr> <dbl>
#> 1 Alice 10
#> 2 Bob 20n_max – Got a 2-million-row file and just want to peek at it? Read only the first few rows.
read_csv(
"name,value\nAlice,10\nBob,20\nCharlie,30\nDiana,40",
n_max = 2
)
#> # A tibble: 2 × 2
#> name value
#> <chr> <dbl>
#> 1 Alice 10
#> 2 Bob 20Sometimes R guesses wrong. The classic example: zip codes. If a zip code starts with 0 (like 02134 for Boston), R will read it as the number 2134 and silently drop the leading zero. Not great.
# Problem: leading zero gets eaten
problem_data <- read_csv(
"id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300"
)
problem_data
#> # A tibble: 3 × 3
#> id zipcode amount
#> <dbl> <chr> <dbl>
#> 1 1 02134 100
#> 2 2 10001 200
#> 3 3 90210 300The fix: tell R that zipcode is text, not a number.
fixed_data <- read_csv(
"id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300",
col_types = cols(zipcode = col_character())
)
fixed_data
#> # A tibble: 3 × 3
#> id zipcode amount
#> <dbl> <chr> <dbl>
#> 1 1 02134 100
#> 2 2 10001 200
#> 3 3 90210 300You can also use a compact shorthand where each letter represents a type: c = character, i = integer, d = double, l = logical, _ = skip the column entirely.
read_csv(
"id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
col_types = "icdc"
)
#> # A tibble: 2 × 4
#> id name score active
#> <int> <chr> <dbl> <chr>
#> 1 1 Alice 92.5 TRUE
#> 2 2 Bob 88 FALSEIf you only need certain columns from a massive file, cols_only() lets you cherry-pick.
read_csv(
"id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
col_types = cols_only(
name = col_character(),
score = col_double()
)
)
#> # A tibble: 2 × 2
#> name score
#> <chr> <dbl>
#> 1 Alice 92.5
#> 2 Bob 88Not everything is comma-separated. Some files use tabs, pipes, or semicolons. readr has you covered.
# Tab-separated
read_tsv(
"name\tage\tscore\nAlice\t25\t92.5\nBob\t30\t88.0"
)
#> # A tibble: 2 × 3
#> name age score
#> <chr> <dbl> <dbl>
#> 1 Alice 25 92.5
#> 2 Bob 30 88# Pipe-delimited (sometimes you'll see this from older database exports)
read_delim(
"name|age|score\nAlice|25|92.5\nBob|30|88.0",
delim = "|"
)
#> # A tibble: 2 × 3
#> name age score
#> <chr> <dbl> <dbl>
#> 1 Alice 25 92.5
#> 2 Bob 30 88If you’re working with European-format CSVs (semicolons as delimiters, commas as decimal marks), use read_csv2(). This comes up more than you’d expect in international business.
read_csv2(
"name;age;score\nAlice;25;92,5\nBob;30;88,0"
)
#> # A tibble: 2 × 3
#> name age score
#> <chr> <dbl> <dbl>
#> 1 Alice 25 92.5
#> 2 Bob 30 88Let’s be honest: in business, Excel is king. Your quarterly reports, your budget forecasts, your customer lists – they’re probably all in .xlsx files. The readxl package handles these beautifully.
read_excel() auto-detects whether a file is .xls or .xlsx and just works.
# readxl comes with example files for practice
datasets_path <- readxl_example("datasets.xlsx")
datasets_data <- read_excel(datasets_path)
datasets_data
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rowsExcel workbooks usually have multiple sheets – maybe “Q1 Sales,” “Q2 Sales,” “Summary.” You can pick which sheet to read.
# See what sheets are available
excel_sheets(datasets_path)
#> [1] "mtcars" "chickwts" "quakes"# Read a specific sheet by name
mtcars_data <- read_excel(datasets_path, sheet = "mtcars")
head(mtcars_data)
#> # A tibble: 6 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1# Or by position (the 3rd sheet)
quakes_data <- read_excel(datasets_path, sheet = 3)
head(quakes_data)
#> # A tibble: 6 × 5
#> lat long depth mag stations
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -20.4 182. 562 4.8 41
#> 2 -20.6 181. 650 4.2 15
#> 3 -26 184. 42 5.4 43
#> 4 -18.0 182. 626 4.1 19
#> 5 -20.4 182. 649 4 11
#> 6 -19.7 184. 195 4 12Sometimes you don’t want the whole sheet. Maybe the data you need is in cells B2:F50, and there’s a bunch of formatting garbage around it. The range argument uses Excel-style cell references, so it feels familiar.
# Read a specific cell range
read_excel(datasets_path, range = "mtcars!B1:D5")
#> # A tibble: 4 × 3
#> cyl disp hp
#> <dbl> <dbl> <dbl>
#> 1 6 160 110
#> 2 6 160 110
#> 3 4 108 93
#> 4 6 258 110# Read specific rows
read_excel(datasets_path, sheet = "mtcars", range = cell_rows(1:6))
#> # A tibble: 5 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2# Read specific columns
read_excel(datasets_path, sheet = "mtcars", range = cell_cols("A:C"))
#> # A tibble: 32 × 3
#> mpg cyl disp
#> <dbl> <dbl> <dbl>
#> 1 21 6 160
#> 2 21 6 160
#> 3 22.8 4 108
#> 4 21.4 6 258
#> 5 18.7 8 360
#> 6 18.1 6 225
#> 7 14.3 8 360
#> 8 24.4 4 147.
#> 9 22.8 4 141.
#> 10 19.2 6 168.
#> # ℹ 22 more rowsJust like read_csv(), you can use na, skip, col_names, and n_max to handle messy Excel files.
read_excel(
datasets_path,
sheet = "chickwts",
na = c("", "NA"),
skip = 0,
n_max = 5
)
#> # A tibble: 5 × 2
#> weight feed
#> <dbl> <chr>
#> 1 179 horsebean
#> 2 160 horsebean
#> 3 136 horsebean
#> 4 227 horsebean
#> 5 217 horsebeanIf someone hands you a .sav file (SPSS), a .dta file (Stata), or a .sas7bdat file (SAS), don’t panic. The haven package reads all of them.
These files often have special “value labels” (like 1 = “Male”, 2 = “Female”). Haven preserves those as a special labelled class. To convert them to regular R factors, use as_factor().
You probably won’t use these every day, but it’s nice to know they exist when you need them.
R has two native formats for saving data: .RData and .rds. Here’s the quick version:
.RData files can store multiple objects. You load them with load(), and they pop back into your environment with their original names. The catch? They can overwrite things without warning..rds files store a single object. You read them with readRDS() and assign the result to whatever name you want. Much safer.# Saving and loading .RData
x <- 1:10
y <- data.frame(a = 1:3, b = letters[1:3])
save(x, y, file = "my_objects.RData")
load("my_objects.RData") # x and y are restored
# Saving and loading .rds (preferred)
my_data <- tibble(name = c("Alice", "Bob"), score = c(92.5, 88.0))
saveRDS(my_data, file = "my_data.rds")
restored_data <- readRDS("my_data.rds")For day-to-day work, prefer .rds. It’s explicit, it’s safe, and it won’t accidentally overwrite your variables.
You can also go the other direction – save your clean data for someone else (or for future you).
my_data <- tibble(
name = c("Alice", "Bob", "Charlie"),
score = c(92.5, 88.0, 95.3),
date = as.Date(c("2025-01-15", "2025-02-20", "2025-03-10"))
)
# Write a CSV
write_csv(my_data, "my_data.csv")
# If someone will open it in Excel, use this version
# (it handles special characters better)
write_excel_csv(my_data, "my_data_excel.csv")
# Save as an R object (preserves column types perfectly)
write_rds(my_data, "my_data.rds")One thing to watch out for: when you save to CSV, you lose type information. Factors become plain text, dates become strings. If you need to preserve everything exactly, use write_rds().
Real-world data is messy. Here are the problems you’ll hit most often and how to fix them.
Missing values in disguise. Different systems use different codes for “no data.” Check your file and tell R what to look for.
Junk at the top of the file. Reports exported from enterprise systems love to put timestamps and headers above the actual data. Use skip or comment.
messy_file <- read_csv(
"Report generated on 2025-01-15\nSource: Internal Database\n\nname,score\nAlice,92\nBob,88\nCharlie,95",
skip = 3
)
messy_file
#> # A tibble: 3 × 2
#> name score
#> <chr> <dbl>
#> 1 Alice 92
#> 2 Bob 88
#> 3 Charlie 95Reading from a URL. All readr functions can read directly from web URLs. Very handy for publicly available datasets.
url_data <- read_csv("https://raw.githubusercontent.com/tidyverse/readr/main/inst/extdata/mtcars.csv")Encoding problems. If your data looks like garbled nonsense (especially with accented characters or non-English text), it’s probably an encoding issue.
# Try specifying the encoding
latin_data <- read_csv("data_latin1.csv", locale = locale(encoding = "Latin1"))
# Or let R guess
guess_encoding("mystery_file.csv")In the real world, you’ll often have data split across multiple files – monthly sales reports, regional data dumps, weekly survey exports. You need to read them all and stack them together.
The easiest approach (readr 2.0+) is to hand read_csv() a vector of file paths directly.
csv_files <- list.files("data/", pattern = "\\.csv$", full.names = TRUE)
all_data <- read_csv(csv_files, id = "source_file")The id argument adds a column telling you which file each row came from. Very useful for debugging.
Here’s a self-contained example showing the pattern:
file1 <- "name,score\nAlice,92\nBob,88"
file2 <- "name,score\nCharlie,95\nDiana,79"
file3 <- "name,score\nEve,85\nFrank,91"
files <- list(file1, file2, file3)
combined_data <- files %>%
set_names(paste0("file_", 1:3)) %>%
map(read_csv) %>%
list_rbind(names_to = "file_number")
combined_data
#> # A tibble: 6 × 3
#> file_number name score
#> <chr> <chr> <dbl>
#> 1 file_1 Alice 92
#> 2 file_1 Bob 88
#> 3 file_2 Charlie 95
#> 4 file_2 Diana 79
#> 5 file_3 Eve 85
#> 6 file_3 Frank 91Here’s your cheat sheet for getting data into R:
read_csv() from readr. This is your bread and butter.read_excel() from readxl. Supports sheets, cell ranges, all the good stuff.read_sav(), read_dta(), read_sas() from haven. There when you need them.readRDS() for .rds, load() for .RData.write_csv() for CSVs, write_rds() to preserve R types.read_csv(), or use map() with list_rbind().The most important habit: always inspect your data right after importing it. Check the column types, look for unexpected NA values, make sure nothing got mangled. Five minutes of checking now saves five hours of debugging later.
---
title: "Data Import"
---
# Data Import {#dataimport}
```{r}
library(knitr)
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
comment = "#>", collapse = TRUE)
```
Before any analysis can happen, the data has to get into R. The mechanics trip people up more often than you would expect. CSVs arrive over email, Excel workbooks come with six tabs, research teams hand over mysterious `.sav` files from SPSS, and almost every one of those files has at least one quirk that breaks default settings somewhere. This chapter covers the readers you will actually use day-to-day and the failure modes you will hit if you skip the verification step.
::: {.callout-warning}
## AI Pitfall: AI reads files without checking what came in
When you ask an AI assistant to "load this CSV," you reliably get a one-liner like `df <- read_csv("file.csv")`. It runs. The data is now in R. The temptation is to move on to analysis.
What the AI did *not* do — and what you must — is check what actually got read. Three failure modes that surface only in verification:
1. **Locale / decimal separator mismatch.** A CSV from a European source uses `1.234,56` to mean 1234.56. `read_csv()` reads it as `NA` or as a character string. Numbers look fine in the column header, but `mean()` returns `NA`.
2. **Date columns parsed as character.** Dates that come in as `"15/03/2026"` are read as strings unless you specify a `col_types` or use `lubridate::dmy()` after import. Filtering by date silently fails because string comparison is alphabetical.
3. **Encoding artifacts.** Files saved on Windows in non-UTF-8 encoding bring in weird characters where accented letters used to be. `José` becomes `Jos\xe9`. Stringr operations against these silently miss matches.
Always run `glimpse()` and `summary()` immediately after import. If the column types or values do not match what you expected, fix them at the read step (using `col_types`, `locale = locale(decimal_mark = ",")`, `encoding = "Latin1"`, etc.) rather than papering over the problem downstream.
:::
We'll use three packages: **readr** (for CSV and other text files), **readxl** (for Excel), and **haven** (for SPSS/Stata/SAS files). The first one loads automatically with the tidyverse; the other two you load separately.
```{r}
library(tidyverse)
library(readxl)
library(haven)
```
## Reading CSV Files {#csv}
CSV files are the lingua franca of data. If someone says "I'll send you the data," there's a 90% chance it's a CSV. The function you want is `read_csv()` from the readr package.
### Why read_csv() Instead of read.csv()?
Base R has `read.csv()`. It works, but `read_csv()` is better in almost every way:
- It's **faster**, especially on large files (like that 500,000-row sales export).
- It returns a **tibble** instead of a plain data frame, so it prints nicely.
- It **doesn't silently convert your text to factors**. (If you've ever had "Q1 Revenue" turn into a mystery number, you know the pain.)
- It **tells you how it parsed each column**, so you can catch problems early.
Let's see it in action. `read_csv()` can even read data from an inline string, which is handy for quick demos.
```{r}
sample_data <- read_csv(
"name,age,score,enrolled
Alice,25,92.5,TRUE
Bob,30,88.0,TRUE
Charlie,22,95.3,FALSE
Diana,28,79.1,TRUE
Eve,35,85.6,FALSE"
)
sample_data
```
See that message? It's telling you it read `name` as text, `age` and `score` as numbers, and `enrolled` as TRUE/FALSE. That's the kind of transparency you want.
### The Arguments You'll Actually Use
**col_names** -- By default, the first row is treated as column headers. If your file doesn't have headers (looking at you, legacy systems), set `col_names = FALSE` or supply your own names.
```{r}
read_csv(
"1,2,3\n4,5,6\n7,8,9",
col_names = c("x", "y", "z")
)
```
**na** -- Different systems represent missing data differently. Some use blanks, some use "N/A", some use "." because... reasons. Tell R what to look for.
```{r}
read_csv(
"name,value\nAlice,10\nBob,N/A\nCharlie,.",
na = c("", "NA", "N/A", ".")
)
```
**skip** -- If the file has junk at the top (report headers, timestamps, disclaimers), skip those lines.
```{r}
read_csv(
"This is a comment line\nAnother comment\nname,value\nAlice,10\nBob,20",
skip = 2
)
```
**n_max** -- Got a 2-million-row file and just want to peek at it? Read only the first few rows.
```{r}
read_csv(
"name,value\nAlice,10\nBob,20\nCharlie,30\nDiana,40",
n_max = 2
)
```
### Specifying Column Types
Sometimes R guesses wrong. The classic example: zip codes. If a zip code starts with 0 (like 02134 for Boston), R will read it as the number 2134 and silently drop the leading zero. Not great.
```{r}
# Problem: leading zero gets eaten
problem_data <- read_csv(
"id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300"
)
problem_data
```
The fix: tell R that `zipcode` is text, not a number.
```{r}
fixed_data <- read_csv(
"id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300",
col_types = cols(zipcode = col_character())
)
fixed_data
```
You can also use a compact shorthand where each letter represents a type: `c` = character, `i` = integer, `d` = double, `l` = logical, `_` = skip the column entirely.
```{r}
read_csv(
"id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
col_types = "icdc"
)
```
If you only need certain columns from a massive file, `cols_only()` lets you cherry-pick.
```{r}
read_csv(
"id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
col_types = cols_only(
name = col_character(),
score = col_double()
)
)
```
## Other Delimited Files
Not everything is comma-separated. Some files use tabs, pipes, or semicolons. readr has you covered.
```{r}
# Tab-separated
read_tsv(
"name\tage\tscore\nAlice\t25\t92.5\nBob\t30\t88.0"
)
```
```{r}
# Pipe-delimited (sometimes you'll see this from older database exports)
read_delim(
"name|age|score\nAlice|25|92.5\nBob|30|88.0",
delim = "|"
)
```
If you're working with European-format CSVs (semicolons as delimiters, commas as decimal marks), use `read_csv2()`. This comes up more than you'd expect in international business.
```{r}
read_csv2(
"name;age;score\nAlice;25;92,5\nBob;30;88,0"
)
```
## Reading Excel Files {#excel}
Let's be honest: in business, Excel is king. Your quarterly reports, your budget forecasts, your customer lists -- they're probably all in `.xlsx` files. The **readxl** package handles these beautifully.
### The Basics
`read_excel()` auto-detects whether a file is `.xls` or `.xlsx` and just works.
```{r}
# readxl comes with example files for practice
datasets_path <- readxl_example("datasets.xlsx")
datasets_data <- read_excel(datasets_path)
datasets_data
```
### Working with Sheets
Excel workbooks usually have multiple sheets -- maybe "Q1 Sales," "Q2 Sales," "Summary." You can pick which sheet to read.
```{r}
# See what sheets are available
excel_sheets(datasets_path)
```
```{r}
# Read a specific sheet by name
mtcars_data <- read_excel(datasets_path, sheet = "mtcars")
head(mtcars_data)
```
```{r}
# Or by position (the 3rd sheet)
quakes_data <- read_excel(datasets_path, sheet = 3)
head(quakes_data)
```
### Reading Specific Cell Ranges
Sometimes you don't want the whole sheet. Maybe the data you need is in cells B2:F50, and there's a bunch of formatting garbage around it. The `range` argument uses Excel-style cell references, so it feels familiar.
```{r}
# Read a specific cell range
read_excel(datasets_path, range = "mtcars!B1:D5")
```
```{r}
# Read specific rows
read_excel(datasets_path, sheet = "mtcars", range = cell_rows(1:6))
```
```{r}
# Read specific columns
read_excel(datasets_path, sheet = "mtcars", range = cell_cols("A:C"))
```
Just like `read_csv()`, you can use `na`, `skip`, `col_names`, and `n_max` to handle messy Excel files.
```{r}
read_excel(
datasets_path,
sheet = "chickwts",
na = c("", "NA"),
skip = 0,
n_max = 5
)
```
## SPSS, Stata, and SAS Files
If someone hands you a `.sav` file (SPSS), a `.dta` file (Stata), or a `.sas7bdat` file (SAS), don't panic. The **haven** package reads all of them.
```{r}
#| eval: false
# SPSS
spss_data <- read_sav("path/to/datafile.sav")
# Stata
stata_data <- read_dta("path/to/datafile.dta")
# SAS
sas_data <- read_sas("path/to/datafile.sas7bdat")
```
These files often have special "value labels" (like 1 = "Male", 2 = "Female"). Haven preserves those as a special `labelled` class. To convert them to regular R factors, use `as_factor()`.
```{r}
#| eval: false
spss_data <- read_sav("path/to/datafile.sav") %>%
mutate(across(where(is.labelled), as_factor))
```
You probably won't use these every day, but it's nice to know they exist when you need them.
## R's Own File Formats
R has two native formats for saving data: `.RData` and `.rds`. Here's the quick version:
- **`.RData`** files can store multiple objects. You load them with `load()`, and they pop back into your environment with their original names. The catch? They can overwrite things without warning.
- **`.rds`** files store a single object. You read them with `readRDS()` and assign the result to whatever name you want. Much safer.
```{r}
# Saving and loading .RData
x <- 1:10
y <- data.frame(a = 1:3, b = letters[1:3])
save(x, y, file = "my_objects.RData")
load("my_objects.RData") # x and y are restored
# Saving and loading .rds (preferred)
my_data <- tibble(name = c("Alice", "Bob"), score = c(92.5, 88.0))
saveRDS(my_data, file = "my_data.rds")
restored_data <- readRDS("my_data.rds")
```
For day-to-day work, prefer `.rds`. It's explicit, it's safe, and it won't accidentally overwrite your variables.
## Writing Data Out
You can also go the other direction -- save your clean data for someone else (or for future you).
```{r}
my_data <- tibble(
name = c("Alice", "Bob", "Charlie"),
score = c(92.5, 88.0, 95.3),
date = as.Date(c("2025-01-15", "2025-02-20", "2025-03-10"))
)
# Write a CSV
write_csv(my_data, "my_data.csv")
# If someone will open it in Excel, use this version
# (it handles special characters better)
write_excel_csv(my_data, "my_data_excel.csv")
# Save as an R object (preserves column types perfectly)
write_rds(my_data, "my_data.rds")
```
One thing to watch out for: when you save to CSV, you lose type information. Factors become plain text, dates become strings. If you need to preserve everything exactly, use `write_rds()`.
## Handling Common Import Problems
Real-world data is messy. Here are the problems you'll hit most often and how to fix them.
**Missing values in disguise.** Different systems use different codes for "no data." Check your file and tell R what to look for.
```{r}
read_csv(
"name,value\nAlice,10\nBob,\nCharlie,NA\nDiana,N/A\nEve,-99",
na = c("", "NA", "N/A", "-99")
)
```
**Junk at the top of the file.** Reports exported from enterprise systems love to put timestamps and headers above the actual data. Use `skip` or `comment`.
```{r}
messy_file <- read_csv(
"Report generated on 2025-01-15\nSource: Internal Database\n\nname,score\nAlice,92\nBob,88\nCharlie,95",
skip = 3
)
messy_file
```
**Reading from a URL.** All readr functions can read directly from web URLs. Very handy for publicly available datasets.
```{r}
url_data <- read_csv("https://raw.githubusercontent.com/tidyverse/readr/main/inst/extdata/mtcars.csv")
```
**Encoding problems.** If your data looks like garbled nonsense (especially with accented characters or non-English text), it's probably an encoding issue.
```{r}
#| eval: false
# Try specifying the encoding
latin_data <- read_csv("data_latin1.csv", locale = locale(encoding = "Latin1"))
# Or let R guess
guess_encoding("mystery_file.csv")
```
## Reading Multiple Files at Once
In the real world, you'll often have data split across multiple files -- monthly sales reports, regional data dumps, weekly survey exports. You need to read them all and stack them together.
The easiest approach (readr 2.0+) is to hand `read_csv()` a vector of file paths directly.
```{r}
#| eval: false
csv_files <- list.files("data/", pattern = "\\.csv$", full.names = TRUE)
all_data <- read_csv(csv_files, id = "source_file")
```
The `id` argument adds a column telling you which file each row came from. Very useful for debugging.
Here's a self-contained example showing the pattern:
```{r}
file1 <- "name,score\nAlice,92\nBob,88"
file2 <- "name,score\nCharlie,95\nDiana,79"
file3 <- "name,score\nEve,85\nFrank,91"
files <- list(file1, file2, file3)
combined_data <- files %>%
set_names(paste0("file_", 1:3)) %>%
map(read_csv) %>%
list_rbind(names_to = "file_number")
combined_data
```
## Summary
Here's your cheat sheet for getting data into R:
- **CSV files**: `read_csv()` from readr. This is your bread and butter.
- **Excel files**: `read_excel()` from readxl. Supports sheets, cell ranges, all the good stuff.
- **SPSS/Stata/SAS**: `read_sav()`, `read_dta()`, `read_sas()` from haven. There when you need them.
- **R native files**: `readRDS()` for `.rds`, `load()` for `.RData`.
- **Writing data**: `write_csv()` for CSVs, `write_rds()` to preserve R types.
- **Multiple files**: Pass a vector of paths to `read_csv()`, or use `map()` with `list_rbind()`.
The most important habit: always inspect your data right after importing it. Check the column types, look for unexpected `NA` values, make sure nothing got mangled. Five minutes of checking now saves five hours of debugging later.