7 Data Import

8 Data Import

Code

library(knitr)

Warning: package 'knitr' was built under R version 4.5.2

Code

opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE,
 comment = "#>", collapse = TRUE)

Before any analysis can happen, the data has to get into R. The mechanics trip people up more often than you would expect. CSVs arrive over email, Excel workbooks come with six tabs, research teams hand over mysterious .sav files from SPSS, and almost every one of those files has at least one quirk that breaks default settings somewhere. This chapter covers the readers you will actually use day-to-day and the failure modes you will hit if you skip the verification step.

AI Pitfall: AI reads files without checking what came in

When you ask an AI assistant to “load this CSV,” you reliably get a one-liner like df <- read_csv("file.csv"). It runs. The data is now in R. The temptation is to move on to analysis.

What the AI did not do — and what you must — is check what actually got read. Three failure modes that surface only in verification:

Locale / decimal separator mismatch. A CSV from a European source uses 1.234,56 to mean 1234.56. read_csv() reads it as NA or as a character string. Numbers look fine in the column header, but mean() returns NA.
Date columns parsed as character. Dates that come in as "15/03/2026" are read as strings unless you specify a col_types or use lubridate::dmy() after import. Filtering by date silently fails because string comparison is alphabetical.
Encoding artifacts. Files saved on Windows in non-UTF-8 encoding bring in weird characters where accented letters used to be. José becomes Jos\xe9. Stringr operations against these silently miss matches.

Always run glimpse() and summary() immediately after import. If the column types or values do not match what you expected, fix them at the read step (using col_types, locale = locale(decimal_mark = ","), encoding = "Latin1", etc.) rather than papering over the problem downstream.

We’ll use three packages: readr (for CSV and other text files), readxl (for Excel), and haven (for SPSS/Stata/SAS files). The first one loads automatically with the tidyverse; the other two you load separately.

Code

library(tidyverse)
library(readxl)
library(haven)

8.1 Reading CSV Files

CSV files are the lingua franca of data. If someone says “I’ll send you the data,” there’s a 90% chance it’s a CSV. The function you want is read_csv() from the readr package.

8.1.1 Why read_csv() Instead of read.csv()?

Base R has read.csv(). It works, but read_csv() is better in almost every way:

It’s faster, especially on large files (like that 500,000-row sales export).
It returns a tibble instead of a plain data frame, so it prints nicely.
It doesn’t silently convert your text to factors. (If you’ve ever had “Q1 Revenue” turn into a mystery number, you know the pain.)
It tells you how it parsed each column, so you can catch problems early.

Let’s see it in action. read_csv() can even read data from an inline string, which is handy for quick demos.

Code

sample_data <- read_csv(
 "name,age,score,enrolled
 Alice,25,92.5,TRUE
 Bob,30,88.0,TRUE
 Charlie,22,95.3,FALSE
 Diana,28,79.1,TRUE
 Eve,35,85.6,FALSE"
)

sample_data
#> # A tibble: 5 × 4
#>   name      age score enrolled
#>   <chr>   <dbl> <dbl> <lgl>   
#> 1 Alice      25  92.5 TRUE    
#> 2 Bob        30  88   TRUE    
#> 3 Charlie    22  95.3 FALSE   
#> 4 Diana      28  79.1 TRUE    
#> 5 Eve        35  85.6 FALSE

See that message? It’s telling you it read name as text, age and score as numbers, and enrolled as TRUE/FALSE. That’s the kind of transparency you want.

8.1.2 The Arguments You’ll Actually Use

col_names – By default, the first row is treated as column headers. If your file doesn’t have headers (looking at you, legacy systems), set col_names = FALSE or supply your own names.

Code

read_csv(
 "1,2,3\n4,5,6\n7,8,9",
 col_names = c("x", "y", "z")
)
#> # A tibble: 3 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6
#> 3     7     8     9

na – Different systems represent missing data differently. Some use blanks, some use “N/A”, some use “.” because… reasons. Tell R what to look for.

Code

read_csv(
 "name,value\nAlice,10\nBob,N/A\nCharlie,.",
 na = c("", "NA", "N/A", ".")
)
#> # A tibble: 3 × 2
#>   name    value
#>   <chr>   <dbl>
#> 1 Alice      10
#> 2 Bob        NA
#> 3 Charlie    NA

skip – If the file has junk at the top (report headers, timestamps, disclaimers), skip those lines.

Code

read_csv(
 "This is a comment line\nAnother comment\nname,value\nAlice,10\nBob,20",
 skip = 2
)
#> # A tibble: 2 × 2
#>   name  value
#>   <chr> <dbl>
#> 1 Alice    10
#> 2 Bob      20

n_max – Got a 2-million-row file and just want to peek at it? Read only the first few rows.

Code

read_csv(
 "name,value\nAlice,10\nBob,20\nCharlie,30\nDiana,40",
 n_max = 2
)
#> # A tibble: 2 × 2
#>   name  value
#>   <chr> <dbl>
#> 1 Alice    10
#> 2 Bob      20

8.1.3 Specifying Column Types

Sometimes R guesses wrong. The classic example: zip codes. If a zip code starts with 0 (like 02134 for Boston), R will read it as the number 2134 and silently drop the leading zero. Not great.

Code

# Problem: leading zero gets eaten
problem_data <- read_csv(
 "id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300"
)

problem_data
#> # A tibble: 3 × 3
#>      id zipcode amount
#>   <dbl> <chr>    <dbl>
#> 1     1 02134      100
#> 2     2 10001      200
#> 3     3 90210      300

The fix: tell R that zipcode is text, not a number.

Code

fixed_data <- read_csv(
 "id,zipcode,amount\n1,02134,100\n2,10001,200\n3,90210,300",
 col_types = cols(zipcode = col_character())
)

fixed_data
#> # A tibble: 3 × 3
#>      id zipcode amount
#>   <dbl> <chr>    <dbl>
#> 1     1 02134      100
#> 2     2 10001      200
#> 3     3 90210      300

You can also use a compact shorthand where each letter represents a type: c = character, i = integer, d = double, l = logical, _ = skip the column entirely.

Code

read_csv(
 "id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
 col_types = "icdc"
)
#> # A tibble: 2 × 4
#>      id name  score active
#>   <int> <chr> <dbl> <chr> 
#> 1     1 Alice  92.5 TRUE  
#> 2     2 Bob    88   FALSE

If you only need certain columns from a massive file, cols_only() lets you cherry-pick.

Code

read_csv(
 "id,name,score,active\n1,Alice,92.5,TRUE\n2,Bob,88.0,FALSE",
 col_types = cols_only(
 name = col_character(),
 score = col_double()
 )
)
#> # A tibble: 2 × 2
#>   name  score
#>   <chr> <dbl>
#> 1 Alice  92.5
#> 2 Bob    88

8.2 Other Delimited Files

Not everything is comma-separated. Some files use tabs, pipes, or semicolons. readr has you covered.

Code

# Tab-separated
read_tsv(
 "name\tage\tscore\nAlice\t25\t92.5\nBob\t30\t88.0"
)
#> # A tibble: 2 × 3
#>   name    age score
#>   <chr> <dbl> <dbl>
#> 1 Alice    25  92.5
#> 2 Bob      30  88

Code

# Pipe-delimited (sometimes you'll see this from older database exports)
read_delim(
 "name|age|score\nAlice|25|92.5\nBob|30|88.0",
 delim = "|"
)
#> # A tibble: 2 × 3
#>   name    age score
#>   <chr> <dbl> <dbl>
#> 1 Alice    25  92.5
#> 2 Bob      30  88

If you’re working with European-format CSVs (semicolons as delimiters, commas as decimal marks), use read_csv2(). This comes up more than you’d expect in international business.

Code

read_csv2(
 "name;age;score\nAlice;25;92,5\nBob;30;88,0"
)
#> # A tibble: 2 × 3
#>   name    age score
#>   <chr> <dbl> <dbl>
#> 1 Alice    25  92.5
#> 2 Bob      30  88

8.3 Reading Excel Files

Let’s be honest: in business, Excel is king. Your quarterly reports, your budget forecasts, your customer lists – they’re probably all in .xlsx files. The readxl package handles these beautifully.

8.3.1 The Basics

read_excel() auto-detects whether a file is .xls or .xlsx and just works.

Code

# readxl comes with example files for practice
datasets_path <- readxl_example("datasets.xlsx")
datasets_data <- read_excel(datasets_path)
datasets_data
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

8.3.2 Working with Sheets

Excel workbooks usually have multiple sheets – maybe “Q1 Sales,” “Q2 Sales,” “Summary.” You can pick which sheet to read.

Code

# See what sheets are available
excel_sheets(datasets_path)
#> [1] "mtcars"   "chickwts" "quakes"

Code

# Read a specific sheet by name
mtcars_data <- read_excel(datasets_path, sheet = "mtcars")
head(mtcars_data)
#> # A tibble: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

Code

# Or by position (the 3rd sheet)
quakes_data <- read_excel(datasets_path, sheet = 3)
head(quakes_data)
#> # A tibble: 6 × 5
#>     lat  long depth   mag stations
#>   <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.4  182.   562   4.8       41
#> 2 -20.6  181.   650   4.2       15
#> 3 -26    184.    42   5.4       43
#> 4 -18.0  182.   626   4.1       19
#> 5 -20.4  182.   649   4         11
#> 6 -19.7  184.   195   4         12

8.3.3 Reading Specific Cell Ranges

Sometimes you don’t want the whole sheet. Maybe the data you need is in cells B2:F50, and there’s a bunch of formatting garbage around it. The range argument uses Excel-style cell references, so it feels familiar.

Code

# Read a specific cell range
read_excel(datasets_path, range = "mtcars!B1:D5")
#> # A tibble: 4 × 3
#>     cyl  disp    hp
#>   <dbl> <dbl> <dbl>
#> 1     6   160   110
#> 2     6   160   110
#> 3     4   108    93
#> 4     6   258   110

Code

# Read specific rows
read_excel(datasets_path, sheet = "mtcars", range = cell_rows(1:6))
#> # A tibble: 5 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

Code

# Read specific columns
read_excel(datasets_path, sheet = "mtcars", range = cell_cols("A:C"))
#> # A tibble: 32 × 3
#>      mpg   cyl  disp
#>    <dbl> <dbl> <dbl>
#>  1  21       6  160 
#>  2  21       6  160 
#>  3  22.8     4  108 
#>  4  21.4     6  258 
#>  5  18.7     8  360 
#>  6  18.1     6  225 
#>  7  14.3     8  360 
#>  8  24.4     4  147.
#>  9  22.8     4  141.
#> 10  19.2     6  168.
#> # ℹ 22 more rows

Just like read_csv(), you can use na, skip, col_names, and n_max to handle messy Excel files.

Code

read_excel(
 datasets_path,
 sheet = "chickwts",
 na = c("", "NA"),
 skip = 0,
 n_max = 5
)
#> # A tibble: 5 × 2
#>   weight feed     
#>    <dbl> <chr>    
#> 1    179 horsebean
#> 2    160 horsebean
#> 3    136 horsebean
#> 4    227 horsebean
#> 5    217 horsebean

8.4 SPSS, Stata, and SAS Files

If someone hands you a .sav file (SPSS), a .dta file (Stata), or a .sas7bdat file (SAS), don’t panic. The haven package reads all of them.

Code

# SPSS
spss_data <- read_sav("path/to/datafile.sav")

# Stata
stata_data <- read_dta("path/to/datafile.dta")

# SAS
sas_data <- read_sas("path/to/datafile.sas7bdat")

These files often have special “value labels” (like 1 = “Male”, 2 = “Female”). Haven preserves those as a special labelled class. To convert them to regular R factors, use as_factor().

Code

spss_data <- read_sav("path/to/datafile.sav") %>%
 mutate(across(where(is.labelled), as_factor))

You probably won’t use these every day, but it’s nice to know they exist when you need them.

8.5 R’s Own File Formats

R has two native formats for saving data: .RData and .rds. Here’s the quick version:

.RData files can store multiple objects. You load them with load(), and they pop back into your environment with their original names. The catch? They can overwrite things without warning.
.rds files store a single object. You read them with readRDS() and assign the result to whatever name you want. Much safer.

Code

# Saving and loading .RData
x <- 1:10
y <- data.frame(a = 1:3, b = letters[1:3])
save(x, y, file = "my_objects.RData")
load("my_objects.RData") # x and y are restored

# Saving and loading .rds (preferred)
my_data <- tibble(name = c("Alice", "Bob"), score = c(92.5, 88.0))
saveRDS(my_data, file = "my_data.rds")
restored_data <- readRDS("my_data.rds")

For day-to-day work, prefer .rds. It’s explicit, it’s safe, and it won’t accidentally overwrite your variables.

8.6 Writing Data Out

You can also go the other direction – save your clean data for someone else (or for future you).

Code

my_data <- tibble(
 name = c("Alice", "Bob", "Charlie"),
 score = c(92.5, 88.0, 95.3),
 date = as.Date(c("2025-01-15", "2025-02-20", "2025-03-10"))
)

# Write a CSV
write_csv(my_data, "my_data.csv")

# If someone will open it in Excel, use this version
# (it handles special characters better)
write_excel_csv(my_data, "my_data_excel.csv")

# Save as an R object (preserves column types perfectly)
write_rds(my_data, "my_data.rds")

One thing to watch out for: when you save to CSV, you lose type information. Factors become plain text, dates become strings. If you need to preserve everything exactly, use write_rds().

8.7 Handling Common Import Problems

Real-world data is messy. Here are the problems you’ll hit most often and how to fix them.

Missing values in disguise. Different systems use different codes for “no data.” Check your file and tell R what to look for.

Code

read_csv(
 "name,value\nAlice,10\nBob,\nCharlie,NA\nDiana,N/A\nEve,-99",
 na = c("", "NA", "N/A", "-99")
)
#> # A tibble: 5 × 2
#>   name    value
#>   <chr>   <dbl>
#> 1 Alice      10
#> 2 Bob        NA
#> 3 Charlie    NA
#> 4 Diana      NA
#> 5 Eve        NA

Junk at the top of the file. Reports exported from enterprise systems love to put timestamps and headers above the actual data. Use skip or comment.

Code

messy_file <- read_csv(
 "Report generated on 2025-01-15\nSource: Internal Database\n\nname,score\nAlice,92\nBob,88\nCharlie,95",
 skip = 3
)

messy_file
#> # A tibble: 3 × 2
#>   name    score
#>   <chr>   <dbl>
#> 1 Alice      92
#> 2 Bob        88
#> 3 Charlie    95

Reading from a URL. All readr functions can read directly from web URLs. Very handy for publicly available datasets.

Code

url_data <- read_csv("https://raw.githubusercontent.com/tidyverse/readr/main/inst/extdata/mtcars.csv")

Encoding problems. If your data looks like garbled nonsense (especially with accented characters or non-English text), it’s probably an encoding issue.

Code

# Try specifying the encoding
latin_data <- read_csv("data_latin1.csv", locale = locale(encoding = "Latin1"))

# Or let R guess
guess_encoding("mystery_file.csv")

8.8 Reading Multiple Files at Once

In the real world, you’ll often have data split across multiple files – monthly sales reports, regional data dumps, weekly survey exports. You need to read them all and stack them together.

The easiest approach (readr 2.0+) is to hand read_csv() a vector of file paths directly.

Code

csv_files <- list.files("data/", pattern = "\\.csv$", full.names = TRUE)
all_data <- read_csv(csv_files, id = "source_file")

The id argument adds a column telling you which file each row came from. Very useful for debugging.

Here’s a self-contained example showing the pattern:

Code

file1 <- "name,score\nAlice,92\nBob,88"
file2 <- "name,score\nCharlie,95\nDiana,79"
file3 <- "name,score\nEve,85\nFrank,91"

files <- list(file1, file2, file3)

combined_data <- files %>%
 set_names(paste0("file_", 1:3)) %>%
 map(read_csv) %>%
 list_rbind(names_to = "file_number")

combined_data
#> # A tibble: 6 × 3
#>   file_number name    score
#>   <chr>       <chr>   <dbl>
#> 1 file_1      Alice      92
#> 2 file_1      Bob        88
#> 3 file_2      Charlie    95
#> 4 file_2      Diana      79
#> 5 file_3      Eve        85
#> 6 file_3      Frank      91

8.9 Summary

Here’s your cheat sheet for getting data into R:

CSV files: read_csv() from readr. This is your bread and butter.
Excel files: read_excel() from readxl. Supports sheets, cell ranges, all the good stuff.
SPSS/Stata/SAS: read_sav(), read_dta(), read_sas() from haven. There when you need them.
R native files: readRDS() for .rds, load() for .RData.
Writing data: write_csv() for CSVs, write_rds() to preserve R types.
Multiple files: Pass a vector of paths to read_csv(), or use map() with list_rbind().

The most important habit: always inspect your data right after importing it. Check the column types, look for unexpected NA values, make sure nothing got mangled. Five minutes of checking now saves five hours of debugging later.