2 Introduction to R

3 Introduction to R

Code

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.5.2

Code

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(knitr)

Warning: package 'knitr' was built under R version 4.5.2

Code

library(xtable)

Warning: package 'xtable' was built under R version 4.5.2

Code

library(gridExtra)

Warning: package 'gridExtra' was built under R version 4.5.2


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Code

opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

Let us start with the question every R book opens on: What is R?

R is a programming language built specifically for working with data — crunching numbers, running analyses, and producing charts good enough for production reports and academic papers. If you have used Excel for analysis, you have probably hit the same walls everyone hits: spreadsheets that slow to a crawl on a few hundred thousand rows, analyses that cannot be reproduced once the file is closed, charts that are functional but rarely publication-quality. R is built for the work that starts where Excel runs out.

R was created by two professors — Ross Ihaka and Robert Gentleman — at the University of Auckland in New Zealand. They started the project in 1992, and it is based on an older language called S, built by John Chambers at Bell Labs (S is now owned by TIBCO). Both R and S were originally written by statisticians for statisticians, but the audience has broadened considerably. Today R is used by marketing analysts tracking campaign performance, finance teams modeling risk, consultants building client dashboards, and anyone whose job involves making sense of messy data.

The official R website is here: https://www.r-project.org

Code

if (file.exists("01files/Rlogo.png")) knitr::include_graphics("01files/Rlogo.png")

3.1 Why Use R? (Or: Why Can’t I Just Stick with Excel?)

Look, we need to talk. Excel is great. We all love Excel. You’ve been using it since high school, and it has served you well. But at some point, you’re going to hit a wall – maybe you have 500,000 rows of customer transaction data, or you need to run the same weekly sales report every Monday, or your director wants a visualization that doesn’t look like it was made two decades ago. That’s where R comes in.

Here are 9 reasons why R is worth adding to your toolkit:

It is free. Actually free — no license fees, no subscriptions, no expiring trials. Because R is open source, thousands of people around the world contribute code and tools you can use instantly, and no purchasing department needs to approve anything before you start.
Your work is reproducible. Ever spend an hour clicking through menus in some software, get your results, and then your professor says “can you run that again with last quarter’s data?” In R, your entire analysis is saved as a script – basically a set of instructions. Change one line, hit run, done. It’s like having a recipe instead of trying to remember what you threw in the pot last Tuesday.
R is always on the cutting edge. New statistical methods and data science techniques show up in R before they appear in almost any other software. If something is new and exciting in the analytics world, R probably already has it. Your future employer will notice.

Code

milestones <- tibble::tibble(
 year = c(2005, 2009, 2013, 2016, 2020, 2023, 2026),
 packages = c(1000, 2000, 5000, 8000, 16000, 20000, 23000),
 label = c("1K", "2K", "5K", "8K", "16K", "20K", "23K+")
)

ggplot(milestones, aes(x = year, y = packages)) +
 geom_line(color = "#a80107", linewidth = 1.2) +
 geom_point(color = "#a80107", size = 3.5) +
 geom_text(aes(label = label), vjust = -1.3, fontface = "bold", size = 3.8) +
 scale_x_continuous(breaks = milestones$year) +
 scale_y_continuous(labels = scales::comma, limits = c(0, 27000)) +
 labs(
 title = "Growth of CRAN Packages Over Time",
 subtitle = "From ~1,000 packages in 2005 to over 23,000 in 2026",
 x = NULL, y = "Packages on CRAN"
 ) +
 theme_minimal(base_size = 13) +
 theme(
 plot.title = element_text(face = "bold", color = "#a80107"),
 panel.grid.minor = element_blank()
 )

Source: Muenchen, Robert A, The Popularity of Data Analysis Software.

The graphics are excellent. R produces publication-quality charts that hold up in academic journals, corporate reports, and major newsrooms. The ggplot2 package in particular has become a standard — the BBC, The New York Times, and FiveThirtyEight all use R for their data graphics, and the BBC’s data journalism team maintains its own R package (bbplot) on GitHub.
Updates are a breeze. Everything is distributed over the internet. No CDs, no USB drives, no submitting a help desk ticket and waiting three weeks for IT to install something. You download what you need, when you need it.
It works on everything. Mac, Windows, Linux – R doesn’t care what computer you have. Your group project members can all use different machines and still run the same code. (If only the rest of group projects were that easy.)
It plays well with others. R can connect to databases, read Excel files, pull in CSV data, handle SPSS files – basically whatever format your data lives in, R can probably import it. That random .xlsx file your internship supervisor emailed you at 5pm on a Friday? R’s got you.
It has been featured in The New York Times – and keeps making headlines. Ashlee Vance’s 2009 NYT article first brought R to mainstream attention. Since then, the BBC built a custom R package (bbplot) for all their data journalism graphics, The Economist uses R for its charts, and in December 2025, multiple outlets – InfoWorld, Slashdot, FlowingData – covered R’s return to the TIOBE top 10.
The community is incredible. This might honestly be the best reason. R has more than 2 million users worldwide (per the R Consortium), and a huge number of them genuinely enjoy helping beginners. Got stuck at midnight before a deadline? Post a question online and someone – probably in a different time zone where it’s a perfectly reasonable hour – will likely answer within hours. It’s like having a giant study group that never sleeps. That spirit of people helping strangers for free — sharing code, answering questions, building tools for the common good — is what makes open-source communities remarkable.

Now, here’s the one thing that makes people nervous: R uses a command line. You type code instead of clicking buttons. I know, I know – that might sound intimidating if you’ve never programmed before. But here’s the good news: you don’t have to stare at a scary blank terminal. There’s a fantastic program called RStudio (made by a company now called Posit) that gives you a friendly, modern workspace with helpful features like auto-complete, built-in help files, and a window to see your plots right next to your code. It’s the difference between building furniture with just a hammer and building it with a fully equipped workshop. We’ll get you set up with RStudio in the next chapter – it’s painless, I promise.

3.1.1 Where R Stands in 2026

As of early 2026, R ranks #8 on the TIOBE Index (its highest position since 2020), #5 on the PYPL Index, and #5 on IEEE Spectrum (2024). It is used by 4.9% of developers in the 2025 Stack Overflow survey. R ranks lower in general-purpose developer surveys because it is a domain-specific language – but in statistics, data science, and business analytics, it is one of the top two tools alongside Python.

Robert Muenchen has been tracking the popularity of data analysis software for years, and his regularly updated article is a great read if you’re curious about how R stacks up against SPSS, SAS, Stata, and others. Check it out at https://r4stats.com/articles/popularity/.

If you want to explore what R can do in specific business fields – marketing analytics, finance, econometrics, supply chain, you name it – the CRAN Task Views page has a massive listing of packages organized by topic. And for the really bleeding-edge stuff, developers often host their latest work on GitHub.

3.2 R – The Environment and Software

Let me give you an analogy that’ll make this click.

Imagine it’s your first day at a new job. Someone hands you a company laptop. Out of the box, it comes with the basics – a web browser, a calculator, maybe Microsoft Notepad. Functional, but not exactly exciting. That’s what R includes built in. It can do a lot right out of the gate: math, basic statistics, simple charts. It’s solid.

But then your manager asks you to build a customer segmentation model, or create an interactive dashboard for the quarterly business review. You need specialized tools. So you go to the app store and download exactly what you need. In R, these specialized tools are called packages – and there are over 23,000 of them (and growing every month). They cover everything from sentiment analysis of Amazon reviews to stock portfolio optimization to mapping sales territories.

Sometimes a package needs other packages to work properly. (Kind of like how some apps require you to update your operating system first.) These are called dependencies, and R handles them automatically – it’ll download whatever supporting software it needs without you having to lift a finger.

And here’s the really cool part: if nobody on the entire planet has written a package for the specific, niche thing you need to do (unlikely, but possible), you can build your own and share it with the world. That’s the beauty of open source – everyone contributes, everyone benefits.

3.2.1 Key Milestones

R has evolved significantly over the years:

R 4.0 (2020): Introduced modern defaults (e.g., stringsAsFactors = FALSE), making data import less error-prone.
R 4.1 (2021): Added the native pipe operator |>, giving base R a feature that tidyverse users had loved for years via %>%.
RStudio renamed to Posit (2022): Reflecting the company’s growing support for Python alongside R.
tidyverse 2.0 (2023): Added lubridate to the core tidyverse, making date-time handling a first-class citizen.
R 4.5 (2025): The current release, which includes the Palmer Penguins dataset built in – a modern, inclusive replacement for the classic iris data.

3.3 The Tidyverse

Okay, buckle up, because I’m about to introduce you to something that’s going to make your data life significantly better.

The tidyverse is a collection of R packages that were designed to work together seamlessly. Think of it like this: base R is a kitchen where all the utensils are scattered across random drawers. The tidyverse is that same kitchen after someone from The Container Store organized the whole thing. Everything has a place, everything works together, and you can actually find what you need.

The tidyverse was created primarily by Hadley Wickham – kind of a celebrity in the data world (he has fans, conference talks, the whole deal) – and is maintained by the team at Posit.

Here’s what’s in the toolkit:

ggplot2: Makes beautiful charts and graphs. This is the package that’ll make your classmates’ and coworkers’ jaws drop during presentations.
dplyr: Lets you filter, sort, group, and summarize data. It’s basically what you wish Excel pivot tables were. Seriously – once you use dplyr, you’ll feel betrayed by pivot tables.
tidyr: Helps you reshape messy data into clean, “tidy” formats. Because real-world data from CRM systems, surveys, and databases is almost always a disaster.
readr: Reads CSV files and other data formats quickly and cleanly. Perfect for loading that sales report your boss just emailed you.
purrr: Automates repetitive tasks so you don’t have to copy-paste the same analysis fifty times. (Also, the name is a cat pun. Data scientists are fun at parties, apparently.)
tibble: A smarter, more user-friendly version of R’s data frame – basically a spreadsheet that lives inside R and behaves itself.
stringr: Tools for working with text data. Great for cleaning up messy customer survey responses where someone typed “Strongly agrEE” instead of “Strongly Agree.”
forcats: Helps you work with categorical data – things like “Freshman/Sophomore/Junior/Senior,” product categories, or customer segments.

Installing the whole collection takes one line of code:

Code

install.packages("tidyverse")

And loading it into your session is just as easy:

Code

library(tidyverse)

Throughout this book, we’ll use a mix of built-in functions and these add-on packages. The tidyverse gets its own deep dive in Chapters @ref(tidyverse) through @ref(purrr).

Once you start using the tidyverse, the productivity difference compared to base R alone is genuinely large. Most of this book leans on it.

AI Pitfall: Summaries flatten what experienced practitioners weight

Asking an AI assistant “what is the difference between base R and the tidyverse?” returns a fluent, accurate-sounding paragraph. What the summary will not tell you is that in 2026 most professional R code is tidyverse-first, that mixing the two carelessly produces code that is harder to read and review, and that your future colleagues will expect you to know which is which when reading their work. AI flattens the distinctions experienced practitioners weight most heavily. Use it to clarify what you read here, not as a replacement for reading.

3.4 “Then why learn R when AI writes it for me?”

The preface answers this question at length. The short version, set up here so each subsequent chapter can build on it: AI assistants are extraordinary at turning a description into syntax that runs. They are weak at three things — knowing whether the code is appropriate for your specific data, verifying the output before you trust it, and defending the analysis when a stakeholder challenges it. Those three are most of what makes someone a good analyst.

Every chapter that follows includes at least one AI Pitfall callout: a specific scenario where an AI assistant produces plausible-but-wrong R code, paired with the verification habit that catches it. The Pitfalls compound. By the end of the book you will have a working library of failure modes — exactly what the analyst in the preface lacked.

--- title: "Introduction to R" --- # Introduction to R {#introR} ```{r} library(ggplot2) library(dplyr) library(knitr) library(xtable) library(gridExtra) opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE) ``` Let us start with the question every R book opens on: **What is R?** R is a programming language built specifically for working with data — crunching numbers, running analyses, and producing charts good enough for production reports and academic papers. If you have used Excel for analysis, you have probably hit the same walls everyone hits: spreadsheets that slow to a crawl on a few hundred thousand rows, analyses that cannot be reproduced once the file is closed, charts that are functional but rarely publication-quality. R is built for the work that starts where Excel runs out. R was created by two professors — **Ross Ihaka** and **Robert Gentleman** — at the University of Auckland in New Zealand. They started the project in 1992, and it is based on an older language called **S**, built by **John Chambers** at Bell Labs (S is now owned by TIBCO). Both R and S were originally written by statisticians for statisticians, but the audience has broadened considerably. Today R is used by marketing analysts tracking campaign performance, finance teams modeling risk, consultants building client dashboards, and anyone whose job involves making sense of messy data. The official R website is here: [https://www.r-project.org](https://www.r-project.org) ```{r} if (file.exists("01files/Rlogo.png")) knitr::include_graphics("01files/Rlogo.png") ``` ## Why Use R? (Or: Why Can't I Just Stick with Excel?) Look, we need to talk. Excel is great. We all love Excel. You've been using it since high school, and it has served you well. But at some point, you're going to hit a wall -- maybe you have 500,000 rows of customer transaction data, or you need to run the same weekly sales report every Monday, or your director wants a visualization that doesn't look like it was made two decades ago. That's where R comes in. Here are 9 reasons why R is worth adding to your toolkit: 1. **It is free.** Actually free — no license fees, no subscriptions, no expiring trials. Because R is open source, thousands of people around the world contribute code and tools you can use instantly, and no purchasing department needs to approve anything before you start. 2. **Your work is reproducible.** Ever spend an hour clicking through menus in some software, get your results, and then your professor says "can you run that again with last quarter's data?" In R, your entire analysis is saved as a script -- basically a set of instructions. Change one line, hit run, done. It's like having a recipe instead of trying to remember what you threw in the pot last Tuesday. 3. **R is always on the cutting edge.** New statistical methods and data science techniques show up in R before they appear in almost any other software. If something is new and exciting in the analytics world, R probably already has it. Your future employer will notice. ```{r} milestones <- tibble::tibble( year = c(2005, 2009, 2013, 2016, 2020, 2023, 2026), packages = c(1000, 2000, 5000, 8000, 16000, 20000, 23000), label = c("1K", "2K", "5K", "8K", "16K", "20K", "23K+") ) ggplot(milestones, aes(x = year, y = packages)) + geom_line(color = "#a80107", linewidth = 1.2) + geom_point(color = "#a80107", size = 3.5) + geom_text(aes(label = label), vjust = -1.3, fontface = "bold", size = 3.8) + scale_x_continuous(breaks = milestones$year) + scale_y_continuous(labels = scales::comma, limits = c(0, 27000)) + labs( title = "Growth of CRAN Packages Over Time", subtitle = "From ~1,000 packages in 2005 to over 23,000 in 2026", x = NULL, y = "Packages on CRAN" ) + theme_minimal(base_size = 13) + theme( plot.title = element_text(face = "bold", color = "#a80107"), panel.grid.minor = element_blank() ) ``` Source: [Muenchen, Robert A, The Popularity of Data Analysis Software.](https://r4stats.com/articles/popularity/) 4. **The graphics are excellent.** R produces publication-quality charts that hold up in academic journals, corporate reports, and major newsrooms. The **ggplot2** package in particular has become a standard — the BBC, The New York Times, and FiveThirtyEight all use R for their data graphics, and the BBC's data journalism team maintains its own R package (`bbplot`) on GitHub. 5. **Updates are a breeze.** Everything is distributed over the internet. No CDs, no USB drives, no submitting a help desk ticket and waiting three weeks for IT to install something. You download what you need, when you need it. 6. **It works on everything.** Mac, Windows, Linux -- R doesn't care what computer you have. Your group project members can all use different machines and still run the same code. (If only the rest of group projects were that easy.) 7. **It plays well with others.** R can connect to databases, read Excel files, pull in CSV data, handle SPSS files -- basically whatever format your data lives in, R can probably import it. That random `.xlsx` file your internship supervisor emailed you at 5pm on a Friday? R's got you. 8. **It has been featured in The New York Times -- and keeps making headlines.** Ashlee Vance's [2009 NYT article](https://www.nytimes.com/2009/01/07/technology/business-computing/07program.html) first brought R to mainstream attention. Since then, the BBC built a custom R package ([`bbplot`](https://github.com/bbc/bbplot)) for all their data journalism graphics, The Economist uses R for its charts, and in December 2025, multiple outlets -- [InfoWorld](https://www.infoworld.com/article/4102696/r-language-is-making-a-comeback-tiobe.html), [Slashdot](https://developers.slashdot.org/story/25/12/14/0340217/is-the-r-programming-language-surging-in-popularity), [FlowingData](https://flowingdata.com/2025/12/19/r-climbs-back-up-into-the-top-ten-programming-languages/) -- covered R's return to the TIOBE top 10. 9. **The community is incredible.** This might honestly be the best reason. R has more than 2 million users worldwide (per the R Consortium), and a huge number of them genuinely enjoy helping beginners. Got stuck at midnight before a deadline? Post a question online and someone -- probably in a different time zone where it's a perfectly reasonable hour -- will likely answer within hours. It's like having a giant study group that never sleeps. That spirit of people helping strangers for free --- sharing code, answering questions, building tools for the common good --- is what makes open-source communities remarkable. Now, here's the one thing that makes people nervous: R uses a command line. You type code instead of clicking buttons. I know, I know -- that might sound intimidating if you've never programmed before. But here's the good news: you don't have to stare at a scary blank terminal. There's a fantastic program called **RStudio** (made by a company now called Posit) that gives you a friendly, modern workspace with helpful features like auto-complete, built-in help files, and a window to see your plots right next to your code. It's the difference between building furniture with just a hammer and building it with a fully equipped workshop. We'll get you set up with RStudio in the next chapter -- it's painless, I promise. ### Where R Stands in 2026 As of early 2026, R ranks **#8 on the TIOBE Index** (its highest position since 2020), **#5 on the PYPL Index**, and **#5 on IEEE Spectrum (2024)**. It is used by 4.9% of developers in the 2025 Stack Overflow survey. R ranks lower in general-purpose developer surveys because it is a domain-specific language -- but in statistics, data science, and business analytics, it is one of the top two tools alongside Python. Robert Muenchen has been tracking the popularity of data analysis software for years, and his regularly updated article is a great read if you're curious about how R stacks up against SPSS, SAS, Stata, and others. Check it out at [https://r4stats.com/articles/popularity/](https://r4stats.com/articles/popularity/). If you want to explore what R can do in specific business fields -- marketing analytics, finance, econometrics, supply chain, you name it -- the [CRAN Task Views](https://cran.r-project.org/web/views/) page has a massive listing of packages organized by topic. And for the really bleeding-edge stuff, developers often host their latest work on [GitHub](https://www.github.com). ## R -- The Environment and Software Let me give you an analogy that'll make this click. Imagine it's your first day at a new job. Someone hands you a company laptop. Out of the box, it comes with the basics -- a web browser, a calculator, maybe Microsoft Notepad. Functional, but not exactly exciting. That's what R includes built in. It can do a lot right out of the gate: math, basic statistics, simple charts. It's solid. But then your manager asks you to build a customer segmentation model, or create an interactive dashboard for the quarterly business review. You need specialized tools. So you go to the app store and download exactly what you need. In R, these specialized tools are called **packages** -- and there are over 23,000 of them (and growing every month). They cover everything from sentiment analysis of Amazon reviews to stock portfolio optimization to mapping sales territories. Sometimes a package needs other packages to work properly. (Kind of like how some apps require you to update your operating system first.) These are called **dependencies**, and R handles them automatically -- it'll download whatever supporting software it needs without you having to lift a finger. And here's the really cool part: if nobody on the entire planet has written a package for the specific, niche thing you need to do (unlikely, but possible), you can build your own and share it with the world. That's the beauty of open source -- everyone contributes, everyone benefits. ### Key Milestones R has evolved significantly over the years: - **R 4.0 (2020):** Introduced modern defaults (e.g., `stringsAsFactors = FALSE`), making data import less error-prone. - **R 4.1 (2021):** Added the native pipe operator `|>`, giving base R a feature that tidyverse users had loved for years via `%>%`. - **RStudio renamed to Posit (2022):** Reflecting the company's growing support for Python alongside R. - **tidyverse 2.0 (2023):** Added **lubridate** to the core tidyverse, making date-time handling a first-class citizen. - **R 4.5 (2025):** The current release, which includes the Palmer Penguins dataset built in -- a modern, inclusive replacement for the classic `iris` data. ## The Tidyverse Okay, buckle up, because I'm about to introduce you to something that's going to make your data life *significantly* better. The **tidyverse** is a collection of R packages that were designed to work together seamlessly. Think of it like this: base R is a kitchen where all the utensils are scattered across random drawers. The tidyverse is that same kitchen after someone from The Container Store organized the whole thing. Everything has a place, everything works together, and you can actually find what you need. The tidyverse was created primarily by Hadley Wickham -- kind of a celebrity in the data world (he has fans, conference talks, the whole deal) -- and is maintained by the team at Posit. Here's what's in the toolkit: - **ggplot2**: Makes beautiful charts and graphs. This is the package that'll make your classmates' and coworkers' jaws drop during presentations. - **dplyr**: Lets you filter, sort, group, and summarize data. It's basically what you *wish* Excel pivot tables were. Seriously -- once you use `dplyr`, you'll feel betrayed by pivot tables. - **tidyr**: Helps you reshape messy data into clean, "tidy" formats. Because real-world data from CRM systems, surveys, and databases is almost always a disaster. - **readr**: Reads CSV files and other data formats quickly and cleanly. Perfect for loading that sales report your boss just emailed you. - **purrr**: Automates repetitive tasks so you don't have to copy-paste the same analysis fifty times. (Also, the name is a cat pun. Data scientists are fun at parties, apparently.) - **tibble**: A smarter, more user-friendly version of R's data frame -- basically a spreadsheet that lives inside R and behaves itself. - **stringr**: Tools for working with text data. Great for cleaning up messy customer survey responses where someone typed "Strongly agrEE " instead of "Strongly Agree." - **forcats**: Helps you work with categorical data -- things like "Freshman/Sophomore/Junior/Senior," product categories, or customer segments. Installing the whole collection takes one line of code: ```{r} #| eval: false install.packages("tidyverse") ``` And loading it into your session is just as easy: ```{r} library(tidyverse) ``` Throughout this book, we'll use a mix of built-in functions and these add-on packages. The tidyverse gets its own deep dive in Chapters \@ref(tidyverse) through \@ref(purrr). Once you start using the tidyverse, the productivity difference compared to base R alone is genuinely large. Most of this book leans on it. ::: {.callout-warning} ## AI Pitfall: Summaries flatten what experienced practitioners weight Asking an AI assistant "what is the difference between base R and the tidyverse?" returns a fluent, accurate-sounding paragraph. What the summary will *not* tell you is that in 2026 most professional R code is tidyverse-first, that mixing the two carelessly produces code that is harder to read and review, and that your future colleagues will expect you to know which is which when reading their work. AI flattens the distinctions experienced practitioners weight most heavily. Use it to clarify what you read here, not as a replacement for reading. ::: ## "Then why learn R when AI writes it for me?" {#why-learn-r-ai} The preface answers this question at length. The short version, set up here so each subsequent chapter can build on it: AI assistants are extraordinary at turning a description into syntax that runs. They are weak at three things — knowing whether the code is *appropriate* for your specific data, *verifying* the output before you trust it, and *defending* the analysis when a stakeholder challenges it. Those three are most of what makes someone a good analyst. Every chapter that follows includes at least one **AI Pitfall** callout: a specific scenario where an AI assistant produces plausible-but-wrong R code, paired with the verification habit that catches it. The Pitfalls compound. By the end of the book you will have a working library of failure modes — exactly what the analyst in the preface lacked.