Run the Numbers

Data Analysis with R and the Tidyverse

Author
Published

December 31, 2025

Preface: Why R in an AI World

You may have wondered something while opening this book: why learn R now?

It is a fair question. AI assistants — Claude, ChatGPT, GitHub Copilot, Cursor — can produce R code faster than you can type. Posit’s newer flagship IDE, Positron, ships with AI completion built in. RStudio Desktop integrates Copilot. Claude Code runs in your terminal and writes, debugs, and explains entire R analyses on a single prompt. If a tool can generate dplyr pipelines and ggplot2 charts on demand, what is left for the human to learn?

The honest answer: everything that mattered before, and a few things that did not.

A small story. A first-year analyst at a regional retailer was asked to build a model that flagged customers likely to churn next quarter. She prompted an AI assistant for a logistic regression in R. The code ran. The reported accuracy was 95%. She put the analysis in front of her vice-president, who asked which customers the model recommended targeting for retention offers. She filtered the predictions and got back zero. The model had achieved 95% accuracy by predicting that nobody would churn — because the actual churn rate was around 5%, and predicting “everyone stays” gets you to 95% accuracy by default. The AI had produced a model. The analyst could not tell that the model was useless.

This is not a story about AI failing. The AI did what was asked. The story is about the analyst — who could write a prompt but could not read what came back. She did not know to check the class balance. She did not have the habits to spot a model that was technically correct and substantively empty. She shipped a production-grade analysis of a phantom result.

This book is about building those habits.

What this book teaches that AI cannot

AI assistants are extraordinary at one thing: turning a description of what you want into syntax that runs. They are weak at three things, and those three things are most of what makes someone a good analyst.

Knowing whether the code is appropriate. A dplyr::filter with correct syntax may silently drop the rows that mattered most. A lubridate parser that runs without errors may have quietly assumed the wrong date format. A ggplot2 default may visually misrepresent a skewed distribution. AI generates working code; whether the code does the right thing for your specific data is a judgment call the AI does not make.

Verifying the output. Did the model fit converge, or just stop iterating? Did the join match the rows you expected, or silently duplicate them? Are the group totals summing to what you know they should sum to? AI gives you the result; you have to know how to interrogate it before trusting it.

Defending the analysis. When a stakeholder asks “why did you pick this method?” or “what would change if you weighted these observations differently?” the AI cannot answer for you. It cannot sit in the room. The reader of this book is the person who has to.

The bottleneck has moved from typing to judgment. Anyone can produce R code now. The scarce skill is knowing what to ask for, what to verify, and what to reject.

What you will actually learn

This is a comprehensive R book. You will learn R at the syntax level: the tidyverse, ggplot2, R Markdown, the basics of statistical modeling, and the workflow tools that turn one-off scripts into reproducible projects. If you have used a spreadsheet, you can learn R from this book — no prior programming experience needed.

But every chapter follows the same recurring rhythm:

  1. A real problem analysts solve in R. Not a toy dataset. Not a contrived example. A specific question worth answering, drawn from published research, public data, or a business decision someone actually made.
  2. The R that solves it. Tidyverse-first, modern conventions, no bad habits.
  3. What AI gets wrong here. A specific way an AI assistant could produce plausible-but-wrong R code in this context, and the verification habit that catches it.
  4. How you defend the result. What you would say if a colleague, client, or reviewer asked you to justify it.

The fourth piece is the niche. Most R books teach you to write code. This one teaches you to ship analyses that hold up under scrutiny.

On the IDE landscape

You have choices in 2026. RStudio Desktop is the long-time standard, mature, with GitHub Copilot integration available. Positron is Posit’s newer cross-language IDE, built on the VS Code platform, with first-class R support and AI completion via Posit’s own assistant. VS Code and Cursor work for R via extensions. Claude Code runs in your terminal alongside any of them and writes R when you ask it to.

This book does not pick a winner. Every example assumes you can copy code into whichever environment you prefer. What matters is that whichever IDE you choose, the AI working alongside you will write R code for you. The skill that separates an analyst from a code passthrough is what happens after the AI hands the code over.

Why R, not Python

A reasonable question. Python is excellent — better than R for production machine-learning systems, web backends, and general-purpose programming. But for the kind of work this book is about (cleaning messy data, computing trustworthy summaries, building publication-quality charts, producing reproducible reports), R is unmatched. The tidyverse makes data wrangling more readable than any equivalent Python library. ggplot2 is the most thoughtful visualization grammar ever shipped. R Markdown produces reports that update themselves when the data changes. R was built by statisticians for statisticians, and that bias toward defensibility shows in everything from how it handles missing values to how it surfaces model diagnostics.

If your career trajectory points toward production ML engineering, learn Python. If it points toward analysis, communication, and decisions made under uncertainty, learn R. Many people learn both. The skills here transfer.

How this book is organized

  • Chapters 1–2: Getting started. Installing R, finding your way around RStudio or Positron, and not panicking.
  • Chapters 3–4: Your data analysis toolkit and importing data. Meet the tidyverse and learn how to get data into R from CSVs, Excel files, and databases.
  • Chapters 5–8: Working with data. Describing variables, modifying and reshaping data with dplyr and tidyr.
  • Chapters 9–11: Summary statistics and visualization. Computing the numbers that matter and building charts with ggplot2.
  • Chapters 12–13: Statistical analysis. Hypothesis testing for single variables and relationships between variables.
  • Chapters 14–17: Specialized tools. Text (stringr), categories (forcats), dates (lubridate), and repetitive tasks (purrr).
  • Chapters 18–19: R Markdown. Combining writing and code into polished, self-updating reports, presentations, and dashboards.
  • Chapter 20: AI-Assisted R Programming. The patterns that make AI a productive partner — and the failure modes to watch for.

Each chapter has at least one AI Pitfall callout: a specific scenario where an AI assistant could produce wrong-but-plausible R code, paired with the verification habit that catches it.

A few ground rules

  • You will make errors. Everyone does. R’s error messages can look scary, but they are usually telling you something simple, like you forgot a comma. Read them. Search them. You will be fine.
  • Run the code as you go. Reading code without running it is like reading a recipe without cooking. Some lessons only land when you watch the output appear.
  • Use AI, but verify. AI assistants are part of how you will work. The point of this book is not to keep you away from them — it is to make you the analyst whose judgment is worth more than the assistant’s syntax.
  • There is no dumb question. If you are confused, someone else is too.

A Note on AI as Research Partner

This book was researched, written, and produced with AI as a research partner. I used Claude, developed by Anthropic, throughout the process: to explore ideas, pressure-test reasoning, draft and revise passages, identify gaps in coverage, and build the technical infrastructure behind the companion website, interactive tools, and publishing pipeline. Claude Code helped construct the Quarto projects, the Shiny applications, and the automation that made a project of this scope feasible for a single author.

I want to be direct about this because I think readers deserve honesty, not theater. AI did not write this book in the way that matters. Every claim has been verified against primary sources. Every analytical position reflects my own judgment, shaped by two decades of teaching and research. Every sentence has been read, reconsidered, and revised by a human who cares whether it is right. The responsibility for what appears here, including any errors, is entirely mine.

The content will improve iteratively. If you find something that needs correcting, or if you have suggestions, I welcome them. Each book has its own feedback page (linked on the companion site and in the preface), or you can email me directly at patilv@gmail.com.

I mention this not as a disclaimer but as a matter of principle. The norms around AI use in scholarly work are being negotiated in real time. I would rather be transparent about my process than pretend the tools I used do not exist. If this book helps you think more clearly about data, the fact that an AI helped me write it does not make the thinking less clear. And if something in this book is wrong, the fact that an AI helped me write it does not make it less my fault.

© 2026 Vivek H. Patil, Ph.D. All rights reserved.

Published by Margin of Error Media LLC.
marginoferrormedia.com

No part of this book may be reproduced, distributed, or transmitted in any form or by any means without the prior written permission of the publisher, except for brief quotations in reviews or scholarly works with full attribution.

For permissions, licensing, classroom adoption, or bulk purchase:
patilv@gmail.com

First edition: 2026
ISBN (print): [ISBN-PENDING]
ISBN (ebook): [ISBN-PENDING]

Why data ethics matters

You might be wondering what a coding book has to do with ethics. Fair question. Here is the short answer: everything.

Good data education is about more than passing an exam — it is about going out into the world and actually doing something meaningful with what you learned. The goal is to develop people who lead with competence and conscience. Data literacy is a big part of that now.

Here is the thing about data: it is never neutral. Every dataset was collected by someone, for some reason, with some set of assumptions baked in. Every summary statistic you compute, every chart you build, every model you deploy — these shape decisions that affect real people. Which neighborhoods get investment. Which patients get treatment. Which applicants get interviews. When you know how to work with data responsibly, you have the power to make those decisions more fair, more transparent, and more grounded in evidence rather than gut feelings.

That is the idea of the “common good” in action — not as an abstract ideal you read about in a philosophy class, but as a practical skill you exercise every time you write a line of code.

So yes, this is a book about R. You will learn functions and packages and how to make charts that do not look like they were made in 1997. But underneath all of that is something bigger: the ability to think critically about data, ask whose story it tells (and whose it leaves out), and use your skills in ways that make the world a little more honest and a little more just.

That is why data ethics matters. Now let’s learn some R.

About the Author

Vivek H. Patil has over two decades of university teaching and research experience spanning marketing, data analytics, and research methodology.

His research integrates measurement theory and statistics with frameworks from economics, social psychology, and cognitive psychology to examine human behavior. He has authored or co-authored 25+ peer-reviewed articles in journals including the Journal of Business Research, PLOS ONE, Journal of Marketing Analytics, Scientometrics, and the American Journal of Health Promotion.

He holds a Ph.D. in Business (Marketing) from the University of Kansas, an M.Eng. in Software Systems from the Birla Institute of Technology and Science (BITS Pilani), and a Master of Management Studies from BITS Pilani. This combination of business, statistics, and software engineering backgrounds informs his approach to teaching data analysis — practical, code-first, and focused on real-world applications.

He teaches courses in marketing research, data visualization, and business analytics. He is also the founder of VeloxPortfolio.com, an AI-powered tool that transforms resumes into portfolio websites. More at patilv.com.

License

All rights reserved. No part of this book may be reproduced, distributed, or transmitted in any form or by any means without the prior written permission of the author, except for brief quotations in reviews or scholarly works. For permissions or licensing inquiries, contact the author.

Attributions and Data Sources

This book was built with open-source tools and relies on publicly available data, R packages, and published sources. Full attributions are listed here.

Software

  • R [@R-base]: R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. License: GPL-2 | GPL-3.
  • RStudio / Posit: Posit Software, PBC. https://posit.co/.
  • bookdown [@R-bookdown]: Yihui Xie. Used to produce this book. License: GPL-3.
  • knitr [@R-knitr]: Yihui Xie. License: GPL-2 | GPL-3.
  • rmarkdown [@R-rmarkdown]: Allaire, J.J., Xie, Y., et al. License: GPL-3.

R Packages Used in Examples

All R packages listed below are open source and available on CRAN. Package authors retain copyright to their respective works.

Package Author(s) License Used in
tidyverse (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats) Hadley Wickham et al. MIT Throughout
lubridate Vitalie Spinu, Garrett Grolemund, Hadley Wickham GPL (>=2) Ch 16
DT Yihui Xie et al. GPL-3 Ch 5, 7
plotly Carson Sievert et al. MIT Ch 10, 13
Hmisc Frank Harrell et al. GPL (>=2) Ch 13
scales Hadley Wickham, Dana Seidel MIT Ch 10–11
readxl Hadley Wickham, Jennifer Bryan MIT Ch 4
kableExtra Hao Zhu MIT Ch 18
skimr Elin Waring et al. GPL-3 Ch 5
patchwork Thomas Lin Pedersen MIT Ch 11
ggthemes Jeffrey B. Arnold GPL-2 Ch 11
ggrepel Kamil Slowikowski GPL-3 Ch 11
haven Hadley Wickham, Evan Miller MIT Ch 4
RColorBrewer Erich Neuwirth Apache 2.0 Ch 11
MASS Brian Ripley et al. GPL-2 / GPL-3 Ch 6

Datasets

Dataset Source License / Terms
iris Anderson, E. (1935) and Fisher, R.A. (1936). Built into R. Public domain
mtcars Henderson and Velleman (1981). Motor Trend magazine, 1974. Built into R. Public domain
mpg EPA fuel economy data (1999, 2008). Included in ggplot2. Public domain (US government data)
diamonds Included in ggplot2. Originally from diamondse.info. Fair use for education
economics Federal Reserve Economic Data (FRED). Included in ggplot2. Public domain (US government data)
survey Venables, W.N. and Ripley, B.D. (1999). Modern Applied Statistics with S-PLUS, 3rd ed., Springer. From the MASS package; collected from 237 Statistics I students at the University of Adelaide. GPL-2 / GPL-3 (MASS package)
corrdata Simulated correlation demonstration data created by the author. Original to this book

Images

Image Source Terms
R logo (Rlogo.png) The R Foundation. https://www.r-project.org/logo/ CC BY-SA 4.0
Type conversion diagram (04files/conversion.png) Created by the author or adapted from public R documentation. Original to this book

Chapter 1 Sources

The following factual claims in Chapter 1 are drawn from publicly available sources:

Typography and Fonts

Acknowledgments

This book was developed with assistance from Claude (Anthropic), an AI language model used for drafting, editing, and code generation. All content was reviewed and approved by the author.