Telling Stories with Data
With applications primarily in R
Reviews
This clean and fun book covers a wide range of topics on statistical communication, programming, and modeling in a way that should be a useful supplement to any statistics course or self-learning program. I absolutely love this book!
Andrew Gelman, Columbia University
An excellent book. Communication and reproducibility are of increasing concern in statistics, and this book covers these topics and more in a practical, appealing, and truly unique way.
Daniela Witten, University of Washington
Many data science texts tell you how to perform perfunctory calculations. Instead, Telling Stories with Data tells you how to engage in the mindset and process of analysis. By arming students with the computational, statistical and philosophical skills needed to use data in sense-making and story-telling, this book stands out from the pack as uniquely actionable and empowering.
Emily Riederer, Capital One
Chapman and Hall/CRC will publish a hard copy of this book in the second-half of 2023. If you would like to be notified when it is available please add your details here.
Preface
This book will help you tell stories with data. It establishes a foundation on which you can build and share knowledge about an aspect of the world that interests you based on data that you observe. Telling stories in small groups around a fire played a critical role in the development of humans and society (Wiessner 2014). Today our stories, based on data, can influence millions.
In this book we will explore, prod, push, manipulate, knead, and ultimately, try to understand the implications of, data. A variety of features drive the choices in this book.
The motto of the university from which I took my PhD is naturam primum cognoscere rerum or roughly “learn the first nature of things”. But the original quote continues temporis aeterni quoniam, or roughly “for eternal time”. We will do both of these things. I focus on tools, approaches, and workflows that enable you to establish lasting and reproducible knowledge.
When I talk of data in this book, it will typically be related to humans. Humans will be at the center of most of our stories, and we will tell social, cultural, and economic stories. In particular, throughout this book I will draw attention to inequity both in social phenomena and in data. Most data analysis reflects the world as it is. Many of the least well-off face a double burden in this regard: not only are they disadvantaged, but the extent is more difficult to measure. Respecting those whose data are in our dataset is a primary concern, and so is thinking of those who are systematically not in our dataset.
While data are often specific to various contexts and disciplines, the approaches used to understand them tend to be similar. Data are also increasingly global, with resources and opportunities available from a variety of sources. Hence, I draw on examples from many disciplines and geographies.
To become knowledge, our findings must be communicated to, understood, and trusted by other people. Scientific and economic progress can only be made by building on the work of others. And this is only possible if we can understand what they did. Similarly, if we are to create knowledge about the world, then we must enable others to understand precisely what we did, what we found, and how we went about our tasks. As such, in this book I will be particularly prescriptive about communication and reproducibility.
Improving the quality of quantitative work is an enormous challenge, yet it is the challenge of our time. Data are all around us, but there is little enduring knowledge being created. This book hopes to contribute, in some small way, to changing that.
Audience and assumed background
The typical person reading this book has some familiarity with first-year undergraduate statistics, for instance they have run a regression. But it is not targeted at a particular level, instead providing aspects relevant to almost any quantitative course. I have taught from this book at undergraduate, graduate, and professional, levels. Everyone has unique needs, but hopefully some aspect of this book speaks to you.
Enthusiasm and interest have taken people far. If you have those, then do not worry about too much else. Some of the most successful students have been those with no quantitative or coding background.
This book covers a lot of ground, but does not go into depth about any particular aspect. As such this book especially complements books such as: Data Science: A First Introduction (Timbers, Campbell, and Lee 2022), R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023), An Introduction to Statistical Learning (James et al. 2021), and Statistical Rethinking (McElreath 2020). If you are interested in those books, then this might be a good one to start with.
Structure and content
This book is structured around six parts: I) Foundations, II) Communication, III) Acquisition, IV) Preparation, V) Modelling, and VI) Applications.
Part I – Foundations – begins with Chapter 1 which provides an overview of what I am trying to achieve with this book and why you should read it. Chapter 2 provides three worked examples. The intention of these is that you can experience the full workflow recommended in this book without worrying too much about the specifics of what is happening. That workflow is: plan, simulate, acquire, model, and communicate. It is normal to not initially follow everything in this chapter, but you should go through it, typing out and executing the code yourself. If you only have time to read one chapter of this book, then I recommend that one. And Chapter 3 introduces some key tools for reproducibility used in the workflow that I advocate. These are aspects like Quarto, R Projects, Git and GitHub, and using R in practice.
Part II – Communication – considers three types of communication: written, static, and interactive. Chapter 4 details the features that quantitative writing should have and how to go about writing a crisp, quantitative, research paper. Static communication in Chapter 5 introduces features like graphs, tables, and maps.
Part III – Acquisition – focuses on three aspects: farming data, gathering data, and hunting data. Farming data in Chapter 6 begins with measurement, and then turns to essential concepts from sampling that govern our approach to data. It then focuses on datasets that are explicitly provided for us to use as data, for instance censuses and other government statistics. These are typically clean, well-documented, pre-packaged datasets. Gathering data in Chapter 7 covers aspects like using Application Programming Interface (APIs), scraping data, getting data from PDFs, and Optical Character Recognition (OCR). The idea is that data are available, but not necessarily designed to be datasets, and that we must go and get them. Finally, hunting data in Chapter 8 covers aspects where more is expected of us. For instance, we may need to conduct an experiment, run an A/B test, or do some surveys.
Part IV – Preparation – covers how to respectfully transform raw data into something that can be explored and shared. Chapter 9 begins by detailing some principles to follow when approaching the task of cleaning and preparing data, and then goes through specific steps to take and checks to implement. Chapter 10 focuses on methods of storing and retrieving those datasets, including the use of R data packages, and then continues onto considerations and steps to take when wanting to disseminate datasets as broadly as possible, while at the same time respecting those whose data they are based on. It also introduces some alternatives to the storage of data including Parquet and SQL
Part V – Modelling – begins with exploratory data analysis in Chapter 11. This is the critical process of coming to understand the nature of a dataset, but not something that typically finds itself into the final product. In Chapter 12 the use of linear models to explore data is introduced. And Chapter 13 considers generalized linear models.
Part VI – Applications – begins with Chapter 14 which is the first of three applications of modelling. It focuses on attempts to make causal claims from observational data and covers approaches such as difference-in-differences, regression discontinuity, and instrumental variables. Chapter 15 is the second of the modelling applications chapters and focuses on multilevel regression with post-stratification where we use a statistical model to adjust a sample for known biases. Chapter 16 is the third and final modelling application and is focused on text-as-data.
Finally, Chapter 17 offers some concluding remarks, details some open problems, and suggests some next steps.
Online appendices offer critical aspects that are either a little too unwieldy for the size constraints of the page, or likely to need more frequent updating than is reasonable for a printed book. Appendix A goes through some essential tasks in R, which is the statistical programming language used in this book. It is more of a reference chapter and you may find yourself returning to it as you go through the rest of the book. The core of this book is centered around Quarto, however its predecessor, R Markdown, has not yet been sunsetted and additionally there is a lot of material available for it. As such, Appendix B contains a translation of the Quarto material in Chapter 3 for R Markdown. A set of papers is included in Appendix C. If you write these, you will be conducting original research on a topic that is of interest to you. Although open-ended research may be new to you, the extent to which you are able to: develop your own questions, use quantitative methods to explore them, and communicate your findings, is the measure of the success of this book. Interactive communication in Appendix D covers aspects such as websites, web applications, and maps that can be interacted with. Appendix E provides an example of a datasheet. Appendix F gives a brief overview of SQL essentials. Appendix H considers how to make model estimates and forecasts more widely available by using the cloud and deploying models. Finally, Appendix I provides some ideas for using this book in teaching.
Pedagogy and key features
You have to do the work. You should actively go through material and code yourself. As King (2000) says “[a]mateurs sit and wait for inspiration, the rest of us just get up and go to work”. Do not passively read this book. My role is best described by Hamming (2020, 2–3):
I am, as it were, only a coach. I cannot run the mile for you; at best I can discuss styles and criticize yours. You know you must run the mile if the athletics course is to be of benefit to you—hence you must think carefully about what you hear and read in this book if it is to be effective in changing you—which must obviously be the purpose…
This book is structured around a dense 12-week course. It provides enough material for advanced readers to be challenged, while establishing a core that all readers should master. Typical courses that I teach cover the material through to Chapter 12, and then pick another couple of chapters that are of particular interest. But it depends on the backgrounds and interests of the students.
From as early as Chapter 2 you will have a workflow—plan, simulate, acquire, model, and communicate—allowing you to tell a convincing story with data. In each subsequent chapter you will add aspects and depth to this workflow that will allow you to speak with increasing sophistication and credibility. As this workflow expands it addresses the skills that are typically sought in industry. For instance, features such as: communication, ethics, reproducibility, research question development, data collection, data cleaning, data protection and dissemination, exploratory data analysis, statistical modelling, and scaling.
One of the defining aspects of this book is that ethics and inequity concerns are integrated throughout, rather than being clustered in one, easily ignorable, chapter. These aspects are critical, yet it can be difficult to immediately see their value, hence their tight integration.
This book is also designed to enable you to build a portfolio of work that you could show to a potential employer. If you want a job in industry, then this is arguably the most important thing that you should be doing. Robinson and Nolis (2020, 55) describe how a portfolio is a collection of projects that show what you can do and is something that can help be successful in a job search.
In the novel The Last Samurai (DeWitt 2000, 326), a character says:
[A] scholar should be able to look at any word in a passage and instantly think of another passage where it occurred; … [so a] text was like a pack of icebergs each word a snowy peak with a huge frozen mass of cross-references beneath the surface.
In an analogous way, this book not only provides you with text and instruction that is self-contained, but also helps develop critical masses of knowledge on which expertise is built. No chapter positions itself as the last word, instead they are written in relation to other work.
Each chapter has the following features:
- A list of required materials that you should go through before you read that chapter. To be clear, you should first read that material and then return to this book. Each chapter also contains extensive references, and those who are particularly interested in the topic should use these as a starting place for further exploration.
- A summary of the key concepts and skills that are developed in that chapter. Technical chapters additionally contain a list of the main packages and functions that are used in the chapter. The combination of these features acts as a checklist for your learning, and you should return to them after completing the chapter.
- Hilary Hahn, the American violinist, publicly documents herself practicing the violin, often scales or similar exercises, almost every day. I recommend something similar for data science Every chapter contains “Scales” where I provide a small scenario and ask you to work through the data science workflow advocated in this book. These will probably take 15-30 minutes.
- A series of short questions that you should complete after going through the required materials, but before going through the chapter, to test your knowledge. After completing the chapter, you should go back through the questions to make sure that you understand each aspect.
- One or two tutorial questions to further encourage you to actively engage with the material. You could consider forming small groups to discuss your answers to these questions.
Some chapters additionally feature:
- A section called “Oh, you think we have good data on that!” which focuses on a particular setting, such as cause of death, in which it is often assumed that there is unimpeachable and unambiguous data but the reality tends to be quite far from that.
- A section called “Shoulders of giants”, which focuses on some of those who created the intellectual foundation on which we build.
Software information and conventions
The software that I use in this book is R (R Core Team 2022). This language was chosen because it is open source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of well-developed features. I do not assume that you have used R before, and so another reason for selecting R for this book is the community of R users. The community is especially welcoming of new-comers and there is a lot of complementary beginner-friendly material available.
If you do not have a programming language, then R is a great one to start with. The ability to code is useful well beyond this book. If you have a preferred programming language already, then it would not hurt to also pick up R. That said, if you have a good reason to prefer another open-source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in this book are in R.
Please download R and RStudio onto your own computer. You can download R for free here, and you can download RStudio Desktop for free here. Please also create an account on Posit Cloud here. This will allow you to run R in the cloud. And download Quarto here.
Packages are in typewriter text, for instance, tidyverse
, while functions are also in typewriter text, but include brackets, for instance filter()
.
Land acknowledgment
This book was primarily written on the traditional lands of the Mississaugas of the Credit, the Huron-Wendat, and the Seneca. Data have long been used to oppress and harm, and the acknowledgement of the history of this land serves as a reminder of the need to try to be better in our own use of data.
Acknowledgments
Many people generously gave code, data, examples, guidance, opportunities, thoughts, and time, that helped develop this book.
Thank you to David Grubbs, Curtis Hill, and the team at CRC Press for publishing this book, and providing invaluable guidance and support. Thank you to Isabella Ghement, who thoroughly went through an early draft of this book and provided detailed feedback that improved this book.
Thank you to Annie Collins who went through every word in this book, improving many of them, and helped to sharpen my thinking on much of the content covered in this book.
Thank you to Emily Riederer who provided detailed comments on the initial plans for the book. She then returned to the manuscript after it was drafted and went through it in minute detail. Her thoughtful comments greatly improved the book, but more broadly her work changed the way I thought about much of the content of this book.
I was fortunate to have many reviewers who read entire chapters, sometimes two, three, or even more. They very much went above and beyond and provided excellent suggestions for improving this book. For this I am indebted to Albert Rapp, Alex Luscombe (who also suggested the Police Violence “Oh, you think…” entry), Ariel Mundo, Benjamin Haibe-Kains, Dan Ryan, Erik Drysdale, Jack Bailey, Jae Hattrick-Simpers, Jon Khan, Jonathan Keane (who also generously shared their parquet expertise), Lauren Kennedy (who also generously shared code, data, and expertise, to develop my thoughts about MRP), Liam Welsh, Liza Bolton (who also helped develop my ideas around how this book should be taught), Luis Correia, Matt Ratto, Matthias Berger, Michael Moon, Roberto Lentini, Ryan Briggs, and Taylor Wright.
Many people made specific suggestions that greatly improved things. All these people contribute to the spirit of generosity that characterizes the open source programming languages communities this book builds on. I am grateful to them all. A Mahfouz made me realize that that covering Poisson regression was critical. Aaron Miller suggested the FINER framework. Alison Presmanes Hill suggested Wordbank. Chris Warshaw suggested the Democracy Fund Voter Study Group survey data. Christina Wei pointed out many code errors. Ella Kaye suggested, and rightly insisted on, moving to Quarto. Faria Khandaker suggested what became the R Essentials chapter. Hareem Naveed generously shared her industry experience. Heath Priston provided assistance with Toronto homelessness data. Jesse Gronsbell gave invaluable suggestions around ethical statistical practice. Keli Chiu reinforced how essential text as data is. Leslie Root came up with the idea around “Oh, you think we have good data on that!”. Michael Chong shaped my approach to EDA. Michael Donnelly, Peter Hepburn, and Léo Raymond-Belzile provided pointers to classic papers that I was unaware of, in political science, sociology, and statistics, respectively. Nick Horton suggested the Hadley Wickham video in Chapter 11. Paul Hodgetts taught me how to make R packages and made the cover art for this book. Radu Craiu ensured sampling was afforded its appropriate place. Sharla Gelfand shaped some of the approaches I advocate around the appropriate use of R. Thomas William Rosenthal made me realize the potential of Shiny. Tom Cardoso and Zane Schwartz were excellent sources of data put together by journalists. Yanbo Tang assisted with Nancy Reid’s “Shoulders of giants” entry. Finally, Chris Maddison and Maia Balint suggested the closing poem.
Thank you to my PhD supervisory panel John Tang, Martine Mariotti, Tim Hatton, and Zach Ward who gave me the freedom to explore the intellectual space that was of interest to me, the support to follow through on those interests, and the guidance to ensure that it all resulted in something tangible. What I learnt during those years served as the foundation for everything in this book.
This book has greatly benefited from the notes and teaching materials of others that are freely available online, including: Chris Bail, Scott Cunningham, Andrew Heiss (who independently taught a course with the same name as this book, well before the book was available), Lisa Lendway, Grant McDermott, Nathan Matias, David Mimno, and Ed Rubin. Thank you to these people. The changed norm of academics making their materials freely available online is a great one and one that I hope the free online version of this book, available here, helps contribute to.
Thank you to Samantha-Jo Caetano who helped develop some of the assessment items. And also, to Lisa Romkey and Alan Chong who allowed me to adapt some aspects of their rubric. The catalysts for aspects of the Chapter 4 tutorial were McPhee (2017, 186) and Chelsea Parlett-Pelleriti. The idea behind the Appendix D tutorial was work by Mauricio Vargas Sepúlveda (“Pachá”) and Andrew Whitby.
I am grateful for the suggestions of: Amy Farrow, Arsh Lakhanpal, Cesar Villarreal Guzman, Finn Korol-O’Dwyer, Flavia López, Gregory Power, Hong Shi, Jayden Jung, John Hayes, Joyce Xuan, Laura Cline, Lorena Almaraz De La Garza, Michaela Drouillard, Mounica Thanam, Reem Alasadi, Tayedza Chikumbirike, Wijdan Tariq, Yang Wu, and Yewon Han.
Kelly Lyons provided support, guidance, mentorship, and friendship. Every day she demonstrates what an academic should be, and more broadly, what one should aspire to be as a person.
Greg Wilson for providing a structure to think about teaching, suggesting the “Scales” style exercises, for being the catalyst for this book, and for helpful comments on drafts. Every day he provides an example of how to contribute to the intellectual community.
Thank you to Elle Côté for enabling this book to be written by looking after first one, and then two, children during a pandemic.
As at Christmas 2021 this book was a disparate collection of partially completed notes. Thank you to Mum and Dad, who dropped everything and came over from the other side of the world for two months to give me the opportunity to re-write it all and put together a cohesive draft.
Finally, thank you to Monica Alexander. Without you I would not have written a book; I would not have even thought it possible. Many of the best ideas in this book are yours, and those that are not, you made better, reading everything many times. Thank you for your inestimable help with writing this book, providing the base on which it builds (remember in the library showing me many times how to get certain rows in R!), giving me the time that I needed to write, encouragement when it turned out that writing a book just meant endlessly re-writing that which was perfect the day before, reading everything in this book many times, making coffee or cocktails as appropriate, and more.
You can contact me at: rohan.alexander@utoronto.ca.
Rohan Alexander
Toronto, Canada
March 2023