18  Concluding remarks

Prerequisites

18.1 Concluding remarks

There is an old saying, something along the lines of “may you live in interesting times”. Maybe every generation feels this way, but we sure do live in interesting times. In this book, we have covered some essential skills for telling stories with data. And this is just the start.

In less than a generation, data science has gone from something that barely existed, to a defining part of academia and industry. The extent and pace of this change has many implications for those learning data science. For instance, it may imply that one should not just make decisions that optimize for what data science looks like right now, but also what could happen. While that is a little difficult, that is also one of the things that makes data science so exciting. That might mean choices like:

  • taking courses on fundamentals, not just fashionable applications;
  • reading core texts, not just whatever is trending; and
  • trying to be at the intersection of at least a few different areas, rather than hyper-specialized.

One of the most exciting times when you learn data science is realizing that you just love playing with data. A decade ago, this did not fit into any particular department or company. These days, it fits into almost any of them.

Data science needs to insist on diversity, both in terms of approaches and applications. It is increasingly the most important work in the world, and hegemonic approaches have no place. It is just such an exciting time to be enthusiastic about data and able to build things.

The central thesis of this book has been that a revolution is needed in data science, and we have proposed one view of what it could look like. This revolution builds on the long history of statistics, borrows heavily from computer science, and draws on other disciplines as needed, but is centered around reproducibility, workflows, and respect. When data science began it was nebulous and ill-defined. As it has matured, we now come to see it as able to stand on its own.

This book has been a reimagining of what data science is, and what it could be. In Chapter 1 we provided an informal definition of data science. We now revisit it. We consider data science to be the process of developing and applying a principled, tested, reproducible, end-to-end workflow that focuses on quantitative measures in and of themselves, and as a foundation to explore questions. We have known for a long-time what rigor looks like in mathematical and statistical theory: theorems are accompanied by proofs (Horton et al. 2022). And we increasingly know what rigor looks like in data science: claims that are accompanied by verified, tested, reproducible, code and data. Rigorous data science creates lasting understanding of the world.

18.2 Some outstanding issues

There are many issues that are outstanding as we think about data science. They are not the type of issues with a definitive answer. Instead, they are questions to be explored and played with. This work will move data science forward and, more importantly, help us tell better stories about the world. Here we detail some of them.

1. How do we write effective tests?

Computer science has built a thorough foundation around testing and the importance of unit and functional tests is broadly accepted. One of the innovations of this book has been to integrate testing throughout the data science workflow, but this, like the first iteration of anything, needs considerable improvement and development.

We need to thoroughly integrate testing through data science. But it is unclear what this should look like, how we should do it, and what is the end-state. What does it mean to have well-tested code in data science? Code coverage, which is a measure of the percentage of lines of code that have tests, is not especially meaningful in data science, but what should we use instead? What do tests look like in data science? How are they written? The extensive use of simulation in statistics, which data science has adopted, provides groundwork, but there is a significant amount of work and investment that is needed.

2. What is happening at the data cleaning and preparation stage?

We do not have a good understanding how much data cleaning and preparation is driving estimates. Huntington-Klein et al. (2021), and Breznau et al. (2022), among others, have begun this work. They show that hidden research decisions have a big effect on subsequent estimates, sometimes greater than the standard errors. Statistics provides a good understanding of how modeling affects estimates, but we need more investigation of the influence of the earlier stages of the data science workflow. More specifically, we need to look for key points of failure and understand the ways in which failure can happen.

This is especially concerning as we scale to larger datasets. For instance, ImageNet is a dataset of 14 million images, which were hand-annotated. The cost, in both time and money, makes it prohibitively difficult to go through every image to ensure the label is consistent with the needs of each user of the dataset. Yet without undertaking this it is difficult to have much faith in subsequent model forecasts, especially in non-obvious cases.

3. How do we create effective names?

One of the crowning achievements of biology is the binomial nomenclature. This is the formal systematic approach to names, established by Carolus Linnaeus, the eighteenth century physician (Morange 2016, 81). Each species is referred to by two words with Latin grammatical form: the first is its genus, and the second is an adjective to characterize the species. Ensuring standardized nomenclature is given active consideration in biology. For instance, the use of nomenclature committees by researchers is recommended (McCarthy et al. 2023). As discussed in Chapter 9, names are a large source of friction in data science, and a standardized approach is similarly needed in data science.

The reason this is so pressing is that it affects understanding, which impacts efficiency. The binomial nomenclature provides diagnostic information, not just a casual reference (Koerner 2000, 45). This is particularly the case when data science is conducted in a team, rather than just one individual. A thorough understanding of what makes an effective name and then infrastructure to encourage them would bring significant dividends.

4. What is the appropriate relationship for data science with the constituent parts?

We have described the origins of data science as being various disciplines. Moving forward we need to consider what role these constituent parts, especially statistics and computer science, should play. More generally, we also need to establish how data science relates to, and interacts with, econometrics, applied mathematics, and computational social science. These draw on data science to answer questions in their own discipline, but like statistics and computer science, they also contribute back to data science. For instance, applications of machine learning in computational social science need to focus on transparency, interpretability, uncertainty, and ethics, and this all advances the more theoretical machine learning research done in other disciplines (Wallach 2018).

We must be careful to continue to learn statistics from statisticians, computer science from computer scientists, etc. An example of the danger of not doing this is clear in the case of p-values, which we have not made much of in this book, but which dominate quantitative analysis even though statisticians have warned about their misapplication for decades. One issue with not learning statistics from statisticians is that statistical practice can become a recipe that is naively followed, because that is the easiest way to teach it, even though that is not how statisticians do statistics.

Data science must remain deeply connected to these disciplines. How we continue to ensure that data science has the best aspects, without also bringing bad practice, is an especially significant challenge. And this is not just technical, but also cultural (Meng 2021). It is particularly important to ensure that data science maintains an inclusive culture of excellence.

5. How do we teach data science?

We are beginning to have agreement on what the foundations of data science are. It involves developing comfort with: computational thinking, sampling, statistics, graphs, Git and GitHub, SQL, command line, cleaning messy data, a few languages including R and Python, ethics, and writing. But we have very little agreement on how best to teach it. Partly this is because data science instructors often come from different fields, but it is also partly a difference in resources and priorities.

Complicating matters is that given the demand for data science skills we cannot limit data science education to graduate students because undergraduate students need those skills when they enter the workforce. If data science is to be taught at the undergraduate level, then it needs to be robust enough to be taught in large classes. Developing teaching tools that scale is critical. For instance, GitHub Actions could be used to run checks of student code and suggest improvements without instructor involvement. However, it is especially difficult to scale case studies style classes, which students often find so useful. Substantial innovation is needed.

6. What does the relationship between industry and academia look like?

Considerable innovation in data science occurs in industry, but sometimes this knowledge cannot be shared, and when it can it tends to be done slowly. The term data science has been used in academia since the 1960s, but it is because of industry that it has become popular in the past decade or so (Irizarry 2020).

Bringing academia and industry together is both a key challenge for data science and one of the easiest to overlook. The nature of the problems faced in industry, for instance scoping the needs of a client, and operating at scale, are removed from typical academic concerns. There is a danger that academic research could be rendered moot unless academics establish and maintain one foot in industry, and enable industry to actively participate in academia. From the industry side, ensuring that best practice is quickly adopted can be challenging if there is no immediate payoff. Ensuring that industry experience is valued in academic hiring and grant evaluation would help, as would encouraging entrepreneurship in academia.

18.3 Next steps

This book has covered much ground, and while we are toward the end of it, as the butler Stevens is told in the novel The Remains of the Day by Kazuo Ishiguro:

The evening’s the best part of the day. You’ve done your day’s work. Now you can put your feet up and enjoy it.

Ishiguro (1989)

Chances are there are aspects that you want to explore further, building on the foundation that you have established. If so, then the book has accomplished its aim.

If you were new to data science at the start of this book, then the next step would be to backfill what we skipped over. Begin with Data Science: A First Introduction (Timbers, Campbell, and Lee 2022). After that go through R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund [2016] 2023). We used R in this book and only mentioned SQL and Python in passing, but it is important to develop comfort in these languages. Start with SQL for Data Scientists (Teate 2022), Python for Data Analysis (McKinney [2011] 2022), and the free Replit “100 Days of Code” Python course.

Sampling is a critical, but easy to overlook, aspect of data science. It would be sensible to go through Sampling: Design and Analysis (Lohr [1999] 2022). To deepen your understanding of surveys and experiments, go next to Field Experiments: Design, Analysis, and Interpretation (Gerber and Green 2012) and Trustworthy online controlled experiments (Kohavi, Tang, and Xu 2020).

For developing better data visualization skills, begin by turning to Data Sketches (Bremer and Wu 2021) and Data Visualization (Healy 2018). After that, develop strong foundations, such as The Grammar of Graphics (Wilkinson 2005).

If you are interested to learn more about modeling, then the next steps are Statistical Rethinking: A Bayesian Course with Examples in R and Stan (McElreath [2015] 2020), which additionally has an excellent series of accompanying videos, Bayes Rules! An Introduction to Bayesian Modeling with R (Johnson, Ott, and Dogucu 2022), and Regression and Other Stories (Gelman, Hill, and Vehtari 2020). It would also be worthwhile to establish a foundation of probability with All of Statistics (Wasserman 2005).

There is only one next natural step if you are interested in machine learning and that is An Introduction to Statistical Learning (James et al. [2013] 2021) followed by The Elements of Statistical Learning (Friedman, Tibshirani, and Hastie 2009).

To learn more about causality start with the economics perspective by going through Causal Inference: The Mixtape (Cunningham 2021) and The Effect: An Introduction to Research Design and Causality (Huntington-Klein 2021). Then turn to the health sciences perspective by going through What If (Hernán and Robins 2023).

For text as data, start with Text As Data (Grimmer, Roberts, and Stewart 2022). Then turn to Supervised Machine Learning for Text Analysis in R (Hvitfeldt and Silge 2021).

In terms of ethics, there are a variety of books. We have covered many chapters of it, throughout this book, but going through Data Feminism (D’Ignazio and Klein 2020) end-to-end would be useful, as would Atlas of AI (Crawford 2021).

And finally, for writing, it would be best to turn inward. Force yourself to write every day for a month. Then do it again and again. You will get better. That said, there are some useful books, including Working (Caro 2019) and On Writing: A Memoir of the Craft (King 2000).

We often hear the phrase “let the data speak”. Hopefully it is clear this never happens. All that we can do is to acknowledge that we are the ones using data to tell stories, and strive and seek to make them worthy.

It was her voice that made
The sky acutest at its vanishing.
She measured to the hour its solitude.
She was the single artificer of the world
In which she sang. And when she sang, the sea,
Whatever self it had, became the self
That was her song, for she was the maker.

Extract from “The Idea of Order at Key West”, (Stevens 1934)

18.4 Exercises

Questions

  1. What is data science?
  2. Who does data affect, and what affects data?
  3. Discuss the inclusion of “race” and/or “sexuality” in a model.
  4. What makes a story more or less convincing?
  5. What is the role of ethics when dealing with data?