17  Concluding remarks

Required material

17.1 Concluding remarks

There is an old saying, something along the lines of “may you live in interesting times”. Maybe every generation feels this way, but we sure do live in interesting times. In this book, we have covered some essential skills for telling stories with data. And this is just the start.

In less than a generation, data science has gone from something that barely existed, to a defining part of academia and industry. The extent and pace of this change has many implications for those learning data science. For instance, it may imply that one should not just make decisions that optimize for what data science looks like right now, but also what could happen. While that is a little difficult, that is also one of the things that makes data science so exciting. That might mean choices like:

  • taking courses on fundamentals, not just fashionable applications;
  • reading core texts, not just whatever is trending; and
  • trying to be at the intersection of at least a few different areas, rather than hyper-specialized.

One of the most exciting times when you learn data science is realizing that you just love playing with data. A decade ago, this did not fit into any particular department or company. These days, it fits into almost any of them. Data science needs diversity, both in terms of approaches and applications. It is increasingly the most important work in the world, and hegemonic approaches have no place. It is just such an exciting time to be enthusiastic about data and able to build things.

The central thesis of this book has been that a revolution is needed in data science, and we have proposed one view of what it could look like. This revolution builds on the long history of statistics, borrows heavily from computer science, and draws on other disciplines as needed, but is centered around reproducibility, workflows, and respect. When data science began it was nebulous and ill-defined. As it has matured, we now come to see it as able to stand on its own.

This book has been a reimagining of what data science is, and what it could be. In Chapter 1 we provided an informal definition of data science, but having built a foundation, it is time to revisit this. We consider data science to be the process of developing and applying a principled, tested, reproducible, end-to-end workflow, that focuses on quantitative measures in and of themselves, and as a foundation to explore questions. We have known for a long-time what rigor looks like in mathematical and statistical theory: theorems are accompanied by proofs (Horton et al. 2022). And we increasingly know what rigor looks like in data science: claims that are accompanied by verified, tested, reproducible, code and data. Rigorous data science creates lasting understanding of the world.

17.2 Some issues

There are many issues that are outstanding as we think about data science. They are not the type of issues with a definitive answer. Instead, they are questions to be explored and played with. This work will move data science forward and, more importantly, help us tell better stories about the world. Here we detail some of the most pressing concerns. We discuss specific things like writing tests and naming, evaluation of cleaning, broader concerns such as the relationship with computer science, statistics, and industry, as well as teaching.

1. How do we write effective tests?

Computer science has built a thorough foundation around testing and the importance of unit and functional tests is broadly accepted. One of the innovations of this book has been to integrate testing throughout the data science workflow, but this, like the first iteration of anything, needs considerable innovation and development.

We need to thoroughly integrate testing through data science, but it is far less clear what this should look like, how we should do this, and what is the end-state. What does it mean to have well-tested code in data science? Code coverage, which is a measure of the percentage of lines of code that have tests, is not especially meaningful in data science, but it is less clear what would be. What do tests look like in data science? How are they written? The extensive use of simulation in statistics, which data science has adopted, provides groundwork, but there is a significant amount of work and investment that is needed.

2. What is happening at the data cleaning and preparation stage?

We basically do not have a good understanding how much any of this matters. Huntington-Klein et al. (2021) and Breznau et al. (2022), among others, showed that hidden research decisions have a big effect on subsequent estimates, sometimes greater than the standard errors. There has been much exploration of the modelling aspect, but we need much more investigation of how the earlier stages of the data science workflow affect the conclusions. More specifically, we need to look for key points of failure and understand the ways in which failure can happen.

This is especially concerning as we scale to larger datasets. For instance, ImageNet is a dataset of 14 million images, which were hand-annotated. The cost, in both time and money, makes it prohibitively difficult to go through every image to ensure the label is consistent with the needs of each user of the dataset. Yet without undertaking this it is difficult to have much faith in subsequent model forecasts, especially in non-obvious cases.

3. How do we create effective names?

One of the crowning achievements of biology is the binomial nomenclature, which is the formal systematic approach to names, established by Carolus Linnaeus, the eighteenth-century physician (Morange 2016, 81). Each species is referred to by two words with Latin grammatical form: the first is its genus, and the second is an adjective to characterize the species. As discussed in Chapter 9, names are a large source of friction, and a generalized approach is needed in data science.

The reason this is so pressing is that it affects understanding, which impacts efficiency. The binomial nomenclature provides diagnostic information, not just a casual reference (Koerner 2000, 45). This is particularly the case when data science is conducted in a team, rather than just one individual. A thorough understanding of what makes an effective name and then infrastructure to encourage them would bring significant dividends.

4. What happened to the machine learning revolution?

It is now clear that the machine learning revolution was oversold, especially when it is called “AI” (Pretz 2021). Some of this may be due to hidden technical debt that will come due at some point (Sculley et al. 2015). Some of this may be a disdain for data, which has hamstrung AI applications (Sambasivan et al. 2021). For instance, there is evidence that misapplied AI may have caused unnecessary death during the pandemic, partly due to a lack of focus on the data (Heaven 2021).

Another aspect of this is that it is difficult to apply machine learning properly. For instance, in the social sciences, we have yet to see a convincing application of machine learning methods, which are designed for prediction, to a social sciences problem where what we care about is understanding. Instead, we have seen machine learning methods misapplied (Kapoor and Narayanan 2022). And it may be that the AI researchers, who have expertise in these methods, are disinterested in the messy real-world (Kerner 2020).

Machine learning is based on humans, not machines (Boykis 2019). This is true of both the underlying datasets, which require labeling, and of the implementation, which occurs within politics and culture which impose certain values on machine learning research (Birhane et al. 2022). The statistical foundation of machine learning means this can never be fully automated without considerable danger of nonsensical results (Scicurious 2012).

Systematic evaluation of the track record and potential, of these new approaches in established areas is needed.

5. What is the appropriate relationship for data science with the constituent parts?

We have described the origins of data science as being various disciplines. Moving forward we need to consider what role these constituent parts, especially statistics and computer science, should play. More generally, we also need to establish how data science relates to, and interacts with, econometrics, applied statistics, and computational social science. These draw on data science to answer questions in their own discipline, but like statistics and computer science, they also contribute back to data science.

We must be careful to continue to learn statistics from statisticians, computer science from computer scientists, etc. An example of the danger of not doing this is clear in the case of p-values, which we have not made much of in this book, but which dominate quantitative analysis even though statisticians have warned about their misapplication for decades. One issue with not learning statistics from statisticians, is that statistical practice can become a recipe that is blindly followed, because that is the easiest way to teach it, even though that is not how statisticians do statistics. We see this, too, in conversations about the related concept of statistical power, as though there is one appropriate level that we need to meet.

Data science must remain deeply connected to these disciplines. How we continue to ensure that data science has the best aspects, without also bringing bad practice, is an especially significant challenge. And this is not just technical, but also cultural (Meng 2021). It is especially challenging to ensure that data science maintains an inclusive culture of excellence.

6. How do we teach data science?

We are beginning to have agreement on what the foundations of data science are. It involves developing comfort with: computational thinking, sampling, statistics, graphs, Git/GitHub, SQL, command line, cleaning messy data, a few languages including R and Python, ethics, and writing. But we have very little agreement on how best to teach it. Partly this is because data science instructors often come different fields, but it is also partly a difference in resources and priorities.

Given the demand for data science skills, there is no point confining data science education to graduate students because undergraduate students will be doing it when they enter the workforce whether or not we train them. If data science is to be taught at the undergraduate level, then it needs to be robust enough to be taught in large classes. Developing teaching tools that scale is critical. For instance, GitHub Actions could be used to run checks of student code and suggest improvements without instructor involvement. However, it is especially difficult to scale case studies style classes, which students find so useful.

7. What does the relationship between industry and academia look like?

Considerable innovation in data science occurs in industry, and yet sometimes this knowledge cannot be shared, and when it can it tends to be done slowly. The term data science has been used in academia since the 1960s, but it is because of industry that it has become popular (Irizarry 2020).

Lashing academia and industry together is both a key challenge for data science and one of the easiest to overlook. The nature of the problems faced in industry, for instance scoping the needs of a client, and operating at scale, are removed from typical academic concerns. There is a danger that academic research could be rendered pointless unless academics in data science maintain one foot in industry, and those in industry actively participate in academia. From the industry side, ensuring that best practice is quickly adopted can be challenging if there is no immediate payoff. Ensuring that industry experience is valued in academic hiring and grant evaluation would help, as would encouraging entrepreneurship in academia.

17.3 Next steps

This book has covered much ground, and while we are toward the end of it, as the butler Stevens is told in the novel The Remains of the Day (Ishiguro 1989):

The evening’s the best part of the day. You’ve done your day’s work. Now you can put your feet up and enjoy it.

Chances are there are aspects that you want to explore further, building on the foundation that you have established. If so, then the book has accomplished its aim.

If you were new to data science at the start of this book, then the next step would be to backfill that which was skipped over. Begin with Data Science: A First Introduction (Timbers, Campbell, and Lee 2022). After that, you should learn more about R in terms of data science by going through R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023). We used R in this book and only mentioned SQL and Python, but it is important to develop comfort in these languages, starting, respectively with Teate (2022) and Gries, Campbell, and Montojo (2017), as well as the free replit “100 Days of Code” Python course.

To learn more about causality start with Causal Inference: The Mixtape (Cunningham 2021) and The Effect: An Introduction to Research Design and Causality (Huntington-Klein 2021).

If you are interested to learn more about statistics, then begin with Statistical Rethinking: A Bayesian Course with Examples in R and Stan (McElreath 2020a). Then backfill with Bayes Rules! An Introduction to Bayesian Modeling with R (Johnson, Ott, and Dogucu 2022) and solidify the foundation with Bayesian Data Analysis (Gelman et al. 2014). And it would be worthwhile to bolster some of the probability foundations that we have skimmed over with All of Statistics (Wasserman 2005).

There is only one next natural step if you are interested in learning more about machine learning and that is An Introduction to Statistical Learning (James et al. 2021) followed by The Elements of Statistical Learning (Friedman, Tibshirani, and Hastie 2009). After that, learn how to implement it all with Tidy Modeling with R (Kuhn and Silge 2022).

If you are interested in sampling, then the next book to turn to is Sampling: Design and Analysis (Lohr 2022). To deepen your understanding of surveys and experiments, go next to Field Experiments: Design, Analysis, and Interpretation (Gerber and Green 2012) in combination with Trustworthy online controlled experiments (Kohavi, Tang, and Xu 2020).

For developing better data visualization skills, begin by turning to Data Sketches (Bremer and Wu 2021) and Data Visualization (Healy 2018), but then after that, develop strong foundations, such as The Grammar of Graphics (Wilkinson 2005).

In terms of ethics, there are a variety of books. We have covered many chapters of it, throughout this book, but going through Data Feminism (D’Ignazio and Klein 2020), end-to-end would be useful, as would Atlas of AI (Crawford 2021).

And finally, for writing, it would be best to turn inward. Force yourself to write and publish every day for a month. Then do it again and again. You will get better. That said, there are some useful books, including Working (Caro 2019) and On Writing: A Memoir of the Craft (King 2000).

We often hear the phrase “let the data speak”. Hopefully by this point you understand that never happens. All that we can do is to acknowledge that we are the ones using data to tell stories and strive and seek to make them worthy.

It was her voice that made
The sky acutest at its vanishing.
She measured to the hour its solitude.
She was the single artificer of the world
In which she sang. And when she sang, the sea,
Whatever self it had, became the self
That was her song, for she was the maker.

“The Idea of Order at Key West”, (Stevens 1934)

17.4 Exercises

Questions

  1. What is data science?
  2. Who does data affect, and what affects data?
  3. Discuss the inclusion of “race” and/or “sexuality” in a model.
  4. What makes a story more or less convincing?
  5. What is the role of ethics when dealing with data?