18  Concluding remarks

Required material

18.1 Concluding remarks

There is an old saying, something along the lines of “may you live in interesting times”. Maybe every generation feels this way, but we sure live in interesting times. In this book, we have covered some essential skills for telling stories with data. And this is just the start.

In less than a generation, data science has gone from something that barely existed, to a defining part of academia and industry. The extent and pace of this change has many implications for those learning data science. For instance, it may imply that one should not just make decisions that optimize for what data science looks like right now, but also what could happen. While that is a little difficult, that is also one of the things that makes data science so exciting. That might mean choices like:

  • taking courses on fundamentals, not just fashionable applications;
  • reading books, not just whatever is trending; and
  • trying to be at the intersection of at least a few different areas, rather than hyper-specialized.

One of the most exciting times when you learn data science is realizing that you just love playing with data. A decade ago, this did not fit into any particular department, these days it fits into almost any of them. Data science needs diversity, both in terms of approaches and applications. It is increasingly the most important work in the world, and hegemonic approaches have no place. It is just such an exciting time to be enthusiastic about data and able to build things.

The central thesis of this book has been that a revolution is needed in data science, and we have proposed one view of what it could look like. This revolution builds on the long history of statistics, takes heavily from computer science, and draws on other disciplines as needed, but is centered around reproducibility, workflows, and respect. When data science began it was nebulous and ill-defined. As it has matured, we now come to see it as a separate discipline, with the maturity to stand on its own.

This book has been a reimagining of what data science is, and what it could be. In Chapter 1 we provided an informal definition of data science, but having built a foundation, it is time to revisit this. We consider data science to be the process of developing and applying a principled, tested, reproducible, end-to-end workflow, that focuses on quantitative measures in and of themselves, and as a foundation to explore questions. Excellence in data science creates lasting understanding of the world.

18.2 Some issues

There are many issues that are outstanding as we think about data science. They are not the type of issues with a definitive answer. Instead, they are questions to be explored and played with. This work will move the discipline forward and, more importantly, help us tell better stories about the world. Here we detail some of the most pressing concerns.

1. How do we write effective tests?

Computer science has built a thorough foundation around testing and the importance of unit and functional tests is broadly accepted. One of the innovations of this book has been to integrate testing throughout the data science workflow, but this, like the first iteration of anything, needs considerable innovation and development.

We need to thoroughly integrate testing through data science, but it is far less clear what this should look like, how we should do this, and what the end-state is. What does it mean to have well-tested code in data science? Code coverage, which is a measure of the percentage of lines of code that have tests, is not especially meaningful in data science, but it is less clear what would be. What do tests look like in data science? How are they written? The extensive use of simulation in statistics, which data science has adopted, provides groundwork, but there is a significant amount of work and investment that is needed.

2. What is happening at the data cleaning and preparation stage?

We basically do not have a good understanding how much any of this matters. Huntington-Klein et al. (2021) showed that hidden research decisions have a big effect on subsequent estimates, sometimes greater than the standard errors. Such findings invalidate claims. We need much more investigation of how these early stages of the data science workflow affect the conclusions. More specifically, we need to look for key points of failure and understand the ways in which failure can happen.

3. How do we create effective names?

One of the crowning achievements of biology is the binomial nomenclature, which is the formal systematic approach to names, established by Carolus Linnaeus, the eighteenth-century physician (Morange 2016, 81). Each species is referred to by two words with Latin grammatical form: the first is its genus, and the second is an adjective to characterize the species. As discussed in Chapter 11, names are a large source of friction and a generalized approach is needed in data science.

The reason this is so pressing is that it affects understanding, which impacts efficiency. The binomial nomenclature provides diagnostic information, not just a casual reference (Koerner 2000, 45). This is particularly the case when data science is conducted in a team, rather than just one individual. A thorough understanding of what makes an effective name and then infrastructure to encourage them would bring significant dividends.

4. What happened to the machine learning revolution?

It is now clear that the machine learning revolution was oversold, especially when it is called “AI” (Pretz 2021). Some of this is a disdain for data, which has hamstrung AI applications (Sambasivan et al. 2021). For instance, there is evidence that misapplied AI may have caused unnecessary death during the pandemic, partly due to a lack of focus on the data (Heaven 2021).

Another aspect of this is that it is difficult to apply machine learning properly. For instance, in the social sciences, we have yet to see a convincing application of machine learning methods, which are designed for prediction, to a social sciences problem where what we care about is understanding. Instead, we have seen machine learning methods misapplied (Kapoor and Narayanan 2022). And it may be that the AI researchers, who have expertise in in these methods, are disinterested in the messy real-world (Kerner 2020).

The current situation is untenable as researchers, especially those in fields that have been historically female, are considered inferior even when their results are no worse. And this means that any method that can conceivably be marketed as AI is being overfunded, at the expense of methods that may work better. This is not just the fault of academia, but also industry, government, and the media (Bender 2022).

Machine learning is based on humans, not machines (Boykis 2019). This is true of both the underlying datasets, which require labeling, and of the implementation, which occurs within politics and culture which impose certain values on machine learning research (Birhane et al. 2021). The statistical foundation of machine learning means this can never be fully automated without considerable danger of nonsensical results (Scicurious 2012).

5. What is the appropriate relationship for data science with the constituent parts?

We have described the origins of data science as being various disciplines. Moving forward we need to consider what role these constituent parts, especially statistics and computer science, should play. More generally, we also need to establish how data science relates to, and interacts with, econometrics, applied statistics, computational social science. These draw on data science, to answer questions in their own discipline, but like statistics and computer science, they also contribute back to data science.

While we have argued that data science is its own discipline, we must be careful to continue to learn statistics from statisticians, computer science from computer scientists, etc. An example of the danger of not doing this is clear in the case of p-values, which we have not made much of in this book, but which mistakenly dominant quantitative analysis even though statisticians have warned about their misapplication for decades. One issue with not learning statistics from statisticians, is that statistical practice can become a recipe that is blindly followed, because that is the easiest way to teach it, even though that is not how statisticians do statistics. We see this, too, in conversations about the related concept of power, as though there is one appropriate level that we need to meet.

Data science must remain deeply connected to these other disciplines even as we develop our own. How we continue to ensure that data science has the best aspects, without also bringing bad practice, is an especially significant challenge. And this is not just technical, but also cultural (Meng 2021). It is especially challenging to ensure that data science maintains an inclusive culture of excellence, even as some of the constituent parts lack it.

6. How do we teach data science?

We are beginning to have agreement on what the foundations of data science are. It involves developing comfort with: computational thinking, sampling, statistics, graphs, Git/GitHub, SQL, command line, cleaning messy data, a few languages including R and Python, ethics, and writing. But we have very little agreement on how best to teach it. Partly this is because data science instructors often come different fields, but also it is also partly a difference in resources and priorities.

Given the demand for data science skills, there is no point confining data science education to graduate students because undergraduate students will be doing it when they enter the workforce whether or not we train them. If data science is to be taught at the undergraduate level, then it needs to be robust enough to be taught in large classes. Developing teaching tools that scale is critical. For instance, GitHub Actions could be used to run checks of student code and suggest improvements without instructor involvement. However, it is especially difficult to scale case studies style classes, which students find so useful.

7. What does the relationship between industry and academia look like?

Considerable innovation in data science occurs in industry, and yet sometimes this knowledge cannot be shared, and when it can it tends to do so slowly. The term data science has been used in academia since the 1960s, but it is because of industry that it has become popular (Irizarry 2020).

Lashing academia and industry together is both a key challenge for data science and one of the easiest to overlook. The nature of the problems faced in industry, for instance scoping the needs of a client, and operating at scale, are so removed from typical academic concerns that academic research would quickly be rendered pointless unless academics in data science maintain one foot in industry. And from the industry side, ensuring that best practice is quickly adopted can be challenging if there is no immediate payoff. Ensuring that industry experience is valued in academic hiring and grant evaluation would help, as would encouraging entrepreneurship in academia.

18.3 Next steps

This book has covered much ground, and while we are toward the end of it, as the butler Stevens is told in the novel The Remains of the Day (Ishiguro 1989):

The evening’s the best part of the day. You’ve done your day’s work. Now you can put your feet up and enjoy it.

Chances are there are aspects that you want to explore further, building on the foundation that you have established. If so, then the book has accomplished its aim.

If you were new to data science at the start of this book, then the next step would be to backfill that which was skipped over. Begin with Data Science: A First Introduction (Timbers, Campbell, and Lee 2022). After that, you should learn more about R in terms of data science by going through R for Data Science (Wickham and Grolemund 2022). To deepen your understanding of R itself, go next to Advanced R (Wickham 2019).

To learn more about causality start with Causal Inference: The Mixtape (Cunningham 2021) and The Effect: An Introduction to Research Design and Causality (Huntington-Klein 2021).

If you are interested to learn more about statistics, then begin with Statistical Rethinking: A Bayesian Course with Examples in R and Stan (McElreath 2020a). Then backfill with Bayes Rules! An Introduction to Bayesian Modeling with R (Johnson, Ott, and Dogucu 2022) and solidify the foundation with Bayesian Data Analysis (Gelman et al. 2014). And it would be worthwhile to bolster some of the probability foundations that we have skimmed over with All of Statistics (Wasserman 2005).

There is only one next natural step if you are interested in learning more about machine learning and that is An Introduction to Statistical Learning (James et al. 2021) followed by The Elements of Statistical Learning (Friedman, Tibshirani, and Hastie 2009). After that, learn how to implement it all with Tidy Modeling with R (Kuhn and Silge 2022).

If you are interested in sampling, then the next book to turn to is Sampling: Design and Analysis (Lohr 2022). To deepen your understanding of surveys and experiments, go next to Field Experiments: Design, Analysis, and Interpretation (Gerber and Green 2012) in combination with Trustworthy online controlled experiments (Kohavi, Tang, and Xu 2020).

For developing better data visualization skills, begin by turning to Data Sketches (Bremer and Wu 2021) and Data Visualization (Healy 2018), but then after that, develop strong foundations, such as The Grammar of Graphics (Wilkinson 2005).

And finally, for writing, it would be best to turn inward. Force yourself to write and publish every day for a month. Then do it again and again. You will get better. That said, there are some useful books, including Working (Caro 2019) and On Writing: A Memoir of the Craft (King 2000).

We often hear the phrase let the data speak. Hopefully by this point you understand that never happens. All that we can do is to acknowledge that we are the ones using data to tell stories and strive and seek to make them worthy.

It was her voice that made
The sky acutest at its vanishing.
She measured to the hour its solitude.
She was the single artificer of the world
In which she sang. And when she sang, the sea,
Whatever self it had, became the self
That was her song, for she was the maker.

“The Idea of Order at Key West”, (Stevens 1934)