Appendix B — Datasets
In general, it is better to stay away from datasets on Kaggle, the UCI Machine Learning Repository, and other commonly used options. From a data science perspective, using a dataset as it is available from such a source means that almost all the important decisions have been already made, and are potentially undocumented. And from a career perspective, it does not set your portfolio apart because everyone else just uses these datasets. Some alternatives include:
- Alex Cookson’s datasets.
- Andrews and Herzberg (2012) provides a variety of datasets, which are available here.
- APIs for social scientists provides a variety of APIs that could be used to gather data.
- BuzzFeed News provides access to many datasets underpinning their articles.
- The Data And Story Library provides access to hundreds of datasets.
- The Demographic and Health Surveys (DHS) Program provides survey data for 90 countries beginning in 1984.
- Duolingo provides access to datasets that underpin its research papers.
- The Economist proves access to many datasets underpinning their articles.
- Federal Reserve Economic Data (FRED) provides economic data.
- FiveThirtyEight proves access to many datasets underpinning their articles.
- Historical Statistics provides links to historical statistics.
- Human Mortality Database provides detailed mortality and population data for a variety of countries.
- IPCC Data Distribution Centre.
- The Irish Social Science Data Archive has a wide variety of datasets available.
- The J-PAL (Abdul Latif Jameel Poverty Action Lab) catalog of administrative data.
- The Markup’s Show Your Work series often include links to GitHub repos with the data that underpin the article. A few notable ones include: The Secret Bias Hidden in Mortgage-Approval Algorithms.
- The Massachusetts Water Resources Authority makes its Wastewater COVID-19 Tracking data available here, with the raw data available in a PDF that could be parsed.
- Microsoft Research Open Data has a large number of datasets across computer science, social science, information science, and other categories.
- The Museum of Modern Art (MoMA) makes datasets about their collection and exhibitions available.
- NASA’s Planetary Data System.
- The OECD provides economic data.
- The Prison Policy Initiative provides many datasets about US prisons and jails.
- The Rijksmuseum provides a variety of data about their collections.
- Tom Cardoso’s Bias behind bars provides data about Black and Indigenous inmates in Canada.
- The Washington Post proves access to many datasets underpinning their articles. Especially of interest may be congress slaveowners, fatal force shooting, school shootings, and Why FEMA is denying aid to Black disaster survivors in the Deep South.
- The Wordbank database is an open database of children’s vocabulary growth. Access is additionally available using
wordbankr
(Braginsky 2020), and Alison Presmanes Hill provides useful background and cleaning code. - The World Bank provides an extensive range of global development data and a Microdata Library.
- Yale’s International Center for Finance datasets: Historical Financial Research Data, and Stock Market Confidence Indices.