Appendix B — Datasets
In general, it is better to stay away from datasets on Kaggle, the UCI Machine Learning Repository, and other commonly used options. And for the papers in Appendix D, you must not use a dataset from either of those sources. From a data science perspective, using a dataset as it is available from such a source means that almost all the important decisions have been already made, and are potentially undocumented. And from a career perspective, it does not set your portfolio apart because everyone else just uses these datasets. Some alternatives include:
- AidData provides a large number of datasets related to research on development and foreign aid.
- Alex Cookson’s datasets.
- Andrews and Herzberg (2012) provide a variety of datasets, which are available here.
- APIs for social scientists provides a variety of APIs that could be used to gather data.
- British Library’s catalogue of world newspapers contains information about the start and end years of publication, the places of publication, variant titles and editions, and the language of publication.
- BuzzFeed News provides access to many datasets underpinning their articles.
- The Canadian Municipal Elections Database contains complete municipal election results for municipalities across Canada (Lucas et al. 2020).
- Congressindata provides datasets about US Congress Members from 2005 to 2015.
- The Congress.gov API is an especially useful source of data about the US Congress especially bills and other text data.
- COVerAGE-DB is a global demographic database of COVID-19 cases and deaths (Riffe et al. 2021).
cricketdata
(Hyndman et al. 2022) provides functions for downloading data about international and other major cricket matches- The Data And Story Library provides access to hundreds of datasets.
- Data Is Plural provides a weekly newsletter of interesting datasets with archives back to 2015.
- The Demographic and Health Surveys (DHS) Program provides survey data for 90 countries beginning in 1984.
- Duolingo provides access to datasets that underpin its research papers.
- The Economist provides access to many datasets underpinning their articles.
- EH.net provides a variety of interesting historical economic datasets.
- European NUTS-Level Election Database (EU-NED) provides national and European parliamentary election results from 1990 to 2020.
- Federal Reserve Economic Data (FRED) provides US economic data, and there is an R package
fredr
(Boysel and Vaughan 2021) for accessing the API. - FiveThirtyEight provides access to many datasets underpinning their articles.
- Goodreads Datasets are a scrape from 2017 of public data about more than two million books including meta-data and reviews (Wan and McAuley 2018; Wan et al. 2019).
- Historical Social Conflict Database provide data about more than 20,000 conflicts, largely focused on Europe (Cédric and Maneuvrier-Hervieu 2022).
- Historical Statistics provides links to historical statistics.
- Human Mortality Database provides detailed mortality and population data for a variety of countries.
- ICANN’s Centralized Zone Data Service provides access to all domain names, after an application and approval process that can take a few days.
- IPCC Data Distribution Centre.
- The Irish Social Science Data Archive has a wide variety of datasets available.
- J-PAL (Abdul Latif Jameel Poverty Action Lab) maintains a catalog of administrative data.
- NFL Savant provides team-specific data about the NFL, including play-by-play data since 2013, combine data since 1999, and weather data.
- The Markup’s Show Your Work series often include links to GitHub repos with the data that underpin the article. A few notable ones include: The Secret Bias Hidden in Mortgage-Approval Algorithms.
- The Massachusetts Water Resources Authority makes its Wastewater COVID-19 Tracking data available here, with the raw data available in a PDF that could be parsed.
- Microsoft Research Open Data has many datasets across computer science, social science, information science, and other categories.
- The Museum of Modern Art (MoMA) makes datasets about their collection and exhibitions available.
- NASA’s Planetary Data System.
- ProPublica Data Store provides an extensive number of datasets about the US, some of which are quite large. For instance, the Open Payments Data (2016) is 6 GB.
- The Notable People dataset of Laouenan et al. (2022) provides a cross-verified database of notable people from 3500BC to 2018AD.
- The OECD provides economic data.
- The Prison Policy Initiative provides many datasets about US prisons and jails.
- The Pudding makes many of the datasets underpinning their articles available. A few notable ones include: The Naked Truth, and The Evolution of the American Census.
- The Pushshift Reddit Dataset is a collection of Reddit posts since 2015 (Baumgartner et al. 2020).
- The Rijksmuseum provides a variety of data about their collections.
- Tom Cardoso’s Bias behind bars provides data about Black and Indigenous inmates in Canada.
- The US Centers for Disease Control and Prevention (CDC) National Vital Statistics System provides a variety of datasets, including Linked Birth and Infant Death Data.
- The United States Sentencing Commission Individual Offender Data Sets as cleaned and prepared by Kevin Wilson.
- Women’s Activities in Armed Rebellion provides access to measures of women’s participation in rebel organizations between 1946-2015 (Loken and Matfess 2023).
- The Washington Post provides access to many datasets underpinning their articles. Especially of interest may be congress slaveowners, fatal force shooting, school shootings, and Why FEMA is denying aid to Black disaster survivors in the Deep South.
- The Wordbank database is an open database of children’s vocabulary growth. Access is additionally available using
wordbankr
(Braginsky 2020), and Alison Presmanes Hill provides useful background and cleaning code. - The World Bank provides an extensive range of global development data and a Microdata Library.
- Yale’s International Center for Finance datasets: Historical Financial Research Data, and Stock Market Confidence Indices.