Online Appendix D — Datasets
One thing students often struggle with is picking a dataset. In general, it is better to stay away from datasets on Kaggle, the UCI Machine Learning Repository, and other commonly used options. From a data science perspective, using a dataset as it is available from such a source means that almost all the important decisions have been already made, and are potentially undocumented. And from a career perspective, it does not set your portfolio apart because everyone else just uses these datasets. Some alternatives include:
- AidData provides a large number of datasets related to research on development and foreign aid.
- Alex Cookson’s datasets.
- Andrews and Herzberg (2012) provide a variety of datasets, which are available here.
- APIs for social scientists provides a variety of APIs that could be used to gather data.
- Bombieri et al. (2023) provide a dataset about more than 5,000 large carnivore attacks on humans.
- The British Library’s catalogue of world newspapers contains information about the start and end years of publication, the places of publication, variant titles and editions, and the language of publication.
- BuzzFeed News provides access to many datasets underpinning their articles.
- The Canadian Municipal Elections Database contains complete municipal election results for municipalities across Canada (Lucas et al. 2020).
- Congressindata provides datasets about US Congress Members from 2005 to 2015.
- The Congress.gov API is an especially useful source of data about the US Congress especially bills and other text data.
- COVerAGE-DB is a global demographic database of COVID-19 cases and deaths (Riffe et al. 2021).
cricketdata
(Hyndman et al. 2022) provides functions for downloading data about international and other major cricket matches- The Data And Story Library provides access to hundreds of datasets.
- Data Is Plural provides a weekly newsletter of interesting datasets with archives back to 2015.
- The Data Liberation Project focuses on using FOI requests to build US government datasets.
- The Demographic and Health Surveys (DHS) Program provides survey data for 90 countries beginning in 1984.
- Duolingo provides access to datasets that underpin its research papers.
- The Economist provides access to many datasets underpinning their articles.
- EH.net provides a variety of interesting historical economic datasets.
- The EPA provides occurrence data from the Unregulated Contaminant Monitoring Rule.
- European NUTS-Level Election Database (EU-NED) provides national and European parliamentary election results from 1990 to 2020.
- Federal Reserve Economic Data (FRED) provides US economic data, and there is an R package
fredr
(Boysel and Vaughan 2021) for accessing the API. - FiveThirtyEight provides access to many datasets underpinning their articles.
- Goodreads Datasets are a scrape from 2017 of public data about more than two million books including meta-data and reviews (Wan and McAuley 2018; Wan et al. 2019).
- Historical Social Conflict Database provide data about more than 20,000 conflicts, largely focused on Europe (Chambru and Maneuvrier-Hervieu 2022).
- Historical Statistics provides links to historical statistics.
- Human Mortality Database provides detailed mortality and population data for a variety of countries.
- ICANN’s Centralized Zone Data Service provides access to all domain names, after an application and approval process that can take a few days.
- IPCC Data Distribution Centre.
- The Irish Social Science Data Archive has a wide variety of datasets available.
- J-PAL (Abdul Latif Jameel Poverty Action Lab) maintains a catalog of administrative data.
- NFL Savant provides team-specific data about the NFL, including play-by-play data since 2013, combine data since 1999, and weather data.
- The Markup’s Show Your Work series often include links to GitHub repos with the data that underpin the article. A few notable ones include: The Secret Bias Hidden in Mortgage-Approval Algorithms.
- The Massachusetts Water Resources Authority makes its Wastewater COVID-19 Tracking data available here, with the raw data available in a PDF that could be parsed.
- The Museum of Modern Art (MoMA) makes datasets about their collection and exhibitions available.
- NASA’s Planetary Data System.
- ProPublica Data Store provides an extensive number of datasets about the US, some of which are quite large. For instance, the Open Payments Data (2016) is 6 GB.
- The Notable People dataset of Laouenan et al. (2022) provides a cross-verified database of notable people from 3500BC to 2018AD.
- The OECD provides economic data.
- The ParlEE dataset contains annotated full-text of millions of speeches in the EU legislative chambers (Sylvester et al. 2023).
- The Prison Policy Initiative provides many datasets about US prisons and jails.
- The Pudding makes many of the datasets underpinning their articles available. A few notable ones include: The Naked Truth, and The Evolution of the American Census.
- The Pushshift Reddit Dataset is a collection of Reddit posts since 2015 (Baumgartner et al. 2020).
- The Refugee Law Lab provides the full text of full text of Supreme Court of Canada decisions in JSON format (Rehaag 2023).
- The Rijksmuseum provides a variety of data about their collections.
- The Socioeconomic High-resolution Rural-Urban Geographic Platform (SHRUG) is an open data platform provides data about socioeconomic development across 600,000 villages and towns in India (Asher et al. 2021).
- Tom Cardoso’s Bias behind bars provides data about Black and Indigenous inmates in Canada.
- Tracking (In)Justice is a dataset that tracks police-involved deaths in Canada (Data and Justice Criminology Lab, Institute of Criminology and Criminal Justice, Carleton University; The Centre for Research & Innovation for Black Survivors of Homicide Victims (The CRIB), at the Factor-Inwentash Faculty of Social Work, University of Toronto; Canadian Civil Liberties Association; Ethics and Technology Lab, Queen’s University 2022).
- The US Centers for Disease Control and Prevention (CDC) National Vital Statistics System provides a variety of datasets, including Linked Birth and Infant Death Data.
- The United States Sentencing Commission Individual Offender Data Sets as cleaned and prepared by Kevin Wilson.
- Women’s Activities in Armed Rebellion provides access to measures of women’s participation in rebel organizations between 1946-2015 (Loken and Matfess 2023).
- The Washington Post provides access to many datasets underpinning their articles. Especially of interest may be congress slaveowners, fatal force shooting, school shootings, and Why FEMA is denying aid to Black disaster survivors in the Deep South.
- The Wordbank database is an open database of children’s vocabulary growth. Access is additionally available using
wordbankr
(Braginsky 2020), and Alison Presmanes Hill provides useful background and cleaning code. - The World Bank provides an extensive range of global development data and a Microdata Library.
- Yale’s International Center for Finance datasets: Historical Financial Research Data, and Stock Market Confidence Indices.