Online Appendix C — Datasets

One thing students often struggle with is picking a dataset. In general, it is better to stay away from datasets on Kaggle, the UCI Machine Learning Repository, and other commonly used options. From a data science perspective, using a dataset as it is available from such a source means that almost all the important decisions have been already made, and are potentially undocumented. And from a career perspective, it does not set your portfolio apart because everyone else just uses these datasets. Some alternatives include:

AidData provides a large number of datasets related to research on development and foreign aid.
Alex Cookson’s datasets.
Andrews and Herzberg (2012) provide a variety of datasets, which are available here.
APIs for social scientists provides a variety of APIs that could be used to gather data.
Bombieri et al. (2023) provide a dataset about more than 5,000 large carnivore attacks on humans.
The British Library’s catalogue of world newspapers contains information about the start and end years of publication, the places of publication, variant titles and editions, and the language of publication.
BuzzFeed News provides access to many datasets underpinning their articles.
The Canadian Municipal Elections Database contains complete municipal election results for municipalities across Canada (Lucas et al. 2020).
Congressindata provides datasets about US Congress Members from 2005 to 2015.
The Congress.gov API is an especially useful source of data about the US Congress especially bills and other text data.
COVerAGE-DB is a global demographic database of COVID-19 cases and deaths (Riffe et al. 2021).
cricketdata (Hyndman et al. 2022) provides functions for downloading data about international and other major cricket matches
The Data And Story Library provides access to hundreds of datasets.
Data Is Plural provides a weekly newsletter of interesting datasets with archives back to 2015.
The Data Liberation Project focuses on using FOI requests to build US government datasets.
The Demographic and Health Surveys (DHS) Program provides survey data for 90 countries beginning in 1984.
Duolingo provides access to datasets that underpin its research papers.
The Economist provides access to many datasets underpinning their articles.
EH.net provides a variety of interesting historical economic datasets.
The EPA provides occurrence data from the Unregulated Contaminant Monitoring Rule.
European NUTS-Level Election Database (EU-NED) provides national and European parliamentary election results from 1990 to 2020.
Federal Reserve Economic Data (FRED) provides US economic data, and there is an R package fredr (Boysel and Vaughan 2021) for accessing the API.
FiveThirtyEight provides access to many datasets underpinning their articles.
Goodreads Datasets are a scrape from 2017 of public data about more than two million books including meta-data and reviews (Wan and McAuley 2018; Wan et al. 2019).
Historical Social Conflict Database provide data about more than 20,000 conflicts, largely focused on Europe (Chambru and Maneuvrier-Hervieu 2022).
Historical Statistics provides links to historical statistics.
Human Mortality Database provides detailed mortality and population data for a variety of countries.
ICANN’s Centralized Zone Data Service provides access to all domain names, after an application and approval process that can take a few days.
IPCC Data Distribution Centre.
The Irish Social Science Data Archive has a wide variety of datasets available.
J-PAL (Abdul Latif Jameel Poverty Action Lab) maintains a catalog of administrative data.
NFL Savant provides team-specific data about the NFL, including play-by-play data since 2013, combine data since 1999, and weather data.
The Markup’s Show Your Work series often include links to GitHub repos with the data that underpin the article. A few notable ones include: The Secret Bias Hidden in Mortgage-Approval Algorithms.
The Massachusetts Water Resources Authority makes its Wastewater COVID-19 Tracking data available here, with the raw data available in a PDF that could be parsed.
The Museum of Modern Art (MoMA) makes datasets about their collection and exhibitions available.
NASA’s Planetary Data System.
ProPublica Data Store provides an extensive number of datasets about the US, some of which are quite large. For instance, the Open Payments Data (2016) is 6 GB.
The Notable People dataset of Laouenan et al. (2022) provides a cross-verified database of notable people from 3500BC to 2018AD.
The OECD provides economic data.
The ParlEE dataset contains annotated full-text of millions of speeches in the EU legislative chambers (Sylvester et al. 2023).
The Prison Policy Initiative provides many datasets about US prisons and jails.
The Pudding makes many of the datasets underpinning their articles available. A few notable ones include: The Naked Truth, and The Evolution of the American Census.
The Pushshift Reddit Dataset is a collection of Reddit posts since 2015 (Baumgartner et al. 2020).
The Refugee Law Lab provides the full text of full text of Supreme Court of Canada decisions in JSON format (Rehaag 2023).
The Rijksmuseum provides a variety of data about their collections.
The Socioeconomic High-resolution Rural-Urban Geographic Platform (SHRUG) is an open data platform provides data about socioeconomic development across 600,000 villages and towns in India (Asher et al. 2021).
Tom Cardoso’s Bias behind bars provides data about Black and Indigenous inmates in Canada.
Tracking (In)Justice is a dataset that tracks police-involved deaths in Canada (Data and Justice Criminology Lab, Institute of Criminology and Criminal Justice, Carleton University; The Centre for Research & Innovation for Black Survivors of Homicide Victims (The CRIB), at the Factor-Inwentash Faculty of Social Work, University of Toronto; Canadian Civil Liberties Association; Ethics and Technology Lab, Queen’s University 2022).
The US Centers for Disease Control and Prevention (CDC) National Vital Statistics System provides a variety of datasets, including Linked Birth and Infant Death Data.
The United States Sentencing Commission Individual Offender Data Sets as cleaned and prepared by Kevin Wilson.
Women’s Activities in Armed Rebellion provides access to measures of women’s participation in rebel organizations between 1946-2015 (Loken and Matfess 2023).
The Washington Post provides access to many datasets underpinning their articles. Especially of interest may be congress slaveowners, fatal force shooting, school shootings, and Why FEMA is denying aid to Black disaster survivors in the Deep South.
The Wordbank database is an open database of children’s vocabulary growth. Access is additionally available using wordbankr (Braginsky 2020), and Alison Presmanes Hill provides useful background and cleaning code.
The World Bank provides an extensive range of global development data and a Microdata Library.
Yale’s International Center for Finance datasets: Historical Financial Research Data, and Stock Market Confidence Indices.

Andrews, David, and Agnes Herzberg. 2012. Data: A Collection of Problems from Many Fields for the Student and Research Worker. New York: Springer Science & Business Media.

Asher, Sam, Tobias Lunt, Ryu Matsuura, and Paul Novosad. 2021. “Development Research at High Geographic Resolution: An Analysis of Night Lights, Firms, and Poverty in India Using the SHRUG Open Data Platform.” World Bank Economic Review 35 (4). https://shrug-assets-ddl.s3.amazonaws.com/static/main/assets/other/almn-shrug.pdf.

Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. “The Pushshift Reddit Dataset.” arXiv. https://doi.org/10.48550/arxiv.2001.08435.

Bombieri, Giulia, Vincenzo Penteriani, Kamran Almasieh, Hüseyin Ambarlı, Mohammad Reza Ashrafzadeh, Chandan Surabhi Das, Nishith Dharaiya, et al. 2023. “A Worldwide Perspective on Large Carnivore Attacks on Humans.” PLOS Biology 21 (1): e3001946. https://doi.org/10.1371/journal.pbio.3001946.

Boysel, Sam, and Davis Vaughan. 2021. fredr: An R Client for the “FRED” API. https://CRAN.R-project.org/package=fredr.

Braginsky, Mika. 2020. wordbankr: Accessing the Wordbank Database. https://CRAN.R-project.org/package=wordbankr.

Chambru, Cédric, and Paul Maneuvrier-Hervieu. 2022. “Introducing HiSCoD: A new gateway for the study of historical social conflict.” Working Paper Series, Department of Economics, University of Zurich. https://doi.org/10.5167/uzh-217109.

Data and Justice Criminology Lab, Institute of Criminology and Criminal Justice, Carleton University; The Centre for Research & Innovation for Black Survivors of Homicide Victims (The CRIB), at the Factor-Inwentash Faculty of Social Work, University of Toronto; Canadian Civil Liberties Association; Ethics and Technology Lab, Queen’s University. 2022. “Tracking (in)justice: A Living Data Set Tracking Canadian Police-Involved Deaths.” https://trackinginjustice.ca.

Hyndman, Rob, Timothy Hyndman, Charles Gray, Sayani Gupta, and Jacquie Tran. 2022. cricketdata: International Cricket Data. https://CRAN.R-project.org/package=cricketdata.

Laouenan, Morgane, Palaash Bhargava, Jean-Benoı̂t Eyméoud, Olivier Gergaud, Guillaume Plique, and Etienne Wasmer. 2022. “A Cross-Verified Database of Notable People, 3500BC–2018AD.” Scientific Data 9 (290). https://doi.org/10.1038/s41597-022-01369-4.

Loken, Meredith, and Hilary Matfess. 2023. “Introducing the Women’s Activities in Armed Rebellion (WAAR) Project, 1946-2015.” Journal of Peace Research.

Lucas, Jack, Reed Merrill, Kelly Blidook, Sandra Breux, Laura Conrad, Gabriel Eidelman, Royce Koop, et al. 2020. “Canadian Municipal Elections Database.” Scholars Portal Dataverse. https://doi.org/10.5683/sp2/4mzjpq.

Rehaag, Sean. 2023. “Supreme Court of Canada Bulk Decisions Dataset.” Refugee Law Laboratory. https://refugeelab.ca/bulk-data/scc.

Riffe, Tim, Enrique Acosta, Enrique José Acosta, Diego Manuel Aburto, Anna Alburez-Gutierrez, Ainhoa Altová, Ugofilippo Alustiza, et al. 2021. “Data Resource Profile: COVerAGE-DB: A Global Demographic Database of COVID-19 Cases and Deaths.” International Journal of Epidemiology 50 (2): 390–390f. https://doi.org/10.1093/ije/dyab027.

Sylvester, Christine, Anastasia Ershova, Aleksandra Khokhlova, Nikoleta Yordanova, and Zachary Greene. 2023. “ParlEE plenary speeches V2 data set: Annotated full-text of 15.1 million sentence-level plenary speeches of six EU legislative chambers.” Harvard Dataverse. https://doi.org/10.7910/DVN/VOPK0E.

Wan, Mengting, and Julian J. McAuley. 2018. “Item Recommendation on Monotonic Behavior Chains.” In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, edited by Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan, 86–94. ACM. https://doi.org/10.1145/3240323.3240369.

Wan, Mengting, Rishabh Misra, Ndapa Nakashole, and Julian J. McAuley. 2019. “Fine-Grained Spoiler Detection from Large-Scale Review Corpora.” In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, edited by Anna Korhonen, David R. Traum, and Lluı́s Màrquez, 2605–10. Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1248.