Online Appendix E — Papers

One way to build understanding of material is by using it. The purpose of these papers is to give you a chance to implement what you have learnt in a real-world setting. Completing the papers is also important from the perspective of building a portfolio for job applications.

Expectations change from year to year so please treat the “previous examples” as examples rather than templates.

E.1 Donaldson Paper

E.1.1 Task

  • Working individually and in an entirely reproducible way, please find a dataset of interest on Open Data Toronto and write a short paper telling a story about the data.
    • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this starter folder.
    • Find a dataset of interest on Open Data Toronto.
      • Put together an R script, “scripts/00-simulate_data.R”, that simulates the dataset of interest and develops some tests. Push to GitHub and include an informative commit message
      • Write an R script, “scripts/01-download_data.R” to download the actual data in a reproducible way using opendatatoronto (Gelfand 2022). Save the data: “data/raw_data/unedited_data.csv” (use a meaningful name and appropriate file type). Push to GitHub and include an informative commit message.
    • Prepare a PDF using Quarto “paper/paper.qmd” with these sections: title, author, date, abstract, introduction, data, and references.
      • The title should be descriptive, informative, and specific.
      • The date should be in an unambiguous format. Add a link to the GitHub repo in the acknowledgments.
      • The abstract should be three or four sentences. The abstract must tell the reader the top-level finding. What is the one thing that we learn about the world because of this paper?
      • The introduction should be two or three paragraphs of content. And there should be an additional final paragraph that sets out the remainder of the paper.
      • The data section should thoroughly and precisely discuss the source of the data and the broader context that gave rise to it (ethical, statistical, and otherwise). Comprehensively describe and summarize the data using text, graphs, and tables. Graphs must be made with ggplot2 (Wickham 2016) and tables must be made with knitr (Xie 2023) or gt (Iannone et al. 2022). Graphs must show the actual data, or as close to it as possible, not summary statistics. Graphs and tables should be cross-referenced in the text e.g. “Table 1 shows…”.
      • References should be added using BibTeX. Be sure to reference R, and any R packages you use, as well as the dataset. Strong submissions will draw on related literature and reference those.
      • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
      • Use appendices for supporting, but not critical, material.
      • Push to GitHub and include an informative commit message
  • Submit a link to your GitHub repo.
  • There should be no evidence that this is a class assignment.

E.1.2 Checks

  • There should be no R code or raw R output in the final PDF.
  • An example statement for the README on LLM usage that you could base yours on is: “Statement on LLM usage: Aspects of the code were written with the help of the autocomplete tool, Codriver. The abstract and introduction were written with the help of ChatHorse and the entire chat history is available in other/llm/usage.txt.”
  • Code should be entirely reproducible, well-documented, commented, and readable.
  • The paper should render directly to PDF i.e. use “Render to PDF”.
    • Do not use “Render to html” and then save as a PDF.
    • Do not use “Render to Word” and then save as a PDF
  • Graphs, tables, and text should be clear, and of comparable quality to those of the Financial Times.
  • The date should be up-to-date and unambiguous (e.g. 2/3/2024 is ambiguous, 2 March 2024 is not).
  • The entire workflow should be entirely reproducible.
  • There should not be any typos.
  • There should be no sign this is a school paper.
  • There must be a link to the paper’s GitHub repo using a footnote.
  • The GitHub repo should be well-organized, and contain an informative README.
  • The paper should be well-written and able to be understood by the average reader of, say, the Financial Times This means that you are allowed to use mathematical notation, but you must explain all of it in plain language. All statistical concepts and terminology must be explained. Your reader is someone with a university education, but not necessarily someone who understands what a p-value is.
  • Avoid titles with puns when the topic is serious because the reader will think you are not taking the topic seriously. (You can break this rule once you get experience.)
  • Abstracts need to be “tightly written”, almost terse. Remove unnecessary words. Do not include more than four sentences. (You can break this rule once you get experience.)
  • Introduction needs paragraphs (leave a space between lines in the Quarto Document).
  • In the introduction, please telegraph the rest of the paper: “Section 2…, Section 3….”. (You can break this rule once you get experience.)
  • Don’t use words like “advanced”, “apt”, “beg the question”, “drives forward”, “crucial”, “elucidate”, “elucidating”, “embark”, “embarks”, “exploration”, “fresh perspective”, “fresh perspectives”, “insights from”, “insights”, “interrogate”, “intricate”, “intriguing”, “key insights”, “kind of”, “meticulously”, “multifaceted”, “offering crucial insight”, “offers crucial insight”, “plummeted”, “rapidly”, “reveals”, “shed light”, “sheds light”, “shocking”, “soared”, “unveiling” and other imprecise words.
  • Please don’t read the data from their server into the Quarto Document, read the saved version. Submissions that do this receive 0 overall.
  • In the introduction, please be more specific about your findings.
  • The data section is not about data cleaning, it is about the data. Put data cleaning into an appendix. Unless there is something critical, do not discuss data cleaning in the data section.
  • Simulation needs a seed.
  • Do not call the repo “Paper 1” or similar.
  • Do not have sections called “graphs” or “tables” or similar.
  • Use usethis::git_vaccinate() to get a better gitignore file, and specifically to ignore DS_Store.

E.1.3 FAQ

  • Can I use a dataset from Kaggle instead? No, because they have done the hard work for you.
  • I cannot use code to download my dataset, can I just manually download it? No, because your entire workflow needs to be reproducible. Please fix the download problem or pick a different dataset.
  • How much should I write? Most students submit something in the two-to-six-page range, but it is up to you. Be precise and thorough.
  • My data is about apartment blocks/NBA/League of Legends so there’s no broader context, what do I do? Please re-read the relevant chapter and readings to better understand bias and ethics. If you really cannot think of something, then it might be worth picking a different dataset.
  • Can I use Python? No. If you already know Python then it does not hurt to learn another language.
  • Why do I need to cite R, when I don’t need to cite Word? R is a free statistical programming language with academic origins, so it is appropriate to acknowledge the work of others. It is also important for reproducibility.
  • What reference style should I use? Any major reference style is fine (APA, Harvard, Chicago, etc); just pick one that you are used to.
  • The paper in the starter folder has a model section, so do I need to put together a model? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • The paper in the starter folder has a data sheets appendix, so do I need to put together a data sheet? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • What does “graph the actual data” mean? If you have, say 5,000 observations in the dataset and three variables, then for every variable there should be a graph that has 5,000 points in the case of dots, or adds up to 5,000 in the case of bar charts and histograms.

E.1.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.1.5 Previous examples

E.2 Mawson Paper

E.2.1 Task

  • Working as part of a team of one to three people, please pick a paper of interest to you, with code and data that are available from:

    1. A paper published anytime since 2019, in an American Economic Association journal. These journals are: “American Economic Review”, “AER: Insights”, “AEJ: Applied Economics”, “AEJ: Economic Policy”, “AEJ: Macroeconomics”, “AEJ: Microeconomics”, “Journal of Economic Literature”, “Journal of Economic Perspectives”, “AEA Papers & Proceedings”.
    2. Any article from the Institute for Replication list available here that has a replicability status of “Looking for replicator”.
    3. One of Gilad Feldman’s papers.1
  • Following the Guide for Accelerating Computational Reproducibility in the Social Sciences, please complete a replication2 of at least three graphs, tables, or a combination, from that paper, using the Social Science Reproduction Platform. Note the DOI of your replication.

  • Working in an entirely reproducible way then conduct a reproduction based on two or three aspects of the paper, and write a short paper about that.

    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you should use this starter folder): title, author, date, abstract, introduction, data, results, discussion, and references.
    • The aspects that you focus on in your paper could be the same aspects that you replicated, but they do not need to be. Follow the direction of the paper, but make it your own. That means you should ask a slightly different question, or answer the same question in a slightly different way, but still use the same dataset.
    • Include the DOI of your replication in your paper and a link to the GitHub repo that underpins your paper.
    • The results section should convey findings.
    • The discussion should include three or four sub-sections that each focus on an interesting point, and there should be another sub-section on the weaknesses of your paper, and another on potential next steps for it.
    • In the discussion section, and any other relevant section, please be sure to discuss ethics and bias, with reference to relevant literature.
    • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
    • Use appendices for supporting, but not critical, material.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a PDF of your paper.

  • There should be no evidence that this is a class assignment.

E.2.2 Checks

  • The paper should not just copy/paste the code from the original paper, but have instead used that as a foundation to work from.
  • Your paper should have a link to the associated GitHub repository and the DOI of the Social Science Reproduction Platform replication that you conducted.
  • Make sure you have referenced everything, including R. Strong submissions will draw on related literature in the discussion (and other sections) and would be sure to also reference those. The style of references does not matter, provided it is consistent.

E.2.3 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.
  • Do I have to focus on a model result? No, it is likely best to stay away from that at this point, and instead focus on tables or graphs of summary or explanatory statistics.
  • What if the paper I choose is in a language other than R? Both your replication and reproduction code should be in R. So you will need to translate the code into R for the replication. And the reproduction should be your own work, so that also should be in R. One common language is Stata, and Huntington-Klein (2022) might be useful as a “Rosetta Stone” of sorts, for R, Python, and Stata, or just use a LLM to help.
  • Can I work by myself? Yes.
  • Do the graphs/tables have to look identical to the original? No, you are welcome to, and should, make them look better as part of the reproduction. And even as part of the replication, they do not have to be identical, just similar enough.
  • One of my graphs has four panels, do I have to do all of them for this to be counted as one element? No, for the purpose of this paper, every panel counts as a separate element, so all you would need to do is three panels and that would be enough.
  • How do I automatically download the data if they are behind a sign-in? If the data are behind a sign-in, just add commented documentation for how to download it into the download_data.R R file, rather than code.
  • Do we need to commit our original, unedited data data to Github if it is really big? No, you do not necessarily need to commit the original, unedited data data to GitHub if it is too large, just add a note explaining the situation in the README and how to obtain the data.
  • What should the abstract and introduction be about? The abstract and introduction should reflect your own work and findings, rather than those of the original paper (even though those will necessarily nonetheless have some role). You are (almost surely) not replicating their entire paper, and so your abstract should be different. See the examples for guidance.

E.2.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Replication 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' SSRP submission needs to be filled out completely for three elements.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.2.5 Previous examples

E.3 Howrah Paper

E.3.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please obtain data from the US General Social Survey3. (You are welcome to use a different government-run survey, but please obtain permission before starting.)
  • Obtain the data, focus on one aspect of the survey, and then use it to tell a story.
    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then use Quarto to prepare a PDF with these sections (you should use this starter folder): title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a survey, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include, but not be limited to:
      • A discussion of the survey’s methodology, and its key features, strengths, and weaknesses. For instance: what is the population, frame, and sample; how is the sample recruited; what sampling approach is taken, and what are some of the trade-offs of this; how is non-response handled.
      • A discussion of the questionnaire: what is good and bad about it?
      • If this becomes too detailed, then use appendices for supporting but not essential aspects.
    • In an appendix, please put together a supplementary survey that could be used to augment the general social survey the paper focuses on. The purpose of the supplementary survey is to gain additional information on the topic that is the focus of the paper, beyond that gathered by the general social survey. The survey would be distributed in the same manner as the general social survey but needs to stand independently. The supplementary survey should be put together using a survey platform. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
    • Please be sure to discuss ethics and bias, with reference to relevant literature.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a link to the GitHub repo.
  • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at a university-educated, but non-specialist, audience. Use survey, sampling, and statistical terminology, but be sure to explain it. The paper should flow, and be easy to follow and understand.
  • There should be no evidence that this is a class paper.

E.3.2 Checks

  • An appendix should contain both a link to the supplementary survey and the details of it, including questions (in case the link fails, and to make the paper self-contained).

E.3.3 FAQ

  • What should I focus on? You may focus on any year, aspect, or geography that is reasonable given the focus and constraints of the general social survey that you are interested in. Please consider the year and topics that you are interested in together, as some surveys focus on particular topics in some years.
  • Do I need to include the raw GSS data in the repo? For most of the general social surveys you will not have permission to share the GSS data. If that is the case, then you should add clear details in the README explaining how the data could be obtained.
  • How many graphs do I need? In general, you need at least as many graphs as you have variables, because you need to show all the observations for all variables. But you may be able to combine a few; or, vice versa, you may be interested in looking at different aspects or relationships.

E.3.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Idealized survey 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The survey should have an introductory section and include the details of a contact person. The survey questions should be well constructed and appropriate to the task. The questions should have an appropriate ordering. A final section should thank the respondent.
Pollster methodology overview and evaluation 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The deep dive provides a thorough understanding of how something goes from being a person's opinion to part of a result for this pollster. Please provide a thorough overview and evaluation of the pollster’s methodology, highlighting both its strengths and limitations.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.3.5 Previous examples

E.4 Dysart Paper

E.4.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please convert at least one full-page table from one DHS Program “Final Report”, from the 1980s or 1990s, as available here, into a usable dataset, then write a short paper telling a story with the data.
  • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this starter folder.
  • Create and document a dataset:
    • Save the PDF to “inputs”.
    • Put together a simulation of your plan for the usable dataset and save the script to “scripts/00-simulation.R”.
    • Write R code, saved as “scripts/01-gather_data.R”, to either OCR or parse the PDF, as appropriate, and save the output to “outputs/data/first_parse.csv”.
    • Write R code, saved as “scripts/02-clean_and_prepare_data.R”, that draws on “first_parse.csv” to clean and prepare the dataset. Use pointblank to put together tests that the dataset passes (at a minimum, every variable should have a test for class and another for content). Save the dataset to “outputs/data/cleaned_data.parquet”.
    • Following Gebru et al. (2021), put together a data sheet for the dataset you put together (put this in the appendix of your paper). You are welcome to start from the template “inputs/data/datasheet_template.qmd” in the starter folder, although, again, you should add it to the appendix of your paper, rather than a stand-alone document.
  • Use the dataset to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a datasheet for the dataset, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include details of the methodology used by the DHS you used, and its key features, strengths, and weaknesses.
  • Submit a link to the GitHub repo.
  • There should be no evidence that this is a class paper.

E.4.2 Checks

  • Use GitHub in a well-developed way by making at least a few commits and using descriptive commit messages.

E.4.3 FAQ

E.4.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Not done'; 1 - 'Done' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
Datasheet 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A thorough datasheet for the dataset that was constructed is included.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.4.5 Previous examples

E.5 Spadina Paper

E.5.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model and then write a short paper telling a story. Some ideas for aspects you could tackle include:
    • Revisit the dataset that you used in Section E.1. Build a linear model for one of the variables, and consider the results.
    • Pick one of the examples in Chapter 13, and change the situation slightly, and then build a generalized linear model.
  • You should use this starter folder.
  • Submit a link to the GitHub repo.
  • There should be no evidence that this is a class paper.

E.5.2 Checks

  • Be careful to thoroughly explain the model. Also consider the assumptions of the model and the threats to its validity.

E.5.3 FAQ

  • What does “change the situation slightly” mean? You are welcome to use the same, or similar, data, but consider a different aspect. For instance:
    • In the logistic regression example of US political support, you may use the CES from a different year, and/or with slightly different explanatory variables.
    • In the Poisson regression example of the letters used in Jane Eyre, you may consider a different novel.
    • In the negative binomial regression of mortality in Alberta, you may consider a different geographic area.
  • Can I use Alberta mortality data? No.

E.5.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate. Detail the statistical model used, defining and explaining each aspect and its importance, and ensure that variables are well-defined and correspond with those discussed in the data section. The model should have an appropriate balance of complexity—neither overly simplistic nor unnecessarily complicated—and be justified as suitable for the situation. Explain how decisions made in modeling reflect the aspects discussed in the data section, including why specific features are included (e.g., why use age rather than age-groups, treating province effects as levels, categorizing gender, etc?). Present the model using appropriate mathematical notation supplemented with plain English explanations, defining every component. If applicable, define sensible priors for Bayesian models. Clearly discuss the underlying assumptions, potential limitations, and circumstances where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—while addressing model convergence, diagnostics, and any alternative models or variants considered, including their strengths and weaknesses and the rationale for the final model choice.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Not done'; 1 - 'Done' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
Datasheet 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A thorough datasheet for the dataset that was constructed is included.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.5.5 Previous examples

E.6 St George Paper

E.6.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model to forecast the winner of the upcoming US presidential election using “poll-of-polls” (Blumenthal 2014; Pasek 2015) and then write a short paper telling a story.
  • You should use this starter folder.
  • You can get data about polling outcomes from here (search for “Download the data”, then select Presidential general election polls (current cycle), then “Download”).
  • Pick one pollster in your sample, and deep-dive on their methodology in an appendix to your paper. In particular, in addition to conveying a sense of the pollster of interest, this appendix should include a discussion of the survey’s methodology, and its key features, strengths, and weaknesses. For instance:
    • what is the population, frame, and sample;
    • how is the sample recruited;
    • what sampling approach is taken, and what are some of the trade-offs of this;
    • how is non-response handled;
    • what is good and bad about the questionnaire.
  • In another appendix, please put together an idealized methodology and survey that you would run if you had a budget of $100K and the task of forecasting the US presidential election. You should detail the sampling approach that you would use, how you would recruit respondents, data validation, and any other relevant aspects of interest. Also be careful to address any poll aggregation or other features of your methodology. You should actually implement your survey using a survey platform like Google Forms. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
  • Submit a link to the GitHub repo.
  • There should be no evidence that this is a class paper.

E.6.2 Checks

  • Check that you have both appendices required.

E.6.3 FAQ

  • Do I need to use all the predictors in the dataset? No, you should be deliberate and thoughtful about the predictors that you use.
  • What about the electoral college? US presidential elections are won based on the electoral college. It is fine to just focus the popular vote. But exceptional submissions would consider the popular vote, possibly by state, and then construct an electoral college estimate, being careful to propagate uncertainty.

E.6.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate. Detail the statistical model used, defining and explaining each aspect and its importance, and ensure that variables are well-defined and correspond with those discussed in the data section. The model should have an appropriate balance of complexity—neither overly simplistic nor unnecessarily complicated—and be justified as suitable for the situation. Explain how decisions made in modeling reflect the aspects discussed in the data section, including why specific features are included (e.g., why use age rather than age-groups, treating province effects as levels, categorizing gender, etc?). Present the model using appropriate mathematical notation supplemented with plain English explanations, defining every component. If applicable, define sensible priors for Bayesian models. Clearly discuss the underlying assumptions, potential limitations, and circumstances where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—while addressing model convergence, diagnostics, and any alternative models or variants considered, including their strengths and weaknesses and the rationale for the final model choice.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Idealized methodology 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The proposed methodology is well-thought through, realistic and would achieve the goals.
Idealized survey 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The survey should have an introductory section and include the details of a contact person. The survey questions should be well constructed and appropriate to the task. The questions should have an appropriate ordering. A final section should thank the respondent.
Pollster methodology overview and evaluation 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The deep dive provides a thorough understanding of how something goes from being a person's opinion to part of a result for this pollster. Please provide a thorough overview and evaluation of the pollster’s methodology, highlighting both its strengths and limitations.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Not done'; 1 - 'Done' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.7 Spofforth Paper

E.7.1 Task

  • Working as part of a team of one to three people, please forecast the popular vote of the upcoming US election using multilevel regression with post-stratification and then write a short paper telling a story.
  • This requires individual-level survey data, post-stratification data, and a model that brings them together. Given the expense of collecting these data, and the privilege of having access to them, please be sure to properly cite all datasets that you use.
  • You will need to:
    • Get access to an individual-level survey dataset.
    • Get access to a post-stratification dataset.
    • Clean and prepare both these datasets to make them useable together.
    • Estimate a model using the survey dataset.
    • Apply the trained model to the post-stratification dataset to forecast the election result.
  • You should use this starter folder.
  • Submit a link to the GitHub repo.
  • There should be no evidence that this is a class paper.

E.7.2 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.

E.7.3 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate. Detail the statistical model used, defining and explaining each aspect and its importance, and ensure that variables are well-defined and correspond with those discussed in the data section. The model should have an appropriate balance of complexity—neither overly simplistic nor unnecessarily complicated—and be justified as suitable for the situation. Explain how decisions made in modeling reflect the aspects discussed in the data section, including why specific features are included (e.g., why use age rather than age-groups, treating province effects as levels, categorizing gender, etc?). Present the model using appropriate mathematical notation supplemented with plain English explanations, defining every component. If applicable, define sensible priors for Bayesian models. Clearly discuss the underlying assumptions, potential limitations, and circumstances where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—while addressing model convergence, diagnostics, and any alternative models or variants considered, including their strengths and weaknesses and the rationale for the final model choice.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Not done'; 1 - 'Done' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.7.4 Previous examples

E.8 Final paper

E.8.1 Task

  • Working individually and in an entirely reproducible way please write a paper that involves original work to tell a story with data.
  • Options include (pick one):
    • Develop a research question that is of interest to you based on your own interests, background, and expertise, then obtain or create a relevant dataset.
    • A reproduction, being sure to use the paper as a foundation rather than as an end-in-itself.
  • All the guidance and expectations from earlier papers applies to this one.

E.8.2 Checks

  • Do not use a dataset from Kaggle, UCI, or Statistica. Mostly this is because everyone else uses these datasets and so it does nothing to make you stand out to employers, but there are sometimes also concerns that the data are old, or you do not know the provenance.

E.8.3 FAQ

  • Can I work as part of a team? No. You must have some work that is entirely your own. You really need your own work to show off for job applications etc.
  • How much should I write? Most students submit something that has 10-to-20-pages of main content, with additional pages devoted to appendices, but it is up to you. Be precise and thorough.
  • Do I have to submit an initial paper in order to do the peer-review? Yes.
  • Can I use the same paper for the reproduction as in the Howrah Paper? No.
  • Can I use any model? You are welcome to use any model, but you need to thoroughly explain it and this can be difficult for more complicated models. Start small. Pick one or two predictors. Once you get that working, then complicate it. Remember that every predictor and the outcome variable needs to be graphed and explained in the data section.

E.8.4 Rubric

Component Range Requirement
R is appropriately cited 0 - 'No'; 1 - 'Yes' Must be properly referred to in the main content and included in the reference list. If not, no need to continue marking, paper gets 0 overall.
Class paper 0 - 'No'; 1 - 'Yes' Check meta data such as project and folder names, as well as other aspect such as title etc. If there is any sign this is a class paper then no need to continue marking, paper gets 0 overall.
LLM usage is documented 0 - 'No'; 1 - 'Yes' A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as Chat-GPT4, were used then the entire chat must be included in the usage text file. If not, no need to continue marking, paper gets 0 overall.
Title 0 - 'Poor or not done'; 1 - 'Yes'; 2 - 'Exceptional' An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. There should be no evidence this is a school paper.
Author, date, and repo 0 - 'Poor or not done'; 2 - 'Yes' The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done'; 1 - 'Done' The estimand is clearly stated in the introduction.
Data 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done'; 2 - 'Some issues'; 3 - 'Good'; 4 - 'Exceptional' A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' The model should be nicely written out, well-explained, justified, and appropriate. Detail the statistical model used, defining and explaining each aspect and its importance, and ensure that variables are well-defined and correspond with those discussed in the data section. The model should have an appropriate balance of complexity—neither overly simplistic nor unnecessarily complicated—and be justified as suitable for the situation. Explain how decisions made in modeling reflect the aspects discussed in the data section, including why specific features are included (e.g., why use age rather than age-groups, treating province effects as levels, categorizing gender, etc?). Present the model using appropriate mathematical notation supplemented with plain English explanations, defining every component. If applicable, define sensible priors for Bayesian models. Clearly discuss the underlying assumptions, potential limitations, and circumstances where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—while addressing model convergence, diagnostics, and any alternative models or variants considered, including their strengths and weaknesses and the rationale for the final model choice.
Results 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars.
Discussion 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Some issues'; 6 - 'Good'; 8 - 'Great'; 10 - 'Exceptional' Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done'; 2 - 'Many issues'; 4 - 'Good'; 6 - 'Exceptional' All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, and clear. Do not use filler phrases such as 'delve into' or 'shed light'. Remove unnecessary words.
Cross-references 0 - 'Poor or not done'; 1 - 'Yes' All figures, tables, and equations, should be numbered, and referred to in the text using cross-references.
Captions 0 - 'Poor or not done'; 1 - 'Good'; 2 - 'Excellent' All figures and tables have detailed and meaningful captions.
Graphs/tables/etc 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. They should: 1) serve a clear purpose; 2) be fully self-contained through appropriate use of captions and sub-captions; 3) appropriately sized and colored; and 4) have appropriate significant figures, in the case of tables.
Referencing 0 - 'Poor or not done'; 3 - 'One minor issue'; 4 - 'Perfect' All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list.
Commits 0 - 'Poor or not done'; 2 - 'Excellent' There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done'; 2 - 'Exceptional' Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The script is clearly commented and structured. All variables are appropriately simulated.
Tests 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' Data and code tests are appropriately used.
Parquet 0 - 'Not done'; 1 - 'Done' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducibility 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' The paper and analysis should be fully reproducible. The repo should have a detailed README. All code should be thoroughly documented. An R project should be used. Code should be used to do all steps including appropriately read data, prepare it, create plots, conduct analysis, and generate documents. Seeds should be used where needed. Code should have a preamble and be well-documented including comments and layout. The repo should be appropriately organized and not contain extraneous files. setwd() and hard coded file paths must not be used.
Code style 0 - 'Poor or not done'; 1 - 'Exceptional' Code is appropriately styled using styler or lintr
Enhancements 0 - 'Poor or not done'; 1 - 'Gets job done'; 2 - 'Fine'; 3 - 'Great'; 4 - 'Exceptional' You should pick at least one of the following and include it to enhance your submission: 1) A datasheet for the dataset; 2) A model card for the model; 3) A Shiny application; 4) An R package; or 5) API for the model.
General excellence 0 - 'None'; 1 - 'Huh, that's interesting'; 2 - 'Wow'; 3 - 'Exceptional' There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

E.8.5 Previous examples


  1. Gilad gave explicit permission and encouragement to be included in this list.↩︎

  2. This terminology is used following Barba (2018), but it is the opposite of that used by BITSS.↩︎

  3. The US GSS is recommended here because individual-level data are publicly available, and the dataset is well-documented. But, often university students in particular countries have access to individual level data that are not available to the public, and if this is the case then you are welcome to use that instead. Students at Australian universities will likely have access to individual-level data from the Australian General Social Survey, and could use that. Students at Canadian universities will likely have access to individual-level data from the Canadian General Social and may like to use that.↩︎