Online Appendix F — Papers

One way to build understanding of material is by using it. The purpose of these papers is to give you a chance to implement what you have learnt in a real-world setting. Completing the papers is also important from the perspective of building a portfolio for job applications.

Expectations change from year to year so please treat the “previous examples” as examples rather than templates.

F.1 Donaldson Paper

F.1.1 Task

  • Working individually and in an entirely reproducible way, please find a dataset of interest on Open Data Toronto and write a short paper telling a story about the data.
    • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this starter folder.
    • Find a dataset of interest on Open Data Toronto. (While not banned, please don’t use a dataset about the pandemic.)
      • Put together an R script, “scripts/00-simulate_data.R”, that simulates the dataset of interest and develops some tests. Push to GitHub and include an informative commit message
      • Write an R script, “scripts/01-download_data.R” to download the actual data in a reproducible way using opendatatoronto (Gelfand 2022). Save the data: “data/raw_data/unedited_data.csv” (use a meaningful name and appropriate file type). Push to GitHub and include an informative commit message.
    • Prepare a PDF using Quarto “paper/paper.qmd” with these sections: title, author, date, abstract, introduction, data, and references.
      • The title should be descriptive, informative, and specific.
      • The date should be in an unambiguous format. Add a link to the GitHub repo in the acknowledgments.
      • The abstract should be three or four sentences. The abstract must tell the reader the top-level finding. What is the one thing that we learn about the world because of this paper?
      • The introduction should be two or three paragraphs of content. And there should be an additional final paragraph that sets out the remainder of the paper.
      • The data section should thoroughly and precisely discuss the source of the data and the broader context that gave rise to it (ethical, statistical, and otherwise). Comprehensively describe and summarize the data using text, graphs, and tables. Graphs must be made with ggplot2 (Wickham 2016) and tables must be made with tinytable (Arel-Bundock 2024). Graphs must show the actual data, or as close to it as possible, not summary statistics. Graphs and tables should be cross-referenced in the text e.g. “Table 1 shows…”.
      • References should be added using BibTeX. Be sure to reference R, and any R packages you use, as well as the dataset. Strong submissions will draw on related literature and reference those.
      • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
      • Use appendices for supporting, but not critical, material.
      • Push to GitHub and include an informative commit message
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class assignment.

F.1.2 Checks

  • There should be no R code or raw R output in the final PDF.
  • An example statement for the README on LLM usage that you could base yours on is: “Statement on LLM usage: Aspects of the code were written with the help of the autocomplete tool, Codriver. The abstract and introduction were written with the help of ChatHorse and the entire chat history is available in other/llm/usage.txt.”
  • The paper should render directly to PDF i.e. use “Render to PDF”.
  • Graphs, tables, and text should be clear, and of comparable quality to those of the Financial Times.
  • The date should be up-to-date and unambiguous (e.g. 2/3/2024 is ambiguous, 2 March 2024 is not).
  • The entire workflow should be entirely reproducible.
  • There should not be any typos.
  • There should be no sign this is a school paper.
  • There must be a link to the paper’s GitHub repo using a footnote.
  • The GitHub repo should be well-organized, and contain an informative README.
  • The paper should be well-written and able to be understood by the average reader of, say, the Financial Times This means that you are allowed to use mathematical notation, but you must explain all of it in plain language. All statistical concepts and terminology must be explained. Your reader is someone with a university education, but not necessarily someone who understands what a p-value is.
  • Abstracts need to be “tightly written”, almost terse. Remove unnecessary words. Do not include more than four sentences. (You can break this rule once you get experience.)
  • Introduction needs paragraphs (leave a space between lines in the Quarto Document).
  • In the introduction, please telegraph the rest of the paper: “Section 2…, Section 3….”. (You can break this rule once you get experience.)
  • Please don’t read the data from their server into the Quarto Document, read the saved version. Submissions that do this receive 0 overall.
  • In the introduction, please be more specific about your findings.
  • The data section is not about data cleaning, it is about the data. Put data cleaning into an appendix. Unless there is something critical, do not discuss data cleaning in the data section.
  • Simulation needs a seed.
  • Do not call the repo “Paper 1” or similar.
  • Do not have sections called “graphs” or “tables” or similar.
  • Use usethis::git_vaccinate() to get a better gitignore file, and specifically to ignore DS_Store.
  • Please remember to cite both the dataset that you use and also opendatatoronto - they are separate things.

F.1.3 FAQ

  • Can I use a dataset from Kaggle instead? No, because they have done the hard work for you.
  • I cannot use code to download my dataset, can I just manually download it? No, because your entire workflow needs to be reproducible. Please fix the download problem or pick a different dataset.
  • How much should I write? Most students submit something in the two-to-six-page range, but it is up to you. Be precise and thorough.
  • My data is about apartment blocks/NBA/League of Legends so there’s no broader context, what do I do? Please re-read the relevant chapter and readings to better understand bias and ethics. If you really cannot think of something, then it might be worth picking a different dataset.
  • Can I use Python? No. If you already know Python then it does not hurt to learn another language.
  • Why do I need to cite R, when I don’t need to cite Word? R is a free statistical programming language with academic origins, so it is appropriate to acknowledge the work of others. It is also important for reproducibility.
  • What reference style should I use? Any major reference style is fine (APA, Harvard, Chicago, etc); just pick one that you are used to.
  • The paper in the starter folder has a model section, so do I need to put together a model? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • The paper in the starter folder has a data sheets appendix, so do I need to put together a data sheet? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need.
  • What does “graph the actual data” mean? If you have, say 5,000 observations in the dataset and three variables, then for every variable there should be a graph that has 5,000 points in the case of dots, or adds up to 5,000 in the case of bar charts and histograms.

F.1.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.1.5 Previous examples

F.2 Mawson Paper

F.2.1 Task

  • Working as part of a team of one to three people, please pick a paper of interest to you, with code and data that are available from:

    1. A paper published anytime since 2019, in an American Economic Association journal. These journals are: “American Economic Review”, “AER: Insights”, “AEJ: Applied Economics”, “AEJ: Economic Policy”, “AEJ: Macroeconomics”, “AEJ: Microeconomics”, “Journal of Economic Literature”, “Journal of Economic Perspectives”, “AEA Papers & Proceedings”.
    2. Any article from the Institute for Replication list available here that has a replicability status of “Looking for replicator”.
    3. One of Gilad Feldman’s papers.1
  • Following the Guide for Accelerating Computational Reproducibility in the Social Sciences, please complete a replication2 of at least three graphs, tables, or a combination, from that paper, using the Social Science Reproduction Platform. Note the DOI of your replication.

  • Working in an entirely reproducible way then conduct a reproduction based on two or three aspects of the paper, and write a short paper about that.

    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you should use this starter folder): title, author, date, abstract, introduction, data, results, discussion, and references.
    • The aspects that you focus on in your paper could be the same aspects that you replicated, but they do not need to be. Follow the direction of the paper, but make it your own. That means you should ask a slightly different question, or answer the same question in a slightly different way, but still use the same dataset.
    • Include the DOI of your replication in your paper and a link to the GitHub repo that underpins your paper.
    • The results section should convey findings.
    • The discussion should include three or four sub-sections that each focus on an interesting point, and there should be another sub-section on the weaknesses of your paper, and another on potential next steps for it.
    • In the discussion section, and any other relevant section, please be sure to discuss ethics and bias, with reference to relevant literature.
    • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
    • Use appendices for supporting, but not critical, material.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.

  • There should be no evidence that this is a class assignment.

F.2.2 Checks

  • The paper should not just copy/paste the code from the original paper, but have instead used that as a foundation to work from.
  • Your paper should have a link to the associated GitHub repository and the DOI of the Social Science Reproduction Platform replication that you conducted.
  • Make sure you have referenced everything, including R. Strong submissions will draw on related literature in the discussion (and other sections) and would be sure to also reference those. The style of references does not matter, provided it is consistent.

F.2.3 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.
  • Do I have to focus on a model result? No, it is likely best to stay away from that at this point, and instead focus on tables or graphs of summary or explanatory statistics.
  • What if the paper I choose is in a language other than R? Both your replication and reproduction code should be in R. So you will need to translate the code into R for the replication. And the reproduction should be your own work, so that also should be in R. One common language is Stata, and Huntington-Klein (2022) might be useful as a “Rosetta Stone” of sorts, for R, Python, and Stata, or just use a LLM to help.
  • Can I work by myself? Yes.
  • Do the graphs/tables have to look identical to the original? No, you are welcome to, and should, make them look better as part of the reproduction. And even as part of the replication, they do not have to be identical, just similar enough.
  • One of my graphs has four panels, do I have to do all of them for this to be counted as one element? No, for the purpose of this paper, every panel counts as a separate element, so all you would need to do is three panels and that would be enough.
  • How do I automatically download the data if they are behind a sign-in? If the data are behind a sign-in, just add commented documentation for how to download it into the download_data.R R file, rather than code.
  • Do we need to commit our original, unedited data to GitHub if it is really big? No, you do not necessarily need to commit the original, unedited data to GitHub if it is too large, just add a note explaining the situation in the README and how to obtain the data.
  • What should the abstract and introduction be about? The abstract and introduction should reflect your own work and findings, rather than those of the original paper (even though those will necessarily nonetheless have some role). You are (almost surely) not replicating their entire paper, and so your abstract should be different. See the examples for guidance.

F.2.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Replication 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
SSRP submission needs to be filled out completely for three elements.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.2.5 Previous examples

F.3 Howrah Paper

F.3.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please obtain data from the US General Social Survey3. (You are welcome to use a different government-run survey, but please obtain permission before starting.)
  • Obtain the data, focus on one aspect of the survey, and then use it to tell a story.
    • Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then use Quarto to prepare a PDF with these sections (you should use this starter folder): title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a survey, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include, but not be limited to:
      • A discussion of the survey’s methodology, and its key features, strengths, and weaknesses. For instance: what is the population, frame, and sample; how is the sample recruited; what sampling approach is taken, and what are some of the trade-offs of this; how is non-response handled.
      • A discussion of the questionnaire: what is good and bad about it?
      • If this becomes too detailed, then use appendices for supporting but not essential aspects.
    • In an appendix, please put together a supplementary survey that could be used to augment the general social survey the paper focuses on. The purpose of the supplementary survey is to gain additional information on the topic that is the focus of the paper, beyond that gathered by the general social survey. The survey would be distributed in the same manner as the general social survey but needs to stand independently. The supplementary survey should be put together using a survey platform. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
    • Please be sure to discuss ethics and bias, with reference to relevant literature.
    • Code should be entirely reproducible, well-documented, and readable.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at a university-educated, but non-specialist, audience. Use survey, sampling, and statistical terminology, but be sure to explain it. The paper should flow, and be easy to follow and understand.
  • There should be no evidence that this is a class paper.

F.3.2 Checks

  • An appendix should contain both a link to the supplementary survey and the details of it, including questions (in case the link fails, and to make the paper self-contained).

F.3.3 FAQ

  • What should I focus on? You may focus on any year, aspect, or geography that is reasonable given the focus and constraints of the general social survey that you are interested in. Please consider the year and topics that you are interested in together, as some surveys focus on particular topics in some years.
  • Do I need to include the raw GSS data in the repo? For most of the general social surveys you will not have permission to share the GSS data. If that is the case, then you should add clear details in the README explaining how the data could be obtained.
  • How many graphs do I need? In general, you need at least as many graphs as you have variables, because you need to show all the observations for all variables. But you may be able to combine a few; or, vice versa, you may be interested in looking at different aspects or relationships.

F.3.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Pollster review 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
The evaluation provides a thorough understanding of how something goes from being a person's opinion to part of a result for this pollster. Provide a thorough overview and evaluation of the pollster's methodology, and sampling approach, highlighting both its strengths and limitations. Use this section to demonstrate knowledge of surveys and sampling and link your evaluation to the literature. Be precise and scientific--your review should not sound like an ad for the pollster.
Idealized survey 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The survey should have an introductory section and include the details of a contact person. The survey questions should be well constructed and appropriate to the task. The questions should have an appropriate ordering. A final section should thank the respondent. Question type should be varied and appropriate. Use this section to demonstrate knowledge of surveys.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.3.5 Previous examples

F.4 Dysart Paper

F.4.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please convert at least one full-page table from one DHS Program “Final Report”, from the 1980s or 1990s, as available here, into a usable dataset, then write a short paper telling a story with the data.
  • Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this starter folder.
  • Create and document a dataset:
    • Save the PDF to “inputs”.
    • Put together a simulation of your plan for the usable dataset and save the script to “scripts/00-simulation.R”.
    • Write R code, saved as “scripts/01-gather_data.R”, to either OCR or parse the PDF, as appropriate, and save the output to “outputs/data/first_parse.csv”.
    • Write R code, saved as “scripts/02-clean_and_prepare_data.R”, that draws on “first_parse.csv” to clean and prepare the dataset. Use pointblank to put together tests that the dataset passes (at a minimum, every variable should have a test for class and another for content). Save the dataset to “outputs/data/cleaned_data.parquet”.
    • Following Gebru et al. (2021), put together a data sheet for the dataset you put together (put this in the appendix of your paper). You are welcome to start from the template “inputs/data/datasheet_template.qmd” in the starter folder, although, again, you should add it to the appendix of your paper, rather than a stand-alone document.
  • Use the dataset to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a datasheet for the dataset, and references.
    • In addition to conveying a sense of the dataset of interest, the data section should include details of the methodology used by the DHS you used, and its key features, strengths, and weaknesses.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class paper.

F.4.2 Checks

  • Use GitHub in a well-developed way by making at least a few commits and using descriptive commit messages.

F.4.3 FAQ

F.4.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Parquet 0 - 'No'; 1 - 'Yes' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Datasheet 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A thorough datasheet for the dataset that was constructed is included.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.4.5 Previous examples

F.5 Spadina Paper

F.5.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model and then write a short paper telling a story. Some ideas for aspects you could tackle include:
    • Revisit the dataset that you used in Section F.1. Build a linear model for one of the variables, and consider the results.
    • Pick one of the examples in Chapter 13, and change the situation slightly, and then build a generalized linear model.
  • You should use this starter folder.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class paper.

F.5.2 Checks

  • Be careful to thoroughly explain the model. Also consider the assumptions of the model and the threats to its validity.

F.5.3 FAQ

  • What does “change the situation slightly” mean? You are welcome to use the same, or similar, data, but consider a different aspect. For instance:
    • In the logistic regression example of US political support, you may use the CES from a different year, and/or with slightly different explanatory variables.
    • In the Poisson regression example of the letters used in Jane Eyre, you may consider a different novel.
    • In the negative binomial regression of mortality in Alberta, you may consider a different geographic area.
  • Can I use Alberta mortality data? No.

F.5.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Present the model clearly using appropriate mathematical notation and plain English explanations, defining every component. Ensure the model is well-explained, justified, appropriate, and balanced in complexity—neither overly simplistic nor unnecessarily complicated—for the situation. Variables should be well-defined and correspond with those in the data section. Explain how modeling decisions reflect aspects discussed in the data section, including why specific features are included (e.g., using age rather than age groups, treating province effects as levels, categorizing gender). If applicable, define and justify sensible priors for Bayesian models. Clearly discuss underlying assumptions, potential limitations, and situations where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—addressing model convergence and diagnostics (although much of the detail make be in the appendix). Include any alternative models or variants considered, their strengths and weaknesses, and the rationale for the final model choice.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Parquet 0 - 'No'; 1 - 'Yes' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Datasheet 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A thorough datasheet for the dataset that was constructed is included.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.5.5 Previous examples

F.6 St George Paper

F.6.1 Task

  • Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model to forecast the winner of the upcoming US presidential election using “poll-of-polls” (Blumenthal 2014; Pasek 2015) and then write a short paper telling a story.
  • You should use this starter folder.
  • You are welcome to use R, Python, or a combination.
  • You can get data about polling outcomes from here (search for “Download the data”, then select Presidential general election polls (current cycle), then “Download”).
  • Pick one pollster in your sample, and deep-dive on their methodology in an appendix to your paper. In particular, in addition to conveying a sense of the pollster of interest, this appendix should include a discussion of the survey’s methodology, and its key features, strengths, and weaknesses. For instance:
    • what is the population, frame, and sample;
    • how is the sample recruited;
    • what sampling approach is taken, and what are some of the trade-offs of this;
    • how is non-response handled;
    • what is good and bad about the questionnaire.
  • In another appendix, please put together an idealized methodology and survey that you would run if you had a budget of $100K and the task of forecasting the US presidential election. You should detail the sampling approach that you would use, how you would recruit respondents, data validation, and any other relevant aspects of interest. Also be careful to address any poll aggregation or other features of your methodology. You should actually implement your survey using a survey platform like Google Forms. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class paper.

F.6.2 Checks

  • Check that you have both appendices required.

F.6.3 FAQ

  • Do I need to use all the predictors in the dataset? No, you should be deliberate and thoughtful about the predictors that you use.
  • What about the electoral college? US presidential elections are won based on the electoral college. It is fine to just focus the popular vote. But exceptional submissions would consider the popular vote, possibly by state, and then construct an electoral college estimate, being careful to propagate uncertainty.

F.6.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Present the model clearly using appropriate mathematical notation and plain English explanations, defining every component. Ensure the model is well-explained, justified, appropriate, and balanced in complexity—neither overly simplistic nor unnecessarily complicated—for the situation. Variables should be well-defined and correspond with those in the data section. Explain how modeling decisions reflect aspects discussed in the data section, including why specific features are included (e.g., using age rather than age groups, treating province effects as levels, categorizing gender). If applicable, define and justify sensible priors for Bayesian models. Clearly discuss underlying assumptions, potential limitations, and situations where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—addressing model convergence and diagnostics (although much of the detail make be in the appendix). Include any alternative models or variants considered, their strengths and weaknesses, and the rationale for the final model choice.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Pollster review 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
The evaluation provides a thorough understanding of how something goes from being a person's opinion to part of a result for this pollster. Provide a thorough overview and evaluation of the pollster's methodology, and sampling approach, highlighting both its strengths and limitations. Use this section to demonstrate knowledge of surveys and sampling and link your evaluation to the literature. Be precise and scientific--your review should not sound like an ad for the pollster.
Idealized methodology 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
The proposed methodology is well-thought through, realistic and would achieve the goals. Use this section to demonstrate knowledge of surveys and sampling and link your evaluation to the literature and simulation.
Idealized survey 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The survey should have an introductory section and include the details of a contact person. The survey questions should be well constructed and appropriate to the task. The questions should have an appropriate ordering. A final section should thank the respondent. Question type should be varied and appropriate. Use this section to demonstrate knowledge of surveys.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Parquet 0 - 'No'; 1 - 'Yes' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.6.5 Previous examples

F.7 Spofforth Paper

F.7.1 Task

  • Working as part of a team of one to three people, please forecast the popular vote of the upcoming US election using multilevel regression with post-stratification and then write a short paper telling a story.
  • This requires individual-level survey data, post-stratification data, and a model that brings them together. Given the expense of collecting these data, and the privilege of having access to them, please be sure to properly cite all datasets that you use.
  • You will need to:
    • Get access to an individual-level survey dataset.
    • Get access to a post-stratification dataset.
    • Clean and prepare both these datasets to make them useable together.
    • Estimate a model using the survey dataset.
    • Apply the trained model to the post-stratification dataset to forecast the election result.
  • You should use this starter folder.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class paper.

F.7.2 FAQ

  • How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.

F.7.3 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Present the model clearly using appropriate mathematical notation and plain English explanations, defining every component. Ensure the model is well-explained, justified, appropriate, and balanced in complexity—neither overly simplistic nor unnecessarily complicated—for the situation. Variables should be well-defined and correspond with those in the data section. Explain how modeling decisions reflect aspects discussed in the data section, including why specific features are included (e.g., using age rather than age groups, treating province effects as levels, categorizing gender). If applicable, define and justify sensible priors for Bayesian models. Clearly discuss underlying assumptions, potential limitations, and situations where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—addressing model convergence and diagnostics (although much of the detail make be in the appendix). Include any alternative models or variants considered, their strengths and weaknesses, and the rationale for the final model choice.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Parquet 0 - 'No'; 1 - 'Yes' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.7.4 Previous examples

F.8 Final paper

F.8.1 Task

  • Working individually and in an entirely reproducible way please write a paper that involves original work to tell a story with data.
  • Develop a research question that is of interest to you, then obtain or create a relevant dataset and put together a paper that answers it.
  • You should use this starter folder.
  • You are welcome to use R, Python, or a combination.
  • Please include an Appendix where you focus on an aspect of surveys, sampling or observational data, related to your paper. This should be an in-depth exploration, akin to the “idealized methodology/survey/pollster methodology” sections of Paper 2. Some aspect of this is likely covered in the Measurement sub-section of your Data section, but this Appendix would be much more detailed, and might include aspects like simulation, links to the literature, explorations and comparisons, among other aspects.
  • Some dataset ideas:
    • Jacob Filipp’s groceries dataset here.
    • The IJF procurement dataset here (you would then be eligible for the IJF best paper award).
    • Revisiting a Open Data Toronto dataset (you would then be eligible for the Open Data Toronto best paper award)
    • A dataset from Appendix D.
  • All the guidance and expectations from earlier papers applies to this one.
  • Submit a link to the GitHub repo. Please do not update the repo after the deadline.
  • There should be no evidence that this is a class paper.

F.8.2 Checks

  • Do not use a dataset from Kaggle, UCI, or Statistica. Mostly this is because everyone else uses these datasets and so it does nothing to make you stand out to employers, but there are sometimes also concerns that the data are old, or you do not know the provenance.

F.8.3 FAQ

  • Can I work as part of a team? No. You must have some work that is entirely your own. You really need your own work to show off for job applications etc.
  • How much should I write? Most students submit something that has 10-to-20-pages of main content, with additional pages devoted to appendices, but it is up to you. Be concise but thorough.
  • Can I use any model? You are welcome to use any model, but you need to thoroughly explain it and this can be difficult for more complicated models. Start small. Pick one or two predictors. Once you get that working, then complicate it. Remember that every predictor and the outcome variable needs to be graphed and explained in the data section.

F.8.4 Rubric

Component Range Requirement
R/Python cited 0 - 'No';
1 - 'Yes'
R (and/or Python) is properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Data cited 0 - 'No';
1 - 'Yes'
Data are properly referenced in the main content and in the reference list. If not, then paper gets 0 overall.
Class paper 0 - 'No';
1 - 'Yes'
There is no sign this is a class project. Check the rproj and folder names, the README, the title, code comments, etc. If there is any sign this is a class paper, then paper gets 0 overall.
LLM documentation 0 - 'No';
1 - 'Yes'
A separate paragraph or dot point must be included in the README about whether LLMs were used, and if so how. If auto-complete tools such as co-pilot were used this must be mentioned. If chat tools such as ChatGPT, were used then the entire chat must be included in the usage text file. If not, then paper gets 0 overall.
Title 0 - 'Poor or not done';
1 - 'Yes';
2 - 'Exceptional'
An informative title is included that explains the story, and ideally tells the reader what happens at the end of it. 'Paper X' is not an informative title. Use a subtitle to convey the main finding. Do not use puns (you can break this rule once you're experienced).
Author, date, and repo 0 - 'Poor or not done';
2 - 'Yes'
The author, date of submission in unambiguous format, and a link to a GitHub repo are clearly included. (The later likely, but not necessarily, through a statement such as: 'Code and data supporting this analysis is available at: LINK').
Abstract 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
An abstract is included and appropriately pitched to a non-specialist audience. The abstract answers: 1) what was done, 2) what was found, and 3) why this matters (all at a high level). Likely four sentences. Abstract must make clear what we learn about the world because of this paper.
Introduction 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The introduction is self-contained and tells a reader everything they need to know including: 1) broader context to motivate; 2) some detail about what the paper is about; 3) a clear gap that needs to be filled; 4) what was done; 5) what was found; 6) why it is important; 7) the structure of the paper. A reader should be able to read only the introduction and know what was done, why, and what was found. Likely 3 or 4 paragraphs, or 10 per cent of total.
Estimand 0 - 'Poor or not done';
1 - 'Done'
The estimand is clearly stated, in its own paragraph, in the introduction.
Data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
A sense of the dataset should be communicated to the reader. The broader context of the dataset should be discussed. All variables should be thoroughly examined and explained. Explain if there were similar datasets that could have been used and why they were not. If variables were constructed then this should be mentioned, and high-level cleaning aspects of note should be mentioned, but this section should focus on the destination, not the journey. It is important to understand what the variables look like by including graphs, and possibly tables, of all observations, along with discussion of those graphs and the other features of these data. Summary statistics should also be included, and well as any relationships between the variables. You are not doing EDA in this section--you are talking the reader through the variables that are of interest. If this becomes too detailed, then appendices could be used. Basically, for every variable in your dataset that is of interest to your paper there needs to be graphs and explanation and maybe tables.
Measurement 0 - 'Poor or not done';
2 - 'Some issues';
3 - 'Acceptable';
4 - 'Exceptional'
A thorough discussion of measurement, relating to the dataset, is provided in the data section. Please ensure that you explain how we went from some phenomena in the world that happened to an entry in the dataset that you are interested in.
Model 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Present the model clearly using appropriate mathematical notation and plain English explanations, defining every component. Ensure the model is well-explained, justified, appropriate, and balanced in complexity—neither overly simplistic nor unnecessarily complicated—for the situation. Variables should be well-defined and correspond with those in the data section. Explain how modeling decisions reflect aspects discussed in the data section, including why specific features are included (e.g., using age rather than age groups, treating province effects as levels, categorizing gender). If applicable, define and justify sensible priors for Bayesian models. Clearly discuss underlying assumptions, potential limitations, and situations where the model may not be appropriate. Mention the software used to implement the model, and provide evidence of model validation and checking—such as out-of-sample testing, RMSE calculations, test/training splits, or sensitivity analyses—addressing model convergence and diagnostics (although much of the detail make be in the appendix). Include any alternative models or variants considered, their strengths and weaknesses, and the rationale for the final model choice.
Results 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Results will likely require summary statistics, tables, graphs, images, and possibly statistical analysis or maps. There should also be text associated with all these aspects. Show the reader the results by plotting them where possible. Talk about them. Explain them. That said, this section should strictly relay results. Regression tables must not contain stars. Use modelsummary to include a table and graph of the estimates.
Discussion 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Some questions that a good discussion would cover include (each of these would be a sub-section of something like half a page to a page): What is done in this paper? What is something that we learn about the world? What is another thing that we learn about the world? What are some weaknesses of what was done? What is left to learn or how should we proceed in the future?
Prose 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Acceptable';
6 - 'Exceptional'
All aspects of submission should be free of noticeable typos, spelling mistakes, and be grammatically correct. Prose should be coherent, concise, clear, and mature. Remove unnecessary words. Do not use the following words/phrases: 'advanced', 'all-encompassing', 'apt', 'backdrop', 'beg the question', 'bridge/s the/a gap', comprehensive', 'critical', 'crucial', 'data-driven', 'delve/s', 'drastic', 'drives forward', 'elucidate/ing', 'embark/s', 'exploration', 'fill that/the/a gap', 'fresh perspective/s', 'hidden factor/s', 'imperative', 'insights from', 'insight/s', 'interrogate', 'intricate', 'intriguing', 'key insights', 'kind of', 'leverage', 'meticulous/ly', 'multifaceted', 'novel', 'nuance', 'offers/ing crucial insight', 'plummeted', 'profound', 'rapidly', 'reveals', 'shed/s light', 'shocking', 'soared', 'unparalleled', 'unveiling', 'valuable', 'wanna'.
Cross-references 0 - 'Poor or not done';
1 - 'Yes'
All figures, tables, and equations, should be numbered, and referred to in the text using cross-references. The telegraphing paragraph in the introduction should cross reference the rest of the paper.
Captions 0 - 'Poor or not done';
1 - 'Acceptable';
2 - 'Excellent'
All figures and tables have detailed and meaningful captions. They should be sufficiently detailed so as to make the main point of the figure/table clear even without the accompanying text. Do not say 'Histogram of...' or whatever else the figure type is.
Graphs and tables 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Graphs and tables must be included in the paper and should be to well-formatted, clear, and digestible. Graphs should be made using ggplot2 and tables should be made using tinytable. They should serve a clear purpose and be fully self-contained. Graphs and tables should be appropriately sized, colored, and labelled. Variable names should not be used as labels. Tables should have an appropriate number of decimal places and use comma separators for thousands. Don't use boxplots, but if you must then you must overlay the actual data.
Surveys, sampling, and observational data 0 - 'Poor or not done';
2 - 'Many issues';
4 - 'Some issues';
6 - 'Acceptable';
8 - 'Impressive';
10 - 'Exceptional'
Please include an appendix where you focus on some aspect of surveys, sampling or observational data, related to your paper. This should be an in-depth exploration, akin to the idealized methodology/survey/pollster methodology sections of Paper 2. Some aspect of this is likely covered in the Measurement sub-section of your Data section, but this would be much more detailed, and might include aspects like simulation and linkages to the literature, among other aspects.
Referencing 0 - 'Poor or not done';
3 - 'One minor issue';
4 - 'Perfect'
All data, software, literature, and any other relevant material, should be cited in-text and included in a properly formatted reference list made using BibTeX. A few lines of code from Stack Overflow or similar, would be acknowledged just with a comment in the script immediately preceding the use of the code rather than here. But larger chunks of code should be fully acknowledged with an in-text citation and appear in the reference list. Check in-text citations and that you have not accidentally used (@my_cite) when you needed [@my_cite]. R packages and all other aspects should be correctly capitalized, and name should be correct e.g. use double braces appropriately in the BibTeX file.
Commits 0 - 'Poor or not done';
2 - 'Done'
There are at least a handful of different commits, and they have meaningful commit messages.
Sketches 0 - 'Poor or not done';
2 - 'Done'
Sketches are included in a labelled folder of the repo, appropriate, and of high-quality.
Simulation 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
The script is clearly commented and structured. All variables are appropriately simulated in a sophisticated way including appropriate interaction between simulated variables.
Tests 0 - 'Poor or not done';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
High-quality extensive suites of tests are written for the both the simulated and actual datasets. These suites must be in separate scripts. The suite should be extensive and put together in a sophisticated way using packages like testthat, validate, pointblank, or great expectations.
Parquet 0 - 'No'; 1 - 'Yes' The analysis dataset is saved as a parquet file. (Note that the raw data should be saved in whatever format it came.)
Reproducible workflow 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
Use an organized repo with a detailed README and an R project. Thoroughly document code and include a preamble, comments, nice structure, and style code with styler or lintr. Use seeds appropriately. Avoid leaving install.packages() in the code unless handled sophisticatedly. Exclude unnecessary files from the repo; avoid hard-coded paths and setwd(). Use base pipe not magrittr pipe. Comment on and close all GitHub issues. Deal with all branches.
Enhancements 0 - 'Poor or not done';
1 - 'Some issues';
2 - 'Acceptable';
3 - 'Impressive';
4 - 'Exceptional'
You should pick at least one of the following and include it to enhance your submission: 1) a datasheet for the dataset; 2) a model card for the model; 3) a Shiny application; 4) an R package; or 5) an API for the model. If you would like to include an enhancement not on this list please email the instructor with your idea.
Miscellaneous 0 - 'None';
1 - 'Notable';
2 - 'Remarkable';
3 - 'Exceptional'
There are always students that excel in a way that is not anticipated in the rubric. This item accounts for that.

F.8.5 Previous examples


  1. Gilad gave explicit permission and encouragement to be included in this list.↩︎

  2. This terminology is used following Barba (2018), but it is the opposite of that used by BITSS.↩︎

  3. The US GSS is recommended here because individual-level data are publicly available, and the dataset is well-documented. But, often university students in particular countries have access to individual level data that are not available to the public, and if this is the case then you are welcome to use that instead. Students at Australian universities will likely have access to individual-level data from the Australian General Social Survey, and could use that. Students at Canadian universities will likely have access to individual-level data from the Canadian General Social and may like to use that.↩︎