Online Appendix B — Python essentials

Prerequisites

Key concepts and skills

Software and packages

B.1 Introduction

Python is a general-purpose programming language created by Guido van Rossum. Python version 0.9.0 was released in February 1991, and the current version, 3.13, was released in October 2024. It was named Python after Monty Python’s Flying Circus.

Python is a popular language in machine learning, but it was designed, and is more commonly used, for more general software applications. This means that we will especially rely on packages when we use Python for data science. This use of Python in this book is focused on data science, rather than the other, more general, uses for which it was developed.

Knowing R will allow you to pick up Python for data science quickly. The main data science packages share the need to solve the same underlying problems.

B.2 Python, VS Code, and uv

We could use Python within RStudio, but another option is to use what is used by the community more broadly, which is VS Code. You can download VS Code for free here and then install it. If you have difficulties with this, then in the same way we started with Posit Cloud and the shifted to our local machine, you could initially use Google Colab here.

Open VS Code (Figure B.1 (a)), and open a new Terminal: Terminal -> New Terminal (Figure B.1 (b)). We can then install uv, which is a Python package manager, by putting curl -LsSf https://astral.sh/uv/install.sh | sh into the Terminal and pressing “return/enter” afterwards (Figure B.1 (c)). Finally, to install Python we can use uv by putting uv python install into that Terminal and pressing “return/enter” afterwards (Figure B.1 (d)).

(a) Opening VS Code
(b) Opening a Terminal in VS Code
(c) Install uv
(d) Install Python
Figure B.1: Opening VS Code and a new terminal and then installing uv and Python

B.3 Getting started

B.3.1 Project set-up

We are going to get started with an example that downloads some data from Open Data Toronto. To start, we need to create a project, which will allow all our code to be self-contained.

Open VS Code and open a new Terminal: “Terminal” -> “New Terminal”. Then use Unix shell commands to navigate to where you want to create your folder. For instance, use ls to list all the folders in the current directory, then move to one using cd and then the name of the folder. If you need to go back one level then use ...

Once you are happy with where you are going to create this new folder, we can use uv init in the Terminal to do this, pressing “return/enter” afterwards (cd then moves to the new folder “shelter_usage”).

uv init shelter_usage
cd shelter_usage

By default, there will be a script in the example folder. We want to use uv run to run that script, which will then create an project environment for us.

uv run hello.py

A project environment is specific to that project. We will use the package numpy to simulate data. We need to add this package to our environment with uv add.

uv add numpy

We can then modify hello.py to use numpy to simulate from the Normal distribution.

import numpy as np

def main():
    np.random.seed(853)

    mu, sigma = 0, 1
    sample_sizes = [10, 100, 1000, 10000]
    differences = []

    for size in sample_sizes:
        sample = np.random.normal(mu, sigma, size)
        sample_mean = np.mean(sample)
        diff = abs(mu - sample_mean)
        differences.append(diff)
        print(f"Sample size: {size}")
        print(f"  Difference between sample and population mean: {round(diff, 3)}")
        
if __name__ == "__main__":
    main()

After we have modified and saved hello.py we can run it with uv run in exactly the same way as before.

At this point we should close VS Code. We want to re-open it to make sure that our project environment is working as it needs to. In VS Code, a project is a self-contained folder. You can open a folder with “File” -> “Open Folder…” and then select the relevant folder, in this case “shelter_usage”. You should then be able to re-run uv run hello.py and it should work.

B.3.2 Plan

We first used this dataset in Chapter 2, but as a reminder, for each day, for each shelter, there is a number of people that used the shelter. So the dataset that we want to simulate is something like Figure B.2 (a) and we are wanting to create a table of average daily number of occupied beds each month, along the lines of Figure B.2 (b).

(a) Quick sketch of a dataset
(b) Quick sketch of a table of the average number of beds occupied each month
Figure B.2: Sketches of a dataset and table related shelter usage in Toronto

B.3.3 Simulate

We would like to more thoroughly simulate the dataset that we are interested in. We will use polars to provide a dataframe to store our simulated results, so we should add this to our environment with uv add.

uv add polars

Create a new Python file called 00-simulate_data.py.

#### Preamble ####
# Purpose: Simulates a dataset of daily shelter usage
# Author: Rohan Alexander
# Date: 12 November 2024
# Contact: rohan.alexander@utoronto.ca
# License: MIT
# Pre-requisites:
# - Add `polars`: uv add polars
# - Add `numpy`: uv add numpy
# - Add `datetime`: uv add datetime


#### Workspace setup ####
import polars as pl
import numpy as np
from datetime import date

rng = np.random.default_rng(seed=853)


#### Simulate data ####
# Simulate 10 shelters and some set capacity
shelters_df = pl.DataFrame(
    {
        "Shelters": [f"Shelter {i}" for i in range(1, 11)],
        "Capacity": rng.integers(low=10, high=100, size=10),
    }
)

# Create data frame of dates
dates = pl.date_range(
    start=date(2024, 1, 1), end=date(2024, 12, 31), interval="1d", eager=True
).alias("Dates")

# Convert dates into a data frame
dates_df = pl.DataFrame(dates)

# Combine dates and shelters
data = dates_df.join(shelters_df, how="cross")

# Add usage as a Poisson draw
poisson_draw = rng.poisson(lam=data["Capacity"])
usage = np.minimum(poisson_draw, data["Capacity"])

data = data.with_columns([pl.Series("Usage", usage)])

print(data)

Write tests

B.3.4 Acquire

Download data

Apply tests

B.3.5 Explore

Manipulate the data

import polars as pl

Make a graph

import matplotlib.pyplot as plt

B.3.6 Share

Add it all into Quarto

Add GitHub to VS code. Why environments.

B.4 Python

For loops

List comprehensions

B.5 Making graphs

matplotlib

seaborn

B.6 Exploring polars

B.6.1 Importing data

B.6.2 Dataset manipulation with joins and pivots

B.6.3 String manipulation

B.6.4 Factor variables

B.7 Exercises

Practice

Quiz

Task

Free Replit “100 Days of Code” Python course.