uv init shelter_usage
cd shelter_usage
Online Appendix B — Python essentials
Prerequisites
Key concepts and skills
Software and packages
Python
(Python Software Foundation 2024)datetime>=5.5
uv
polars
B.1 Introduction
Python
is a general-purpose programming language created by Guido van Rossum. Python
version 0.9.0 was released in February 1991, and the current version, 3.13, was released in October 2024. It was named Python
after Monty Python’s Flying Circus.
Python
is a popular language in machine learning, but it was designed, and is more commonly used, for more general software applications. This means that we will especially rely on packages when we use Python for data science. This use of Python
in this book is focused on data science, rather than the other, more general, uses for which it was developed.
Knowing R
will allow you to pick up Python
for data science quickly. The main data science packages share the need to solve the same underlying problems.
B.2 Python, VS Code, and uv
We could use Python
within RStudio, but another option is to use what is used by the community more broadly, which is VS Code. You can download VS Code for free here and then install it. If you have difficulties with this, then in the same way we started with Posit Cloud and the shifted to our local machine, you could initially use Google Colab here.
Open VS Code (Figure B.1 (a)), and open a new Terminal: Terminal -> New Terminal (Figure B.1 (b)). We can then install uv
, which is a Python package manager, by putting curl -LsSf https://astral.sh/uv/install.sh | sh
into the Terminal and pressing “return/enter” afterwards (Figure B.1 (c)). Finally, to install Python we can use uv
by putting uv python install
into that Terminal and pressing “return/enter” afterwards (Figure B.1 (d)).
B.3 Getting started
B.3.1 Project set-up
We are going to get started with an example that downloads some data from Open Data Toronto. To start, we need to create a project, which will allow all our code to be self-contained.
Open VS Code and open a new Terminal: “Terminal” -> “New Terminal”. Then use Unix shell commands to navigate to where you want to create your folder. For instance, use ls
to list all the folders in the current directory, then move to one using cd
and then the name of the folder. If you need to go back one level then use ..
.
Once you are happy with where you are going to create this new folder, we can use uv init
in the Terminal to do this, pressing “return/enter” afterwards (cd
then moves to the new folder “shelter_usage”).
By default, there will be a script in the example folder. We want to use uv run
to run that script, which will then create an project environment for us.
uv run hello.py
A project environment is specific to that project. We will use the package numpy
to simulate data. We need to add this package to our environment with uv add
.
uv add numpy
We can then modify hello.py
to use numpy
to simulate from the Normal distribution.
import numpy as np
def main():
853)
np.random.seed(
= 0, 1
mu, sigma = [10, 100, 1000, 10000]
sample_sizes = []
differences
for size in sample_sizes:
= np.random.normal(mu, sigma, size)
sample = np.mean(sample)
sample_mean = abs(mu - sample_mean)
diff
differences.append(diff)print(f"Sample size: {size}")
print(f" Difference between sample and population mean: {round(diff, 3)}")
if __name__ == "__main__":
main()
After we have modified and saved hello.py
we can run it with uv run
in exactly the same way as before.
At this point we should close VS Code. We want to re-open it to make sure that our project environment is working as it needs to. In VS Code, a project is a self-contained folder. You can open a folder with “File” -> “Open Folder…” and then select the relevant folder, in this case “shelter_usage”. You should then be able to re-run uv run hello.py
and it should work.
B.3.2 Plan
We first used this dataset in Chapter 2, but as a reminder, for each day, for each shelter, there is a number of people that used the shelter. So the dataset that we want to simulate is something like Figure B.2 (a) and we are wanting to create a table of average daily number of occupied beds each month, along the lines of Figure B.2 (b).
B.3.3 Simulate
We would like to more thoroughly simulate the dataset that we are interested in. We will use polars
to provide a dataframe to store our simulated results, so we should add this to our environment with uv add
.
uv add polars
Create a new Python file called 00-simulate_data.py
.
#### Preamble ####
# Purpose: Simulates a dataset of daily shelter usage
# Author: Rohan Alexander
# Date: 12 November 2024
# Contact: rohan.alexander@utoronto.ca
# License: MIT
# Pre-requisites:
# - Add `polars`: uv add polars
# - Add `numpy`: uv add numpy
# - Add `datetime`: uv add datetime
#### Workspace setup ####
import polars as pl
import numpy as np
from datetime import date
= np.random.default_rng(seed=853)
rng
#### Simulate data ####
# Simulate 10 shelters and some set capacity
= pl.DataFrame(
shelters_df
{"Shelters": [f"Shelter {i}" for i in range(1, 11)],
"Capacity": rng.integers(low=10, high=100, size=10),
}
)
# Create data frame of dates
= pl.date_range(
dates =date(2024, 1, 1), end=date(2024, 12, 31), interval="1d", eager=True
start"Dates")
).alias(
# Convert dates into a data frame
= pl.DataFrame(dates)
dates_df
# Combine dates and shelters
= dates_df.join(shelters_df, how="cross")
data
# Add usage as a Poisson draw
= rng.poisson(lam=data["Capacity"])
poisson_draw = np.minimum(poisson_draw, data["Capacity"])
usage
= data.with_columns([pl.Series("Usage", usage)])
data
print(data)
Write tests
B.3.4 Acquire
Download data
Apply tests
B.3.5 Explore
Manipulate the data
import polars as pl
Make a graph
import matplotlib.pyplot as plt
B.4 Python
For loops
List comprehensions
B.5 Making graphs
matplotlib
seaborn
B.6 Exploring polars
B.6.1 Importing data
B.6.2 Dataset manipulation with joins and pivots
B.6.3 String manipulation
B.6.4 Factor variables
B.7 Exercises
Practice
Quiz
Task
Free Replit “100 Days of Code” Python course.