LearnPythonForExcel — Printable Training Manual

The complete printable companion. 17 modules, 112 lessons, every walkthrough and exercise.

01.1 — What is Python, really?

A plain-English answer — no jargon — and why a spreadsheet user should care.

What you'll learn

Explain in one sentence what Python is
Name three things Python is famously good at
Spot the difference between Python and a spreadsheet formula

Python in one sentence

Python is a way of writing instructions for your computer in language that looks a lot like English. That's it. Once you write the instructions down in a file, you tell the computer to run the file, and the computer does what you said.

Here's a complete, working piece of Python:

print("Hello, Excel friend!")

You can probably guess what that does before you've ever written a line of code: it prints the words "Hello, Excel friend!" to the screen. That readability is the whole point of Python.

Where Python sits in the world

Python is one of the two or three most-used programming languages on the planet. NASA uses it. Netflix uses it. Your bank uses it. The data science team at virtually every company uses it. And starting in 2023, Microsoft Excel itself uses it — there's a Python interpreter built right into the spreadsheet now, accessed through a formula called =PY().

That last bit is the reason this course exists. The world's most popular spreadsheet now ships with the world's most popular data language built in. If you've spent years living inside Excel, you've just been handed a superpower.

What is Python good at?

Roughly the kind of work an Excel power-user already does — and then a lot more:

Working with data. Cleaning it, reshaping it, summarising it, joining tables, building charts. Python's pandas library is what most of the world's data analysts use.
Automating repetitive computer tasks. Renaming a thousand files, merging a hundred spreadsheets, sending a personalised email to each row of a list.
Getting data from places. Web pages, APIs, PDFs, databases, emails — Python can pull data from almost anything.
Calculation and math. Anything from a budget reconciliation to a machine-learning model.
Building things. Websites, mobile apps, games, scientific simulations. (We won't touch this side in this course — but it's there if you want it later.)

What's the difference between Python and an Excel formula?

Excel formulas live inside a cell. They look at the cells around them and compute a single value. They're a single-step calculator.

A piece of Python — usually called a script — lives in its own file. It can do a thousand things one after another: open a file, change some numbers, save the file, then open another file, then send an email. It's a recipe with as many steps as you want.

The new =PY() formula in Excel mashes the two together: you write a few lines of Python inside a spreadsheet cell and you get a value back, just like a formula. We cover that in Module 8.

🧠 The mental model

Think of Python as "what an Excel formula would be if it could do anything and weren't stuck in one cell."

What this course will (and won't) teach you

We focus 100% on Python for spreadsheet-style work. That means: data, files, automation, reports, charts, and the new =PY() in Excel. We don't cover web apps, game development, or computer science theory. If you want those later, you'll have plenty of Python background to pick them up easily.

Key takeaways

Python is a programming language designed to be readable, like English.
It's now built into Excel as the =PY() formula.
It's especially good at data, automation, and pulling information from anywhere.
You don't need to learn the whole language to be useful with it — this course teaches the parts that pay off fastest.

Try this — no setup required

Open an online Python sandbox in another tab (search for "python online" — Google has one built in, or use replit.com).
Type print("Hello, " + "world") and hit Run.
Now change "world" to your name. Run it again.
Congratulations — you've just written and run a Python program.

01.2 — Why Python, why Excel, why now?

The case for an Excel user learning Python in 2026 — five concrete reasons.

What you'll learn

List five concrete advantages of using Python with Excel
Identify two situations where Excel alone hits its limits
Explain what changed in 2023 that made this course worth taking

The short version

Excel is brilliant for a lot of things — but it has limits. Python doesn't have most of those limits. And now you can use Python inside Excel, instead of as a separate tool. That's why this is suddenly a no-brainer for spreadsheet pros.

Five concrete reasons

1. Files that crash Excel are no problem for Python

The classic limit you've probably bumped into: Excel maxes out at about 1,048,576 rows. You get a 4-million-row CSV from the data team, you open it in Excel, and… it shows you the first million. Python handles 4 million rows without breaking a sweat. 40 million? Fine.

2. The same task, done in one click instead of fifty

If you do the same Excel work every Monday — open a file, clean three columns, run a pivot, format a report, save it as a PDF, email it — Python turns that into one script you run once. The first time costs you an hour. Every week after costs you the time it takes to double-click an icon.

💡 The "annual ROI" rule of thumb

A 30-minute weekly task is 26 hours a year. If automating it takes you four hours up front, you break even by Week 9 and save 22 hours every year after that. Most automations pay back faster than that.

3. Things Excel can't do at all

Excel cannot scrape a web page for you. It can't pull data from most APIs cleanly. It can't read most PDFs intelligently. It can't send a hundred personalised emails. Python does all four with a handful of lines.

4. Reusability — the script outlives the spreadsheet

A complex Excel workbook is fragile. Someone changes a column heading and three pivot tables break. A Python script for the same job is text — you can read it, version it, share it, fix it, and re-use it on next month's data without copy-pasting anything.

5. It's now inside Excel

In 2023 Microsoft added a built-in Python runtime to Excel (Microsoft 365). You write =PY() in a cell, type a few lines of Python, hit Enter, and the result lands in the spreadsheet like any other formula result. No installs, no separate tool, no copy-paste between apps. We dive into this in Module 8.

Where Excel is still the right answer

This is not a course that argues you should stop using Excel. Excel is unbeatable for:

Interactive, click-and-explore data work.
Sharing a single number-crunching artefact with non-technical colleagues.
Quick what-if analysis where you tweak a value and watch totals update.
Anything visual and small — a budget, a model, a tracker.

The right mental model: Excel is your workshop; Python is your power tools. Use both.

The big "what changed"

Three things changed in the past few years that make this course worth taking right now:

2023: Python landed inside Excel via =PY().
The data world standardised on pandas. If you can use pandas, you can talk shop with any analyst in any company.
AI assistants now write 80% of the boring bits for you. You no longer need to memorise syntax. You need to know what's possible and how to read code — both of which this course teaches.

Key takeaways

Python lets you blow past Excel's row limit and automate the repetitive stuff.
It can do things Excel cannot — scrape sites, hit APIs, parse PDFs, send emails.
It now lives inside Excel as =PY(), so you don't have to leave your spreadsheet.
Use both: Excel for interactive work, Python for everything that should "just run."

Reflect for a minute

Write down, on paper or in a sticky note:

One Excel task you do at least weekly.
How long it takes each time.
Multiply by 52. That's the time you'll get back per year by the time we're done.

Keep this note handy — you'll come back to it in Module 12 when you automate exactly this task.

01.3 — How Python actually runs

What 'running' Python means, the three places you'll run it, and the difference between a script and a notebook.

What you'll learn

Describe what happens when you 'run' Python code
Identify the three main places code gets run (REPL, script, notebook)
Pick the right one for a given situation

What 'running' Python means

When you write Python code, it's just plain text in a file. The computer can't do anything with text on its own. To make it do something, you hand the text to a program called the Python interpreter. The interpreter reads the text from top to bottom, one line at a time, and acts on each instruction as it gets to it.

Two consequences worth knowing:

Order matters. Line 3 can use a value created on Line 2, but Line 2 can't use a value created on Line 3.
If a line breaks, everything after it stops. Python tells you the line number, which makes fixing it easier than it sounds.

The three places you'll run code

1. The REPL — for trying things

REPL stands for Read–Eval–Print Loop. It's an interactive prompt: you type one line, hit Enter, and Python answers immediately. Great for experimenting.

>>> 2 + 2
4
>>> "hello".upper()
'HELLO'

You'll meet the REPL in Module 2.

2. A script — for things you want to run again

A script is a file ending in .py that holds Python code. You write the code once, save the file, and you can run it whenever you want, as many times as you want.

# clean_monday_report.py
import pandas as pd
df = pd.read_excel("monday_raw.xlsx")
df = df.dropna()
df.to_excel("monday_clean.xlsx", index=False)
print("Done.")

This is what 90% of "automating boring stuff" looks like.

3. A notebook — for analysis

A Jupyter notebook (file extension .ipynb) is somewhere between a script and a Word document. You write code in little blocks called cells, and the output appears right below each cell. You can mix code, charts, and notes in the same document.

Notebooks are the standard format for data analysis — you'll see them everywhere in finance, science, and business analytics.

📝 And one more: =PY() inside Excel

A fourth place Python now runs is inside an Excel cell itself, via the new =PY() formula. The code runs on a server in the cloud and the result lands back in the cell. We cover this in Module 8.

Which should I use when?

Situation	Use
"What does this single line do?"	REPL
"I want to run this same job every Monday."	Script (.py)
"I'm exploring data and want notes alongside."	Notebook (.ipynb)
"I want a Python value inside a spreadsheet cell."	`=PY()` in Excel

You don't have to install anything to try this

If you want a taste before Module 2, you can run Python in a browser tab right now — try online-python.com or replit.com.

Key takeaways

"Running" Python means handing your code to the interpreter, which reads it top-to-bottom.
The REPL is for trying single lines. Scripts are for repeatable jobs. Notebooks are for analysis with notes.
=PY() is a fourth option that runs Python inside an Excel cell.
You can try all of this in a browser before you install anything.

Try it now (browser only, no install)

Open online-python.com.

Paste this code and hit Run:

name = "your name here"
print("Hi, " + name + "!")
print("Today's Python lesson is going well.")

Change "your name here" to your actual name, and run it again.
You just used a variable, a function, and string addition — three of Python's most-used features.

01.4 — The Excel-to-Python mental model

Translate things you already know — cells, formulas, ranges, tables — into their Python equivalents.

What you'll learn

Map five Excel concepts to their Python equivalents
Recognise a pandas DataFrame as 'just a sheet'
Feel a lot less intimidated by Python syntax

You already know more than you think

People often say "Python is hard because it's a programming language." The truth is: if you've used Excel formulas, you've already met most of the concepts. They just have different names. Let's translate.

The translation table

Excel concept	Python equivalent	Why they're the same
A cell holding a value: `A1 = 100`	A variable: `a1 = 100`	Both are a named container for a value.
A formula: `=SUM(A1:A10)`	A function call: `sum(numbers)`	Both take inputs and give back one output.
A range: `A1:A10`	A list: `[100, 200, 300, ...]`	Both are an ordered group of values.
A table / structured range	A pandas DataFrame	Both are rows-and-columns with named headers.
A VLOOKUP table	A dictionary: `{"apple": 1.20, ...}`	Both are a key-to-value lookup.
`IF(A1>10, "big", "small")`	`if a1 > 10: "big" else: "small"`	Both branch on a condition.
Filling a formula down 100 rows	A `for` loop	Both apply the same logic across many items.
A macro (VBA)	A Python script	Both are saved instructions you run on demand.
An add-in	A library (pandas, numpy, etc.)	Both add capabilities the base tool doesn't have.

Three quick "Wait, that's it?" examples

Adding numbers in a range

Excel:

=SUM(A1:A5)

Python:

numbers = [10, 20, 30, 40, 50]
total = sum(numbers)
print(total)   # 150

Looking up a value

Excel:

=VLOOKUP("apple", A1:B10, 2, FALSE)

Python:

prices = {"apple": 1.20, "pear": 1.50, "banana": 0.40}
print(prices["apple"])   # 1.2

"If A1 is greater than 10, say 'big', otherwise 'small'"

Excel:

=IF(A1>10, "big", "small")

Python:

a1 = 14
if a1 > 10:
    result = "big"
else:
    result = "small"
print(result)   # big

The DataFrame: the most important translation

If there's only one Python concept you remember from this lesson, make it the DataFrame. It's the pandas library's word for a table of data. It has rows. It has named columns. You can sort it, filter it, group it, summarise it. It is, for all practical purposes, an Excel sheet that lives inside Python.

Once you internalise that mental model — "DataFrame = Sheet" — every example in Modules 9 through 16 will feel familiar.

import pandas as pd

# This DataFrame is basically:
#   Name      Sales
#   Alice     1200
#   Bob       800
#   Carol     1500

df = pd.DataFrame({
    "Name":  ["Alice", "Bob", "Carol"],
    "Sales": [1200,    800,    1500],
})

print(df["Sales"].sum())   # 3500 — like =SUM(B2:B4)

💡 If you keep one slide

Variable = cell. Function = formula. List = range. DataFrame = sheet. Dictionary = VLOOKUP. Loop = fill down.

Key takeaways

Most Python concepts have a direct Excel equivalent.
A DataFrame is the pandas word for a sheet — keep that mapping in your head.
If you can write formulas, you can write Python. You're not starting from zero.

Translate these three formulas to Python

Try to translate each on paper before scrolling down. Answers below.

=A1 + A2
=AVERAGE(A1:A5)
=IF(A1="VIP", A2*0.9, A2)

Show answers

a1 + a2
sum(numbers) / len(numbers) — or with pandas: df["A"].mean()
a2 * 0.9 if a1 == "VIP" else a2

01.5 — The tools you'll meet in this course

A field guide to the names you'll keep hearing — Python, pandas, Jupyter, VS Code, openpyxl, =PY(), and so on.

What you'll learn

Recognise the names of the core tools and libraries
Understand the role each one plays
Stop being intimidated by the jargon

A field guide to the names

When you read Python tutorials online, you'll see the same dozen words over and over. Here's what each one is and why you'll meet it in this course.

Python

The language itself. When someone says "I wrote it in Python," they mean they wrote code in this language and ran it with the Python interpreter.

Anaconda

A free bundled installer that gives you Python plus the most-used libraries (pandas, NumPy, Jupyter, matplotlib) in one click. The friendliest way to get set up. We use it in Module 2.

pip

Python's package installer. To add a new library to your Python: pip install thelibrary. Anaconda gives you the common ones already, but pip is how you get anything else.

Library / package

A bundle of pre-written Python code that adds capabilities. Excel calls these "add-ins"; Python calls them libraries. There are hundreds of thousands of them for free.

pandas

The library that turns Python into a souped-up spreadsheet. You will use this constantly. It introduces the DataFrame — that "Python sheet" we just met. Full crash course in Module 9.

NumPy

A library for fast number-crunching. Pandas is built on top of NumPy. You'll touch it occasionally.

openpyxl

The library that reads and writes .xlsx Excel files at a low level — sheets, cells, formatting, formulas, charts. We use it in Modules 7 and 12 for things pandas can't do.

matplotlib & seaborn

The two most common libraries for making charts in Python. We use them in Module 11.

Jupyter notebook

The "code + notes + charts in one document" format we mentioned earlier. File extension .ipynb. Comes free with Anaconda.

VS Code (Visual Studio Code)

A free, modern code editor from Microsoft. It's what we'll use to write Python scripts. There are others (PyCharm, Sublime, etc.) but VS Code is the easiest starting point and works on Mac, Windows, and Linux.

Terminal / command line / PowerShell

The text-based way to talk to your computer. Macs call it Terminal; Windows calls it PowerShell or Command Prompt. You'll need it occasionally — mainly to install things with pip. We hand-hold you through it in Module 2.

`=PY()` in Excel

The formula that runs Python inside an Excel cell. Microsoft 365 only, as of 2026. Full module on this in Module 8.

Microsoft 365 / Excel for the web

You need the modern subscription version of Excel to use =PY(). The classic boxed Excel 2019 / 2021 does not include it.

How they fit together

A typical day in this course looks like:

You open VS Code.
You write a Python script that imports pandas.
Pandas reads an Excel file (under the hood it uses openpyxl).
You clean and reshape the data.
You save the result back to Excel, or you build a chart with matplotlib.
Optionally: you switch to Excel, type =PY(), and re-run a small piece of the analysis live in a cell.

💡 You don't need to memorise this

This page is a glossary, not a quiz. Bookmark it. As we use each tool you'll learn what it actually feels like.

Key takeaways

The names you'll hear most: Python, Anaconda, pip, pandas, openpyxl, Jupyter, VS Code, =PY().
Each one has one job — they fit together like Lego.
You don't need to memorise them now; you'll meet each in context.

Spot the tool

For each of these jobs, name the tool from this lesson that does it:

"Open and edit an .xlsx file at the cell level."
"Write a chart with code."
"Run a piece of Python inside an Excel cell."
"Install a new library."
"Work on data with code, output, and notes mixed in one document."

Show answers

openpyxl
matplotlib (or seaborn)
=PY()
pip
A Jupyter notebook

01.6 — How to use this course

Three suggested paths through the material — total beginner, refresher, job-focused — and how to get the most out of each lesson.

What you'll learn

Pick a route through the course that fits your level and goals
Know what to expect on every lesson page
Use the exercises effectively without getting stuck

Three suggested paths

🟢 Path A — Total beginner (recommended for most people)

Read Modules 1 through 12 in order, doing every exercise. Then pick whichever role module (13–16) fits your job, then Module 17 for the capstone.

Time: Roughly 12–20 hours, depending on how much you experiment.

🟡 Path B — "I've coded a little before"

Skim Modules 1 and 3–6, do all of Module 2 (setup) properly, then go full speed from Module 7 onwards.

Time: Roughly 6–10 hours.

🟠 Path C — "I have a specific Excel pain to fix today"

Do Module 2 (setup), Module 7 (files), Module 9 (pandas), and then jump straight to your role module. You can fill in the foundations later — most readers do.

Time: 4–6 hours.

What every lesson looks like

Every lesson page follows the same shape, so you always know where to look:

Objectives box at the top — what you'll know by the end.
Body — the actual lesson, broken into short sections with code samples.
Step-by-step walkthroughs in numbered cards for any real task.
Key takeaways — the three to five bullets you'd write on a sticky note.
Exercise — a small, do-it-yourself task that's quick to get a win on.
Prev / Next at the bottom — the lesson flows top-to-bottom through the course.

How to do the exercises (without giving up)

Three rules:

Try first, then read the answer. Even five minutes of "I don't know what to do" builds the muscle that reading never will.
It's fine to look things up. Real professional Python work involves Googling roughly every 90 seconds. We're not training you for an exam.
An error message is a clue, not a verdict. The last line of an error usually tells you exactly what went wrong. We have a whole lesson on this (Module 2, Lesson 6).

💡 Pro tip: use AI as a study partner

You can paste any error message — or any lesson exercise — into ChatGPT, Claude, or Copilot and ask "explain this like I'm a spreadsheet user." That's not cheating; it's how everyone codes in 2026.

One small commitment

Do not skip Module 2 (Setup). It's the only module where the work isn't optional. If your Python isn't installed correctly, none of the rest of the course works on your machine. Take the 30 minutes; you'll never need to do it again.

What to do when you're done

You'll know you've "graduated" when you can:

Open any spreadsheet in Python.
Write a script that does the equivalent of three things you'd normally do by hand in Excel.
Run a small =PY() formula inside Excel.
Read an unfamiliar piece of Python code and roughly guess what it does.

From there, the world is your oyster — web scraping, dashboards, machine learning, automation, scripting your whole job. We list further-reading recommendations at the end of Module 17.

Key takeaways

Pick the path that matches your level (A, B, or C).
Every lesson has the same shape — objectives, body, walkthrough, takeaways, exercise.
Don't skip Module 2.
Look things up; that's how real Python work happens.

Set your goal

Pick your path: A, B, or C.
Block 30 minutes on your calendar in the next two days for Module 2 (setup).
You're ready. Click "Next lesson →" and let's go.

02.1 — Install Python (the easy way, with Anaconda)

Get Python and the big data libraries on your machine in one installer. Works the same way on Mac and Windows.

What you'll learn

Download and install Anaconda for your operating system
Verify Python is installed correctly
Know where Python lives on your computer

Why Anaconda?

You can install Python directly from python.org. For an Excel user with no programming background, that path leads to lots of head-scratching. Anaconda is a free bundled installer made by a data-science company. Click one button, wait ten minutes, and you have:

Python itself
pandas, NumPy, matplotlib, seaborn — everything we'll use
Jupyter notebooks
A graphical app called Anaconda Navigator so you don't have to use the command line for basic tasks

If you've already installed Python a different way and it's working, you can skip this lesson. Otherwise, use Anaconda.

Step-by-step walkthrough

Go to the Anaconda download page

Open https://www.anaconda.com/download in your browser.

You may be asked for an email. You can skip that and click "Download" directly.

Pick your operating system

The page detects your OS automatically. Confirm it's right (Windows, macOS, or Linux) and choose the 64-bit installer. If you have a recent Mac with an M-series chip, pick the Apple Silicon version.

Run the installer

Double-click the downloaded file. Click "Continue" / "Next" through every screen. The defaults are correct for what you need.

On Windows, when asked, leave "Add Anaconda to my PATH environment variable" unchecked (Anaconda's own recommendation). We'll use the Anaconda Prompt for command-line work.

The install takes 5–10 minutes and uses about 3 GB of disk space.

Open the Anaconda Navigator

On Mac: open Launchpad and find Anaconda-Navigator.
On Windows: hit the Start menu and search for Anaconda Navigator.

The Navigator is a friendly home screen for everything Anaconda installed. Take a look around. The "Home" tab shows tiles for tools you can launch — Jupyter Notebook, JupyterLab, Spyder, etc.

Confirm Python is working

Click the green Launch button under Jupyter Notebook. Your browser will open with a file-browser view. Click New → Python 3 (ipykernel) to make a new notebook.

In the first cell, type:

print("Python is alive!")

Hold Shift and press Enter. You should see Python is alive! printed below the cell. 🎉

Confirm pandas works

Make a new cell (click the + button) and run:

import pandas as pd
print(pd.__version__)

You should see a version number like 2.2.2. That means pandas is installed and importable. You're done.

⚠️ If Jupyter doesn't open

On Mac, you may need to allow Terminal access. On Windows, your antivirus might flag the launch — allow it. If nothing happens at all, open Anaconda Navigator again and try clicking Launch a second time.

Key takeaways

Anaconda is the easiest one-click install of Python plus all the data libraries.
Anaconda Navigator is your graphical home base.
You've now run your first piece of Python.

Sanity-check exercise

In a new notebook cell, run:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("All three are installed!")

If you see "All three are installed!" with no red error, you're ready for the rest of the course.

02.2 — Install VS Code — your code editor

Set up Visual Studio Code, Microsoft's free code editor, with the Python extension.

What you'll learn

Install Visual Studio Code
Add the Python extension
Open and run a first .py file

Why an editor on top of Anaconda?

Jupyter notebooks (which Anaconda gave you) are great for analysis. But when you start writing reusable scripts — the kind you run on a schedule — you'll want a proper code editor. VS Code is free, from Microsoft, and works the same way on Mac, Windows, and Linux.

Download VS Code

Go to code.visualstudio.com and click the big blue download button. Run the installer with default settings.

On Windows, when asked, tick the boxes for "Add 'Open with Code' action to Windows Explorer file context menu" and "Register Code as an editor for supported file types". Those make life easier later.

Open VS Code for the first time

Launch it. You'll see a welcome screen. The little icons on the far left are your activity bar — files, search, source control, and a "puzzle piece" for Extensions.

Install the Python extension

Click the puzzle-piece icon (Extensions). In the search box, type Python. Install the one made by Microsoft (it has millions of downloads).

That extension is what makes VS Code understand Python: syntax colours, error highlighting, "run this file" buttons, debugging.

Point VS Code at the Anaconda Python

Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) to open the Command Palette. Type Python: Select Interpreter and pick the option that says "Anaconda" or has anaconda3 in the path.

This tells VS Code: "When I press Run, use the Python that came with Anaconda."

Make your first script

In VS Code, go to File → New File. Save it as hello.py on your Desktop. Paste in:

print("Hello from VS Code!")
for i in range(5):
    print(f"  Line {i + 1}")

Click the green ▶ "Run" arrow in the top right (or press F5 and pick "Python File"). The output appears in the panel at the bottom.

💡 The two-window workflow

Many pros keep VS Code open for scripts and Jupyter open for exploration. You'll naturally fall into the same rhythm.

Key takeaways

VS Code is the editor we use for .py scripts.
The Microsoft Python extension turns it into a real Python IDE.
Always confirm the interpreter — it must point at Anaconda's Python.

Two-minute exercise

Save a new file called about_me.py. Inside, write three lines that print your name, your job role, and what you hope to automate with Python. Run it. Notice how the output appears at the bottom of VS Code, instantly.

02.3 — The terminal — a friendly introduction

Five commands that are all you really need: cd, ls/dir, mkdir, pip, python.

What you'll learn

Open the terminal on Mac and Windows
Navigate folders with cd
Install Python libraries with pip

The terminal is not as scary as it looks

The terminal is a text-based way to talk to your computer. You'll need it for exactly two things in this course: installing libraries (pip install ...), and occasionally running a Python script (python myscript.py). Five commands, and you're set.

How to open it

Mac: Press Cmd+Space, type "Terminal", press Enter.
Windows: Hit the Start menu and search for "Anaconda Prompt". Use that one, not regular PowerShell — it already knows about your Anaconda Python.

The five commands you actually need

1. `pwd` (Mac) / `cd` on its own (Windows) — "where am I?"

pwd
# /Users/yourname

Tells you which folder the terminal is currently looking at.

2. `ls` (Mac) / `dir` (Windows) — "what's in this folder?"

ls
# Documents  Desktop  Downloads  Music ...

3. `cd folder_name` — "move into this folder"

cd Desktop      # move into Desktop
cd ..           # move one folder up
cd ~            # go to your home folder (Mac/Linux)

4. `mkdir new_folder_name` — "make a new folder"

mkdir python-projects
cd python-projects

5. `pip install <library>` — install a Python library

pip install openpyxl

You'll see a stream of text as pip downloads and installs the library. When the prompt comes back, it's done.

Bonus: `python myscript.py` — run a script

python hello.py
# Hello from VS Code!
#   Line 1
#   Line 2
#   ...

Walkthrough: install a library

Open the terminal (or Anaconda Prompt on Windows)

Type a pip command

pip install xlsxwriter

Press Enter. Watch the lines scroll. When you see Successfully installed xlsxwriter-..., you're done.

Confirm it worked

Open a Jupyter notebook or a Python file and run:

import xlsxwriter
print(xlsxwriter.__version__)

You should see a version number, not a red error.

⚠️ "Command not found" on Mac

If pip isn't recognised, try pip3 instead, or open Anaconda Navigator → Environments → click the green ▶ next to base (root) → "Open Terminal", then run pip install ... from that terminal.

⚠️ Windows: must use the Anaconda Prompt

Regular Command Prompt may not see your Anaconda Python. Always launch the Anaconda Prompt from the Start menu for these commands.

Key takeaways

Five commands cover 95% of what you need: pwd, ls/dir, cd, mkdir, pip install.
On Windows, always use the Anaconda Prompt, not regular PowerShell.
pip install is how you add any extra library.

Five-minute hands-on

Open your terminal (or Anaconda Prompt).
Make a new folder on your Desktop called python-for-excel.
cd into it.
Run pip install xlsxwriter requests beautifulsoup4.
You just installed three libraries we'll use later. Easy.

02.4 — Turn on Python inside Excel (=PY())

Get the =PY() formula working in your Microsoft 365 Excel.

What you'll learn

Check that your Excel version supports =PY()
Enter your first =PY() formula
Know the difference between regular Excel and =PY() cells

What you need

Microsoft 365 subscription Excel (Windows or Mac).
The standalone Excel 2019 / 2021 / 2024 does not include Python.
An internet connection — the Python code runs on Microsoft's cloud servers, not on your laptop.

📝 Don't have Microsoft 365?

You can still do this entire course. We use =PY() heavily in Module 8; everywhere else, regular Python + the Anaconda setup is enough. If you can get the company 365 license, do.

Quick setup walkthrough

Update Excel

In Excel, go to File → Account → Update Options → Update Now. Make sure you're on a recent build. Python in Excel is generally available in Microsoft 365 since 2024.

Make a new blank workbook

Open a fresh workbook so you don't accidentally break a real one.

Click an empty cell and type the new formula

Click cell A1. Type =PY( — when you do this, Excel switches the cell into Python mode. You'll notice the formula bar turns into a multi-line area and shows a little green "PY" badge.

Inside the cell, type:

1 + 1

Press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to commit. After a brief "calculating in the cloud" moment, the cell shows 2.

Try a slightly bigger one

Click another empty cell, type =PY(, then:

import pandas as pd
df = pd.DataFrame({"city": ["NYC", "LA", "Chicago"], "pop_m": [8.3, 3.9, 2.7]})
df

Ctrl+Enter. Excel now shows the DataFrame as a special "Python object" cell. Right-click it → Output as → Excel values to spill it into proper Excel cells.

What just happened?

When you type =PY( Excel sends your code to Microsoft's cloud, runs it in a real Python interpreter (with pandas, NumPy, matplotlib pre-installed), and returns the result back into your spreadsheet. Total round trip: usually 1–3 seconds.

💡 The two output modes

A Python cell can return its result either as a Python object (a single cell that holds, say, a whole DataFrame) or as Excel values (spilled into a range, one cell per value). Toggle with Right-click → Output as.

Key takeaways

=PY() needs Microsoft 365 Excel.
Type =PY( then your Python, then Ctrl+Enter.
Results can be a single Python object or spilled Excel values.
Full coverage in Module 8 — this lesson is just "is it on?"

Quick check

In a fresh Excel sheet, get this to work end-to-end:

=PY(
import pandas as pd
ages = pd.Series([24, 31, 45, 22, 38])
ages.mean()
)

You should see 32 in the cell. If you do — Python in Excel is on, and Module 8 will be a breeze.

02.5 — A two-minute tour of Jupyter notebooks

Cells, run buttons, kernels, save, share — the bits of Jupyter you'll touch every day.

What you'll learn

Create and delete cells in Jupyter
Know the keyboard shortcuts that save time
Save and export a notebook

Why you'll use Jupyter all the time

Jupyter is the notebook tool that came with Anaconda. It's the standard way to do data analysis in Python: you type a few lines of code, run them, see the result instantly, write a note about what you saw, type a few more lines. Repeat. Save the whole thing as a document you can share.

Open a notebook

Anaconda Navigator → Jupyter Notebook → Launch. Browser opens. Navigate to a folder. Click New → Python 3.

The anatomy of a notebook

Cells. The boxes you type into. Two types: Code (the default — runs Python) and Markdown (formatted notes — same syntax as a README).
The kernel. The Python interpreter behind the notebook. If your code stops working, restart the kernel (Kernel → Restart).
The toolbar. The little ▶ button runs the selected cell.

The keyboard shortcuts that change your life

Shortcut	What it does
`Shift+Enter`	Run the current cell and move to the next
`Ctrl+Enter`	Run the current cell and stay on it
`Alt+Enter`	Run and insert a new cell below
`Esc` then `A`	Insert a new cell above
`Esc` then `B`	Insert a new cell below
`Esc` then `D D`	Delete the current cell
`Esc` then `M`	Turn the current cell into a Markdown note
`Esc` then `Y`	Turn it back into code

Walkthrough: build a tiny analysis notebook

Make a Markdown title cell

Press Esc M, then type # My first analysis. Shift+Enter. Notice it renders as a big heading.

Make a code cell that creates some data

import pandas as pd
sales = pd.DataFrame({
    "rep":   ["Aisha", "Ben", "Carlos", "Dora", "Eve"],
    "deals": [12, 7, 19, 8, 14],
    "value": [42000, 18000, 71000, 22000, 39000],
})
sales

Shift+Enter. You see the table.

A second code cell that summarises

sales["value"].sum()

Shift+Enter. You see 192000.

Save

Ctrl+S (or Cmd+S on Mac). The file is saved as Untitled.ipynb. Click the title at the top to rename it.

Export

File → Download as → HTML gives you a self-contained web page you can email. Download as → PDF gives you a printable version. Download as → Notebook (.ipynb) gives you the raw file someone else can open in their Jupyter.

💡 The "last expression auto-prints" trick

A code cell automatically prints the result of its last line, without needing print(). That's why sales on its own showed the whole table.

Key takeaways

Jupyter = the analysis tool that came with Anaconda.
Shift+Enter runs a cell. Esc M turns it into a note.
The last line of a cell prints automatically.
Save as .ipynb for editing, or export to HTML / PDF to share.

Exercise

Open Jupyter. Make a new notebook called warmup.ipynb. Build it with:

A Markdown heading: "Warm-up exercise".
A code cell that creates a list of your five favourite numbers.
A code cell that prints their sum.
A Markdown cell at the bottom that says how surprising the answer was (or wasn't).
Save it. Export it as HTML. Open the HTML file in your browser.

02.6 — When things break — reading errors without panic

How to read a Python error message, the five most common errors, and what to do about each.

What you'll learn

Read a Python traceback bottom-up
Recognise the five errors you'll see most
Know the standard fix for each

The good news about Python errors

Python errors look intimidating because they're long. They're actually friendly: the last line tells you exactly what went wrong, and the line above that tells you where. Read bottom-up and ignore the rest until you're more advanced.

Anatomy of a traceback

Traceback (most recent call last):
  File "report.py", line 12, in <module>
    total = price * quantitiy
NameError: name 'quantitiy' is not defined

Read bottom-up:

The last line says what went wrong: NameError: name 'quantitiy' is not defined. We misspelled quantity.
The line above tells you where: line 12 of report.py.

The five errors you'll see most

1. `NameError: name 'x' is not defined`

Meaning: Python doesn't know what x is.
Usual cause: Typo, or you forgot to create the variable, or you tried to use it before the line that defines it.

Fix: check spelling and make sure the variable is created on an earlier line.

2. `SyntaxError: invalid syntax`

Meaning: The code doesn't follow Python's grammar.
Usual cause: Missing colon at the end of an if/for/def line. Or a bracket that wasn't closed. Or a quote that wasn't closed.

# Wrong: no colon
if age > 18
    print("adult")

# Right
if age > 18:
    print("adult")

3. `IndentationError`

Meaning: Wrong amount of leading whitespace.
Usual cause: You used a tab where Python expected spaces, or you indented inconsistently.

Fix: pick four-space indentation everywhere. VS Code does this for you if you stay on the Python file mode.

4. `TypeError`

Meaning: You're using a value of the wrong type.
Usual cause: Trying to add a number and a string.

# Wrong: "5" is a string, not a number
total = "5" + 10

# Right
total = int("5") + 10   # 15

5. `FileNotFoundError`

Meaning: The file you asked Python to open doesn't exist at that path.
Usual cause: Typo in the filename, wrong folder, or the file extension is wrong.

Fix: print the current folder first and double-check:

import os
print(os.getcwd())     # which folder am I in?
print(os.listdir())    # what's in this folder?

The "first response" checklist

Read the last line of the error.
Look at the line number just above it.
Re-read that line in your code, slowly.
If still stuck, paste the error into Google or an AI assistant verbatim.
Rerun, see if the error changes. A different error = progress.

💡 Pro tip

Don't try to "guess" what's broken. Always re-read the actual error message. 95% of the time it tells you exactly what to fix.

Key takeaways

Read tracebacks bottom-up; the last line is the actual error.
The five common errors: NameError, SyntaxError, IndentationError, TypeError, FileNotFoundError.
Each has a one-line fix. You'll memorise them after seeing each three times.
Copy-paste the error into search; you're never the first to hit it.

Make these errors happen on purpose

In a Jupyter cell, run each of these and see what error you get. Then fix it.

# 1
print(secret_number)

# 2
if 5 > 3
    print("yes")

# 3
"hello" + 5

# 4
open("not_a_real_file.txt")

Reading errors gets easier the more of them you see. Welcome to programming.

03.1 — Variables — Python's version of a named cell

How to give a value a name, and the rules for what names are allowed.

What you'll learn

Create variables and reassign them
Follow Python's variable-naming rules
Understand the difference between a variable and a value

What's a variable?

A variable is a name that points to a value. In Excel, you might put 100 in cell B2 and refer to it as =B2 from then on. In Python you write:

price = 100

From now on, anywhere you write price, Python substitutes in 100.

price = 100
tax = price * 0.2
print(price + tax)   # 120

The single equals sign means "store this"

This trips people up at first: = in Python is not the same as = in algebra. It doesn't mean "equals." It means "take what's on the right and store it under the name on the left."

x = 5      # store 5 in x
x = x + 1  # take whatever x is (5), add 1, store back in x
print(x)   # 6

The naming rules

Start with a letter or underscore. Can't start with a number.
Made of letters, numbers, underscores. No spaces, no dashes, no punctuation.
Case-sensitive: Price, price, and PRICE are three different variables.
Don't use Python's reserved words (if, for, def, class, True, None…). Your editor will warn you.

Style: `snake_case`

Python convention is lowercase with underscores: monthly_total, customer_name, tax_rate. Not monthlyTotal (that's Java) or MonthlyTotal (that's a class). Stick to snake_case and your code will look professional.

Walkthrough: build a tiny invoice

Set up the inputs

customer_name = "Acme Co."
quantity = 3
unit_price = 19.99
tax_rate = 0.08

Compute

subtotal = quantity * unit_price
tax = subtotal * tax_rate
total = subtotal + tax

Print the receipt

print("Customer:", customer_name)
print("Subtotal:", subtotal)
print("Tax:     ", tax)
print("Total:   ", total)

Output:

Customer: Acme Co.
Subtotal: 59.97
Tax:      4.7976
Total:    64.7676

📝 Don't panic about the long decimal

We'll fix the rounding (round(total, 2) → 64.77) in the next lesson.

Key takeaways

A variable is a name attached to a value: name = value.
= means "store," not "is equal to."
Use snake_case. Case matters.
Reassigning a variable just overwrites whatever was there.

Build your own pocket calculator

Create variables for: a meal price, a tip percentage, the number of people splitting.
Compute total with tip.
Compute per-person share.
Print everything with labels.

03.2 — Numbers — ints, floats, and the math operators

Whole numbers, decimal numbers, and every math operator Python supports.

What you'll learn

Distinguish ints from floats
Use Python's arithmetic operators including //, %, **
Round and format numeric output

Two flavours: int and float

int = a whole number: 0, 1, -42, 1_000_000.
float = a decimal number: 0.5, -3.14, 2.0.

You can check what kind a value is with type():

type(42)        # <class 'int'>
type(3.14)      # <class 'float'>
type(1_000_000) # <class 'int'>   underscores are legal as thousand-separators

The math operators

Operator	What it does	Example	Result
`+`	Add	`3 + 4`	7
`-`	Subtract	`10 - 3`	7
`*`	Multiply	`5 * 6`	30
`/`	Divide (always returns a float)	`10 / 4`	2.5
`//`	Integer divide (drops the remainder)	`10 // 4`	2
`%`	Modulo (the remainder)	`10 % 4`	2
`**`	Power	`2 ** 10`	1024

Two tricks worth knowing

Round to N decimals

round(3.14159, 2)   # 3.14
round(1234.5678, 0) # 1235.0

Format with commas and currency

total = 1234567.891
print(f"${total:,.2f}")     # $1,234,567.89

The f"..." string is called an f-string. We cover it properly in the next lesson; for now know that you can pop variables and format hints into one inside curly braces.

Integer vs float — when does it matter?

Mostly it doesn't. Python converts between them automatically. Two practical gotchas:

10 / 4 is 2.5 (a float). 10 // 4 is 2 (an int). If you need a clean whole number — for example, "how many full pages" — use //.
Some financial systems hate floats because of tiny rounding errors. We'll meet the decimal library in the Finance scenarios module.

Walkthrough: split a bill three ways

Inputs

bill = 87.45
tip_pct = 18
people = 3

Compute

tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people

Print neatly

print(f"Bill:       ${bill:,.2f}")
print(f"Tip ({tip_pct}%): ${tip:,.2f}")
print(f"Total:      ${total:,.2f}")
print(f"Per person: ${per_person:,.2f}")

Key takeaways

int = whole, float = decimal.
/ always returns a float; // truncates; % gives the remainder.
round(x, 2) rounds; f"${x:,.2f}" formats with commas and two decimals.

Power-of-two table

Print the first ten powers of two in a column, like this:

2^1 = 2
2^2 = 4
2^3 = 8
...
2^10 = 1024

Hint: use a for loop (we cover those in Module 4) or just write ten lines.

03.3 — Strings — text values and how to wrangle them

Quotes, f-strings, slicing, methods like .upper() and .replace().

What you'll learn

Create strings with single, double, and triple quotes
Build strings with f-strings
Use the most common string methods

What's a string?

Text. Anything in quotes. Single or double — Python doesn't care, but pick one and be consistent.

name = "Alice"
greeting = 'Hello there'
long_text = """This is a multi-line string.
It can wrap across as many lines as you like."""

Joining strings together

first = "Alice"
last = "Smith"
full = first + " " + last     # "Alice Smith"

F-strings — the modern way

Put an f right before the opening quote, and you can drop variables straight inside {}:

name = "Alice"
sales = 1234.567
print(f"{name} sold ${sales:,.2f}")
# Alice sold $1,234.57

F-strings replaced the older "{} sold ${:.2f}".format(name, sales) style. Use f-strings everywhere.

String methods you'll use weekly

Method	What it does	Example	Result
`.upper()`	UPPERCASE	`"hi".upper()`	`"HI"`
`.lower()`	lowercase	`"Hi".lower()`	`"hi"`
`.title()`	Title Case	`"alice smith".title()`	`"Alice Smith"`
`.strip()`	Trim whitespace	`" abc ".strip()`	`"abc"`
`.replace(a, b)`	Substitute	`"$1,200".replace(",", "")`	`"$1200"`
`.split(",")`	Break into a list	`"a,b,c".split(",")`	`["a","b","c"]`
`" ".join([...])`	Glue list back into a string	`" ".join(["a","b"])`	`"a b"`
`.startswith("X")`	Boolean check	`"acme.csv".endswith(".csv")`	`True`

Slicing — pulling out a chunk

Like grabbing a substring with MID() or LEFT() in Excel.

name = "Anthropic"
name[0]      # 'A'        — first character
name[-1]     # 'c'        — last character
name[0:4]    # 'Anth'     — first four
name[4:]     # 'ropic'    — from index 4 to the end
name[::-1]   # 'cipohtnA' — reversed (a Python party trick)

Walkthrough: clean a messy column of names

Sample data

raw = "  alice SMITH  "

Clean step by step

step1 = raw.strip()           # "alice SMITH"
step2 = step1.lower()         # "alice smith"
step3 = step2.title()         # "Alice Smith"
print(step3)

Or chain it

cleaned = raw.strip().lower().title()
print(cleaned)   # Alice Smith

Chaining like this — calling one method on the result of another — is a Python superpower. You'll see it constantly in pandas.

Key takeaways

Strings live in quotes (single, double, or triple for multi-line).
F-strings are the modern way to mix variables into text: f"Hi {name}".
Methods like .upper(), .strip(), .replace() do most of the cleaning work.
Slicing — s[0:4] — pulls out substrings.

Clean a list of email addresses (on paper)

Given " Alice@Example.COM ", write the Python that turns it into a clean lowercase, trimmed email.

Answer

"  Alice@Example.COM ".strip().lower()
# 'alice@example.com'

03.4 — Booleans and comparisons

True, False, and the operators that compare values.

What you'll learn

Use comparison operators (==, !=, >, <, >=, <=)
Combine conditions with and/or/not
Recognise 'truthy' and 'falsy' values

True and False — capitalised

is_open = True
is_closed = False
type(is_open)   # <class 'bool'>

Two values. Lowercase true/false won't work — Python is fussy about the capitals.

Comparison operators

Operator	Meaning	Example	Result
`==`	Equal to	`5 == 5`	`True`
`!=`	Not equal to	`5 != 5`	`False`
`>`	Greater than	`10 > 3`	`True`
`<`	Less than	`10 < 3`	`False`
`>=`	Greater or equal	`5 >= 5`	`True`
`<=`	Less or equal	`5 <= 4`	`False`

⚠️ One equals vs two equals

= stores a value. == tests for equality. x = 5 is "store 5 in x". x == 5 is "is x equal to 5?". Mixing them up is the most common beginner bug.

Combining: and, or, not

age = 25
income = 80000

(age > 18) and (income > 50000)   # True
(age < 18) or (income > 100000)   # False
not (age < 18)                    # True

"Truthy" and "falsy"

Python treats some non-boolean values as if they were True or False:

Falsy: 0, 0.0, "" (empty string), [] (empty list), None.
Truthy: Pretty much everything else.

name = ""
if name:
    print(f"Hi, {name}")
else:
    print("No name provided.")
# Output: No name provided.

The `in` operator — bonus useful

"alice" in ["alice", "bob", "carol"]   # True
"@" in "alice@example.com"             # True

Often replaces a chunky IF + OR formula in Excel.

Key takeaways

True and False with a capital letter.
== compares; = assigns. Don't mix them up.
Combine conditions with and, or, not.
Empty containers ("", [], 0, None) count as False.

Quick check

What does each of these print?

print(10 == 10.0)
print(10 == "10")
print("a" in "alpha")
print(0 or 5)

Answers

True — Python is happy comparing 10 (int) and 10.0 (float).
False — a number is never equal to a string of that number.
True — "a" appears in "alpha".
5 — or returns the first truthy value.

03.5 — None, type conversions, and reading the type

The special value None, plus the int(), str(), float() conversion functions.

What you'll learn

Use None as an 'unknown' placeholder
Convert between strings, ints, and floats
Inspect the type of any value

None — Python's "nothing"

Sometimes you need a variable to mean "I don't have a value yet" or "this is blank." Python uses None for that.

last_login = None
if last_login is None:
    print("New user")

Compare with is, not ==, when checking for None.

Type conversions

You'll constantly get data as the wrong type — a number that came in as text, a date that came in as a number. Convert with these:

Function	What it does	Example
`int(x)`	Try to turn into a whole number	`int("42")` → 42
`float(x)`	Try to turn into a decimal	`float("3.14")` → 3.14
`str(x)`	Turn into text	`str(42)` → "42"
`bool(x)`	Turn into True/False	`bool("hi")` → True

The "everything is a string when it comes from outside" rule

Any value that arrives from a CSV, a web form, the command line, or a user input box arrives as a string. If you want to do math with it, convert first.

# Pretend "12" came from a CSV cell
quantity = "12"
price = 5

# This is a bug — it repeats "12" five times
print(quantity * price)        # "1212121212"

# Convert first
print(int(quantity) * price)   # 60

Checking the type at runtime

type(42)         # int
type("hi")       # str
type([1,2,3])    # list
isinstance(42, int)   # True

isinstance() is the cleaner version. Use it in real code.

Walkthrough: a "safe" number parse

The problem

A user might type "42", or "forty-two", or accidentally hit space. We want a function that returns the number, or None if it can't.

The fix using try/except

def safe_int(text):
    try:
        return int(text.strip())
    except (ValueError, AttributeError):
        return None

safe_int("42")          # 42
safe_int("forty-two")   # None
safe_int("  100 ")      # 100

Don't worry about try/except in detail — we cover it in Module 4. The point: real-world data is messy, and Python gives you tools to handle "what if it's broken."

Key takeaways

None = "no value yet." Check with x is None.
Convert with int(), float(), str(), bool().
Data coming from outside is almost always a string. Convert before you compute.
Use type() or isinstance() to inspect a value at runtime.

Spot the bug

response = input("How old are you? ")
if response > 18:
    print("Adult")

What's broken? Fix it.

Answer

input() always returns a string. Comparing a string to an int crashes. Fix: if int(response) > 18:

03.6 — Talking to the user — input() and print()

How to ask the user a question and show them an answer.

What you'll learn

Use input() to read text from the user
Format print() output cleanly
Combine the two into a working mini-program

input() — ask a question

name = input("What's your name? ")
print(f"Hello, {name}!")

When this runs in a script or terminal, the prompt appears, the program pauses, the user types something, hits Enter, and the typed text is stored in name. Always as a string — even if they typed digits.

print() — talk back

print("Hello")
print("a", "b", "c")           # a b c
print("a", "b", "c", sep="-")  # a-b-c
print("first")
print("second")
print("third", end=" ")
print("on the same line")

A complete mini-program

# tip_calculator.py
bill = float(input("Bill amount: "))
tip_pct = float(input("Tip percent: "))
people = int(input("How many people? "))

tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people

print()
print(f"Total: ${total:,.2f}")
print(f"Each:  ${per_person:,.2f}")

💡 input() in Jupyter

The same input() works in a Jupyter notebook — a little prompt appears under the cell.

Key takeaways

input("prompt: ") reads a line from the user; the result is always a string.
print() can take any number of values, separated by sep=, terminated by end=.
Combine them for tiny command-line tools.

Build a unit converter

Ask the user for kilometres, convert to miles (km × 0.621), print the result.

03.7 — Comments — notes to your future self

How to leave notes in your code, and the difference between a comment and a docstring.

What you'll learn

Write single-line and block comments
Know when comments help (and when they're noise)
Recognise a docstring

The `#` character

Anything after a # on a line is ignored by Python. Use this for notes.

# Read the raw export the finance team dumps in this folder every Monday
raw = pd.read_csv("monday_dump.csv")

# Drop the rows where the customer column is blank — they're rejected orders
clean = raw.dropna(subset=["customer"])

Block comments

Multiple lines, each starting with #. There's no /* ... */ in Python.

# --- Configuration ---
# Change these once a month after the price review meeting
TAX_RATE = 0.08
SHIPPING_FLAT = 7.50

Docstrings — comments for functions

A docstring is a triple-quoted string right under a function's def line. Tools and IDEs use it to show the function's help.

def cents_to_dollars(cents):
    """Convert an integer cents value to a float dollars value.

    Example:
        cents_to_dollars(1299) -> 12.99
    """
    return cents / 100

When comments help — and when they're noise

Good comments explain why, not what:

# Bad — restates what the code already says
x = x + 1  # add 1 to x

# Good — explains the why
x = x + 1  # account for the off-by-one in the upstream CSV

Key takeaways

# starts a comment, runs to end of line.
Use comments to explain why, not what.
Docstrings (triple-quoted) sit under def lines and describe the function.

Comment a tangled script

Take your tip calculator from the last lesson. Add three comments: one at the top explaining what the script does, one above the calculation block explaining the formula, one at the bottom explaining what's printed.

04.1 — if / else — making decisions

Python's IF() statement: indentation, elif, the comparison operators in action.

What you'll learn

Write if / elif / else blocks
Use indentation correctly
Refactor nested ifs into clean code

The shape of an if-statement

age = 18

if age >= 18:
    print("Adult")
else:
    print("Minor")

Notice three things:

Colon at the end of the if and else lines.
Indentation (four spaces) marks what's "inside" the branch.
There are no parentheses around the condition (you can add them but you don't have to).

elif — chained conditions

Like IFS() in Excel:

score = 73

if score >= 90:
    grade = "A"
elif score >= 80:
    grade = "B"
elif score >= 70:
    grade = "C"
elif score >= 60:
    grade = "D"
else:
    grade = "F"

print(grade)   # C

One-line if (when it improves readability)

status = "VIP" if total > 1000 else "regular"

Use sparingly. Multi-line is usually clearer.

The "early return / guard" pattern

Nested ifs get ugly fast. Flatten them with early returns:

# Tangled
def process(order):
    if order is not None:
        if order["status"] == "paid":
            if order["amount"] > 0:
                ship(order)

# Cleaner
def process(order):
    if order is None: return
    if order["status"] != "paid": return
    if order["amount"] <= 0: return
    ship(order)

Walkthrough: pricing tiers

Inputs

customer_type = "wholesale"
quantity = 250

Compute the discount

if customer_type == "wholesale":
    if quantity >= 500:
        discount = 0.20
    elif quantity >= 100:
        discount = 0.10
    else:
        discount = 0.05
elif customer_type == "vip":
    discount = 0.15
else:
    discount = 0.00

print(f"{discount*100:.0f}% discount")   # 10% discount

Key takeaways

Colon at the end of if/elif/else, indent the body.
elif chains conditions; else catches everything left.
Flatten deep nesting with early returns.

Shipping cost calculator

Free shipping over $100. $5 flat if weight < 1kg. Otherwise $15. Write the if/elif/else.

04.2 — for loops — doing something for every item

Loop through lists, ranges, dictionaries, and lines of a file.

What you'll learn

Write a for loop over a list and over a range
Use enumerate() to track position
Recognise when not to write a loop (vectorised operations)

The shape of a for loop

for name in ["Alice", "Bob", "Carol"]:
    print(f"Hello, {name}!")

Read it as: "for every name in this list, do the indented stuff." Three names → three greetings.

Looping a fixed number of times — range()

for i in range(5):
    print(i)
# 0
# 1
# 2
# 3
# 4

range(5) = 0, 1, 2, 3, 4. (Stops before 5 — Python's standard half-open style.)

range(1, 6)        # 1, 2, 3, 4, 5
range(0, 10, 2)    # 0, 2, 4, 6, 8

enumerate() — when you need the index

items = ["bread", "milk", "eggs"]
for i, item in enumerate(items, start=1):
    print(f"{i}. {item}")
# 1. bread
# 2. milk
# 3. eggs

Looping a dictionary

prices = {"bread": 3.50, "milk": 2.10, "eggs": 5.00}

for product, price in prices.items():
    print(f"{product}: ${price}")

break and continue

break — stop the loop entirely.
continue — skip the rest of this iteration; jump to the next item.

for n in range(20):
    if n % 2 != 0:   # odd
        continue
    if n > 12:
        break
    print(n)
# 0 2 4 6 8 10 12

Walkthrough: total a list of invoices

The data

invoices = [
    {"customer": "Acme",  "amount": 1200},
    {"customer": "Beta",  "amount":  800},
    {"customer": "Acme",  "amount":  300},
    {"customer": "Gamma", "amount": 1500},
]

Loop and accumulate

total = 0
for inv in invoices:
    total += inv["amount"]
print(total)   # 3800

total += inv["amount"] is shorthand for total = total + inv["amount"].

💡 The shortcut

For totals, Python gives you a built-in: sum(inv["amount"] for inv in invoices). We meet that pattern (a "generator expression") later.

When NOT to write a loop

If you're using pandas, you almost never need a manual loop. df["amount"].sum() beats writing out the loop in both speed and clarity. We'll see this everywhere from Module 9 onwards.

Key takeaways

for x in collection: — visit every item.
range(n) — loop n times.
enumerate() when you also need the index.
break stops the loop; continue skips to the next item.

FizzBuzz

For numbers 1 to 30: print "Fizz" if divisible by 3, "Buzz" if divisible by 5, "FizzBuzz" if both, otherwise the number. Classic interview question. You now know enough to write it.

04.3 — while loops — repeat until done

When you don't know how many iterations you need: while, break, and the infinite-loop pitfall.

What you'll learn

Write a while loop with a clean exit condition
Use break to leave a loop early
Avoid accidental infinite loops

The shape of a while loop

balance = 100
months = 0
while balance < 200:
    balance = balance * 1.05   # 5% per month
    months += 1
print(f"Took {months} months")   # 15

"Keep going while the condition is true." When the condition flips false, the loop exits.

The classic infinite loop

x = 0
while x < 10:
    print(x)
    # forgot to change x — runs forever

If you ever fire one off by accident, hit the ⬛ stop button in Jupyter, or Ctrl+C in the terminal.

while True + break — the "until" pattern

while True:
    answer = input("Type 'quit' to stop: ")
    if answer == "quit":
        break
    print(f"You typed: {answer}")

When to use while vs for

If you know how many times you'll iterate…	use `for`
If you keep going until a condition flips…	use `while`

Key takeaways

while CONDITION: runs until the condition is false.
Change the variable inside the loop or you'll spin forever.
while True: ... break is the Pythonic "until" pattern.

Doubling game

Start at 1. Keep doubling until you exceed 1,000,000. Print how many doublings it took.

04.4 — List comprehensions — one-line loops

The compact way to build a list from another list. Less code, more clarity.

What you'll learn

Read a list comprehension
Convert a for-loop to a comprehension
Add a filter (if-clause) to a comprehension

The pattern

# Long form
squares = []
for n in range(10):
    squares.append(n * n)

# Comprehension — same result
squares = [n * n for n in range(10)]

Read it left to right: "a list of n * n for every n in range(10)."

With a filter

evens = [n for n in range(20) if n % 2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Transform + filter

names = ["alice", "bob", "carol", "dave"]
short_upper = [n.upper() for n in names if len(n) < 5]
# ['ALICE', 'BOB', 'DAVE']

Dict and set comprehensions

# dict
prices = {"a": 1, "b": 2, "c": 3}
doubled = {k: v * 2 for k, v in prices.items()}
# {'a': 2, 'b': 4, 'c': 6}

# set (no duplicates)
words = ["apple", "ant", "banana"]
first_letters = {w[0] for w in words}
# {'a', 'b'}

💡 When to use them

Use a comprehension when the loop body is one short transformation. If you're doing five things to each item, use a regular for-loop — comprehensions stop helping when they get long.

Key takeaways

[expression for item in collection] = one-line list builder.
Add if condition at the end to filter.
Also works for dicts and sets.
Don't overuse — readability beats brevity.

Three quick conversions

Convert each loop to a list comprehension.

# 1
result = []
for n in range(20):
    if n % 3 == 0:
        result.append(n)

# 2
upper_names = []
for name in ["alice", "bob"]:
    upper_names.append(name.upper())

# 3
totals = []
for r in [{"qty":3,"price":10}, {"qty":2,"price":15}]:
    totals.append(r["qty"] * r["price"])

04.5 — try / except — handling errors gracefully

Catch exceptions so your script keeps running instead of dying on bad data.

What you'll learn

Wrap risky code in try/except
Catch specific exception types
Decide when NOT to catch

The pattern

try:
    n = int(user_input)
except ValueError:
    n = 0
    print("Couldn't parse the number, defaulting to 0.")

Python runs the try block. If a ValueError is raised, control jumps to the except block. If no error, the except is skipped.

The big four exception types

ValueError — the value was the right type but wrong content (e.g. int("abc")).
TypeError — the value was the wrong type.
KeyError — that key didn't exist in the dict.
FileNotFoundError — couldn't find that file.

Catch specific, not generic

# OK but lazy — swallows bugs you'd want to know about
try:
    do_thing()
except:
    pass

# Better — explicit about what you expect
try:
    do_thing()
except FileNotFoundError:
    print("File missing, skipping.")

The finally block — cleanup that always runs

f = open("data.txt")
try:
    work_with(f)
except Exception as e:
    print(f"Failed: {e}")
finally:
    f.close()      # runs whether we succeeded or not

Walkthrough: process a folder of CSVs that might be broken

The plan

Loop through all CSVs in a folder. Try to read each. If one is broken, log it and keep going — don't let one bad file stop the rest.

The script

import pandas as pd
from pathlib import Path

results = []
for path in Path("monthly_files").glob("*.csv"):
    try:
        df = pd.read_csv(path)
        results.append((path.name, len(df)))
    except Exception as e:
        print(f"Skipped {path.name}: {e}")

print(f"Processed {len(results)} files successfully")

⚠️ Don't catch bugs

The wrong place for try/except: hiding bugs. If your code has a typo, you want it to crash so you can fix it. Only catch errors you've thought about in advance.

Key takeaways

Wrap risky code (file I/O, type conversions, network calls) in try.
Catch specific exception types — ValueError, FileNotFoundError etc.
Use finally for cleanup that must always run.
Don't use try/except to hide bugs.

Safe parser

Write a function to_float(text) that returns a float if it can, or None if it can't. Then test it with: "12.5", "abc", "", "100".

04.6 — Putting it together — a mini contact-cleaner

Combine if/for/comprehensions/try-except into your first useful script.

What you'll learn

Combine the four control-flow tools in one script
Read a tiny piece of real-feeling input
Produce a clean output

The goal

You receive a list of contacts where each item looks like one of these:

raw = [
    "  Alice Smith,alice@example.com,42  ",
    "Bob Brown,BOB@EXAMPLE.com,",     # missing age
    "carol,wrong-email,29",            # no last name, bad email
    ",dave@example.com,55",            # no name
    "  Eve  Wells, EVE@example.com , 31 ",
]

You want a clean list of dicts with valid contacts only.

Walkthrough

Step 1: split the line and trim each piece

cleaned = []
for line in raw:
    parts = [p.strip() for p in line.split(",")]
    cleaned.append(parts)

Step 2: validate name, email, age

contacts = []
for name, email, age_text in cleaned:
    if not name:                       continue       # need a name
    if "@" not in email:               continue       # need an @
    try:
        age = int(age_text)
    except ValueError:
        age = None
    contacts.append({
        "name":  name.title(),
        "email": email.lower(),
        "age":   age,
    })

Step 3: inspect

for c in contacts:
    print(c)

Output:

{'name': 'Alice Smith', 'email': 'alice@example.com', 'age': 42}
{'name': 'Bob Brown',   'email': 'bob@example.com',   'age': None}
{'name': 'Eve  Wells',  'email': 'eve@example.com',   'age': 31}

Carol got dropped (bad email) and Dave got dropped (no name). Bob made it in with a None age. Real data, real result.

Key takeaways

Real-world scripts always chain these tools — loops + comprehensions + ifs + try/except.
Validation is just "skip if not what we want" (continue) and "guess if we can" (try/except).
Clean inputs at the door; the rest of your code stays simple.

Add a feature

Extend the cleaner: also reject contacts under age 18. Add a counter for how many were rejected and print it at the end.

05.1 — def — defining your own function

Make a custom 'formula' you can call from anywhere in your code.

What you'll learn

Write a function with parameters and a return value
Call it with positional and keyword arguments
Use a docstring to document it

Why functions matter

You write the same five-line calculation in three places, then need to fix a typo. With a function: fix it once, every caller gets the fix. Functions are how code stays maintainable.

The shape

def add_tax(amount, rate):
    return amount * (1 + rate)

# Use it
total = add_tax(100, 0.08)
print(total)   # 108.0

Five things to notice:

def — keyword that starts a function definition.
add_tax — the function's name. Use snake_case.
(amount, rate) — the parameters the function expects.
Colon, indented body — same as if/for.
return — sends a value back to the caller.

Positional vs keyword arguments

add_tax(100, 0.08)              # positional: amount=100, rate=0.08
add_tax(amount=100, rate=0.08)  # keyword: same thing, more explicit
add_tax(rate=0.08, amount=100)  # keyword: order doesn't matter

Use keyword arguments when you're calling a function with several parameters — it makes the call site readable.

Default values

def add_tax(amount, rate=0.08):
    return amount * (1 + rate)

add_tax(100)         # uses default rate of 0.08 → 108
add_tax(100, 0.10)   # overrides → 110

Docstrings

Add a triple-quoted string right under def. Editors will show it as help when you hover the function name.

def add_tax(amount, rate=0.08):
    """Return amount + sales tax at the given rate."""
    return amount * (1 + rate)

Functions that don't return a value

If you don't write return, Python returns None automatically. That's fine for functions whose job is to print something or save a file — their effect is what matters, not the return value.

def banner(msg):
    line = "=" * len(msg)
    print(line)
    print(msg)
    print(line)

banner("Monday report")

Key takeaways

def name(params): defines a function.
return sends a value back.
Use keyword arguments for clarity.
Default parameters give callers an opt-in override.

Write three functions

celsius_to_fahrenheit(c) — return F.
discounted_price(price, discount_pct=10) — return final price.
full_name(first, last, middle="") — return a properly spaced full name.

05.2 — Returning multiple values & unpacking

How to send back more than one thing — using tuples and dict returns.

What you'll learn

Return multiple values as a tuple
Unpack a return into multiple variables
Decide between a tuple return and a dict return

The tuple return

def split_full_name(full):
    first, last = full.split(" ", 1)
    return first, last

f, l = split_full_name("Alice Smith")
print(f)   # Alice
print(l)   # Smith

Behind the scenes, returning first, last is the same as returning the tuple (first, last). The parentheses are optional.

Unpacking

name, age, role = ("Alice", 32, "analyst")

The dict return — when there are many fields

def summary(numbers):
    return {
        "count": len(numbers),
        "total": sum(numbers),
        "mean":  sum(numbers) / len(numbers),
        "max":   max(numbers),
        "min":   min(numbers),
    }

stats = summary([10, 20, 30, 40, 50])
print(stats["mean"])   # 30.0

Better than returning a 5-tuple: at the call site you read stats["mean"], not stats[2].

💡 Rule of thumb

Two return values → tuple. Three or more → dict (or a class, which we won't need in this course).

Key takeaways

Return multiple values as a tuple: return a, b.
Unpack at the call site: x, y = func().
For many fields, return a dict — readability wins.

min_max_avg

Write a function that takes a list of numbers and returns min, max, and average. Try a tuple return first; then rewrite as a dict.

05.3 — Scope — where variables live

Local vs global, why your variable disappeared after the function ran, and how to do this safely.

What you'll learn

Distinguish local from global variables
Avoid the global keyword unless you really need it
Pass values in / return values out — the clean pattern

Local scope

Variables created inside a function only exist inside that function. Once it returns, they're gone.

def calc():
    temp = 42
    return temp * 2

result = calc()
print(result)   # 84
print(temp)     # NameError — temp doesn't exist out here

Why this is good

It means a function is a self-contained box. Whatever it does inside can't accidentally rename or corrupt a variable you have outside. Imagine the alternative — every function in a million-line program clobbering each other's variables.

Reading a global is OK; changing one is sketchy

TAX_RATE = 0.08   # global constant

def add_tax(amount):
    return amount * (1 + TAX_RATE)   # OK to read

# Bad pattern — reaching outside to change a global
TOTAL = 0
def add(n):
    global TOTAL
    TOTAL += n

# Good pattern — return the new value, let the caller decide
def add(total, n):
    return total + n

The mental model

A clean function takes its inputs as parameters and gives its output as a return value. It doesn't reach out into the world. It's predictable, testable, and reusable. Treat global like cigarettes — legal, occasionally unavoidable, mostly a sign you should reconsider.

Key takeaways

Variables created in a function only exist while the function runs.
Globals are fine for read-only constants; avoid global for assignment.
Inputs in → outputs out: the cleanest function pattern.

Refactor

RUNNING_TOTAL = 0

def add_invoice(amount):
    global RUNNING_TOTAL
    RUNNING_TOTAL += amount
    return RUNNING_TOTAL

Rewrite without the global. Pass the running total in, return the updated value.

05.4 — lambda, map, filter — the functional shortcuts

Tiny one-line functions and a quick tour of map() and filter().

What you'll learn

Recognise a lambda
Use sorted() with a key function
Read map() and filter() expressions

What's a lambda?

A one-line function that doesn't get a name. Useful when you need to pass a tiny function to another function.

# With def
def double(x):
    return x * 2

# Same thing as a lambda
double = lambda x: x * 2

You'll mostly see lambdas inline, not stored in variables. The most common use: key= in sorted().

sorted() with a key

people = [
    {"name": "Alice", "age": 30},
    {"name": "Bob",   "age": 25},
    {"name": "Carol", "age": 35},
]

by_age = sorted(people, key=lambda p: p["age"])
# [{'name':'Bob',...}, {'name':'Alice',...}, {'name':'Carol',...}]

by_name_desc = sorted(people, key=lambda p: p["name"], reverse=True)

map() — "apply this function to every item"

names = ["alice", "bob", "carol"]
upper = list(map(str.upper, names))   # ['ALICE', 'BOB', 'CAROL']

# Often clearer as a comprehension:
upper = [n.upper() for n in names]

filter() — "keep items where this is true"

ages = [12, 15, 18, 21, 17, 30]
adults = list(filter(lambda x: x >= 18, ages))   # [18, 21, 30]

# Also clearer as a comprehension:
adults = [x for x in ages if x >= 18]

📝 In practice

Python programmers reach for list comprehensions far more often than map() / filter(). But you'll see all three in the wild — recognise them and you can read any Python code.

Key takeaways

lambda x: expression — a tiny anonymous function.
Most common use: sorted(items, key=lambda i: i["something"]).
map/filter exist; comprehensions are usually clearer.

Sort by multiple keys

Sort the people list above by age ascending, but use name as a tiebreaker. Hint: a lambda can return a tuple.

05.5 — Modules and imports — using other people's code

import, from-import, aliases, and the difference between standard library and third-party.

What you'll learn

Import a module
Use from-import for a specific function
Recognise the standard library vs pip-installed libraries

The three import styles

# 1 — the whole module
import math
math.sqrt(16)        # 4.0

# 2 — with an alias (the convention for big libraries)
import pandas as pd
pd.read_csv("file.csv")

# 3 — a specific name
from math import sqrt, pi
sqrt(16)             # 4.0
pi                   # 3.14159...

Standard library vs third party

Standard library — comes with Python. No install needed. Examples: math, os, pathlib, datetime, json, csv, re.
Third party — installed with pip install .... Examples: pandas, numpy, openpyxl, requests.

The conventional aliases

Library	Conventional alias
pandas	`import pandas as pd`
numpy	`import numpy as np`
matplotlib.pyplot	`import matplotlib.pyplot as plt`
seaborn	`import seaborn as sns`

Use the conventions. Strangers reading your code will thank you.

Writing your own module

Any .py file is automatically importable from a script in the same folder. Put related functions in helpers.py:

# helpers.py
def clean_email(s):
    return s.strip().lower()

Then in main.py next to it:

from helpers import clean_email
clean_email("  ALICE@example.com ")   # 'alice@example.com'

Key takeaways

import x / import x as y / from x import y — three styles.
Standard library = free; third-party = pip install.
Use the conventional aliases (pd, np, plt, sns).
Any of your own .py files is importable.

Use the standard library

Use the datetime module to print today's date in the format YYYY-MM-DD. Hint: from datetime import date; date.today().isoformat().

05.6 — Anatomy of a real script

How professional Python scripts are organised — imports, constants, functions, and the if __name__ block.

What you'll learn

Recognise the standard layout of a Python script
Use the if __name__ == '__main__' idiom
Write your own well-organised script

The standard layout

"""monthly_report.py — generate the monthly sales report.

Usage: python monthly_report.py
"""

# ---- 1. Imports (standard library first, then third-party, then your own) ----
import os
from datetime import date

import pandas as pd

from helpers import clean_email

# ---- 2. Constants ----
INPUT_FILE  = "sales_raw.csv"
OUTPUT_FILE = "sales_clean.xlsx"
TAX_RATE    = 0.08

# ---- 3. Functions ----
def load(path):
    return pd.read_csv(path)

def clean(df):
    df["email"] = df["email"].apply(clean_email)
    df["total"] = df["price"] * (1 + TAX_RATE)
    return df

def save(df, path):
    df.to_excel(path, index=False)

def main():
    df = load(INPUT_FILE)
    df = clean(df)
    save(df, OUTPUT_FILE)
    print(f"Wrote {len(df)} rows to {OUTPUT_FILE}")

# ---- 4. The "run me" line ----
if __name__ == "__main__":
    main()

What's that `if name == "main"` thing?

It's the line that says: "Run main() only if I was launched directly, not when someone imports me from another file."

Why it matters: if you put loose code at the bottom of the file, that code runs even when you just import the file from somewhere else. The if __name__ idiom prevents that.

The benefit of breaking into functions

Each function is small and testable.
You can call individual pieces from a Jupyter notebook while developing.
You can swap pieces without rewriting the whole thing.

💡 The 30-second rule

If you can describe a piece of code in one short sentence, it's probably a function. "Clean the email column" → clean_emails(df). "Read the input file" → load(path).

Key takeaways

Imports → constants → functions → main() → if __name__ == "__main__".
The if __name__ guard stops your main() from running on import.
Small functions beat a giant script every time.

Refactor your tip calculator

Take the tip-calculator script you wrote in Module 3 and split it into three functions: get_inputs(), compute(bill, tip_pct, people), and print_results(...). Call them from main().

06.1 — Lists — the everyday ordered collection

Create, index, slice, modify. The Python equivalent of a single-column range.

What you'll learn

Create and modify lists
Index, slice, and iterate
Use append, pop, insert, sort

Creating a list

fruits = ["apple", "banana", "cherry"]
numbers = [10, 20, 30, 40, 50]
mixed = [1, "two", 3.0, True]
empty = []

Indexing

fruits[0]    # 'apple'  — Python starts at 0
fruits[1]    # 'banana'
fruits[-1]   # 'cherry' — negative indexes count from the end
fruits[-2]   # 'banana'

Slicing

nums = [10, 20, 30, 40, 50]
nums[1:4]     # [20, 30, 40]
nums[:3]      # [10, 20, 30]
nums[3:]      # [40, 50]
nums[:]       # [10, 20, 30, 40, 50]  — a copy
nums[::2]     # [10, 30, 50]  — every other one

Modifying

fruits = ["apple", "banana"]
fruits.append("cherry")        # add to the end
fruits.insert(0, "avocado")    # insert at index 0
fruits.remove("banana")        # remove by value
last = fruits.pop()            # remove and return last
fruits[0] = "apricot"          # replace by index

Useful list operations

len(fruits)         # how many items
"apple" in fruits   # True/False
sorted(numbers)     # returns a sorted copy
numbers.sort()      # sorts in place
numbers.reverse()
sum(numbers)        # 150
max(numbers)        # 50
min(numbers)        # 10

Walkthrough: track daily sales

Start with last week's numbers

daily = [1200, 950, 1340, 1100, 1600, 0, 0]   # Mon..Sun

This morning's sale comes in

daily.append(1450)        # Monday of week 2
print(daily)              # [1200, 950, 1340, 1100, 1600, 0, 0, 1450]

Print weekday totals only (drop the weekend)

weekdays = daily[:5]
print(sum(weekdays))      # 6190

⚠️ Lists are mutable

If you pass a list to a function and the function changes it, the original list is changed too. To pass a copy, use my_list[:] or list(my_list).

Key takeaways

Lists hold an ordered collection of anything.
Indexes start at 0; negative indexes count from the end.
append, pop, insert, remove, sort — the workhorses.
Lists are mutable — changes are in place unless you copy first.

Quick drill

Given nums = [5, 2, 8, 1, 9, 3], write one line each to:

Get the average.
Get the second-largest.
Get a sorted list, descending.
Get just the odd numbers.

06.2 — Dictionaries — key-value lookups

VLOOKUP, built-in to the language. Create, look up, update, iterate.

What you'll learn

Create and access dictionaries
Use .get() to avoid KeyError
Iterate keys, values, and items

What's a dict?

A dictionary maps keys to values. It's a lookup table.

prices = {
    "apple":  1.20,
    "banana": 0.50,
    "cherry": 3.00,
}

prices["apple"]    # 1.20
prices["banana"]   # 0.50

Adding and updating

prices["date"] = 5.00        # add a new key
prices["apple"] = 1.50       # update existing
del prices["cherry"]         # remove

The .get() escape hatch

Accessing a missing key crashes:

prices["mango"]   # KeyError!

.get() returns None (or a default) instead:

prices.get("mango")          # None
prices.get("mango", 0)       # 0  — supply a default

Iterating

for fruit in prices:
    print(fruit)             # just keys

for fruit, price in prices.items():
    print(f"{fruit}: ${price}")

prices.keys()    # dict_keys(['apple', 'banana', ...])
prices.values()  # dict_values([1.20, 0.50, ...])
prices.items()   # dict_items([('apple', 1.20), ...])

The "row as dict" pattern — this one's important

Real-world data often looks like one dict per row:

customers = [
    {"id": 1, "name": "Alice",  "city": "NYC",     "ltv": 1200},
    {"id": 2, "name": "Bob",    "city": "Chicago", "ltv":  800},
    {"id": 3, "name": "Carol",  "city": "NYC",     "ltv": 2200},
]

# Find all NYC customers
nyc = [c for c in customers if c["city"] == "NYC"]
# Total NYC LTV
total = sum(c["ltv"] for c in customers if c["city"] == "NYC")

Pandas DataFrames are basically a faster version of this same structure. Internalise the pattern and DataFrames will feel natural.

Walkthrough: build a quick price lookup

Define the lookup

price_list = {
    "SKU-001": 19.99,
    "SKU-002": 29.99,
    "SKU-003":  9.99,
}

Cost a basket

basket = [("SKU-001", 2), ("SKU-002", 1), ("SKU-099", 4)]   # last one isn't in the list

total = 0
for sku, qty in basket:
    price = price_list.get(sku)
    if price is None:
        print(f"Unknown SKU: {sku}")
        continue
    total += price * qty

print(f"Total: ${total:.2f}")

Key takeaways

Dicts map keys to values: d[key].
Use .get(key, default) to avoid KeyError on missing keys.
Iterate with .items() when you need both key and value.
Lists of dicts are the precursor to a DataFrame.

Word counter

Given a sentence, build a dict where each key is a word and each value is how many times it appears.

06.3 — Tuples and when to use them

Lists you can't change. Smaller, faster, and a perfect fit for fixed pairs.

What you'll learn

Create tuples with and without parentheses
Unpack tuples
Pick tuple vs list correctly

A tuple is an immutable list

point = (10, 20)
rgb = (255, 0, 128)

point[0]      # 10
point[0] = 5  # TypeError — tuples are immutable

The bare-comma shortcut

point = 10, 20   # also a tuple — parens are optional
x, y = point     # unpack into two variables

When to use a tuple

A fixed-size pair or triple where order matters: (lat, lon), (year, month, day).
Returning multiple values from a function.
As a dictionary key (lists can't be keys, tuples can).

When to use a list

A growing/shrinking collection.
A homogeneous "many of the same thing" (names, prices, rows).

Key takeaways

Tuples are immutable. Use them for fixed groupings.
Parentheses around tuples are optional in most cases.
Tuples can be dict keys; lists cannot.

Day-of-week dict

Use tuples of (year, month, day) as keys in a dict. Add three keys with notes as values. Print one of them.

06.4 — Sets — unique values, fast

When you need to deduplicate, check membership, or do union/intersection.

What you'll learn

Create and modify sets
Use set operations (union, intersection, difference)
Use a set to deduplicate fast

A set is an unordered collection of unique values

colors = {"red", "green", "blue"}
colors.add("red")        # already there — ignored
colors.add("yellow")
print(colors)            # {'red','green','blue','yellow'}  (order varies)

Deduplicating a list

raw = ["alice@x.com", "bob@y.com", "alice@x.com", "carol@z.com"]
unique = list(set(raw))

Set operations

last_month = {"Alice", "Bob", "Carol"}
this_month = {"Bob", "Carol", "Dave"}

last_month | this_month   # union           {Alice, Bob, Carol, Dave}
last_month & this_month   # intersection    {Bob, Carol}
last_month - this_month   # difference      {Alice}      — churned
this_month - last_month   # difference      {Dave}       — new

That last pair — "who churned" and "who's new" — is a classic real-world use.

Fast membership

Checking x in big_list scans every item. Checking x in big_set is nearly instant, regardless of size. If you're checking membership a lot, use a set.

Key takeaways

Sets hold unique values; adding a duplicate is silently ignored.
list(set(x)) is the fastest dedupe trick.
| union, & intersection, - difference.
Membership checks on sets are O(1) — way faster than lists.

Churn analysis

Given last month's customers (a list) and this month's customers (a list), print the count of churned and new customers.

06.5 — Nested structures — lists of dicts, dicts of lists

How real-world data actually looks in Python before pandas comes along.

What you'll learn

Build and read nested structures
Loop through them cleanly
Convert one to a pandas DataFrame

List of dicts — the most common shape

orders = [
    {"id": 101, "customer": "Acme",  "items": ["A1", "B7"],         "total":  450},
    {"id": 102, "customer": "Beta",  "items": ["C3"],               "total":  120},
    {"id": 103, "customer": "Acme",  "items": ["A1", "A1", "B7"],   "total":  780},
]

Reading it

first = orders[0]
print(first["customer"])      # Acme
print(first["items"][0])      # A1
print(orders[2]["items"][-1]) # B7

Looping

for order in orders:
    print(f"Order {order['id']} for {order['customer']}: ${order['total']}")

Aggregating across a list of dicts

# Total revenue
total = sum(o["total"] for o in orders)        # 1350

# Revenue per customer
by_customer = {}
for o in orders:
    cust = o["customer"]
    by_customer[cust] = by_customer.get(cust, 0) + o["total"]

print(by_customer)   # {'Acme': 1230, 'Beta': 120}

And here's why DataFrames are so popular

That last calculation took five lines. With pandas it's one:

import pandas as pd
df = pd.DataFrame(orders)
df.groupby("customer")["total"].sum()

Same answer. Same data. Less code. That's why Module 9 is the turning point in this course.

Key takeaways

Real-world data is usually a list of dicts (or a dict of lists).
You can do everything with plain Python — it's just verbose.
pandas makes the same operations one-liners.

Group by city

Given a list of customer dicts each with city and ltv, build a dict mapping city to total LTV.

06.6 — Working with dates

The datetime module, parsing strings to dates, and date arithmetic.

What you'll learn

Create date and datetime objects
Parse a date string with strptime
Do date arithmetic with timedelta

The two types you'll use

from datetime import date, datetime, timedelta

d = date(2026, 5, 16)           # just a date
dt = datetime(2026, 5, 16, 14, 30, 0)   # date + time

Today / now

date.today()                # date(2026, 5, 16)
datetime.now()              # full timestamp

Parsing a string

datetime.strptime("2026-05-16", "%Y-%m-%d")
datetime.strptime("16/05/2026", "%d/%m/%Y")
datetime.strptime("May 16, 2026", "%B %d, %Y")

The format codes are documented at strftime.org. The big ones: %Y year, %m month, %d day, %H hour, %M minute.

Formatting back to a string

d = date.today()
d.isoformat()                   # '2026-05-16'
d.strftime("%B %d, %Y")         # 'May 16, 2026'
d.strftime("%A")                # 'Saturday'

Date arithmetic

today = date.today()
tomorrow = today + timedelta(days=1)
last_week = today - timedelta(weeks=1)

age_days = (today - date(1990, 1, 1)).days   # how many days old

Walkthrough: end-of-month dates for the next year

Build a list of month-ends

from datetime import date
from calendar import monthrange

year = 2026
month_ends = []
for m in range(1, 13):
    last_day = monthrange(year, m)[1]
    month_ends.append(date(year, m, last_day))

for d in month_ends:
    print(d.isoformat())

Key takeaways

date = just a date. datetime = date + time.
strptime parses; strftime formats.
Subtract two dates → timedelta; .days gives you the day count.
Pandas has even better date tools — covered in Module 9.

Days until your birthday

Write a script that prints how many days until your next birthday.

06.7 — Files and paths — the pathlib essentials

Read and write text files; build paths that work on Mac, Windows, and Linux.

What you'll learn

Read and write text files
Use pathlib to build cross-platform paths
Loop through every file in a folder

Reading a text file

with open("notes.txt", "r") as f:
    contents = f.read()
print(contents)

with open(...) as f: is the safe pattern — Python closes the file for you automatically when the indented block ends.

Writing a text file

with open("output.txt", "w") as f:
    f.write("Hello!\n")
    f.write("Second line.\n")

"w" overwrites; "a" appends.

Reading line by line

with open("notes.txt") as f:
    for line in f:
        print(line.strip())

pathlib — the modern way to build paths

from pathlib import Path

home = Path.home()                     # ~/
desktop = home / "Desktop"             # ~/Desktop
output = desktop / "report.xlsx"       # ~/Desktop/report.xlsx

output.exists()           # True/False
output.name               # 'report.xlsx'
output.suffix             # '.xlsx'
output.parent             # PosixPath('~/Desktop')

The / operator joins paths in the correct way for Mac/Linux/Windows. Use this everywhere instead of string concatenation.

Looping through every file in a folder

from pathlib import Path

folder = Path("reports/2026/q1")
for path in folder.glob("*.xlsx"):
    print(path.name, path.stat().st_size)

glob("*.xlsx") = "every .xlsx in this folder."
rglob("*.xlsx") = "every .xlsx in this folder and all subfolders."

Walkthrough: gather every CSV from a year's worth of monthly subfolders

The structure

data/
  2026-01/
    sales.csv
    returns.csv
  2026-02/
    sales.csv
  ...

List them all

from pathlib import Path

for p in Path("data").rglob("*.csv"):
    print(p)

Read and combine

import pandas as pd
frames = [pd.read_csv(p) for p in Path("data").rglob("sales.csv")]
combined = pd.concat(frames, ignore_index=True)
print(len(combined))

Six lines of code, dozens of files, one DataFrame. This pattern alone has saved many an analyst's morning.

Key takeaways

with open(...) as f: is the safe way to read/write text files.
pathlib.Path + / builds cross-platform paths.
glob("*.csv") and rglob("*.csv") iterate every file matching a pattern.
Combine with pd.concat() to merge many files into one DataFrame.

Inventory script

Write a script that lists every file in your Downloads folder along with its size. Print a one-line total at the end.

07.1 — Reading and writing CSV files

The csv standard library module and the pandas one-liner equivalents.

What you'll learn

Read a CSV into a list of dicts (standard library)
Read a CSV into a DataFrame (pandas)
Write a DataFrame back out

Two ways: csv module and pandas

The csv module is built into Python. pandas is the third-party library we'll lean on heavily. For 95% of work, pandas is what you want. For tiny scripts or constrained environments, the csv module is fine.

The pandas one-liner

import pandas as pd
df = pd.read_csv("sales.csv")
print(df.head())

Done. You have a DataFrame.

Writing a CSV

df.to_csv("clean.csv", index=False)

index=False suppresses pandas' row-number column, which is usually what you want.

Useful read_csv options

pd.read_csv("sales.csv",
            sep=",",                  # delimiter
            encoding="utf-8",         # "latin-1" for some legacy exports
            skiprows=2,               # skip first two rows
            header=0,                 # row to use as headers
            usecols=["date","amount"],# read only these columns
            parse_dates=["date"],     # turn date column into real dates
            dtype={"id": str},        # force this column to be string
            nrows=1000,               # read only first 1000 rows
           )

The csv module (for when you can't use pandas)

import csv

with open("sales.csv") as f:
    reader = csv.DictReader(f)
    rows = list(reader)

print(rows[0])      # {'date': '2026-01-05', 'amount': '120', ...}

Every value comes back as a string. You convert as you go.

Common CSV headaches and fixes

Problem	Fix
UnicodeDecodeError	Try `encoding="latin-1"` or `encoding="cp1252"`
Numbers come in as strings (with commas)	`df["amount"] = df["amount"].str.replace(",","").astype(float)`
Dates won't sort	Use `parse_dates=["col"]` at read time
Wrong delimiter (semicolons or tabs)	`sep=";"` or `sep="\t"`
Leading/trailing spaces in headers	`df.columns = df.columns.str.strip()`

Walkthrough: end-to-end CSV cleanup

Read

import pandas as pd
df = pd.read_csv("sales_raw.csv", parse_dates=["date"])

Inspect

df.head()
df.info()
df.describe()

Clean

df.columns = df.columns.str.strip().str.lower()
df = df.dropna(subset=["customer"])
df["amount"] = df["amount"].astype(float)

Save

df.to_csv("sales_clean.csv", index=False)

Key takeaways

pd.read_csv() is the one-liner you'll use 95% of the time.
The standard csv module is there if you can't use pandas.
Watch out for encoding, delimiters, and number columns that came in as strings.

CSV inspector

Write a small script that takes a CSV path and prints: number of rows, number of columns, the column names, and the first three rows.

07.2 — Reading Excel files with pandas

read_excel: the easiest way to get a .xlsx into Python.

What you'll learn

Read a specific sheet of an Excel file
Read multiple sheets at once
Write a DataFrame back to .xlsx

The basic read

import pandas as pd
df = pd.read_excel("monthly_report.xlsx")

By default this reads the first sheet.

Pick a specific sheet

df = pd.read_excel("workbook.xlsx", sheet_name="January")
df = pd.read_excel("workbook.xlsx", sheet_name=2)   # by index, 0-based

Read every sheet into a dict

sheets = pd.read_excel("workbook.xlsx", sheet_name=None)
# sheets is a dict: {"January": df1, "February": df2, ...}

for name, df in sheets.items():
    print(name, len(df))

Skip header rows / pick a range

pd.read_excel("workbook.xlsx",
              sheet_name="Summary",
              skiprows=3,            # skip the title block
              usecols="B:F",         # read columns B through F
              nrows=50)

Write back

# One sheet
df.to_excel("output.xlsx", index=False, sheet_name="Results")

# Multiple sheets
with pd.ExcelWriter("output.xlsx") as writer:
    monthly.to_excel(writer, sheet_name="Monthly", index=False)
    yearly.to_excel(writer,  sheet_name="Yearly",  index=False)

Walkthrough: combine 12 monthly tabs into one

Read every sheet

sheets = pd.read_excel("2026_monthly.xlsx", sheet_name=None)

Tag each with its month and stack

frames = []
for name, df in sheets.items():
    df["month"] = name
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined.shape)

Save the combined version

combined.to_excel("2026_combined.xlsx", index=False, sheet_name="All")

💡 The dependency

pandas uses openpyxl under the hood for .xlsx. Anaconda includes it; if you used a different install: pip install openpyxl.

Key takeaways

pd.read_excel() with sheet_name=... picks one sheet; sheet_name=None returns all as a dict.
ExcelWriter writes multiple sheets to one file.
Combine many sheets with pd.concat([...]).

Sheet inventory

Write a function inventory(path) that returns a DataFrame with one row per sheet in a workbook, with columns: sheet_name, n_rows, n_cols.

07.3 — openpyxl — when you need cell-level control

Read/write individual cells, formatting, formulas, and styling. The library pandas uses under the hood.

What you'll learn

Open a workbook, write to specific cells
Style cells (bold, colour, number format)
Save and re-open

When you reach for openpyxl

Pandas treats Excel like a flat table. If you need the spreadsheety bits — colours, formulas, merged cells, comments, charts — drop down to openpyxl.

from openpyxl import Workbook, load_workbook

wb = Workbook()
ws = wb.active
ws.title = "Summary"

ws["A1"] = "Customer"
ws["B1"] = "Revenue"
ws["A2"] = "Acme"
ws["B2"] = 1200
ws["A3"] = "Beta"
ws["B3"] = 800

ws["B4"] = "=SUM(B2:B3)"   # a real Excel formula

wb.save("openpyxl_demo.xlsx")

Reading existing files

wb = load_workbook("openpyxl_demo.xlsx")
ws = wb["Summary"]

print(ws["A1"].value)         # 'Customer'
print(ws.cell(row=2, column=2).value)   # 1200

for row in ws.iter_rows(min_row=2, values_only=True):
    print(row)

Styling — bold, fill, number format

from openpyxl.styles import Font, PatternFill

bold_white = Font(bold=True, color="FFFFFF")
green_fill = PatternFill(start_color="217346", end_color="217346", fill_type="solid")

for cell in ws[1]:                # first row = headers
    cell.font = bold_white
    cell.fill = green_fill

ws["B2"].number_format = '"$"#,##0.00'

Column widths and freeze panes

ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 14
ws.freeze_panes = "A2"

Walkthrough: turn a DataFrame into a styled report

Write the DataFrame with pandas

import pandas as pd
df = pd.DataFrame({"Customer": ["Acme","Beta","Gamma"], "Revenue": [1200,800,1500]})
df.to_excel("report.xlsx", index=False, sheet_name="Sales")

Re-open with openpyxl and style

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

wb = load_workbook("report.xlsx")
ws = wb["Sales"]

header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill("solid", fgColor="217346")
for cell in ws[1]:
    cell.font = header_font
    cell.fill = header_fill

for cell in ws["B"][1:]:                # data rows
    cell.number_format = '"$"#,##0.00'

ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 16
ws.freeze_panes = "A2"

wb.save("report.xlsx")

Key takeaways

pandas for the data; openpyxl for the spreadsheety bits.
Cells are ws["A1"] or ws.cell(row=N, column=N).
Style with Font, PatternFill, .number_format.
Always save with wb.save("...xlsx").

Style your own report

Take a DataFrame with three columns, write it with pandas, then add: bold green header, currency format on the numeric column, autosized column widths, frozen header row.

07.4 — Looping through many files

Combine, transform, or report on a whole folder of files in one go.

What you'll learn

Iterate files in a folder
Build a combined DataFrame from many files
Add a column tracking which file each row came from

The 'combine many files' pattern

from pathlib import Path
import pandas as pd

folder = Path("monthly_files")
frames = []
for path in sorted(folder.glob("*.csv")):
    df = pd.read_csv(path)
    df["source_file"] = path.name      # track which file
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(f"Combined {len(frames)} files, {len(combined):,} total rows")

This pattern shows up everywhere — monthly reports, regional exports, daily dumps. Add a sort on path.stat().st_mtime if you need chronological order.

Adding parsed info from the filename

If files are called sales_2026-01.csv, you can pull the month out:

for path in folder.glob("sales_*.csv"):
    month = path.stem.split("_")[1]   # '2026-01'
    df = pd.read_csv(path)
    df["month"] = month
    frames.append(df)

What if the files have inconsistent columns?

pd.concat() aligns columns by name; missing columns become NaN. If files have wildly different shapes, normalise each one before stacking:

def normalise(df):
    df.columns = df.columns.str.strip().str.lower()
    keep = ["date", "customer", "amount"]
    return df.reindex(columns=keep)

frames = [normalise(pd.read_csv(p)) for p in folder.glob("*.csv")]
combined = pd.concat(frames, ignore_index=True)

Walkthrough: monthly Excel files → one yearly file

Inputs

from pathlib import Path
import pandas as pd
import re

folder = Path("monthly_xlsx")
frames = []

Read each Excel file, tag with month

for path in sorted(folder.glob("*.xlsx")):
    m = re.search(r"(\d{4}-\d{2})", path.stem)
    if not m: continue
    df = pd.read_excel(path)
    df["month"] = m.group(1)
    frames.append(df)

Combine and save

year_df = pd.concat(frames, ignore_index=True)
year_df.to_excel("yearly_combined.xlsx", index=False)
print(f"Wrote {len(year_df):,} rows.")

Key takeaways

folder.glob("*.csv") + pd.concat() = combine many files into one.
Always add a column tracking which file the data came from.
Normalise columns before concat if files differ.

Combine your Downloads

Pretend you have ten CSVs in a folder. Write the combine-and-tag script. Bonus: print the row count per file before combining.

07.5 — Text files, JSON, and other formats

When you're not dealing with CSV/Excel: plain text, JSON, and a quick note on XML.

What you'll learn

Read and write JSON
Parse log-style text files
Know when to reach for a different format

JSON in two lines

JSON is the format APIs and lots of modern tools use. Python's json module reads it straight into a dict.

import json

# Read
with open("config.json") as f:
    config = json.load(f)
print(config["api_key"])

# Write
data = {"name": "Alice", "scores": [10, 20, 30]}
with open("out.json", "w") as f:
    json.dump(data, f, indent=2)

JSON ↔ DataFrame

import pandas as pd
df = pd.read_json("orders.json")                  # array of objects
df.to_json("clean.json", orient="records", indent=2)

Parsing a log file

lines = Path("server.log").read_text().splitlines()
errors = [l for l in lines if "ERROR" in l]
print(len(errors), "errors")

XML — yes, you'll occasionally see it

For old enterprise systems and government data, you'll get XML. The fastest way: pandas can read simple XML tables.

df = pd.read_xml("data.xml")

For gnarlier XML, the standard library's xml.etree.ElementTree works, and the third-party lxml is faster.

Key takeaways

JSON: json.load / json.dump — clean dict in, clean dict out.
Plain text: Path(...).read_text().splitlines() is the one-liner.
pandas can also read JSON and XML directly into DataFrames.

Config saver

Write a script that saves your settings (any three key/value pairs you like) to a JSON file, then reads them back and prints them.

07.6 — Encoding, BOMs, and other file gotchas

Why exports from old systems look like garbage, and how to fix them.

What you'll learn

Recognise an encoding issue
Try the most-likely fixes
Strip BOMs and other invisible characters

The classic symptom

You read a CSV and see things like caf� instead of café, or â€™ instead of an apostrophe. That's an encoding mismatch — the file was saved with one character encoding and you're reading it as another.

The two-line fix

# Try utf-8 first (default)
df = pd.read_csv("export.csv")

# If garbled, try latin-1 (Windows / older systems)
df = pd.read_csv("export.csv", encoding="latin-1")

# Or cp1252 (also common from Windows Excel)
df = pd.read_csv("export.csv", encoding="cp1252")

If none of those work, ask the system that exported the file what encoding it uses.

The BOM problem

Some Excel CSV exports include an invisible "byte-order mark" character at the start of the file. Symptom: your first column header looks like Customer with a weird prefix.

pd.read_csv("export.csv", encoding="utf-8-sig")

The -sig suffix tells pandas to strip the BOM.

Stray whitespace and invisible characters

# Strip columns and string values in one pass
df.columns = df.columns.str.strip()
for col in df.select_dtypes("object"):
    df[col] = df[col].str.strip()

Date columns that come in as strings

df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" turns bad dates into NaT (not-a-time) instead of crashing.

Key takeaways

If text looks garbled, try encoding="utf-8-sig", then latin-1, then cp1252.
Strip whitespace from columns AND string values when loading messy data.
pd.to_datetime(..., errors="coerce") handles unparsable dates gracefully.

Encoding sleuth

Write a function safe_read_csv(path) that tries utf-8, then utf-8-sig, then latin-1, and returns the first one that works.

08.1 — =PY() — the first ten formulas

Type =PY(, write Python, hit Ctrl+Enter. The basic muscle memory.

What you'll learn

Enter a =PY() formula
Commit with Ctrl+Enter (not just Enter)
Output as Python object vs Excel values

The syntax

Click an empty cell.
Type =PY( — Excel switches the cell into Python mode. A green "PY" badge appears.
Write your Python.
Press Ctrl+Enter (Windows) / Cmd+Enter (Mac) to commit.

Don't just hit Enter — that adds a newline inside the cell. Ctrl+Enter is the "submit" key.

Ten warm-up formulas

# 1 — basic math
=PY(1 + 1)                # 2

# 2 — a variable
=PY(
x = 100
x * 1.08
)                         # 108

# 3 — string
=PY("hello".upper())      # HELLO

# 4 — list aggregation
=PY(sum([10, 20, 30]))    # 60

# 5 — current date
=PY(
from datetime import date
date.today()
)

# 6 — a Series
=PY(
import pandas as pd
pd.Series([1, 2, 3, 4, 5]).mean()
)                         # 3.0

# 7 — a small DataFrame
=PY(
import pandas as pd
pd.DataFrame({"x": [1,2,3], "y": [4,5,6]})
)

# 8 — random
=PY(
import random
random.choice(["heads", "tails"])
)

# 9 — string manipulation
=PY("alice@example.com".split("@")[1])   # example.com

# 10 — sorted list
=PY(sorted([3, 1, 4, 1, 5, 9, 2, 6]))

Output mode: Python object vs Excel values

By default, a =PY() cell shows a Python object — a single cell that holds (say) a whole DataFrame. To spill the contents into proper Excel cells:

Right-click the cell.
Output as → Excel values.
The DataFrame spills into a range. Headers become column headers, rows become rows.

Toggle back to Python-object mode if you want to use the cell's value in another =PY() formula.

Walkthrough: build a summary cell

Put some numbers in A1:A10

Type any 10 numbers in cells A1 through A10.

In B1, write =PY()

=PY(
xl("A1:A10").mean()
)

The xl(...) helper grabs an Excel range as a pandas Series or DataFrame. .mean() averages it.

Try a more useful summary

=PY(
s = xl("A1:A10")
{"count": s.count(),
 "sum":   s.sum(),
 "mean":  s.mean(),
 "min":   s.min(),
 "max":   s.max()}
)

Right-click → Output as → Excel values. The dict spills into a two-column table.

Key takeaways

=PY( + your Python + Ctrl+Enter.
xl("A1:A10") brings an Excel range into Python as a Series/DataFrame.
Right-click → Output as → Excel values to spill the result into cells.

Calculator cell

In a fresh workbook, put numbers in A1:A20. In B1, write a =PY() formula that computes the median. In B2, one that computes the standard deviation.

08.2 — The xl() helper — getting Excel data into Python

How to reference cells, ranges, and tables from inside =PY().

What you'll learn

Reference single cells, ranges, and tables with xl()
Choose between headers=True and headers=False
Pull data from another sheet

The xl() forms

xl("A1")                       # single value
xl("A1:C5")                    # range → DataFrame (or Series if one col)
xl("A1:C5", headers=True)      # treat first row as headers
xl("Sheet2!A1:C100")           # range from another sheet
xl("MyTable[#All]")            # an Excel Table by name
xl("MyTable[Revenue]")         # one column of a Table

Single cells

=PY(
tax_rate = xl("B1")
amount   = xl("B2")
amount * (1 + tax_rate)
)

Range with headers

Best practice: lay your data out with headers in the first row, then read with headers=True:

=PY(
df = xl("A1:D100", headers=True)
df.groupby("Region")["Revenue"].sum()
)

Cross-sheet references

=PY(
df = xl("RawData!A1:F1000", headers=True)
df.head()
)

Excel Tables — the cleanest way

If your data is a real Excel Table (Ctrl+T on a range), you can reference it by name. That's the most readable option and survives row insertions/deletions:

=PY(
sales = xl("SalesTable[#All]", headers=True)
sales["Total"] = sales["Qty"] * sales["Price"]
sales
)

Walkthrough: build a small dashboard cell

Set up the data

Put this in A1:C5 (headers in row 1):

Region      Quarter   Revenue
North       Q1        1200
North       Q2        1500
South       Q1         900
South       Q2        1100

Summary cell

=PY(
df = xl("A1:C5", headers=True)
df.groupby("Region")["Revenue"].agg(["sum", "mean"])
)

Right-click → Output as → Excel values. You get a clean two-column summary.

💡 The =PY() pattern that scales

Store your raw data as an Excel Table. Use =PY() cells purely for analysis and presentation. When new rows come in, the analysis updates automatically.

Key takeaways

xl("A1:B5", headers=True) brings a labelled range in as a DataFrame.
Excel Tables are referenced by name and survive resizing.
You can read other sheets with Sheet2!A1:B5 syntax.

Top three

Lay out a list of 10 employees with a Sales column. In a =PY() cell, return the top three by sales as a DataFrame.

08.3 — Pandas in =PY() — the killer combination

Run real pandas operations inside Excel cells. Groupby, pivot, sort, filter.

What you'll learn

Group, pivot, and aggregate in =PY()
Sort and filter in =PY()
Spill the result into a clean Excel range

The setup

Put this raw data in A1:D11 (10 rows + headers):

Date         Region   Product   Revenue
2026-01-05   North    Widget    1200
2026-01-12   North    Gadget     800
2026-01-19   South    Widget     950
2026-02-02   South    Gadget    1100
2026-02-09   East     Widget    1300
2026-02-16   East     Gadget     700
2026-03-03   North    Widget    1500
2026-03-10   South    Widget    1050
2026-03-17   East     Gadget     900
2026-03-24   North    Gadget    1250

Pivot by Region × Product

=PY(
df = xl("A1:D11", headers=True)
df.pivot_table(index="Region", columns="Product",
               values="Revenue", aggfunc="sum")
)

Group by month

=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M")
df.groupby("Month")["Revenue"].sum()
)

Top N

=PY(
df = xl("A1:D11", headers=True)
df.nlargest(3, "Revenue")
)

Filter

=PY(
df = xl("A1:D11", headers=True)
df[df["Region"] == "North"]
)

Walkthrough: build a region × month sales matrix

The =PY() cell

=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m")
m = df.pivot_table(index="Region", columns="Month",
                   values="Revenue", aggfunc="sum", fill_value=0)
m
)

Spill it

Right-click → Output as → Excel values. You get a clean Region × Month matrix that updates whenever the source data changes.

📝 Pandas is already imported

Inside =PY(), pandas is already available as pd by default. You don't need an import pandas as pd at the top — though it's harmless if you add it.

Key takeaways

=PY() gives you the full power of pandas inside a cell.
Groupby, pivot, sort, filter — all one-liners.
Spill the result as Excel values when you want to use it elsewhere.

Build your own pivot

Using the table above, write a =PY() that returns total Revenue by Product, sorted descending.

08.4 — Charts in =PY() with matplotlib and seaborn

Return a chart from a =PY() cell. Lives inside the workbook like an image.

What you'll learn

Return a matplotlib chart from =PY()
Customise the chart's title, labels, and size
Use seaborn for prettier defaults

The simplest chart cell

=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar")
)

The cell returns a chart image. Click it; Excel lets you resize and reposition it.

Styling

=PY(
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar", ax=ax, color="#217346")
ax.set_title("Revenue by Region")
ax.set_ylabel("Revenue ($)")
ax.set_xlabel("")
plt.tight_layout()
)

Seaborn for nicer defaults

=PY(
import seaborn as sns
df = xl("A1:D11", headers=True)
sns.barplot(data=df, x="Region", y="Revenue", estimator="sum", errorbar=None)
)

Walkthrough: monthly trend chart

Aggregate first, plot second

=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M").dt.to_timestamp()
monthly = df.groupby("Month")["Revenue"].sum()

fig, ax = plt.subplots(figsize=(7, 3))
monthly.plot(ax=ax, marker="o", color="#2b6cb0")
ax.set_title("Monthly Revenue")
ax.grid(alpha=0.3)
plt.tight_layout()
)

💡 Treat chart cells like images

A =PY() chart behaves like an image inside the spreadsheet — you can resize it, move it across sheets, and it'll update whenever the source data changes.

Key takeaways

=PY() can return matplotlib/seaborn charts directly.
For control: build a fig, ax and tweak labels/colours.
Use plt.tight_layout() to avoid clipping.

Stacked bar

Build a stacked bar chart in =PY() showing Revenue by Region, with Product as the stack. Hint: df.pivot_table(...).plot(kind="bar", stacked=True).

08.5 — When =PY() is the right tool — and when it isn't

A decision guide so you reach for the right hammer.

What you'll learn

Pick =PY() over native formulas when it pays off
Stick with native formulas when they're faster/cleaner
Combine both in the same workbook

When =PY() is the right tool

Anything that wants a real DataFrame: pivots, groupbys, complex joins.
Statistical analysis: regressions, t-tests, distributions, percentile bands.
Any chart Excel can't make cleanly (heatmaps, violin plots, faceted small multiples).
Date arithmetic that would need ten nested formulas.
Working with data that doesn't fit on a normal sheet (rare in =PY(), but possible).
Anything you'd want to re-do programmatically later.

When to stick with native formulas

Simple lookups, sums, counts, and conditional formatting — XLOOKUP, SUMIFS, COUNTIFS are fast and obvious.
Interactive what-ifs where every keystroke should ripple through the whole sheet.
Workbooks shared with people who don't have Microsoft 365 (their cells will show #BLOCKED!).
Anything that needs to recalc in < 100 ms.

The hybrid pattern

Use native formulas for the live, interactive parts. Use =PY() for the heavier analysis next to them. The two coexist beautifully.

Two performance notes

=PY() runs in the cloud. Each calc is a round trip. Don't put 500 of them in one workbook.
One bigger =PY() beats many small ones. Do as much per cell as you can; one cell that produces a 10×10 table is faster than 100 cells each producing one value.

Key takeaways

=PY() shines for pivots, stats, charts, and heavy date work.
Native formulas are still best for simple lookups and live what-ifs.
Mix them; prefer one big =PY() over many small ones.

Decide

For each task, pick: native formulas, =PY(), or both?

"Sum revenue where region = North."
"Forecast next quarter's revenue with a linear regression."
"Look up a customer's tier from a side table."
"Build a heatmap of orders by hour-of-day × day-of-week."

08.6 — Twelve =PY() recipes you'll use

Copy-paste-ready snippets for the most common Python-in-Excel tasks.

What you'll learn

Bookmark this lesson
Adapt each recipe to your own data
Spot which pandas method does what

1 — Sum a column with a condition

=PY(
df = xl("DataTable[#All]", headers=True)
df.loc[df["Region"]=="North", "Revenue"].sum()
)

2 — Count distinct values

=PY(
xl("A2:A1000").nunique()
)

3 — Group-by total

=PY(
df = xl("A1:D1000", headers=True)
df.groupby("Region")["Revenue"].sum().sort_values(ascending=False)
)

4 — Pivot table

=PY(
df = xl("A1:D1000", headers=True)
df.pivot_table(index="Region", columns="Quarter",
               values="Revenue", aggfunc="sum", fill_value=0)
)

5 — Top N by metric

=PY(
df = xl("A1:E1000", headers=True)
df.nlargest(10, "Revenue")
)

6 — Running total

=PY(
df = xl("A1:B100", headers=True).sort_values("Date")
df["RunningTotal"] = df["Amount"].cumsum()
df
)

7 — Percentage of total

=PY(
s = xl("B2:B11")
(s / s.sum() * 100).round(1)
)

8 — Date difference in days

=PY(
df = xl("A1:C100", headers=True)
df["DaysOpen"] = (pd.to_datetime(df["Closed"])
                  - pd.to_datetime(df["Opened"])).dt.days
df
)

9 — Quartile bucketing

=PY(
df = xl("A1:B500", headers=True)
df["Bucket"] = pd.qcut(df["Score"], 4, labels=["Q1","Q2","Q3","Q4"])
df
)

10 — Inner join two ranges

=PY(
orders = xl("OrdersTable[#All]", headers=True)
customers = xl("CustomersTable[#All]", headers=True)
orders.merge(customers, on="CustomerID")
)

11 — Quick linear regression

=PY(
import numpy as np
df = xl("A1:B100", headers=True)
slope, intercept = np.polyfit(df["X"], df["Y"], 1)
{"slope": slope, "intercept": intercept}
)

12 — Heatmap chart

=PY(
import seaborn as sns
df = xl("A1:D1000", headers=True)
pivot = df.pivot_table(index="DayOfWeek", columns="Hour",
                       values="Orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu")
)

Key takeaways

Most analysis in =PY() is two or three lines.
Bookmark this lesson — you'll come back to it.
If you keep wanting a recipe that isn't here, that's a great cue to ask in contact.

Adapt one

Pick recipe #3, #5, or #10. Build a small dataset on a new sheet and make it run in your own workbook.

08.7 — Limits, costs, and security of =PY()

What =PY() can't do, where it stores your data, and how it's priced.

What you'll learn

Know what =PY() doesn't support
Understand where the code runs
Decide what's appropriate to put in a =PY() cell

The runtime model

When a =PY() cell calculates, Excel sends the code and the relevant Excel data to Microsoft's cloud, runs it in a sandboxed Python environment, and sends the result back. Your laptop does almost nothing — most of the work happens on Microsoft Azure.

What's pre-installed

Microsoft pre-installs a curated set of libraries: pandas, NumPy, matplotlib, seaborn, scikit-learn, statsmodels, and a few more. You can't pip install something extra into a =PY() cell — what's there is what's there.

What's not allowed

No network calls. You can't fetch a web page or hit an API from =PY(). Use a Jupyter notebook for that.
No reading local files. The sandbox can't see your hard disk. Everything has to come from xl(...).
No installing libraries. Stick to the pre-installed set.

Calculation order

Python cells are calculated after regular Excel formulas, top-to-bottom, left-to-right by default. If you have one =PY() that depends on another, that's fine — Excel orders them correctly.

Privacy

The data in cells referenced by xl(...) is sent to Microsoft's servers. For most spreadsheet work this is fine; for highly sensitive data, check your organisation's policy. The data is not used to train AI models per Microsoft's published policy at launch — but verify the current policy at support.microsoft.com.

Cost

At launch, basic =PY() usage is included in Microsoft 365. Premium tiers offer faster compute and higher data limits. Check Microsoft's pricing page for the current numbers.

Performance gotchas

Each =PY() is a network round-trip. Hundreds of them in one workbook will feel slow.
Big DataFrames (> 50k rows) are slower over the wire — for very large data, do the analysis in Jupyter and paste results in.

Key takeaways

=PY() runs in Microsoft's cloud — your data goes through their servers.
No network, no local files, no extra libraries inside =PY().
For network/file work, use a Jupyter notebook (the rest of this course).

Decide where each runs

Which of these should live in =PY(), and which in a Jupyter script?

Scrape today's exchange rate from a website.
Make a heatmap of last month's sales.
Combine 50 monthly CSVs from a shared drive.
Compute the correlation matrix of a 20-column table sitting in Excel.

09.1 — DataFrames — Python's spreadsheet

Create a DataFrame, inspect it, and learn the vocabulary.

What you'll learn

Build a DataFrame from a dict or list of dicts
Use .head(), .info(), .describe(), .shape
Distinguish a DataFrame from a Series

Two ways to make a DataFrame

import pandas as pd

# From a dict of columns
df = pd.DataFrame({
    "name":  ["Alice", "Bob", "Carol"],
    "age":   [30, 25, 35],
    "city":  ["NYC", "Chicago", "LA"],
})

# From a list of rows (each row a dict)
rows = [
    {"name": "Alice", "age": 30, "city": "NYC"},
    {"name": "Bob",   "age": 25, "city": "Chicago"},
    {"name": "Carol", "age": 35, "city": "LA"},
]
df = pd.DataFrame(rows)

The four "what's in this thing?" methods

df.head()       # first 5 rows
df.tail(3)      # last 3 rows
df.shape        # (rows, columns)
df.info()       # column names, dtypes, missing counts
df.describe()   # numeric summary stats

Series vs DataFrame

A Series is a single column. A DataFrame is many columns aligned by row.

df["age"]           # a Series
type(df["age"])     # <class 'pandas.core.series.Series'>
df[["age", "city"]] # a DataFrame (note: double brackets)

Renaming columns

df = df.rename(columns={"name": "full_name"})
df.columns = ["full_name", "age", "city"]   # set all at once

Adding a column

df["is_adult"] = df["age"] >= 18
df["greeting"] = "Hi " + df["full_name"]

Walkthrough: from raw to summary in five lines

Read

df = pd.read_csv("sales.csv")
df.head()

Inspect

df.shape          # (1000, 6)
df.info()
df.describe()

Quick summary

df.groupby("region")["amount"].sum()

Key takeaways

DataFrame = table; Series = one column.
.head(), .info(), .describe(), .shape are the inspection workhorses.
Build from a dict-of-lists or a list-of-dicts.

Quick build

Create a DataFrame of five employees with name, department, salary. Print .describe() on it and notice what it shows for the numeric column versus the text ones.

09.2 — Selecting rows and columns — loc and iloc

Pull out the slice of data you actually want.

What you'll learn

Select columns by name
Filter rows by condition
Use .loc and .iloc correctly

Selecting columns

df["name"]             # one column → Series
df[["name", "city"]]   # multi-column → DataFrame

Filtering rows by condition

df[df["age"] > 30]
df[df["city"] == "NYC"]
df[df["name"].str.startswith("A")]

Combining conditions

df[(df["age"] > 30) & (df["city"] == "NYC")]   # AND  — use &
df[(df["age"] < 30) | (df["city"] == "LA")]    # OR   — use |
df[~(df["city"] == "Chicago")]                  # NOT  — use ~

Important: use & | ~, not and/or/not. And wrap each condition in parentheses.

.loc — label-based selection

df.loc[0]                       # first row
df.loc[0:3, "name"]             # rows 0..3 (inclusive!), column 'name'
df.loc[df["age"] > 30, "name"]  # name column for the over-30s

.iloc — position-based selection

df.iloc[0]            # first row by position
df.iloc[:5, 0:2]      # first 5 rows, first 2 columns
df.iloc[-1]           # last row

The "is in this list" filter

df[df["city"].isin(["NYC", "LA"])]
df[~df["city"].isin(["NYC", "LA"])]   # everything not in the list

Walkthrough: pulling apart a dataset

The data

df = pd.read_csv("orders.csv")
df.columns
# Index(['order_id','date','region','customer','amount','status'])

Big orders in the North region

big_north = df[(df["region"] == "North") & (df["amount"] > 1000)]

Just two columns of those

big_north[["date", "amount"]]

Or in one go with .loc

df.loc[(df["region"]=="North") & (df["amount"]>1000),
       ["date", "amount"]]

Key takeaways

df[bool_mask] is the everyday filter pattern.
Use &, |, ~ (with parens) when combining.
.loc[rows, cols] for labels; .iloc[rows, cols] for positions.
.isin([list]) for membership filters.

Drill

From a DataFrame of employees: select everyone in Sales with a salary over $80k, returning only the name and salary columns.

09.3 — Sorting, counting, and ranking

sort_values, value_counts, rank — the simple stuff you'll use a hundred times a day.

What you'll learn

Sort by one or more columns
Count occurrences of values
Rank a column

Sorting

df.sort_values("amount", ascending=False).head(10)
df.sort_values(["region", "amount"], ascending=[True, False])
df.sort_index()           # by row index

Counting

df["region"].value_counts()           # counts per region
df["region"].value_counts(normalize=True)   # as percentages
df["region"].nunique()                # distinct count

Ranking

df["rank_by_amount"] = df["amount"].rank(ascending=False)
df["pct_rank"] = df["amount"].rank(pct=True)

Top / bottom N

df.nlargest(10, "amount")
df.nsmallest(5, "amount")

Key takeaways

sort_values(["a","b"], ascending=[True,False]) handles multi-key sort.
value_counts() is the one-liner for a quick frequency table.
nlargest() / nsmallest() are faster and clearer than sort+head.

Bestseller list

From an orders DataFrame, print the top 5 products by total revenue.

09.4 — Aggregation — sum, mean, agg

How to summarise. Series methods, DataFrame methods, and .agg().

What you'll learn

Compute basic aggregates on a Series
Aggregate multiple columns at once
Use .agg() for multiple stats per column

Single-column aggregates

df["amount"].sum()
df["amount"].mean()
df["amount"].median()
df["amount"].std()
df["amount"].max()
df["amount"].min()
df["amount"].count()
df["amount"].quantile(0.95)

Whole-DataFrame aggregates

df.sum(numeric_only=True)
df.mean(numeric_only=True)
df.describe()

.agg() — many at once

df["amount"].agg(["sum", "mean", "min", "max"])

df.agg({
    "amount":   ["sum", "mean"],
    "quantity": ["sum", "max"],
})

Custom aggregations

def iqr(s):
    return s.quantile(0.75) - s.quantile(0.25)

df["amount"].agg(["mean", iqr])

Key takeaways

Series have direct methods: .sum(), .mean() etc.
.agg(["sum","mean"]) runs many at once.
.describe() is the quickest "show me everything" summary.

Summary box

For a numeric column of your choosing, print mean, median, min, max, and 95th percentile in a single .agg() call.

09.5 — groupby — the move that pays for the course

Split → apply → combine. Replace your SUMIFS forever.

What you'll learn

Group by one or more columns
Aggregate after groupby
Get the result back as a clean DataFrame

The pattern

# Total revenue per region — one line
df.groupby("region")["amount"].sum()

That's it. That replaces a column of SUMIFS formulas, plus the formula maintenance, plus the broken refs when rows shift.

Multiple aggregations

df.groupby("region")["amount"].agg(["sum", "mean", "count"])

df.groupby("region").agg(
    total_revenue = ("amount",   "sum"),
    avg_order     = ("amount",   "mean"),
    n_orders      = ("order_id", "count"),
)

The second form (named aggregations) is the cleanest — output columns are named exactly what you want.

Multiple group keys

df.groupby(["region", "quarter"])["amount"].sum()

Get a clean DataFrame back

By default, groupby returns a Series or a DataFrame with a hierarchical index. Add .reset_index() for a flat one:

summary = (df.groupby(["region", "quarter"])["amount"]
             .sum()
             .reset_index())

Filtering within groups

# Keep only groups with more than 10 orders
df.groupby("region").filter(lambda g: len(g) > 10)

# Top 3 per group
df.sort_values("amount", ascending=False).groupby("region").head(3)

Walkthrough: the classic regional summary

Read

df = pd.read_csv("orders.csv", parse_dates=["date"])
df["quarter"] = df["date"].dt.to_period("Q").astype(str)

Summary

summary = (df.groupby(["region", "quarter"])
             .agg(total = ("amount",   "sum"),
                  count = ("order_id", "count"),
                  avg   = ("amount",   "mean"))
             .reset_index())

print(summary.head(10))

Save

summary.to_excel("regional_summary.xlsx", index=False)

💡 Mental shortcut

"Group by X then aggregate" = the moment you'd reach for a pivot table in Excel.

Key takeaways

df.groupby("col")["other"].sum() = SUMIF for a whole column.
Use named aggregations for clean output: .agg(name=("col", "func")).
.reset_index() to get a flat DataFrame back.

Per-customer summary

From an orders DataFrame, build a per-customer summary with: total spend, number of orders, average order value.

09.6 — Pivot tables — pivot_table()

The Excel pivot table, one function call.

What you'll learn

Build a pivot with index/columns/values
Add multiple value columns and aggregations
Handle missing cells with fill_value

The basic pivot

df.pivot_table(index="region",
               columns="quarter",
               values="amount",
               aggfunc="sum")

Rows = region. Columns = quarter. Cells = total amount.

Multiple aggregations

df.pivot_table(index="region",
               columns="quarter",
               values="amount",
               aggfunc=["sum", "mean", "count"])

Multiple value columns

df.pivot_table(index="region",
               values=["amount", "quantity"],
               aggfunc={"amount": "sum", "quantity": "sum"})

Fill empty cells

df.pivot_table(index="region", columns="quarter",
               values="amount", aggfunc="sum", fill_value=0)

With margins (totals row + column)

df.pivot_table(index="region", columns="quarter",
               values="amount", aggfunc="sum",
               margins=True, margins_name="Total")

crosstab — quick frequency pivot

pd.crosstab(df["region"], df["status"])
pd.crosstab(df["region"], df["status"], normalize="index")   # row percentages

Key takeaways

pivot_table(index, columns, values, aggfunc) = Excel pivot.
fill_value=0 stops NaN holes.
margins=True adds totals.
pd.crosstab is the shortcut for frequency tables.

Sales by product × month

Build a pivot of total sales with products as rows and months as columns. Add a totals row and column.

09.7 — Joining and merging — pandas's VLOOKUP

merge, concat, join — three ways to combine DataFrames.

What you'll learn

Use pd.merge() for VLOOKUP-style joins
Pick the right join type (inner/left/right/outer)
Stack DataFrames with pd.concat()

The basic merge — like XLOOKUP

orders   = pd.read_csv("orders.csv")        # has customer_id
customers = pd.read_csv("customers.csv")    # has customer_id and name, city, tier

combined = orders.merge(customers, on="customer_id")

Different column names on each side

orders.merge(customers, left_on="cust_id", right_on="id")

Join types

How	What you get
`how="inner"` (default)	Only rows present on both sides.
`how="left"`	All rows from the left; matching info from the right (NaN if no match).
`how="right"`	All rows from the right.
`how="outer"`	Every row from either side; NaN where the other side has no match.

combined = orders.merge(customers, on="customer_id", how="left")

This is the "VLOOKUP that doesn't lose rows when there's no match" — your default for enrichment joins.

Checking the join with indicator

combined = orders.merge(customers, on="customer_id",
                        how="left", indicator=True)
combined["_merge"].value_counts()
# left_only   42    ← orders with no matching customer (a red flag)
# both       958

Stacking with concat

jan = pd.read_csv("jan.csv")
feb = pd.read_csv("feb.csv")
quarter = pd.concat([jan, feb], ignore_index=True)

Walkthrough: enrich orders with customer info

Load

orders   = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

Join

enriched = orders.merge(customers, on="customer_id",
                        how="left", indicator=True)

Sanity-check the join

print(enriched["_merge"].value_counts())
unmatched = enriched[enriched["_merge"] == "left_only"]
print(f"{len(unmatched)} orders without a matching customer")

Drop the indicator and save

enriched = enriched.drop(columns=["_merge"])
enriched.to_csv("orders_enriched.csv", index=False)

Key takeaways

df.merge(other, on="key", how="left") is VLOOKUP without the heartbreak.
Always use indicator=True on a one-off to sanity-check matches.
pd.concat([...]) stacks DataFrames vertically.

Three-way join

Given orders, customers, and products tables, build a fully enriched DataFrame containing customer name and product name on each order.

09.8 — Applying functions — apply, map, vectorisation

Run a function across every row or every cell, and when to skip it.

What you'll learn

Use .apply() on a Series and a DataFrame
Use .map() for simple value-to-value substitutions
Prefer vectorised operations when possible

The hierarchy: vectorised > apply > loop

The fastest pandas code uses built-in vectorised operations. Reach for apply when there's no built-in, and only loop manually as a last resort.

Vectorised — the goal

df["total"] = df["price"] * df["quantity"] * (1 + df["tax_rate"])

.map() — value-to-value lookup on a Series

tier_label = {"A": "Gold", "B": "Silver", "C": "Bronze"}
df["tier_name"] = df["tier"].map(tier_label)

.apply() on a Series

def classify(amount):
    if amount > 1000: return "Big"
    if amount > 100:  return "Medium"
    return "Small"

df["bucket"] = df["amount"].apply(classify)

.apply() on a DataFrame (row-wise)

def full_label(row):
    return f"{row['region']}/{row['product']}"

df["label"] = df.apply(full_label, axis=1)

The vectorised replacement

Almost anything you'd write with apply has a faster vectorised form:

# Slower
df["bucket"] = df["amount"].apply(classify)

# Faster
import numpy as np
df["bucket"] = np.select(
    [df["amount"] > 1000, df["amount"] > 100],
    ["Big", "Medium"],
    default="Small")

💡 The lazy rule

Use apply when you're prototyping. Switch to vectorised when the data is large and performance matters.

Key takeaways

Vectorised pandas is fast; apply is slow; manual loops are slowest.
.map(dict) is the cleanest way to translate values.
.apply(func, axis=1) runs func(row) on each row.

Tier classification

Given a column of customer LTV values, add a tier column: Gold > 5000, Silver > 1000, Bronze otherwise. Solve with apply; then rewrite with np.select.

10.1 — Handling missing values

NaN, dropna, fillna — what's missing and what to do about it.

What you'll learn

Detect missing values
Drop rows or columns with too many gaps
Fill missing values sensibly

Spotting missing values

df.isna().sum()              # missing count per column
df.isna().mean() * 100       # percentage per column
df[df["email"].isna()]       # rows where email is missing

Dropping

df.dropna()                              # drop any row with any NaN
df.dropna(subset=["email", "customer"])  # only drop if those cols are NaN
df.dropna(axis=1)                        # drop columns with NaN
df.dropna(thresh=5)                      # keep rows with ≥5 non-null values

Filling

df["age"] = df["age"].fillna(df["age"].median())
df["country"] = df["country"].fillna("Unknown")

# Fill different columns differently
df = df.fillna({"age": 0, "country": "Unknown", "score": df["score"].mean()})

Forward / back fill (time series)

df["price"] = df["price"].ffill()   # carry last known forward
df["price"] = df["price"].bfill()   # use the next known backward

Walkthrough: clean a survey export

Inspect first

df.isna().sum().sort_values(ascending=False).head(10)

Drop unusable rows

df = df.dropna(subset=["respondent_id", "submitted_at"])

Fill the rest with sensible defaults

df["age"] = df["age"].fillna(df["age"].median())
df["region"] = df["region"].fillna("Unknown")
df["score"] = df["score"].fillna(0)

Key takeaways

isna().sum() shows you where the holes are.
dropna(subset=[...]) is safer than blanket dropna.
fillna() can take a single value, a dict, or a forward/back fill.

Missing-value audit

Load any messy CSV. Print a table of column name, missing count, and missing percentage, sorted by worst first.

10.2 — Finding and removing duplicates

duplicated, drop_duplicates, and how to decide which copy to keep.

What you'll learn

Find duplicate rows
Drop duplicates by all or some columns
Pick which duplicate to keep (first/last/by ranking)

Find them

df.duplicated().sum()                          # how many full-row dupes
df.duplicated(subset=["email"]).sum()          # dupes by email
df[df.duplicated(subset=["email"], keep=False)]   # show all dupe groups

Drop them

df.drop_duplicates()                          # exact-row dupes
df.drop_duplicates(subset=["email"])          # by email, keep first
df.drop_duplicates(subset=["email"], keep="last")

Keep the "best" duplicate

If a customer has multiple rows and you want the most recent:

df = (df.sort_values("date")
        .drop_duplicates(subset=["email"], keep="last"))

Walkthrough: dedupe a contact list keeping the highest-value entry

Sort by what makes "best" mean something

df = df.sort_values("lifetime_value", ascending=False)

Drop, keeping the first (= the highest LTV)

df = df.drop_duplicates(subset=["email"], keep="first")

Key takeaways

duplicated() returns a boolean Series; drop_duplicates() drops them.
subset= lets you check duplicates by chosen columns only.
Sort first, then drop_duplicates with keep="first" to keep the "best" one.

Dedupe orders

Given an orders table, dedupe by order_id, keeping the row with the latest last_modified.

10.3 — String cleanup at scale

str.strip, str.lower, str.replace, regex — apply to a whole column.

What you'll learn

Apply string methods to whole columns
Use regex via .str.replace
Split a column into multiple columns

The .str accessor

df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["country"] = df["country"].str.upper()

Anywhere you'd call a string method on one string, you call .str.method() on a column.

Replace

df["amount_text"] = df["amount_text"].str.replace(",", "")
df["amount"] = df["amount_text"].astype(float)

Regex replace

# Keep only digits
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)

# Strip parentheses and hyphens
df["phone"] = df["phone"].str.replace(r"[()\-\s]", "", regex=True)

Split a column

# "Alice Smith" → first / last
df[["first", "last"]] = df["full_name"].str.split(" ", n=1, expand=True)

Extract with regex

# pull domain from email
df["domain"] = df["email"].str.extract(r"@(.+)$")

Walkthrough: clean a column of free-text phone numbers

Start

df["phone"]
# "(415) 555-1212"
# " 415.555.1213 "
# "415-555-1214 ext.500"

Normalise

df["phone_clean"] = (df["phone"].str.strip()
                                .str.replace(r"\D", "", regex=True)
                                .str[:10])     # keep first 10 digits

Key takeaways

.str.method() applies any string method to a whole column.
regex=True unlocks powerful pattern-based replacement.
.str.split(..., expand=True) splits one column into many.

Email cleanup

Given a column of messy emails, produce a clean lowercase column AND a new column containing just the domain.

10.4 — Working with dates in pandas

to_datetime, .dt accessor, date arithmetic, resampling.

What you'll learn

Convert string columns to datetime
Pull year/month/day/weekday from a date column
Resample a time series

Parse first

df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" turns unparsable strings into NaT (missing date) instead of crashing.

Pull parts out

df["year"]   = df["date"].dt.year
df["month"]  = df["date"].dt.month
df["dow"]    = df["date"].dt.day_name()         # 'Saturday'
df["q"]      = df["date"].dt.to_period("Q")     # PeriodIndex 2026Q2
df["month_start"] = df["date"].dt.to_period("M").dt.to_timestamp()

Date arithmetic

df["days_open"] = (df["closed"] - df["opened"]).dt.days
df["due_date"] = df["created"] + pd.Timedelta(days=30)
df[df["date"] >= "2026-01-01"]

Resampling (group by time bucket)

monthly = (df.set_index("date")["amount"]
             .resample("M")
             .sum())

weekly  = df.set_index("date")["amount"].resample("W").mean()

Walkthrough: monthly revenue trend

Parse and group

df["date"] = pd.to_datetime(df["date"])
monthly = (df.groupby(df["date"].dt.to_period("M"))["amount"]
             .sum()
             .reset_index())
monthly["date"] = monthly["date"].astype(str)

Save / plot

monthly.to_excel("monthly_revenue.xlsx", index=False)

Key takeaways

pd.to_datetime(..., errors="coerce") is the safe parse.
.dt gives you year, month, day, day_name, to_period.
.resample("M") on a datetime index gives quick time-bucket aggregates.

Weekly average

From a transactions DataFrame with date + amount, compute the average daily revenue per week of the year.

10.5 — Numeric cleanup — currencies, percentages, outliers

Strip $ and %, clip outliers, fix obvious typos.

What you'll learn

Convert currency-formatted strings to floats
Convert percent-formatted strings to floats
Cap or remove outliers

Currency strings → floats

df["amount"] = (df["amount"].str.replace("[$,]", "", regex=True)
                              .astype(float))

Percent strings → floats

# "12.5%" → 0.125
df["margin"] = (df["margin"].str.replace("%", "")
                              .astype(float) / 100)

Capping outliers (winsorise)

lo, hi = df["amount"].quantile([0.01, 0.99])
df["amount_capped"] = df["amount"].clip(lower=lo, upper=hi)

Removing outliers

q1, q3 = df["amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = df["amount"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)
clean = df[mask]

Catching obvious typos

# Negative ages, ages over 120, etc.
clean = df.query("0 < age < 120")

Key takeaways

Strip $ and , with a regex, then .astype(float).
.clip(lower, upper) caps without removing rows.
IQR rule (Q1 − 1.5·IQR, Q3 + 1.5·IQR) catches typical outliers.

Cap and report

Cap the top 1% of values in a numeric column. Print how many rows were affected.

10.6 — Reshape — melt and pivot

Turn wide to long and back. The shape-shifting move you'll thank yourself for.

What you'll learn

Use melt() to turn wide data long
Use pivot() to turn long data wide
Recognise when each shape is the right one

Wide vs long

Wide has one row per "thing" and many columns:

region   Q1    Q2    Q3    Q4
North   1200  1500  1300  1700
South    900  1100  1050   950

Long has one row per observation:

region   quarter  amount
North    Q1       1200
North    Q2       1500
North    Q3       1300
North    Q4       1700
South    Q1        900
...

Long is friendlier for grouping/plotting; wide is friendlier for humans.

Wide → long with melt

long = df.melt(id_vars="region",
               value_vars=["Q1","Q2","Q3","Q4"],
               var_name="quarter",
               value_name="amount")

Long → wide with pivot

wide = long.pivot(index="region", columns="quarter", values="amount")

When to reshape

Going to a chart? Long is what seaborn and plotly want.
Going to a printable report? Wide reads better.
Doing groupby? Long.

Key takeaways

melt() turns wide to long.
pivot() turns long to wide.
Long shape is what most analysis tools prefer.

Quarterly to long

Take a wide DataFrame of quarterly figures by region. Melt it into long form. Then pivot it back.

10.7 — Categoricals and ordered types

Save memory and enable ordered comparisons.

What you'll learn

Convert a column to categorical
Define an ordered categorical
Use the savings to handle bigger datasets

Why categoricals

If a column has 5 million rows but only 4 distinct values ("North", "South", "East", "West"), storing it as a categorical can cut its memory use by 90% and speed up groupby.

df["region"] = df["region"].astype("category")

Ordered categoricals

from pandas.api.types import CategoricalDtype
size_order = CategoricalDtype(categories=["XS","S","M","L","XL"], ordered=True)
df["size"] = df["size"].astype(size_order)

df.sort_values("size")          # sorts in size order, not alphabetical
df[df["size"] >= "L"]           # comparisons work

Memory check

df.memory_usage(deep=True)

Key takeaways

Repeating-string columns benefit hugely from .astype("category").
Ordered categoricals let you sort and compare logically.
Useful when your DataFrame starts pushing your laptop's memory.

Memory diet

Take any DataFrame with a low-cardinality text column. Convert it to categorical. Print before/after memory usage.

11.1 — Matplotlib basics

fig, ax, plot, title, save. The foundation everything else builds on.

What you'll learn

Create a figure and axes
Plot lines, bars, and scatter
Save the chart to PNG

The basic shape

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot([1, 2, 3, 4], [10, 25, 20, 35])
ax.set_title("My first chart")
ax.set_xlabel("Quarter")
ax.set_ylabel("Revenue")
plt.tight_layout()
plt.savefig("chart.png", dpi=150)
plt.show()

Every chart you make will follow this skeleton: build a figure, draw onto axes, label, save/show.

Three chart types in three lines

ax.plot(x, y)           # line
ax.bar(categories, values)  # bar
ax.scatter(x, y)        # scatter

Style sample

ax.plot(x, y, color="#217346", linewidth=2, marker="o")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False)   # cleaner look

Walkthrough: monthly trend

Data

months = ["Jan","Feb","Mar","Apr","May","Jun"]
revenue = [12, 15, 18, 14, 22, 27]

Plot

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.plot(months, revenue, marker="o", color="#2b6cb0", linewidth=2)
ax.set_title("Monthly Revenue (in thousands)")
ax.set_ylabel("$k")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False)
plt.tight_layout()
plt.savefig("monthly_revenue.png", dpi=150)

Key takeaways

fig, ax = plt.subplots(figsize=(w, h)) is the starting point.
ax.plot / ax.bar / ax.scatter are the three you'll use most.
Always set a title and axis labels. Save with plt.savefig(..., dpi=150).

Bar chart

Make a bar chart of the same monthly revenue data. Make Jan and Feb grey, the other months green.

11.2 — Plotting directly from a DataFrame

df.plot() — the fastest way from data to picture.

What you'll learn

Use df.plot() to chart a Series
Plot multiple columns at once
Pick the right kind (line/bar/area/hist)

The one-liner

df.set_index("date")["amount"].plot(figsize=(8, 4), title="Daily Revenue")

Multiple columns

df.set_index("date")[["amount", "forecast"]].plot()

Kinds

df["amount"].plot(kind="line")
df["amount"].plot(kind="hist", bins=30)
df["region"].value_counts().plot(kind="bar")
df["region"].value_counts().plot(kind="pie")    # use sparingly!
df.plot(kind="scatter", x="qty", y="amount")

Grouped bars from a pivot

pivot = df.pivot_table(index="region", columns="quarter",
                       values="amount", aggfunc="sum")
pivot.plot(kind="bar")

Stacked bars

pivot.plot(kind="bar", stacked=True)

Key takeaways

df.plot() is the quickest path from data to chart.
kind= picks the chart type.
For grouped/stacked bars, pivot first then plot.

Side-by-side bars

Pivot orders by region (rows) and quarter (columns). Plot the result as a grouped bar chart.

11.3 — Seaborn — prettier defaults, less code

When you want a chart that looks publication-ready out of the box.

What you'll learn

Use seaborn's most common charts
Take advantage of automatic styling
Plot from long-form DataFrames

Setup

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

Bar chart

sns.barplot(data=df, x="region", y="amount", estimator="sum", errorbar=None)

Scatter with colour-coding

sns.scatterplot(data=df, x="qty", y="amount", hue="region")

Distribution

sns.histplot(data=df, x="amount", hue="region", multiple="stack")

Heatmap

pivot = df.pivot_table(index="dow", columns="hour",
                        values="orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu", annot=False)

Faceted small-multiples

g = sns.FacetGrid(df, col="region", col_wrap=2)
g.map_dataframe(sns.lineplot, x="date", y="amount")

💡 Seaborn loves long data

Seaborn assumes your data is in long form (one row per observation). If your data is wide, melt it first (Module 10, Lesson 6).

Key takeaways

Seaborn = matplotlib with prettier defaults and fewer lines.
Use long-form DataFrames; let hue= handle grouping.
Heatmaps and facets are seaborn superpowers Excel can't easily match.

Hour-of-day heatmap

Given an orders table with date+hour, build a heatmap of total orders by day-of-week × hour-of-day.

11.4 — Formatting for stakeholders

Removing chart-junk, labelling totals, and saving at the right DPI.

What you'll learn

Remove unnecessary chart-junk
Add data labels on top of bars
Save at print quality

Less is more

ax.spines[["top","right"]].set_visible(False)
ax.grid(axis="y", alpha=0.3)
ax.set_xlabel("")           # X axis label often redundant
ax.tick_params(left=False)  # hide ticks

Labels on bars

bars = ax.bar(months, revenue, color="#217346")
for bar, value in zip(bars, revenue):
    ax.text(bar.get_x() + bar.get_width()/2,
            bar.get_height() + 0.5,
            f"${value:,.0f}",
            ha="center", va="bottom", fontsize=9)

Save

plt.tight_layout()
plt.savefig("revenue.png", dpi=200, bbox_inches="tight")

Colours for accessibility

About 1 in 12 people has red-green colour blindness. When using colour as a meaningful signal, prefer a colour-blind-safe palette:

sns.set_palette("colorblind")

Key takeaways

Drop the chart-junk — top/right spines, redundant labels.
Label totals on bars; don't make the reader squint.
Save at dpi=200 for slides / print.
Use colour-blind-safe palettes.

Polish a chart

Take your monthly-revenue bar chart from the matplotlib lesson. Remove the spines. Add dollar-value labels on top of each bar. Save at 200 dpi.

11.5 — Inserting charts into Excel

Three ways: paste, =PY(), and the openpyxl image insert.

What you'll learn

Save a chart as PNG and insert it into a sheet
Use =PY() to embed a chart directly
Build a sheet with multiple charts in code

Option 1 — Paste a PNG manually

Save the chart, then in Excel: Insert → Pictures → from file. Fastest for one-offs.

Option 2 — =PY() chart

Covered in Module 8. Best for charts that should refresh when the underlying Excel data changes.

Option 3 — Programmatic with openpyxl

import pandas as pd
import matplotlib.pyplot as plt
from openpyxl import load_workbook
from openpyxl.drawing.image import Image

# Generate the chart
df = pd.read_csv("monthly.csv")
fig, ax = plt.subplots(figsize=(7, 3))
df.plot(x="month", y="revenue", ax=ax, marker="o")
plt.tight_layout()
fig.savefig("monthly.png", dpi=150)
plt.close(fig)

# Insert into the workbook
df.to_excel("report.xlsx", index=False, sheet_name="Data")
wb = load_workbook("report.xlsx")
ws = wb.create_sheet("Chart")
img = Image("monthly.png")
ws.add_image(img, "B2")
wb.save("report.xlsx")

Walkthrough: monthly report builder

The plan

One script. Reads a CSV. Writes a 2-sheet Excel: a Data sheet and a Charts sheet with three charts.

The pattern

def make_chart(df, kind, filename, title):
    fig, ax = plt.subplots(figsize=(7, 3))
    df.plot(kind=kind, ax=ax, color="#217346")
    ax.set_title(title)
    plt.tight_layout()
    fig.savefig(filename, dpi=150)
    plt.close(fig)

Then call it three times and insert each via openpyxl.

Key takeaways

One-offs: save PNG, paste into Excel.
Live: =PY() chart that refreshes.
Bulk reports: openpyxl add_image in code.

Multi-chart workbook

Build a script that produces an Excel report with a Data sheet, a Bar Chart sheet, and a Line Chart sheet — all from one CSV input.

11.6 — Tiny dashboards with subplots

Multiple charts on one figure — your first 'dashboard'.

What you'll learn

Build a 2×2 grid of charts
Share axes appropriately
Save the whole thing as one image

The subplots grid

fig, axes = plt.subplots(2, 2, figsize=(10, 7))
ax_a, ax_b = axes[0]
ax_c, ax_d = axes[1]

ax_a.plot(x, y1); ax_a.set_title("Trend")
ax_b.bar(cats, vals); ax_b.set_title("By Category")
ax_c.scatter(x, y2); ax_c.set_title("Scatter")
ax_d.hist(values, bins=20); ax_d.set_title("Distribution")

plt.suptitle("Q1 Dashboard", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("dashboard.png", dpi=200)

Shared axes

fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True)

Tip: give each panel breathing room

fig.subplots_adjust(hspace=0.4, wspace=0.3)

Key takeaways

plt.subplots(rows, cols) gives you a grid of axes.
sharey=True aligns scales for fair comparison.
One suptitle(), one tight_layout(), save once.

Three-up dashboard

Make a 1×3 dashboard with: monthly revenue line, top-5 product bar, and a distribution histogram. Save as one PNG.

12.1 — Merging many files into one

The everyday 'I have a folder of monthly reports' job.

What you'll learn

Combine every file in a folder
Tag each row with its source
Handle a folder of mixed-format files gracefully

The script

from pathlib import Path
import pandas as pd

SRC = Path("monthly")
DST = Path("combined.xlsx")

frames = []
for path in sorted(SRC.glob("*.xlsx")):
    df = pd.read_excel(path)
    df["source_file"] = path.name
    frames.append(df)
    print(f"  loaded {path.name}: {len(df):,} rows")

combined = pd.concat(frames, ignore_index=True)
combined.to_excel(DST, index=False, sheet_name="All")
print(f"Wrote {len(combined):,} rows to {DST}")

Mixed CSV + XLSX

def read_any(p):
    if p.suffix == ".csv":
        return pd.read_csv(p)
    return pd.read_excel(p)

frames = [read_any(p) for p in SRC.iterdir() if p.suffix in (".csv", ".xlsx")]

Skip broken files

frames = []
errors = []
for path in SRC.glob("*"):
    try:
        df = read_any(path)
        df["source_file"] = path.name
        frames.append(df)
    except Exception as e:
        errors.append((path.name, str(e)))

print(f"Processed: {len(frames)},  Errors: {len(errors)}")
for name, msg in errors:
    print(f"  {name}: {msg}")

Key takeaways

folder.glob("*.xlsx") + pd.concat() is the merge skeleton.
Tag every row with source_file — you'll always want it later.
Wrap in try/except so one broken file doesn't stop the whole job.

Build a merger

Put 3-5 small CSVs in a folder. Write a script that merges them into one Excel file with a "source" column.

12.2 — Splitting one file into many

Group by a column and save one workbook per group.

What you'll learn

Iterate over groupby groups
Write each to a named file
Create the output folder if needed

The pattern

from pathlib import Path
import pandas as pd

df = pd.read_excel("all_orders.xlsx")
out = Path("per_region"); out.mkdir(exist_ok=True)

for region, group in df.groupby("region"):
    safe = region.replace("/", "_")
    group.to_excel(out / f"{safe}.xlsx", index=False)
    print(f"  wrote {safe}.xlsx ({len(group):,} rows)")

One workbook with one sheet per group

with pd.ExcelWriter("by_region.xlsx") as writer:
    for region, group in df.groupby("region"):
        safe = region.replace("/", "_")[:31]   # Excel sheet name limit
        group.to_excel(writer, sheet_name=safe, index=False)

⚠️ Sheet name limit

Excel sheet names cap at 31 characters and can't contain / \ ? * [ ]. Always sanitise.

Key takeaways

df.groupby("col") yields (key, group_df) tuples.
One file per group: loop and .to_excel().
One workbook, many sheets: pd.ExcelWriter + sheet_name=.

Split a big workbook

Take a multi-region orders file and produce one workbook per salesperson, each containing only their rows.

12.3 — Filling a template workbook

Open a pre-formatted .xlsx, drop data into specific cells, save with a new name.

What you'll learn

Load an Excel template with openpyxl
Write values into named cells
Save the filled copy with a unique name

Why templates beat from-scratch reports

Excel templates can have logos, frozen panes, formulas, colours, named ranges — all the spreadsheety polish. With openpyxl you can open the template, plug in values, and save a copy — keeping every bit of formatting.

The script

from openpyxl import load_workbook
from datetime import date
import pandas as pd

template = "templates/invoice.xlsx"
orders = pd.read_csv("orders_to_invoice.csv")

for _, row in orders.iterrows():
    wb = load_workbook(template)
    ws = wb.active

    ws["B2"] = f"Invoice {row['invoice_no']}"
    ws["B3"] = row["customer"]
    ws["B4"] = row["billing_address"]
    ws["B5"] = date.today().isoformat()
    ws["E10"] = row["quantity"]
    ws["E11"] = row["unit_price"]
    ws["E13"] = row["tax_rate"]
    # formulas in the template will recalc when the file is opened

    out = f"invoices/{row['invoice_no']}.xlsx"
    wb.save(out)
    print(f"  wrote {out}")

Use named ranges to make this readable

In Excel: Formulas → Define Name. Then in Python:

wb["TemplateSheet"]["CustomerName"]   # if you defined CustomerName as a named range

(For most everyday cases, the ws["B3"] style is fine.)

Key takeaways

Templates keep formatting; openpyxl just stamps data into specific cells.
Loop a DataFrame, one template-copy per row.
Formulas in the template still work — they recalc on open.

Receipt printer

Build a tiny Excel "receipt" template with a logo and a totals formula. Loop a small DataFrame and generate one receipt per row.

12.4 — Renaming and organising files in bulk

pathlib.rename, glob, regex — your weekend file-cleanup script.

What you'll learn

Bulk-rename files with a regex
Move files into dated subfolders
Always preview before renaming

Rename pattern

from pathlib import Path
import re

folder = Path("downloads")
for p in folder.glob("Report_*.xlsx"):
    new_name = re.sub(r"Report_(\d{4})(\d{2})(\d{2})",
                      r"report-\1-\2-\3",
                      p.stem) + p.suffix
    print(f"{p.name}  →  {new_name}")
    # p.rename(p.with_name(new_name))   # uncomment when ready

⚠️ Always dry-run first

Print the planned renames in a first pass. Only enable the actual p.rename() after you've eyeballed the output.

Move into dated folders

from datetime import datetime
for p in folder.glob("*.csv"):
    mtime = datetime.fromtimestamp(p.stat().st_mtime)
    target = folder / mtime.strftime("%Y-%m")
    target.mkdir(exist_ok=True)
    p.rename(target / p.name)

Key takeaways

Always dry-run renames before doing them.
Use regex for structured filename patterns.
Sort by modification date to organise into month folders.

Downloads cleaner

Sketch a script (don't enable rename yet) that moves every .pdf in your Downloads folder into Downloads/PDFs/ and every .xlsx into Downloads/Spreadsheets/.

12.5 — Sending emails from Python

smtplib for the standard library way; better libraries for Gmail/Outlook.

What you'll learn

Send a plain-text email with smtplib
Attach a file
Use environment variables for credentials

The standard library way

import smtplib
import os
from email.message import EmailMessage

msg = EmailMessage()
msg["Subject"] = "Monthly Report"
msg["From"] = "you@yourdomain.com"
msg["To"] = "boss@yourdomain.com"
msg.set_content("Hi — attached is the monthly report.\n\nThanks,\nYou")

# Attach a file
with open("report.xlsx", "rb") as f:
    msg.add_attachment(
        f.read(),
        maintype="application",
        subtype="vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        filename="report.xlsx",
    )

with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
    smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
    smtp.send_message(msg)
print("Sent.")

⚠️ Never paste passwords in code

Use environment variables (os.environ["EMAIL_PASS"]), a .env file loaded with python-dotenv, or a system secret manager. Never check credentials into version control.

Gmail and Outlook today

Big providers now require app-specific passwords or OAuth. The cleanest libraries:

Gmail: enable 2-factor authentication, generate an app password, use it instead of your real password.
Outlook 365: use the O365 library (pip install O365) which wraps the Microsoft Graph API.

Personalised mass email

for _, row in customers.iterrows():
    msg = EmailMessage()
    msg["Subject"] = f"Your {row['month']} statement"
    msg["From"]    = "you@yourdomain.com"
    msg["To"]      = row["email"]
    msg.set_content(f"""Hi {row['first_name']},

Your statement for {row['month']} is attached.

Best,
You""")

    with open(row["attachment"], "rb") as f:
        msg.add_attachment(f.read(),
                           maintype="application",
                           subtype="pdf",
                           filename=Path(row["attachment"]).name)

    with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
        smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
        smtp.send_message(msg)

Key takeaways

smtplib sends mail; EmailMessage formats it.
Use app passwords or OAuth — never your real password.
Personalised mass-mail = loop over a DataFrame, one message per row.

Test send

Send yourself a test email with an attached CSV. Confirm it lands. Then do the same for three rows of a small DataFrame.

12.6 — Scheduling: run it automatically

Make a script run on its own — cron (Mac/Linux), Task Scheduler (Windows), or a Python scheduler library.

What you'll learn

Schedule a script daily on Mac
Schedule a script daily on Windows
Know when to use a Python scheduler instead

Mac and Linux — cron

In Terminal: crontab -e. Add a line:

# minute hour day month weekday   command
0 8 * * 1-5  /opt/anaconda3/bin/python /Users/you/scripts/morning_report.py >> ~/cron.log 2>&1

This runs at 08:00 on weekdays.

Windows — Task Scheduler

Start menu → Task Scheduler → Create Task.
Triggers → Daily, 8:00 AM.
Actions → Start a program → python.exe; arguments: C:\path\to\morning_report.py.
Save.

The script must be self-sufficient

A scheduled script can't ask the user for input. Hard-code paths or read them from environment variables / a config file.

Logging — so you know it ran

import logging
from datetime import datetime

logging.basicConfig(
    filename="morning_report.log",
    level=logging.INFO,
    format="%(asctime)s %(levelname)s: %(message)s",
)

logging.info("Starting morning report")
# ... do work ...
logging.info(f"Completed in {(datetime.now()-start).total_seconds():.1f}s")

Python-native scheduling

For a long-running script that schedules itself:

import schedule, time
def job():
    print("Running daily report")
schedule.every().day.at("08:00").do(job)
while True:
    schedule.run_pending()
    time.sleep(60)

Useful inside Docker containers; for laptop use, cron / Task Scheduler is simpler.

Key takeaways

cron on Mac/Linux; Task Scheduler on Windows.
Scheduled scripts must run without prompts — hard-code or config-file the inputs.
Always log; you want to know if it silently failed.

Schedule a 'hello'

Schedule a tiny Python script (one line: append today's date to a log file) to run daily. Confirm it ran tomorrow.

12.7 — An automation recipe book

Eight common boring-Excel jobs and the 20-line scripts that automate them.

What you'll learn

Recognise which of these matches your daily work
Adapt the recipe to your own data
Save them in a shared script folder

1 — Add today's date to every filename in a folder

from pathlib import Path
from datetime import date
stamp = date.today().isoformat()
for p in Path("inbox").glob("*.xlsx"):
    p.rename(p.with_name(f"{stamp}_{p.name}"))

2 — Find every sheet across a folder of workbooks containing a keyword

from openpyxl import load_workbook
matches = []
for p in Path("archive").rglob("*.xlsx"):
    wb = load_workbook(p, read_only=True, data_only=True)
    for ws in wb.worksheets:
        for row in ws.iter_rows(values_only=True):
            if any(c == "URGENT" for c in row if isinstance(c, str)):
                matches.append((p.name, ws.title))
                break
print(matches)

3 — Roll a year of weekly reports into one DataFrame

import pandas as pd
frames = [pd.read_excel(p) for p in Path("weekly_2026").glob("week_*.xlsx")]
pd.concat(frames, ignore_index=True).to_excel("weekly_2026_combined.xlsx", index=False)

4 — Find rows that exist this month but didn't last month (new customers)

this_m = set(pd.read_excel("apr.xlsx")["email"])
last_m = set(pd.read_excel("mar.xlsx")["email"])
new = this_m - last_m
print(f"{len(new)} new customers")

5 — Convert all CSVs in a folder to Excel

for p in Path(".").glob("*.csv"):
    pd.read_csv(p).to_excel(p.with_suffix(".xlsx"), index=False)

6 — Rename a sheet across every workbook in a folder

for p in Path("reports").glob("*.xlsx"):
    wb = load_workbook(p)
    if "Sheet1" in wb.sheetnames:
        wb["Sheet1"].title = "Data"
        wb.save(p)

7 — Email the boss a CSV every Monday morning

# See Lesson 5 — combine with Lesson 6 scheduling.

8 — Build a daily PDF dashboard

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
df.groupby("region")["amount"].sum().plot(kind="bar", ax=ax)
ax.set_title(f"Revenue {date.today():%Y-%m-%d}")
fig.savefig(f"dashboards/{date.today():%Y-%m-%d}.pdf")

Key takeaways

Most "automate the boring" jobs are 5-20 lines.
Once written, schedule them and forget them.
Build a personal recipe book — your future self will thank you.

Adapt one

Pick the recipe closest to a real task you do. Spend 30 minutes adapting it to your own data and folder.

13.1 — User story: bank reconciliation

Match GL transactions to a bank statement. Find the unmatched.

What you'll learn

Load two tables and align them
Use merge with indicator to find unmatched rows
Handle near-matches (same amount, different date)

The story

Priya gets two files at month-end: gl_april.xlsx (the general ledger from the accounting system) and bank_april.csv (the bank statement). She has to find every transaction on one side that isn't on the other and investigate it.

In Excel: copy both into a workbook, build a helper column of date+amount, VLOOKUP across, scroll through #N/A results. Two hours.

In Python: 25 lines. Five minutes. Below.

The script

import pandas as pd
from pathlib import Path

gl   = pd.read_excel("gl_april.xlsx")
bank = pd.read_csv("bank_april.csv")

# Normalise both sides to the same key
def keyify(df, date_col, amount_col):
    df = df.copy()
    df["date"] = pd.to_datetime(df[date_col]).dt.date
    df["amount"] = df[amount_col].round(2)
    df["key"] = df["date"].astype(str) + "_" + df["amount"].astype(str)
    return df

gl_k   = keyify(gl,   "TxnDate", "Amount")
bank_k = keyify(bank, "PostDate", "Amount")

merged = gl_k.merge(bank_k, on="key", how="outer",
                    suffixes=("_gl", "_bank"), indicator=True)

unmatched_gl   = merged[merged["_merge"] == "left_only"]
unmatched_bank = merged[merged["_merge"] == "right_only"]

with pd.ExcelWriter("reconciliation_april.xlsx") as w:
    unmatched_gl.to_excel(w,   sheet_name="GL only",   index=False)
    unmatched_bank.to_excel(w, sheet_name="Bank only", index=False)
    merged[merged["_merge"]=="both"].to_excel(w, sheet_name="Matched", index=False)

print(f"GL only:   {len(unmatched_gl)}")
print(f"Bank only: {len(unmatched_bank)}")

Handling near-matches

If amounts can be off by a small fee, allow a tolerance:

def near_match(amt_a, amt_b, tol=0.50):
    return abs(amt_a - amt_b) <= tol

# For each unmatched GL row, search bank rows within ±3 days for a near-amount match.
for _, gl_row in unmatched_gl.iterrows():
    candidates = bank[
        (bank["PostDate"] >= gl_row["TxnDate"] - pd.Timedelta(days=3)) &
        (bank["PostDate"] <= gl_row["TxnDate"] + pd.Timedelta(days=3)) &
        (bank["Amount"].apply(lambda a: near_match(a, gl_row["Amount"])))
    ]
    if len(candidates) == 1:
        print(f"Possible fuzzy match for GL {gl_row['TxnId']} ↔ Bank {candidates.iloc[0]['BankRef']}")

Video walkthrough script (for recording)

Show both raw files in Excel

"Here's the GL. Here's the bank. Look — same month, but different columns and different date formats. Reconciling these manually is a nightmare."

Open Jupyter, load both

"pd.read_excel for the GL, pd.read_csv for the bank. Two lines."

Build the match key

"Both sides get a date and a 2-decimal amount, joined as a string. That's the merge key."

Outer merge with indicator

"This is the magic line. how='outer' keeps everything from both sides; indicator=True tells us which side each row came from."

Split into three sheets and save

"Matched, GL only, Bank only — each gets its own sheet. The file lands ready to email to the controller."

Key takeaways

The reconciliation skeleton: build a merge key, outer-merge with indicator, split by _merge.
Tolerance match handles bank fees and tiny rounding differences.
One script, multiple sheets in the output — ready for the controller.

Your own recon

Build a recon between two of your own CSVs. The match key can be whatever uniquely identifies a transaction in both systems (txn ID, date+amount, invoice number).

13.2 — User story: budget vs actual variance

Compare actuals to budget by department, flag the worrying lines.

What you'll learn

Join budget and actual data
Compute variance and percent variance
Flag and sort by exposure

The story

Priya needs a one-page variance report for the CFO. Every department, every GL account, budget vs actual, with the worst variances at the top.

The script

import pandas as pd

budget = pd.read_excel("budget_2026.xlsx")    # cols: dept, account, month, budget
actual = pd.read_excel("actual_2026.xlsx")    # cols: dept, account, month, actual

merged = budget.merge(actual,
                      on=["dept", "account", "month"],
                      how="outer").fillna(0)

merged["variance"]     = merged["actual"] - merged["budget"]
merged["variance_pct"] = (merged["variance"] / merged["budget"].replace(0, pd.NA)) * 100

# Flag the worrying lines
def flag(row):
    if abs(row["variance"]) > 50_000:           return "MATERIAL"
    if abs(row["variance_pct"] or 0) > 20:      return "HIGH-PCT"
    return ""
merged["flag"] = merged.apply(flag, axis=1)

# Sort: flagged first, then by absolute variance
report = (merged.assign(abs_var = merged["variance"].abs())
                 .sort_values(["flag", "abs_var"], ascending=[True, False])
                 .drop(columns="abs_var"))

report.to_excel("variance_april.xlsx", index=False)

Make the output executive-friendly

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

wb = load_workbook("variance_april.xlsx")
ws = wb.active

# Headers: bold, green
for cell in ws[1]:
    cell.font = Font(bold=True, color="FFFFFF")
    cell.fill = PatternFill("solid", fgColor="217346")

# Highlight MATERIAL rows
red = PatternFill("solid", fgColor="FEE2E2")
for row in ws.iter_rows(min_row=2):
    if row[ws.max_column - 1].value == "MATERIAL":
        for cell in row:
            cell.fill = red

# Currency formats
for col in "DEFG":
    for cell in ws[col][1:]:
        cell.number_format = '"$"#,##0;[Red]("$"#,##0)'

# Freeze pane + widths
ws.freeze_panes = "A2"
for col, w in zip("ABCDEFGH", [10, 12, 10, 14, 14, 14, 10, 14]):
    ws.column_dimensions[col].width = w

wb.save("variance_april.xlsx")

Walkthrough for video

Open both source files

"Budget. Actual. Same shape but different numbers."

One merge, one subtraction

"merge(on=[dept, account, month]) joins them. Two assignments compute variance and percent variance."

Apply the flag

"Anything over $50k absolute or 20% relative gets a label."

Save and style

"Save with pandas, restyle with openpyxl. Red rows for the lines that matter."

Key takeaways

Variance reports are merge + arithmetic + sort.
Always flag both absolute and percentage variance — a 50% variance on $200 doesn't matter.
Stylesheet at the end makes it CFO-ready.

Add a sparkline column

Extend the report: add a "trend" column with the last six months of actuals as a comma-separated list. (Hint: groupby dept+account, take last 6 months, agg into a list.)

13.3 — User story: consolidating subsidiary budgets

Six subsidiaries send six workbooks, slightly differently shaped. Combine them.

What you'll learn

Load workbooks with mismatched columns
Normalise to a single schema
Produce a consolidated rollup

The story

Six subsidiaries send a budget workbook every quarter. Each is slightly different — different column orders, different account naming, sometimes an extra sheet. Priya needs one consolidated rollup.

Step 1 — load and inspect

from pathlib import Path
import pandas as pd

files = list(Path("subs_q2").glob("*.xlsx"))
for p in files:
    print(p.name)
    df = pd.read_excel(p, sheet_name="Budget")
    print("  cols:", list(df.columns))
    print("  rows:", len(df))

Step 2 — write a per-file normaliser

COL_MAP = {
    # canonical_name : list of possible source names
    "subsidiary":  ["Sub", "Subsidiary", "Entity", "Company"],
    "account":     ["Account", "GL Account", "Ledger Acct", "AcctName"],
    "month":       ["Month", "Period", "MonthEnd"],
    "amount":      ["Amount", "Budget", "USD", "BudgetUSD"],
}

def rename(df):
    out = {}
    for canon, options in COL_MAP.items():
        for opt in options:
            if opt in df.columns:
                out[opt] = canon
                break
    return df.rename(columns=out)

def normalise(df, sub_default):
    df = rename(df)
    if "subsidiary" not in df.columns:
        df["subsidiary"] = sub_default
    df["month"] = pd.to_datetime(df["month"]).dt.to_period("M").astype(str)
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    return df[["subsidiary", "account", "month", "amount"]]

Step 3 — combine

frames = []
for p in files:
    raw = pd.read_excel(p, sheet_name="Budget")
    frames.append(normalise(raw, sub_default=p.stem))

combined = pd.concat(frames, ignore_index=True)
combined = combined.dropna(subset=["account", "amount"])

Step 4 — rollups

by_sub_account = (combined.groupby(["subsidiary", "account"])["amount"]
                            .sum()
                            .reset_index())

monthly_by_sub = combined.pivot_table(
    index="subsidiary", columns="month",
    values="amount", aggfunc="sum", fill_value=0,
    margins=True, margins_name="Total")

with pd.ExcelWriter("consolidated_q2.xlsx") as w:
    combined.to_excel(w, sheet_name="Raw", index=False)
    by_sub_account.to_excel(w, sheet_name="By Sub × Account", index=False)
    monthly_by_sub.to_excel(w, sheet_name="By Sub × Month")

💡 The COL_MAP pattern

A column-mapping dictionary makes new subsidiaries painless to onboard — just add their column names to the options list. The rest of the pipeline doesn't change.

Key takeaways

Consolidation = load + normalise + concat + rollup.
A COL_MAP dict absorbs the "everyone names things differently" problem.
Output multiple sheets: raw, summary, pivot.

Add a sub

Add a seventh subsidiary's file with brand-new column names. Update COL_MAP. Re-run. Confirm nothing else needed to change.

13.4 — User story: invoice batch processing

Read a folder of supplier invoices (Excel), validate, post to the GL feed.

What you'll learn

Read many invoices from a folder
Validate each (totals, tax, supplier exists)
Produce a clean GL import file plus an exceptions report

The story

Suppliers send Excel invoices into a shared folder. Priya checks each, keys them into the GL. She wants this scripted.

Layout

Each invoice has a fixed structure: supplier in B2, invoice no in B3, date in B4, subtotal/tax/total in E10/E11/E12, line items in B6:D8.

The script

import pandas as pd
from openpyxl import load_workbook
from pathlib import Path

valid_suppliers = set(pd.read_excel("master_data.xlsx", sheet_name="Suppliers")["name"])
TAX_RATE = 0.08

rows, errors = [], []

for path in Path("inbox").glob("*.xlsx"):
    wb = load_workbook(path, data_only=True)
    ws = wb.active

    supplier = ws["B2"].value
    inv_no   = ws["B3"].value
    inv_date = ws["B4"].value
    subtotal = ws["E10"].value
    tax      = ws["E11"].value
    total    = ws["E12"].value

    # Validate
    if supplier not in valid_suppliers:
        errors.append((path.name, f"Unknown supplier: {supplier}")); continue
    if subtotal is None or total is None:
        errors.append((path.name, "Missing totals")); continue
    expected_tax = round(subtotal * TAX_RATE, 2)
    if abs(tax - expected_tax) > 0.05:
        errors.append((path.name, f"Tax mismatch: {tax} vs expected {expected_tax}"))
        continue
    if abs(total - (subtotal + tax)) > 0.05:
        errors.append((path.name, "Total ≠ subtotal+tax")); continue

    rows.append({
        "invoice_no": inv_no,
        "supplier":   supplier,
        "date":       inv_date,
        "subtotal":   subtotal,
        "tax":        tax,
        "total":      total,
        "source":     path.name,
    })

gl_feed = pd.DataFrame(rows)
gl_feed.to_csv("gl_feed_april.csv", index=False)

if errors:
    pd.DataFrame(errors, columns=["file", "error"]).to_csv("invoice_errors.csv", index=False)
    print(f"⚠️  {len(errors)} errors — see invoice_errors.csv")

print(f"✅ Posted {len(gl_feed)} clean invoices")

Move processed files

from pathlib import Path
done = Path("processed"); done.mkdir(exist_ok=True)
for r in rows:
    Path("inbox") / r["source"]).rename(done / r["source"])

Key takeaways

Validate at the door — supplier exists, tax math checks, totals tie.
Produce both a clean output and an exceptions report.
Move successfully-processed files out of the inbox so re-runs don't double-post.

Add a duplicate check

Extend the script: refuse to post an invoice number that's already in the GL feed from previous runs.

13.5 — User story: accounts receivable aging

Bucket open invoices into 0-30, 31-60, 61-90, 90+.

What you'll learn

Compute days since due date
Bucket with pd.cut
Produce a pivot by customer × age band

The story

Every Monday the controller asks for the AR aging. Priya wants it ready by 9 AM.

The script

import pandas as pd
from datetime import date

ar = pd.read_excel("ar_open.xlsx")        # cols: customer, invoice_no, due_date, balance

today = pd.Timestamp(date.today())
ar["due_date"] = pd.to_datetime(ar["due_date"])
ar["days_overdue"] = (today - ar["due_date"]).dt.days.clip(lower=0)

bins   = [-1, 0, 30, 60, 90, 99999]
labels = ["Current", "1-30", "31-60", "61-90", "90+"]
ar["bucket"] = pd.cut(ar["days_overdue"], bins=bins, labels=labels)

pivot = ar.pivot_table(index="customer", columns="bucket",
                        values="balance", aggfunc="sum",
                        fill_value=0, observed=True)
pivot["Total"] = pivot.sum(axis=1)
pivot = pivot.sort_values("Total", ascending=False)

pivot.to_excel("ar_aging.xlsx")

Highlight worst offenders

from openpyxl import load_workbook
from openpyxl.styles import PatternFill

wb = load_workbook("ar_aging.xlsx")
ws = wb.active

red = PatternFill("solid", fgColor="FECACA")
yellow = PatternFill("solid", fgColor="FEF3C7")

for row in ws.iter_rows(min_row=2):
    over90 = row[5].value or 0   # column F = "90+"
    over60 = row[4].value or 0
    if over90 > 0:
        for cell in row: cell.fill = red
    elif over60 > 0:
        for cell in row: cell.fill = yellow

wb.save("ar_aging.xlsx")

Key takeaways

pd.cut with explicit bins gives clean labelled buckets.
Pivot with fill_value=0 avoids holes in the table.
openpyxl styling at the end makes the report scannable.

Add aging by salesperson

Modify to also pivot by salesperson (one sheet per salesperson) so each can chase their own delinquents.

13.6 — User story: multi-currency consolidation

Convert a multi-currency P&L to USD using daily FX rates.

What you'll learn

Merge a transactions table with an FX rates table
Apply the correct daily rate to each transaction
Roll up to USD totals

The story

Priya consolidates P&L from offices in London (GBP), Tokyo (JPY), and Berlin (EUR). The CFO wants USD totals using the rate on the date of each transaction.

The data

tx = pd.read_excel("multi_ccy_tx.xlsx")
# date, office, account, amount, currency

rates = pd.read_csv("fx_rates_2026.csv")
# date, currency, rate_to_usd

The join

tx["date"] = pd.to_datetime(tx["date"])
rates["date"] = pd.to_datetime(rates["date"])

merged = tx.merge(rates, on=["date", "currency"], how="left")
missing = merged["rate_to_usd"].isna().sum()
print(f"Transactions with no rate: {missing}")

# If date doesn't match exactly (weekend), forward-fill from previous business day
if missing > 0:
    rates_indexed = rates.set_index(["currency", "date"]).sort_index()
    def lookup(row):
        try:
            sub = rates_indexed.loc[row["currency"]]
            avail = sub[sub.index <= row["date"]]
            return avail.iloc[-1]["rate_to_usd"] if len(avail) else None
        except KeyError:
            return None
    merged.loc[merged["rate_to_usd"].isna(), "rate_to_usd"] = merged[merged["rate_to_usd"].isna()].apply(lookup, axis=1)

merged["usd_amount"] = (merged["amount"] * merged["rate_to_usd"]).round(2)

Rollup

usd_pnl = (merged.groupby(["office", "account"])["usd_amount"]
                  .sum()
                  .unstack(fill_value=0))

usd_pnl["Total"] = usd_pnl.sum(axis=1)
usd_pnl.loc["Total"] = usd_pnl.sum(numeric_only=True)
usd_pnl.to_excel("pnl_usd.xlsx")

Key takeaways

Multi-currency conversion = merge on (date, currency) + multiply.
Forward-fill the previous business day's rate for weekend transactions.
Roll up to totals at the end so the CFO has one number.

Sensitivity

Compute the same P&L using month-average rates instead of daily. By how much does the USD total differ? That difference is your FX translation risk.

13.7 — Project: the monthly close pack

Combine the previous lessons into one script that produces the entire close pack.

What you'll learn

Orchestrate multiple analyses in one script
Build a single output Excel with one sheet per analysis
Add a cover sheet with summary KPIs

The story

Month-end. Priya wants one script to run that produces the whole close pack: a cover sheet, reconciliation, variance, AR aging, FX-translated P&L. One file. One double-click.

The skeleton

"""monthly_close.py — generate the month-end close pack."""
from datetime import date
from pathlib import Path
import pandas as pd

# 1) Recon
def make_recon():
    # ... reuse code from Lesson 1
    return matched, gl_only, bank_only

# 2) Variance
def make_variance():
    # ... reuse code from Lesson 2
    return variance_df

# 3) AR aging
def make_aging():
    # ... reuse code from Lesson 5
    return aging_pivot

# 4) FX P&L
def make_pnl():
    # ... reuse code from Lesson 6
    return pnl_df

def make_cover(matched, gl_only, bank_only, variance, aging, pnl):
    return pd.DataFrame({
        "Metric": [
            "Period",
            "GL transactions",
            "Bank transactions",
            "GL only", "Bank only",
            "Material variances",
            "AR over 60 days",
            "Total revenue (USD)",
        ],
        "Value": [
            f"{date.today():%B %Y}",
            len(matched) + len(gl_only),
            len(matched) + len(bank_only),
            len(gl_only), len(bank_only),
            (variance["flag"] == "MATERIAL").sum(),
            aging["61-90"].sum() + aging["90+"].sum(),
            pnl.loc["Total", "Total"],
        ],
    })

def main():
    matched, gl_only, bank_only = make_recon()
    variance = make_variance()
    aging    = make_aging()
    pnl      = make_pnl()
    cover    = make_cover(matched, gl_only, bank_only, variance, aging, pnl)

    out = f"close_pack_{date.today():%Y-%m}.xlsx"
    with pd.ExcelWriter(out) as w:
        cover.to_excel(w,    sheet_name="Cover",    index=False)
        matched.to_excel(w,  sheet_name="Recon-Matched",  index=False)
        gl_only.to_excel(w,  sheet_name="Recon-GL only",  index=False)
        bank_only.to_excel(w,sheet_name="Recon-Bank only",index=False)
        variance.to_excel(w, sheet_name="Variance", index=False)
        aging.to_excel(w,    sheet_name="AR Aging")
        pnl.to_excel(w,      sheet_name="USD P&L")
    print(f"✅ Wrote {out}")

if __name__ == "__main__":
    main()

Schedule it

Set this script to run automatically on the 1st of every month (Module 12, Lesson 6). Wake up to a finished close pack in your inbox.

Key takeaways

Compose previous scripts into one orchestrator with a main().
Each analysis is its own function; main() just glues them.
A cover sheet with KPIs makes the pack scannable in 30 seconds.

Email it

Extend the script: after building the file, email it to the controller using the email recipe from Module 12, Lesson 5.

14.1 — User story: cleaning a 4-million-row export

When the CSV is too big for Excel itself.

What you'll learn

Open a multi-million-row CSV with pandas
Filter and sample before doing heavy work
Save a clean subset back to Excel

The story

Marcus gets a 4M-row CSV of transactions. Excel opens 1M and silently drops the rest. He needs all of it cleaned and pivoted.

Step 1 — peek before loading the whole thing

import pandas as pd
sample = pd.read_csv("transactions.csv", nrows=1000)
sample.info()                     # columns + dtypes
sample.head()

Step 2 — load with the right options

df = pd.read_csv("transactions.csv",
                 usecols=["date", "store_id", "sku", "qty", "revenue"],
                 dtype={"store_id": "category", "sku": "category"},
                 parse_dates=["date"])
print(df.shape)        # (4_000_000, 5)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

usecols drops unused columns at load time. dtype="category" on repeating strings cuts memory by ~90%.

Step 3 — filter early

# Last 90 days only
recent = df[df["date"] >= df["date"].max() - pd.Timedelta(days=90)]
print(len(recent))   # ~1M

Step 4 — aggregate down to something Excel-friendly

by_store_sku = (recent.groupby(["store_id", "sku"], observed=True)
                       .agg(units=("qty", "sum"),
                            revenue=("revenue", "sum"))
                       .reset_index()
                       .sort_values("revenue", ascending=False))

by_store_sku.head(20).to_excel("top_skus_per_store.xlsx", index=False)

If even pandas struggles — chunked read

aggregates = []
for chunk in pd.read_csv("transactions.csv", chunksize=500_000, parse_dates=["date"]):
    chunk = chunk[chunk["date"] >= "2026-01-01"]
    aggregates.append(chunk.groupby("store_id")["revenue"].sum())

total = pd.concat(aggregates).groupby(level=0).sum()
print(total)

Key takeaways

usecols + dtype="category" makes huge files manageable.
Filter and aggregate early; only the small result needs to land in Excel.
chunksize lets you stream files bigger than RAM.

Your own big file

Find a CSV bigger than 100k rows in your work. Load it with usecols and dtype. Print memory savings.

14.2 — User story: joining six tables to answer one question

Stars, snowflakes, and the merge dance.

What you'll learn

Plan a multi-table join
Verify match rates at every step
Avoid the silent row-explosion of a bad join

The story

"Who are our top customers in the Northeast, what did they buy last quarter, and what was their average margin?" Six tables: customers, orders, order_items, products, regions, costs.

Map it out first

customers ──(customer_id)── orders ──(order_id)── order_items ──(sku)── products ──(sku)── costs
customers ──(region_id)── regions

Load and check sizes

cu = pd.read_csv("customers.csv")
o  = pd.read_csv("orders.csv", parse_dates=["order_date"])
oi = pd.read_csv("order_items.csv")
p  = pd.read_csv("products.csv")
r  = pd.read_csv("regions.csv")
co = pd.read_csv("costs.csv")
for name, df in [("cu",cu),("o",o),("oi",oi),("p",p),("r",r),("co",co)]:
    print(name, df.shape)

Join one step at a time, with indicator

cu_r = cu.merge(r, on="region_id", how="left", indicator=True)
print(cu_r["_merge"].value_counts())   # all should be 'both'
cu_r = cu_r.drop(columns="_merge")

o_cu = o.merge(cu_r, on="customer_id", how="left", indicator=True)
print(o_cu["_merge"].value_counts())
o_cu = o_cu.drop(columns="_merge")

# Items per order — this is the row-multiplying join
items = oi.merge(o_cu, on="order_id", how="inner")
print("rows after items join:", len(items))

# Add product + cost
items = items.merge(p,  on="sku", how="left").merge(co, on="sku", how="left")

⚠️ The row explosion

If your join key has duplicates on both sides, you get a Cartesian product. Check len(after) ≤ len(before) × max_dupes. If not, debug.

Answer the question

last_q = items[items["order_date"] >= "2026-01-01"]
last_q["revenue"] = last_q["qty"] * last_q["unit_price"]
last_q["margin"]  = last_q["revenue"] - last_q["qty"] * last_q["unit_cost"]

ne = last_q[last_q["region_name"] == "Northeast"]

top = (ne.groupby(["customer_id", "customer_name"])
         .agg(orders   = ("order_id", "nunique"),
              revenue  = ("revenue", "sum"),
              margin   = ("margin",  "sum"))
         .assign(margin_pct = lambda d: (d["margin"] / d["revenue"] * 100).round(1))
         .sort_values("revenue", ascending=False)
         .head(20)
         .reset_index())

top.to_excel("top20_ne_q1.xlsx", index=False)

Key takeaways

Diagram joins before writing them.
Use indicator=True on each step the first time you write the script.
Watch for row explosion on many-to-many joins.

Reverse it

Same data, different question: which products have the highest margin% but the lowest sales volume? (Hidden gems to push.)

14.3 — User story: customer cohort retention

Group customers by signup month, track their retention over time.

What you'll learn

Compute a cohort label per customer
Build a cohort × period retention matrix
Render it as a heatmap

The story

"How are our newer customers doing compared to older ones?" The classic cohort retention chart.

Step 1 — label each customer

orders["order_month"]  = orders["order_date"].dt.to_period("M")
first_order_month = (orders.groupby("customer_id")["order_month"].min()
                            .rename("cohort"))
orders = orders.merge(first_order_month, on="customer_id")
orders["months_since_join"] = (orders["order_month"] - orders["cohort"]).apply(lambda d: d.n)

Step 2 — build the cohort matrix

cohort_counts = (orders.groupby(["cohort", "months_since_join"])["customer_id"]
                        .nunique()
                        .unstack(fill_value=0))

cohort_sizes = cohort_counts.iloc[:, 0]
retention = cohort_counts.divide(cohort_sizes, axis=0).round(3) * 100

Step 3 — heatmap

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(retention.iloc[:, :12],   # first 12 months
            annot=True, fmt=".0f",
            cmap="YlGn", cbar_kws={"label": "Retention %"},
            ax=ax)
ax.set_title("Cohort Retention — % of cohort active in each month after signup")
ax.set_xlabel("Months after signup")
ax.set_ylabel("Cohort (signup month)")
plt.tight_layout()
plt.savefig("cohort_retention.png", dpi=200)

Key takeaways

Cohort = group by signup period; track by months since signup.
.divide(..., axis=0) turns counts into percentages.
A heatmap makes retention patterns obvious at a glance.

Revenue cohort

Change the analysis: instead of retention %, show average revenue per active customer by cohort × months-since-signup.

14.4 — User story: did the change work? A quick A/B test

Two groups, one metric, was the difference real or noise?

What you'll learn

Run a two-sample t-test
Interpret a p-value plainly
Build a confidence interval for the lift

The story

Marketing tried a new landing page on half the traffic. The new version has 4.2% conversion vs the old 3.8%. Is that a real win or just luck?

The data

import pandas as pd
df = pd.read_csv("ab_test.csv")   # visitor_id, group, converted (0/1)

print(df.groupby("group")["converted"].agg(["count","mean","sum"]))

Quick check with scipy

from scipy import stats

a = df.loc[df["group"]=="control",   "converted"]
b = df.loc[df["group"]=="treatment", "converted"]

result = stats.ttest_ind(b, a, equal_var=False)
print(f"t = {result.statistic:.2f}, p = {result.pvalue:.4f}")

What the p-value really says

p = the probability of seeing a difference at least this big, if there was no real effect. p < 0.05 by convention means "this is unlikely to be noise." Not "definitely real." Not "important." Just "unlikely noise."

Confidence interval for the lift

import numpy as np
p_a, n_a = a.mean(), len(a)
p_b, n_b = b.mean(), len(b)
lift = p_b - p_a
se = (p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b) ** 0.5
ci = (lift - 1.96*se, lift + 1.96*se)
print(f"Lift: {lift*100:+.2f} pp ({ci[0]*100:+.2f}, {ci[1]*100:+.2f})")

The confidence interval is more useful than the p-value — it tells you the size of the effect, not just "is it real."

Quick visual

import matplotlib.pyplot as plt
groups = df.groupby("group")["converted"].mean()
errs = df.groupby("group")["converted"].apply(lambda s: 1.96 * (s.mean()*(1-s.mean())/len(s))**0.5)

fig, ax = plt.subplots(figsize=(5, 4))
groups.plot(kind="bar", yerr=errs, capsize=6, ax=ax, color=["#94a3b8","#217346"])
ax.set_title("Conversion rate by group (95% CI)")
ax.set_ylabel("Conversion %")
plt.tight_layout()
plt.savefig("ab_test.png", dpi=200)

Key takeaways

Two-sample t-test is the workhorse for "is A different from B?"
p-value: probability of seeing this if there's no real effect.
Confidence intervals communicate size, not just significance.

Underpowered?

Re-run the analysis with only the first 1,000 rows of each group. How does the CI change? You've just seen why we need lots of data for small effects.

14.5 — User story: trend, seasonality, and 'is this Tuesday weird?'

Decompose a time series. Spot anomalies.

What you'll learn

Resample daily data to weekly
Decompose into trend + seasonality
Flag anomaly days

The story

Daily sessions on the website have a clear weekly rhythm. Marcus needs to spot days that are unusually high or low after accounting for that rhythm.

Decomposition

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

df = pd.read_csv("daily_sessions.csv", parse_dates=["date"]).set_index("date").sort_index()

decomp = seasonal_decompose(df["sessions"], model="additive", period=7)
decomp.plot()
plt.tight_layout()
plt.savefig("decomposition.png", dpi=150)

Anomaly flag

# Residual = actual - trend - seasonal
resid = decomp.resid.dropna()
std = resid.std()
df["expected"] = (decomp.trend + decomp.seasonal)
df["anomaly"]  = (resid.abs() > 2 * std)

anomalies = df[df["anomaly"]].sort_values("sessions", ascending=False)
print(anomalies)

Plot with anomalies highlighted

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df.index, df["sessions"], lw=0.8)
ax.plot(df.index, df["expected"], lw=1.2, color="#217346", label="expected")
mask = df["anomaly"].fillna(False)
ax.scatter(df.index[mask], df["sessions"][mask], color="red", s=20, zorder=3, label="anomaly")
ax.legend()
plt.tight_layout()
plt.savefig("anomalies.png", dpi=150)

Key takeaways

seasonal_decompose splits a series into trend, seasonality, and residual.
An anomaly is just a residual that's more than ~2 standard deviations from zero.
Use period=7 for weekly seasonality, period=12 for monthly, period=24 for hourly.

Find your anomalies

Take any daily metric you track (web sessions, support tickets, orders). Decompose it. List the anomalous days from the past 90.

14.6 — User story: a refreshing dashboard sheet

Build an Excel dashboard that auto-updates when the source data changes.

What you'll learn

Build multiple analyses against a single source
Use =PY() so the dashboard refreshes on data change
Lay out the dashboard for at-a-glance reading

The story

Marcus's manager wants a single Excel file she can open every morning and see: today's revenue, week-on-week, top 5 products, top 5 stores, anomaly count.

The structure

Sheet 1: Data — the raw transactions, refreshed nightly.
Sheet 2: Dashboard — 5 =PY() cells, each producing one tile.

The =PY() cells

# Tile 1: Yesterday's revenue
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
y = df[df["date"].dt.date == (pd.Timestamp.today().date() - pd.Timedelta(days=1))]
f"${y['revenue'].sum():,.0f}"
)

# Tile 2: Week-on-week
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
this7 = df[df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=7)]["revenue"].sum()
prev7 = df[(df["date"] < pd.Timestamp.today() - pd.Timedelta(days=7)) &
           (df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=14))]["revenue"].sum()
f"{(this7/prev7 - 1)*100:+.1f}%"
)

# Tile 3: Top 5 products
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("product")["revenue"].sum().nlargest(5)
)

# Tile 4: Top 5 stores
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("store")["revenue"].sum().nlargest(5)
)

# Tile 5: Anomaly count
=PY(
df = xl("Data!A1:F100000", headers=True)
daily = df.groupby(pd.to_datetime(df["date"]).dt.date)["revenue"].sum()
mu, sd = daily.mean(), daily.std()
int(((daily - mu).abs() > 2*sd).sum())
)

Lay it out

Put the small tiles (revenue, WoW, anomaly count) in big bold cells across the top. Put the Top 5 lists in the middle. Add a single chart at the bottom: =PY() returning a line chart of the last 30 days.

💡 Refresh discipline

The dashboard auto-recalculates when the Data sheet changes. Tell stakeholders to only edit the Data sheet and nothing else.

Key takeaways

Dashboards are a stack of =PY() tiles, each producing one number or one table.
Keep the data in a separate sheet so the dashboard layout is stable.
Aim for one screen — if you have to scroll, it's a report, not a dashboard.

Add a tile

Add a sixth tile: yesterday's conversion rate vs the trailing 30-day average. Format as "X.XX% (Y.Y pp vs avg)".

14.7 — User story: shipping the analysis — narrative report

Turn an analysis notebook into a one-page narrative the boss will actually read.

What you'll learn

Choose the one chart that tells the story
Structure a memo: TL;DR, evidence, recommendation
Export to PDF or DOCX

The story

Marcus finished his analysis. Now he has to convince the VP to act on it.

The 'analyst memo' template

TL;DR (3 sentences). What you found, what it means, what you recommend.
The headline chart. One image that proves the TL;DR.
How you measured it. Data, method, limits, in plain English.
What to do next. Specific, time-bound, owned.

Auto-generate a PDF memo

from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

with PdfPages("memo_q1.pdf") as pdf:
    # Page 1 — title + TL;DR
    fig, ax = plt.subplots(figsize=(8.5, 11))
    ax.axis("off")
    ax.text(0.05, 0.95, "Q1 Northeast Customers — Memo", fontsize=20, weight="bold", va="top")
    ax.text(0.05, 0.85,
            "TL;DR — The top 20 NE customers drove 62% of revenue but their\n"
            "margin% trails the national average by 6 points. We recommend\n"
            "a pricing review on the three SKUs that account for 80% of the gap.",
            fontsize=12, va="top")
    pdf.savefig(fig); plt.close(fig)

    # Page 2 — the chart
    fig, ax = plt.subplots(figsize=(8.5, 11))
    # ... build the chart
    pdf.savefig(fig); plt.close(fig)

Or generate a Word memo

from docx import Document
doc = Document()
doc.add_heading("Q1 Northeast Customers — Memo", 0)
doc.add_paragraph("TL;DR — The top 20 NE customers drove 62% of revenue...")
doc.add_picture("ne_margins.png", width=docx.shared.Inches(6))
doc.add_heading("Recommendation", level=2)
doc.add_paragraph("Pricing review on SKUs 12345, 23456, 34567 by end of June.")
doc.save("memo_q1.docx")

Key takeaways

Analysis without a recommendation is just trivia.
One chart, one number, one ask — that's the memo.
Python can generate the PDF/DOCX itself; the same script can rebuild it monthly.

Write your own memo

Take your last analysis. Distill it into a TL;DR, one chart, and a recommendation. Hand it to a colleague and ask: "did this make me sound certain enough?"

15.1 — User story: monthly headcount report

Active employees by department, with hires/leavers.

What you'll learn

Compute active headcount as of any date
Identify hires and leavers in a month
Format the rollup for the leadership pack

The story

Tomas needs a monthly headcount file: active count by department on month-end, plus hires and leavers in the month, plus turnover %.

The data

import pandas as pd
emp = pd.read_excel("employees.xlsx")
# emp cols: employee_id, name, department, hire_date, term_date (NaT if active)
emp["hire_date"] = pd.to_datetime(emp["hire_date"])
emp["term_date"] = pd.to_datetime(emp["term_date"], errors="coerce")

Active as of any date

def active_as_of(df, asof):
    asof = pd.Timestamp(asof)
    return df[(df["hire_date"] <= asof) & ((df["term_date"].isna()) | (df["term_date"] > asof))]

eom = pd.Timestamp("2026-04-30")
active = active_as_of(emp, eom)
print(f"Headcount on {eom:%Y-%m-%d}: {len(active)}")

By department

hc_by_dept = active.groupby("department").size().rename("active").reset_index()
hc_by_dept = hc_by_dept.sort_values("active", ascending=False)

Hires and leavers in the month

month_start = pd.Timestamp("2026-04-01")
hires   = emp[(emp["hire_date"] >= month_start) & (emp["hire_date"] <= eom)]
leavers = emp[(emp["term_date"] >= month_start) & (emp["term_date"] <= eom)]

events = pd.DataFrame({
    "department": hc_by_dept["department"],
    "hires":   hires.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
    "leavers": leavers.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
})
events["turnover_pct"] = (events["leavers"] / hc_by_dept["active"] * 100).round(1)

summary = hc_by_dept.merge(events, on="department")
summary.loc["Total"] = [
    "Total", summary["active"].sum(), summary["hires"].sum(),
    summary["leavers"].sum(),
    round(summary["leavers"].sum() / summary["active"].sum() * 100, 1),
]
summary.to_excel(f"headcount_{eom:%Y-%m}.xlsx", index=False)

Key takeaways

A point-in-time headcount = filter where hire ≤ asof and (term IS NULL or term > asof).
Monthly events = hires/terms with date in the month.
Turnover% = leavers / active.

Year-to-date version

Build a "YTD" version: rolling headcount at each month-end of the year, in one wide table.

15.2 — User story: payroll prep

Take the timesheet dump, validate it, output the payroll-ready file.

What you'll learn

Aggregate hours per employee per pay period
Apply overtime rules
Output a payroll-system-friendly file

The data

ts = pd.read_csv("timesheets.csv", parse_dates=["date"])
# columns: employee_id, date, hours, project

Validate

# No more than 16 hours in a day
bad = ts.groupby(["employee_id","date"])["hours"].sum()
bad = bad[bad > 16]
if len(bad):
    print(f"⚠️  {len(bad)} employee-days over 16 hours")
    print(bad)

Aggregate to pay period (e.g., bi-weekly)

ts["pay_period"] = ts["date"].dt.to_period("2W-SUN")    # bi-weekly ending Sunday
daily = ts.groupby(["employee_id","date"])["hours"].sum().reset_index()

# Apply 40h/week overtime rule
daily["week"] = daily["date"].dt.to_period("W-SUN")
weekly = (daily.groupby(["employee_id","week"])["hours"]
                .sum()
                .reset_index())
weekly["regular_hrs"]  = weekly["hours"].clip(upper=40)
weekly["overtime_hrs"] = (weekly["hours"] - 40).clip(lower=0)

Roll up to pay period

weekly["pay_period"] = weekly["week"].apply(lambda w: pd.Period(w.start_time, freq="2W-SUN"))
payroll = (weekly.groupby(["employee_id","pay_period"])
                  .agg(regular = ("regular_hrs", "sum"),
                       overtime= ("overtime_hrs", "sum"))
                  .reset_index())

payroll["pay_period"] = payroll["pay_period"].astype(str)
payroll.to_csv("payroll_import.csv", index=False)
print(f"Wrote payroll for {payroll['employee_id'].nunique()} employees")

Key takeaways

Validate before you compute — bad rows in the timesheet ruin the payroll.
Overtime is a weekly cap, even if pay periods are bi-weekly.
Always tag the output with employee_id + period.

Pay rate

Merge the payroll output with an employees table containing hourly rates. Compute total pay (regular×rate + overtime×rate×1.5).

15.3 — User story: employee data hygiene

Names with stray spaces, missing managers, duplicate IDs.

What you'll learn

Profile a master employee file
Catch the common dirty-data patterns
Produce a 'to fix' list for the HRIS

Profile first

def profile(df):
    return pd.DataFrame({
        "dtype": df.dtypes.astype(str),
        "missing": df.isna().sum(),
        "missing_pct": (df.isna().mean()*100).round(1),
        "distinct": df.nunique(),
    })

print(profile(emp))

Common HRIS dirty patterns

issues = []

# Stray whitespace
for col in ["first_name", "last_name", "email"]:
    bad = emp[col].astype(str).str.strip() != emp[col].astype(str)
    issues += [(rowid, col, "whitespace") for rowid in emp.index[bad]]

# Mixed case emails
bad = emp["email"].astype(str) != emp["email"].astype(str).str.lower()
issues += [(r, "email", "mixed case") for r in emp.index[bad]]

# Missing manager
issues += [(r, "manager_id", "missing") for r in emp.index[emp["manager_id"].isna()]]

# Manager who doesn't exist
mgr_ids = set(emp["employee_id"])
bad = ~emp["manager_id"].isin(mgr_ids) & emp["manager_id"].notna()
issues += [(r, "manager_id", "non-existent") for r in emp.index[bad]]

# Duplicate IDs
dup = emp[emp.duplicated("employee_id", keep=False)]
issues += [(r, "employee_id", "duplicate") for r in dup.index]

fix = pd.DataFrame(issues, columns=["row_index", "column", "issue"]).merge(
    emp[["employee_id", "first_name", "last_name"]],
    left_on="row_index", right_index=True
)
fix.to_excel("hris_fix_list.xlsx", index=False)
print(f"⚠️  {len(fix)} issues to fix")

Key takeaways

A profile() function tells you 80% of what's broken in 5 seconds.
Track issues row-by-row so HR can fix them in the system, not the spreadsheet.
Run this monthly — quality drifts.

Email validity

Add a check: flag emails that don't match a basic name@domain.tld pattern. Hint: str.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$").

15.4 — User story: build the org chart programmatically

From a flat employee table to a parent-child hierarchy.

What you'll learn

Walk a parent-pointer hierarchy
Count direct and indirect reports
Render the org tree as text

The data

emp = pd.read_excel("employees.xlsx")[["employee_id","name","manager_id","title"]]

Build a parent-children map

from collections import defaultdict
children = defaultdict(list)
for _, r in emp.iterrows():
    if pd.notna(r["manager_id"]):
        children[r["manager_id"]].append(r["employee_id"])

names = dict(zip(emp["employee_id"], emp["name"] + " — " + emp["title"]))

def print_tree(emp_id, depth=0):
    print("  " * depth + "• " + names[emp_id])
    for kid in children[emp_id]:
        print_tree(kid, depth + 1)

# Find roots (no manager) and print
roots = emp[emp["manager_id"].isna()]["employee_id"]
for r in roots:
    print_tree(r)

Count direct + indirect reports

def total_reports(emp_id):
    direct = children[emp_id]
    return len(direct) + sum(total_reports(c) for c in direct)

emp["direct_reports"]   = emp["employee_id"].apply(lambda e: len(children[e]))
emp["total_reports"]    = emp["employee_id"].apply(total_reports)
emp.sort_values("total_reports", ascending=False).head(10)

Key takeaways

Org chart = flat table with a parent-pointer column.
defaultdict(list) makes the children-of-X lookup trivial.
Recursion is the natural way to count indirect reports.

Span check

Flag managers with more than 12 direct reports (broad span) or fewer than 3 (too narrow).

15.5 — User story: engagement survey analysis

Turn 800 free-text + Likert responses into one summary file.

What you'll learn

Score Likert columns and roll up by team
Word-count the open-ended responses
Output a summary deck-ready set of charts

The data

survey = pd.read_excel("engagement_2026.xlsx")
# cols: respondent_id, team, q1_likert (1-5), q2_likert ..., q10_open (free text)

Score the Likerts

likert_cols = [c for c in survey.columns if c.endswith("_likert")]

by_team = (survey.groupby("team")[likert_cols].mean().round(2))
by_team["overall"] = by_team.mean(axis=1).round(2)
by_team.to_excel("survey_scores_by_team.xlsx")

# Distribution per question
dist = survey.melt(id_vars="team", value_vars=likert_cols,
                    var_name="question", value_name="score")
dist_pct = (dist.groupby(["question","score"]).size()
                  .unstack(fill_value=0)
                  .pipe(lambda d: d.div(d.sum(axis=1), axis=0) * 100)
                  .round(1))

Free text: top words

import re
from collections import Counter

text = " ".join(survey["q10_open"].dropna().astype(str)).lower()
words = re.findall(r"[a-z']+", text)
stop = {"the","and","a","to","of","in","is","i","it","for","that","but","my","we","with","this","on","be","are"}
freq = Counter(w for w in words if w not in stop and len(w) > 3)
top20 = pd.DataFrame(freq.most_common(20), columns=["word","count"])

Output

with pd.ExcelWriter("survey_summary.xlsx") as w:
    by_team.to_excel(w, sheet_name="By Team")
    dist_pct.to_excel(w, sheet_name="Distributions")
    top20.to_excel(w, sheet_name="Top Words", index=False)

Key takeaways

Likerts roll up cleanly with mean by team.
Free text gets a quick "top words after removing stopwords" pass — enough for a first cut.
For deeper text analysis, level up to NLTK or spaCy (out of scope here).

Net Promoter Score

Add an NPS computation: % promoters (9-10) minus % detractors (0-6).

15.6 — User story: PTO accrual and balances

How much vacation does each employee have, with the accrual rules baked in?

What you'll learn

Compute accrual based on tenure
Subtract taken leave
Flag anyone over the cap

The rules (example)

Default accrual: 15 days/year, prorated.
After 5 years tenure: 20 days/year.
After 10 years: 25 days/year.
Max balance carried into a year: 30 days.

Compute

from datetime import date
import pandas as pd

emp = pd.read_excel("employees.xlsx")
leave = pd.read_excel("leave_taken.xlsx")   # employee_id, days_taken_ytd

today = pd.Timestamp(date.today())
emp["tenure_yrs"] = ((today - pd.to_datetime(emp["hire_date"])).dt.days / 365.25)

def annual_rate(years):
    if years >= 10: return 25
    if years >= 5:  return 20
    return 15

emp["annual_rate"] = emp["tenure_yrs"].apply(annual_rate)
ytd_fraction = (today.dayofyear / 365.25)
emp["accrued_ytd"] = (emp["annual_rate"] * ytd_fraction).round(1)

emp = emp.merge(leave, on="employee_id", how="left")
emp["days_taken_ytd"] = emp["days_taken_ytd"].fillna(0)
emp["balance"] = (emp["accrued_ytd"] - emp["days_taken_ytd"]).round(1)

emp["over_cap"] = emp["balance"] > 30
print(f"{emp['over_cap'].sum()} employees are over the 30-day cap.")

emp[["employee_id","name","tenure_yrs","annual_rate","accrued_ytd","days_taken_ytd","balance","over_cap"]].to_excel("pto_balances.xlsx", index=False)

Key takeaways

Tiered accrual = a function of tenure, applied with .apply().
YTD accrual is a simple proration.
Always flag exceptions (over-cap, negative balance, > 1yr no leave).

Use-it-or-lose-it warning

Find employees who would lose more than 5 days at year-end if they don't take any more leave. Email each one a polite nudge using Module 12 Lesson 5.

15.7 — User story: new-hire onboarding tracker

Generate per-hire onboarding sheets from a master checklist.

What you'll learn

Fill an Excel template per new hire
Track progress in one master file
Send a reminder when items are overdue

Per-hire onboarding workbook

from openpyxl import load_workbook
from datetime import date, timedelta
import pandas as pd

template = "onboarding_template.xlsx"
hires = pd.read_excel("new_hires_this_week.xlsx")

for _, h in hires.iterrows():
    wb = load_workbook(template)
    ws = wb["Checklist"]
    ws["B2"] = h["name"]
    ws["B3"] = h["start_date"]
    ws["B4"] = h["manager"]
    ws["B5"] = h["department"]

    # Auto-fill due dates
    items = pd.read_excel(template, sheet_name="ItemDefs")
    for i, item in items.iterrows():
        row = 8 + i
        ws.cell(row=row, column=1, value=item["item"])
        ws.cell(row=row, column=2, value=pd.Timestamp(h["start_date"]) + pd.Timedelta(days=item["due_offset"]))

    wb.save(f"onboarding/{h['name'].replace(' ', '_')}.xlsx")

Master tracker — read every per-hire workbook, summarise

from pathlib import Path
rows = []
for p in Path("onboarding").glob("*.xlsx"):
    wb = load_workbook(p, data_only=True)
    ws = wb["Checklist"]
    name = ws["B2"].value
    for r in ws.iter_rows(min_row=8, values_only=True):
        item, due, done, owner = r[:4]
        if item:
            rows.append({"hire": name, "item": item, "due": due, "done": done, "owner": owner})

tracker = pd.DataFrame(rows)
tracker["overdue"] = (~tracker["done"].fillna(False)) & (tracker["due"] < pd.Timestamp.now())
print(f"Overdue items: {tracker['overdue'].sum()}")
tracker.to_excel("onboarding_master.xlsx", index=False)

Key takeaways

Templates + Python loops = consistent per-hire artifacts.
The master tracker reads back all the individual files — single source of truth.
Schedule the tracker to run daily and email a one-page status.

Slack/Teams reminder

Instead of email, post the daily summary into a Teams/Slack channel via incoming webhook. (Both have free webhook URLs.)

16.1 — User story: dedupe a 50,000-lead list

Marketing pulled a list. Half of them are already in the CRM. Find the half that aren't.

What you'll learn

Normalise emails and company names
Match on multiple keys
Output the net-new leads only

The data

import pandas as pd

new_list = pd.read_csv("conference_list.csv")        # 50,000 rows
existing = pd.read_csv("crm_export.csv")             # 220,000 rows

Normalise both sides

def norm_email(s):
    return str(s).strip().lower()

def norm_company(s):
    s = str(s).strip().lower()
    for suffix in [" inc", " inc.", " llc", " ltd", " ltd.", " gmbh", " co", " co."]:
        if s.endswith(suffix): s = s[:-len(suffix)]
    return s.strip()

for df in (new_list, existing):
    df["email_n"]   = df["email"].apply(norm_email)
    df["company_n"] = df["company"].apply(norm_company)

Match on email first, company second

known_emails    = set(existing["email_n"])
known_companies = set(existing["company_n"])

new_list["match_by_email"]   = new_list["email_n"].isin(known_emails)
new_list["match_by_company"] = new_list["company_n"].isin(known_companies)
new_list["already_known"]    = new_list["match_by_email"] | new_list["match_by_company"]

net_new = new_list[~new_list["already_known"]].drop(columns=["match_by_email","match_by_company"])
print(f"Net-new leads: {len(net_new):,} of {len(new_list):,}")
net_new.to_excel("net_new_leads.xlsx", index=False)

Catch fuzzy duplicates within the new list itself

net_new = net_new.sort_values("score", ascending=False)   # keep highest-scoring dupe
net_new = net_new.drop_duplicates(subset="email_n")
net_new = net_new.drop_duplicates(subset=["company_n", "last_name"])

Key takeaways

Normalise (lowercase, trim, strip "Inc/LLC") before matching.
Match on multiple keys; OR them together for "already known".
Dedupe within the new list before importing.

Domain match

Add a third match: if the email domain (after @) matches a known company, treat it as a likely dupe.

16.2 — User story: roll up the marketing campaign report

Spend, impressions, clicks, conversions, by campaign and channel.

What you'll learn

Combine exports from multiple ad platforms
Compute the standard ratios (CTR, CPC, CPA)
Rank by ROAS

The data

google = pd.read_csv("google_ads.csv")
meta   = pd.read_csv("meta_ads.csv")
linked = pd.read_csv("linkedin_ads.csv")

Normalise each to a common schema

def normalise(df, channel, mapping):
    return (df.rename(columns=mapping)
              .assign(channel=channel)
              [["channel","campaign","date","spend","impressions","clicks","conversions","revenue"]])

g = normalise(google, "Google",   {"Campaign":"campaign","Day":"date","Cost":"spend",
                                    "Impr.":"impressions","Clicks":"clicks",
                                    "Conv.":"conversions","Conv. value":"revenue"})
m = normalise(meta,   "Meta",     {"Ad set name":"campaign","Reporting starts":"date",
                                    "Amount spent (USD)":"spend","Impressions":"impressions",
                                    "Link clicks":"clicks","Results":"conversions",
                                    "Purchase conversion value":"revenue"})
l = normalise(linked, "LinkedIn", {"Campaign Name":"campaign","Start Date":"date",
                                    "Total Spent":"spend","Impressions":"impressions",
                                    "Clicks":"clicks","Conversions":"conversions",
                                    "Conversion Value":"revenue"})

all_ads = pd.concat([g, m, l], ignore_index=True)
all_ads["date"] = pd.to_datetime(all_ads["date"])

Compute ratios

all_ads["CTR"]  = (all_ads["clicks"]      / all_ads["impressions"]).fillna(0)
all_ads["CPC"]  = (all_ads["spend"]       / all_ads["clicks"]).replace([float("inf")], 0).fillna(0)
all_ads["CPA"]  = (all_ads["spend"]       / all_ads["conversions"]).replace([float("inf")], 0).fillna(0)
all_ads["ROAS"] = (all_ads["revenue"]     / all_ads["spend"]).replace([float("inf")], 0).fillna(0)

Roll up

by_channel = (all_ads.groupby("channel")
                       .agg(spend=("spend","sum"),
                            revenue=("revenue","sum"),
                            conversions=("conversions","sum"))
                       .assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2),
                               CPA= lambda d: (d["spend"]/d["conversions"]).round(2))
                       .sort_values("ROAS", ascending=False))

by_campaign = (all_ads.groupby(["channel","campaign"])
                       .agg(spend=("spend","sum"),
                            revenue=("revenue","sum"),
                            conversions=("conversions","sum"))
                       .assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2))
                       .sort_values("ROAS", ascending=False)
                       .reset_index())

with pd.ExcelWriter("campaign_report.xlsx") as w:
    by_channel.to_excel(w, sheet_name="By Channel")
    by_campaign.to_excel(w, sheet_name="By Campaign", index=False)

Key takeaways

Three platforms, three rename maps, one DataFrame.
Replace inf and NaN before ratios — every ad platform has zero-impression rows.
Sort by the metric the team actually cares about (usually ROAS).

Channel ROI trend

Add a weekly line chart of ROAS by channel for the last 12 weeks.

16.3 — User story: funnel analysis

From session → product view → add to cart → checkout → purchase.

What you'll learn

Compute step-to-step conversion rates
Spot where the biggest drop-off is
Build a funnel chart

The data

events = pd.read_csv("events.csv")
# user_id, event_name, timestamp

Per-step counts

steps = ["session_start","product_view","add_to_cart","checkout","purchase"]

step_users = {}
for s in steps:
    step_users[s] = events.loc[events["event_name"]==s, "user_id"].nunique()

funnel = pd.DataFrame({"step": steps, "users": [step_users[s] for s in steps]})
funnel["pct_of_top"]   = (funnel["users"] / funnel["users"].iloc[0] * 100).round(1)
funnel["pct_of_prev"]  = (funnel["users"] / funnel["users"].shift().fillna(funnel["users"].iloc[0]) * 100).round(1)
print(funnel)

Strict sequential funnel (each step must have happened in order)

seq = events.sort_values(["user_id","timestamp"]).groupby("user_id")["event_name"].agg(list)

def reached(events_list, step_idx):
    pos = 0
    for ev in events_list:
        if pos < len(steps) and ev == steps[pos]:
            pos += 1
            if pos > step_idx: return True
    return False

reached_step = {s: int(seq.apply(lambda L: reached(L, i)).sum()) for i, s in enumerate(steps)}
print(reached_step)

Funnel chart

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
ax.barh(funnel["step"][::-1], funnel["users"][::-1], color="#2b6cb0")
for i, (s, n, pct) in enumerate(zip(funnel["step"][::-1], funnel["users"][::-1], funnel["pct_of_top"][::-1])):
    ax.text(n, i, f"  {n:,} ({pct}%)", va="center")
ax.set_title("Funnel — last 30 days")
plt.tight_layout()
plt.savefig("funnel.png", dpi=200)

Key takeaways

Naive funnel = distinct users per step. Strict funnel = sequential check.
Step-to-prev % shows where you're leaking the most.
One bar chart, one number per bar — readable in 5 seconds.

Segment the funnel

Compute the funnel separately for desktop vs mobile users. Where does mobile drop off more?

16.4 — User story: the Monday pipeline report

Total pipeline, weighted forecast, deals moving by stage.

What you'll learn

Compute pipeline by stage with stage-weighted forecast
Spot deals stuck (no movement in 30+ days)
Output a one-pager for the sales leader

The data

deals = pd.read_csv("crm_pipeline.csv", parse_dates=["close_date","last_modified"])
# id, owner, account, amount, stage, close_date, last_modified

stage_prob = {
    "Discovery":   0.10,
    "Qualified":   0.20,
    "Proposal":    0.40,
    "Negotiation": 0.70,
    "Closing":     0.90,
}
deals["prob"]      = deals["stage"].map(stage_prob).fillna(0)
deals["weighted"]  = deals["amount"] * deals["prob"]

Pipeline by stage

by_stage = (deals.groupby("stage")
                  .agg(deals=("id","count"),
                       value=("amount","sum"),
                       weighted=("weighted","sum"))
                  .reindex(stage_prob.keys()))

Per-rep snapshot

by_rep = (deals.groupby("owner")
                .agg(deals=("id","count"),
                     pipeline=("amount","sum"),
                     forecast=("weighted","sum"))
                .sort_values("forecast", ascending=False))

Stuck deals

stuck = deals[(deals["last_modified"] < pd.Timestamp.now() - pd.Timedelta(days=30)) &
              (deals["stage"] != "Closed")]

One-page output

with pd.ExcelWriter("monday_pipeline.xlsx") as w:
    by_stage.to_excel(w, sheet_name="By Stage")
    by_rep.to_excel(w, sheet_name="By Rep")
    stuck.to_excel(w, sheet_name="Stuck Deals", index=False)
    pd.DataFrame({
        "Metric": ["Total pipeline","Weighted forecast","# stuck deals"],
        "Value":  [by_stage["value"].sum(), by_stage["weighted"].sum(), len(stuck)],
    }).to_excel(w, sheet_name="Cover", index=False)

Schedule it

Schedule the script to run at 6 AM every Monday (Module 12, Lesson 6). Email it to the VP of Sales (Module 12, Lesson 5). Now your "Monday morning pipeline report" job is done before you wake up.

Key takeaways

Stage probabilities turn raw pipeline into a forecast.
"Stuck" = last_modified < today - 30 days, and not closed.
Schedule + email = the report writes itself.

Forecast vs quota

Add a column to by_rep for quota; compute forecast / quota %, sort, flag < 70%.

16.5 — User story: RFM segmentation

Score every customer on Recency, Frequency, Monetary value. Build a champions list.

What you'll learn

Compute R, F, M scores using quintiles
Combine into a segment label
Identify champions, at-risk, lost

The data

orders = pd.read_csv("orders.csv", parse_dates=["order_date"])
# customer_id, order_date, amount

Compute R, F, M

asof = orders["order_date"].max()
rfm = orders.groupby("customer_id").agg(
    recency   = ("order_date", lambda d: (asof - d.max()).days),
    frequency = ("order_date", "count"),
    monetary  = ("amount",     "sum"),
)

Score 1-5 by quintile

# Recency: lower = better, so reverse
rfm["R"] = pd.qcut(rfm["recency"],   5, labels=[5,4,3,2,1]).astype(int)
rfm["F"] = pd.qcut(rfm["frequency"], 5, labels=[1,2,3,4,5], duplicates="drop").astype(int)
rfm["M"] = pd.qcut(rfm["monetary"],  5, labels=[1,2,3,4,5]).astype(int)
rfm["RFM"] = rfm["R"].astype(str) + rfm["F"].astype(str) + rfm["M"].astype(str)

Segment

def label(row):
    if row["R"] >= 4 and row["F"] >= 4 and row["M"] >= 4: return "Champion"
    if row["R"] >= 4 and row["F"] >= 3:                   return "Loyal"
    if row["R"] >= 4 and row["F"] <= 2:                   return "New"
    if row["R"] <= 2 and row["F"] >= 4:                   return "At Risk"
    if row["R"] <= 2 and row["F"] <= 2:                   return "Lost"
    return "Need Attention"

rfm["segment"] = rfm.apply(label, axis=1)
print(rfm["segment"].value_counts())

Output target lists

with pd.ExcelWriter("rfm_targets.xlsx") as w:
    for seg in rfm["segment"].unique():
        rfm[rfm["segment"]==seg].reset_index().to_excel(w, sheet_name=seg[:31], index=False)

Key takeaways

RFM is the simplest, most powerful customer segmentation.
Quintile scoring (qcut) auto-adjusts to your data's distribution.
Run quarterly; hand the segments to the marketing team.

Send-list builder

Build a campaign email list of "At Risk" customers (R≤2, F≥4) with their last-order-amount as a personalisation token.

16.6 — User story: simple multi-touch attribution

Who gets credit when a customer touched four channels before buying?

What you'll learn

Build a journey per customer
Try first-touch, last-touch, and linear attribution
Compare the answers

The data

touches = pd.read_csv("touches.csv", parse_dates=["ts"])
# user_id, ts, channel, conversion (bool), revenue

Build journeys

converters = touches.loc[touches["conversion"], "user_id"].unique()
j = (touches[touches["user_id"].isin(converters)]
        .sort_values(["user_id","ts"])
        .groupby("user_id")
        .agg(path=("channel", list),
             revenue=("revenue", "max")))

First-touch

j["first"] = j["path"].apply(lambda p: p[0])
first_touch = j.groupby("first")["revenue"].sum()

Last-touch

j["last"] = j["path"].apply(lambda p: p[-1])
last_touch = j.groupby("last")["revenue"].sum()

Linear (equal credit to every touch)

rows = []
for path, rev in zip(j["path"], j["revenue"]):
    share = rev / len(path)
    for c in path:
        rows.append({"channel": c, "credit": share})
linear = pd.DataFrame(rows).groupby("channel")["credit"].sum()

Compare

compare = pd.concat([first_touch.rename("first"),
                     last_touch.rename("last"),
                     linear.rename("linear")], axis=1).fillna(0).round(0)
compare.sort_values("linear", ascending=False)

Key takeaways

First-touch overrates discovery channels; last-touch overrates closing channels.
Linear is naive but a good "everyone gets credit" baseline.
Show all three side-by-side; the truth is usually somewhere in between.

Position-based (40/20/40)

Implement a position-based model: 40% to first, 40% to last, 20% split among the middle touches.

16.7 — User story: sales territory rebalancing

Split a country into roughly-equal account groups for new reps.

What you'll learn

Sum account value by region/state
Use a greedy assignment to balance territories
Output one workbook per rep

The setup

Lena is hiring three new reps. She has to carve out three territories with roughly equal pipeline value.

Sum by geography

accts = pd.read_csv("accounts.csv")   # account_id, state, pipeline_value
state_value = accts.groupby("state")["pipeline_value"].sum().sort_values(ascending=False)
print(state_value)

Greedy assignment

n_reps = 3
territories = {f"Rep{i+1}": {"states": [], "value": 0} for i in range(n_reps)}

for state, value in state_value.items():
    # assign to whichever rep currently has the lowest total
    lowest = min(territories, key=lambda r: territories[r]["value"])
    territories[lowest]["states"].append(state)
    territories[lowest]["value"] += value

for rep, info in territories.items():
    print(f"{rep}: ${info['value']:,.0f}  ({len(info['states'])} states)")

Output per-rep workbooks

for rep, info in territories.items():
    sub = accts[accts["state"].isin(info["states"])].sort_values("pipeline_value", ascending=False)
    sub.to_excel(f"territory_{rep}.xlsx", index=False)

How balanced did we end up?

values = [t["value"] for t in territories.values()]
print(f"min:{min(values):,.0f}  max:{max(values):,.0f}  spread:{(max(values)-min(values))/sum(values)*100:.1f}%")

📝 Beyond greedy

For real territory optimisation (drive time, equal account count, named-account constraints), look at scipy.optimize.linprog or the pulp library. Greedy is a great first cut.

Key takeaways

Greedy bin-packing is the 80% solution for territory carving.
Always print spread% — if it's huge you need a smarter algorithm.
One workbook per rep is the kindest deliverable.

Add a constraint

Modify the script so each rep gets at most 12 states (forces re-balancing in dense areas).

17.1 — Capstone 1 — Monthly close pack (Finance)

End-to-end: load, reconcile, variance, AR aging, FX P&L, single Excel output, scheduled.

What you'll learn

Combine everything from Module 13 into one orchestrator
Add a styled cover sheet with KPIs
Schedule it to run on the 1st of each month

Project brief

Build a single script monthly_close.py that, when run, produces close_pack_YYYY-MM.xlsx containing:

A cover sheet with 6 KPIs.
Bank reconciliation (matched, GL-only, bank-only).
Budget vs actual variance (with material-variance flags).
AR aging by customer × bucket.
USD-translated P&L by office.

Schedule it to run automatically on the 1st of every month and email the result to your controller.

Suggested file layout

monthly_close/
├── monthly_close.py        ← the orchestrator
├── recon.py                ← reconciliation function
├── variance.py             ← variance function
├── aging.py
├── fx_pnl.py
├── style.py                ← shared openpyxl styling
├── inputs/                 ← raw files dropped here
└── outputs/                ← finished pack lands here

Acceptance criteria

Runs end-to-end with one command: python monthly_close.py.
Cover sheet headline KPIs match the detail sheets.
All numeric cells formatted as currency.
Material variance rows have red fill.
AR rows over 90 days have red fill; 60-90 yellow.
Email lands in the controller's inbox automatically every 1st.

Stretch goals

Add a "previous month vs this month" comparison column on each summary.
Embed a 12-month revenue trend chart on the cover sheet.
Save the pack to a SharePoint or Google Drive folder so it's auto-shared.

What to show off

When it's done: a 90-second screen recording showing you double-click the script, the file appears, you open it, scroll the cover sheet, click into one detail tab. That's your portfolio piece.

17.2 — Capstone 2 — Self-refreshing dashboard (Analyst)

A daily-refresh dashboard from a database export, with anomaly detection.

What you'll learn

Pipe a daily CSV export into a clean dashboard Excel
Add anomaly flags and a 30-day trend
Make it run automatically and email the file

Project brief

Build daily_dashboard.py that runs every morning and produces a one-page Excel dashboard for your manager. Source data: a CSV dump from the data warehouse (or any large CSV you have access to).

Required content

Big number tiles: yesterday's revenue, WoW change, MoM change, anomaly count.
Top 5 products by revenue.
Top 5 stores by revenue.
30-day revenue trend chart.
Heatmap: hour × day-of-week.
Anomalies tab: list of dates where the residual was > 2σ.

Acceptance criteria

One command: python daily_dashboard.py.
Cover-page tiles auto-refresh against today's data.
Charts saved as PNG and embedded into the workbook.
Anomaly tab populated using seasonal_decompose from Module 14 Lesson 5.
Scheduled to run daily at 6 AM (Module 12 Lesson 6).
Result emailed to your manager (Module 12 Lesson 5).

Stretch goals

Render a second copy as a single PNG suitable for posting in Slack.
Add segmentation: same dashboard but filtered to one region per tab.
Replace the file-emailing with a Slack/Teams webhook post.

What to show off

The dashboard your manager now refreshes themselves just by opening yesterday's file. The fact that they stopped asking you for it.

17.3 — Capstone 3 — Headcount + payroll prep + tracker (HR)

Three monthly HR jobs in one weekly script.

What you'll learn

Run all weekly HR jobs as one pipeline
Produce a headcount summary, payroll-import file, and onboarding tracker
Email each output to the right people

Project brief

Build hr_weekly.py that produces three artifacts and routes each to the right stakeholder:

headcount_YYYY-MM-DD.xlsx → HR Director.
payroll_import_YYYY-MM-DD.csv → Payroll vendor SFTP (or just a folder).
onboarding_status_YYYY-MM-DD.xlsx → People Manager.

Required inputs

employees.xlsx — HRIS export.
timesheets.csv — last week's hours.
onboarding/*.xlsx — per-new-hire onboarding workbooks.

Acceptance criteria

Headcount file: current active count by dept, hires and leavers in last 7 days, turnover %.
Payroll file: per-employee regular hours + overtime, ready to import.
Onboarding file: list of overdue items with the new-hire and owner.
Each file is emailed to a different recipient with a templated subject and body.
Scheduled for Monday 7 AM.

Stretch goals

Add a "PTO over cap" alert section in the headcount file.
Skip employees who didn't submit timesheets and email them a reminder.
Log every send to hr_weekly.log so you can audit who got what.

What to show off

Three jobs collapsed to one calendar entry that runs while you sleep. Your Monday opens with a clean inbox.

17.4 — Capstone 4 — Pipeline + campaign report (Sales/Marketing)

A weekly Monday email that combines pipeline, marketing spend, and ROAS.

What you'll learn

Combine CRM data and ad-platform exports
Produce a weighted forecast and channel ROAS
Email a 1-page PDF + the underlying Excel

Project brief

Build monday_gtm.py that runs every Sunday night and emails a 1-page PDF + Excel pack to the VP of Sales and the VP of Marketing.

The PDF cover page (auto-generated)

Weighted pipeline forecast — total, by stage, by rep.
Last week's spend by channel + ROAS.
5 "stuck deals" that have not moved in 30+ days.
Net-new leads added in the last week.

The Excel underneath

Tab 1: Pipeline by stage.
Tab 2: Pipeline by rep, with quota %.
Tab 3: Stuck deals (sorted by amount desc).
Tab 4: Channel ROAS, last 4 weeks.
Tab 5: Net-new leads (segmented by source).

Acceptance criteria

One command, full pack built in < 90 seconds.
The PDF prints cleanly on A4 / Letter without scaling.
Excel is opened and looks "clean" — no extra empty tabs, no #DIV/0! cells.
Email sent with both files attached, Subject line: "GTM Weekly — <week-of>".
Scheduled for Sunday 11 PM.

Stretch goals

Add a small RFM segment summary on the PDF cover.
Track week-over-week changes for every KPI on the cover.
Add a "biggest deal closed last week" callout for morale.

What to show off

An automated Monday-morning report that two VPs now rely on. The fact that they can't tell you didn't hand-make it.

17.5 — Where to go from here

Books, sites, communities, and the next skills worth learning.

What you'll learn

Find the right next book or site
Join a community to keep learning
Pick one more skill to add this quarter

Books to read next

Automate the Boring Stuff with Python — Al Sweigart. Free online at automatetheboringstuff.com. Sister book to this course, more breadth, less Excel-focused.
Python for Data Analysis — Wes McKinney (the creator of pandas). The deeper dive on every pandas concept we glossed.
Fluent Python — Luciano Ramalho. When you want to understand the language, not just use it.

Sites worth bookmarking

pandas user guide — pandas.pydata.org. The official reference; surprisingly readable.
Real Python — realpython.com. High-quality tutorials.
Stack Overflow — for every error message you'll ever hit.

Skills to add this quarter

SQL. If pandas is the Excel-replacement, SQL is the "where the data actually lives" language. Two weekends of practice pays for itself for the rest of your career.
Git basics. Version control for your scripts. Two hours. You'll never lose work again.
One specific library deeper. Depending on your job: requests for APIs, beautifulsoup4 for scraping, plotly for interactive charts, streamlit for tiny web apps.

Communities

r/learnpython on Reddit — beginner-friendly, helpful answers.
Python Discord — quick chat-style help.
Pandas tag on Stack Overflow — search before posting; your question probably has an answer.

The most underrated next move

Pick one script you wrote during this course. Make it 50% better — add error handling, write a docstring for every function, add a small test. The difference between "I tinkered with Python" and "I write Python" is taste, and taste comes from polishing your own work.

And finally — keep this site bookmarked

This whole course is here whenever you need to look something up. New lessons get added when there's something worth saying. Send corrections and lesson requests via the contact page.

Key takeaways

Read Automate the Boring Stuff next for breadth; Python for Data Analysis for pandas depth.
Learn SQL and Git — two weekends, lifetime value.
Polish one of your own scripts. That's the real skill jump.

Your last exercise

Pick one script you wrote during this course. Spend 30 minutes: add a docstring at the top, error handling on the I/O, and a print statement at the end that says how long it took. Save it somewhere safe. That's your first piece of personal infrastructure.

LearnPythonForExcel — Training Manual

Table of Contents

Module 01 — Introduction to Python

Module 02 — Setting Up Your Environment

Module 03 — Variables and Data Types

Module 04 — Control Flow

Module 05 — Functions

Module 06 — Data Structures

Module 07 — Working with Files

Module 08 — Python inside Excel — =PY()

Module 09 — Pandas Crash Course

Module 10 — Data Cleaning and Transformation

Module 11 — Data Visualization

Module 12 — Automating Boring Excel Tasks

Module 13 — Scenario: Finance & Accounting

Module 14 — Scenario: Data & Business Analyst

Module 15 — Scenario: HR & Operations

Module 16 — Scenario: Sales & Marketing

Module 17 — Capstone Projects

Introduction to Python

01.1 — What is Python, really?

What you'll learn

Python in one sentence

Where Python sits in the world

What is Python good at?

What's the difference between Python and an Excel formula?

What this course will (and won't) teach you

Key takeaways

Try this — no setup required

01.2 — Why Python, why Excel, why now?

What you'll learn

The short version

Five concrete reasons

1. Files that crash Excel are no problem for Python

2. The same task, done in one click instead of fifty

3. Things Excel can't do at all

4. Reusability — the script outlives the spreadsheet

5. It's now inside Excel

Where Excel is still the right answer

The big "what changed"

Key takeaways

Reflect for a minute

01.3 — How Python actually runs

What you'll learn

What 'running' Python means

The three places you'll run code

1. The REPL — for trying things

2. A script — for things you want to run again

3. A notebook — for analysis

Which should I use when?

You don't have to install anything to try this

Key takeaways

Try it now (browser only, no install)

01.4 — The Excel-to-Python mental model

What you'll learn

You already know more than you think

The translation table

Three quick "Wait, that's it?" examples

Adding numbers in a range

Looking up a value

"If A1 is greater than 10, say 'big', otherwise 'small'"

The DataFrame: the most important translation

Key takeaways

Translate these three formulas to Python

01.5 — The tools you'll meet in this course

What you'll learn

A field guide to the names

Python

Anaconda

pip

Library / package

pandas

NumPy

openpyxl

matplotlib & seaborn

Jupyter notebook

VS Code (Visual Studio Code)

Terminal / command line / PowerShell

=PY() in Excel

Microsoft 365 / Excel for the web

`=PY()` in Excel

1. `pwd` (Mac) / `cd` on its own (Windows) — "where am I?"

2. `ls` (Mac) / `dir` (Windows) — "what's in this folder?"

3. `cd folder_name` — "move into this folder"

4. `mkdir new_folder_name` — "make a new folder"

5. `pip install <library>` — install a Python library

Bonus: `python myscript.py` — run a script