← Back to the site

LearnPythonForExcel β€” Training Manual

The complete printable companion. 17 modules, 112 lessons, every walkthrough and exercise.

Table of Contents

Module 01 β€” Introduction to Python

  1. What is Python, really?
  2. Why Python, why Excel, why now?
  3. How Python actually runs
  4. The Excel-to-Python mental model
  5. The tools you'll meet in this course
  6. How to use this course

Module 02 β€” Setting Up Your Environment

  1. Install Python (the easy way, with Anaconda)
  2. Install VS Code β€” your code editor
  3. The terminal β€” a friendly introduction
  4. Turn on Python inside Excel (=PY())
  5. A two-minute tour of Jupyter notebooks
  6. When things break β€” reading errors without panic

Module 03 β€” Variables and Data Types

  1. Variables β€” Python's version of a named cell
  2. Numbers β€” ints, floats, and the math operators
  3. Strings β€” text values and how to wrangle them
  4. Booleans and comparisons
  5. None, type conversions, and reading the type
  6. Talking to the user β€” input() and print()
  7. Comments β€” notes to your future self

Module 04 β€” Control Flow

  1. if / else β€” making decisions
  2. for loops β€” doing something for every item
  3. while loops β€” repeat until done
  4. List comprehensions β€” one-line loops
  5. try / except β€” handling errors gracefully
  6. Putting it together β€” a mini contact-cleaner

Module 05 β€” Functions

  1. def β€” defining your own function
  2. Returning multiple values & unpacking
  3. Scope β€” where variables live
  4. lambda, map, filter β€” the functional shortcuts
  5. Modules and imports β€” using other people's code
  6. Anatomy of a real script

Module 06 β€” Data Structures

  1. Lists β€” the everyday ordered collection
  2. Dictionaries β€” key-value lookups
  3. Tuples and when to use them
  4. Sets β€” unique values, fast
  5. Nested structures β€” lists of dicts, dicts of lists
  6. Working with dates
  7. Files and paths β€” the pathlib essentials

Module 07 β€” Working with Files

  1. Reading and writing CSV files
  2. Reading Excel files with pandas
  3. openpyxl β€” when you need cell-level control
  4. Looping through many files
  5. Text files, JSON, and other formats
  6. Encoding, BOMs, and other file gotchas

Module 08 β€” Python inside Excel β€” =PY()

  1. =PY() β€” the first ten formulas
  2. The xl() helper β€” getting Excel data into Python
  3. Pandas in =PY() β€” the killer combination
  4. Charts in =PY() with matplotlib and seaborn
  5. When =PY() is the right tool β€” and when it isn't
  6. Twelve =PY() recipes you'll use
  7. Limits, costs, and security of =PY()

Module 09 β€” Pandas Crash Course

  1. DataFrames β€” Python's spreadsheet
  2. Selecting rows and columns β€” loc and iloc
  3. Sorting, counting, and ranking
  4. Aggregation β€” sum, mean, agg
  5. groupby β€” the move that pays for the course
  6. Pivot tables β€” pivot_table()
  7. Joining and merging β€” pandas's VLOOKUP
  8. Applying functions β€” apply, map, vectorisation

Module 10 β€” Data Cleaning and Transformation

  1. Handling missing values
  2. Finding and removing duplicates
  3. String cleanup at scale
  4. Working with dates in pandas
  5. Numeric cleanup β€” currencies, percentages, outliers
  6. Reshape β€” melt and pivot
  7. Categoricals and ordered types

Module 11 β€” Data Visualization

  1. Matplotlib basics
  2. Plotting directly from a DataFrame
  3. Seaborn β€” prettier defaults, less code
  4. Formatting for stakeholders
  5. Inserting charts into Excel
  6. Tiny dashboards with subplots

Module 12 β€” Automating Boring Excel Tasks

  1. Merging many files into one
  2. Splitting one file into many
  3. Filling a template workbook
  4. Renaming and organising files in bulk
  5. Sending emails from Python
  6. Scheduling: run it automatically
  7. An automation recipe book

Module 13 β€” Scenario: Finance & Accounting

  1. User story: bank reconciliation
  2. User story: budget vs actual variance
  3. User story: consolidating subsidiary budgets
  4. User story: invoice batch processing
  5. User story: accounts receivable aging
  6. User story: multi-currency consolidation
  7. Project: the monthly close pack

Module 14 β€” Scenario: Data & Business Analyst

  1. User story: cleaning a 4-million-row export
  2. User story: joining six tables to answer one question
  3. User story: customer cohort retention
  4. User story: did the change work? A quick A/B test
  5. User story: trend, seasonality, and 'is this Tuesday weird?'
  6. User story: a refreshing dashboard sheet
  7. User story: shipping the analysis β€” narrative report

Module 15 β€” Scenario: HR & Operations

  1. User story: monthly headcount report
  2. User story: payroll prep
  3. User story: employee data hygiene
  4. User story: build the org chart programmatically
  5. User story: engagement survey analysis
  6. User story: PTO accrual and balances
  7. User story: new-hire onboarding tracker

Module 16 β€” Scenario: Sales & Marketing

  1. User story: dedupe a 50,000-lead list
  2. User story: roll up the marketing campaign report
  3. User story: funnel analysis
  4. User story: the Monday pipeline report
  5. User story: RFM segmentation
  6. User story: simple multi-touch attribution
  7. User story: sales territory rebalancing

Module 17 β€” Capstone Projects

  1. Capstone 1 β€” Monthly close pack (Finance)
  2. Capstone 2 β€” Self-refreshing dashboard (Analyst)
  3. Capstone 3 β€” Headcount + payroll prep + tracker (HR)
  4. Capstone 4 β€” Pipeline + campaign report (Sales/Marketing)
  5. Where to go from here
Module 01

Introduction to Python

What Python is, why it matters for Excel users, what it can (and can't) do, and how to use this course.

This first module is short and friendly. You'll come out of it knowing what Python actually is, why it suddenly matters for spreadsheet users, and what kinds of problems it solves. No installation yet β€” we save that for Module 2.

πŸ’‘ Before you start
You don't need to install anything for this module. Just read. Open the lessons in any order if you like. Each one is 5-10 minutes.

01.1 β€” What is Python, really?

A plain-English answer β€” no jargon β€” and why a spreadsheet user should care.

What you'll learn

Python in one sentence

Python is a way of writing instructions for your computer in language that looks a lot like English. That's it. Once you write the instructions down in a file, you tell the computer to run the file, and the computer does what you said.

Here's a complete, working piece of Python:

print("Hello, Excel friend!")

You can probably guess what that does before you've ever written a line of code: it prints the words "Hello, Excel friend!" to the screen. That readability is the whole point of Python.

Where Python sits in the world

Python is one of the two or three most-used programming languages on the planet. NASA uses it. Netflix uses it. Your bank uses it. The data science team at virtually every company uses it. And starting in 2023, Microsoft Excel itself uses it β€” there's a Python interpreter built right into the spreadsheet now, accessed through a formula called =PY().

That last bit is the reason this course exists. The world's most popular spreadsheet now ships with the world's most popular data language built in. If you've spent years living inside Excel, you've just been handed a superpower.

What is Python good at?

Roughly the kind of work an Excel power-user already does β€” and then a lot more:

What's the difference between Python and an Excel formula?

Excel formulas live inside a cell. They look at the cells around them and compute a single value. They're a single-step calculator.

A piece of Python β€” usually called a script β€” lives in its own file. It can do a thousand things one after another: open a file, change some numbers, save the file, then open another file, then send an email. It's a recipe with as many steps as you want.

The new =PY() formula in Excel mashes the two together: you write a few lines of Python inside a spreadsheet cell and you get a value back, just like a formula. We cover that in Module 8.

🧠 The mental model
Think of Python as "what an Excel formula would be if it could do anything and weren't stuck in one cell."

What this course will (and won't) teach you

We focus 100% on Python for spreadsheet-style work. That means: data, files, automation, reports, charts, and the new =PY() in Excel. We don't cover web apps, game development, or computer science theory. If you want those later, you'll have plenty of Python background to pick them up easily.

Key takeaways

Try this β€” no setup required

  1. Open an online Python sandbox in another tab (search for "python online" β€” Google has one built in, or use replit.com).
  2. Type print("Hello, " + "world") and hit Run.
  3. Now change "world" to your name. Run it again.
  4. Congratulations β€” you've just written and run a Python program.

01.2 β€” Why Python, why Excel, why now?

The case for an Excel user learning Python in 2026 β€” five concrete reasons.

What you'll learn

The short version

Excel is brilliant for a lot of things β€” but it has limits. Python doesn't have most of those limits. And now you can use Python inside Excel, instead of as a separate tool. That's why this is suddenly a no-brainer for spreadsheet pros.

Five concrete reasons

1. Files that crash Excel are no problem for Python

The classic limit you've probably bumped into: Excel maxes out at about 1,048,576 rows. You get a 4-million-row CSV from the data team, you open it in Excel, and… it shows you the first million. Python handles 4 million rows without breaking a sweat. 40 million? Fine.

2. The same task, done in one click instead of fifty

If you do the same Excel work every Monday β€” open a file, clean three columns, run a pivot, format a report, save it as a PDF, email it β€” Python turns that into one script you run once. The first time costs you an hour. Every week after costs you the time it takes to double-click an icon.

πŸ’‘ The "annual ROI" rule of thumb
A 30-minute weekly task is 26 hours a year. If automating it takes you four hours up front, you break even by Week 9 and save 22 hours every year after that. Most automations pay back faster than that.

3. Things Excel can't do at all

Excel cannot scrape a web page for you. It can't pull data from most APIs cleanly. It can't read most PDFs intelligently. It can't send a hundred personalised emails. Python does all four with a handful of lines.

4. Reusability β€” the script outlives the spreadsheet

A complex Excel workbook is fragile. Someone changes a column heading and three pivot tables break. A Python script for the same job is text β€” you can read it, version it, share it, fix it, and re-use it on next month's data without copy-pasting anything.

5. It's now inside Excel

In 2023 Microsoft added a built-in Python runtime to Excel (Microsoft 365). You write =PY() in a cell, type a few lines of Python, hit Enter, and the result lands in the spreadsheet like any other formula result. No installs, no separate tool, no copy-paste between apps. We dive into this in Module 8.

Where Excel is still the right answer

This is not a course that argues you should stop using Excel. Excel is unbeatable for:

The right mental model: Excel is your workshop; Python is your power tools. Use both.

The big "what changed"

Three things changed in the past few years that make this course worth taking right now:

  1. 2023: Python landed inside Excel via =PY().
  2. The data world standardised on pandas. If you can use pandas, you can talk shop with any analyst in any company.
  3. AI assistants now write 80% of the boring bits for you. You no longer need to memorise syntax. You need to know what's possible and how to read code β€” both of which this course teaches.

Key takeaways

Reflect for a minute

Write down, on paper or in a sticky note:

  1. One Excel task you do at least weekly.
  2. How long it takes each time.
  3. Multiply by 52. That's the time you'll get back per year by the time we're done.

Keep this note handy β€” you'll come back to it in Module 12 when you automate exactly this task.

01.3 β€” How Python actually runs

What 'running' Python means, the three places you'll run it, and the difference between a script and a notebook.

What you'll learn

What 'running' Python means

When you write Python code, it's just plain text in a file. The computer can't do anything with text on its own. To make it do something, you hand the text to a program called the Python interpreter. The interpreter reads the text from top to bottom, one line at a time, and acts on each instruction as it gets to it.

Two consequences worth knowing:

The three places you'll run code

1. The REPL β€” for trying things

REPL stands for Read–Eval–Print Loop. It's an interactive prompt: you type one line, hit Enter, and Python answers immediately. Great for experimenting.

>>> 2 + 2
4
>>> "hello".upper()
'HELLO'

You'll meet the REPL in Module 2.

2. A script β€” for things you want to run again

A script is a file ending in .py that holds Python code. You write the code once, save the file, and you can run it whenever you want, as many times as you want.

# clean_monday_report.py
import pandas as pd
df = pd.read_excel("monday_raw.xlsx")
df = df.dropna()
df.to_excel("monday_clean.xlsx", index=False)
print("Done.")

This is what 90% of "automating boring stuff" looks like.

3. A notebook β€” for analysis

A Jupyter notebook (file extension .ipynb) is somewhere between a script and a Word document. You write code in little blocks called cells, and the output appears right below each cell. You can mix code, charts, and notes in the same document.

Notebooks are the standard format for data analysis β€” you'll see them everywhere in finance, science, and business analytics.

πŸ“ And one more: =PY() inside Excel
A fourth place Python now runs is inside an Excel cell itself, via the new =PY() formula. The code runs on a server in the cloud and the result lands back in the cell. We cover this in Module 8.

Which should I use when?

SituationUse
"What does this single line do?"REPL
"I want to run this same job every Monday."Script (.py)
"I'm exploring data and want notes alongside."Notebook (.ipynb)
"I want a Python value inside a spreadsheet cell."=PY() in Excel

You don't have to install anything to try this

If you want a taste before Module 2, you can run Python in a browser tab right now β€” try online-python.com or replit.com.

Key takeaways

Try it now (browser only, no install)

  1. Open online-python.com.
  2. Paste this code and hit Run:
    name = "your name here"
    print("Hi, " + name + "!")
    print("Today's Python lesson is going well.")
  3. Change "your name here" to your actual name, and run it again.
  4. You just used a variable, a function, and string addition β€” three of Python's most-used features.

01.4 β€” The Excel-to-Python mental model

Translate things you already know β€” cells, formulas, ranges, tables β€” into their Python equivalents.

What you'll learn

You already know more than you think

People often say "Python is hard because it's a programming language." The truth is: if you've used Excel formulas, you've already met most of the concepts. They just have different names. Let's translate.

The translation table

Excel conceptPython equivalentWhy they're the same
A cell holding a value: A1 = 100A variable: a1 = 100Both are a named container for a value.
A formula: =SUM(A1:A10)A function call: sum(numbers)Both take inputs and give back one output.
A range: A1:A10A list: [100, 200, 300, ...]Both are an ordered group of values.
A table / structured rangeA pandas DataFrameBoth are rows-and-columns with named headers.
A VLOOKUP tableA dictionary: {"apple": 1.20, ...}Both are a key-to-value lookup.
IF(A1>10, "big", "small")if a1 > 10: "big"
else: "small"
Both branch on a condition.
Filling a formula down 100 rowsA for loopBoth apply the same logic across many items.
A macro (VBA)A Python scriptBoth are saved instructions you run on demand.
An add-inA library (pandas, numpy, etc.)Both add capabilities the base tool doesn't have.

Three quick "Wait, that's it?" examples

Adding numbers in a range

Excel:

=SUM(A1:A5)

Python:

numbers = [10, 20, 30, 40, 50]
total = sum(numbers)
print(total)   # 150

Looking up a value

Excel:

=VLOOKUP("apple", A1:B10, 2, FALSE)

Python:

prices = {"apple": 1.20, "pear": 1.50, "banana": 0.40}
print(prices["apple"])   # 1.2

"If A1 is greater than 10, say 'big', otherwise 'small'"

Excel:

=IF(A1>10, "big", "small")

Python:

a1 = 14
if a1 > 10:
    result = "big"
else:
    result = "small"
print(result)   # big

The DataFrame: the most important translation

If there's only one Python concept you remember from this lesson, make it the DataFrame. It's the pandas library's word for a table of data. It has rows. It has named columns. You can sort it, filter it, group it, summarise it. It is, for all practical purposes, an Excel sheet that lives inside Python.

Once you internalise that mental model β€” "DataFrame = Sheet" β€” every example in Modules 9 through 16 will feel familiar.

import pandas as pd

# This DataFrame is basically:
#   Name      Sales
#   Alice     1200
#   Bob       800
#   Carol     1500

df = pd.DataFrame({
    "Name":  ["Alice", "Bob", "Carol"],
    "Sales": [1200,    800,    1500],
})

print(df["Sales"].sum())   # 3500 β€” like =SUM(B2:B4)
πŸ’‘ If you keep one slide
Variable = cell. Function = formula. List = range. DataFrame = sheet. Dictionary = VLOOKUP. Loop = fill down.

Key takeaways

Translate these three formulas to Python

Try to translate each on paper before scrolling down. Answers below.

  1. =A1 + A2
  2. =AVERAGE(A1:A5)
  3. =IF(A1="VIP", A2*0.9, A2)
Show answers
  1. a1 + a2
  2. sum(numbers) / len(numbers)  β€” or with pandas: df["A"].mean()
  3. a2 * 0.9 if a1 == "VIP" else a2

01.5 β€” The tools you'll meet in this course

A field guide to the names you'll keep hearing β€” Python, pandas, Jupyter, VS Code, openpyxl, =PY(), and so on.

What you'll learn

A field guide to the names

When you read Python tutorials online, you'll see the same dozen words over and over. Here's what each one is and why you'll meet it in this course.

Python

The language itself. When someone says "I wrote it in Python," they mean they wrote code in this language and ran it with the Python interpreter.

Anaconda

A free bundled installer that gives you Python plus the most-used libraries (pandas, NumPy, Jupyter, matplotlib) in one click. The friendliest way to get set up. We use it in Module 2.

pip

Python's package installer. To add a new library to your Python: pip install thelibrary. Anaconda gives you the common ones already, but pip is how you get anything else.

Library / package

A bundle of pre-written Python code that adds capabilities. Excel calls these "add-ins"; Python calls them libraries. There are hundreds of thousands of them for free.

pandas

The library that turns Python into a souped-up spreadsheet. You will use this constantly. It introduces the DataFrame β€” that "Python sheet" we just met. Full crash course in Module 9.

NumPy

A library for fast number-crunching. Pandas is built on top of NumPy. You'll touch it occasionally.

openpyxl

The library that reads and writes .xlsx Excel files at a low level β€” sheets, cells, formatting, formulas, charts. We use it in Modules 7 and 12 for things pandas can't do.

matplotlib & seaborn

The two most common libraries for making charts in Python. We use them in Module 11.

Jupyter notebook

The "code + notes + charts in one document" format we mentioned earlier. File extension .ipynb. Comes free with Anaconda.

VS Code (Visual Studio Code)

A free, modern code editor from Microsoft. It's what we'll use to write Python scripts. There are others (PyCharm, Sublime, etc.) but VS Code is the easiest starting point and works on Mac, Windows, and Linux.

Terminal / command line / PowerShell

The text-based way to talk to your computer. Macs call it Terminal; Windows calls it PowerShell or Command Prompt. You'll need it occasionally β€” mainly to install things with pip. We hand-hold you through it in Module 2.

=PY() in Excel

The formula that runs Python inside an Excel cell. Microsoft 365 only, as of 2026. Full module on this in Module 8.

Microsoft 365 / Excel for the web

You need the modern subscription version of Excel to use =PY(). The classic boxed Excel 2019 / 2021 does not include it.

How they fit together

A typical day in this course looks like:

  1. You open VS Code.
  2. You write a Python script that imports pandas.
  3. Pandas reads an Excel file (under the hood it uses openpyxl).
  4. You clean and reshape the data.
  5. You save the result back to Excel, or you build a chart with matplotlib.
  6. Optionally: you switch to Excel, type =PY(), and re-run a small piece of the analysis live in a cell.
πŸ’‘ You don't need to memorise this
This page is a glossary, not a quiz. Bookmark it. As we use each tool you'll learn what it actually feels like.

Key takeaways

Spot the tool

For each of these jobs, name the tool from this lesson that does it:

  1. "Open and edit an .xlsx file at the cell level."
  2. "Write a chart with code."
  3. "Run a piece of Python inside an Excel cell."
  4. "Install a new library."
  5. "Work on data with code, output, and notes mixed in one document."
Show answers
  1. openpyxl
  2. matplotlib (or seaborn)
  3. =PY()
  4. pip
  5. A Jupyter notebook

01.6 β€” How to use this course

Three suggested paths through the material β€” total beginner, refresher, job-focused β€” and how to get the most out of each lesson.

What you'll learn

Three suggested paths

🟒 Path A β€” Total beginner (recommended for most people)

Read Modules 1 through 12 in order, doing every exercise. Then pick whichever role module (13–16) fits your job, then Module 17 for the capstone.

Time: Roughly 12–20 hours, depending on how much you experiment.

🟑 Path B β€” "I've coded a little before"

Skim Modules 1 and 3–6, do all of Module 2 (setup) properly, then go full speed from Module 7 onwards.

Time: Roughly 6–10 hours.

🟠 Path C β€” "I have a specific Excel pain to fix today"

Do Module 2 (setup), Module 7 (files), Module 9 (pandas), and then jump straight to your role module. You can fill in the foundations later β€” most readers do.

Time: 4–6 hours.

What every lesson looks like

Every lesson page follows the same shape, so you always know where to look:

  1. Objectives box at the top β€” what you'll know by the end.
  2. Body β€” the actual lesson, broken into short sections with code samples.
  3. Step-by-step walkthroughs in numbered cards for any real task.
  4. Key takeaways β€” the three to five bullets you'd write on a sticky note.
  5. Exercise β€” a small, do-it-yourself task that's quick to get a win on.
  6. Prev / Next at the bottom β€” the lesson flows top-to-bottom through the course.

How to do the exercises (without giving up)

Three rules:

  1. Try first, then read the answer. Even five minutes of "I don't know what to do" builds the muscle that reading never will.
  2. It's fine to look things up. Real professional Python work involves Googling roughly every 90 seconds. We're not training you for an exam.
  3. An error message is a clue, not a verdict. The last line of an error usually tells you exactly what went wrong. We have a whole lesson on this (Module 2, Lesson 6).
πŸ’‘ Pro tip: use AI as a study partner
You can paste any error message β€” or any lesson exercise β€” into ChatGPT, Claude, or Copilot and ask "explain this like I'm a spreadsheet user." That's not cheating; it's how everyone codes in 2026.

One small commitment

Do not skip Module 2 (Setup). It's the only module where the work isn't optional. If your Python isn't installed correctly, none of the rest of the course works on your machine. Take the 30 minutes; you'll never need to do it again.

What to do when you're done

You'll know you've "graduated" when you can:

From there, the world is your oyster β€” web scraping, dashboards, machine learning, automation, scripting your whole job. We list further-reading recommendations at the end of Module 17.

Key takeaways

Set your goal

  1. Pick your path: A, B, or C.
  2. Block 30 minutes on your calendar in the next two days for Module 2 (setup).
  3. You're ready. Click "Next lesson β†’" and let's go.
Module 02

Setting Up Your Environment

Install Python the easy way (Anaconda), set up VS Code, get a Jupyter notebook running, and turn on =PY() in Excel.

This is the most important module in the course because everything that follows depends on a working setup. The good news: we use the friendliest path on each platform, with screenshots-worth-of-description for every step.

⚠️ Do this module on a real computer
You can't do the rest of the course on a phone or tablet. Use a Mac, Windows, or Linux laptop or desktop.

02.1 β€” Install Python (the easy way, with Anaconda)

Get Python and the big data libraries on your machine in one installer. Works the same way on Mac and Windows.

What you'll learn

Why Anaconda?

You can install Python directly from python.org. For an Excel user with no programming background, that path leads to lots of head-scratching. Anaconda is a free bundled installer made by a data-science company. Click one button, wait ten minutes, and you have:

If you've already installed Python a different way and it's working, you can skip this lesson. Otherwise, use Anaconda.

Step-by-step walkthrough

Go to the Anaconda download page

Open https://www.anaconda.com/download in your browser.

You may be asked for an email. You can skip that and click "Download" directly.

Pick your operating system

The page detects your OS automatically. Confirm it's right (Windows, macOS, or Linux) and choose the 64-bit installer. If you have a recent Mac with an M-series chip, pick the Apple Silicon version.

Run the installer

Double-click the downloaded file. Click "Continue" / "Next" through every screen. The defaults are correct for what you need.

On Windows, when asked, leave "Add Anaconda to my PATH environment variable" unchecked (Anaconda's own recommendation). We'll use the Anaconda Prompt for command-line work.

The install takes 5–10 minutes and uses about 3 GB of disk space.

Open the Anaconda Navigator

On Mac: open Launchpad and find Anaconda-Navigator.
On Windows: hit the Start menu and search for Anaconda Navigator.

The Navigator is a friendly home screen for everything Anaconda installed. Take a look around. The "Home" tab shows tiles for tools you can launch β€” Jupyter Notebook, JupyterLab, Spyder, etc.

Confirm Python is working

Click the green Launch button under Jupyter Notebook. Your browser will open with a file-browser view. Click New β†’ Python 3 (ipykernel) to make a new notebook.

In the first cell, type:

print("Python is alive!")

Hold Shift and press Enter. You should see Python is alive! printed below the cell. πŸŽ‰

Confirm pandas works

Make a new cell (click the + button) and run:

import pandas as pd
print(pd.__version__)

You should see a version number like 2.2.2. That means pandas is installed and importable. You're done.

⚠️ If Jupyter doesn't open
On Mac, you may need to allow Terminal access. On Windows, your antivirus might flag the launch β€” allow it. If nothing happens at all, open Anaconda Navigator again and try clicking Launch a second time.

Key takeaways

Sanity-check exercise

In a new notebook cell, run:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("All three are installed!")

If you see "All three are installed!" with no red error, you're ready for the rest of the course.

02.2 β€” Install VS Code β€” your code editor

Set up Visual Studio Code, Microsoft's free code editor, with the Python extension.

What you'll learn

Why an editor on top of Anaconda?

Jupyter notebooks (which Anaconda gave you) are great for analysis. But when you start writing reusable scripts β€” the kind you run on a schedule β€” you'll want a proper code editor. VS Code is free, from Microsoft, and works the same way on Mac, Windows, and Linux.

Download VS Code

Go to code.visualstudio.com and click the big blue download button. Run the installer with default settings.

On Windows, when asked, tick the boxes for "Add 'Open with Code' action to Windows Explorer file context menu" and "Register Code as an editor for supported file types". Those make life easier later.

Open VS Code for the first time

Launch it. You'll see a welcome screen. The little icons on the far left are your activity bar β€” files, search, source control, and a "puzzle piece" for Extensions.

Install the Python extension

Click the puzzle-piece icon (Extensions). In the search box, type Python. Install the one made by Microsoft (it has millions of downloads).

That extension is what makes VS Code understand Python: syntax colours, error highlighting, "run this file" buttons, debugging.

Point VS Code at the Anaconda Python

Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) to open the Command Palette. Type Python: Select Interpreter and pick the option that says "Anaconda" or has anaconda3 in the path.

This tells VS Code: "When I press Run, use the Python that came with Anaconda."

Make your first script

In VS Code, go to File β†’ New File. Save it as hello.py on your Desktop. Paste in:

print("Hello from VS Code!")
for i in range(5):
    print(f"  Line {i + 1}")

Click the green β–Ά "Run" arrow in the top right (or press F5 and pick "Python File"). The output appears in the panel at the bottom.

πŸ’‘ The two-window workflow
Many pros keep VS Code open for scripts and Jupyter open for exploration. You'll naturally fall into the same rhythm.

Key takeaways

Two-minute exercise

Save a new file called about_me.py. Inside, write three lines that print your name, your job role, and what you hope to automate with Python. Run it. Notice how the output appears at the bottom of VS Code, instantly.

02.3 β€” The terminal β€” a friendly introduction

Five commands that are all you really need: cd, ls/dir, mkdir, pip, python.

What you'll learn

The terminal is not as scary as it looks

The terminal is a text-based way to talk to your computer. You'll need it for exactly two things in this course: installing libraries (pip install ...), and occasionally running a Python script (python myscript.py). Five commands, and you're set.

How to open it

The five commands you actually need

1. pwd (Mac) / cd on its own (Windows) β€” "where am I?"

pwd
# /Users/yourname

Tells you which folder the terminal is currently looking at.

2. ls (Mac) / dir (Windows) β€” "what's in this folder?"

ls
# Documents  Desktop  Downloads  Music ...

3. cd folder_name β€” "move into this folder"

cd Desktop      # move into Desktop
cd ..           # move one folder up
cd ~            # go to your home folder (Mac/Linux)

4. mkdir new_folder_name β€” "make a new folder"

mkdir python-projects
cd python-projects

5. pip install <library> β€” install a Python library

pip install openpyxl

You'll see a stream of text as pip downloads and installs the library. When the prompt comes back, it's done.

Bonus: python myscript.py β€” run a script

python hello.py
# Hello from VS Code!
#   Line 1
#   Line 2
#   ...

Walkthrough: install a library

Open the terminal (or Anaconda Prompt on Windows)

Type a pip command

pip install xlsxwriter

Press Enter. Watch the lines scroll. When you see Successfully installed xlsxwriter-..., you're done.

Confirm it worked

Open a Jupyter notebook or a Python file and run:

import xlsxwriter
print(xlsxwriter.__version__)

You should see a version number, not a red error.

⚠️ "Command not found" on Mac
If pip isn't recognised, try pip3 instead, or open Anaconda Navigator β†’ Environments β†’ click the green β–Ά next to base (root) β†’ "Open Terminal", then run pip install ... from that terminal.
⚠️ Windows: must use the Anaconda Prompt
Regular Command Prompt may not see your Anaconda Python. Always launch the Anaconda Prompt from the Start menu for these commands.

Key takeaways

Five-minute hands-on

  1. Open your terminal (or Anaconda Prompt).
  2. Make a new folder on your Desktop called python-for-excel.
  3. cd into it.
  4. Run pip install xlsxwriter requests beautifulsoup4.
  5. You just installed three libraries we'll use later. Easy.

02.4 β€” Turn on Python inside Excel (=PY())

Get the =PY() formula working in your Microsoft 365 Excel.

What you'll learn

What you need

πŸ“ Don't have Microsoft 365?
You can still do this entire course. We use =PY() heavily in Module 8; everywhere else, regular Python + the Anaconda setup is enough. If you can get the company 365 license, do.

Quick setup walkthrough

Update Excel

In Excel, go to File β†’ Account β†’ Update Options β†’ Update Now. Make sure you're on a recent build. Python in Excel is generally available in Microsoft 365 since 2024.

Make a new blank workbook

Open a fresh workbook so you don't accidentally break a real one.

Click an empty cell and type the new formula

Click cell A1. Type =PY( β€” when you do this, Excel switches the cell into Python mode. You'll notice the formula bar turns into a multi-line area and shows a little green "PY" badge.

Inside the cell, type:

1 + 1

Press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to commit. After a brief "calculating in the cloud" moment, the cell shows 2.

Try a slightly bigger one

Click another empty cell, type =PY(, then:

import pandas as pd
df = pd.DataFrame({"city": ["NYC", "LA", "Chicago"], "pop_m": [8.3, 3.9, 2.7]})
df

Ctrl+Enter. Excel now shows the DataFrame as a special "Python object" cell. Right-click it β†’ Output as β†’ Excel values to spill it into proper Excel cells.

What just happened?

When you type =PY( Excel sends your code to Microsoft's cloud, runs it in a real Python interpreter (with pandas, NumPy, matplotlib pre-installed), and returns the result back into your spreadsheet. Total round trip: usually 1–3 seconds.

πŸ’‘ The two output modes
A Python cell can return its result either as a Python object (a single cell that holds, say, a whole DataFrame) or as Excel values (spilled into a range, one cell per value). Toggle with Right-click β†’ Output as.

Key takeaways

Quick check

In a fresh Excel sheet, get this to work end-to-end:

=PY(
import pandas as pd
ages = pd.Series([24, 31, 45, 22, 38])
ages.mean()
)

You should see 32 in the cell. If you do β€” Python in Excel is on, and Module 8 will be a breeze.

02.5 β€” A two-minute tour of Jupyter notebooks

Cells, run buttons, kernels, save, share β€” the bits of Jupyter you'll touch every day.

What you'll learn

Why you'll use Jupyter all the time

Jupyter is the notebook tool that came with Anaconda. It's the standard way to do data analysis in Python: you type a few lines of code, run them, see the result instantly, write a note about what you saw, type a few more lines. Repeat. Save the whole thing as a document you can share.

Open a notebook

Anaconda Navigator β†’ Jupyter Notebook β†’ Launch. Browser opens. Navigate to a folder. Click New β†’ Python 3.

The anatomy of a notebook

The keyboard shortcuts that change your life

ShortcutWhat it does
Shift+EnterRun the current cell and move to the next
Ctrl+EnterRun the current cell and stay on it
Alt+EnterRun and insert a new cell below
Esc then AInsert a new cell above
Esc then BInsert a new cell below
Esc then D DDelete the current cell
Esc then MTurn the current cell into a Markdown note
Esc then YTurn it back into code

Walkthrough: build a tiny analysis notebook

Make a Markdown title cell

Press Esc M, then type # My first analysis. Shift+Enter. Notice it renders as a big heading.

Make a code cell that creates some data

import pandas as pd
sales = pd.DataFrame({
    "rep":   ["Aisha", "Ben", "Carlos", "Dora", "Eve"],
    "deals": [12, 7, 19, 8, 14],
    "value": [42000, 18000, 71000, 22000, 39000],
})
sales

Shift+Enter. You see the table.

A second code cell that summarises

sales["value"].sum()

Shift+Enter. You see 192000.

Save

Ctrl+S (or Cmd+S on Mac). The file is saved as Untitled.ipynb. Click the title at the top to rename it.

Export

File β†’ Download as β†’ HTML gives you a self-contained web page you can email. Download as β†’ PDF gives you a printable version. Download as β†’ Notebook (.ipynb) gives you the raw file someone else can open in their Jupyter.

πŸ’‘ The "last expression auto-prints" trick
A code cell automatically prints the result of its last line, without needing print(). That's why sales on its own showed the whole table.

Key takeaways

Exercise

Open Jupyter. Make a new notebook called warmup.ipynb. Build it with:

  1. A Markdown heading: "Warm-up exercise".
  2. A code cell that creates a list of your five favourite numbers.
  3. A code cell that prints their sum.
  4. A Markdown cell at the bottom that says how surprising the answer was (or wasn't).
  5. Save it. Export it as HTML. Open the HTML file in your browser.

02.6 β€” When things break β€” reading errors without panic

How to read a Python error message, the five most common errors, and what to do about each.

What you'll learn

The good news about Python errors

Python errors look intimidating because they're long. They're actually friendly: the last line tells you exactly what went wrong, and the line above that tells you where. Read bottom-up and ignore the rest until you're more advanced.

Anatomy of a traceback

Traceback (most recent call last):
  File "report.py", line 12, in <module>
    total = price * quantitiy
NameError: name 'quantitiy' is not defined

Read bottom-up:

The five errors you'll see most

1. NameError: name 'x' is not defined

Meaning: Python doesn't know what x is.
Usual cause: Typo, or you forgot to create the variable, or you tried to use it before the line that defines it.

Fix: check spelling and make sure the variable is created on an earlier line.

2. SyntaxError: invalid syntax

Meaning: The code doesn't follow Python's grammar.
Usual cause: Missing colon at the end of an if/for/def line. Or a bracket that wasn't closed. Or a quote that wasn't closed.

# Wrong: no colon
if age > 18
    print("adult")

# Right
if age > 18:
    print("adult")

3. IndentationError

Meaning: Wrong amount of leading whitespace.
Usual cause: You used a tab where Python expected spaces, or you indented inconsistently.

Fix: pick four-space indentation everywhere. VS Code does this for you if you stay on the Python file mode.

4. TypeError

Meaning: You're using a value of the wrong type.
Usual cause: Trying to add a number and a string.

# Wrong: "5" is a string, not a number
total = "5" + 10

# Right
total = int("5") + 10   # 15

5. FileNotFoundError

Meaning: The file you asked Python to open doesn't exist at that path.
Usual cause: Typo in the filename, wrong folder, or the file extension is wrong.

Fix: print the current folder first and double-check:

import os
print(os.getcwd())     # which folder am I in?
print(os.listdir())    # what's in this folder?

The "first response" checklist

  1. Read the last line of the error.
  2. Look at the line number just above it.
  3. Re-read that line in your code, slowly.
  4. If still stuck, paste the error into Google or an AI assistant verbatim.
  5. Rerun, see if the error changes. A different error = progress.
πŸ’‘ Pro tip
Don't try to "guess" what's broken. Always re-read the actual error message. 95% of the time it tells you exactly what to fix.

Key takeaways

Make these errors happen on purpose

In a Jupyter cell, run each of these and see what error you get. Then fix it.

# 1
print(secret_number)

# 2
if 5 > 3
    print("yes")

# 3
"hello" + 5

# 4
open("not_a_real_file.txt")

Reading errors gets easier the more of them you see. Welcome to programming.

Module 03

Variables and Data Types

Strings, numbers, booleans, None β€” the building blocks of every Python value, and how they map to Excel's cell types.

By the end of this module you'll know what every value in Python is made of: text, whole numbers, decimal numbers, true/false flags, and "nothing." You'll be reading and writing the most-common Python building blocks like an Excel formula.

03.1 β€” Variables β€” Python's version of a named cell

How to give a value a name, and the rules for what names are allowed.

What you'll learn

What's a variable?

A variable is a name that points to a value. In Excel, you might put 100 in cell B2 and refer to it as =B2 from then on. In Python you write:

price = 100

From now on, anywhere you write price, Python substitutes in 100.

price = 100
tax = price * 0.2
print(price + tax)   # 120

The single equals sign means "store this"

This trips people up at first: = in Python is not the same as = in algebra. It doesn't mean "equals." It means "take what's on the right and store it under the name on the left."

x = 5      # store 5 in x
x = x + 1  # take whatever x is (5), add 1, store back in x
print(x)   # 6

The naming rules

Style: snake_case

Python convention is lowercase with underscores: monthly_total, customer_name, tax_rate. Not monthlyTotal (that's Java) or MonthlyTotal (that's a class). Stick to snake_case and your code will look professional.

Walkthrough: build a tiny invoice

Set up the inputs

customer_name = "Acme Co."
quantity = 3
unit_price = 19.99
tax_rate = 0.08

Compute

subtotal = quantity * unit_price
tax = subtotal * tax_rate
total = subtotal + tax

Print the receipt

print("Customer:", customer_name)
print("Subtotal:", subtotal)
print("Tax:     ", tax)
print("Total:   ", total)

Output:

Customer: Acme Co.
Subtotal: 59.97
Tax:      4.7976
Total:    64.7676
πŸ“ Don't panic about the long decimal
We'll fix the rounding (round(total, 2) β†’ 64.77) in the next lesson.

Key takeaways

Build your own pocket calculator

  1. Create variables for: a meal price, a tip percentage, the number of people splitting.
  2. Compute total with tip.
  3. Compute per-person share.
  4. Print everything with labels.

03.2 β€” Numbers β€” ints, floats, and the math operators

Whole numbers, decimal numbers, and every math operator Python supports.

What you'll learn

Two flavours: int and float

int = a whole number: 0, 1, -42, 1_000_000.
float = a decimal number: 0.5, -3.14, 2.0.

You can check what kind a value is with type():

type(42)        # <class 'int'>
type(3.14)      # <class 'float'>
type(1_000_000) # <class 'int'>   underscores are legal as thousand-separators

The math operators

OperatorWhat it doesExampleResult
+Add3 + 47
-Subtract10 - 37
*Multiply5 * 630
/Divide (always returns a float)10 / 42.5
//Integer divide (drops the remainder)10 // 42
%Modulo (the remainder)10 % 42
**Power2 ** 101024

Two tricks worth knowing

Round to N decimals

round(3.14159, 2)   # 3.14
round(1234.5678, 0) # 1235.0

Format with commas and currency

total = 1234567.891
print(f"${total:,.2f}")     # $1,234,567.89

The f"..." string is called an f-string. We cover it properly in the next lesson; for now know that you can pop variables and format hints into one inside curly braces.

Integer vs float β€” when does it matter?

Mostly it doesn't. Python converts between them automatically. Two practical gotchas:

Walkthrough: split a bill three ways

Inputs

bill = 87.45
tip_pct = 18
people = 3

Compute

tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people

Print neatly

print(f"Bill:       ${bill:,.2f}")
print(f"Tip ({tip_pct}%): ${tip:,.2f}")
print(f"Total:      ${total:,.2f}")
print(f"Per person: ${per_person:,.2f}")

Key takeaways

Power-of-two table

Print the first ten powers of two in a column, like this:

2^1 = 2
2^2 = 4
2^3 = 8
...
2^10 = 1024

Hint: use a for loop (we cover those in Module 4) or just write ten lines.

03.3 β€” Strings β€” text values and how to wrangle them

Quotes, f-strings, slicing, methods like .upper() and .replace().

What you'll learn

What's a string?

Text. Anything in quotes. Single or double β€” Python doesn't care, but pick one and be consistent.

name = "Alice"
greeting = 'Hello there'
long_text = """This is a multi-line string.
It can wrap across as many lines as you like."""

Joining strings together

first = "Alice"
last = "Smith"
full = first + " " + last     # "Alice Smith"

F-strings β€” the modern way

Put an f right before the opening quote, and you can drop variables straight inside {}:

name = "Alice"
sales = 1234.567
print(f"{name} sold ${sales:,.2f}")
# Alice sold $1,234.57

F-strings replaced the older "{} sold ${:.2f}".format(name, sales) style. Use f-strings everywhere.

String methods you'll use weekly

MethodWhat it doesExampleResult
.upper()UPPERCASE"hi".upper()"HI"
.lower()lowercase"Hi".lower()"hi"
.title()Title Case"alice smith".title()"Alice Smith"
.strip()Trim whitespace" abc ".strip()"abc"
.replace(a, b)Substitute"$1,200".replace(",", "")"$1200"
.split(",")Break into a list"a,b,c".split(",")["a","b","c"]
" ".join([...])Glue list back into a string" ".join(["a","b"])"a b"
.startswith("X")Boolean check"acme.csv".endswith(".csv")True

Slicing β€” pulling out a chunk

Like grabbing a substring with MID() or LEFT() in Excel.

name = "Anthropic"
name[0]      # 'A'        β€” first character
name[-1]     # 'c'        β€” last character
name[0:4]    # 'Anth'     β€” first four
name[4:]     # 'ropic'    β€” from index 4 to the end
name[::-1]   # 'cipohtnA' β€” reversed (a Python party trick)

Walkthrough: clean a messy column of names

Sample data

raw = "  alice SMITH  "

Clean step by step

step1 = raw.strip()           # "alice SMITH"
step2 = step1.lower()         # "alice smith"
step3 = step2.title()         # "Alice Smith"
print(step3)

Or chain it

cleaned = raw.strip().lower().title()
print(cleaned)   # Alice Smith

Chaining like this β€” calling one method on the result of another β€” is a Python superpower. You'll see it constantly in pandas.

Key takeaways

Clean a list of email addresses (on paper)

Given " Alice@Example.COM ", write the Python that turns it into a clean lowercase, trimmed email.

Answer
"  Alice@Example.COM ".strip().lower()
# 'alice@example.com'

03.4 β€” Booleans and comparisons

True, False, and the operators that compare values.

What you'll learn

True and False β€” capitalised

is_open = True
is_closed = False
type(is_open)   # <class 'bool'>

Two values. Lowercase true/false won't work β€” Python is fussy about the capitals.

Comparison operators

OperatorMeaningExampleResult
==Equal to5 == 5True
!=Not equal to5 != 5False
>Greater than10 > 3True
<Less than10 < 3False
>=Greater or equal5 >= 5True
<=Less or equal5 <= 4False
⚠️ One equals vs two equals
= stores a value. == tests for equality. x = 5 is "store 5 in x". x == 5 is "is x equal to 5?". Mixing them up is the most common beginner bug.

Combining: and, or, not

age = 25
income = 80000

(age > 18) and (income > 50000)   # True
(age < 18) or (income > 100000)   # False
not (age < 18)                    # True

"Truthy" and "falsy"

Python treats some non-boolean values as if they were True or False:

name = ""
if name:
    print(f"Hi, {name}")
else:
    print("No name provided.")
# Output: No name provided.

The in operator β€” bonus useful

"alice" in ["alice", "bob", "carol"]   # True
"@" in "alice@example.com"             # True

Often replaces a chunky IF + OR formula in Excel.

Key takeaways

Quick check

What does each of these print?

print(10 == 10.0)
print(10 == "10")
print("a" in "alpha")
print(0 or 5)
Answers
  1. True β€” Python is happy comparing 10 (int) and 10.0 (float).
  2. False β€” a number is never equal to a string of that number.
  3. True β€” "a" appears in "alpha".
  4. 5 β€” or returns the first truthy value.

03.5 β€” None, type conversions, and reading the type

The special value None, plus the int(), str(), float() conversion functions.

What you'll learn

None β€” Python's "nothing"

Sometimes you need a variable to mean "I don't have a value yet" or "this is blank." Python uses None for that.

last_login = None
if last_login is None:
    print("New user")

Compare with is, not ==, when checking for None.

Type conversions

You'll constantly get data as the wrong type β€” a number that came in as text, a date that came in as a number. Convert with these:

FunctionWhat it doesExample
int(x)Try to turn into a whole numberint("42") β†’ 42
float(x)Try to turn into a decimalfloat("3.14") β†’ 3.14
str(x)Turn into textstr(42) β†’ "42"
bool(x)Turn into True/Falsebool("hi") β†’ True

The "everything is a string when it comes from outside" rule

Any value that arrives from a CSV, a web form, the command line, or a user input box arrives as a string. If you want to do math with it, convert first.

# Pretend "12" came from a CSV cell
quantity = "12"
price = 5

# This is a bug β€” it repeats "12" five times
print(quantity * price)        # "1212121212"

# Convert first
print(int(quantity) * price)   # 60

Checking the type at runtime

type(42)         # int
type("hi")       # str
type([1,2,3])    # list
isinstance(42, int)   # True

isinstance() is the cleaner version. Use it in real code.

Walkthrough: a "safe" number parse

The problem

A user might type "42", or "forty-two", or accidentally hit space. We want a function that returns the number, or None if it can't.

The fix using try/except

def safe_int(text):
    try:
        return int(text.strip())
    except (ValueError, AttributeError):
        return None

safe_int("42")          # 42
safe_int("forty-two")   # None
safe_int("  100 ")      # 100

Don't worry about try/except in detail β€” we cover it in Module 4. The point: real-world data is messy, and Python gives you tools to handle "what if it's broken."

Key takeaways

Spot the bug

response = input("How old are you? ")
if response > 18:
    print("Adult")

What's broken? Fix it.

Answer

input() always returns a string. Comparing a string to an int crashes. Fix: if int(response) > 18:

03.6 β€” Talking to the user β€” input() and print()

How to ask the user a question and show them an answer.

What you'll learn

input() β€” ask a question

name = input("What's your name? ")
print(f"Hello, {name}!")

When this runs in a script or terminal, the prompt appears, the program pauses, the user types something, hits Enter, and the typed text is stored in name. Always as a string β€” even if they typed digits.

print() β€” talk back

print("Hello")
print("a", "b", "c")           # a b c
print("a", "b", "c", sep="-")  # a-b-c
print("first")
print("second")
print("third", end=" ")
print("on the same line")

A complete mini-program

# tip_calculator.py
bill = float(input("Bill amount: "))
tip_pct = float(input("Tip percent: "))
people = int(input("How many people? "))

tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people

print()
print(f"Total: ${total:,.2f}")
print(f"Each:  ${per_person:,.2f}")
πŸ’‘ input() in Jupyter
The same input() works in a Jupyter notebook β€” a little prompt appears under the cell.

Key takeaways

Build a unit converter

Ask the user for kilometres, convert to miles (km Γ— 0.621), print the result.

03.7 β€” Comments β€” notes to your future self

How to leave notes in your code, and the difference between a comment and a docstring.

What you'll learn

The # character

Anything after a # on a line is ignored by Python. Use this for notes.

# Read the raw export the finance team dumps in this folder every Monday
raw = pd.read_csv("monday_dump.csv")

# Drop the rows where the customer column is blank β€” they're rejected orders
clean = raw.dropna(subset=["customer"])

Block comments

Multiple lines, each starting with #. There's no /* ... */ in Python.

# --- Configuration ---
# Change these once a month after the price review meeting
TAX_RATE = 0.08
SHIPPING_FLAT = 7.50

Docstrings β€” comments for functions

A docstring is a triple-quoted string right under a function's def line. Tools and IDEs use it to show the function's help.

def cents_to_dollars(cents):
    """Convert an integer cents value to a float dollars value.

    Example:
        cents_to_dollars(1299) -> 12.99
    """
    return cents / 100

When comments help β€” and when they're noise

Good comments explain why, not what:

# Bad β€” restates what the code already says
x = x + 1  # add 1 to x

# Good β€” explains the why
x = x + 1  # account for the off-by-one in the upstream CSV

Key takeaways

Comment a tangled script

Take your tip calculator from the last lesson. Add three comments: one at the top explaining what the script does, one above the calculation block explaining the formula, one at the bottom explaining what's printed.

Module 04

Control Flow

if / else / elif, for and while loops, and Python's gentle error-handling β€” try/except.

04.1 β€” if / else β€” making decisions

Python's IF() statement: indentation, elif, the comparison operators in action.

What you'll learn

The shape of an if-statement

age = 18

if age >= 18:
    print("Adult")
else:
    print("Minor")

Notice three things:

elif β€” chained conditions

Like IFS() in Excel:

score = 73

if score >= 90:
    grade = "A"
elif score >= 80:
    grade = "B"
elif score >= 70:
    grade = "C"
elif score >= 60:
    grade = "D"
else:
    grade = "F"

print(grade)   # C

One-line if (when it improves readability)

status = "VIP" if total > 1000 else "regular"

Use sparingly. Multi-line is usually clearer.

The "early return / guard" pattern

Nested ifs get ugly fast. Flatten them with early returns:

# Tangled
def process(order):
    if order is not None:
        if order["status"] == "paid":
            if order["amount"] > 0:
                ship(order)

# Cleaner
def process(order):
    if order is None: return
    if order["status"] != "paid": return
    if order["amount"] <= 0: return
    ship(order)

Walkthrough: pricing tiers

Inputs

customer_type = "wholesale"
quantity = 250

Compute the discount

if customer_type == "wholesale":
    if quantity >= 500:
        discount = 0.20
    elif quantity >= 100:
        discount = 0.10
    else:
        discount = 0.05
elif customer_type == "vip":
    discount = 0.15
else:
    discount = 0.00

print(f"{discount*100:.0f}% discount")   # 10% discount

Key takeaways

Shipping cost calculator

Free shipping over $100. $5 flat if weight < 1kg. Otherwise $15. Write the if/elif/else.

04.2 β€” for loops β€” doing something for every item

Loop through lists, ranges, dictionaries, and lines of a file.

What you'll learn

The shape of a for loop

for name in ["Alice", "Bob", "Carol"]:
    print(f"Hello, {name}!")

Read it as: "for every name in this list, do the indented stuff." Three names β†’ three greetings.

Looping a fixed number of times β€” range()

for i in range(5):
    print(i)
# 0
# 1
# 2
# 3
# 4

range(5) = 0, 1, 2, 3, 4. (Stops before 5 β€” Python's standard half-open style.)

range(1, 6)        # 1, 2, 3, 4, 5
range(0, 10, 2)    # 0, 2, 4, 6, 8

enumerate() β€” when you need the index

items = ["bread", "milk", "eggs"]
for i, item in enumerate(items, start=1):
    print(f"{i}. {item}")
# 1. bread
# 2. milk
# 3. eggs

Looping a dictionary

prices = {"bread": 3.50, "milk": 2.10, "eggs": 5.00}

for product, price in prices.items():
    print(f"{product}: ${price}")

break and continue

for n in range(20):
    if n % 2 != 0:   # odd
        continue
    if n > 12:
        break
    print(n)
# 0 2 4 6 8 10 12

Walkthrough: total a list of invoices

The data

invoices = [
    {"customer": "Acme",  "amount": 1200},
    {"customer": "Beta",  "amount":  800},
    {"customer": "Acme",  "amount":  300},
    {"customer": "Gamma", "amount": 1500},
]

Loop and accumulate

total = 0
for inv in invoices:
    total += inv["amount"]
print(total)   # 3800

total += inv["amount"] is shorthand for total = total + inv["amount"].

πŸ’‘ The shortcut
For totals, Python gives you a built-in: sum(inv["amount"] for inv in invoices). We meet that pattern (a "generator expression") later.

When NOT to write a loop

If you're using pandas, you almost never need a manual loop. df["amount"].sum() beats writing out the loop in both speed and clarity. We'll see this everywhere from Module 9 onwards.

Key takeaways

FizzBuzz

For numbers 1 to 30: print "Fizz" if divisible by 3, "Buzz" if divisible by 5, "FizzBuzz" if both, otherwise the number. Classic interview question. You now know enough to write it.

04.3 β€” while loops β€” repeat until done

When you don't know how many iterations you need: while, break, and the infinite-loop pitfall.

What you'll learn

The shape of a while loop

balance = 100
months = 0
while balance < 200:
    balance = balance * 1.05   # 5% per month
    months += 1
print(f"Took {months} months")   # 15

"Keep going while the condition is true." When the condition flips false, the loop exits.

The classic infinite loop

x = 0
while x < 10:
    print(x)
    # forgot to change x β€” runs forever

If you ever fire one off by accident, hit the ⬛ stop button in Jupyter, or Ctrl+C in the terminal.

while True + break β€” the "until" pattern

while True:
    answer = input("Type 'quit' to stop: ")
    if answer == "quit":
        break
    print(f"You typed: {answer}")

When to use while vs for

If you know how many times you'll iterate…use for
If you keep going until a condition flips…use while

Key takeaways

Doubling game

Start at 1. Keep doubling until you exceed 1,000,000. Print how many doublings it took.

04.4 β€” List comprehensions β€” one-line loops

The compact way to build a list from another list. Less code, more clarity.

What you'll learn

The pattern

# Long form
squares = []
for n in range(10):
    squares.append(n * n)

# Comprehension β€” same result
squares = [n * n for n in range(10)]

Read it left to right: "a list of n * n for every n in range(10)."

With a filter

evens = [n for n in range(20) if n % 2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Transform + filter

names = ["alice", "bob", "carol", "dave"]
short_upper = [n.upper() for n in names if len(n) < 5]
# ['ALICE', 'BOB', 'DAVE']

Dict and set comprehensions

# dict
prices = {"a": 1, "b": 2, "c": 3}
doubled = {k: v * 2 for k, v in prices.items()}
# {'a': 2, 'b': 4, 'c': 6}

# set (no duplicates)
words = ["apple", "ant", "banana"]
first_letters = {w[0] for w in words}
# {'a', 'b'}
πŸ’‘ When to use them
Use a comprehension when the loop body is one short transformation. If you're doing five things to each item, use a regular for-loop β€” comprehensions stop helping when they get long.

Key takeaways

Three quick conversions

Convert each loop to a list comprehension.

# 1
result = []
for n in range(20):
    if n % 3 == 0:
        result.append(n)

# 2
upper_names = []
for name in ["alice", "bob"]:
    upper_names.append(name.upper())

# 3
totals = []
for r in [{"qty":3,"price":10}, {"qty":2,"price":15}]:
    totals.append(r["qty"] * r["price"])

04.5 β€” try / except β€” handling errors gracefully

Catch exceptions so your script keeps running instead of dying on bad data.

What you'll learn

The pattern

try:
    n = int(user_input)
except ValueError:
    n = 0
    print("Couldn't parse the number, defaulting to 0.")

Python runs the try block. If a ValueError is raised, control jumps to the except block. If no error, the except is skipped.

The big four exception types

Catch specific, not generic

# OK but lazy β€” swallows bugs you'd want to know about
try:
    do_thing()
except:
    pass

# Better β€” explicit about what you expect
try:
    do_thing()
except FileNotFoundError:
    print("File missing, skipping.")

The finally block β€” cleanup that always runs

f = open("data.txt")
try:
    work_with(f)
except Exception as e:
    print(f"Failed: {e}")
finally:
    f.close()      # runs whether we succeeded or not

Walkthrough: process a folder of CSVs that might be broken

The plan

Loop through all CSVs in a folder. Try to read each. If one is broken, log it and keep going β€” don't let one bad file stop the rest.

The script

import pandas as pd
from pathlib import Path

results = []
for path in Path("monthly_files").glob("*.csv"):
    try:
        df = pd.read_csv(path)
        results.append((path.name, len(df)))
    except Exception as e:
        print(f"Skipped {path.name}: {e}")

print(f"Processed {len(results)} files successfully")
⚠️ Don't catch bugs
The wrong place for try/except: hiding bugs. If your code has a typo, you want it to crash so you can fix it. Only catch errors you've thought about in advance.

Key takeaways

Safe parser

Write a function to_float(text) that returns a float if it can, or None if it can't. Then test it with: "12.5", "abc", "", "100".

04.6 β€” Putting it together β€” a mini contact-cleaner

Combine if/for/comprehensions/try-except into your first useful script.

What you'll learn

The goal

You receive a list of contacts where each item looks like one of these:

raw = [
    "  Alice Smith,alice@example.com,42  ",
    "Bob Brown,BOB@EXAMPLE.com,",     # missing age
    "carol,wrong-email,29",            # no last name, bad email
    ",dave@example.com,55",            # no name
    "  Eve  Wells, EVE@example.com , 31 ",
]

You want a clean list of dicts with valid contacts only.

Walkthrough

Step 1: split the line and trim each piece

cleaned = []
for line in raw:
    parts = [p.strip() for p in line.split(",")]
    cleaned.append(parts)

Step 2: validate name, email, age

contacts = []
for name, email, age_text in cleaned:
    if not name:                       continue       # need a name
    if "@" not in email:               continue       # need an @
    try:
        age = int(age_text)
    except ValueError:
        age = None
    contacts.append({
        "name":  name.title(),
        "email": email.lower(),
        "age":   age,
    })

Step 3: inspect

for c in contacts:
    print(c)

Output:

{'name': 'Alice Smith', 'email': 'alice@example.com', 'age': 42}
{'name': 'Bob Brown',   'email': 'bob@example.com',   'age': None}
{'name': 'Eve  Wells',  'email': 'eve@example.com',   'age': 31}

Carol got dropped (bad email) and Dave got dropped (no name). Bob made it in with a None age. Real data, real result.

Key takeaways

Add a feature

Extend the cleaner: also reject contacts under age 18. Add a counter for how many were rejected and print it at the end.

Module 05

Functions

Define your own reusable mini-tools. Arguments, return values, defaults, and the rules that keep functions honest.

05.1 β€” def β€” defining your own function

Make a custom 'formula' you can call from anywhere in your code.

What you'll learn

Why functions matter

You write the same five-line calculation in three places, then need to fix a typo. With a function: fix it once, every caller gets the fix. Functions are how code stays maintainable.

The shape

def add_tax(amount, rate):
    return amount * (1 + rate)

# Use it
total = add_tax(100, 0.08)
print(total)   # 108.0

Five things to notice:

  1. def β€” keyword that starts a function definition.
  2. add_tax β€” the function's name. Use snake_case.
  3. (amount, rate) β€” the parameters the function expects.
  4. Colon, indented body β€” same as if/for.
  5. return β€” sends a value back to the caller.

Positional vs keyword arguments

add_tax(100, 0.08)              # positional: amount=100, rate=0.08
add_tax(amount=100, rate=0.08)  # keyword: same thing, more explicit
add_tax(rate=0.08, amount=100)  # keyword: order doesn't matter

Use keyword arguments when you're calling a function with several parameters β€” it makes the call site readable.

Default values

def add_tax(amount, rate=0.08):
    return amount * (1 + rate)

add_tax(100)         # uses default rate of 0.08 β†’ 108
add_tax(100, 0.10)   # overrides β†’ 110

Docstrings

Add a triple-quoted string right under def. Editors will show it as help when you hover the function name.

def add_tax(amount, rate=0.08):
    """Return amount + sales tax at the given rate."""
    return amount * (1 + rate)

Functions that don't return a value

If you don't write return, Python returns None automatically. That's fine for functions whose job is to print something or save a file β€” their effect is what matters, not the return value.

def banner(msg):
    line = "=" * len(msg)
    print(line)
    print(msg)
    print(line)

banner("Monday report")

Key takeaways

Write three functions

  1. celsius_to_fahrenheit(c) β€” return F.
  2. discounted_price(price, discount_pct=10) β€” return final price.
  3. full_name(first, last, middle="") β€” return a properly spaced full name.

05.2 β€” Returning multiple values & unpacking

How to send back more than one thing β€” using tuples and dict returns.

What you'll learn

The tuple return

def split_full_name(full):
    first, last = full.split(" ", 1)
    return first, last

f, l = split_full_name("Alice Smith")
print(f)   # Alice
print(l)   # Smith

Behind the scenes, returning first, last is the same as returning the tuple (first, last). The parentheses are optional.

Unpacking

name, age, role = ("Alice", 32, "analyst")

The dict return β€” when there are many fields

def summary(numbers):
    return {
        "count": len(numbers),
        "total": sum(numbers),
        "mean":  sum(numbers) / len(numbers),
        "max":   max(numbers),
        "min":   min(numbers),
    }

stats = summary([10, 20, 30, 40, 50])
print(stats["mean"])   # 30.0

Better than returning a 5-tuple: at the call site you read stats["mean"], not stats[2].

πŸ’‘ Rule of thumb
Two return values β†’ tuple. Three or more β†’ dict (or a class, which we won't need in this course).

Key takeaways

min_max_avg

Write a function that takes a list of numbers and returns min, max, and average. Try a tuple return first; then rewrite as a dict.

05.3 β€” Scope β€” where variables live

Local vs global, why your variable disappeared after the function ran, and how to do this safely.

What you'll learn

Local scope

Variables created inside a function only exist inside that function. Once it returns, they're gone.

def calc():
    temp = 42
    return temp * 2

result = calc()
print(result)   # 84
print(temp)     # NameError β€” temp doesn't exist out here

Why this is good

It means a function is a self-contained box. Whatever it does inside can't accidentally rename or corrupt a variable you have outside. Imagine the alternative β€” every function in a million-line program clobbering each other's variables.

Reading a global is OK; changing one is sketchy

TAX_RATE = 0.08   # global constant

def add_tax(amount):
    return amount * (1 + TAX_RATE)   # OK to read
# Bad pattern β€” reaching outside to change a global
TOTAL = 0
def add(n):
    global TOTAL
    TOTAL += n

# Good pattern β€” return the new value, let the caller decide
def add(total, n):
    return total + n

The mental model

A clean function takes its inputs as parameters and gives its output as a return value. It doesn't reach out into the world. It's predictable, testable, and reusable. Treat global like cigarettes β€” legal, occasionally unavoidable, mostly a sign you should reconsider.

Key takeaways

Refactor

RUNNING_TOTAL = 0

def add_invoice(amount):
    global RUNNING_TOTAL
    RUNNING_TOTAL += amount
    return RUNNING_TOTAL

Rewrite without the global. Pass the running total in, return the updated value.

05.4 β€” lambda, map, filter β€” the functional shortcuts

Tiny one-line functions and a quick tour of map() and filter().

What you'll learn

What's a lambda?

A one-line function that doesn't get a name. Useful when you need to pass a tiny function to another function.

# With def
def double(x):
    return x * 2

# Same thing as a lambda
double = lambda x: x * 2

You'll mostly see lambdas inline, not stored in variables. The most common use: key= in sorted().

sorted() with a key

people = [
    {"name": "Alice", "age": 30},
    {"name": "Bob",   "age": 25},
    {"name": "Carol", "age": 35},
]

by_age = sorted(people, key=lambda p: p["age"])
# [{'name':'Bob',...}, {'name':'Alice',...}, {'name':'Carol',...}]

by_name_desc = sorted(people, key=lambda p: p["name"], reverse=True)

map() β€” "apply this function to every item"

names = ["alice", "bob", "carol"]
upper = list(map(str.upper, names))   # ['ALICE', 'BOB', 'CAROL']

# Often clearer as a comprehension:
upper = [n.upper() for n in names]

filter() β€” "keep items where this is true"

ages = [12, 15, 18, 21, 17, 30]
adults = list(filter(lambda x: x >= 18, ages))   # [18, 21, 30]

# Also clearer as a comprehension:
adults = [x for x in ages if x >= 18]
πŸ“ In practice
Python programmers reach for list comprehensions far more often than map() / filter(). But you'll see all three in the wild β€” recognise them and you can read any Python code.

Key takeaways

Sort by multiple keys

Sort the people list above by age ascending, but use name as a tiebreaker. Hint: a lambda can return a tuple.

05.5 β€” Modules and imports β€” using other people's code

import, from-import, aliases, and the difference between standard library and third-party.

What you'll learn

The three import styles

# 1 β€” the whole module
import math
math.sqrt(16)        # 4.0

# 2 β€” with an alias (the convention for big libraries)
import pandas as pd
pd.read_csv("file.csv")

# 3 β€” a specific name
from math import sqrt, pi
sqrt(16)             # 4.0
pi                   # 3.14159...

Standard library vs third party

The conventional aliases

LibraryConventional alias
pandasimport pandas as pd
numpyimport numpy as np
matplotlib.pyplotimport matplotlib.pyplot as plt
seabornimport seaborn as sns

Use the conventions. Strangers reading your code will thank you.

Writing your own module

Any .py file is automatically importable from a script in the same folder. Put related functions in helpers.py:

# helpers.py
def clean_email(s):
    return s.strip().lower()

Then in main.py next to it:

from helpers import clean_email
clean_email("  ALICE@example.com ")   # 'alice@example.com'

Key takeaways

Use the standard library

Use the datetime module to print today's date in the format YYYY-MM-DD. Hint: from datetime import date; date.today().isoformat().

05.6 β€” Anatomy of a real script

How professional Python scripts are organised β€” imports, constants, functions, and the if __name__ block.

What you'll learn

The standard layout

"""monthly_report.py β€” generate the monthly sales report.

Usage: python monthly_report.py
"""

# ---- 1. Imports (standard library first, then third-party, then your own) ----
import os
from datetime import date

import pandas as pd

from helpers import clean_email

# ---- 2. Constants ----
INPUT_FILE  = "sales_raw.csv"
OUTPUT_FILE = "sales_clean.xlsx"
TAX_RATE    = 0.08

# ---- 3. Functions ----
def load(path):
    return pd.read_csv(path)

def clean(df):
    df["email"] = df["email"].apply(clean_email)
    df["total"] = df["price"] * (1 + TAX_RATE)
    return df

def save(df, path):
    df.to_excel(path, index=False)

def main():
    df = load(INPUT_FILE)
    df = clean(df)
    save(df, OUTPUT_FILE)
    print(f"Wrote {len(df)} rows to {OUTPUT_FILE}")

# ---- 4. The "run me" line ----
if __name__ == "__main__":
    main()

What's that if __name__ == "__main__" thing?

It's the line that says: "Run main() only if I was launched directly, not when someone imports me from another file."

Why it matters: if you put loose code at the bottom of the file, that code runs even when you just import the file from somewhere else. The if __name__ idiom prevents that.

The benefit of breaking into functions

πŸ’‘ The 30-second rule
If you can describe a piece of code in one short sentence, it's probably a function. "Clean the email column" β†’ clean_emails(df). "Read the input file" β†’ load(path).

Key takeaways

Refactor your tip calculator

Take the tip-calculator script you wrote in Module 3 and split it into three functions: get_inputs(), compute(bill, tip_pct, people), and print_results(...). Call them from main().

Module 06

Data Structures

Lists, dicts, tuples, sets β€” Python's built-in collections and the Excel equivalents for each.

06.1 β€” Lists β€” the everyday ordered collection

Create, index, slice, modify. The Python equivalent of a single-column range.

What you'll learn

Creating a list

fruits = ["apple", "banana", "cherry"]
numbers = [10, 20, 30, 40, 50]
mixed = [1, "two", 3.0, True]
empty = []

Indexing

fruits[0]    # 'apple'  β€” Python starts at 0
fruits[1]    # 'banana'
fruits[-1]   # 'cherry' β€” negative indexes count from the end
fruits[-2]   # 'banana'

Slicing

nums = [10, 20, 30, 40, 50]
nums[1:4]     # [20, 30, 40]
nums[:3]      # [10, 20, 30]
nums[3:]      # [40, 50]
nums[:]       # [10, 20, 30, 40, 50]  β€” a copy
nums[::2]     # [10, 30, 50]  β€” every other one

Modifying

fruits = ["apple", "banana"]
fruits.append("cherry")        # add to the end
fruits.insert(0, "avocado")    # insert at index 0
fruits.remove("banana")        # remove by value
last = fruits.pop()            # remove and return last
fruits[0] = "apricot"          # replace by index

Useful list operations

len(fruits)         # how many items
"apple" in fruits   # True/False
sorted(numbers)     # returns a sorted copy
numbers.sort()      # sorts in place
numbers.reverse()
sum(numbers)        # 150
max(numbers)        # 50
min(numbers)        # 10

Walkthrough: track daily sales

Start with last week's numbers

daily = [1200, 950, 1340, 1100, 1600, 0, 0]   # Mon..Sun

This morning's sale comes in

daily.append(1450)        # Monday of week 2
print(daily)              # [1200, 950, 1340, 1100, 1600, 0, 0, 1450]

Print weekday totals only (drop the weekend)

weekdays = daily[:5]
print(sum(weekdays))      # 6190
⚠️ Lists are mutable
If you pass a list to a function and the function changes it, the original list is changed too. To pass a copy, use my_list[:] or list(my_list).

Key takeaways

Quick drill

Given nums = [5, 2, 8, 1, 9, 3], write one line each to:

  1. Get the average.
  2. Get the second-largest.
  3. Get a sorted list, descending.
  4. Get just the odd numbers.

06.2 β€” Dictionaries β€” key-value lookups

VLOOKUP, built-in to the language. Create, look up, update, iterate.

What you'll learn

What's a dict?

A dictionary maps keys to values. It's a lookup table.

prices = {
    "apple":  1.20,
    "banana": 0.50,
    "cherry": 3.00,
}

prices["apple"]    # 1.20
prices["banana"]   # 0.50

Adding and updating

prices["date"] = 5.00        # add a new key
prices["apple"] = 1.50       # update existing
del prices["cherry"]         # remove

The .get() escape hatch

Accessing a missing key crashes:

prices["mango"]   # KeyError!

.get() returns None (or a default) instead:

prices.get("mango")          # None
prices.get("mango", 0)       # 0  β€” supply a default

Iterating

for fruit in prices:
    print(fruit)             # just keys

for fruit, price in prices.items():
    print(f"{fruit}: ${price}")

prices.keys()    # dict_keys(['apple', 'banana', ...])
prices.values()  # dict_values([1.20, 0.50, ...])
prices.items()   # dict_items([('apple', 1.20), ...])

The "row as dict" pattern β€” this one's important

Real-world data often looks like one dict per row:

customers = [
    {"id": 1, "name": "Alice",  "city": "NYC",     "ltv": 1200},
    {"id": 2, "name": "Bob",    "city": "Chicago", "ltv":  800},
    {"id": 3, "name": "Carol",  "city": "NYC",     "ltv": 2200},
]

# Find all NYC customers
nyc = [c for c in customers if c["city"] == "NYC"]
# Total NYC LTV
total = sum(c["ltv"] for c in customers if c["city"] == "NYC")

Pandas DataFrames are basically a faster version of this same structure. Internalise the pattern and DataFrames will feel natural.

Walkthrough: build a quick price lookup

Define the lookup

price_list = {
    "SKU-001": 19.99,
    "SKU-002": 29.99,
    "SKU-003":  9.99,
}

Cost a basket

basket = [("SKU-001", 2), ("SKU-002", 1), ("SKU-099", 4)]   # last one isn't in the list

total = 0
for sku, qty in basket:
    price = price_list.get(sku)
    if price is None:
        print(f"Unknown SKU: {sku}")
        continue
    total += price * qty

print(f"Total: ${total:.2f}")

Key takeaways

Word counter

Given a sentence, build a dict where each key is a word and each value is how many times it appears.

06.3 β€” Tuples and when to use them

Lists you can't change. Smaller, faster, and a perfect fit for fixed pairs.

What you'll learn

A tuple is an immutable list

point = (10, 20)
rgb = (255, 0, 128)

point[0]      # 10
point[0] = 5  # TypeError β€” tuples are immutable

The bare-comma shortcut

point = 10, 20   # also a tuple β€” parens are optional
x, y = point     # unpack into two variables

When to use a tuple

When to use a list

Key takeaways

Day-of-week dict

Use tuples of (year, month, day) as keys in a dict. Add three keys with notes as values. Print one of them.

06.4 β€” Sets β€” unique values, fast

When you need to deduplicate, check membership, or do union/intersection.

What you'll learn

A set is an unordered collection of unique values

colors = {"red", "green", "blue"}
colors.add("red")        # already there β€” ignored
colors.add("yellow")
print(colors)            # {'red','green','blue','yellow'}  (order varies)

Deduplicating a list

raw = ["alice@x.com", "bob@y.com", "alice@x.com", "carol@z.com"]
unique = list(set(raw))

Set operations

last_month = {"Alice", "Bob", "Carol"}
this_month = {"Bob", "Carol", "Dave"}

last_month | this_month   # union           {Alice, Bob, Carol, Dave}
last_month & this_month   # intersection    {Bob, Carol}
last_month - this_month   # difference      {Alice}      β€” churned
this_month - last_month   # difference      {Dave}       β€” new

That last pair β€” "who churned" and "who's new" β€” is a classic real-world use.

Fast membership

Checking x in big_list scans every item. Checking x in big_set is nearly instant, regardless of size. If you're checking membership a lot, use a set.

Key takeaways

Churn analysis

Given last month's customers (a list) and this month's customers (a list), print the count of churned and new customers.

06.5 β€” Nested structures β€” lists of dicts, dicts of lists

How real-world data actually looks in Python before pandas comes along.

What you'll learn

List of dicts β€” the most common shape

orders = [
    {"id": 101, "customer": "Acme",  "items": ["A1", "B7"],         "total":  450},
    {"id": 102, "customer": "Beta",  "items": ["C3"],               "total":  120},
    {"id": 103, "customer": "Acme",  "items": ["A1", "A1", "B7"],   "total":  780},
]

Reading it

first = orders[0]
print(first["customer"])      # Acme
print(first["items"][0])      # A1
print(orders[2]["items"][-1]) # B7

Looping

for order in orders:
    print(f"Order {order['id']} for {order['customer']}: ${order['total']}")

Aggregating across a list of dicts

# Total revenue
total = sum(o["total"] for o in orders)        # 1350

# Revenue per customer
by_customer = {}
for o in orders:
    cust = o["customer"]
    by_customer[cust] = by_customer.get(cust, 0) + o["total"]

print(by_customer)   # {'Acme': 1230, 'Beta': 120}

And here's why DataFrames are so popular

That last calculation took five lines. With pandas it's one:

import pandas as pd
df = pd.DataFrame(orders)
df.groupby("customer")["total"].sum()

Same answer. Same data. Less code. That's why Module 9 is the turning point in this course.

Key takeaways

Group by city

Given a list of customer dicts each with city and ltv, build a dict mapping city to total LTV.

06.6 β€” Working with dates

The datetime module, parsing strings to dates, and date arithmetic.

What you'll learn

The two types you'll use

from datetime import date, datetime, timedelta

d = date(2026, 5, 16)           # just a date
dt = datetime(2026, 5, 16, 14, 30, 0)   # date + time

Today / now

date.today()                # date(2026, 5, 16)
datetime.now()              # full timestamp

Parsing a string

datetime.strptime("2026-05-16", "%Y-%m-%d")
datetime.strptime("16/05/2026", "%d/%m/%Y")
datetime.strptime("May 16, 2026", "%B %d, %Y")

The format codes are documented at strftime.org. The big ones: %Y year, %m month, %d day, %H hour, %M minute.

Formatting back to a string

d = date.today()
d.isoformat()                   # '2026-05-16'
d.strftime("%B %d, %Y")         # 'May 16, 2026'
d.strftime("%A")                # 'Saturday'

Date arithmetic

today = date.today()
tomorrow = today + timedelta(days=1)
last_week = today - timedelta(weeks=1)

age_days = (today - date(1990, 1, 1)).days   # how many days old

Walkthrough: end-of-month dates for the next year

Build a list of month-ends

from datetime import date
from calendar import monthrange

year = 2026
month_ends = []
for m in range(1, 13):
    last_day = monthrange(year, m)[1]
    month_ends.append(date(year, m, last_day))

for d in month_ends:
    print(d.isoformat())

Key takeaways

Days until your birthday

Write a script that prints how many days until your next birthday.

06.7 β€” Files and paths β€” the pathlib essentials

Read and write text files; build paths that work on Mac, Windows, and Linux.

What you'll learn

Reading a text file

with open("notes.txt", "r") as f:
    contents = f.read()
print(contents)

with open(...) as f: is the safe pattern β€” Python closes the file for you automatically when the indented block ends.

Writing a text file

with open("output.txt", "w") as f:
    f.write("Hello!\n")
    f.write("Second line.\n")

"w" overwrites; "a" appends.

Reading line by line

with open("notes.txt") as f:
    for line in f:
        print(line.strip())

pathlib β€” the modern way to build paths

from pathlib import Path

home = Path.home()                     # ~/
desktop = home / "Desktop"             # ~/Desktop
output = desktop / "report.xlsx"       # ~/Desktop/report.xlsx

output.exists()           # True/False
output.name               # 'report.xlsx'
output.suffix             # '.xlsx'
output.parent             # PosixPath('~/Desktop')

The / operator joins paths in the correct way for Mac/Linux/Windows. Use this everywhere instead of string concatenation.

Looping through every file in a folder

from pathlib import Path

folder = Path("reports/2026/q1")
for path in folder.glob("*.xlsx"):
    print(path.name, path.stat().st_size)

glob("*.xlsx") = "every .xlsx in this folder."
rglob("*.xlsx") = "every .xlsx in this folder and all subfolders."

Walkthrough: gather every CSV from a year's worth of monthly subfolders

The structure

data/
  2026-01/
    sales.csv
    returns.csv
  2026-02/
    sales.csv
  ...

List them all

from pathlib import Path

for p in Path("data").rglob("*.csv"):
    print(p)

Read and combine

import pandas as pd
frames = [pd.read_csv(p) for p in Path("data").rglob("sales.csv")]
combined = pd.concat(frames, ignore_index=True)
print(len(combined))

Six lines of code, dozens of files, one DataFrame. This pattern alone has saved many an analyst's morning.

Key takeaways

Inventory script

Write a script that lists every file in your Downloads folder along with its size. Print a one-line total at the end.

Module 07

Working with Files

Open CSVs, Excel workbooks, folders full of spreadsheets, and write back out. The bread-and-butter file operations.

07.1 β€” Reading and writing CSV files

The csv standard library module and the pandas one-liner equivalents.

What you'll learn

Two ways: csv module and pandas

The csv module is built into Python. pandas is the third-party library we'll lean on heavily. For 95% of work, pandas is what you want. For tiny scripts or constrained environments, the csv module is fine.

The pandas one-liner

import pandas as pd
df = pd.read_csv("sales.csv")
print(df.head())

Done. You have a DataFrame.

Writing a CSV

df.to_csv("clean.csv", index=False)

index=False suppresses pandas' row-number column, which is usually what you want.

Useful read_csv options

pd.read_csv("sales.csv",
            sep=",",                  # delimiter
            encoding="utf-8",         # "latin-1" for some legacy exports
            skiprows=2,               # skip first two rows
            header=0,                 # row to use as headers
            usecols=["date","amount"],# read only these columns
            parse_dates=["date"],     # turn date column into real dates
            dtype={"id": str},        # force this column to be string
            nrows=1000,               # read only first 1000 rows
           )

The csv module (for when you can't use pandas)

import csv

with open("sales.csv") as f:
    reader = csv.DictReader(f)
    rows = list(reader)

print(rows[0])      # {'date': '2026-01-05', 'amount': '120', ...}

Every value comes back as a string. You convert as you go.

Common CSV headaches and fixes

ProblemFix
UnicodeDecodeErrorTry encoding="latin-1" or encoding="cp1252"
Numbers come in as strings (with commas)df["amount"] = df["amount"].str.replace(",","").astype(float)
Dates won't sortUse parse_dates=["col"] at read time
Wrong delimiter (semicolons or tabs)sep=";" or sep="\t"
Leading/trailing spaces in headersdf.columns = df.columns.str.strip()

Walkthrough: end-to-end CSV cleanup

Read

import pandas as pd
df = pd.read_csv("sales_raw.csv", parse_dates=["date"])

Inspect

df.head()
df.info()
df.describe()

Clean

df.columns = df.columns.str.strip().str.lower()
df = df.dropna(subset=["customer"])
df["amount"] = df["amount"].astype(float)

Save

df.to_csv("sales_clean.csv", index=False)

Key takeaways

CSV inspector

Write a small script that takes a CSV path and prints: number of rows, number of columns, the column names, and the first three rows.

07.2 β€” Reading Excel files with pandas

read_excel: the easiest way to get a .xlsx into Python.

What you'll learn

The basic read

import pandas as pd
df = pd.read_excel("monthly_report.xlsx")

By default this reads the first sheet.

Pick a specific sheet

df = pd.read_excel("workbook.xlsx", sheet_name="January")
df = pd.read_excel("workbook.xlsx", sheet_name=2)   # by index, 0-based

Read every sheet into a dict

sheets = pd.read_excel("workbook.xlsx", sheet_name=None)
# sheets is a dict: {"January": df1, "February": df2, ...}

for name, df in sheets.items():
    print(name, len(df))

Skip header rows / pick a range

pd.read_excel("workbook.xlsx",
              sheet_name="Summary",
              skiprows=3,            # skip the title block
              usecols="B:F",         # read columns B through F
              nrows=50)

Write back

# One sheet
df.to_excel("output.xlsx", index=False, sheet_name="Results")

# Multiple sheets
with pd.ExcelWriter("output.xlsx") as writer:
    monthly.to_excel(writer, sheet_name="Monthly", index=False)
    yearly.to_excel(writer,  sheet_name="Yearly",  index=False)

Walkthrough: combine 12 monthly tabs into one

Read every sheet

sheets = pd.read_excel("2026_monthly.xlsx", sheet_name=None)

Tag each with its month and stack

frames = []
for name, df in sheets.items():
    df["month"] = name
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(combined.shape)

Save the combined version

combined.to_excel("2026_combined.xlsx", index=False, sheet_name="All")
πŸ’‘ The dependency
pandas uses openpyxl under the hood for .xlsx. Anaconda includes it; if you used a different install: pip install openpyxl.

Key takeaways

Sheet inventory

Write a function inventory(path) that returns a DataFrame with one row per sheet in a workbook, with columns: sheet_name, n_rows, n_cols.

07.3 β€” openpyxl β€” when you need cell-level control

Read/write individual cells, formatting, formulas, and styling. The library pandas uses under the hood.

What you'll learn

When you reach for openpyxl

Pandas treats Excel like a flat table. If you need the spreadsheety bits β€” colours, formulas, merged cells, comments, charts β€” drop down to openpyxl.

from openpyxl import Workbook, load_workbook

wb = Workbook()
ws = wb.active
ws.title = "Summary"

ws["A1"] = "Customer"
ws["B1"] = "Revenue"
ws["A2"] = "Acme"
ws["B2"] = 1200
ws["A3"] = "Beta"
ws["B3"] = 800

ws["B4"] = "=SUM(B2:B3)"   # a real Excel formula

wb.save("openpyxl_demo.xlsx")

Reading existing files

wb = load_workbook("openpyxl_demo.xlsx")
ws = wb["Summary"]

print(ws["A1"].value)         # 'Customer'
print(ws.cell(row=2, column=2).value)   # 1200

for row in ws.iter_rows(min_row=2, values_only=True):
    print(row)

Styling β€” bold, fill, number format

from openpyxl.styles import Font, PatternFill

bold_white = Font(bold=True, color="FFFFFF")
green_fill = PatternFill(start_color="217346", end_color="217346", fill_type="solid")

for cell in ws[1]:                # first row = headers
    cell.font = bold_white
    cell.fill = green_fill

ws["B2"].number_format = '"$"#,##0.00'

Column widths and freeze panes

ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 14
ws.freeze_panes = "A2"

Walkthrough: turn a DataFrame into a styled report

Write the DataFrame with pandas

import pandas as pd
df = pd.DataFrame({"Customer": ["Acme","Beta","Gamma"], "Revenue": [1200,800,1500]})
df.to_excel("report.xlsx", index=False, sheet_name="Sales")

Re-open with openpyxl and style

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

wb = load_workbook("report.xlsx")
ws = wb["Sales"]

header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill("solid", fgColor="217346")
for cell in ws[1]:
    cell.font = header_font
    cell.fill = header_fill

for cell in ws["B"][1:]:                # data rows
    cell.number_format = '"$"#,##0.00'

ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 16
ws.freeze_panes = "A2"

wb.save("report.xlsx")

Key takeaways

Style your own report

Take a DataFrame with three columns, write it with pandas, then add: bold green header, currency format on the numeric column, autosized column widths, frozen header row.

07.4 β€” Looping through many files

Combine, transform, or report on a whole folder of files in one go.

What you'll learn

The 'combine many files' pattern

from pathlib import Path
import pandas as pd

folder = Path("monthly_files")
frames = []
for path in sorted(folder.glob("*.csv")):
    df = pd.read_csv(path)
    df["source_file"] = path.name      # track which file
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)
print(f"Combined {len(frames)} files, {len(combined):,} total rows")

This pattern shows up everywhere β€” monthly reports, regional exports, daily dumps. Add a sort on path.stat().st_mtime if you need chronological order.

Adding parsed info from the filename

If files are called sales_2026-01.csv, you can pull the month out:

for path in folder.glob("sales_*.csv"):
    month = path.stem.split("_")[1]   # '2026-01'
    df = pd.read_csv(path)
    df["month"] = month
    frames.append(df)

What if the files have inconsistent columns?

pd.concat() aligns columns by name; missing columns become NaN. If files have wildly different shapes, normalise each one before stacking:

def normalise(df):
    df.columns = df.columns.str.strip().str.lower()
    keep = ["date", "customer", "amount"]
    return df.reindex(columns=keep)

frames = [normalise(pd.read_csv(p)) for p in folder.glob("*.csv")]
combined = pd.concat(frames, ignore_index=True)

Walkthrough: monthly Excel files β†’ one yearly file

Inputs

from pathlib import Path
import pandas as pd
import re

folder = Path("monthly_xlsx")
frames = []

Read each Excel file, tag with month

for path in sorted(folder.glob("*.xlsx")):
    m = re.search(r"(\d{4}-\d{2})", path.stem)
    if not m: continue
    df = pd.read_excel(path)
    df["month"] = m.group(1)
    frames.append(df)

Combine and save

year_df = pd.concat(frames, ignore_index=True)
year_df.to_excel("yearly_combined.xlsx", index=False)
print(f"Wrote {len(year_df):,} rows.")

Key takeaways

Combine your Downloads

Pretend you have ten CSVs in a folder. Write the combine-and-tag script. Bonus: print the row count per file before combining.

07.5 β€” Text files, JSON, and other formats

When you're not dealing with CSV/Excel: plain text, JSON, and a quick note on XML.

What you'll learn

JSON in two lines

JSON is the format APIs and lots of modern tools use. Python's json module reads it straight into a dict.

import json

# Read
with open("config.json") as f:
    config = json.load(f)
print(config["api_key"])

# Write
data = {"name": "Alice", "scores": [10, 20, 30]}
with open("out.json", "w") as f:
    json.dump(data, f, indent=2)

JSON ↔ DataFrame

import pandas as pd
df = pd.read_json("orders.json")                  # array of objects
df.to_json("clean.json", orient="records", indent=2)

Parsing a log file

lines = Path("server.log").read_text().splitlines()
errors = [l for l in lines if "ERROR" in l]
print(len(errors), "errors")

XML β€” yes, you'll occasionally see it

For old enterprise systems and government data, you'll get XML. The fastest way: pandas can read simple XML tables.

df = pd.read_xml("data.xml")

For gnarlier XML, the standard library's xml.etree.ElementTree works, and the third-party lxml is faster.

Key takeaways

Config saver

Write a script that saves your settings (any three key/value pairs you like) to a JSON file, then reads them back and prints them.

07.6 β€” Encoding, BOMs, and other file gotchas

Why exports from old systems look like garbage, and how to fix them.

What you'll learn

The classic symptom

You read a CSV and see things like cafοΏ½ instead of cafΓ©, or Ò€ℒ instead of an apostrophe. That's an encoding mismatch β€” the file was saved with one character encoding and you're reading it as another.

The two-line fix

# Try utf-8 first (default)
df = pd.read_csv("export.csv")

# If garbled, try latin-1 (Windows / older systems)
df = pd.read_csv("export.csv", encoding="latin-1")

# Or cp1252 (also common from Windows Excel)
df = pd.read_csv("export.csv", encoding="cp1252")

If none of those work, ask the system that exported the file what encoding it uses.

The BOM problem

Some Excel CSV exports include an invisible "byte-order mark" character at the start of the file. Symptom: your first column header looks like ο»ΏCustomer with a weird prefix.

pd.read_csv("export.csv", encoding="utf-8-sig")

The -sig suffix tells pandas to strip the BOM.

Stray whitespace and invisible characters

# Strip columns and string values in one pass
df.columns = df.columns.str.strip()
for col in df.select_dtypes("object"):
    df[col] = df[col].str.strip()

Date columns that come in as strings

df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" turns bad dates into NaT (not-a-time) instead of crashing.

Key takeaways

Encoding sleuth

Write a function safe_read_csv(path) that tries utf-8, then utf-8-sig, then latin-1, and returns the first one that works.

Module 08

Python inside Excel β€” =PY()

Use Python in a cell like a formula. Pass ranges, return DataFrames, chart inline, and decide when =PY() is the right tool.

This is the headline feature. The =PY() formula lets you write a few lines of Python directly inside an Excel cell and get the result back into the spreadsheet. Cloud-powered, no install required (other than Microsoft 365), and surprisingly powerful.

⚠️ Requires Microsoft 365
Standalone Excel 2019/2021/2024 doesn't include =PY(). If you don't have 365, you can still read this module for the concepts and use Python in Jupyter for the same outcomes.

08.1 β€” =PY() β€” the first ten formulas

Type =PY(, write Python, hit Ctrl+Enter. The basic muscle memory.

What you'll learn

The syntax

  1. Click an empty cell.
  2. Type =PY( β€” Excel switches the cell into Python mode. A green "PY" badge appears.
  3. Write your Python.
  4. Press Ctrl+Enter (Windows) / Cmd+Enter (Mac) to commit.

Don't just hit Enter β€” that adds a newline inside the cell. Ctrl+Enter is the "submit" key.

Ten warm-up formulas

# 1 β€” basic math
=PY(1 + 1)                # 2

# 2 β€” a variable
=PY(
x = 100
x * 1.08
)                         # 108

# 3 β€” string
=PY("hello".upper())      # HELLO

# 4 β€” list aggregation
=PY(sum([10, 20, 30]))    # 60

# 5 β€” current date
=PY(
from datetime import date
date.today()
)

# 6 β€” a Series
=PY(
import pandas as pd
pd.Series([1, 2, 3, 4, 5]).mean()
)                         # 3.0

# 7 β€” a small DataFrame
=PY(
import pandas as pd
pd.DataFrame({"x": [1,2,3], "y": [4,5,6]})
)

# 8 β€” random
=PY(
import random
random.choice(["heads", "tails"])
)

# 9 β€” string manipulation
=PY("alice@example.com".split("@")[1])   # example.com

# 10 β€” sorted list
=PY(sorted([3, 1, 4, 1, 5, 9, 2, 6]))

Output mode: Python object vs Excel values

By default, a =PY() cell shows a Python object β€” a single cell that holds (say) a whole DataFrame. To spill the contents into proper Excel cells:

  1. Right-click the cell.
  2. Output as β†’ Excel values.
  3. The DataFrame spills into a range. Headers become column headers, rows become rows.

Toggle back to Python-object mode if you want to use the cell's value in another =PY() formula.

Walkthrough: build a summary cell

Put some numbers in A1:A10

Type any 10 numbers in cells A1 through A10.

In B1, write =PY()

=PY(
xl("A1:A10").mean()
)

The xl(...) helper grabs an Excel range as a pandas Series or DataFrame. .mean() averages it.

Try a more useful summary

=PY(
s = xl("A1:A10")
{"count": s.count(),
 "sum":   s.sum(),
 "mean":  s.mean(),
 "min":   s.min(),
 "max":   s.max()}
)

Right-click β†’ Output as β†’ Excel values. The dict spills into a two-column table.

Key takeaways

Calculator cell

In a fresh workbook, put numbers in A1:A20. In B1, write a =PY() formula that computes the median. In B2, one that computes the standard deviation.

08.2 β€” The xl() helper β€” getting Excel data into Python

How to reference cells, ranges, and tables from inside =PY().

What you'll learn

The xl() forms

xl("A1")                       # single value
xl("A1:C5")                    # range β†’ DataFrame (or Series if one col)
xl("A1:C5", headers=True)      # treat first row as headers
xl("Sheet2!A1:C100")           # range from another sheet
xl("MyTable[#All]")            # an Excel Table by name
xl("MyTable[Revenue]")         # one column of a Table

Single cells

=PY(
tax_rate = xl("B1")
amount   = xl("B2")
amount * (1 + tax_rate)
)

Range with headers

Best practice: lay your data out with headers in the first row, then read with headers=True:

=PY(
df = xl("A1:D100", headers=True)
df.groupby("Region")["Revenue"].sum()
)

Cross-sheet references

=PY(
df = xl("RawData!A1:F1000", headers=True)
df.head()
)

Excel Tables β€” the cleanest way

If your data is a real Excel Table (Ctrl+T on a range), you can reference it by name. That's the most readable option and survives row insertions/deletions:

=PY(
sales = xl("SalesTable[#All]", headers=True)
sales["Total"] = sales["Qty"] * sales["Price"]
sales
)

Walkthrough: build a small dashboard cell

Set up the data

Put this in A1:C5 (headers in row 1):

Region      Quarter   Revenue
North       Q1        1200
North       Q2        1500
South       Q1         900
South       Q2        1100

Summary cell

=PY(
df = xl("A1:C5", headers=True)
df.groupby("Region")["Revenue"].agg(["sum", "mean"])
)

Right-click β†’ Output as β†’ Excel values. You get a clean two-column summary.

πŸ’‘ The =PY() pattern that scales
Store your raw data as an Excel Table. Use =PY() cells purely for analysis and presentation. When new rows come in, the analysis updates automatically.

Key takeaways

Top three

Lay out a list of 10 employees with a Sales column. In a =PY() cell, return the top three by sales as a DataFrame.

08.3 β€” Pandas in =PY() β€” the killer combination

Run real pandas operations inside Excel cells. Groupby, pivot, sort, filter.

What you'll learn

The setup

Put this raw data in A1:D11 (10 rows + headers):

Date         Region   Product   Revenue
2026-01-05   North    Widget    1200
2026-01-12   North    Gadget     800
2026-01-19   South    Widget     950
2026-02-02   South    Gadget    1100
2026-02-09   East     Widget    1300
2026-02-16   East     Gadget     700
2026-03-03   North    Widget    1500
2026-03-10   South    Widget    1050
2026-03-17   East     Gadget     900
2026-03-24   North    Gadget    1250

Pivot by Region Γ— Product

=PY(
df = xl("A1:D11", headers=True)
df.pivot_table(index="Region", columns="Product",
               values="Revenue", aggfunc="sum")
)

Group by month

=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M")
df.groupby("Month")["Revenue"].sum()
)

Top N

=PY(
df = xl("A1:D11", headers=True)
df.nlargest(3, "Revenue")
)

Filter

=PY(
df = xl("A1:D11", headers=True)
df[df["Region"] == "North"]
)

Walkthrough: build a region Γ— month sales matrix

The =PY() cell

=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m")
m = df.pivot_table(index="Region", columns="Month",
                   values="Revenue", aggfunc="sum", fill_value=0)
m
)

Spill it

Right-click β†’ Output as β†’ Excel values. You get a clean Region Γ— Month matrix that updates whenever the source data changes.

πŸ“ Pandas is already imported
Inside =PY(), pandas is already available as pd by default. You don't need an import pandas as pd at the top β€” though it's harmless if you add it.

Key takeaways

Build your own pivot

Using the table above, write a =PY() that returns total Revenue by Product, sorted descending.

08.4 β€” Charts in =PY() with matplotlib and seaborn

Return a chart from a =PY() cell. Lives inside the workbook like an image.

What you'll learn

The simplest chart cell

=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar")
)

The cell returns a chart image. Click it; Excel lets you resize and reposition it.

Styling

=PY(
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar", ax=ax, color="#217346")
ax.set_title("Revenue by Region")
ax.set_ylabel("Revenue ($)")
ax.set_xlabel("")
plt.tight_layout()
)

Seaborn for nicer defaults

=PY(
import seaborn as sns
df = xl("A1:D11", headers=True)
sns.barplot(data=df, x="Region", y="Revenue", estimator="sum", errorbar=None)
)

Walkthrough: monthly trend chart

Aggregate first, plot second

=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M").dt.to_timestamp()
monthly = df.groupby("Month")["Revenue"].sum()

fig, ax = plt.subplots(figsize=(7, 3))
monthly.plot(ax=ax, marker="o", color="#2b6cb0")
ax.set_title("Monthly Revenue")
ax.grid(alpha=0.3)
plt.tight_layout()
)
πŸ’‘ Treat chart cells like images
A =PY() chart behaves like an image inside the spreadsheet β€” you can resize it, move it across sheets, and it'll update whenever the source data changes.

Key takeaways

Stacked bar

Build a stacked bar chart in =PY() showing Revenue by Region, with Product as the stack. Hint: df.pivot_table(...).plot(kind="bar", stacked=True).

08.5 β€” When =PY() is the right tool β€” and when it isn't

A decision guide so you reach for the right hammer.

What you'll learn

When =PY() is the right tool

When to stick with native formulas

The hybrid pattern

Use native formulas for the live, interactive parts. Use =PY() for the heavier analysis next to them. The two coexist beautifully.

Two performance notes

Key takeaways

Decide

For each task, pick: native formulas, =PY(), or both?

  1. "Sum revenue where region = North."
  2. "Forecast next quarter's revenue with a linear regression."
  3. "Look up a customer's tier from a side table."
  4. "Build a heatmap of orders by hour-of-day Γ— day-of-week."

08.6 β€” Twelve =PY() recipes you'll use

Copy-paste-ready snippets for the most common Python-in-Excel tasks.

What you'll learn

1 β€” Sum a column with a condition

=PY(
df = xl("DataTable[#All]", headers=True)
df.loc[df["Region"]=="North", "Revenue"].sum()
)

2 β€” Count distinct values

=PY(
xl("A2:A1000").nunique()
)

3 β€” Group-by total

=PY(
df = xl("A1:D1000", headers=True)
df.groupby("Region")["Revenue"].sum().sort_values(ascending=False)
)

4 β€” Pivot table

=PY(
df = xl("A1:D1000", headers=True)
df.pivot_table(index="Region", columns="Quarter",
               values="Revenue", aggfunc="sum", fill_value=0)
)

5 β€” Top N by metric

=PY(
df = xl("A1:E1000", headers=True)
df.nlargest(10, "Revenue")
)

6 β€” Running total

=PY(
df = xl("A1:B100", headers=True).sort_values("Date")
df["RunningTotal"] = df["Amount"].cumsum()
df
)

7 β€” Percentage of total

=PY(
s = xl("B2:B11")
(s / s.sum() * 100).round(1)
)

8 β€” Date difference in days

=PY(
df = xl("A1:C100", headers=True)
df["DaysOpen"] = (pd.to_datetime(df["Closed"])
                  - pd.to_datetime(df["Opened"])).dt.days
df
)

9 β€” Quartile bucketing

=PY(
df = xl("A1:B500", headers=True)
df["Bucket"] = pd.qcut(df["Score"], 4, labels=["Q1","Q2","Q3","Q4"])
df
)

10 β€” Inner join two ranges

=PY(
orders = xl("OrdersTable[#All]", headers=True)
customers = xl("CustomersTable[#All]", headers=True)
orders.merge(customers, on="CustomerID")
)

11 β€” Quick linear regression

=PY(
import numpy as np
df = xl("A1:B100", headers=True)
slope, intercept = np.polyfit(df["X"], df["Y"], 1)
{"slope": slope, "intercept": intercept}
)

12 β€” Heatmap chart

=PY(
import seaborn as sns
df = xl("A1:D1000", headers=True)
pivot = df.pivot_table(index="DayOfWeek", columns="Hour",
                       values="Orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu")
)

Key takeaways

Adapt one

Pick recipe #3, #5, or #10. Build a small dataset on a new sheet and make it run in your own workbook.

08.7 β€” Limits, costs, and security of =PY()

What =PY() can't do, where it stores your data, and how it's priced.

What you'll learn

The runtime model

When a =PY() cell calculates, Excel sends the code and the relevant Excel data to Microsoft's cloud, runs it in a sandboxed Python environment, and sends the result back. Your laptop does almost nothing β€” most of the work happens on Microsoft Azure.

What's pre-installed

Microsoft pre-installs a curated set of libraries: pandas, NumPy, matplotlib, seaborn, scikit-learn, statsmodels, and a few more. You can't pip install something extra into a =PY() cell β€” what's there is what's there.

What's not allowed

Calculation order

Python cells are calculated after regular Excel formulas, top-to-bottom, left-to-right by default. If you have one =PY() that depends on another, that's fine β€” Excel orders them correctly.

Privacy

The data in cells referenced by xl(...) is sent to Microsoft's servers. For most spreadsheet work this is fine; for highly sensitive data, check your organisation's policy. The data is not used to train AI models per Microsoft's published policy at launch β€” but verify the current policy at support.microsoft.com.

Cost

At launch, basic =PY() usage is included in Microsoft 365. Premium tiers offer faster compute and higher data limits. Check Microsoft's pricing page for the current numbers.

Performance gotchas

Key takeaways

Decide where each runs

Which of these should live in =PY(), and which in a Jupyter script?

  1. Scrape today's exchange rate from a website.
  2. Make a heatmap of last month's sales.
  3. Combine 50 monthly CSVs from a shared drive.
  4. Compute the correlation matrix of a 20-column table sitting in Excel.
Module 09

Pandas Crash Course

The most important library in this course. DataFrames, selection, filtering, grouping, joining, pivoting β€” the spreadsheet superpowers.

09.1 β€” DataFrames β€” Python's spreadsheet

Create a DataFrame, inspect it, and learn the vocabulary.

What you'll learn

Two ways to make a DataFrame

import pandas as pd

# From a dict of columns
df = pd.DataFrame({
    "name":  ["Alice", "Bob", "Carol"],
    "age":   [30, 25, 35],
    "city":  ["NYC", "Chicago", "LA"],
})

# From a list of rows (each row a dict)
rows = [
    {"name": "Alice", "age": 30, "city": "NYC"},
    {"name": "Bob",   "age": 25, "city": "Chicago"},
    {"name": "Carol", "age": 35, "city": "LA"},
]
df = pd.DataFrame(rows)

The four "what's in this thing?" methods

df.head()       # first 5 rows
df.tail(3)      # last 3 rows
df.shape        # (rows, columns)
df.info()       # column names, dtypes, missing counts
df.describe()   # numeric summary stats

Series vs DataFrame

A Series is a single column. A DataFrame is many columns aligned by row.

df["age"]           # a Series
type(df["age"])     # <class 'pandas.core.series.Series'>
df[["age", "city"]] # a DataFrame (note: double brackets)

Renaming columns

df = df.rename(columns={"name": "full_name"})
df.columns = ["full_name", "age", "city"]   # set all at once

Adding a column

df["is_adult"] = df["age"] >= 18
df["greeting"] = "Hi " + df["full_name"]

Walkthrough: from raw to summary in five lines

Read

df = pd.read_csv("sales.csv")
df.head()

Inspect

df.shape          # (1000, 6)
df.info()
df.describe()

Quick summary

df.groupby("region")["amount"].sum()

Key takeaways

Quick build

Create a DataFrame of five employees with name, department, salary. Print .describe() on it and notice what it shows for the numeric column versus the text ones.

09.2 β€” Selecting rows and columns β€” loc and iloc

Pull out the slice of data you actually want.

What you'll learn

Selecting columns

df["name"]             # one column β†’ Series
df[["name", "city"]]   # multi-column β†’ DataFrame

Filtering rows by condition

df[df["age"] > 30]
df[df["city"] == "NYC"]
df[df["name"].str.startswith("A")]

Combining conditions

df[(df["age"] > 30) & (df["city"] == "NYC")]   # AND  β€” use &
df[(df["age"] < 30) | (df["city"] == "LA")]    # OR   β€” use |
df[~(df["city"] == "Chicago")]                  # NOT  β€” use ~

Important: use & | ~, not and/or/not. And wrap each condition in parentheses.

.loc β€” label-based selection

df.loc[0]                       # first row
df.loc[0:3, "name"]             # rows 0..3 (inclusive!), column 'name'
df.loc[df["age"] > 30, "name"]  # name column for the over-30s

.iloc β€” position-based selection

df.iloc[0]            # first row by position
df.iloc[:5, 0:2]      # first 5 rows, first 2 columns
df.iloc[-1]           # last row

The "is in this list" filter

df[df["city"].isin(["NYC", "LA"])]
df[~df["city"].isin(["NYC", "LA"])]   # everything not in the list

Walkthrough: pulling apart a dataset

The data

df = pd.read_csv("orders.csv")
df.columns
# Index(['order_id','date','region','customer','amount','status'])

Big orders in the North region

big_north = df[(df["region"] == "North") & (df["amount"] > 1000)]

Just two columns of those

big_north[["date", "amount"]]

Or in one go with .loc

df.loc[(df["region"]=="North") & (df["amount"]>1000),
       ["date", "amount"]]

Key takeaways

Drill

From a DataFrame of employees: select everyone in Sales with a salary over $80k, returning only the name and salary columns.

09.3 β€” Sorting, counting, and ranking

sort_values, value_counts, rank β€” the simple stuff you'll use a hundred times a day.

What you'll learn

Sorting

df.sort_values("amount", ascending=False).head(10)
df.sort_values(["region", "amount"], ascending=[True, False])
df.sort_index()           # by row index

Counting

df["region"].value_counts()           # counts per region
df["region"].value_counts(normalize=True)   # as percentages
df["region"].nunique()                # distinct count

Ranking

df["rank_by_amount"] = df["amount"].rank(ascending=False)
df["pct_rank"] = df["amount"].rank(pct=True)

Top / bottom N

df.nlargest(10, "amount")
df.nsmallest(5, "amount")

Key takeaways

Bestseller list

From an orders DataFrame, print the top 5 products by total revenue.

09.4 β€” Aggregation β€” sum, mean, agg

How to summarise. Series methods, DataFrame methods, and .agg().

What you'll learn

Single-column aggregates

df["amount"].sum()
df["amount"].mean()
df["amount"].median()
df["amount"].std()
df["amount"].max()
df["amount"].min()
df["amount"].count()
df["amount"].quantile(0.95)

Whole-DataFrame aggregates

df.sum(numeric_only=True)
df.mean(numeric_only=True)
df.describe()

.agg() β€” many at once

df["amount"].agg(["sum", "mean", "min", "max"])

df.agg({
    "amount":   ["sum", "mean"],
    "quantity": ["sum", "max"],
})

Custom aggregations

def iqr(s):
    return s.quantile(0.75) - s.quantile(0.25)

df["amount"].agg(["mean", iqr])

Key takeaways

Summary box

For a numeric column of your choosing, print mean, median, min, max, and 95th percentile in a single .agg() call.

09.5 β€” groupby β€” the move that pays for the course

Split β†’ apply β†’ combine. Replace your SUMIFS forever.

What you'll learn

The pattern

# Total revenue per region β€” one line
df.groupby("region")["amount"].sum()

That's it. That replaces a column of SUMIFS formulas, plus the formula maintenance, plus the broken refs when rows shift.

Multiple aggregations

df.groupby("region")["amount"].agg(["sum", "mean", "count"])

df.groupby("region").agg(
    total_revenue = ("amount",   "sum"),
    avg_order     = ("amount",   "mean"),
    n_orders      = ("order_id", "count"),
)

The second form (named aggregations) is the cleanest β€” output columns are named exactly what you want.

Multiple group keys

df.groupby(["region", "quarter"])["amount"].sum()

Get a clean DataFrame back

By default, groupby returns a Series or a DataFrame with a hierarchical index. Add .reset_index() for a flat one:

summary = (df.groupby(["region", "quarter"])["amount"]
             .sum()
             .reset_index())

Filtering within groups

# Keep only groups with more than 10 orders
df.groupby("region").filter(lambda g: len(g) > 10)

# Top 3 per group
df.sort_values("amount", ascending=False).groupby("region").head(3)

Walkthrough: the classic regional summary

Read

df = pd.read_csv("orders.csv", parse_dates=["date"])
df["quarter"] = df["date"].dt.to_period("Q").astype(str)

Summary

summary = (df.groupby(["region", "quarter"])
             .agg(total = ("amount",   "sum"),
                  count = ("order_id", "count"),
                  avg   = ("amount",   "mean"))
             .reset_index())

print(summary.head(10))

Save

summary.to_excel("regional_summary.xlsx", index=False)
πŸ’‘ Mental shortcut
"Group by X then aggregate" = the moment you'd reach for a pivot table in Excel.

Key takeaways

Per-customer summary

From an orders DataFrame, build a per-customer summary with: total spend, number of orders, average order value.

09.6 β€” Pivot tables β€” pivot_table()

The Excel pivot table, one function call.

What you'll learn

The basic pivot

df.pivot_table(index="region",
               columns="quarter",
               values="amount",
               aggfunc="sum")

Rows = region. Columns = quarter. Cells = total amount.

Multiple aggregations

df.pivot_table(index="region",
               columns="quarter",
               values="amount",
               aggfunc=["sum", "mean", "count"])

Multiple value columns

df.pivot_table(index="region",
               values=["amount", "quantity"],
               aggfunc={"amount": "sum", "quantity": "sum"})

Fill empty cells

df.pivot_table(index="region", columns="quarter",
               values="amount", aggfunc="sum", fill_value=0)

With margins (totals row + column)

df.pivot_table(index="region", columns="quarter",
               values="amount", aggfunc="sum",
               margins=True, margins_name="Total")

crosstab β€” quick frequency pivot

pd.crosstab(df["region"], df["status"])
pd.crosstab(df["region"], df["status"], normalize="index")   # row percentages

Key takeaways

Sales by product Γ— month

Build a pivot of total sales with products as rows and months as columns. Add a totals row and column.

09.7 β€” Joining and merging β€” pandas's VLOOKUP

merge, concat, join β€” three ways to combine DataFrames.

What you'll learn

The basic merge β€” like XLOOKUP

orders   = pd.read_csv("orders.csv")        # has customer_id
customers = pd.read_csv("customers.csv")    # has customer_id and name, city, tier

combined = orders.merge(customers, on="customer_id")

Different column names on each side

orders.merge(customers, left_on="cust_id", right_on="id")

Join types

HowWhat you get
how="inner" (default)Only rows present on both sides.
how="left"All rows from the left; matching info from the right (NaN if no match).
how="right"All rows from the right.
how="outer"Every row from either side; NaN where the other side has no match.
combined = orders.merge(customers, on="customer_id", how="left")

This is the "VLOOKUP that doesn't lose rows when there's no match" β€” your default for enrichment joins.

Checking the join with indicator

combined = orders.merge(customers, on="customer_id",
                        how="left", indicator=True)
combined["_merge"].value_counts()
# left_only   42    ← orders with no matching customer (a red flag)
# both       958

Stacking with concat

jan = pd.read_csv("jan.csv")
feb = pd.read_csv("feb.csv")
quarter = pd.concat([jan, feb], ignore_index=True)

Walkthrough: enrich orders with customer info

Load

orders   = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")

Join

enriched = orders.merge(customers, on="customer_id",
                        how="left", indicator=True)

Sanity-check the join

print(enriched["_merge"].value_counts())
unmatched = enriched[enriched["_merge"] == "left_only"]
print(f"{len(unmatched)} orders without a matching customer")

Drop the indicator and save

enriched = enriched.drop(columns=["_merge"])
enriched.to_csv("orders_enriched.csv", index=False)

Key takeaways

Three-way join

Given orders, customers, and products tables, build a fully enriched DataFrame containing customer name and product name on each order.

09.8 β€” Applying functions β€” apply, map, vectorisation

Run a function across every row or every cell, and when to skip it.

What you'll learn

The hierarchy: vectorised > apply > loop

The fastest pandas code uses built-in vectorised operations. Reach for apply when there's no built-in, and only loop manually as a last resort.

Vectorised β€” the goal

df["total"] = df["price"] * df["quantity"] * (1 + df["tax_rate"])

.map() β€” value-to-value lookup on a Series

tier_label = {"A": "Gold", "B": "Silver", "C": "Bronze"}
df["tier_name"] = df["tier"].map(tier_label)

.apply() on a Series

def classify(amount):
    if amount > 1000: return "Big"
    if amount > 100:  return "Medium"
    return "Small"

df["bucket"] = df["amount"].apply(classify)

.apply() on a DataFrame (row-wise)

def full_label(row):
    return f"{row['region']}/{row['product']}"

df["label"] = df.apply(full_label, axis=1)

The vectorised replacement

Almost anything you'd write with apply has a faster vectorised form:

# Slower
df["bucket"] = df["amount"].apply(classify)

# Faster
import numpy as np
df["bucket"] = np.select(
    [df["amount"] > 1000, df["amount"] > 100],
    ["Big", "Medium"],
    default="Small")
πŸ’‘ The lazy rule
Use apply when you're prototyping. Switch to vectorised when the data is large and performance matters.

Key takeaways

Tier classification

Given a column of customer LTV values, add a tier column: Gold > 5000, Silver > 1000, Bronze otherwise. Solve with apply; then rewrite with np.select.

Module 10

Data Cleaning and Transformation

Fix dates that won't sort, names with stray spaces, half-empty columns, and every other mess that real data throws at you.

10.1 β€” Handling missing values

NaN, dropna, fillna β€” what's missing and what to do about it.

What you'll learn

Spotting missing values

df.isna().sum()              # missing count per column
df.isna().mean() * 100       # percentage per column
df[df["email"].isna()]       # rows where email is missing

Dropping

df.dropna()                              # drop any row with any NaN
df.dropna(subset=["email", "customer"])  # only drop if those cols are NaN
df.dropna(axis=1)                        # drop columns with NaN
df.dropna(thresh=5)                      # keep rows with β‰₯5 non-null values

Filling

df["age"] = df["age"].fillna(df["age"].median())
df["country"] = df["country"].fillna("Unknown")

# Fill different columns differently
df = df.fillna({"age": 0, "country": "Unknown", "score": df["score"].mean()})

Forward / back fill (time series)

df["price"] = df["price"].ffill()   # carry last known forward
df["price"] = df["price"].bfill()   # use the next known backward

Walkthrough: clean a survey export

Inspect first

df.isna().sum().sort_values(ascending=False).head(10)

Drop unusable rows

df = df.dropna(subset=["respondent_id", "submitted_at"])

Fill the rest with sensible defaults

df["age"] = df["age"].fillna(df["age"].median())
df["region"] = df["region"].fillna("Unknown")
df["score"] = df["score"].fillna(0)

Key takeaways

Missing-value audit

Load any messy CSV. Print a table of column name, missing count, and missing percentage, sorted by worst first.

10.2 β€” Finding and removing duplicates

duplicated, drop_duplicates, and how to decide which copy to keep.

What you'll learn

Find them

df.duplicated().sum()                          # how many full-row dupes
df.duplicated(subset=["email"]).sum()          # dupes by email
df[df.duplicated(subset=["email"], keep=False)]   # show all dupe groups

Drop them

df.drop_duplicates()                          # exact-row dupes
df.drop_duplicates(subset=["email"])          # by email, keep first
df.drop_duplicates(subset=["email"], keep="last")

Keep the "best" duplicate

If a customer has multiple rows and you want the most recent:

df = (df.sort_values("date")
        .drop_duplicates(subset=["email"], keep="last"))

Walkthrough: dedupe a contact list keeping the highest-value entry

Sort by what makes "best" mean something

df = df.sort_values("lifetime_value", ascending=False)

Drop, keeping the first (= the highest LTV)

df = df.drop_duplicates(subset=["email"], keep="first")

Key takeaways

Dedupe orders

Given an orders table, dedupe by order_id, keeping the row with the latest last_modified.

10.3 β€” String cleanup at scale

str.strip, str.lower, str.replace, regex β€” apply to a whole column.

What you'll learn

The .str accessor

df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["country"] = df["country"].str.upper()

Anywhere you'd call a string method on one string, you call .str.method() on a column.

Replace

df["amount_text"] = df["amount_text"].str.replace(",", "")
df["amount"] = df["amount_text"].astype(float)

Regex replace

# Keep only digits
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)

# Strip parentheses and hyphens
df["phone"] = df["phone"].str.replace(r"[()\-\s]", "", regex=True)

Split a column

# "Alice Smith" β†’ first / last
df[["first", "last"]] = df["full_name"].str.split(" ", n=1, expand=True)

Extract with regex

# pull domain from email
df["domain"] = df["email"].str.extract(r"@(.+)$")

Walkthrough: clean a column of free-text phone numbers

Start

df["phone"]
# "(415) 555-1212"
# " 415.555.1213 "
# "415-555-1214 ext.500"

Normalise

df["phone_clean"] = (df["phone"].str.strip()
                                .str.replace(r"\D", "", regex=True)
                                .str[:10])     # keep first 10 digits

Key takeaways

Email cleanup

Given a column of messy emails, produce a clean lowercase column AND a new column containing just the domain.

10.4 β€” Working with dates in pandas

to_datetime, .dt accessor, date arithmetic, resampling.

What you'll learn

Parse first

df["date"] = pd.to_datetime(df["date"], errors="coerce")

errors="coerce" turns unparsable strings into NaT (missing date) instead of crashing.

Pull parts out

df["year"]   = df["date"].dt.year
df["month"]  = df["date"].dt.month
df["dow"]    = df["date"].dt.day_name()         # 'Saturday'
df["q"]      = df["date"].dt.to_period("Q")     # PeriodIndex 2026Q2
df["month_start"] = df["date"].dt.to_period("M").dt.to_timestamp()

Date arithmetic

df["days_open"] = (df["closed"] - df["opened"]).dt.days
df["due_date"] = df["created"] + pd.Timedelta(days=30)
df[df["date"] >= "2026-01-01"]

Resampling (group by time bucket)

monthly = (df.set_index("date")["amount"]
             .resample("M")
             .sum())

weekly  = df.set_index("date")["amount"].resample("W").mean()

Walkthrough: monthly revenue trend

Parse and group

df["date"] = pd.to_datetime(df["date"])
monthly = (df.groupby(df["date"].dt.to_period("M"))["amount"]
             .sum()
             .reset_index())
monthly["date"] = monthly["date"].astype(str)

Save / plot

monthly.to_excel("monthly_revenue.xlsx", index=False)

Key takeaways

Weekly average

From a transactions DataFrame with date + amount, compute the average daily revenue per week of the year.

10.5 β€” Numeric cleanup β€” currencies, percentages, outliers

Strip $ and %, clip outliers, fix obvious typos.

What you'll learn

Currency strings β†’ floats

df["amount"] = (df["amount"].str.replace("[$,]", "", regex=True)
                              .astype(float))

Percent strings β†’ floats

# "12.5%" β†’ 0.125
df["margin"] = (df["margin"].str.replace("%", "")
                              .astype(float) / 100)

Capping outliers (winsorise)

lo, hi = df["amount"].quantile([0.01, 0.99])
df["amount_capped"] = df["amount"].clip(lower=lo, upper=hi)

Removing outliers

q1, q3 = df["amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = df["amount"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)
clean = df[mask]

Catching obvious typos

# Negative ages, ages over 120, etc.
clean = df.query("0 < age < 120")

Key takeaways

Cap and report

Cap the top 1% of values in a numeric column. Print how many rows were affected.

10.6 β€” Reshape β€” melt and pivot

Turn wide to long and back. The shape-shifting move you'll thank yourself for.

What you'll learn

Wide vs long

Wide has one row per "thing" and many columns:

region   Q1    Q2    Q3    Q4
North   1200  1500  1300  1700
South    900  1100  1050   950

Long has one row per observation:

region   quarter  amount
North    Q1       1200
North    Q2       1500
North    Q3       1300
North    Q4       1700
South    Q1        900
...

Long is friendlier for grouping/plotting; wide is friendlier for humans.

Wide β†’ long with melt

long = df.melt(id_vars="region",
               value_vars=["Q1","Q2","Q3","Q4"],
               var_name="quarter",
               value_name="amount")

Long β†’ wide with pivot

wide = long.pivot(index="region", columns="quarter", values="amount")

When to reshape

Key takeaways

Quarterly to long

Take a wide DataFrame of quarterly figures by region. Melt it into long form. Then pivot it back.

10.7 β€” Categoricals and ordered types

Save memory and enable ordered comparisons.

What you'll learn

Why categoricals

If a column has 5 million rows but only 4 distinct values ("North", "South", "East", "West"), storing it as a categorical can cut its memory use by 90% and speed up groupby.

df["region"] = df["region"].astype("category")

Ordered categoricals

from pandas.api.types import CategoricalDtype
size_order = CategoricalDtype(categories=["XS","S","M","L","XL"], ordered=True)
df["size"] = df["size"].astype(size_order)

df.sort_values("size")          # sorts in size order, not alphabetical
df[df["size"] >= "L"]           # comparisons work

Memory check

df.memory_usage(deep=True)

Key takeaways

Memory diet

Take any DataFrame with a low-cardinality text column. Convert it to categorical. Print before/after memory usage.

Module 11

Data Visualization

Matplotlib basics, pandas plotting, seaborn for prettier defaults, and how to drop charts into Excel.

11.1 β€” Matplotlib basics

fig, ax, plot, title, save. The foundation everything else builds on.

What you'll learn

The basic shape

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot([1, 2, 3, 4], [10, 25, 20, 35])
ax.set_title("My first chart")
ax.set_xlabel("Quarter")
ax.set_ylabel("Revenue")
plt.tight_layout()
plt.savefig("chart.png", dpi=150)
plt.show()

Every chart you make will follow this skeleton: build a figure, draw onto axes, label, save/show.

Three chart types in three lines

ax.plot(x, y)           # line
ax.bar(categories, values)  # bar
ax.scatter(x, y)        # scatter

Style sample

ax.plot(x, y, color="#217346", linewidth=2, marker="o")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False)   # cleaner look

Walkthrough: monthly trend

Data

months = ["Jan","Feb","Mar","Apr","May","Jun"]
revenue = [12, 15, 18, 14, 22, 27]

Plot

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.plot(months, revenue, marker="o", color="#2b6cb0", linewidth=2)
ax.set_title("Monthly Revenue (in thousands)")
ax.set_ylabel("$k")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False)
plt.tight_layout()
plt.savefig("monthly_revenue.png", dpi=150)

Key takeaways

Bar chart

Make a bar chart of the same monthly revenue data. Make Jan and Feb grey, the other months green.

11.2 β€” Plotting directly from a DataFrame

df.plot() β€” the fastest way from data to picture.

What you'll learn

The one-liner

df.set_index("date")["amount"].plot(figsize=(8, 4), title="Daily Revenue")

Multiple columns

df.set_index("date")[["amount", "forecast"]].plot()

Kinds

df["amount"].plot(kind="line")
df["amount"].plot(kind="hist", bins=30)
df["region"].value_counts().plot(kind="bar")
df["region"].value_counts().plot(kind="pie")    # use sparingly!
df.plot(kind="scatter", x="qty", y="amount")

Grouped bars from a pivot

pivot = df.pivot_table(index="region", columns="quarter",
                       values="amount", aggfunc="sum")
pivot.plot(kind="bar")

Stacked bars

pivot.plot(kind="bar", stacked=True)

Key takeaways

Side-by-side bars

Pivot orders by region (rows) and quarter (columns). Plot the result as a grouped bar chart.

11.3 β€” Seaborn β€” prettier defaults, less code

When you want a chart that looks publication-ready out of the box.

What you'll learn

Setup

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

Bar chart

sns.barplot(data=df, x="region", y="amount", estimator="sum", errorbar=None)

Scatter with colour-coding

sns.scatterplot(data=df, x="qty", y="amount", hue="region")

Distribution

sns.histplot(data=df, x="amount", hue="region", multiple="stack")

Heatmap

pivot = df.pivot_table(index="dow", columns="hour",
                        values="orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu", annot=False)

Faceted small-multiples

g = sns.FacetGrid(df, col="region", col_wrap=2)
g.map_dataframe(sns.lineplot, x="date", y="amount")
πŸ’‘ Seaborn loves long data
Seaborn assumes your data is in long form (one row per observation). If your data is wide, melt it first (Module 10, Lesson 6).

Key takeaways

Hour-of-day heatmap

Given an orders table with date+hour, build a heatmap of total orders by day-of-week Γ— hour-of-day.

11.4 β€” Formatting for stakeholders

Removing chart-junk, labelling totals, and saving at the right DPI.

What you'll learn

Less is more

ax.spines[["top","right"]].set_visible(False)
ax.grid(axis="y", alpha=0.3)
ax.set_xlabel("")           # X axis label often redundant
ax.tick_params(left=False)  # hide ticks

Labels on bars

bars = ax.bar(months, revenue, color="#217346")
for bar, value in zip(bars, revenue):
    ax.text(bar.get_x() + bar.get_width()/2,
            bar.get_height() + 0.5,
            f"${value:,.0f}",
            ha="center", va="bottom", fontsize=9)

Save

plt.tight_layout()
plt.savefig("revenue.png", dpi=200, bbox_inches="tight")

Colours for accessibility

About 1 in 12 people has red-green colour blindness. When using colour as a meaningful signal, prefer a colour-blind-safe palette:

sns.set_palette("colorblind")

Key takeaways

Polish a chart

Take your monthly-revenue bar chart from the matplotlib lesson. Remove the spines. Add dollar-value labels on top of each bar. Save at 200 dpi.

11.5 β€” Inserting charts into Excel

Three ways: paste, =PY(), and the openpyxl image insert.

What you'll learn

Option 1 β€” Paste a PNG manually

Save the chart, then in Excel: Insert β†’ Pictures β†’ from file. Fastest for one-offs.

Option 2 β€” =PY() chart

Covered in Module 8. Best for charts that should refresh when the underlying Excel data changes.

Option 3 β€” Programmatic with openpyxl

import pandas as pd
import matplotlib.pyplot as plt
from openpyxl import load_workbook
from openpyxl.drawing.image import Image

# Generate the chart
df = pd.read_csv("monthly.csv")
fig, ax = plt.subplots(figsize=(7, 3))
df.plot(x="month", y="revenue", ax=ax, marker="o")
plt.tight_layout()
fig.savefig("monthly.png", dpi=150)
plt.close(fig)

# Insert into the workbook
df.to_excel("report.xlsx", index=False, sheet_name="Data")
wb = load_workbook("report.xlsx")
ws = wb.create_sheet("Chart")
img = Image("monthly.png")
ws.add_image(img, "B2")
wb.save("report.xlsx")

Walkthrough: monthly report builder

The plan

One script. Reads a CSV. Writes a 2-sheet Excel: a Data sheet and a Charts sheet with three charts.

The pattern

def make_chart(df, kind, filename, title):
    fig, ax = plt.subplots(figsize=(7, 3))
    df.plot(kind=kind, ax=ax, color="#217346")
    ax.set_title(title)
    plt.tight_layout()
    fig.savefig(filename, dpi=150)
    plt.close(fig)

Then call it three times and insert each via openpyxl.

Key takeaways

Multi-chart workbook

Build a script that produces an Excel report with a Data sheet, a Bar Chart sheet, and a Line Chart sheet β€” all from one CSV input.

11.6 β€” Tiny dashboards with subplots

Multiple charts on one figure β€” your first 'dashboard'.

What you'll learn

The subplots grid

fig, axes = plt.subplots(2, 2, figsize=(10, 7))
ax_a, ax_b = axes[0]
ax_c, ax_d = axes[1]

ax_a.plot(x, y1); ax_a.set_title("Trend")
ax_b.bar(cats, vals); ax_b.set_title("By Category")
ax_c.scatter(x, y2); ax_c.set_title("Scatter")
ax_d.hist(values, bins=20); ax_d.set_title("Distribution")

plt.suptitle("Q1 Dashboard", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("dashboard.png", dpi=200)

Shared axes

fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True)

Tip: give each panel breathing room

fig.subplots_adjust(hspace=0.4, wspace=0.3)

Key takeaways

Three-up dashboard

Make a 1Γ—3 dashboard with: monthly revenue line, top-5 product bar, and a distribution histogram. Save as one PNG.

Module 12

Automating Boring Excel Tasks

Merge 100 files, rename sheets, format dozens of workbooks, send personalised emails β€” set it once and walk away.

12.1 β€” Merging many files into one

The everyday 'I have a folder of monthly reports' job.

What you'll learn

The script

from pathlib import Path
import pandas as pd

SRC = Path("monthly")
DST = Path("combined.xlsx")

frames = []
for path in sorted(SRC.glob("*.xlsx")):
    df = pd.read_excel(path)
    df["source_file"] = path.name
    frames.append(df)
    print(f"  loaded {path.name}: {len(df):,} rows")

combined = pd.concat(frames, ignore_index=True)
combined.to_excel(DST, index=False, sheet_name="All")
print(f"Wrote {len(combined):,} rows to {DST}")

Mixed CSV + XLSX

def read_any(p):
    if p.suffix == ".csv":
        return pd.read_csv(p)
    return pd.read_excel(p)

frames = [read_any(p) for p in SRC.iterdir() if p.suffix in (".csv", ".xlsx")]

Skip broken files

frames = []
errors = []
for path in SRC.glob("*"):
    try:
        df = read_any(path)
        df["source_file"] = path.name
        frames.append(df)
    except Exception as e:
        errors.append((path.name, str(e)))

print(f"Processed: {len(frames)},  Errors: {len(errors)}")
for name, msg in errors:
    print(f"  {name}: {msg}")

Key takeaways

Build a merger

Put 3-5 small CSVs in a folder. Write a script that merges them into one Excel file with a "source" column.

12.2 β€” Splitting one file into many

Group by a column and save one workbook per group.

What you'll learn

The pattern

from pathlib import Path
import pandas as pd

df = pd.read_excel("all_orders.xlsx")
out = Path("per_region"); out.mkdir(exist_ok=True)

for region, group in df.groupby("region"):
    safe = region.replace("/", "_")
    group.to_excel(out / f"{safe}.xlsx", index=False)
    print(f"  wrote {safe}.xlsx ({len(group):,} rows)")

One workbook with one sheet per group

with pd.ExcelWriter("by_region.xlsx") as writer:
    for region, group in df.groupby("region"):
        safe = region.replace("/", "_")[:31]   # Excel sheet name limit
        group.to_excel(writer, sheet_name=safe, index=False)
⚠️ Sheet name limit
Excel sheet names cap at 31 characters and can't contain / \ ? * [ ]. Always sanitise.

Key takeaways

Split a big workbook

Take a multi-region orders file and produce one workbook per salesperson, each containing only their rows.

12.3 β€” Filling a template workbook

Open a pre-formatted .xlsx, drop data into specific cells, save with a new name.

What you'll learn

Why templates beat from-scratch reports

Excel templates can have logos, frozen panes, formulas, colours, named ranges β€” all the spreadsheety polish. With openpyxl you can open the template, plug in values, and save a copy β€” keeping every bit of formatting.

The script

from openpyxl import load_workbook
from datetime import date
import pandas as pd

template = "templates/invoice.xlsx"
orders = pd.read_csv("orders_to_invoice.csv")

for _, row in orders.iterrows():
    wb = load_workbook(template)
    ws = wb.active

    ws["B2"] = f"Invoice {row['invoice_no']}"
    ws["B3"] = row["customer"]
    ws["B4"] = row["billing_address"]
    ws["B5"] = date.today().isoformat()
    ws["E10"] = row["quantity"]
    ws["E11"] = row["unit_price"]
    ws["E13"] = row["tax_rate"]
    # formulas in the template will recalc when the file is opened

    out = f"invoices/{row['invoice_no']}.xlsx"
    wb.save(out)
    print(f"  wrote {out}")

Use named ranges to make this readable

In Excel: Formulas β†’ Define Name. Then in Python:

wb["TemplateSheet"]["CustomerName"]   # if you defined CustomerName as a named range

(For most everyday cases, the ws["B3"] style is fine.)

Key takeaways

Receipt printer

Build a tiny Excel "receipt" template with a logo and a totals formula. Loop a small DataFrame and generate one receipt per row.

12.4 β€” Renaming and organising files in bulk

pathlib.rename, glob, regex β€” your weekend file-cleanup script.

What you'll learn

Rename pattern

from pathlib import Path
import re

folder = Path("downloads")
for p in folder.glob("Report_*.xlsx"):
    new_name = re.sub(r"Report_(\d{4})(\d{2})(\d{2})",
                      r"report-\1-\2-\3",
                      p.stem) + p.suffix
    print(f"{p.name}  β†’  {new_name}")
    # p.rename(p.with_name(new_name))   # uncomment when ready
⚠️ Always dry-run first
Print the planned renames in a first pass. Only enable the actual p.rename() after you've eyeballed the output.

Move into dated folders

from datetime import datetime
for p in folder.glob("*.csv"):
    mtime = datetime.fromtimestamp(p.stat().st_mtime)
    target = folder / mtime.strftime("%Y-%m")
    target.mkdir(exist_ok=True)
    p.rename(target / p.name)

Key takeaways

Downloads cleaner

Sketch a script (don't enable rename yet) that moves every .pdf in your Downloads folder into Downloads/PDFs/ and every .xlsx into Downloads/Spreadsheets/.

12.5 β€” Sending emails from Python

smtplib for the standard library way; better libraries for Gmail/Outlook.

What you'll learn

The standard library way

import smtplib
import os
from email.message import EmailMessage

msg = EmailMessage()
msg["Subject"] = "Monthly Report"
msg["From"] = "you@yourdomain.com"
msg["To"] = "boss@yourdomain.com"
msg.set_content("Hi β€” attached is the monthly report.\n\nThanks,\nYou")

# Attach a file
with open("report.xlsx", "rb") as f:
    msg.add_attachment(
        f.read(),
        maintype="application",
        subtype="vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        filename="report.xlsx",
    )

with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
    smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
    smtp.send_message(msg)
print("Sent.")
⚠️ Never paste passwords in code
Use environment variables (os.environ["EMAIL_PASS"]), a .env file loaded with python-dotenv, or a system secret manager. Never check credentials into version control.

Gmail and Outlook today

Big providers now require app-specific passwords or OAuth. The cleanest libraries:

Personalised mass email

for _, row in customers.iterrows():
    msg = EmailMessage()
    msg["Subject"] = f"Your {row['month']} statement"
    msg["From"]    = "you@yourdomain.com"
    msg["To"]      = row["email"]
    msg.set_content(f"""Hi {row['first_name']},

Your statement for {row['month']} is attached.

Best,
You""")

    with open(row["attachment"], "rb") as f:
        msg.add_attachment(f.read(),
                           maintype="application",
                           subtype="pdf",
                           filename=Path(row["attachment"]).name)

    with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
        smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
        smtp.send_message(msg)

Key takeaways

Test send

Send yourself a test email with an attached CSV. Confirm it lands. Then do the same for three rows of a small DataFrame.

12.6 β€” Scheduling: run it automatically

Make a script run on its own β€” cron (Mac/Linux), Task Scheduler (Windows), or a Python scheduler library.

What you'll learn

Mac and Linux β€” cron

In Terminal: crontab -e. Add a line:

# minute hour day month weekday   command
0 8 * * 1-5  /opt/anaconda3/bin/python /Users/you/scripts/morning_report.py >> ~/cron.log 2>&1

This runs at 08:00 on weekdays.

Windows β€” Task Scheduler

  1. Start menu β†’ Task Scheduler β†’ Create Task.
  2. Triggers β†’ Daily, 8:00 AM.
  3. Actions β†’ Start a program β†’ python.exe; arguments: C:\path\to\morning_report.py.
  4. Save.

The script must be self-sufficient

A scheduled script can't ask the user for input. Hard-code paths or read them from environment variables / a config file.

Logging β€” so you know it ran

import logging
from datetime import datetime

logging.basicConfig(
    filename="morning_report.log",
    level=logging.INFO,
    format="%(asctime)s %(levelname)s: %(message)s",
)

logging.info("Starting morning report")
# ... do work ...
logging.info(f"Completed in {(datetime.now()-start).total_seconds():.1f}s")

Python-native scheduling

For a long-running script that schedules itself:

import schedule, time
def job():
    print("Running daily report")
schedule.every().day.at("08:00").do(job)
while True:
    schedule.run_pending()
    time.sleep(60)

Useful inside Docker containers; for laptop use, cron / Task Scheduler is simpler.

Key takeaways

Schedule a 'hello'

Schedule a tiny Python script (one line: append today's date to a log file) to run daily. Confirm it ran tomorrow.

12.7 β€” An automation recipe book

Eight common boring-Excel jobs and the 20-line scripts that automate them.

What you'll learn

1 β€” Add today's date to every filename in a folder

from pathlib import Path
from datetime import date
stamp = date.today().isoformat()
for p in Path("inbox").glob("*.xlsx"):
    p.rename(p.with_name(f"{stamp}_{p.name}"))

2 β€” Find every sheet across a folder of workbooks containing a keyword

from openpyxl import load_workbook
matches = []
for p in Path("archive").rglob("*.xlsx"):
    wb = load_workbook(p, read_only=True, data_only=True)
    for ws in wb.worksheets:
        for row in ws.iter_rows(values_only=True):
            if any(c == "URGENT" for c in row if isinstance(c, str)):
                matches.append((p.name, ws.title))
                break
print(matches)

3 β€” Roll a year of weekly reports into one DataFrame

import pandas as pd
frames = [pd.read_excel(p) for p in Path("weekly_2026").glob("week_*.xlsx")]
pd.concat(frames, ignore_index=True).to_excel("weekly_2026_combined.xlsx", index=False)

4 β€” Find rows that exist this month but didn't last month (new customers)

this_m = set(pd.read_excel("apr.xlsx")["email"])
last_m = set(pd.read_excel("mar.xlsx")["email"])
new = this_m - last_m
print(f"{len(new)} new customers")

5 β€” Convert all CSVs in a folder to Excel

for p in Path(".").glob("*.csv"):
    pd.read_csv(p).to_excel(p.with_suffix(".xlsx"), index=False)

6 β€” Rename a sheet across every workbook in a folder

for p in Path("reports").glob("*.xlsx"):
    wb = load_workbook(p)
    if "Sheet1" in wb.sheetnames:
        wb["Sheet1"].title = "Data"
        wb.save(p)

7 β€” Email the boss a CSV every Monday morning

# See Lesson 5 β€” combine with Lesson 6 scheduling.

8 β€” Build a daily PDF dashboard

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
df.groupby("region")["amount"].sum().plot(kind="bar", ax=ax)
ax.set_title(f"Revenue {date.today():%Y-%m-%d}")
fig.savefig(f"dashboards/{date.today():%Y-%m-%d}.pdf")

Key takeaways

Adapt one

Pick the recipe closest to a real task you do. Spend 30 minutes adapting it to your own data and folder.

Module 13

Scenario: Finance & Accounting

Reconciliations, variance analysis, budget consolidation, invoice processing, and the small daily wins that compound.

This module follows a fictional accountant β€” call her Priya β€” through a real month. Each lesson is a story: a problem on Priya's desk, the Excel-only way, and the Python solution that saves an hour. Copy each script. Adapt to your numbers.

13.1 β€” User story: bank reconciliation

Match GL transactions to a bank statement. Find the unmatched.

What you'll learn

The story

Priya gets two files at month-end: gl_april.xlsx (the general ledger from the accounting system) and bank_april.csv (the bank statement). She has to find every transaction on one side that isn't on the other and investigate it.

In Excel: copy both into a workbook, build a helper column of date+amount, VLOOKUP across, scroll through #N/A results. Two hours.

In Python: 25 lines. Five minutes. Below.

The script

import pandas as pd
from pathlib import Path

gl   = pd.read_excel("gl_april.xlsx")
bank = pd.read_csv("bank_april.csv")

# Normalise both sides to the same key
def keyify(df, date_col, amount_col):
    df = df.copy()
    df["date"] = pd.to_datetime(df[date_col]).dt.date
    df["amount"] = df[amount_col].round(2)
    df["key"] = df["date"].astype(str) + "_" + df["amount"].astype(str)
    return df

gl_k   = keyify(gl,   "TxnDate", "Amount")
bank_k = keyify(bank, "PostDate", "Amount")

merged = gl_k.merge(bank_k, on="key", how="outer",
                    suffixes=("_gl", "_bank"), indicator=True)

unmatched_gl   = merged[merged["_merge"] == "left_only"]
unmatched_bank = merged[merged["_merge"] == "right_only"]

with pd.ExcelWriter("reconciliation_april.xlsx") as w:
    unmatched_gl.to_excel(w,   sheet_name="GL only",   index=False)
    unmatched_bank.to_excel(w, sheet_name="Bank only", index=False)
    merged[merged["_merge"]=="both"].to_excel(w, sheet_name="Matched", index=False)

print(f"GL only:   {len(unmatched_gl)}")
print(f"Bank only: {len(unmatched_bank)}")

Handling near-matches

If amounts can be off by a small fee, allow a tolerance:

def near_match(amt_a, amt_b, tol=0.50):
    return abs(amt_a - amt_b) <= tol

# For each unmatched GL row, search bank rows within Β±3 days for a near-amount match.
for _, gl_row in unmatched_gl.iterrows():
    candidates = bank[
        (bank["PostDate"] >= gl_row["TxnDate"] - pd.Timedelta(days=3)) &
        (bank["PostDate"] <= gl_row["TxnDate"] + pd.Timedelta(days=3)) &
        (bank["Amount"].apply(lambda a: near_match(a, gl_row["Amount"])))
    ]
    if len(candidates) == 1:
        print(f"Possible fuzzy match for GL {gl_row['TxnId']} ↔ Bank {candidates.iloc[0]['BankRef']}")

Video walkthrough script (for recording)

Show both raw files in Excel

"Here's the GL. Here's the bank. Look β€” same month, but different columns and different date formats. Reconciling these manually is a nightmare."

Open Jupyter, load both

"pd.read_excel for the GL, pd.read_csv for the bank. Two lines."

Build the match key

"Both sides get a date and a 2-decimal amount, joined as a string. That's the merge key."

Outer merge with indicator

"This is the magic line. how='outer' keeps everything from both sides; indicator=True tells us which side each row came from."

Split into three sheets and save

"Matched, GL only, Bank only β€” each gets its own sheet. The file lands ready to email to the controller."

Key takeaways

Your own recon

Build a recon between two of your own CSVs. The match key can be whatever uniquely identifies a transaction in both systems (txn ID, date+amount, invoice number).

13.2 β€” User story: budget vs actual variance

Compare actuals to budget by department, flag the worrying lines.

What you'll learn

The story

Priya needs a one-page variance report for the CFO. Every department, every GL account, budget vs actual, with the worst variances at the top.

The script

import pandas as pd

budget = pd.read_excel("budget_2026.xlsx")    # cols: dept, account, month, budget
actual = pd.read_excel("actual_2026.xlsx")    # cols: dept, account, month, actual

merged = budget.merge(actual,
                      on=["dept", "account", "month"],
                      how="outer").fillna(0)

merged["variance"]     = merged["actual"] - merged["budget"]
merged["variance_pct"] = (merged["variance"] / merged["budget"].replace(0, pd.NA)) * 100

# Flag the worrying lines
def flag(row):
    if abs(row["variance"]) > 50_000:           return "MATERIAL"
    if abs(row["variance_pct"] or 0) > 20:      return "HIGH-PCT"
    return ""
merged["flag"] = merged.apply(flag, axis=1)

# Sort: flagged first, then by absolute variance
report = (merged.assign(abs_var = merged["variance"].abs())
                 .sort_values(["flag", "abs_var"], ascending=[True, False])
                 .drop(columns="abs_var"))

report.to_excel("variance_april.xlsx", index=False)

Make the output executive-friendly

from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill

wb = load_workbook("variance_april.xlsx")
ws = wb.active

# Headers: bold, green
for cell in ws[1]:
    cell.font = Font(bold=True, color="FFFFFF")
    cell.fill = PatternFill("solid", fgColor="217346")

# Highlight MATERIAL rows
red = PatternFill("solid", fgColor="FEE2E2")
for row in ws.iter_rows(min_row=2):
    if row[ws.max_column - 1].value == "MATERIAL":
        for cell in row:
            cell.fill = red

# Currency formats
for col in "DEFG":
    for cell in ws[col][1:]:
        cell.number_format = '"$"#,##0;[Red]("$"#,##0)'

# Freeze pane + widths
ws.freeze_panes = "A2"
for col, w in zip("ABCDEFGH", [10, 12, 10, 14, 14, 14, 10, 14]):
    ws.column_dimensions[col].width = w

wb.save("variance_april.xlsx")

Walkthrough for video

Open both source files

"Budget. Actual. Same shape but different numbers."

One merge, one subtraction

"merge(on=[dept, account, month]) joins them. Two assignments compute variance and percent variance."

Apply the flag

"Anything over $50k absolute or 20% relative gets a label."

Save and style

"Save with pandas, restyle with openpyxl. Red rows for the lines that matter."

Key takeaways

Add a sparkline column

Extend the report: add a "trend" column with the last six months of actuals as a comma-separated list. (Hint: groupby dept+account, take last 6 months, agg into a list.)

13.3 β€” User story: consolidating subsidiary budgets

Six subsidiaries send six workbooks, slightly differently shaped. Combine them.

What you'll learn

The story

Six subsidiaries send a budget workbook every quarter. Each is slightly different β€” different column orders, different account naming, sometimes an extra sheet. Priya needs one consolidated rollup.

Step 1 β€” load and inspect

from pathlib import Path
import pandas as pd

files = list(Path("subs_q2").glob("*.xlsx"))
for p in files:
    print(p.name)
    df = pd.read_excel(p, sheet_name="Budget")
    print("  cols:", list(df.columns))
    print("  rows:", len(df))

Step 2 β€” write a per-file normaliser

COL_MAP = {
    # canonical_name : list of possible source names
    "subsidiary":  ["Sub", "Subsidiary", "Entity", "Company"],
    "account":     ["Account", "GL Account", "Ledger Acct", "AcctName"],
    "month":       ["Month", "Period", "MonthEnd"],
    "amount":      ["Amount", "Budget", "USD", "BudgetUSD"],
}

def rename(df):
    out = {}
    for canon, options in COL_MAP.items():
        for opt in options:
            if opt in df.columns:
                out[opt] = canon
                break
    return df.rename(columns=out)

def normalise(df, sub_default):
    df = rename(df)
    if "subsidiary" not in df.columns:
        df["subsidiary"] = sub_default
    df["month"] = pd.to_datetime(df["month"]).dt.to_period("M").astype(str)
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
    return df[["subsidiary", "account", "month", "amount"]]

Step 3 β€” combine

frames = []
for p in files:
    raw = pd.read_excel(p, sheet_name="Budget")
    frames.append(normalise(raw, sub_default=p.stem))

combined = pd.concat(frames, ignore_index=True)
combined = combined.dropna(subset=["account", "amount"])

Step 4 β€” rollups

by_sub_account = (combined.groupby(["subsidiary", "account"])["amount"]
                            .sum()
                            .reset_index())

monthly_by_sub = combined.pivot_table(
    index="subsidiary", columns="month",
    values="amount", aggfunc="sum", fill_value=0,
    margins=True, margins_name="Total")

with pd.ExcelWriter("consolidated_q2.xlsx") as w:
    combined.to_excel(w, sheet_name="Raw", index=False)
    by_sub_account.to_excel(w, sheet_name="By Sub Γ— Account", index=False)
    monthly_by_sub.to_excel(w, sheet_name="By Sub Γ— Month")
πŸ’‘ The COL_MAP pattern
A column-mapping dictionary makes new subsidiaries painless to onboard β€” just add their column names to the options list. The rest of the pipeline doesn't change.

Key takeaways

Add a sub

Add a seventh subsidiary's file with brand-new column names. Update COL_MAP. Re-run. Confirm nothing else needed to change.

13.4 β€” User story: invoice batch processing

Read a folder of supplier invoices (Excel), validate, post to the GL feed.

What you'll learn

The story

Suppliers send Excel invoices into a shared folder. Priya checks each, keys them into the GL. She wants this scripted.

Layout

Each invoice has a fixed structure: supplier in B2, invoice no in B3, date in B4, subtotal/tax/total in E10/E11/E12, line items in B6:D8.

The script

import pandas as pd
from openpyxl import load_workbook
from pathlib import Path

valid_suppliers = set(pd.read_excel("master_data.xlsx", sheet_name="Suppliers")["name"])
TAX_RATE = 0.08

rows, errors = [], []

for path in Path("inbox").glob("*.xlsx"):
    wb = load_workbook(path, data_only=True)
    ws = wb.active

    supplier = ws["B2"].value
    inv_no   = ws["B3"].value
    inv_date = ws["B4"].value
    subtotal = ws["E10"].value
    tax      = ws["E11"].value
    total    = ws["E12"].value

    # Validate
    if supplier not in valid_suppliers:
        errors.append((path.name, f"Unknown supplier: {supplier}")); continue
    if subtotal is None or total is None:
        errors.append((path.name, "Missing totals")); continue
    expected_tax = round(subtotal * TAX_RATE, 2)
    if abs(tax - expected_tax) > 0.05:
        errors.append((path.name, f"Tax mismatch: {tax} vs expected {expected_tax}"))
        continue
    if abs(total - (subtotal + tax)) > 0.05:
        errors.append((path.name, "Total β‰  subtotal+tax")); continue

    rows.append({
        "invoice_no": inv_no,
        "supplier":   supplier,
        "date":       inv_date,
        "subtotal":   subtotal,
        "tax":        tax,
        "total":      total,
        "source":     path.name,
    })

gl_feed = pd.DataFrame(rows)
gl_feed.to_csv("gl_feed_april.csv", index=False)

if errors:
    pd.DataFrame(errors, columns=["file", "error"]).to_csv("invoice_errors.csv", index=False)
    print(f"⚠️  {len(errors)} errors β€” see invoice_errors.csv")

print(f"βœ… Posted {len(gl_feed)} clean invoices")

Move processed files

from pathlib import Path
done = Path("processed"); done.mkdir(exist_ok=True)
for r in rows:
    Path("inbox") / r["source"]).rename(done / r["source"])

Key takeaways

Add a duplicate check

Extend the script: refuse to post an invoice number that's already in the GL feed from previous runs.

13.5 β€” User story: accounts receivable aging

Bucket open invoices into 0-30, 31-60, 61-90, 90+.

What you'll learn

The story

Every Monday the controller asks for the AR aging. Priya wants it ready by 9 AM.

The script

import pandas as pd
from datetime import date

ar = pd.read_excel("ar_open.xlsx")        # cols: customer, invoice_no, due_date, balance

today = pd.Timestamp(date.today())
ar["due_date"] = pd.to_datetime(ar["due_date"])
ar["days_overdue"] = (today - ar["due_date"]).dt.days.clip(lower=0)

bins   = [-1, 0, 30, 60, 90, 99999]
labels = ["Current", "1-30", "31-60", "61-90", "90+"]
ar["bucket"] = pd.cut(ar["days_overdue"], bins=bins, labels=labels)

pivot = ar.pivot_table(index="customer", columns="bucket",
                        values="balance", aggfunc="sum",
                        fill_value=0, observed=True)
pivot["Total"] = pivot.sum(axis=1)
pivot = pivot.sort_values("Total", ascending=False)

pivot.to_excel("ar_aging.xlsx")

Highlight worst offenders

from openpyxl import load_workbook
from openpyxl.styles import PatternFill

wb = load_workbook("ar_aging.xlsx")
ws = wb.active

red = PatternFill("solid", fgColor="FECACA")
yellow = PatternFill("solid", fgColor="FEF3C7")

for row in ws.iter_rows(min_row=2):
    over90 = row[5].value or 0   # column F = "90+"
    over60 = row[4].value or 0
    if over90 > 0:
        for cell in row: cell.fill = red
    elif over60 > 0:
        for cell in row: cell.fill = yellow

wb.save("ar_aging.xlsx")

Key takeaways

Add aging by salesperson

Modify to also pivot by salesperson (one sheet per salesperson) so each can chase their own delinquents.

13.6 β€” User story: multi-currency consolidation

Convert a multi-currency P&L to USD using daily FX rates.

What you'll learn

The story

Priya consolidates P&L from offices in London (GBP), Tokyo (JPY), and Berlin (EUR). The CFO wants USD totals using the rate on the date of each transaction.

The data

tx = pd.read_excel("multi_ccy_tx.xlsx")
# date, office, account, amount, currency

rates = pd.read_csv("fx_rates_2026.csv")
# date, currency, rate_to_usd

The join

tx["date"] = pd.to_datetime(tx["date"])
rates["date"] = pd.to_datetime(rates["date"])

merged = tx.merge(rates, on=["date", "currency"], how="left")
missing = merged["rate_to_usd"].isna().sum()
print(f"Transactions with no rate: {missing}")

# If date doesn't match exactly (weekend), forward-fill from previous business day
if missing > 0:
    rates_indexed = rates.set_index(["currency", "date"]).sort_index()
    def lookup(row):
        try:
            sub = rates_indexed.loc[row["currency"]]
            avail = sub[sub.index <= row["date"]]
            return avail.iloc[-1]["rate_to_usd"] if len(avail) else None
        except KeyError:
            return None
    merged.loc[merged["rate_to_usd"].isna(), "rate_to_usd"] = merged[merged["rate_to_usd"].isna()].apply(lookup, axis=1)

merged["usd_amount"] = (merged["amount"] * merged["rate_to_usd"]).round(2)

Rollup

usd_pnl = (merged.groupby(["office", "account"])["usd_amount"]
                  .sum()
                  .unstack(fill_value=0))

usd_pnl["Total"] = usd_pnl.sum(axis=1)
usd_pnl.loc["Total"] = usd_pnl.sum(numeric_only=True)
usd_pnl.to_excel("pnl_usd.xlsx")

Key takeaways

Sensitivity

Compute the same P&L using month-average rates instead of daily. By how much does the USD total differ? That difference is your FX translation risk.

13.7 β€” Project: the monthly close pack

Combine the previous lessons into one script that produces the entire close pack.

What you'll learn

The story

Month-end. Priya wants one script to run that produces the whole close pack: a cover sheet, reconciliation, variance, AR aging, FX-translated P&L. One file. One double-click.

The skeleton

"""monthly_close.py β€” generate the month-end close pack."""
from datetime import date
from pathlib import Path
import pandas as pd

# 1) Recon
def make_recon():
    # ... reuse code from Lesson 1
    return matched, gl_only, bank_only

# 2) Variance
def make_variance():
    # ... reuse code from Lesson 2
    return variance_df

# 3) AR aging
def make_aging():
    # ... reuse code from Lesson 5
    return aging_pivot

# 4) FX P&L
def make_pnl():
    # ... reuse code from Lesson 6
    return pnl_df

def make_cover(matched, gl_only, bank_only, variance, aging, pnl):
    return pd.DataFrame({
        "Metric": [
            "Period",
            "GL transactions",
            "Bank transactions",
            "GL only", "Bank only",
            "Material variances",
            "AR over 60 days",
            "Total revenue (USD)",
        ],
        "Value": [
            f"{date.today():%B %Y}",
            len(matched) + len(gl_only),
            len(matched) + len(bank_only),
            len(gl_only), len(bank_only),
            (variance["flag"] == "MATERIAL").sum(),
            aging["61-90"].sum() + aging["90+"].sum(),
            pnl.loc["Total", "Total"],
        ],
    })

def main():
    matched, gl_only, bank_only = make_recon()
    variance = make_variance()
    aging    = make_aging()
    pnl      = make_pnl()
    cover    = make_cover(matched, gl_only, bank_only, variance, aging, pnl)

    out = f"close_pack_{date.today():%Y-%m}.xlsx"
    with pd.ExcelWriter(out) as w:
        cover.to_excel(w,    sheet_name="Cover",    index=False)
        matched.to_excel(w,  sheet_name="Recon-Matched",  index=False)
        gl_only.to_excel(w,  sheet_name="Recon-GL only",  index=False)
        bank_only.to_excel(w,sheet_name="Recon-Bank only",index=False)
        variance.to_excel(w, sheet_name="Variance", index=False)
        aging.to_excel(w,    sheet_name="AR Aging")
        pnl.to_excel(w,      sheet_name="USD P&L")
    print(f"βœ… Wrote {out}")

if __name__ == "__main__":
    main()

Schedule it

Set this script to run automatically on the 1st of every month (Module 12, Lesson 6). Wake up to a finished close pack in your inbox.

Key takeaways

Email it

Extend the script: after building the file, email it to the controller using the email recipe from Module 12, Lesson 5.

Module 14

Scenario: Data & Business Analyst

Cleaning huge exports, joining tables, building dashboards, ad-hoc analyses, and shipping the results.

Meet Marcus, a business analyst at a mid-sized retailer. His job is roughly 'turn data into decisions'. Here's how Python earns its keep on his typical week.

14.1 β€” User story: cleaning a 4-million-row export

When the CSV is too big for Excel itself.

What you'll learn

The story

Marcus gets a 4M-row CSV of transactions. Excel opens 1M and silently drops the rest. He needs all of it cleaned and pivoted.

Step 1 β€” peek before loading the whole thing

import pandas as pd
sample = pd.read_csv("transactions.csv", nrows=1000)
sample.info()                     # columns + dtypes
sample.head()

Step 2 β€” load with the right options

df = pd.read_csv("transactions.csv",
                 usecols=["date", "store_id", "sku", "qty", "revenue"],
                 dtype={"store_id": "category", "sku": "category"},
                 parse_dates=["date"])
print(df.shape)        # (4_000_000, 5)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

usecols drops unused columns at load time. dtype="category" on repeating strings cuts memory by ~90%.

Step 3 β€” filter early

# Last 90 days only
recent = df[df["date"] >= df["date"].max() - pd.Timedelta(days=90)]
print(len(recent))   # ~1M

Step 4 β€” aggregate down to something Excel-friendly

by_store_sku = (recent.groupby(["store_id", "sku"], observed=True)
                       .agg(units=("qty", "sum"),
                            revenue=("revenue", "sum"))
                       .reset_index()
                       .sort_values("revenue", ascending=False))

by_store_sku.head(20).to_excel("top_skus_per_store.xlsx", index=False)

If even pandas struggles β€” chunked read

aggregates = []
for chunk in pd.read_csv("transactions.csv", chunksize=500_000, parse_dates=["date"]):
    chunk = chunk[chunk["date"] >= "2026-01-01"]
    aggregates.append(chunk.groupby("store_id")["revenue"].sum())

total = pd.concat(aggregates).groupby(level=0).sum()
print(total)

Key takeaways

Your own big file

Find a CSV bigger than 100k rows in your work. Load it with usecols and dtype. Print memory savings.

14.2 β€” User story: joining six tables to answer one question

Stars, snowflakes, and the merge dance.

What you'll learn

The story

"Who are our top customers in the Northeast, what did they buy last quarter, and what was their average margin?" Six tables: customers, orders, order_items, products, regions, costs.

Map it out first

customers ──(customer_id)── orders ──(order_id)── order_items ──(sku)── products ──(sku)── costs
customers ──(region_id)── regions

Load and check sizes

cu = pd.read_csv("customers.csv")
o  = pd.read_csv("orders.csv", parse_dates=["order_date"])
oi = pd.read_csv("order_items.csv")
p  = pd.read_csv("products.csv")
r  = pd.read_csv("regions.csv")
co = pd.read_csv("costs.csv")
for name, df in [("cu",cu),("o",o),("oi",oi),("p",p),("r",r),("co",co)]:
    print(name, df.shape)

Join one step at a time, with indicator

cu_r = cu.merge(r, on="region_id", how="left", indicator=True)
print(cu_r["_merge"].value_counts())   # all should be 'both'
cu_r = cu_r.drop(columns="_merge")

o_cu = o.merge(cu_r, on="customer_id", how="left", indicator=True)
print(o_cu["_merge"].value_counts())
o_cu = o_cu.drop(columns="_merge")

# Items per order β€” this is the row-multiplying join
items = oi.merge(o_cu, on="order_id", how="inner")
print("rows after items join:", len(items))

# Add product + cost
items = items.merge(p,  on="sku", how="left").merge(co, on="sku", how="left")
⚠️ The row explosion
If your join key has duplicates on both sides, you get a Cartesian product. Check len(after) ≀ len(before) Γ— max_dupes. If not, debug.

Answer the question

last_q = items[items["order_date"] >= "2026-01-01"]
last_q["revenue"] = last_q["qty"] * last_q["unit_price"]
last_q["margin"]  = last_q["revenue"] - last_q["qty"] * last_q["unit_cost"]

ne = last_q[last_q["region_name"] == "Northeast"]

top = (ne.groupby(["customer_id", "customer_name"])
         .agg(orders   = ("order_id", "nunique"),
              revenue  = ("revenue", "sum"),
              margin   = ("margin",  "sum"))
         .assign(margin_pct = lambda d: (d["margin"] / d["revenue"] * 100).round(1))
         .sort_values("revenue", ascending=False)
         .head(20)
         .reset_index())

top.to_excel("top20_ne_q1.xlsx", index=False)

Key takeaways

Reverse it

Same data, different question: which products have the highest margin% but the lowest sales volume? (Hidden gems to push.)

14.3 β€” User story: customer cohort retention

Group customers by signup month, track their retention over time.

What you'll learn

The story

"How are our newer customers doing compared to older ones?" The classic cohort retention chart.

Step 1 β€” label each customer

orders["order_month"]  = orders["order_date"].dt.to_period("M")
first_order_month = (orders.groupby("customer_id")["order_month"].min()
                            .rename("cohort"))
orders = orders.merge(first_order_month, on="customer_id")
orders["months_since_join"] = (orders["order_month"] - orders["cohort"]).apply(lambda d: d.n)

Step 2 β€” build the cohort matrix

cohort_counts = (orders.groupby(["cohort", "months_since_join"])["customer_id"]
                        .nunique()
                        .unstack(fill_value=0))

cohort_sizes = cohort_counts.iloc[:, 0]
retention = cohort_counts.divide(cohort_sizes, axis=0).round(3) * 100

Step 3 β€” heatmap

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(retention.iloc[:, :12],   # first 12 months
            annot=True, fmt=".0f",
            cmap="YlGn", cbar_kws={"label": "Retention %"},
            ax=ax)
ax.set_title("Cohort Retention β€” % of cohort active in each month after signup")
ax.set_xlabel("Months after signup")
ax.set_ylabel("Cohort (signup month)")
plt.tight_layout()
plt.savefig("cohort_retention.png", dpi=200)

Key takeaways

Revenue cohort

Change the analysis: instead of retention %, show average revenue per active customer by cohort Γ— months-since-signup.

14.4 β€” User story: did the change work? A quick A/B test

Two groups, one metric, was the difference real or noise?

What you'll learn

The story

Marketing tried a new landing page on half the traffic. The new version has 4.2% conversion vs the old 3.8%. Is that a real win or just luck?

The data

import pandas as pd
df = pd.read_csv("ab_test.csv")   # visitor_id, group, converted (0/1)

print(df.groupby("group")["converted"].agg(["count","mean","sum"]))

Quick check with scipy

from scipy import stats

a = df.loc[df["group"]=="control",   "converted"]
b = df.loc[df["group"]=="treatment", "converted"]

result = stats.ttest_ind(b, a, equal_var=False)
print(f"t = {result.statistic:.2f}, p = {result.pvalue:.4f}")

What the p-value really says

p = the probability of seeing a difference at least this big, if there was no real effect. p < 0.05 by convention means "this is unlikely to be noise." Not "definitely real." Not "important." Just "unlikely noise."

Confidence interval for the lift

import numpy as np
p_a, n_a = a.mean(), len(a)
p_b, n_b = b.mean(), len(b)
lift = p_b - p_a
se = (p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b) ** 0.5
ci = (lift - 1.96*se, lift + 1.96*se)
print(f"Lift: {lift*100:+.2f} pp ({ci[0]*100:+.2f}, {ci[1]*100:+.2f})")

The confidence interval is more useful than the p-value β€” it tells you the size of the effect, not just "is it real."

Quick visual

import matplotlib.pyplot as plt
groups = df.groupby("group")["converted"].mean()
errs = df.groupby("group")["converted"].apply(lambda s: 1.96 * (s.mean()*(1-s.mean())/len(s))**0.5)

fig, ax = plt.subplots(figsize=(5, 4))
groups.plot(kind="bar", yerr=errs, capsize=6, ax=ax, color=["#94a3b8","#217346"])
ax.set_title("Conversion rate by group (95% CI)")
ax.set_ylabel("Conversion %")
plt.tight_layout()
plt.savefig("ab_test.png", dpi=200)

Key takeaways

Underpowered?

Re-run the analysis with only the first 1,000 rows of each group. How does the CI change? You've just seen why we need lots of data for small effects.

14.5 β€” User story: trend, seasonality, and 'is this Tuesday weird?'

Decompose a time series. Spot anomalies.

What you'll learn

The story

Daily sessions on the website have a clear weekly rhythm. Marcus needs to spot days that are unusually high or low after accounting for that rhythm.

Decomposition

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

df = pd.read_csv("daily_sessions.csv", parse_dates=["date"]).set_index("date").sort_index()

decomp = seasonal_decompose(df["sessions"], model="additive", period=7)
decomp.plot()
plt.tight_layout()
plt.savefig("decomposition.png", dpi=150)

Anomaly flag

# Residual = actual - trend - seasonal
resid = decomp.resid.dropna()
std = resid.std()
df["expected"] = (decomp.trend + decomp.seasonal)
df["anomaly"]  = (resid.abs() > 2 * std)

anomalies = df[df["anomaly"]].sort_values("sessions", ascending=False)
print(anomalies)

Plot with anomalies highlighted

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df.index, df["sessions"], lw=0.8)
ax.plot(df.index, df["expected"], lw=1.2, color="#217346", label="expected")
mask = df["anomaly"].fillna(False)
ax.scatter(df.index[mask], df["sessions"][mask], color="red", s=20, zorder=3, label="anomaly")
ax.legend()
plt.tight_layout()
plt.savefig("anomalies.png", dpi=150)

Key takeaways

Find your anomalies

Take any daily metric you track (web sessions, support tickets, orders). Decompose it. List the anomalous days from the past 90.

14.6 β€” User story: a refreshing dashboard sheet

Build an Excel dashboard that auto-updates when the source data changes.

What you'll learn

The story

Marcus's manager wants a single Excel file she can open every morning and see: today's revenue, week-on-week, top 5 products, top 5 stores, anomaly count.

The structure

The =PY() cells

# Tile 1: Yesterday's revenue
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
y = df[df["date"].dt.date == (pd.Timestamp.today().date() - pd.Timedelta(days=1))]
f"${y['revenue'].sum():,.0f}"
)

# Tile 2: Week-on-week
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
this7 = df[df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=7)]["revenue"].sum()
prev7 = df[(df["date"] < pd.Timestamp.today() - pd.Timedelta(days=7)) &
           (df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=14))]["revenue"].sum()
f"{(this7/prev7 - 1)*100:+.1f}%"
)

# Tile 3: Top 5 products
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("product")["revenue"].sum().nlargest(5)
)

# Tile 4: Top 5 stores
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("store")["revenue"].sum().nlargest(5)
)

# Tile 5: Anomaly count
=PY(
df = xl("Data!A1:F100000", headers=True)
daily = df.groupby(pd.to_datetime(df["date"]).dt.date)["revenue"].sum()
mu, sd = daily.mean(), daily.std()
int(((daily - mu).abs() > 2*sd).sum())
)

Lay it out

Put the small tiles (revenue, WoW, anomaly count) in big bold cells across the top. Put the Top 5 lists in the middle. Add a single chart at the bottom: =PY() returning a line chart of the last 30 days.

πŸ’‘ Refresh discipline
The dashboard auto-recalculates when the Data sheet changes. Tell stakeholders to only edit the Data sheet and nothing else.

Key takeaways

Add a tile

Add a sixth tile: yesterday's conversion rate vs the trailing 30-day average. Format as "X.XX% (Y.Y pp vs avg)".

14.7 β€” User story: shipping the analysis β€” narrative report

Turn an analysis notebook into a one-page narrative the boss will actually read.

What you'll learn

The story

Marcus finished his analysis. Now he has to convince the VP to act on it.

The 'analyst memo' template

  1. TL;DR (3 sentences). What you found, what it means, what you recommend.
  2. The headline chart. One image that proves the TL;DR.
  3. How you measured it. Data, method, limits, in plain English.
  4. What to do next. Specific, time-bound, owned.

Auto-generate a PDF memo

from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

with PdfPages("memo_q1.pdf") as pdf:
    # Page 1 β€” title + TL;DR
    fig, ax = plt.subplots(figsize=(8.5, 11))
    ax.axis("off")
    ax.text(0.05, 0.95, "Q1 Northeast Customers β€” Memo", fontsize=20, weight="bold", va="top")
    ax.text(0.05, 0.85,
            "TL;DR β€” The top 20 NE customers drove 62% of revenue but their\n"
            "margin% trails the national average by 6 points. We recommend\n"
            "a pricing review on the three SKUs that account for 80% of the gap.",
            fontsize=12, va="top")
    pdf.savefig(fig); plt.close(fig)

    # Page 2 β€” the chart
    fig, ax = plt.subplots(figsize=(8.5, 11))
    # ... build the chart
    pdf.savefig(fig); plt.close(fig)

Or generate a Word memo

from docx import Document
doc = Document()
doc.add_heading("Q1 Northeast Customers β€” Memo", 0)
doc.add_paragraph("TL;DR β€” The top 20 NE customers drove 62% of revenue...")
doc.add_picture("ne_margins.png", width=docx.shared.Inches(6))
doc.add_heading("Recommendation", level=2)
doc.add_paragraph("Pricing review on SKUs 12345, 23456, 34567 by end of June.")
doc.save("memo_q1.docx")

Key takeaways

Write your own memo

Take your last analysis. Distill it into a TL;DR, one chart, and a recommendation. Hand it to a colleague and ask: "did this make me sound certain enough?"

Module 15

Scenario: HR & Operations

Headcount, payroll prep, timesheet cleanup, employee data hygiene, and the dozen small reports that pile up.

Meet Tomas, an HR coordinator. He's drowning in spreadsheets β€” exports from the HRIS, manager surveys, timesheet dumps. Here's how Python turns each one into a quiet five-minute job.

15.1 β€” User story: monthly headcount report

Active employees by department, with hires/leavers.

What you'll learn

The story

Tomas needs a monthly headcount file: active count by department on month-end, plus hires and leavers in the month, plus turnover %.

The data

import pandas as pd
emp = pd.read_excel("employees.xlsx")
# emp cols: employee_id, name, department, hire_date, term_date (NaT if active)
emp["hire_date"] = pd.to_datetime(emp["hire_date"])
emp["term_date"] = pd.to_datetime(emp["term_date"], errors="coerce")

Active as of any date

def active_as_of(df, asof):
    asof = pd.Timestamp(asof)
    return df[(df["hire_date"] <= asof) & ((df["term_date"].isna()) | (df["term_date"] > asof))]

eom = pd.Timestamp("2026-04-30")
active = active_as_of(emp, eom)
print(f"Headcount on {eom:%Y-%m-%d}: {len(active)}")

By department

hc_by_dept = active.groupby("department").size().rename("active").reset_index()
hc_by_dept = hc_by_dept.sort_values("active", ascending=False)

Hires and leavers in the month

month_start = pd.Timestamp("2026-04-01")
hires   = emp[(emp["hire_date"] >= month_start) & (emp["hire_date"] <= eom)]
leavers = emp[(emp["term_date"] >= month_start) & (emp["term_date"] <= eom)]

events = pd.DataFrame({
    "department": hc_by_dept["department"],
    "hires":   hires.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
    "leavers": leavers.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
})
events["turnover_pct"] = (events["leavers"] / hc_by_dept["active"] * 100).round(1)

summary = hc_by_dept.merge(events, on="department")
summary.loc["Total"] = [
    "Total", summary["active"].sum(), summary["hires"].sum(),
    summary["leavers"].sum(),
    round(summary["leavers"].sum() / summary["active"].sum() * 100, 1),
]
summary.to_excel(f"headcount_{eom:%Y-%m}.xlsx", index=False)

Key takeaways

Year-to-date version

Build a "YTD" version: rolling headcount at each month-end of the year, in one wide table.

15.2 β€” User story: payroll prep

Take the timesheet dump, validate it, output the payroll-ready file.

What you'll learn

The data

ts = pd.read_csv("timesheets.csv", parse_dates=["date"])
# columns: employee_id, date, hours, project

Validate

# No more than 16 hours in a day
bad = ts.groupby(["employee_id","date"])["hours"].sum()
bad = bad[bad > 16]
if len(bad):
    print(f"⚠️  {len(bad)} employee-days over 16 hours")
    print(bad)

Aggregate to pay period (e.g., bi-weekly)

ts["pay_period"] = ts["date"].dt.to_period("2W-SUN")    # bi-weekly ending Sunday
daily = ts.groupby(["employee_id","date"])["hours"].sum().reset_index()

# Apply 40h/week overtime rule
daily["week"] = daily["date"].dt.to_period("W-SUN")
weekly = (daily.groupby(["employee_id","week"])["hours"]
                .sum()
                .reset_index())
weekly["regular_hrs"]  = weekly["hours"].clip(upper=40)
weekly["overtime_hrs"] = (weekly["hours"] - 40).clip(lower=0)

Roll up to pay period

weekly["pay_period"] = weekly["week"].apply(lambda w: pd.Period(w.start_time, freq="2W-SUN"))
payroll = (weekly.groupby(["employee_id","pay_period"])
                  .agg(regular = ("regular_hrs", "sum"),
                       overtime= ("overtime_hrs", "sum"))
                  .reset_index())

payroll["pay_period"] = payroll["pay_period"].astype(str)
payroll.to_csv("payroll_import.csv", index=False)
print(f"Wrote payroll for {payroll['employee_id'].nunique()} employees")

Key takeaways

Pay rate

Merge the payroll output with an employees table containing hourly rates. Compute total pay (regularΓ—rate + overtimeΓ—rateΓ—1.5).

15.3 β€” User story: employee data hygiene

Names with stray spaces, missing managers, duplicate IDs.

What you'll learn

Profile first

def profile(df):
    return pd.DataFrame({
        "dtype": df.dtypes.astype(str),
        "missing": df.isna().sum(),
        "missing_pct": (df.isna().mean()*100).round(1),
        "distinct": df.nunique(),
    })

print(profile(emp))

Common HRIS dirty patterns

issues = []

# Stray whitespace
for col in ["first_name", "last_name", "email"]:
    bad = emp[col].astype(str).str.strip() != emp[col].astype(str)
    issues += [(rowid, col, "whitespace") for rowid in emp.index[bad]]

# Mixed case emails
bad = emp["email"].astype(str) != emp["email"].astype(str).str.lower()
issues += [(r, "email", "mixed case") for r in emp.index[bad]]

# Missing manager
issues += [(r, "manager_id", "missing") for r in emp.index[emp["manager_id"].isna()]]

# Manager who doesn't exist
mgr_ids = set(emp["employee_id"])
bad = ~emp["manager_id"].isin(mgr_ids) & emp["manager_id"].notna()
issues += [(r, "manager_id", "non-existent") for r in emp.index[bad]]

# Duplicate IDs
dup = emp[emp.duplicated("employee_id", keep=False)]
issues += [(r, "employee_id", "duplicate") for r in dup.index]

fix = pd.DataFrame(issues, columns=["row_index", "column", "issue"]).merge(
    emp[["employee_id", "first_name", "last_name"]],
    left_on="row_index", right_index=True
)
fix.to_excel("hris_fix_list.xlsx", index=False)
print(f"⚠️  {len(fix)} issues to fix")

Key takeaways

Email validity

Add a check: flag emails that don't match a basic name@domain.tld pattern. Hint: str.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$").

15.4 β€” User story: build the org chart programmatically

From a flat employee table to a parent-child hierarchy.

What you'll learn

The data

emp = pd.read_excel("employees.xlsx")[["employee_id","name","manager_id","title"]]

Build a parent-children map

from collections import defaultdict
children = defaultdict(list)
for _, r in emp.iterrows():
    if pd.notna(r["manager_id"]):
        children[r["manager_id"]].append(r["employee_id"])

names = dict(zip(emp["employee_id"], emp["name"] + " β€” " + emp["title"]))

def print_tree(emp_id, depth=0):
    print("  " * depth + "β€’ " + names[emp_id])
    for kid in children[emp_id]:
        print_tree(kid, depth + 1)

# Find roots (no manager) and print
roots = emp[emp["manager_id"].isna()]["employee_id"]
for r in roots:
    print_tree(r)

Count direct + indirect reports

def total_reports(emp_id):
    direct = children[emp_id]
    return len(direct) + sum(total_reports(c) for c in direct)

emp["direct_reports"]   = emp["employee_id"].apply(lambda e: len(children[e]))
emp["total_reports"]    = emp["employee_id"].apply(total_reports)
emp.sort_values("total_reports", ascending=False).head(10)

Key takeaways

Span check

Flag managers with more than 12 direct reports (broad span) or fewer than 3 (too narrow).

15.5 β€” User story: engagement survey analysis

Turn 800 free-text + Likert responses into one summary file.

What you'll learn

The data

survey = pd.read_excel("engagement_2026.xlsx")
# cols: respondent_id, team, q1_likert (1-5), q2_likert ..., q10_open (free text)

Score the Likerts

likert_cols = [c for c in survey.columns if c.endswith("_likert")]

by_team = (survey.groupby("team")[likert_cols].mean().round(2))
by_team["overall"] = by_team.mean(axis=1).round(2)
by_team.to_excel("survey_scores_by_team.xlsx")

# Distribution per question
dist = survey.melt(id_vars="team", value_vars=likert_cols,
                    var_name="question", value_name="score")
dist_pct = (dist.groupby(["question","score"]).size()
                  .unstack(fill_value=0)
                  .pipe(lambda d: d.div(d.sum(axis=1), axis=0) * 100)
                  .round(1))

Free text: top words

import re
from collections import Counter

text = " ".join(survey["q10_open"].dropna().astype(str)).lower()
words = re.findall(r"[a-z']+", text)
stop = {"the","and","a","to","of","in","is","i","it","for","that","but","my","we","with","this","on","be","are"}
freq = Counter(w for w in words if w not in stop and len(w) > 3)
top20 = pd.DataFrame(freq.most_common(20), columns=["word","count"])

Output

with pd.ExcelWriter("survey_summary.xlsx") as w:
    by_team.to_excel(w, sheet_name="By Team")
    dist_pct.to_excel(w, sheet_name="Distributions")
    top20.to_excel(w, sheet_name="Top Words", index=False)

Key takeaways

Net Promoter Score

Add an NPS computation: % promoters (9-10) minus % detractors (0-6).

15.6 β€” User story: PTO accrual and balances

How much vacation does each employee have, with the accrual rules baked in?

What you'll learn

The rules (example)

Compute

from datetime import date
import pandas as pd

emp = pd.read_excel("employees.xlsx")
leave = pd.read_excel("leave_taken.xlsx")   # employee_id, days_taken_ytd

today = pd.Timestamp(date.today())
emp["tenure_yrs"] = ((today - pd.to_datetime(emp["hire_date"])).dt.days / 365.25)

def annual_rate(years):
    if years >= 10: return 25
    if years >= 5:  return 20
    return 15

emp["annual_rate"] = emp["tenure_yrs"].apply(annual_rate)
ytd_fraction = (today.dayofyear / 365.25)
emp["accrued_ytd"] = (emp["annual_rate"] * ytd_fraction).round(1)

emp = emp.merge(leave, on="employee_id", how="left")
emp["days_taken_ytd"] = emp["days_taken_ytd"].fillna(0)
emp["balance"] = (emp["accrued_ytd"] - emp["days_taken_ytd"]).round(1)

emp["over_cap"] = emp["balance"] > 30
print(f"{emp['over_cap'].sum()} employees are over the 30-day cap.")

emp[["employee_id","name","tenure_yrs","annual_rate","accrued_ytd","days_taken_ytd","balance","over_cap"]].to_excel("pto_balances.xlsx", index=False)

Key takeaways

Use-it-or-lose-it warning

Find employees who would lose more than 5 days at year-end if they don't take any more leave. Email each one a polite nudge using Module 12 Lesson 5.

15.7 β€” User story: new-hire onboarding tracker

Generate per-hire onboarding sheets from a master checklist.

What you'll learn

Per-hire onboarding workbook

from openpyxl import load_workbook
from datetime import date, timedelta
import pandas as pd

template = "onboarding_template.xlsx"
hires = pd.read_excel("new_hires_this_week.xlsx")

for _, h in hires.iterrows():
    wb = load_workbook(template)
    ws = wb["Checklist"]
    ws["B2"] = h["name"]
    ws["B3"] = h["start_date"]
    ws["B4"] = h["manager"]
    ws["B5"] = h["department"]

    # Auto-fill due dates
    items = pd.read_excel(template, sheet_name="ItemDefs")
    for i, item in items.iterrows():
        row = 8 + i
        ws.cell(row=row, column=1, value=item["item"])
        ws.cell(row=row, column=2, value=pd.Timestamp(h["start_date"]) + pd.Timedelta(days=item["due_offset"]))

    wb.save(f"onboarding/{h['name'].replace(' ', '_')}.xlsx")

Master tracker β€” read every per-hire workbook, summarise

from pathlib import Path
rows = []
for p in Path("onboarding").glob("*.xlsx"):
    wb = load_workbook(p, data_only=True)
    ws = wb["Checklist"]
    name = ws["B2"].value
    for r in ws.iter_rows(min_row=8, values_only=True):
        item, due, done, owner = r[:4]
        if item:
            rows.append({"hire": name, "item": item, "due": due, "done": done, "owner": owner})

tracker = pd.DataFrame(rows)
tracker["overdue"] = (~tracker["done"].fillna(False)) & (tracker["due"] < pd.Timestamp.now())
print(f"Overdue items: {tracker['overdue'].sum()}")
tracker.to_excel("onboarding_master.xlsx", index=False)

Key takeaways

Slack/Teams reminder

Instead of email, post the daily summary into a Teams/Slack channel via incoming webhook. (Both have free webhook URLs.)

Module 16

Scenario: Sales & Marketing

Lead-list cleanup, CRM exports, campaign reporting, funnel analysis, and a weekly pipeline report you stop having to build.

Meet Lena, a sales-ops lead. Every Monday she rebuilds the same pipeline report. Every quarter she dedupes a giant lead list. Here's how Python earns its keep on each of these tasks.

16.1 β€” User story: dedupe a 50,000-lead list

Marketing pulled a list. Half of them are already in the CRM. Find the half that aren't.

What you'll learn

The data

import pandas as pd

new_list = pd.read_csv("conference_list.csv")        # 50,000 rows
existing = pd.read_csv("crm_export.csv")             # 220,000 rows

Normalise both sides

def norm_email(s):
    return str(s).strip().lower()

def norm_company(s):
    s = str(s).strip().lower()
    for suffix in [" inc", " inc.", " llc", " ltd", " ltd.", " gmbh", " co", " co."]:
        if s.endswith(suffix): s = s[:-len(suffix)]
    return s.strip()

for df in (new_list, existing):
    df["email_n"]   = df["email"].apply(norm_email)
    df["company_n"] = df["company"].apply(norm_company)

Match on email first, company second

known_emails    = set(existing["email_n"])
known_companies = set(existing["company_n"])

new_list["match_by_email"]   = new_list["email_n"].isin(known_emails)
new_list["match_by_company"] = new_list["company_n"].isin(known_companies)
new_list["already_known"]    = new_list["match_by_email"] | new_list["match_by_company"]

net_new = new_list[~new_list["already_known"]].drop(columns=["match_by_email","match_by_company"])
print(f"Net-new leads: {len(net_new):,} of {len(new_list):,}")
net_new.to_excel("net_new_leads.xlsx", index=False)

Catch fuzzy duplicates within the new list itself

net_new = net_new.sort_values("score", ascending=False)   # keep highest-scoring dupe
net_new = net_new.drop_duplicates(subset="email_n")
net_new = net_new.drop_duplicates(subset=["company_n", "last_name"])

Key takeaways

Domain match

Add a third match: if the email domain (after @) matches a known company, treat it as a likely dupe.

16.2 β€” User story: roll up the marketing campaign report

Spend, impressions, clicks, conversions, by campaign and channel.

What you'll learn

The data

google = pd.read_csv("google_ads.csv")
meta   = pd.read_csv("meta_ads.csv")
linked = pd.read_csv("linkedin_ads.csv")

Normalise each to a common schema

def normalise(df, channel, mapping):
    return (df.rename(columns=mapping)
              .assign(channel=channel)
              [["channel","campaign","date","spend","impressions","clicks","conversions","revenue"]])

g = normalise(google, "Google",   {"Campaign":"campaign","Day":"date","Cost":"spend",
                                    "Impr.":"impressions","Clicks":"clicks",
                                    "Conv.":"conversions","Conv. value":"revenue"})
m = normalise(meta,   "Meta",     {"Ad set name":"campaign","Reporting starts":"date",
                                    "Amount spent (USD)":"spend","Impressions":"impressions",
                                    "Link clicks":"clicks","Results":"conversions",
                                    "Purchase conversion value":"revenue"})
l = normalise(linked, "LinkedIn", {"Campaign Name":"campaign","Start Date":"date",
                                    "Total Spent":"spend","Impressions":"impressions",
                                    "Clicks":"clicks","Conversions":"conversions",
                                    "Conversion Value":"revenue"})

all_ads = pd.concat([g, m, l], ignore_index=True)
all_ads["date"] = pd.to_datetime(all_ads["date"])

Compute ratios

all_ads["CTR"]  = (all_ads["clicks"]      / all_ads["impressions"]).fillna(0)
all_ads["CPC"]  = (all_ads["spend"]       / all_ads["clicks"]).replace([float("inf")], 0).fillna(0)
all_ads["CPA"]  = (all_ads["spend"]       / all_ads["conversions"]).replace([float("inf")], 0).fillna(0)
all_ads["ROAS"] = (all_ads["revenue"]     / all_ads["spend"]).replace([float("inf")], 0).fillna(0)

Roll up

by_channel = (all_ads.groupby("channel")
                       .agg(spend=("spend","sum"),
                            revenue=("revenue","sum"),
                            conversions=("conversions","sum"))
                       .assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2),
                               CPA= lambda d: (d["spend"]/d["conversions"]).round(2))
                       .sort_values("ROAS", ascending=False))

by_campaign = (all_ads.groupby(["channel","campaign"])
                       .agg(spend=("spend","sum"),
                            revenue=("revenue","sum"),
                            conversions=("conversions","sum"))
                       .assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2))
                       .sort_values("ROAS", ascending=False)
                       .reset_index())

with pd.ExcelWriter("campaign_report.xlsx") as w:
    by_channel.to_excel(w, sheet_name="By Channel")
    by_campaign.to_excel(w, sheet_name="By Campaign", index=False)

Key takeaways

Channel ROI trend

Add a weekly line chart of ROAS by channel for the last 12 weeks.

16.3 β€” User story: funnel analysis

From session β†’ product view β†’ add to cart β†’ checkout β†’ purchase.

What you'll learn

The data

events = pd.read_csv("events.csv")
# user_id, event_name, timestamp

Per-step counts

steps = ["session_start","product_view","add_to_cart","checkout","purchase"]

step_users = {}
for s in steps:
    step_users[s] = events.loc[events["event_name"]==s, "user_id"].nunique()

funnel = pd.DataFrame({"step": steps, "users": [step_users[s] for s in steps]})
funnel["pct_of_top"]   = (funnel["users"] / funnel["users"].iloc[0] * 100).round(1)
funnel["pct_of_prev"]  = (funnel["users"] / funnel["users"].shift().fillna(funnel["users"].iloc[0]) * 100).round(1)
print(funnel)

Strict sequential funnel (each step must have happened in order)

seq = events.sort_values(["user_id","timestamp"]).groupby("user_id")["event_name"].agg(list)

def reached(events_list, step_idx):
    pos = 0
    for ev in events_list:
        if pos < len(steps) and ev == steps[pos]:
            pos += 1
            if pos > step_idx: return True
    return False

reached_step = {s: int(seq.apply(lambda L: reached(L, i)).sum()) for i, s in enumerate(steps)}
print(reached_step)

Funnel chart

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
ax.barh(funnel["step"][::-1], funnel["users"][::-1], color="#2b6cb0")
for i, (s, n, pct) in enumerate(zip(funnel["step"][::-1], funnel["users"][::-1], funnel["pct_of_top"][::-1])):
    ax.text(n, i, f"  {n:,} ({pct}%)", va="center")
ax.set_title("Funnel β€” last 30 days")
plt.tight_layout()
plt.savefig("funnel.png", dpi=200)

Key takeaways

Segment the funnel

Compute the funnel separately for desktop vs mobile users. Where does mobile drop off more?

16.4 β€” User story: the Monday pipeline report

Total pipeline, weighted forecast, deals moving by stage.

What you'll learn

The data

deals = pd.read_csv("crm_pipeline.csv", parse_dates=["close_date","last_modified"])
# id, owner, account, amount, stage, close_date, last_modified

stage_prob = {
    "Discovery":   0.10,
    "Qualified":   0.20,
    "Proposal":    0.40,
    "Negotiation": 0.70,
    "Closing":     0.90,
}
deals["prob"]      = deals["stage"].map(stage_prob).fillna(0)
deals["weighted"]  = deals["amount"] * deals["prob"]

Pipeline by stage

by_stage = (deals.groupby("stage")
                  .agg(deals=("id","count"),
                       value=("amount","sum"),
                       weighted=("weighted","sum"))
                  .reindex(stage_prob.keys()))

Per-rep snapshot

by_rep = (deals.groupby("owner")
                .agg(deals=("id","count"),
                     pipeline=("amount","sum"),
                     forecast=("weighted","sum"))
                .sort_values("forecast", ascending=False))

Stuck deals

stuck = deals[(deals["last_modified"] < pd.Timestamp.now() - pd.Timedelta(days=30)) &
              (deals["stage"] != "Closed")]

One-page output

with pd.ExcelWriter("monday_pipeline.xlsx") as w:
    by_stage.to_excel(w, sheet_name="By Stage")
    by_rep.to_excel(w, sheet_name="By Rep")
    stuck.to_excel(w, sheet_name="Stuck Deals", index=False)
    pd.DataFrame({
        "Metric": ["Total pipeline","Weighted forecast","# stuck deals"],
        "Value":  [by_stage["value"].sum(), by_stage["weighted"].sum(), len(stuck)],
    }).to_excel(w, sheet_name="Cover", index=False)

Schedule it

Schedule the script to run at 6 AM every Monday (Module 12, Lesson 6). Email it to the VP of Sales (Module 12, Lesson 5). Now your "Monday morning pipeline report" job is done before you wake up.

Key takeaways

Forecast vs quota

Add a column to by_rep for quota; compute forecast / quota %, sort, flag < 70%.

16.5 β€” User story: RFM segmentation

Score every customer on Recency, Frequency, Monetary value. Build a champions list.

What you'll learn

The data

orders = pd.read_csv("orders.csv", parse_dates=["order_date"])
# customer_id, order_date, amount

Compute R, F, M

asof = orders["order_date"].max()
rfm = orders.groupby("customer_id").agg(
    recency   = ("order_date", lambda d: (asof - d.max()).days),
    frequency = ("order_date", "count"),
    monetary  = ("amount",     "sum"),
)

Score 1-5 by quintile

# Recency: lower = better, so reverse
rfm["R"] = pd.qcut(rfm["recency"],   5, labels=[5,4,3,2,1]).astype(int)
rfm["F"] = pd.qcut(rfm["frequency"], 5, labels=[1,2,3,4,5], duplicates="drop").astype(int)
rfm["M"] = pd.qcut(rfm["monetary"],  5, labels=[1,2,3,4,5]).astype(int)
rfm["RFM"] = rfm["R"].astype(str) + rfm["F"].astype(str) + rfm["M"].astype(str)

Segment

def label(row):
    if row["R"] >= 4 and row["F"] >= 4 and row["M"] >= 4: return "Champion"
    if row["R"] >= 4 and row["F"] >= 3:                   return "Loyal"
    if row["R"] >= 4 and row["F"] <= 2:                   return "New"
    if row["R"] <= 2 and row["F"] >= 4:                   return "At Risk"
    if row["R"] <= 2 and row["F"] <= 2:                   return "Lost"
    return "Need Attention"

rfm["segment"] = rfm.apply(label, axis=1)
print(rfm["segment"].value_counts())

Output target lists

with pd.ExcelWriter("rfm_targets.xlsx") as w:
    for seg in rfm["segment"].unique():
        rfm[rfm["segment"]==seg].reset_index().to_excel(w, sheet_name=seg[:31], index=False)

Key takeaways

Send-list builder

Build a campaign email list of "At Risk" customers (R≀2, Fβ‰₯4) with their last-order-amount as a personalisation token.

16.6 β€” User story: simple multi-touch attribution

Who gets credit when a customer touched four channels before buying?

What you'll learn

The data

touches = pd.read_csv("touches.csv", parse_dates=["ts"])
# user_id, ts, channel, conversion (bool), revenue

Build journeys

converters = touches.loc[touches["conversion"], "user_id"].unique()
j = (touches[touches["user_id"].isin(converters)]
        .sort_values(["user_id","ts"])
        .groupby("user_id")
        .agg(path=("channel", list),
             revenue=("revenue", "max")))

First-touch

j["first"] = j["path"].apply(lambda p: p[0])
first_touch = j.groupby("first")["revenue"].sum()

Last-touch

j["last"] = j["path"].apply(lambda p: p[-1])
last_touch = j.groupby("last")["revenue"].sum()

Linear (equal credit to every touch)

rows = []
for path, rev in zip(j["path"], j["revenue"]):
    share = rev / len(path)
    for c in path:
        rows.append({"channel": c, "credit": share})
linear = pd.DataFrame(rows).groupby("channel")["credit"].sum()

Compare

compare = pd.concat([first_touch.rename("first"),
                     last_touch.rename("last"),
                     linear.rename("linear")], axis=1).fillna(0).round(0)
compare.sort_values("linear", ascending=False)

Key takeaways

Position-based (40/20/40)

Implement a position-based model: 40% to first, 40% to last, 20% split among the middle touches.

16.7 β€” User story: sales territory rebalancing

Split a country into roughly-equal account groups for new reps.

What you'll learn

The setup

Lena is hiring three new reps. She has to carve out three territories with roughly equal pipeline value.

Sum by geography

accts = pd.read_csv("accounts.csv")   # account_id, state, pipeline_value
state_value = accts.groupby("state")["pipeline_value"].sum().sort_values(ascending=False)
print(state_value)

Greedy assignment

n_reps = 3
territories = {f"Rep{i+1}": {"states": [], "value": 0} for i in range(n_reps)}

for state, value in state_value.items():
    # assign to whichever rep currently has the lowest total
    lowest = min(territories, key=lambda r: territories[r]["value"])
    territories[lowest]["states"].append(state)
    territories[lowest]["value"] += value

for rep, info in territories.items():
    print(f"{rep}: ${info['value']:,.0f}  ({len(info['states'])} states)")

Output per-rep workbooks

for rep, info in territories.items():
    sub = accts[accts["state"].isin(info["states"])].sort_values("pipeline_value", ascending=False)
    sub.to_excel(f"territory_{rep}.xlsx", index=False)

How balanced did we end up?

values = [t["value"] for t in territories.values()]
print(f"min:{min(values):,.0f}  max:{max(values):,.0f}  spread:{(max(values)-min(values))/sum(values)*100:.1f}%")
πŸ“ Beyond greedy
For real territory optimisation (drive time, equal account count, named-account constraints), look at scipy.optimize.linprog or the pulp library. Greedy is a great first cut.

Key takeaways

Add a constraint

Modify the script so each rep gets at most 12 states (forces re-balancing in dense areas).

Module 17

Capstone Projects

Four end-to-end projects that combine everything in the course. Pick one, ship it, show it.

You've learned the moves. Now build the dance. Each capstone is a real-world project you can finish in 2-6 hours, share with colleagues, and put on your portfolio.

πŸ’‘ You don't have to do all four
Pick the one closest to your job. Do that one well. Come back for the others when you need them.

17.1 β€” Capstone 1 β€” Monthly close pack (Finance)

End-to-end: load, reconcile, variance, AR aging, FX P&L, single Excel output, scheduled.

What you'll learn

Project brief

Build a single script monthly_close.py that, when run, produces close_pack_YYYY-MM.xlsx containing:

  1. A cover sheet with 6 KPIs.
  2. Bank reconciliation (matched, GL-only, bank-only).
  3. Budget vs actual variance (with material-variance flags).
  4. AR aging by customer Γ— bucket.
  5. USD-translated P&L by office.

Schedule it to run automatically on the 1st of every month and email the result to your controller.

Suggested file layout

monthly_close/
β”œβ”€β”€ monthly_close.py        ← the orchestrator
β”œβ”€β”€ recon.py                ← reconciliation function
β”œβ”€β”€ variance.py             ← variance function
β”œβ”€β”€ aging.py
β”œβ”€β”€ fx_pnl.py
β”œβ”€β”€ style.py                ← shared openpyxl styling
β”œβ”€β”€ inputs/                 ← raw files dropped here
└── outputs/                ← finished pack lands here

Acceptance criteria

Stretch goals

What to show off

When it's done: a 90-second screen recording showing you double-click the script, the file appears, you open it, scroll the cover sheet, click into one detail tab. That's your portfolio piece.

17.2 β€” Capstone 2 β€” Self-refreshing dashboard (Analyst)

A daily-refresh dashboard from a database export, with anomaly detection.

What you'll learn

Project brief

Build daily_dashboard.py that runs every morning and produces a one-page Excel dashboard for your manager. Source data: a CSV dump from the data warehouse (or any large CSV you have access to).

Required content

  1. Big number tiles: yesterday's revenue, WoW change, MoM change, anomaly count.
  2. Top 5 products by revenue.
  3. Top 5 stores by revenue.
  4. 30-day revenue trend chart.
  5. Heatmap: hour Γ— day-of-week.
  6. Anomalies tab: list of dates where the residual was > 2Οƒ.

Acceptance criteria

Stretch goals

What to show off

The dashboard your manager now refreshes themselves just by opening yesterday's file. The fact that they stopped asking you for it.

17.3 β€” Capstone 3 β€” Headcount + payroll prep + tracker (HR)

Three monthly HR jobs in one weekly script.

What you'll learn

Project brief

Build hr_weekly.py that produces three artifacts and routes each to the right stakeholder:

  1. headcount_YYYY-MM-DD.xlsx β†’ HR Director.
  2. payroll_import_YYYY-MM-DD.csv β†’ Payroll vendor SFTP (or just a folder).
  3. onboarding_status_YYYY-MM-DD.xlsx β†’ People Manager.

Required inputs

Acceptance criteria

Stretch goals

What to show off

Three jobs collapsed to one calendar entry that runs while you sleep. Your Monday opens with a clean inbox.

17.4 β€” Capstone 4 β€” Pipeline + campaign report (Sales/Marketing)

A weekly Monday email that combines pipeline, marketing spend, and ROAS.

What you'll learn

Project brief

Build monday_gtm.py that runs every Sunday night and emails a 1-page PDF + Excel pack to the VP of Sales and the VP of Marketing.

The PDF cover page (auto-generated)

The Excel underneath

Acceptance criteria

Stretch goals

What to show off

An automated Monday-morning report that two VPs now rely on. The fact that they can't tell you didn't hand-make it.

17.5 β€” Where to go from here

Books, sites, communities, and the next skills worth learning.

What you'll learn

Books to read next

Sites worth bookmarking

Skills to add this quarter

  1. SQL. If pandas is the Excel-replacement, SQL is the "where the data actually lives" language. Two weekends of practice pays for itself for the rest of your career.
  2. Git basics. Version control for your scripts. Two hours. You'll never lose work again.
  3. One specific library deeper. Depending on your job: requests for APIs, beautifulsoup4 for scraping, plotly for interactive charts, streamlit for tiny web apps.

Communities

The most underrated next move

Pick one script you wrote during this course. Make it 50% better β€” add error handling, write a docstring for every function, add a small test. The difference between "I tinkered with Python" and "I write Python" is taste, and taste comes from polishing your own work.

And finally β€” keep this site bookmarked

This whole course is here whenever you need to look something up. New lessons get added when there's something worth saying. Send corrections and lesson requests via the contact page.

Key takeaways

Your last exercise

Pick one script you wrote during this course. Spend 30 minutes: add a docstring at the top, error handling on the I/O, and a print statement at the end that says how long it took. Save it somewhere safe. That's your first piece of personal infrastructure.


End of manual. Β© 2026 LearnPythonForExcel.com
The live, always-up-to-date version lives at learnpythonforexcel.com.