LearnPythonForExcel β Training Manual
The complete printable companion. 17 modules, 112 lessons, every walkthrough and exercise.
π Save as PDF: In your browser, hit Ctrl+P (Windows) or Cmd+P (Mac), then choose "Save as PDF." Print-optimised styling kicks in automatically. Total length: roughly 200 pages.
Table of Contents
Module 01 β Introduction to Python
- What is Python, really?
- Why Python, why Excel, why now?
- How Python actually runs
- The Excel-to-Python mental model
- The tools you'll meet in this course
- How to use this course
Module 02 β Setting Up Your Environment
- Install Python (the easy way, with Anaconda)
- Install VS Code β your code editor
- The terminal β a friendly introduction
- Turn on Python inside Excel (=PY())
- A two-minute tour of Jupyter notebooks
- When things break β reading errors without panic
Module 03 β Variables and Data Types
- Variables β Python's version of a named cell
- Numbers β ints, floats, and the math operators
- Strings β text values and how to wrangle them
- Booleans and comparisons
- None, type conversions, and reading the type
- Talking to the user β input() and print()
- Comments β notes to your future self
Module 04 β Control Flow
- if / else β making decisions
- for loops β doing something for every item
- while loops β repeat until done
- List comprehensions β one-line loops
- try / except β handling errors gracefully
- Putting it together β a mini contact-cleaner
Module 05 β Functions
- def β defining your own function
- Returning multiple values & unpacking
- Scope β where variables live
- lambda, map, filter β the functional shortcuts
- Modules and imports β using other people's code
- Anatomy of a real script
Module 06 β Data Structures
- Lists β the everyday ordered collection
- Dictionaries β key-value lookups
- Tuples and when to use them
- Sets β unique values, fast
- Nested structures β lists of dicts, dicts of lists
- Working with dates
- Files and paths β the pathlib essentials
Module 07 β Working with Files
- Reading and writing CSV files
- Reading Excel files with pandas
- openpyxl β when you need cell-level control
- Looping through many files
- Text files, JSON, and other formats
- Encoding, BOMs, and other file gotchas
Module 08 β Python inside Excel β =PY()
- =PY() β the first ten formulas
- The xl() helper β getting Excel data into Python
- Pandas in =PY() β the killer combination
- Charts in =PY() with matplotlib and seaborn
- When =PY() is the right tool β and when it isn't
- Twelve =PY() recipes you'll use
- Limits, costs, and security of =PY()
Module 09 β Pandas Crash Course
- DataFrames β Python's spreadsheet
- Selecting rows and columns β loc and iloc
- Sorting, counting, and ranking
- Aggregation β sum, mean, agg
- groupby β the move that pays for the course
- Pivot tables β pivot_table()
- Joining and merging β pandas's VLOOKUP
- Applying functions β apply, map, vectorisation
Module 10 β Data Cleaning and Transformation
- Handling missing values
- Finding and removing duplicates
- String cleanup at scale
- Working with dates in pandas
- Numeric cleanup β currencies, percentages, outliers
- Reshape β melt and pivot
- Categoricals and ordered types
Module 11 β Data Visualization
- Matplotlib basics
- Plotting directly from a DataFrame
- Seaborn β prettier defaults, less code
- Formatting for stakeholders
- Inserting charts into Excel
- Tiny dashboards with subplots
Module 12 β Automating Boring Excel Tasks
- Merging many files into one
- Splitting one file into many
- Filling a template workbook
- Renaming and organising files in bulk
- Sending emails from Python
- Scheduling: run it automatically
- An automation recipe book
Module 13 β Scenario: Finance & Accounting
- User story: bank reconciliation
- User story: budget vs actual variance
- User story: consolidating subsidiary budgets
- User story: invoice batch processing
- User story: accounts receivable aging
- User story: multi-currency consolidation
- Project: the monthly close pack
Module 14 β Scenario: Data & Business Analyst
- User story: cleaning a 4-million-row export
- User story: joining six tables to answer one question
- User story: customer cohort retention
- User story: did the change work? A quick A/B test
- User story: trend, seasonality, and 'is this Tuesday weird?'
- User story: a refreshing dashboard sheet
- User story: shipping the analysis β narrative report
Module 15 β Scenario: HR & Operations
- User story: monthly headcount report
- User story: payroll prep
- User story: employee data hygiene
- User story: build the org chart programmatically
- User story: engagement survey analysis
- User story: PTO accrual and balances
- User story: new-hire onboarding tracker
Module 16 β Scenario: Sales & Marketing
- User story: dedupe a 50,000-lead list
- User story: roll up the marketing campaign report
- User story: funnel analysis
- User story: the Monday pipeline report
- User story: RFM segmentation
- User story: simple multi-touch attribution
- User story: sales territory rebalancing
Module 17 β Capstone Projects
- Capstone 1 β Monthly close pack (Finance)
- Capstone 2 β Self-refreshing dashboard (Analyst)
- Capstone 3 β Headcount + payroll prep + tracker (HR)
- Capstone 4 β Pipeline + campaign report (Sales/Marketing)
- Where to go from here
Module 01Introduction to Python
What Python is, why it matters for Excel users, what it can (and can't) do, and how to use this course.
This first module is short and friendly. You'll come out of it knowing what Python actually is, why it suddenly matters for spreadsheet users, and what kinds of problems it solves. No installation yet β we save that for Module 2.
π‘ Before you start
You don't need to install anything for this module. Just read. Open the lessons in any order if you like. Each one is 5-10 minutes.
01.1 β What is Python, really?
A plain-English answer β no jargon β and why a spreadsheet user should care.
What you'll learn
- Explain in one sentence what Python is
- Name three things Python is famously good at
- Spot the difference between Python and a spreadsheet formula
Python in one sentence
Python is a way of writing instructions for your computer in language that looks a lot like English. That's it. Once you write the instructions down in a file, you tell the computer to run the file, and the computer does what you said.
Here's a complete, working piece of Python:
print("Hello, Excel friend!")
You can probably guess what that does before you've ever written a line of code: it prints the words "Hello, Excel friend!" to the screen. That readability is the whole point of Python.
Where Python sits in the world
Python is one of the two or three most-used programming languages on the planet. NASA uses it. Netflix uses it. Your bank uses it. The data science team at virtually every company uses it. And starting in 2023, Microsoft Excel itself uses it β there's a Python interpreter built right into the spreadsheet now, accessed through a formula called =PY().
That last bit is the reason this course exists. The world's most popular spreadsheet now ships with the world's most popular data language built in. If you've spent years living inside Excel, you've just been handed a superpower.
What is Python good at?
Roughly the kind of work an Excel power-user already does β and then a lot more:
- Working with data. Cleaning it, reshaping it, summarising it, joining tables, building charts. Python's pandas library is what most of the world's data analysts use.
- Automating repetitive computer tasks. Renaming a thousand files, merging a hundred spreadsheets, sending a personalised email to each row of a list.
- Getting data from places. Web pages, APIs, PDFs, databases, emails β Python can pull data from almost anything.
- Calculation and math. Anything from a budget reconciliation to a machine-learning model.
- Building things. Websites, mobile apps, games, scientific simulations. (We won't touch this side in this course β but it's there if you want it later.)
What's the difference between Python and an Excel formula?
Excel formulas live inside a cell. They look at the cells around them and compute a single value. They're a single-step calculator.
A piece of Python β usually called a script β lives in its own file. It can do a thousand things one after another: open a file, change some numbers, save the file, then open another file, then send an email. It's a recipe with as many steps as you want.
The new =PY() formula in Excel mashes the two together: you write a few lines of Python inside a spreadsheet cell and you get a value back, just like a formula. We cover that in Module 8.
π§ The mental model
Think of Python as "what an Excel formula would be if it could do anything and weren't stuck in one cell."
What this course will (and won't) teach you
We focus 100% on Python for spreadsheet-style work. That means: data, files, automation, reports, charts, and the new =PY() in Excel. We don't cover web apps, game development, or computer science theory. If you want those later, you'll have plenty of Python background to pick them up easily.
Key takeaways
- Python is a programming language designed to be readable, like English.
- It's now built into Excel as the
=PY() formula.
- It's especially good at data, automation, and pulling information from anywhere.
- You don't need to learn the whole language to be useful with it β this course teaches the parts that pay off fastest.
Try this β no setup required
- Open an online Python sandbox in another tab (search for "python online" β Google has one built in, or use replit.com).
- Type
print("Hello, " + "world") and hit Run.
- Now change "world" to your name. Run it again.
- Congratulations β you've just written and run a Python program.
01.2 β Why Python, why Excel, why now?
The case for an Excel user learning Python in 2026 β five concrete reasons.
What you'll learn
- List five concrete advantages of using Python with Excel
- Identify two situations where Excel alone hits its limits
- Explain what changed in 2023 that made this course worth taking
The short version
Excel is brilliant for a lot of things β but it has limits. Python doesn't have most of those limits. And now you can use Python inside Excel, instead of as a separate tool. That's why this is suddenly a no-brainer for spreadsheet pros.
Five concrete reasons
1. Files that crash Excel are no problem for Python
The classic limit you've probably bumped into: Excel maxes out at about 1,048,576 rows. You get a 4-million-row CSV from the data team, you open it in Excel, and⦠it shows you the first million. Python handles 4 million rows without breaking a sweat. 40 million? Fine.
2. The same task, done in one click instead of fifty
If you do the same Excel work every Monday β open a file, clean three columns, run a pivot, format a report, save it as a PDF, email it β Python turns that into one script you run once. The first time costs you an hour. Every week after costs you the time it takes to double-click an icon.
π‘ The "annual ROI" rule of thumb
A 30-minute weekly task is 26 hours a year. If automating it takes you four hours up front, you break even by Week 9 and save 22 hours every year after that. Most automations pay back faster than that.
3. Things Excel can't do at all
Excel cannot scrape a web page for you. It can't pull data from most APIs cleanly. It can't read most PDFs intelligently. It can't send a hundred personalised emails. Python does all four with a handful of lines.
4. Reusability β the script outlives the spreadsheet
A complex Excel workbook is fragile. Someone changes a column heading and three pivot tables break. A Python script for the same job is text β you can read it, version it, share it, fix it, and re-use it on next month's data without copy-pasting anything.
5. It's now inside Excel
In 2023 Microsoft added a built-in Python runtime to Excel (Microsoft 365). You write =PY() in a cell, type a few lines of Python, hit Enter, and the result lands in the spreadsheet like any other formula result. No installs, no separate tool, no copy-paste between apps. We dive into this in Module 8.
Where Excel is still the right answer
This is not a course that argues you should stop using Excel. Excel is unbeatable for:
- Interactive, click-and-explore data work.
- Sharing a single number-crunching artefact with non-technical colleagues.
- Quick what-if analysis where you tweak a value and watch totals update.
- Anything visual and small β a budget, a model, a tracker.
The right mental model: Excel is your workshop; Python is your power tools. Use both.
The big "what changed"
Three things changed in the past few years that make this course worth taking right now:
- 2023: Python landed inside Excel via
=PY().
- The data world standardised on pandas. If you can use pandas, you can talk shop with any analyst in any company.
- AI assistants now write 80% of the boring bits for you. You no longer need to memorise syntax. You need to know what's possible and how to read code β both of which this course teaches.
Key takeaways
- Python lets you blow past Excel's row limit and automate the repetitive stuff.
- It can do things Excel cannot β scrape sites, hit APIs, parse PDFs, send emails.
- It now lives inside Excel as
=PY(), so you don't have to leave your spreadsheet.
- Use both: Excel for interactive work, Python for everything that should "just run."
Reflect for a minute
Write down, on paper or in a sticky note:
- One Excel task you do at least weekly.
- How long it takes each time.
- Multiply by 52. That's the time you'll get back per year by the time we're done.
Keep this note handy β you'll come back to it in Module 12 when you automate exactly this task.
01.3 β How Python actually runs
What 'running' Python means, the three places you'll run it, and the difference between a script and a notebook.
What you'll learn
- Describe what happens when you 'run' Python code
- Identify the three main places code gets run (REPL, script, notebook)
- Pick the right one for a given situation
What 'running' Python means
When you write Python code, it's just plain text in a file. The computer can't do anything with text on its own. To make it do something, you hand the text to a program called the Python interpreter. The interpreter reads the text from top to bottom, one line at a time, and acts on each instruction as it gets to it.
Two consequences worth knowing:
- Order matters. Line 3 can use a value created on Line 2, but Line 2 can't use a value created on Line 3.
- If a line breaks, everything after it stops. Python tells you the line number, which makes fixing it easier than it sounds.
The three places you'll run code
1. The REPL β for trying things
REPL stands for ReadβEvalβPrint Loop. It's an interactive prompt: you type one line, hit Enter, and Python answers immediately. Great for experimenting.
>>> 2 + 2
4
>>> "hello".upper()
'HELLO'
You'll meet the REPL in Module 2.
2. A script β for things you want to run again
A script is a file ending in .py that holds Python code. You write the code once, save the file, and you can run it whenever you want, as many times as you want.
# clean_monday_report.py
import pandas as pd
df = pd.read_excel("monday_raw.xlsx")
df = df.dropna()
df.to_excel("monday_clean.xlsx", index=False)
print("Done.")
This is what 90% of "automating boring stuff" looks like.
3. A notebook β for analysis
A Jupyter notebook (file extension .ipynb) is somewhere between a script and a Word document. You write code in little blocks called cells, and the output appears right below each cell. You can mix code, charts, and notes in the same document.
Notebooks are the standard format for data analysis β you'll see them everywhere in finance, science, and business analytics.
π And one more: =PY() inside Excel
A fourth place Python now runs is inside an Excel cell itself, via the new
=PY() formula. The code runs on a server in the cloud and the result lands back in the cell. We cover this in Module 8.
Which should I use when?
| Situation | Use |
| "What does this single line do?" | REPL |
| "I want to run this same job every Monday." | Script (.py) |
| "I'm exploring data and want notes alongside." | Notebook (.ipynb) |
| "I want a Python value inside a spreadsheet cell." | =PY() in Excel |
You don't have to install anything to try this
If you want a taste before Module 2, you can run Python in a browser tab right now β try online-python.com or replit.com.
Key takeaways
- "Running" Python means handing your code to the interpreter, which reads it top-to-bottom.
- The REPL is for trying single lines. Scripts are for repeatable jobs. Notebooks are for analysis with notes.
=PY() is a fourth option that runs Python inside an Excel cell.
- You can try all of this in a browser before you install anything.
Try it now (browser only, no install)
- Open online-python.com.
- Paste this code and hit Run:
name = "your name here"
print("Hi, " + name + "!")
print("Today's Python lesson is going well.")
- Change
"your name here" to your actual name, and run it again.
- You just used a variable, a function, and string addition β three of Python's most-used features.
01.4 β The Excel-to-Python mental model
Translate things you already know β cells, formulas, ranges, tables β into their Python equivalents.
What you'll learn
- Map five Excel concepts to their Python equivalents
- Recognise a pandas DataFrame as 'just a sheet'
- Feel a lot less intimidated by Python syntax
You already know more than you think
People often say "Python is hard because it's a programming language." The truth is: if you've used Excel formulas, you've already met most of the concepts. They just have different names. Let's translate.
The translation table
| Excel concept | Python equivalent | Why they're the same |
A cell holding a value: A1 = 100 | A variable: a1 = 100 | Both are a named container for a value. |
A formula: =SUM(A1:A10) | A function call: sum(numbers) | Both take inputs and give back one output. |
A range: A1:A10 | A list: [100, 200, 300, ...] | Both are an ordered group of values. |
| A table / structured range | A pandas DataFrame | Both are rows-and-columns with named headers. |
| A VLOOKUP table | A dictionary: {"apple": 1.20, ...} | Both are a key-to-value lookup. |
IF(A1>10, "big", "small") | if a1 > 10: "big" else: "small" | Both branch on a condition. |
| Filling a formula down 100 rows | A for loop | Both apply the same logic across many items. |
| A macro (VBA) | A Python script | Both are saved instructions you run on demand. |
| An add-in | A library (pandas, numpy, etc.) | Both add capabilities the base tool doesn't have. |
Three quick "Wait, that's it?" examples
Adding numbers in a range
Excel:
=SUM(A1:A5)
Python:
numbers = [10, 20, 30, 40, 50]
total = sum(numbers)
print(total) # 150
Looking up a value
Excel:
=VLOOKUP("apple", A1:B10, 2, FALSE)
Python:
prices = {"apple": 1.20, "pear": 1.50, "banana": 0.40}
print(prices["apple"]) # 1.2
"If A1 is greater than 10, say 'big', otherwise 'small'"
Excel:
=IF(A1>10, "big", "small")
Python:
a1 = 14
if a1 > 10:
result = "big"
else:
result = "small"
print(result) # big
The DataFrame: the most important translation
If there's only one Python concept you remember from this lesson, make it the DataFrame. It's the pandas library's word for a table of data. It has rows. It has named columns. You can sort it, filter it, group it, summarise it. It is, for all practical purposes, an Excel sheet that lives inside Python.
Once you internalise that mental model β "DataFrame = Sheet" β every example in Modules 9 through 16 will feel familiar.
import pandas as pd
# This DataFrame is basically:
# Name Sales
# Alice 1200
# Bob 800
# Carol 1500
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Carol"],
"Sales": [1200, 800, 1500],
})
print(df["Sales"].sum()) # 3500 β like =SUM(B2:B4)
π‘ If you keep one slide
Variable = cell. Function = formula. List = range. DataFrame = sheet. Dictionary = VLOOKUP. Loop = fill down.
Key takeaways
- Most Python concepts have a direct Excel equivalent.
- A DataFrame is the pandas word for a sheet β keep that mapping in your head.
- If you can write formulas, you can write Python. You're not starting from zero.
Translate these three formulas to Python
Try to translate each on paper before scrolling down. Answers below.
=A1 + A2
=AVERAGE(A1:A5)
=IF(A1="VIP", A2*0.9, A2)
Show answers
a1 + a2
sum(numbers) / len(numbers) β or with pandas: df["A"].mean()
a2 * 0.9 if a1 == "VIP" else a2
01.5 β The tools you'll meet in this course
A field guide to the names you'll keep hearing β Python, pandas, Jupyter, VS Code, openpyxl, =PY(), and so on.
What you'll learn
- Recognise the names of the core tools and libraries
- Understand the role each one plays
- Stop being intimidated by the jargon
A field guide to the names
When you read Python tutorials online, you'll see the same dozen words over and over. Here's what each one is and why you'll meet it in this course.
Python
The language itself. When someone says "I wrote it in Python," they mean they wrote code in this language and ran it with the Python interpreter.
Anaconda
A free bundled installer that gives you Python plus the most-used libraries (pandas, NumPy, Jupyter, matplotlib) in one click. The friendliest way to get set up. We use it in Module 2.
pip
Python's package installer. To add a new library to your Python: pip install thelibrary. Anaconda gives you the common ones already, but pip is how you get anything else.
Library / package
A bundle of pre-written Python code that adds capabilities. Excel calls these "add-ins"; Python calls them libraries. There are hundreds of thousands of them for free.
pandas
The library that turns Python into a souped-up spreadsheet. You will use this constantly. It introduces the DataFrame β that "Python sheet" we just met. Full crash course in Module 9.
NumPy
A library for fast number-crunching. Pandas is built on top of NumPy. You'll touch it occasionally.
openpyxl
The library that reads and writes .xlsx Excel files at a low level β sheets, cells, formatting, formulas, charts. We use it in Modules 7 and 12 for things pandas can't do.
matplotlib & seaborn
The two most common libraries for making charts in Python. We use them in Module 11.
Jupyter notebook
The "code + notes + charts in one document" format we mentioned earlier. File extension .ipynb. Comes free with Anaconda.
VS Code (Visual Studio Code)
A free, modern code editor from Microsoft. It's what we'll use to write Python scripts. There are others (PyCharm, Sublime, etc.) but VS Code is the easiest starting point and works on Mac, Windows, and Linux.
Terminal / command line / PowerShell
The text-based way to talk to your computer. Macs call it Terminal; Windows calls it PowerShell or Command Prompt. You'll need it occasionally β mainly to install things with pip. We hand-hold you through it in Module 2.
=PY() in Excel
The formula that runs Python inside an Excel cell. Microsoft 365 only, as of 2026. Full module on this in Module 8.
Microsoft 365 / Excel for the web
You need the modern subscription version of Excel to use =PY(). The classic boxed Excel 2019 / 2021 does not include it.
How they fit together
A typical day in this course looks like:
- You open VS Code.
- You write a Python script that imports pandas.
- Pandas reads an Excel file (under the hood it uses openpyxl).
- You clean and reshape the data.
- You save the result back to Excel, or you build a chart with matplotlib.
- Optionally: you switch to Excel, type
=PY(), and re-run a small piece of the analysis live in a cell.
π‘ You don't need to memorise this
This page is a glossary, not a quiz. Bookmark it. As we use each tool you'll learn what it actually feels like.
Key takeaways
- The names you'll hear most: Python, Anaconda, pip, pandas, openpyxl, Jupyter, VS Code,
=PY().
- Each one has one job β they fit together like Lego.
- You don't need to memorise them now; you'll meet each in context.
Spot the tool
For each of these jobs, name the tool from this lesson that does it:
- "Open and edit an .xlsx file at the cell level."
- "Write a chart with code."
- "Run a piece of Python inside an Excel cell."
- "Install a new library."
- "Work on data with code, output, and notes mixed in one document."
Show answers
- openpyxl
- matplotlib (or seaborn)
=PY()
- pip
- A Jupyter notebook
01.6 β How to use this course
Three suggested paths through the material β total beginner, refresher, job-focused β and how to get the most out of each lesson.
What you'll learn
- Pick a route through the course that fits your level and goals
- Know what to expect on every lesson page
- Use the exercises effectively without getting stuck
Three suggested paths
π’ Path A β Total beginner (recommended for most people)
Read Modules 1 through 12 in order, doing every exercise. Then pick whichever role module (13β16) fits your job, then Module 17 for the capstone.
Time: Roughly 12β20 hours, depending on how much you experiment.
π‘ Path B β "I've coded a little before"
Skim Modules 1 and 3β6, do all of Module 2 (setup) properly, then go full speed from Module 7 onwards.
Time: Roughly 6β10 hours.
π Path C β "I have a specific Excel pain to fix today"
Do Module 2 (setup), Module 7 (files), Module 9 (pandas), and then jump straight to your role module. You can fill in the foundations later β most readers do.
Time: 4β6 hours.
What every lesson looks like
Every lesson page follows the same shape, so you always know where to look:
- Objectives box at the top β what you'll know by the end.
- Body β the actual lesson, broken into short sections with code samples.
- Step-by-step walkthroughs in numbered cards for any real task.
- Key takeaways β the three to five bullets you'd write on a sticky note.
- Exercise β a small, do-it-yourself task that's quick to get a win on.
- Prev / Next at the bottom β the lesson flows top-to-bottom through the course.
How to do the exercises (without giving up)
Three rules:
- Try first, then read the answer. Even five minutes of "I don't know what to do" builds the muscle that reading never will.
- It's fine to look things up. Real professional Python work involves Googling roughly every 90 seconds. We're not training you for an exam.
- An error message is a clue, not a verdict. The last line of an error usually tells you exactly what went wrong. We have a whole lesson on this (Module 2, Lesson 6).
π‘ Pro tip: use AI as a study partner
You can paste any error message β or any lesson exercise β into ChatGPT, Claude, or Copilot and ask "explain this like I'm a spreadsheet user." That's not cheating; it's how everyone codes in 2026.
One small commitment
Do not skip Module 2 (Setup). It's the only module where the work isn't optional. If your Python isn't installed correctly, none of the rest of the course works on your machine. Take the 30 minutes; you'll never need to do it again.
What to do when you're done
You'll know you've "graduated" when you can:
- Open any spreadsheet in Python.
- Write a script that does the equivalent of three things you'd normally do by hand in Excel.
- Run a small
=PY() formula inside Excel.
- Read an unfamiliar piece of Python code and roughly guess what it does.
From there, the world is your oyster β web scraping, dashboards, machine learning, automation, scripting your whole job. We list further-reading recommendations at the end of Module 17.
Key takeaways
- Pick the path that matches your level (A, B, or C).
- Every lesson has the same shape β objectives, body, walkthrough, takeaways, exercise.
- Don't skip Module 2.
- Look things up; that's how real Python work happens.
Set your goal
- Pick your path: A, B, or C.
- Block 30 minutes on your calendar in the next two days for Module 2 (setup).
- You're ready. Click "Next lesson β" and let's go.
Module 02Setting Up Your Environment
Install Python the easy way (Anaconda), set up VS Code, get a Jupyter notebook running, and turn on =PY() in Excel.
This is the most important module in the course because everything that follows depends on a working setup. The good news: we use the friendliest path on each platform, with screenshots-worth-of-description for every step.
β οΈ Do this module on a real computer
You can't do the rest of the course on a phone or tablet. Use a Mac, Windows, or Linux laptop or desktop.
02.1 β Install Python (the easy way, with Anaconda)
Get Python and the big data libraries on your machine in one installer. Works the same way on Mac and Windows.
What you'll learn
- Download and install Anaconda for your operating system
- Verify Python is installed correctly
- Know where Python lives on your computer
Why Anaconda?
You can install Python directly from python.org. For an Excel user with no programming background, that path leads to lots of head-scratching. Anaconda is a free bundled installer made by a data-science company. Click one button, wait ten minutes, and you have:
- Python itself
- pandas, NumPy, matplotlib, seaborn β everything we'll use
- Jupyter notebooks
- A graphical app called Anaconda Navigator so you don't have to use the command line for basic tasks
If you've already installed Python a different way and it's working, you can skip this lesson. Otherwise, use Anaconda.
Step-by-step walkthrough
Go to the Anaconda download page
Open https://www.anaconda.com/download in your browser.
You may be asked for an email. You can skip that and click "Download" directly.
Pick your operating system
The page detects your OS automatically. Confirm it's right (Windows, macOS, or Linux) and choose the 64-bit installer. If you have a recent Mac with an M-series chip, pick the Apple Silicon version.
Run the installer
Double-click the downloaded file. Click "Continue" / "Next" through every screen. The defaults are correct for what you need.
On Windows, when asked, leave "Add Anaconda to my PATH environment variable" unchecked (Anaconda's own recommendation). We'll use the Anaconda Prompt for command-line work.
The install takes 5β10 minutes and uses about 3 GB of disk space.
Open the Anaconda Navigator
On Mac: open Launchpad and find Anaconda-Navigator.
On Windows: hit the Start menu and search for Anaconda Navigator.
The Navigator is a friendly home screen for everything Anaconda installed. Take a look around. The "Home" tab shows tiles for tools you can launch β Jupyter Notebook, JupyterLab, Spyder, etc.
Confirm Python is working
Click the green Launch button under Jupyter Notebook. Your browser will open with a file-browser view. Click New β Python 3 (ipykernel) to make a new notebook.
In the first cell, type:
print("Python is alive!")
Hold Shift and press Enter. You should see Python is alive! printed below the cell. π
Confirm pandas works
Make a new cell (click the + button) and run:
import pandas as pd
print(pd.__version__)
You should see a version number like 2.2.2. That means pandas is installed and importable. You're done.
β οΈ If Jupyter doesn't open
On Mac, you may need to allow Terminal access. On Windows, your antivirus might flag the launch β allow it. If nothing happens at all, open Anaconda Navigator again and try clicking
Launch a second time.
Key takeaways
- Anaconda is the easiest one-click install of Python plus all the data libraries.
- Anaconda Navigator is your graphical home base.
- You've now run your first piece of Python.
Sanity-check exercise
In a new notebook cell, run:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("All three are installed!")
If you see "All three are installed!" with no red error, you're ready for the rest of the course.
02.2 β Install VS Code β your code editor
Set up Visual Studio Code, Microsoft's free code editor, with the Python extension.
What you'll learn
- Install Visual Studio Code
- Add the Python extension
- Open and run a first .py file
Why an editor on top of Anaconda?
Jupyter notebooks (which Anaconda gave you) are great for analysis. But when you start writing reusable scripts β the kind you run on a schedule β you'll want a proper code editor. VS Code is free, from Microsoft, and works the same way on Mac, Windows, and Linux.
Download VS Code
Go to code.visualstudio.com and click the big blue download button. Run the installer with default settings.
On Windows, when asked, tick the boxes for "Add 'Open with Code' action to Windows Explorer file context menu" and "Register Code as an editor for supported file types". Those make life easier later.
Open VS Code for the first time
Launch it. You'll see a welcome screen. The little icons on the far left are your activity bar β files, search, source control, and a "puzzle piece" for Extensions.
Install the Python extension
Click the puzzle-piece icon (Extensions). In the search box, type Python. Install the one made by Microsoft (it has millions of downloads).
That extension is what makes VS Code understand Python: syntax colours, error highlighting, "run this file" buttons, debugging.
Point VS Code at the Anaconda Python
Press Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) to open the Command Palette. Type Python: Select Interpreter and pick the option that says "Anaconda" or has anaconda3 in the path.
This tells VS Code: "When I press Run, use the Python that came with Anaconda."
Make your first script
In VS Code, go to File β New File. Save it as hello.py on your Desktop. Paste in:
print("Hello from VS Code!")
for i in range(5):
print(f" Line {i + 1}")
Click the green βΆ "Run" arrow in the top right (or press F5 and pick "Python File"). The output appears in the panel at the bottom.
π‘ The two-window workflow
Many pros keep VS Code open for scripts and Jupyter open for exploration. You'll naturally fall into the same rhythm.
Key takeaways
- VS Code is the editor we use for
.py scripts.
- The Microsoft Python extension turns it into a real Python IDE.
- Always confirm the interpreter β it must point at Anaconda's Python.
Two-minute exercise
Save a new file called about_me.py. Inside, write three lines that print your name, your job role, and what you hope to automate with Python. Run it. Notice how the output appears at the bottom of VS Code, instantly.
02.3 β The terminal β a friendly introduction
Five commands that are all you really need: cd, ls/dir, mkdir, pip, python.
What you'll learn
- Open the terminal on Mac and Windows
- Navigate folders with cd
- Install Python libraries with pip
The terminal is not as scary as it looks
The terminal is a text-based way to talk to your computer. You'll need it for exactly two things in this course: installing libraries (pip install ...), and occasionally running a Python script (python myscript.py). Five commands, and you're set.
How to open it
- Mac: Press Cmd+Space, type "Terminal", press Enter.
- Windows: Hit the Start menu and search for "Anaconda Prompt". Use that one, not regular PowerShell β it already knows about your Anaconda Python.
The five commands you actually need
1. pwd (Mac) / cd on its own (Windows) β "where am I?"
pwd
# /Users/yourname
Tells you which folder the terminal is currently looking at.
2. ls (Mac) / dir (Windows) β "what's in this folder?"
ls
# Documents Desktop Downloads Music ...
3. cd folder_name β "move into this folder"
cd Desktop # move into Desktop
cd .. # move one folder up
cd ~ # go to your home folder (Mac/Linux)
4. mkdir new_folder_name β "make a new folder"
mkdir python-projects
cd python-projects
5. pip install <library> β install a Python library
pip install openpyxl
You'll see a stream of text as pip downloads and installs the library. When the prompt comes back, it's done.
Bonus: python myscript.py β run a script
python hello.py
# Hello from VS Code!
# Line 1
# Line 2
# ...
Walkthrough: install a library
Open the terminal (or Anaconda Prompt on Windows)
Type a pip command
pip install xlsxwriter
Press Enter. Watch the lines scroll. When you see Successfully installed xlsxwriter-..., you're done.
Confirm it worked
Open a Jupyter notebook or a Python file and run:
import xlsxwriter
print(xlsxwriter.__version__)
You should see a version number, not a red error.
β οΈ "Command not found" on Mac
If
pip isn't recognised, try
pip3 instead, or open Anaconda Navigator β Environments β click the green βΆ next to
base (root) β "Open Terminal", then run
pip install ... from that terminal.
β οΈ Windows: must use the Anaconda Prompt
Regular Command Prompt may not see your Anaconda Python. Always launch the
Anaconda Prompt from the Start menu for these commands.
Key takeaways
- Five commands cover 95% of what you need:
pwd, ls/dir, cd, mkdir, pip install.
- On Windows, always use the Anaconda Prompt, not regular PowerShell.
pip install is how you add any extra library.
Five-minute hands-on
- Open your terminal (or Anaconda Prompt).
- Make a new folder on your Desktop called
python-for-excel.
cd into it.
- Run
pip install xlsxwriter requests beautifulsoup4.
- You just installed three libraries we'll use later. Easy.
02.4 β Turn on Python inside Excel (=PY())
Get the =PY() formula working in your Microsoft 365 Excel.
What you'll learn
- Check that your Excel version supports =PY()
- Enter your first =PY() formula
- Know the difference between regular Excel and =PY() cells
What you need
- Microsoft 365 subscription Excel (Windows or Mac).
The standalone Excel 2019 / 2021 / 2024 does not include Python.
- An internet connection β the Python code runs on Microsoft's cloud servers, not on your laptop.
π Don't have Microsoft 365?
You can still do this entire course. We use
=PY() heavily in
Module 8; everywhere else, regular Python + the Anaconda setup is enough. If you can get the company 365 license, do.
Quick setup walkthrough
Update Excel
In Excel, go to File β Account β Update Options β Update Now. Make sure you're on a recent build. Python in Excel is generally available in Microsoft 365 since 2024.
Make a new blank workbook
Open a fresh workbook so you don't accidentally break a real one.
Click an empty cell and type the new formula
Click cell A1. Type =PY( β when you do this, Excel switches the cell into Python mode. You'll notice the formula bar turns into a multi-line area and shows a little green "PY" badge.
Inside the cell, type:
1 + 1
Press Ctrl+Enter (Windows) or Cmd+Enter (Mac) to commit. After a brief "calculating in the cloud" moment, the cell shows 2.
Try a slightly bigger one
Click another empty cell, type =PY(, then:
import pandas as pd
df = pd.DataFrame({"city": ["NYC", "LA", "Chicago"], "pop_m": [8.3, 3.9, 2.7]})
df
Ctrl+Enter. Excel now shows the DataFrame as a special "Python object" cell. Right-click it β Output as β Excel values to spill it into proper Excel cells.
What just happened?
When you type =PY( Excel sends your code to Microsoft's cloud, runs it in a real Python interpreter (with pandas, NumPy, matplotlib pre-installed), and returns the result back into your spreadsheet. Total round trip: usually 1β3 seconds.
π‘ The two output modes
A Python cell can return its result either as a
Python object (a single cell that holds, say, a whole DataFrame) or as
Excel values (spilled into a range, one cell per value). Toggle with
Right-click β Output as.
Key takeaways
=PY() needs Microsoft 365 Excel.
- Type
=PY( then your Python, then Ctrl+Enter.
- Results can be a single Python object or spilled Excel values.
- Full coverage in Module 8 β this lesson is just "is it on?"
Quick check
In a fresh Excel sheet, get this to work end-to-end:
=PY(
import pandas as pd
ages = pd.Series([24, 31, 45, 22, 38])
ages.mean()
)
You should see 32 in the cell. If you do β Python in Excel is on, and Module 8 will be a breeze.
02.5 β A two-minute tour of Jupyter notebooks
Cells, run buttons, kernels, save, share β the bits of Jupyter you'll touch every day.
What you'll learn
- Create and delete cells in Jupyter
- Know the keyboard shortcuts that save time
- Save and export a notebook
Why you'll use Jupyter all the time
Jupyter is the notebook tool that came with Anaconda. It's the standard way to do data analysis in Python: you type a few lines of code, run them, see the result instantly, write a note about what you saw, type a few more lines. Repeat. Save the whole thing as a document you can share.
Open a notebook
Anaconda Navigator β Jupyter Notebook β Launch. Browser opens. Navigate to a folder. Click New β Python 3.
The anatomy of a notebook
- Cells. The boxes you type into. Two types: Code (the default β runs Python) and Markdown (formatted notes β same syntax as a README).
- The kernel. The Python interpreter behind the notebook. If your code stops working, restart the kernel (Kernel β Restart).
- The toolbar. The little βΆ button runs the selected cell.
The keyboard shortcuts that change your life
| Shortcut | What it does |
| Shift+Enter | Run the current cell and move to the next |
| Ctrl+Enter | Run the current cell and stay on it |
| Alt+Enter | Run and insert a new cell below |
| Esc then A | Insert a new cell above |
| Esc then B | Insert a new cell below |
| Esc then D D | Delete the current cell |
| Esc then M | Turn the current cell into a Markdown note |
| Esc then Y | Turn it back into code |
Walkthrough: build a tiny analysis notebook
Make a Markdown title cell
Press Esc M, then type # My first analysis. Shift+Enter. Notice it renders as a big heading.
Make a code cell that creates some data
import pandas as pd
sales = pd.DataFrame({
"rep": ["Aisha", "Ben", "Carlos", "Dora", "Eve"],
"deals": [12, 7, 19, 8, 14],
"value": [42000, 18000, 71000, 22000, 39000],
})
sales
Shift+Enter. You see the table.
A second code cell that summarises
sales["value"].sum()
Shift+Enter. You see 192000.
Save
Ctrl+S (or Cmd+S on Mac). The file is saved as Untitled.ipynb. Click the title at the top to rename it.
Export
File β Download as β HTML gives you a self-contained web page you can email. Download as β PDF gives you a printable version. Download as β Notebook (.ipynb) gives you the raw file someone else can open in their Jupyter.
π‘ The "last expression auto-prints" trick
A code cell automatically prints the result of its
last line, without needing
print(). That's why
sales on its own showed the whole table.
Key takeaways
- Jupyter = the analysis tool that came with Anaconda.
- Shift+Enter runs a cell. Esc M turns it into a note.
- The last line of a cell prints automatically.
- Save as
.ipynb for editing, or export to HTML / PDF to share.
Exercise
Open Jupyter. Make a new notebook called warmup.ipynb. Build it with:
- A Markdown heading: "Warm-up exercise".
- A code cell that creates a list of your five favourite numbers.
- A code cell that prints their sum.
- A Markdown cell at the bottom that says how surprising the answer was (or wasn't).
- Save it. Export it as HTML. Open the HTML file in your browser.
02.6 β When things break β reading errors without panic
How to read a Python error message, the five most common errors, and what to do about each.
What you'll learn
- Read a Python traceback bottom-up
- Recognise the five errors you'll see most
- Know the standard fix for each
The good news about Python errors
Python errors look intimidating because they're long. They're actually friendly: the last line tells you exactly what went wrong, and the line above that tells you where. Read bottom-up and ignore the rest until you're more advanced.
Anatomy of a traceback
Traceback (most recent call last):
File "report.py", line 12, in <module>
total = price * quantitiy
NameError: name 'quantitiy' is not defined
Read bottom-up:
- The last line says what went wrong: NameError: name 'quantitiy' is not defined. We misspelled
quantity.
- The line above tells you where: line 12 of
report.py.
The five errors you'll see most
1. NameError: name 'x' is not defined
Meaning: Python doesn't know what x is.
Usual cause: Typo, or you forgot to create the variable, or you tried to use it before the line that defines it.
Fix: check spelling and make sure the variable is created on an earlier line.
2. SyntaxError: invalid syntax
Meaning: The code doesn't follow Python's grammar.
Usual cause: Missing colon at the end of an if/for/def line. Or a bracket that wasn't closed. Or a quote that wasn't closed.
# Wrong: no colon
if age > 18
print("adult")
# Right
if age > 18:
print("adult")
3. IndentationError
Meaning: Wrong amount of leading whitespace.
Usual cause: You used a tab where Python expected spaces, or you indented inconsistently.
Fix: pick four-space indentation everywhere. VS Code does this for you if you stay on the Python file mode.
4. TypeError
Meaning: You're using a value of the wrong type.
Usual cause: Trying to add a number and a string.
# Wrong: "5" is a string, not a number
total = "5" + 10
# Right
total = int("5") + 10 # 15
5. FileNotFoundError
Meaning: The file you asked Python to open doesn't exist at that path.
Usual cause: Typo in the filename, wrong folder, or the file extension is wrong.
Fix: print the current folder first and double-check:
import os
print(os.getcwd()) # which folder am I in?
print(os.listdir()) # what's in this folder?
The "first response" checklist
- Read the last line of the error.
- Look at the line number just above it.
- Re-read that line in your code, slowly.
- If still stuck, paste the error into Google or an AI assistant verbatim.
- Rerun, see if the error changes. A different error = progress.
π‘ Pro tip
Don't try to "guess" what's broken. Always re-read the actual error message. 95% of the time it tells you exactly what to fix.
Key takeaways
- Read tracebacks bottom-up; the last line is the actual error.
- The five common errors:
NameError, SyntaxError, IndentationError, TypeError, FileNotFoundError.
- Each has a one-line fix. You'll memorise them after seeing each three times.
- Copy-paste the error into search; you're never the first to hit it.
Make these errors happen on purpose
In a Jupyter cell, run each of these and see what error you get. Then fix it.
# 1
print(secret_number)
# 2
if 5 > 3
print("yes")
# 3
"hello" + 5
# 4
open("not_a_real_file.txt")
Reading errors gets easier the more of them you see. Welcome to programming.
Module 03Variables and Data Types
Strings, numbers, booleans, None β the building blocks of every Python value, and how they map to Excel's cell types.
By the end of this module you'll know what every value in Python is made of: text, whole numbers, decimal numbers, true/false flags, and "nothing." You'll be reading and writing the most-common Python building blocks like an Excel formula.
03.1 β Variables β Python's version of a named cell
How to give a value a name, and the rules for what names are allowed.
What you'll learn
- Create variables and reassign them
- Follow Python's variable-naming rules
- Understand the difference between a variable and a value
What's a variable?
A variable is a name that points to a value. In Excel, you might put 100 in cell B2 and refer to it as =B2 from then on. In Python you write:
price = 100
From now on, anywhere you write price, Python substitutes in 100.
price = 100
tax = price * 0.2
print(price + tax) # 120
The single equals sign means "store this"
This trips people up at first: = in Python is not the same as = in algebra. It doesn't mean "equals." It means "take what's on the right and store it under the name on the left."
x = 5 # store 5 in x
x = x + 1 # take whatever x is (5), add 1, store back in x
print(x) # 6
The naming rules
- Start with a letter or underscore. Can't start with a number.
- Made of letters, numbers, underscores. No spaces, no dashes, no punctuation.
- Case-sensitive:
Price, price, and PRICE are three different variables.
- Don't use Python's reserved words (
if, for, def, class, True, Noneβ¦). Your editor will warn you.
Style: snake_case
Python convention is lowercase with underscores: monthly_total, customer_name, tax_rate. Not monthlyTotal (that's Java) or MonthlyTotal (that's a class). Stick to snake_case and your code will look professional.
Walkthrough: build a tiny invoice
Set up the inputs
customer_name = "Acme Co."
quantity = 3
unit_price = 19.99
tax_rate = 0.08
Compute
subtotal = quantity * unit_price
tax = subtotal * tax_rate
total = subtotal + tax
Print the receipt
print("Customer:", customer_name)
print("Subtotal:", subtotal)
print("Tax: ", tax)
print("Total: ", total)
Output:
Customer: Acme Co.
Subtotal: 59.97
Tax: 4.7976
Total: 64.7676
π Don't panic about the long decimal
We'll fix the rounding (
round(total, 2) β 64.77) in the next lesson.
Key takeaways
- A variable is a name attached to a value:
name = value.
= means "store," not "is equal to."
- Use
snake_case. Case matters.
- Reassigning a variable just overwrites whatever was there.
Build your own pocket calculator
- Create variables for: a meal price, a tip percentage, the number of people splitting.
- Compute total with tip.
- Compute per-person share.
- Print everything with labels.
03.2 β Numbers β ints, floats, and the math operators
Whole numbers, decimal numbers, and every math operator Python supports.
What you'll learn
- Distinguish ints from floats
- Use Python's arithmetic operators including //, %, **
- Round and format numeric output
Two flavours: int and float
int = a whole number: 0, 1, -42, 1_000_000.
float = a decimal number: 0.5, -3.14, 2.0.
You can check what kind a value is with type():
type(42) # <class 'int'>
type(3.14) # <class 'float'>
type(1_000_000) # <class 'int'> underscores are legal as thousand-separators
The math operators
| Operator | What it does | Example | Result |
+ | Add | 3 + 4 | 7 |
- | Subtract | 10 - 3 | 7 |
* | Multiply | 5 * 6 | 30 |
/ | Divide (always returns a float) | 10 / 4 | 2.5 |
// | Integer divide (drops the remainder) | 10 // 4 | 2 |
% | Modulo (the remainder) | 10 % 4 | 2 |
** | Power | 2 ** 10 | 1024 |
Two tricks worth knowing
Round to N decimals
round(3.14159, 2) # 3.14
round(1234.5678, 0) # 1235.0
Format with commas and currency
total = 1234567.891
print(f"${total:,.2f}") # $1,234,567.89
The f"..." string is called an f-string. We cover it properly in the next lesson; for now know that you can pop variables and format hints into one inside curly braces.
Integer vs float β when does it matter?
Mostly it doesn't. Python converts between them automatically. Two practical gotchas:
10 / 4 is 2.5 (a float). 10 // 4 is 2 (an int). If you need a clean whole number β for example, "how many full pages" β use //.
- Some financial systems hate floats because of tiny rounding errors. We'll meet the
decimal library in the Finance scenarios module.
Walkthrough: split a bill three ways
Inputs
bill = 87.45
tip_pct = 18
people = 3
Compute
tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people
Print neatly
print(f"Bill: ${bill:,.2f}")
print(f"Tip ({tip_pct}%): ${tip:,.2f}")
print(f"Total: ${total:,.2f}")
print(f"Per person: ${per_person:,.2f}")
Key takeaways
int = whole, float = decimal.
/ always returns a float; // truncates; % gives the remainder.
round(x, 2) rounds; f"${x:,.2f}" formats with commas and two decimals.
Power-of-two table
Print the first ten powers of two in a column, like this:
2^1 = 2
2^2 = 4
2^3 = 8
...
2^10 = 1024
Hint: use a for loop (we cover those in Module 4) or just write ten lines.
03.3 β Strings β text values and how to wrangle them
Quotes, f-strings, slicing, methods like .upper() and .replace().
What you'll learn
- Create strings with single, double, and triple quotes
- Build strings with f-strings
- Use the most common string methods
What's a string?
Text. Anything in quotes. Single or double β Python doesn't care, but pick one and be consistent.
name = "Alice"
greeting = 'Hello there'
long_text = """This is a multi-line string.
It can wrap across as many lines as you like."""
Joining strings together
first = "Alice"
last = "Smith"
full = first + " " + last # "Alice Smith"
F-strings β the modern way
Put an f right before the opening quote, and you can drop variables straight inside {}:
name = "Alice"
sales = 1234.567
print(f"{name} sold ${sales:,.2f}")
# Alice sold $1,234.57
F-strings replaced the older "{} sold ${:.2f}".format(name, sales) style. Use f-strings everywhere.
String methods you'll use weekly
| Method | What it does | Example | Result |
.upper() | UPPERCASE | "hi".upper() | "HI" |
.lower() | lowercase | "Hi".lower() | "hi" |
.title() | Title Case | "alice smith".title() | "Alice Smith" |
.strip() | Trim whitespace | " abc ".strip() | "abc" |
.replace(a, b) | Substitute | "$1,200".replace(",", "") | "$1200" |
.split(",") | Break into a list | "a,b,c".split(",") | ["a","b","c"] |
" ".join([...]) | Glue list back into a string | " ".join(["a","b"]) | "a b" |
.startswith("X") | Boolean check | "acme.csv".endswith(".csv") | True |
Slicing β pulling out a chunk
Like grabbing a substring with MID() or LEFT() in Excel.
name = "Anthropic"
name[0] # 'A' β first character
name[-1] # 'c' β last character
name[0:4] # 'Anth' β first four
name[4:] # 'ropic' β from index 4 to the end
name[::-1] # 'cipohtnA' β reversed (a Python party trick)
Walkthrough: clean a messy column of names
Sample data
raw = " alice SMITH "
Clean step by step
step1 = raw.strip() # "alice SMITH"
step2 = step1.lower() # "alice smith"
step3 = step2.title() # "Alice Smith"
print(step3)
Or chain it
cleaned = raw.strip().lower().title()
print(cleaned) # Alice Smith
Chaining like this β calling one method on the result of another β is a Python superpower. You'll see it constantly in pandas.
Key takeaways
- Strings live in quotes (single, double, or triple for multi-line).
- F-strings are the modern way to mix variables into text:
f"Hi {name}".
- Methods like
.upper(), .strip(), .replace() do most of the cleaning work.
- Slicing β
s[0:4] β pulls out substrings.
Clean a list of email addresses (on paper)
Given " Alice@Example.COM ", write the Python that turns it into a clean lowercase, trimmed email.
Answer
" Alice@Example.COM ".strip().lower()
# 'alice@example.com'
03.4 β Booleans and comparisons
True, False, and the operators that compare values.
What you'll learn
- Use comparison operators (==, !=, >, <, >=, <=)
- Combine conditions with and/or/not
- Recognise 'truthy' and 'falsy' values
True and False β capitalised
is_open = True
is_closed = False
type(is_open) # <class 'bool'>
Two values. Lowercase true/false won't work β Python is fussy about the capitals.
Comparison operators
| Operator | Meaning | Example | Result |
== | Equal to | 5 == 5 | True |
!= | Not equal to | 5 != 5 | False |
> | Greater than | 10 > 3 | True |
< | Less than | 10 < 3 | False |
>= | Greater or equal | 5 >= 5 | True |
<= | Less or equal | 5 <= 4 | False |
β οΈ One equals vs two equals
= stores a value.
== tests for equality.
x = 5 is "store 5 in x".
x == 5 is "is x equal to 5?". Mixing them up is the most common beginner bug.
Combining: and, or, not
age = 25
income = 80000
(age > 18) and (income > 50000) # True
(age < 18) or (income > 100000) # False
not (age < 18) # True
"Truthy" and "falsy"
Python treats some non-boolean values as if they were True or False:
- Falsy:
0, 0.0, "" (empty string), [] (empty list), None.
- Truthy: Pretty much everything else.
name = ""
if name:
print(f"Hi, {name}")
else:
print("No name provided.")
# Output: No name provided.
The in operator β bonus useful
"alice" in ["alice", "bob", "carol"] # True
"@" in "alice@example.com" # True
Often replaces a chunky IF + OR formula in Excel.
Key takeaways
True and False with a capital letter.
== compares; = assigns. Don't mix them up.
- Combine conditions with
and, or, not.
- Empty containers (
"", [], 0, None) count as False.
Quick check
What does each of these print?
print(10 == 10.0)
print(10 == "10")
print("a" in "alpha")
print(0 or 5)
Answers
True β Python is happy comparing 10 (int) and 10.0 (float).
False β a number is never equal to a string of that number.
True β "a" appears in "alpha".
5 β or returns the first truthy value.
03.5 β None, type conversions, and reading the type
The special value None, plus the int(), str(), float() conversion functions.
What you'll learn
- Use None as an 'unknown' placeholder
- Convert between strings, ints, and floats
- Inspect the type of any value
None β Python's "nothing"
Sometimes you need a variable to mean "I don't have a value yet" or "this is blank." Python uses None for that.
last_login = None
if last_login is None:
print("New user")
Compare with is, not ==, when checking for None.
Type conversions
You'll constantly get data as the wrong type β a number that came in as text, a date that came in as a number. Convert with these:
| Function | What it does | Example |
int(x) | Try to turn into a whole number | int("42") β 42 |
float(x) | Try to turn into a decimal | float("3.14") β 3.14 |
str(x) | Turn into text | str(42) β "42" |
bool(x) | Turn into True/False | bool("hi") β True |
The "everything is a string when it comes from outside" rule
Any value that arrives from a CSV, a web form, the command line, or a user input box arrives as a string. If you want to do math with it, convert first.
# Pretend "12" came from a CSV cell
quantity = "12"
price = 5
# This is a bug β it repeats "12" five times
print(quantity * price) # "1212121212"
# Convert first
print(int(quantity) * price) # 60
Checking the type at runtime
type(42) # int
type("hi") # str
type([1,2,3]) # list
isinstance(42, int) # True
isinstance() is the cleaner version. Use it in real code.
Walkthrough: a "safe" number parse
The problem
A user might type "42", or "forty-two", or accidentally hit space. We want a function that returns the number, or None if it can't.
The fix using try/except
def safe_int(text):
try:
return int(text.strip())
except (ValueError, AttributeError):
return None
safe_int("42") # 42
safe_int("forty-two") # None
safe_int(" 100 ") # 100
Don't worry about try/except in detail β we cover it in Module 4. The point: real-world data is messy, and Python gives you tools to handle "what if it's broken."
Key takeaways
None = "no value yet." Check with x is None.
- Convert with
int(), float(), str(), bool().
- Data coming from outside is almost always a string. Convert before you compute.
- Use
type() or isinstance() to inspect a value at runtime.
Spot the bug
response = input("How old are you? ")
if response > 18:
print("Adult")
What's broken? Fix it.
Answer
input() always returns a string. Comparing a string to an int crashes. Fix: if int(response) > 18:
03.6 β Talking to the user β input() and print()
How to ask the user a question and show them an answer.
What you'll learn
- Use input() to read text from the user
- Format print() output cleanly
- Combine the two into a working mini-program
input() β ask a question
name = input("What's your name? ")
print(f"Hello, {name}!")
When this runs in a script or terminal, the prompt appears, the program pauses, the user types something, hits Enter, and the typed text is stored in name. Always as a string β even if they typed digits.
print() β talk back
print("Hello")
print("a", "b", "c") # a b c
print("a", "b", "c", sep="-") # a-b-c
print("first")
print("second")
print("third", end=" ")
print("on the same line")
A complete mini-program
# tip_calculator.py
bill = float(input("Bill amount: "))
tip_pct = float(input("Tip percent: "))
people = int(input("How many people? "))
tip = bill * tip_pct / 100
total = bill + tip
per_person = total / people
print()
print(f"Total: ${total:,.2f}")
print(f"Each: ${per_person:,.2f}")
π‘ input() in Jupyter
The same input() works in a Jupyter notebook β a little prompt appears under the cell.
Key takeaways
input("prompt: ") reads a line from the user; the result is always a string.
print() can take any number of values, separated by sep=, terminated by end=.
- Combine them for tiny command-line tools.
Build a unit converter
Ask the user for kilometres, convert to miles (km Γ 0.621), print the result.
03.7 β Comments β notes to your future self
How to leave notes in your code, and the difference between a comment and a docstring.
What you'll learn
- Write single-line and block comments
- Know when comments help (and when they're noise)
- Recognise a docstring
The # character
Anything after a # on a line is ignored by Python. Use this for notes.
# Read the raw export the finance team dumps in this folder every Monday
raw = pd.read_csv("monday_dump.csv")
# Drop the rows where the customer column is blank β they're rejected orders
clean = raw.dropna(subset=["customer"])
Block comments
Multiple lines, each starting with #. There's no /* ... */ in Python.
# --- Configuration ---
# Change these once a month after the price review meeting
TAX_RATE = 0.08
SHIPPING_FLAT = 7.50
Docstrings β comments for functions
A docstring is a triple-quoted string right under a function's def line. Tools and IDEs use it to show the function's help.
def cents_to_dollars(cents):
"""Convert an integer cents value to a float dollars value.
Example:
cents_to_dollars(1299) -> 12.99
"""
return cents / 100
When comments help β and when they're noise
Good comments explain why, not what:
# Bad β restates what the code already says
x = x + 1 # add 1 to x
# Good β explains the why
x = x + 1 # account for the off-by-one in the upstream CSV
Key takeaways
# starts a comment, runs to end of line.
- Use comments to explain why, not what.
- Docstrings (triple-quoted) sit under
def lines and describe the function.
Comment a tangled script
Take your tip calculator from the last lesson. Add three comments: one at the top explaining what the script does, one above the calculation block explaining the formula, one at the bottom explaining what's printed.
Module 04Control Flow
if / else / elif, for and while loops, and Python's gentle error-handling β try/except.
04.1 β if / else β making decisions
Python's IF() statement: indentation, elif, the comparison operators in action.
What you'll learn
- Write if / elif / else blocks
- Use indentation correctly
- Refactor nested ifs into clean code
The shape of an if-statement
age = 18
if age >= 18:
print("Adult")
else:
print("Minor")
Notice three things:
- Colon at the end of the
if and else lines.
- Indentation (four spaces) marks what's "inside" the branch.
- There are no parentheses around the condition (you can add them but you don't have to).
elif β chained conditions
Like IFS() in Excel:
score = 73
if score >= 90:
grade = "A"
elif score >= 80:
grade = "B"
elif score >= 70:
grade = "C"
elif score >= 60:
grade = "D"
else:
grade = "F"
print(grade) # C
One-line if (when it improves readability)
status = "VIP" if total > 1000 else "regular"
Use sparingly. Multi-line is usually clearer.
The "early return / guard" pattern
Nested ifs get ugly fast. Flatten them with early returns:
# Tangled
def process(order):
if order is not None:
if order["status"] == "paid":
if order["amount"] > 0:
ship(order)
# Cleaner
def process(order):
if order is None: return
if order["status"] != "paid": return
if order["amount"] <= 0: return
ship(order)
Walkthrough: pricing tiers
Inputs
customer_type = "wholesale"
quantity = 250
Compute the discount
if customer_type == "wholesale":
if quantity >= 500:
discount = 0.20
elif quantity >= 100:
discount = 0.10
else:
discount = 0.05
elif customer_type == "vip":
discount = 0.15
else:
discount = 0.00
print(f"{discount*100:.0f}% discount") # 10% discount
Key takeaways
- Colon at the end of
if/elif/else, indent the body.
elif chains conditions; else catches everything left.
- Flatten deep nesting with early returns.
Shipping cost calculator
Free shipping over $100. $5 flat if weight < 1kg. Otherwise $15. Write the if/elif/else.
04.2 β for loops β doing something for every item
Loop through lists, ranges, dictionaries, and lines of a file.
What you'll learn
- Write a for loop over a list and over a range
- Use enumerate() to track position
- Recognise when not to write a loop (vectorised operations)
The shape of a for loop
for name in ["Alice", "Bob", "Carol"]:
print(f"Hello, {name}!")
Read it as: "for every name in this list, do the indented stuff." Three names β three greetings.
Looping a fixed number of times β range()
for i in range(5):
print(i)
# 0
# 1
# 2
# 3
# 4
range(5) = 0, 1, 2, 3, 4. (Stops before 5 β Python's standard half-open style.)
range(1, 6) # 1, 2, 3, 4, 5
range(0, 10, 2) # 0, 2, 4, 6, 8
enumerate() β when you need the index
items = ["bread", "milk", "eggs"]
for i, item in enumerate(items, start=1):
print(f"{i}. {item}")
# 1. bread
# 2. milk
# 3. eggs
Looping a dictionary
prices = {"bread": 3.50, "milk": 2.10, "eggs": 5.00}
for product, price in prices.items():
print(f"{product}: ${price}")
break and continue
break β stop the loop entirely.
continue β skip the rest of this iteration; jump to the next item.
for n in range(20):
if n % 2 != 0: # odd
continue
if n > 12:
break
print(n)
# 0 2 4 6 8 10 12
Walkthrough: total a list of invoices
The data
invoices = [
{"customer": "Acme", "amount": 1200},
{"customer": "Beta", "amount": 800},
{"customer": "Acme", "amount": 300},
{"customer": "Gamma", "amount": 1500},
]
Loop and accumulate
total = 0
for inv in invoices:
total += inv["amount"]
print(total) # 3800
total += inv["amount"] is shorthand for total = total + inv["amount"].
π‘ The shortcut
For totals, Python gives you a built-in:
sum(inv["amount"] for inv in invoices). We meet that pattern (a "generator expression") later.
When NOT to write a loop
If you're using pandas, you almost never need a manual loop. df["amount"].sum() beats writing out the loop in both speed and clarity. We'll see this everywhere from Module 9 onwards.
Key takeaways
for x in collection: β visit every item.
range(n) β loop n times.
enumerate() when you also need the index.
break stops the loop; continue skips to the next item.
FizzBuzz
For numbers 1 to 30: print "Fizz" if divisible by 3, "Buzz" if divisible by 5, "FizzBuzz" if both, otherwise the number. Classic interview question. You now know enough to write it.
04.3 β while loops β repeat until done
When you don't know how many iterations you need: while, break, and the infinite-loop pitfall.
What you'll learn
- Write a while loop with a clean exit condition
- Use break to leave a loop early
- Avoid accidental infinite loops
The shape of a while loop
balance = 100
months = 0
while balance < 200:
balance = balance * 1.05 # 5% per month
months += 1
print(f"Took {months} months") # 15
"Keep going while the condition is true." When the condition flips false, the loop exits.
The classic infinite loop
x = 0
while x < 10:
print(x)
# forgot to change x β runs forever
If you ever fire one off by accident, hit the β¬ stop button in Jupyter, or Ctrl+C in the terminal.
while True + break β the "until" pattern
while True:
answer = input("Type 'quit' to stop: ")
if answer == "quit":
break
print(f"You typed: {answer}")
When to use while vs for
| If you know how many times you'll iterate⦠| use for |
| If you keep going until a condition flips⦠| use while |
Key takeaways
while CONDITION: runs until the condition is false.
- Change the variable inside the loop or you'll spin forever.
while True: ... break is the Pythonic "until" pattern.
Doubling game
Start at 1. Keep doubling until you exceed 1,000,000. Print how many doublings it took.
04.4 β List comprehensions β one-line loops
The compact way to build a list from another list. Less code, more clarity.
What you'll learn
- Read a list comprehension
- Convert a for-loop to a comprehension
- Add a filter (if-clause) to a comprehension
The pattern
# Long form
squares = []
for n in range(10):
squares.append(n * n)
# Comprehension β same result
squares = [n * n for n in range(10)]
Read it left to right: "a list of n * n for every n in range(10)."
With a filter
evens = [n for n in range(20) if n % 2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Transform + filter
names = ["alice", "bob", "carol", "dave"]
short_upper = [n.upper() for n in names if len(n) < 5]
# ['ALICE', 'BOB', 'DAVE']
Dict and set comprehensions
# dict
prices = {"a": 1, "b": 2, "c": 3}
doubled = {k: v * 2 for k, v in prices.items()}
# {'a': 2, 'b': 4, 'c': 6}
# set (no duplicates)
words = ["apple", "ant", "banana"]
first_letters = {w[0] for w in words}
# {'a', 'b'}
π‘ When to use them
Use a comprehension when the loop body is one short transformation. If you're doing five things to each item, use a regular for-loop β comprehensions stop helping when they get long.
Key takeaways
[expression for item in collection] = one-line list builder.
- Add
if condition at the end to filter.
- Also works for dicts and sets.
- Don't overuse β readability beats brevity.
Three quick conversions
Convert each loop to a list comprehension.
# 1
result = []
for n in range(20):
if n % 3 == 0:
result.append(n)
# 2
upper_names = []
for name in ["alice", "bob"]:
upper_names.append(name.upper())
# 3
totals = []
for r in [{"qty":3,"price":10}, {"qty":2,"price":15}]:
totals.append(r["qty"] * r["price"])
04.5 β try / except β handling errors gracefully
Catch exceptions so your script keeps running instead of dying on bad data.
What you'll learn
- Wrap risky code in try/except
- Catch specific exception types
- Decide when NOT to catch
The pattern
try:
n = int(user_input)
except ValueError:
n = 0
print("Couldn't parse the number, defaulting to 0.")
Python runs the try block. If a ValueError is raised, control jumps to the except block. If no error, the except is skipped.
The big four exception types
ValueError β the value was the right type but wrong content (e.g. int("abc")).
TypeError β the value was the wrong type.
KeyError β that key didn't exist in the dict.
FileNotFoundError β couldn't find that file.
Catch specific, not generic
# OK but lazy β swallows bugs you'd want to know about
try:
do_thing()
except:
pass
# Better β explicit about what you expect
try:
do_thing()
except FileNotFoundError:
print("File missing, skipping.")
The finally block β cleanup that always runs
f = open("data.txt")
try:
work_with(f)
except Exception as e:
print(f"Failed: {e}")
finally:
f.close() # runs whether we succeeded or not
Walkthrough: process a folder of CSVs that might be broken
The plan
Loop through all CSVs in a folder. Try to read each. If one is broken, log it and keep going β don't let one bad file stop the rest.
The script
import pandas as pd
from pathlib import Path
results = []
for path in Path("monthly_files").glob("*.csv"):
try:
df = pd.read_csv(path)
results.append((path.name, len(df)))
except Exception as e:
print(f"Skipped {path.name}: {e}")
print(f"Processed {len(results)} files successfully")
β οΈ Don't catch bugs
The wrong place for try/except: hiding bugs. If your code has a typo, you want it to crash so you can fix it. Only catch errors you've thought about in advance.
Key takeaways
- Wrap risky code (file I/O, type conversions, network calls) in
try.
- Catch specific exception types β
ValueError, FileNotFoundError etc.
- Use
finally for cleanup that must always run.
- Don't use try/except to hide bugs.
Safe parser
Write a function to_float(text) that returns a float if it can, or None if it can't. Then test it with: "12.5", "abc", "", "100".
04.6 β Putting it together β a mini contact-cleaner
Combine if/for/comprehensions/try-except into your first useful script.
What you'll learn
- Combine the four control-flow tools in one script
- Read a tiny piece of real-feeling input
- Produce a clean output
The goal
You receive a list of contacts where each item looks like one of these:
raw = [
" Alice Smith,alice@example.com,42 ",
"Bob Brown,BOB@EXAMPLE.com,", # missing age
"carol,wrong-email,29", # no last name, bad email
",dave@example.com,55", # no name
" Eve Wells, EVE@example.com , 31 ",
]
You want a clean list of dicts with valid contacts only.
Walkthrough
Step 1: split the line and trim each piece
cleaned = []
for line in raw:
parts = [p.strip() for p in line.split(",")]
cleaned.append(parts)
Step 2: validate name, email, age
contacts = []
for name, email, age_text in cleaned:
if not name: continue # need a name
if "@" not in email: continue # need an @
try:
age = int(age_text)
except ValueError:
age = None
contacts.append({
"name": name.title(),
"email": email.lower(),
"age": age,
})
Step 3: inspect
for c in contacts:
print(c)
Output:
{'name': 'Alice Smith', 'email': 'alice@example.com', 'age': 42}
{'name': 'Bob Brown', 'email': 'bob@example.com', 'age': None}
{'name': 'Eve Wells', 'email': 'eve@example.com', 'age': 31}
Carol got dropped (bad email) and Dave got dropped (no name). Bob made it in with a None age. Real data, real result.
Key takeaways
- Real-world scripts always chain these tools β loops + comprehensions + ifs + try/except.
- Validation is just "skip if not what we want" (
continue) and "guess if we can" (try/except).
- Clean inputs at the door; the rest of your code stays simple.
Add a feature
Extend the cleaner: also reject contacts under age 18. Add a counter for how many were rejected and print it at the end.
Module 05Functions
Define your own reusable mini-tools. Arguments, return values, defaults, and the rules that keep functions honest.
05.1 β def β defining your own function
Make a custom 'formula' you can call from anywhere in your code.
What you'll learn
- Write a function with parameters and a return value
- Call it with positional and keyword arguments
- Use a docstring to document it
Why functions matter
You write the same five-line calculation in three places, then need to fix a typo. With a function: fix it once, every caller gets the fix. Functions are how code stays maintainable.
The shape
def add_tax(amount, rate):
return amount * (1 + rate)
# Use it
total = add_tax(100, 0.08)
print(total) # 108.0
Five things to notice:
def β keyword that starts a function definition.
add_tax β the function's name. Use snake_case.
(amount, rate) β the parameters the function expects.
- Colon, indented body β same as
if/for.
return β sends a value back to the caller.
Positional vs keyword arguments
add_tax(100, 0.08) # positional: amount=100, rate=0.08
add_tax(amount=100, rate=0.08) # keyword: same thing, more explicit
add_tax(rate=0.08, amount=100) # keyword: order doesn't matter
Use keyword arguments when you're calling a function with several parameters β it makes the call site readable.
Default values
def add_tax(amount, rate=0.08):
return amount * (1 + rate)
add_tax(100) # uses default rate of 0.08 β 108
add_tax(100, 0.10) # overrides β 110
Docstrings
Add a triple-quoted string right under def. Editors will show it as help when you hover the function name.
def add_tax(amount, rate=0.08):
"""Return amount + sales tax at the given rate."""
return amount * (1 + rate)
Functions that don't return a value
If you don't write return, Python returns None automatically. That's fine for functions whose job is to print something or save a file β their effect is what matters, not the return value.
def banner(msg):
line = "=" * len(msg)
print(line)
print(msg)
print(line)
banner("Monday report")
Key takeaways
def name(params): defines a function.
return sends a value back.
- Use keyword arguments for clarity.
- Default parameters give callers an opt-in override.
Write three functions
celsius_to_fahrenheit(c) β return F.
discounted_price(price, discount_pct=10) β return final price.
full_name(first, last, middle="") β return a properly spaced full name.
05.2 β Returning multiple values & unpacking
How to send back more than one thing β using tuples and dict returns.
What you'll learn
- Return multiple values as a tuple
- Unpack a return into multiple variables
- Decide between a tuple return and a dict return
The tuple return
def split_full_name(full):
first, last = full.split(" ", 1)
return first, last
f, l = split_full_name("Alice Smith")
print(f) # Alice
print(l) # Smith
Behind the scenes, returning first, last is the same as returning the tuple (first, last). The parentheses are optional.
Unpacking
name, age, role = ("Alice", 32, "analyst")
The dict return β when there are many fields
def summary(numbers):
return {
"count": len(numbers),
"total": sum(numbers),
"mean": sum(numbers) / len(numbers),
"max": max(numbers),
"min": min(numbers),
}
stats = summary([10, 20, 30, 40, 50])
print(stats["mean"]) # 30.0
Better than returning a 5-tuple: at the call site you read stats["mean"], not stats[2].
π‘ Rule of thumb
Two return values β tuple. Three or more β dict (or a class, which we won't need in this course).
Key takeaways
- Return multiple values as a tuple:
return a, b.
- Unpack at the call site:
x, y = func().
- For many fields, return a dict β readability wins.
min_max_avg
Write a function that takes a list of numbers and returns min, max, and average. Try a tuple return first; then rewrite as a dict.
05.3 β Scope β where variables live
Local vs global, why your variable disappeared after the function ran, and how to do this safely.
What you'll learn
- Distinguish local from global variables
- Avoid the global keyword unless you really need it
- Pass values in / return values out β the clean pattern
Local scope
Variables created inside a function only exist inside that function. Once it returns, they're gone.
def calc():
temp = 42
return temp * 2
result = calc()
print(result) # 84
print(temp) # NameError β temp doesn't exist out here
Why this is good
It means a function is a self-contained box. Whatever it does inside can't accidentally rename or corrupt a variable you have outside. Imagine the alternative β every function in a million-line program clobbering each other's variables.
Reading a global is OK; changing one is sketchy
TAX_RATE = 0.08 # global constant
def add_tax(amount):
return amount * (1 + TAX_RATE) # OK to read
# Bad pattern β reaching outside to change a global
TOTAL = 0
def add(n):
global TOTAL
TOTAL += n
# Good pattern β return the new value, let the caller decide
def add(total, n):
return total + n
The mental model
A clean function takes its inputs as parameters and gives its output as a return value. It doesn't reach out into the world. It's predictable, testable, and reusable. Treat global like cigarettes β legal, occasionally unavoidable, mostly a sign you should reconsider.
Key takeaways
- Variables created in a function only exist while the function runs.
- Globals are fine for read-only constants; avoid
global for assignment.
- Inputs in β outputs out: the cleanest function pattern.
Refactor
RUNNING_TOTAL = 0
def add_invoice(amount):
global RUNNING_TOTAL
RUNNING_TOTAL += amount
return RUNNING_TOTAL
Rewrite without the global. Pass the running total in, return the updated value.
05.4 β lambda, map, filter β the functional shortcuts
Tiny one-line functions and a quick tour of map() and filter().
What you'll learn
- Recognise a lambda
- Use sorted() with a key function
- Read map() and filter() expressions
What's a lambda?
A one-line function that doesn't get a name. Useful when you need to pass a tiny function to another function.
# With def
def double(x):
return x * 2
# Same thing as a lambda
double = lambda x: x * 2
You'll mostly see lambdas inline, not stored in variables. The most common use: key= in sorted().
sorted() with a key
people = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Carol", "age": 35},
]
by_age = sorted(people, key=lambda p: p["age"])
# [{'name':'Bob',...}, {'name':'Alice',...}, {'name':'Carol',...}]
by_name_desc = sorted(people, key=lambda p: p["name"], reverse=True)
map() β "apply this function to every item"
names = ["alice", "bob", "carol"]
upper = list(map(str.upper, names)) # ['ALICE', 'BOB', 'CAROL']
# Often clearer as a comprehension:
upper = [n.upper() for n in names]
filter() β "keep items where this is true"
ages = [12, 15, 18, 21, 17, 30]
adults = list(filter(lambda x: x >= 18, ages)) # [18, 21, 30]
# Also clearer as a comprehension:
adults = [x for x in ages if x >= 18]
π In practice
Python programmers reach for list comprehensions far more often than
map() /
filter(). But you'll see all three in the wild β recognise them and you can read any Python code.
Key takeaways
lambda x: expression β a tiny anonymous function.
- Most common use:
sorted(items, key=lambda i: i["something"]).
- map/filter exist; comprehensions are usually clearer.
Sort by multiple keys
Sort the people list above by age ascending, but use name as a tiebreaker. Hint: a lambda can return a tuple.
05.5 β Modules and imports β using other people's code
import, from-import, aliases, and the difference between standard library and third-party.
What you'll learn
- Import a module
- Use from-import for a specific function
- Recognise the standard library vs pip-installed libraries
The three import styles
# 1 β the whole module
import math
math.sqrt(16) # 4.0
# 2 β with an alias (the convention for big libraries)
import pandas as pd
pd.read_csv("file.csv")
# 3 β a specific name
from math import sqrt, pi
sqrt(16) # 4.0
pi # 3.14159...
Standard library vs third party
- Standard library β comes with Python. No install needed. Examples:
math, os, pathlib, datetime, json, csv, re.
- Third party β installed with
pip install .... Examples: pandas, numpy, openpyxl, requests.
The conventional aliases
| Library | Conventional alias |
| pandas | import pandas as pd |
| numpy | import numpy as np |
| matplotlib.pyplot | import matplotlib.pyplot as plt |
| seaborn | import seaborn as sns |
Use the conventions. Strangers reading your code will thank you.
Writing your own module
Any .py file is automatically importable from a script in the same folder. Put related functions in helpers.py:
# helpers.py
def clean_email(s):
return s.strip().lower()
Then in main.py next to it:
from helpers import clean_email
clean_email(" ALICE@example.com ") # 'alice@example.com'
Key takeaways
import x / import x as y / from x import y β three styles.
- Standard library = free; third-party =
pip install.
- Use the conventional aliases (pd, np, plt, sns).
- Any of your own
.py files is importable.
Use the standard library
Use the datetime module to print today's date in the format YYYY-MM-DD. Hint: from datetime import date; date.today().isoformat().
05.6 β Anatomy of a real script
How professional Python scripts are organised β imports, constants, functions, and the if __name__ block.
What you'll learn
- Recognise the standard layout of a Python script
- Use the if __name__ == '__main__' idiom
- Write your own well-organised script
The standard layout
"""monthly_report.py β generate the monthly sales report.
Usage: python monthly_report.py
"""
# ---- 1. Imports (standard library first, then third-party, then your own) ----
import os
from datetime import date
import pandas as pd
from helpers import clean_email
# ---- 2. Constants ----
INPUT_FILE = "sales_raw.csv"
OUTPUT_FILE = "sales_clean.xlsx"
TAX_RATE = 0.08
# ---- 3. Functions ----
def load(path):
return pd.read_csv(path)
def clean(df):
df["email"] = df["email"].apply(clean_email)
df["total"] = df["price"] * (1 + TAX_RATE)
return df
def save(df, path):
df.to_excel(path, index=False)
def main():
df = load(INPUT_FILE)
df = clean(df)
save(df, OUTPUT_FILE)
print(f"Wrote {len(df)} rows to {OUTPUT_FILE}")
# ---- 4. The "run me" line ----
if __name__ == "__main__":
main()
What's that if __name__ == "__main__" thing?
It's the line that says: "Run main() only if I was launched directly, not when someone imports me from another file."
Why it matters: if you put loose code at the bottom of the file, that code runs even when you just import the file from somewhere else. The if __name__ idiom prevents that.
The benefit of breaking into functions
- Each function is small and testable.
- You can call individual pieces from a Jupyter notebook while developing.
- You can swap pieces without rewriting the whole thing.
π‘ The 30-second rule
If you can describe a piece of code in one short sentence, it's probably a function. "Clean the email column" β
clean_emails(df). "Read the input file" β
load(path).
Key takeaways
- Imports β constants β functions β
main() β if __name__ == "__main__".
- The
if __name__ guard stops your main() from running on import.
- Small functions beat a giant script every time.
Refactor your tip calculator
Take the tip-calculator script you wrote in Module 3 and split it into three functions: get_inputs(), compute(bill, tip_pct, people), and print_results(...). Call them from main().
Module 06Data Structures
Lists, dicts, tuples, sets β Python's built-in collections and the Excel equivalents for each.
06.1 β Lists β the everyday ordered collection
Create, index, slice, modify. The Python equivalent of a single-column range.
What you'll learn
- Create and modify lists
- Index, slice, and iterate
- Use append, pop, insert, sort
Creating a list
fruits = ["apple", "banana", "cherry"]
numbers = [10, 20, 30, 40, 50]
mixed = [1, "two", 3.0, True]
empty = []
Indexing
fruits[0] # 'apple' β Python starts at 0
fruits[1] # 'banana'
fruits[-1] # 'cherry' β negative indexes count from the end
fruits[-2] # 'banana'
Slicing
nums = [10, 20, 30, 40, 50]
nums[1:4] # [20, 30, 40]
nums[:3] # [10, 20, 30]
nums[3:] # [40, 50]
nums[:] # [10, 20, 30, 40, 50] β a copy
nums[::2] # [10, 30, 50] β every other one
Modifying
fruits = ["apple", "banana"]
fruits.append("cherry") # add to the end
fruits.insert(0, "avocado") # insert at index 0
fruits.remove("banana") # remove by value
last = fruits.pop() # remove and return last
fruits[0] = "apricot" # replace by index
Useful list operations
len(fruits) # how many items
"apple" in fruits # True/False
sorted(numbers) # returns a sorted copy
numbers.sort() # sorts in place
numbers.reverse()
sum(numbers) # 150
max(numbers) # 50
min(numbers) # 10
Walkthrough: track daily sales
Start with last week's numbers
daily = [1200, 950, 1340, 1100, 1600, 0, 0] # Mon..Sun
This morning's sale comes in
daily.append(1450) # Monday of week 2
print(daily) # [1200, 950, 1340, 1100, 1600, 0, 0, 1450]
Print weekday totals only (drop the weekend)
weekdays = daily[:5]
print(sum(weekdays)) # 6190
β οΈ Lists are mutable
If you pass a list to a function and the function changes it, the original list is changed too. To pass a copy, use
my_list[:] or
list(my_list).
Key takeaways
- Lists hold an ordered collection of anything.
- Indexes start at 0; negative indexes count from the end.
append, pop, insert, remove, sort β the workhorses.
- Lists are mutable β changes are in place unless you copy first.
Quick drill
Given nums = [5, 2, 8, 1, 9, 3], write one line each to:
- Get the average.
- Get the second-largest.
- Get a sorted list, descending.
- Get just the odd numbers.
06.2 β Dictionaries β key-value lookups
VLOOKUP, built-in to the language. Create, look up, update, iterate.
What you'll learn
- Create and access dictionaries
- Use .get() to avoid KeyError
- Iterate keys, values, and items
What's a dict?
A dictionary maps keys to values. It's a lookup table.
prices = {
"apple": 1.20,
"banana": 0.50,
"cherry": 3.00,
}
prices["apple"] # 1.20
prices["banana"] # 0.50
Adding and updating
prices["date"] = 5.00 # add a new key
prices["apple"] = 1.50 # update existing
del prices["cherry"] # remove
The .get() escape hatch
Accessing a missing key crashes:
prices["mango"] # KeyError!
.get() returns None (or a default) instead:
prices.get("mango") # None
prices.get("mango", 0) # 0 β supply a default
Iterating
for fruit in prices:
print(fruit) # just keys
for fruit, price in prices.items():
print(f"{fruit}: ${price}")
prices.keys() # dict_keys(['apple', 'banana', ...])
prices.values() # dict_values([1.20, 0.50, ...])
prices.items() # dict_items([('apple', 1.20), ...])
The "row as dict" pattern β this one's important
Real-world data often looks like one dict per row:
customers = [
{"id": 1, "name": "Alice", "city": "NYC", "ltv": 1200},
{"id": 2, "name": "Bob", "city": "Chicago", "ltv": 800},
{"id": 3, "name": "Carol", "city": "NYC", "ltv": 2200},
]
# Find all NYC customers
nyc = [c for c in customers if c["city"] == "NYC"]
# Total NYC LTV
total = sum(c["ltv"] for c in customers if c["city"] == "NYC")
Pandas DataFrames are basically a faster version of this same structure. Internalise the pattern and DataFrames will feel natural.
Walkthrough: build a quick price lookup
Define the lookup
price_list = {
"SKU-001": 19.99,
"SKU-002": 29.99,
"SKU-003": 9.99,
}
Cost a basket
basket = [("SKU-001", 2), ("SKU-002", 1), ("SKU-099", 4)] # last one isn't in the list
total = 0
for sku, qty in basket:
price = price_list.get(sku)
if price is None:
print(f"Unknown SKU: {sku}")
continue
total += price * qty
print(f"Total: ${total:.2f}")
Key takeaways
- Dicts map keys to values:
d[key].
- Use
.get(key, default) to avoid KeyError on missing keys.
- Iterate with
.items() when you need both key and value.
- Lists of dicts are the precursor to a DataFrame.
Word counter
Given a sentence, build a dict where each key is a word and each value is how many times it appears.
06.3 β Tuples and when to use them
Lists you can't change. Smaller, faster, and a perfect fit for fixed pairs.
What you'll learn
- Create tuples with and without parentheses
- Unpack tuples
- Pick tuple vs list correctly
A tuple is an immutable list
point = (10, 20)
rgb = (255, 0, 128)
point[0] # 10
point[0] = 5 # TypeError β tuples are immutable
The bare-comma shortcut
point = 10, 20 # also a tuple β parens are optional
x, y = point # unpack into two variables
When to use a tuple
- A fixed-size pair or triple where order matters:
(lat, lon), (year, month, day).
- Returning multiple values from a function.
- As a dictionary key (lists can't be keys, tuples can).
When to use a list
- A growing/shrinking collection.
- A homogeneous "many of the same thing" (names, prices, rows).
Key takeaways
- Tuples are immutable. Use them for fixed groupings.
- Parentheses around tuples are optional in most cases.
- Tuples can be dict keys; lists cannot.
Day-of-week dict
Use tuples of (year, month, day) as keys in a dict. Add three keys with notes as values. Print one of them.
06.4 β Sets β unique values, fast
When you need to deduplicate, check membership, or do union/intersection.
What you'll learn
- Create and modify sets
- Use set operations (union, intersection, difference)
- Use a set to deduplicate fast
A set is an unordered collection of unique values
colors = {"red", "green", "blue"}
colors.add("red") # already there β ignored
colors.add("yellow")
print(colors) # {'red','green','blue','yellow'} (order varies)
Deduplicating a list
raw = ["alice@x.com", "bob@y.com", "alice@x.com", "carol@z.com"]
unique = list(set(raw))
Set operations
last_month = {"Alice", "Bob", "Carol"}
this_month = {"Bob", "Carol", "Dave"}
last_month | this_month # union {Alice, Bob, Carol, Dave}
last_month & this_month # intersection {Bob, Carol}
last_month - this_month # difference {Alice} β churned
this_month - last_month # difference {Dave} β new
That last pair β "who churned" and "who's new" β is a classic real-world use.
Fast membership
Checking x in big_list scans every item. Checking x in big_set is nearly instant, regardless of size. If you're checking membership a lot, use a set.
Key takeaways
- Sets hold unique values; adding a duplicate is silently ignored.
list(set(x)) is the fastest dedupe trick.
| union, & intersection, - difference.
- Membership checks on sets are O(1) β way faster than lists.
Churn analysis
Given last month's customers (a list) and this month's customers (a list), print the count of churned and new customers.
06.5 β Nested structures β lists of dicts, dicts of lists
How real-world data actually looks in Python before pandas comes along.
What you'll learn
- Build and read nested structures
- Loop through them cleanly
- Convert one to a pandas DataFrame
List of dicts β the most common shape
orders = [
{"id": 101, "customer": "Acme", "items": ["A1", "B7"], "total": 450},
{"id": 102, "customer": "Beta", "items": ["C3"], "total": 120},
{"id": 103, "customer": "Acme", "items": ["A1", "A1", "B7"], "total": 780},
]
Reading it
first = orders[0]
print(first["customer"]) # Acme
print(first["items"][0]) # A1
print(orders[2]["items"][-1]) # B7
Looping
for order in orders:
print(f"Order {order['id']} for {order['customer']}: ${order['total']}")
Aggregating across a list of dicts
# Total revenue
total = sum(o["total"] for o in orders) # 1350
# Revenue per customer
by_customer = {}
for o in orders:
cust = o["customer"]
by_customer[cust] = by_customer.get(cust, 0) + o["total"]
print(by_customer) # {'Acme': 1230, 'Beta': 120}
And here's why DataFrames are so popular
That last calculation took five lines. With pandas it's one:
import pandas as pd
df = pd.DataFrame(orders)
df.groupby("customer")["total"].sum()
Same answer. Same data. Less code. That's why Module 9 is the turning point in this course.
Key takeaways
- Real-world data is usually a list of dicts (or a dict of lists).
- You can do everything with plain Python β it's just verbose.
- pandas makes the same operations one-liners.
Group by city
Given a list of customer dicts each with city and ltv, build a dict mapping city to total LTV.
06.6 β Working with dates
The datetime module, parsing strings to dates, and date arithmetic.
What you'll learn
- Create date and datetime objects
- Parse a date string with strptime
- Do date arithmetic with timedelta
The two types you'll use
from datetime import date, datetime, timedelta
d = date(2026, 5, 16) # just a date
dt = datetime(2026, 5, 16, 14, 30, 0) # date + time
Today / now
date.today() # date(2026, 5, 16)
datetime.now() # full timestamp
Parsing a string
datetime.strptime("2026-05-16", "%Y-%m-%d")
datetime.strptime("16/05/2026", "%d/%m/%Y")
datetime.strptime("May 16, 2026", "%B %d, %Y")
The format codes are documented at strftime.org. The big ones: %Y year, %m month, %d day, %H hour, %M minute.
Formatting back to a string
d = date.today()
d.isoformat() # '2026-05-16'
d.strftime("%B %d, %Y") # 'May 16, 2026'
d.strftime("%A") # 'Saturday'
Date arithmetic
today = date.today()
tomorrow = today + timedelta(days=1)
last_week = today - timedelta(weeks=1)
age_days = (today - date(1990, 1, 1)).days # how many days old
Walkthrough: end-of-month dates for the next year
Build a list of month-ends
from datetime import date
from calendar import monthrange
year = 2026
month_ends = []
for m in range(1, 13):
last_day = monthrange(year, m)[1]
month_ends.append(date(year, m, last_day))
for d in month_ends:
print(d.isoformat())
Key takeaways
date = just a date. datetime = date + time.
strptime parses; strftime formats.
- Subtract two dates β
timedelta; .days gives you the day count.
- Pandas has even better date tools β covered in Module 9.
Days until your birthday
Write a script that prints how many days until your next birthday.
06.7 β Files and paths β the pathlib essentials
Read and write text files; build paths that work on Mac, Windows, and Linux.
What you'll learn
- Read and write text files
- Use pathlib to build cross-platform paths
- Loop through every file in a folder
Reading a text file
with open("notes.txt", "r") as f:
contents = f.read()
print(contents)
with open(...) as f: is the safe pattern β Python closes the file for you automatically when the indented block ends.
Writing a text file
with open("output.txt", "w") as f:
f.write("Hello!\n")
f.write("Second line.\n")
"w" overwrites; "a" appends.
Reading line by line
with open("notes.txt") as f:
for line in f:
print(line.strip())
pathlib β the modern way to build paths
from pathlib import Path
home = Path.home() # ~/
desktop = home / "Desktop" # ~/Desktop
output = desktop / "report.xlsx" # ~/Desktop/report.xlsx
output.exists() # True/False
output.name # 'report.xlsx'
output.suffix # '.xlsx'
output.parent # PosixPath('~/Desktop')
The / operator joins paths in the correct way for Mac/Linux/Windows. Use this everywhere instead of string concatenation.
Looping through every file in a folder
from pathlib import Path
folder = Path("reports/2026/q1")
for path in folder.glob("*.xlsx"):
print(path.name, path.stat().st_size)
glob("*.xlsx") = "every .xlsx in this folder."
rglob("*.xlsx") = "every .xlsx in this folder and all subfolders."
Walkthrough: gather every CSV from a year's worth of monthly subfolders
The structure
data/
2026-01/
sales.csv
returns.csv
2026-02/
sales.csv
...
List them all
from pathlib import Path
for p in Path("data").rglob("*.csv"):
print(p)
Read and combine
import pandas as pd
frames = [pd.read_csv(p) for p in Path("data").rglob("sales.csv")]
combined = pd.concat(frames, ignore_index=True)
print(len(combined))
Six lines of code, dozens of files, one DataFrame. This pattern alone has saved many an analyst's morning.
Key takeaways
with open(...) as f: is the safe way to read/write text files.
pathlib.Path + / builds cross-platform paths.
glob("*.csv") and rglob("*.csv") iterate every file matching a pattern.
- Combine with
pd.concat() to merge many files into one DataFrame.
Inventory script
Write a script that lists every file in your Downloads folder along with its size. Print a one-line total at the end.
Module 07Working with Files
Open CSVs, Excel workbooks, folders full of spreadsheets, and write back out. The bread-and-butter file operations.
07.1 β Reading and writing CSV files
The csv standard library module and the pandas one-liner equivalents.
What you'll learn
- Read a CSV into a list of dicts (standard library)
- Read a CSV into a DataFrame (pandas)
- Write a DataFrame back out
Two ways: csv module and pandas
The csv module is built into Python. pandas is the third-party library we'll lean on heavily. For 95% of work, pandas is what you want. For tiny scripts or constrained environments, the csv module is fine.
The pandas one-liner
import pandas as pd
df = pd.read_csv("sales.csv")
print(df.head())
Done. You have a DataFrame.
Writing a CSV
df.to_csv("clean.csv", index=False)
index=False suppresses pandas' row-number column, which is usually what you want.
Useful read_csv options
pd.read_csv("sales.csv",
sep=",", # delimiter
encoding="utf-8", # "latin-1" for some legacy exports
skiprows=2, # skip first two rows
header=0, # row to use as headers
usecols=["date","amount"],# read only these columns
parse_dates=["date"], # turn date column into real dates
dtype={"id": str}, # force this column to be string
nrows=1000, # read only first 1000 rows
)
The csv module (for when you can't use pandas)
import csv
with open("sales.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
print(rows[0]) # {'date': '2026-01-05', 'amount': '120', ...}
Every value comes back as a string. You convert as you go.
Common CSV headaches and fixes
| Problem | Fix |
| UnicodeDecodeError | Try encoding="latin-1" or encoding="cp1252" |
| Numbers come in as strings (with commas) | df["amount"] = df["amount"].str.replace(",","").astype(float) |
| Dates won't sort | Use parse_dates=["col"] at read time |
| Wrong delimiter (semicolons or tabs) | sep=";" or sep="\t" |
| Leading/trailing spaces in headers | df.columns = df.columns.str.strip() |
Walkthrough: end-to-end CSV cleanup
Read
import pandas as pd
df = pd.read_csv("sales_raw.csv", parse_dates=["date"])
Inspect
df.head()
df.info()
df.describe()
Clean
df.columns = df.columns.str.strip().str.lower()
df = df.dropna(subset=["customer"])
df["amount"] = df["amount"].astype(float)
Save
df.to_csv("sales_clean.csv", index=False)
Key takeaways
pd.read_csv() is the one-liner you'll use 95% of the time.
- The standard
csv module is there if you can't use pandas.
- Watch out for encoding, delimiters, and number columns that came in as strings.
CSV inspector
Write a small script that takes a CSV path and prints: number of rows, number of columns, the column names, and the first three rows.
07.2 β Reading Excel files with pandas
read_excel: the easiest way to get a .xlsx into Python.
What you'll learn
- Read a specific sheet of an Excel file
- Read multiple sheets at once
- Write a DataFrame back to .xlsx
The basic read
import pandas as pd
df = pd.read_excel("monthly_report.xlsx")
By default this reads the first sheet.
Pick a specific sheet
df = pd.read_excel("workbook.xlsx", sheet_name="January")
df = pd.read_excel("workbook.xlsx", sheet_name=2) # by index, 0-based
Read every sheet into a dict
sheets = pd.read_excel("workbook.xlsx", sheet_name=None)
# sheets is a dict: {"January": df1, "February": df2, ...}
for name, df in sheets.items():
print(name, len(df))
Skip header rows / pick a range
pd.read_excel("workbook.xlsx",
sheet_name="Summary",
skiprows=3, # skip the title block
usecols="B:F", # read columns B through F
nrows=50)
Write back
# One sheet
df.to_excel("output.xlsx", index=False, sheet_name="Results")
# Multiple sheets
with pd.ExcelWriter("output.xlsx") as writer:
monthly.to_excel(writer, sheet_name="Monthly", index=False)
yearly.to_excel(writer, sheet_name="Yearly", index=False)
Walkthrough: combine 12 monthly tabs into one
Read every sheet
sheets = pd.read_excel("2026_monthly.xlsx", sheet_name=None)
Tag each with its month and stack
frames = []
for name, df in sheets.items():
df["month"] = name
frames.append(df)
combined = pd.concat(frames, ignore_index=True)
print(combined.shape)
Save the combined version
combined.to_excel("2026_combined.xlsx", index=False, sheet_name="All")
π‘ The dependency
pandas uses
openpyxl under the hood for .xlsx. Anaconda includes it; if you used a different install:
pip install openpyxl.
Key takeaways
pd.read_excel() with sheet_name=... picks one sheet; sheet_name=None returns all as a dict.
ExcelWriter writes multiple sheets to one file.
- Combine many sheets with
pd.concat([...]).
Sheet inventory
Write a function inventory(path) that returns a DataFrame with one row per sheet in a workbook, with columns: sheet_name, n_rows, n_cols.
07.3 β openpyxl β when you need cell-level control
Read/write individual cells, formatting, formulas, and styling. The library pandas uses under the hood.
What you'll learn
- Open a workbook, write to specific cells
- Style cells (bold, colour, number format)
- Save and re-open
When you reach for openpyxl
Pandas treats Excel like a flat table. If you need the spreadsheety bits β colours, formulas, merged cells, comments, charts β drop down to openpyxl.
from openpyxl import Workbook, load_workbook
wb = Workbook()
ws = wb.active
ws.title = "Summary"
ws["A1"] = "Customer"
ws["B1"] = "Revenue"
ws["A2"] = "Acme"
ws["B2"] = 1200
ws["A3"] = "Beta"
ws["B3"] = 800
ws["B4"] = "=SUM(B2:B3)" # a real Excel formula
wb.save("openpyxl_demo.xlsx")
Reading existing files
wb = load_workbook("openpyxl_demo.xlsx")
ws = wb["Summary"]
print(ws["A1"].value) # 'Customer'
print(ws.cell(row=2, column=2).value) # 1200
for row in ws.iter_rows(min_row=2, values_only=True):
print(row)
Styling β bold, fill, number format
from openpyxl.styles import Font, PatternFill
bold_white = Font(bold=True, color="FFFFFF")
green_fill = PatternFill(start_color="217346", end_color="217346", fill_type="solid")
for cell in ws[1]: # first row = headers
cell.font = bold_white
cell.fill = green_fill
ws["B2"].number_format = '"$"#,##0.00'
Column widths and freeze panes
ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 14
ws.freeze_panes = "A2"
Walkthrough: turn a DataFrame into a styled report
Write the DataFrame with pandas
import pandas as pd
df = pd.DataFrame({"Customer": ["Acme","Beta","Gamma"], "Revenue": [1200,800,1500]})
df.to_excel("report.xlsx", index=False, sheet_name="Sales")
Re-open with openpyxl and style
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill
wb = load_workbook("report.xlsx")
ws = wb["Sales"]
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill("solid", fgColor="217346")
for cell in ws[1]:
cell.font = header_font
cell.fill = header_fill
for cell in ws["B"][1:]: # data rows
cell.number_format = '"$"#,##0.00'
ws.column_dimensions["A"].width = 22
ws.column_dimensions["B"].width = 16
ws.freeze_panes = "A2"
wb.save("report.xlsx")
Key takeaways
- pandas for the data; openpyxl for the spreadsheety bits.
- Cells are
ws["A1"] or ws.cell(row=N, column=N).
- Style with
Font, PatternFill, .number_format.
- Always save with
wb.save("...xlsx").
Style your own report
Take a DataFrame with three columns, write it with pandas, then add: bold green header, currency format on the numeric column, autosized column widths, frozen header row.
07.4 β Looping through many files
Combine, transform, or report on a whole folder of files in one go.
What you'll learn
- Iterate files in a folder
- Build a combined DataFrame from many files
- Add a column tracking which file each row came from
The 'combine many files' pattern
from pathlib import Path
import pandas as pd
folder = Path("monthly_files")
frames = []
for path in sorted(folder.glob("*.csv")):
df = pd.read_csv(path)
df["source_file"] = path.name # track which file
frames.append(df)
combined = pd.concat(frames, ignore_index=True)
print(f"Combined {len(frames)} files, {len(combined):,} total rows")
This pattern shows up everywhere β monthly reports, regional exports, daily dumps. Add a sort on path.stat().st_mtime if you need chronological order.
Adding parsed info from the filename
If files are called sales_2026-01.csv, you can pull the month out:
for path in folder.glob("sales_*.csv"):
month = path.stem.split("_")[1] # '2026-01'
df = pd.read_csv(path)
df["month"] = month
frames.append(df)
What if the files have inconsistent columns?
pd.concat() aligns columns by name; missing columns become NaN. If files have wildly different shapes, normalise each one before stacking:
def normalise(df):
df.columns = df.columns.str.strip().str.lower()
keep = ["date", "customer", "amount"]
return df.reindex(columns=keep)
frames = [normalise(pd.read_csv(p)) for p in folder.glob("*.csv")]
combined = pd.concat(frames, ignore_index=True)
Walkthrough: monthly Excel files β one yearly file
Inputs
from pathlib import Path
import pandas as pd
import re
folder = Path("monthly_xlsx")
frames = []
Read each Excel file, tag with month
for path in sorted(folder.glob("*.xlsx")):
m = re.search(r"(\d{4}-\d{2})", path.stem)
if not m: continue
df = pd.read_excel(path)
df["month"] = m.group(1)
frames.append(df)
Combine and save
year_df = pd.concat(frames, ignore_index=True)
year_df.to_excel("yearly_combined.xlsx", index=False)
print(f"Wrote {len(year_df):,} rows.")
Key takeaways
folder.glob("*.csv") + pd.concat() = combine many files into one.
- Always add a column tracking which file the data came from.
- Normalise columns before concat if files differ.
Combine your Downloads
Pretend you have ten CSVs in a folder. Write the combine-and-tag script. Bonus: print the row count per file before combining.
07.5 β Text files, JSON, and other formats
When you're not dealing with CSV/Excel: plain text, JSON, and a quick note on XML.
What you'll learn
- Read and write JSON
- Parse log-style text files
- Know when to reach for a different format
JSON in two lines
JSON is the format APIs and lots of modern tools use. Python's json module reads it straight into a dict.
import json
# Read
with open("config.json") as f:
config = json.load(f)
print(config["api_key"])
# Write
data = {"name": "Alice", "scores": [10, 20, 30]}
with open("out.json", "w") as f:
json.dump(data, f, indent=2)
JSON β DataFrame
import pandas as pd
df = pd.read_json("orders.json") # array of objects
df.to_json("clean.json", orient="records", indent=2)
Parsing a log file
lines = Path("server.log").read_text().splitlines()
errors = [l for l in lines if "ERROR" in l]
print(len(errors), "errors")
XML β yes, you'll occasionally see it
For old enterprise systems and government data, you'll get XML. The fastest way: pandas can read simple XML tables.
df = pd.read_xml("data.xml")
For gnarlier XML, the standard library's xml.etree.ElementTree works, and the third-party lxml is faster.
Key takeaways
- JSON:
json.load / json.dump β clean dict in, clean dict out.
- Plain text:
Path(...).read_text().splitlines() is the one-liner.
- pandas can also read JSON and XML directly into DataFrames.
Config saver
Write a script that saves your settings (any three key/value pairs you like) to a JSON file, then reads them back and prints them.
07.6 β Encoding, BOMs, and other file gotchas
Why exports from old systems look like garbage, and how to fix them.
What you'll learn
- Recognise an encoding issue
- Try the most-likely fixes
- Strip BOMs and other invisible characters
The classic symptom
You read a CSV and see things like cafοΏ½ instead of cafΓ©, or Γ’β¬β’ instead of an apostrophe. That's an encoding mismatch β the file was saved with one character encoding and you're reading it as another.
The two-line fix
# Try utf-8 first (default)
df = pd.read_csv("export.csv")
# If garbled, try latin-1 (Windows / older systems)
df = pd.read_csv("export.csv", encoding="latin-1")
# Or cp1252 (also common from Windows Excel)
df = pd.read_csv("export.csv", encoding="cp1252")
If none of those work, ask the system that exported the file what encoding it uses.
The BOM problem
Some Excel CSV exports include an invisible "byte-order mark" character at the start of the file. Symptom: your first column header looks like ο»ΏCustomer with a weird prefix.
pd.read_csv("export.csv", encoding="utf-8-sig")
The -sig suffix tells pandas to strip the BOM.
Stray whitespace and invisible characters
# Strip columns and string values in one pass
df.columns = df.columns.str.strip()
for col in df.select_dtypes("object"):
df[col] = df[col].str.strip()
Date columns that come in as strings
df["date"] = pd.to_datetime(df["date"], errors="coerce")
errors="coerce" turns bad dates into NaT (not-a-time) instead of crashing.
Key takeaways
- If text looks garbled, try
encoding="utf-8-sig", then latin-1, then cp1252.
- Strip whitespace from columns AND string values when loading messy data.
pd.to_datetime(..., errors="coerce") handles unparsable dates gracefully.
Encoding sleuth
Write a function safe_read_csv(path) that tries utf-8, then utf-8-sig, then latin-1, and returns the first one that works.
Module 08Python inside Excel β =PY()
Use Python in a cell like a formula. Pass ranges, return DataFrames, chart inline, and decide when =PY() is the right tool.
This is the headline feature. The =PY() formula lets you write a few lines of Python directly inside an Excel cell and get the result back into the spreadsheet. Cloud-powered, no install required (other than Microsoft 365), and surprisingly powerful.
β οΈ Requires Microsoft 365
Standalone Excel 2019/2021/2024 doesn't include
=PY(). If you don't have 365, you can still read this module for the concepts and use Python in Jupyter for the same outcomes.
08.1 β =PY() β the first ten formulas
Type =PY(, write Python, hit Ctrl+Enter. The basic muscle memory.
What you'll learn
- Enter a =PY() formula
- Commit with Ctrl+Enter (not just Enter)
- Output as Python object vs Excel values
The syntax
- Click an empty cell.
- Type
=PY( β Excel switches the cell into Python mode. A green "PY" badge appears.
- Write your Python.
- Press Ctrl+Enter (Windows) / Cmd+Enter (Mac) to commit.
Don't just hit Enter β that adds a newline inside the cell. Ctrl+Enter is the "submit" key.
Ten warm-up formulas
# 1 β basic math
=PY(1 + 1) # 2
# 2 β a variable
=PY(
x = 100
x * 1.08
) # 108
# 3 β string
=PY("hello".upper()) # HELLO
# 4 β list aggregation
=PY(sum([10, 20, 30])) # 60
# 5 β current date
=PY(
from datetime import date
date.today()
)
# 6 β a Series
=PY(
import pandas as pd
pd.Series([1, 2, 3, 4, 5]).mean()
) # 3.0
# 7 β a small DataFrame
=PY(
import pandas as pd
pd.DataFrame({"x": [1,2,3], "y": [4,5,6]})
)
# 8 β random
=PY(
import random
random.choice(["heads", "tails"])
)
# 9 β string manipulation
=PY("alice@example.com".split("@")[1]) # example.com
# 10 β sorted list
=PY(sorted([3, 1, 4, 1, 5, 9, 2, 6]))
Output mode: Python object vs Excel values
By default, a =PY() cell shows a Python object β a single cell that holds (say) a whole DataFrame. To spill the contents into proper Excel cells:
- Right-click the cell.
- Output as β Excel values.
- The DataFrame spills into a range. Headers become column headers, rows become rows.
Toggle back to Python-object mode if you want to use the cell's value in another =PY() formula.
Walkthrough: build a summary cell
Put some numbers in A1:A10
Type any 10 numbers in cells A1 through A10.
In B1, write =PY()
=PY(
xl("A1:A10").mean()
)
The xl(...) helper grabs an Excel range as a pandas Series or DataFrame. .mean() averages it.
Try a more useful summary
=PY(
s = xl("A1:A10")
{"count": s.count(),
"sum": s.sum(),
"mean": s.mean(),
"min": s.min(),
"max": s.max()}
)
Right-click β Output as β Excel values. The dict spills into a two-column table.
Key takeaways
=PY( + your Python + Ctrl+Enter.
xl("A1:A10") brings an Excel range into Python as a Series/DataFrame.
- Right-click β Output as β Excel values to spill the result into cells.
Calculator cell
In a fresh workbook, put numbers in A1:A20. In B1, write a =PY() formula that computes the median. In B2, one that computes the standard deviation.
08.2 β The xl() helper β getting Excel data into Python
How to reference cells, ranges, and tables from inside =PY().
What you'll learn
- Reference single cells, ranges, and tables with xl()
- Choose between headers=True and headers=False
- Pull data from another sheet
The xl() forms
xl("A1") # single value
xl("A1:C5") # range β DataFrame (or Series if one col)
xl("A1:C5", headers=True) # treat first row as headers
xl("Sheet2!A1:C100") # range from another sheet
xl("MyTable[#All]") # an Excel Table by name
xl("MyTable[Revenue]") # one column of a Table
Single cells
=PY(
tax_rate = xl("B1")
amount = xl("B2")
amount * (1 + tax_rate)
)
Range with headers
Best practice: lay your data out with headers in the first row, then read with headers=True:
=PY(
df = xl("A1:D100", headers=True)
df.groupby("Region")["Revenue"].sum()
)
Cross-sheet references
=PY(
df = xl("RawData!A1:F1000", headers=True)
df.head()
)
Excel Tables β the cleanest way
If your data is a real Excel Table (Ctrl+T on a range), you can reference it by name. That's the most readable option and survives row insertions/deletions:
=PY(
sales = xl("SalesTable[#All]", headers=True)
sales["Total"] = sales["Qty"] * sales["Price"]
sales
)
Walkthrough: build a small dashboard cell
Set up the data
Put this in A1:C5 (headers in row 1):
Region Quarter Revenue
North Q1 1200
North Q2 1500
South Q1 900
South Q2 1100
Summary cell
=PY(
df = xl("A1:C5", headers=True)
df.groupby("Region")["Revenue"].agg(["sum", "mean"])
)
Right-click β Output as β Excel values. You get a clean two-column summary.
π‘ The =PY() pattern that scales
Store your
raw data as an Excel Table. Use
=PY() cells purely for analysis and presentation. When new rows come in, the analysis updates automatically.
Key takeaways
xl("A1:B5", headers=True) brings a labelled range in as a DataFrame.
- Excel Tables are referenced by name and survive resizing.
- You can read other sheets with
Sheet2!A1:B5 syntax.
Top three
Lay out a list of 10 employees with a Sales column. In a =PY() cell, return the top three by sales as a DataFrame.
08.3 β Pandas in =PY() β the killer combination
Run real pandas operations inside Excel cells. Groupby, pivot, sort, filter.
What you'll learn
- Group, pivot, and aggregate in =PY()
- Sort and filter in =PY()
- Spill the result into a clean Excel range
The setup
Put this raw data in A1:D11 (10 rows + headers):
Date Region Product Revenue
2026-01-05 North Widget 1200
2026-01-12 North Gadget 800
2026-01-19 South Widget 950
2026-02-02 South Gadget 1100
2026-02-09 East Widget 1300
2026-02-16 East Gadget 700
2026-03-03 North Widget 1500
2026-03-10 South Widget 1050
2026-03-17 East Gadget 900
2026-03-24 North Gadget 1250
Pivot by Region Γ Product
=PY(
df = xl("A1:D11", headers=True)
df.pivot_table(index="Region", columns="Product",
values="Revenue", aggfunc="sum")
)
Group by month
=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M")
df.groupby("Month")["Revenue"].sum()
)
Top N
=PY(
df = xl("A1:D11", headers=True)
df.nlargest(3, "Revenue")
)
Filter
=PY(
df = xl("A1:D11", headers=True)
df[df["Region"] == "North"]
)
Walkthrough: build a region Γ month sales matrix
The =PY() cell
=PY(
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m")
m = df.pivot_table(index="Region", columns="Month",
values="Revenue", aggfunc="sum", fill_value=0)
m
)
Spill it
Right-click β Output as β Excel values. You get a clean Region Γ Month matrix that updates whenever the source data changes.
π Pandas is already imported
Inside
=PY(), pandas is already available as
pd by default. You don't need an
import pandas as pd at the top β though it's harmless if you add it.
Key takeaways
=PY() gives you the full power of pandas inside a cell.
- Groupby, pivot, sort, filter β all one-liners.
- Spill the result as Excel values when you want to use it elsewhere.
Build your own pivot
Using the table above, write a =PY() that returns total Revenue by Product, sorted descending.
08.4 β Charts in =PY() with matplotlib and seaborn
Return a chart from a =PY() cell. Lives inside the workbook like an image.
What you'll learn
- Return a matplotlib chart from =PY()
- Customise the chart's title, labels, and size
- Use seaborn for prettier defaults
The simplest chart cell
=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar")
)
The cell returns a chart image. Click it; Excel lets you resize and reposition it.
Styling
=PY(
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
df = xl("A1:D11", headers=True)
df.groupby("Region")["Revenue"].sum().plot(kind="bar", ax=ax, color="#217346")
ax.set_title("Revenue by Region")
ax.set_ylabel("Revenue ($)")
ax.set_xlabel("")
plt.tight_layout()
)
Seaborn for nicer defaults
=PY(
import seaborn as sns
df = xl("A1:D11", headers=True)
sns.barplot(data=df, x="Region", y="Revenue", estimator="sum", errorbar=None)
)
Walkthrough: monthly trend chart
Aggregate first, plot second
=PY(
import matplotlib.pyplot as plt
df = xl("A1:D11", headers=True)
df["Month"] = pd.to_datetime(df["Date"]).dt.to_period("M").dt.to_timestamp()
monthly = df.groupby("Month")["Revenue"].sum()
fig, ax = plt.subplots(figsize=(7, 3))
monthly.plot(ax=ax, marker="o", color="#2b6cb0")
ax.set_title("Monthly Revenue")
ax.grid(alpha=0.3)
plt.tight_layout()
)
π‘ Treat chart cells like images
A
=PY() chart behaves like an image inside the spreadsheet β you can resize it, move it across sheets, and it'll update whenever the source data changes.
Key takeaways
=PY() can return matplotlib/seaborn charts directly.
- For control: build a
fig, ax and tweak labels/colours.
- Use
plt.tight_layout() to avoid clipping.
Stacked bar
Build a stacked bar chart in =PY() showing Revenue by Region, with Product as the stack. Hint: df.pivot_table(...).plot(kind="bar", stacked=True).
08.5 β When =PY() is the right tool β and when it isn't
A decision guide so you reach for the right hammer.
What you'll learn
- Pick =PY() over native formulas when it pays off
- Stick with native formulas when they're faster/cleaner
- Combine both in the same workbook
When =PY() is the right tool
- Anything that wants a real DataFrame: pivots, groupbys, complex joins.
- Statistical analysis: regressions, t-tests, distributions, percentile bands.
- Any chart Excel can't make cleanly (heatmaps, violin plots, faceted small multiples).
- Date arithmetic that would need ten nested formulas.
- Working with data that doesn't fit on a normal sheet (rare in =PY(), but possible).
- Anything you'd want to re-do programmatically later.
When to stick with native formulas
- Simple lookups, sums, counts, and conditional formatting β
XLOOKUP, SUMIFS, COUNTIFS are fast and obvious.
- Interactive what-ifs where every keystroke should ripple through the whole sheet.
- Workbooks shared with people who don't have Microsoft 365 (their cells will show #BLOCKED!).
- Anything that needs to recalc in < 100 ms.
The hybrid pattern
Use native formulas for the live, interactive parts. Use =PY() for the heavier analysis next to them. The two coexist beautifully.
Two performance notes
=PY() runs in the cloud. Each calc is a round trip. Don't put 500 of them in one workbook.
- One bigger
=PY() beats many small ones. Do as much per cell as you can; one cell that produces a 10Γ10 table is faster than 100 cells each producing one value.
Key takeaways
=PY() shines for pivots, stats, charts, and heavy date work.
- Native formulas are still best for simple lookups and live what-ifs.
- Mix them; prefer one big
=PY() over many small ones.
Decide
For each task, pick: native formulas, =PY(), or both?
- "Sum revenue where region = North."
- "Forecast next quarter's revenue with a linear regression."
- "Look up a customer's tier from a side table."
- "Build a heatmap of orders by hour-of-day Γ day-of-week."
08.6 β Twelve =PY() recipes you'll use
Copy-paste-ready snippets for the most common Python-in-Excel tasks.
What you'll learn
- Bookmark this lesson
- Adapt each recipe to your own data
- Spot which pandas method does what
1 β Sum a column with a condition
=PY(
df = xl("DataTable[#All]", headers=True)
df.loc[df["Region"]=="North", "Revenue"].sum()
)
2 β Count distinct values
=PY(
xl("A2:A1000").nunique()
)
3 β Group-by total
=PY(
df = xl("A1:D1000", headers=True)
df.groupby("Region")["Revenue"].sum().sort_values(ascending=False)
)
4 β Pivot table
=PY(
df = xl("A1:D1000", headers=True)
df.pivot_table(index="Region", columns="Quarter",
values="Revenue", aggfunc="sum", fill_value=0)
)
5 β Top N by metric
=PY(
df = xl("A1:E1000", headers=True)
df.nlargest(10, "Revenue")
)
6 β Running total
=PY(
df = xl("A1:B100", headers=True).sort_values("Date")
df["RunningTotal"] = df["Amount"].cumsum()
df
)
7 β Percentage of total
=PY(
s = xl("B2:B11")
(s / s.sum() * 100).round(1)
)
8 β Date difference in days
=PY(
df = xl("A1:C100", headers=True)
df["DaysOpen"] = (pd.to_datetime(df["Closed"])
- pd.to_datetime(df["Opened"])).dt.days
df
)
9 β Quartile bucketing
=PY(
df = xl("A1:B500", headers=True)
df["Bucket"] = pd.qcut(df["Score"], 4, labels=["Q1","Q2","Q3","Q4"])
df
)
10 β Inner join two ranges
=PY(
orders = xl("OrdersTable[#All]", headers=True)
customers = xl("CustomersTable[#All]", headers=True)
orders.merge(customers, on="CustomerID")
)
11 β Quick linear regression
=PY(
import numpy as np
df = xl("A1:B100", headers=True)
slope, intercept = np.polyfit(df["X"], df["Y"], 1)
{"slope": slope, "intercept": intercept}
)
12 β Heatmap chart
=PY(
import seaborn as sns
df = xl("A1:D1000", headers=True)
pivot = df.pivot_table(index="DayOfWeek", columns="Hour",
values="Orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu")
)
Key takeaways
- Most analysis in
=PY() is two or three lines.
- Bookmark this lesson β you'll come back to it.
- If you keep wanting a recipe that isn't here, that's a great cue to ask in contact.
Adapt one
Pick recipe #3, #5, or #10. Build a small dataset on a new sheet and make it run in your own workbook.
08.7 β Limits, costs, and security of =PY()
What =PY() can't do, where it stores your data, and how it's priced.
What you'll learn
- Know what =PY() doesn't support
- Understand where the code runs
- Decide what's appropriate to put in a =PY() cell
The runtime model
When a =PY() cell calculates, Excel sends the code and the relevant Excel data to Microsoft's cloud, runs it in a sandboxed Python environment, and sends the result back. Your laptop does almost nothing β most of the work happens on Microsoft Azure.
What's pre-installed
Microsoft pre-installs a curated set of libraries: pandas, NumPy, matplotlib, seaborn, scikit-learn, statsmodels, and a few more. You can't pip install something extra into a =PY() cell β what's there is what's there.
What's not allowed
- No network calls. You can't fetch a web page or hit an API from
=PY(). Use a Jupyter notebook for that.
- No reading local files. The sandbox can't see your hard disk. Everything has to come from
xl(...).
- No installing libraries. Stick to the pre-installed set.
Calculation order
Python cells are calculated after regular Excel formulas, top-to-bottom, left-to-right by default. If you have one =PY() that depends on another, that's fine β Excel orders them correctly.
Privacy
The data in cells referenced by xl(...) is sent to Microsoft's servers. For most spreadsheet work this is fine; for highly sensitive data, check your organisation's policy. The data is not used to train AI models per Microsoft's published policy at launch β but verify the current policy at support.microsoft.com.
Cost
At launch, basic =PY() usage is included in Microsoft 365. Premium tiers offer faster compute and higher data limits. Check Microsoft's pricing page for the current numbers.
Performance gotchas
- Each
=PY() is a network round-trip. Hundreds of them in one workbook will feel slow.
- Big DataFrames (> 50k rows) are slower over the wire β for very large data, do the analysis in Jupyter and paste results in.
Key takeaways
=PY() runs in Microsoft's cloud β your data goes through their servers.
- No network, no local files, no extra libraries inside
=PY().
- For network/file work, use a Jupyter notebook (the rest of this course).
Decide where each runs
Which of these should live in =PY(), and which in a Jupyter script?
- Scrape today's exchange rate from a website.
- Make a heatmap of last month's sales.
- Combine 50 monthly CSVs from a shared drive.
- Compute the correlation matrix of a 20-column table sitting in Excel.
Module 09Pandas Crash Course
The most important library in this course. DataFrames, selection, filtering, grouping, joining, pivoting β the spreadsheet superpowers.
09.1 β DataFrames β Python's spreadsheet
Create a DataFrame, inspect it, and learn the vocabulary.
What you'll learn
- Build a DataFrame from a dict or list of dicts
- Use .head(), .info(), .describe(), .shape
- Distinguish a DataFrame from a Series
Two ways to make a DataFrame
import pandas as pd
# From a dict of columns
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"city": ["NYC", "Chicago", "LA"],
})
# From a list of rows (each row a dict)
rows = [
{"name": "Alice", "age": 30, "city": "NYC"},
{"name": "Bob", "age": 25, "city": "Chicago"},
{"name": "Carol", "age": 35, "city": "LA"},
]
df = pd.DataFrame(rows)
The four "what's in this thing?" methods
df.head() # first 5 rows
df.tail(3) # last 3 rows
df.shape # (rows, columns)
df.info() # column names, dtypes, missing counts
df.describe() # numeric summary stats
Series vs DataFrame
A Series is a single column. A DataFrame is many columns aligned by row.
df["age"] # a Series
type(df["age"]) # <class 'pandas.core.series.Series'>
df[["age", "city"]] # a DataFrame (note: double brackets)
Renaming columns
df = df.rename(columns={"name": "full_name"})
df.columns = ["full_name", "age", "city"] # set all at once
Adding a column
df["is_adult"] = df["age"] >= 18
df["greeting"] = "Hi " + df["full_name"]
Walkthrough: from raw to summary in five lines
Read
df = pd.read_csv("sales.csv")
df.head()
Inspect
df.shape # (1000, 6)
df.info()
df.describe()
Quick summary
df.groupby("region")["amount"].sum()
Key takeaways
- DataFrame = table; Series = one column.
.head(), .info(), .describe(), .shape are the inspection workhorses.
- Build from a dict-of-lists or a list-of-dicts.
Quick build
Create a DataFrame of five employees with name, department, salary. Print .describe() on it and notice what it shows for the numeric column versus the text ones.
09.2 β Selecting rows and columns β loc and iloc
Pull out the slice of data you actually want.
What you'll learn
- Select columns by name
- Filter rows by condition
- Use .loc and .iloc correctly
Selecting columns
df["name"] # one column β Series
df[["name", "city"]] # multi-column β DataFrame
Filtering rows by condition
df[df["age"] > 30]
df[df["city"] == "NYC"]
df[df["name"].str.startswith("A")]
Combining conditions
df[(df["age"] > 30) & (df["city"] == "NYC")] # AND β use &
df[(df["age"] < 30) | (df["city"] == "LA")] # OR β use |
df[~(df["city"] == "Chicago")] # NOT β use ~
Important: use & | ~, not and/or/not. And wrap each condition in parentheses.
.loc β label-based selection
df.loc[0] # first row
df.loc[0:3, "name"] # rows 0..3 (inclusive!), column 'name'
df.loc[df["age"] > 30, "name"] # name column for the over-30s
.iloc β position-based selection
df.iloc[0] # first row by position
df.iloc[:5, 0:2] # first 5 rows, first 2 columns
df.iloc[-1] # last row
The "is in this list" filter
df[df["city"].isin(["NYC", "LA"])]
df[~df["city"].isin(["NYC", "LA"])] # everything not in the list
Walkthrough: pulling apart a dataset
The data
df = pd.read_csv("orders.csv")
df.columns
# Index(['order_id','date','region','customer','amount','status'])
Big orders in the North region
big_north = df[(df["region"] == "North") & (df["amount"] > 1000)]
Just two columns of those
big_north[["date", "amount"]]
Or in one go with .loc
df.loc[(df["region"]=="North") & (df["amount"]>1000),
["date", "amount"]]
Key takeaways
df[bool_mask] is the everyday filter pattern.
- Use
&, |, ~ (with parens) when combining.
.loc[rows, cols] for labels; .iloc[rows, cols] for positions.
.isin([list]) for membership filters.
Drill
From a DataFrame of employees: select everyone in Sales with a salary over $80k, returning only the name and salary columns.
09.3 β Sorting, counting, and ranking
sort_values, value_counts, rank β the simple stuff you'll use a hundred times a day.
What you'll learn
- Sort by one or more columns
- Count occurrences of values
- Rank a column
Sorting
df.sort_values("amount", ascending=False).head(10)
df.sort_values(["region", "amount"], ascending=[True, False])
df.sort_index() # by row index
Counting
df["region"].value_counts() # counts per region
df["region"].value_counts(normalize=True) # as percentages
df["region"].nunique() # distinct count
Ranking
df["rank_by_amount"] = df["amount"].rank(ascending=False)
df["pct_rank"] = df["amount"].rank(pct=True)
Top / bottom N
df.nlargest(10, "amount")
df.nsmallest(5, "amount")
Key takeaways
sort_values(["a","b"], ascending=[True,False]) handles multi-key sort.
value_counts() is the one-liner for a quick frequency table.
nlargest() / nsmallest() are faster and clearer than sort+head.
Bestseller list
From an orders DataFrame, print the top 5 products by total revenue.
09.4 β Aggregation β sum, mean, agg
How to summarise. Series methods, DataFrame methods, and .agg().
What you'll learn
- Compute basic aggregates on a Series
- Aggregate multiple columns at once
- Use .agg() for multiple stats per column
Single-column aggregates
df["amount"].sum()
df["amount"].mean()
df["amount"].median()
df["amount"].std()
df["amount"].max()
df["amount"].min()
df["amount"].count()
df["amount"].quantile(0.95)
Whole-DataFrame aggregates
df.sum(numeric_only=True)
df.mean(numeric_only=True)
df.describe()
.agg() β many at once
df["amount"].agg(["sum", "mean", "min", "max"])
df.agg({
"amount": ["sum", "mean"],
"quantity": ["sum", "max"],
})
Custom aggregations
def iqr(s):
return s.quantile(0.75) - s.quantile(0.25)
df["amount"].agg(["mean", iqr])
Key takeaways
- Series have direct methods:
.sum(), .mean() etc.
.agg(["sum","mean"]) runs many at once.
.describe() is the quickest "show me everything" summary.
Summary box
For a numeric column of your choosing, print mean, median, min, max, and 95th percentile in a single .agg() call.
09.5 β groupby β the move that pays for the course
Split β apply β combine. Replace your SUMIFS forever.
What you'll learn
- Group by one or more columns
- Aggregate after groupby
- Get the result back as a clean DataFrame
The pattern
# Total revenue per region β one line
df.groupby("region")["amount"].sum()
That's it. That replaces a column of SUMIFS formulas, plus the formula maintenance, plus the broken refs when rows shift.
Multiple aggregations
df.groupby("region")["amount"].agg(["sum", "mean", "count"])
df.groupby("region").agg(
total_revenue = ("amount", "sum"),
avg_order = ("amount", "mean"),
n_orders = ("order_id", "count"),
)
The second form (named aggregations) is the cleanest β output columns are named exactly what you want.
Multiple group keys
df.groupby(["region", "quarter"])["amount"].sum()
Get a clean DataFrame back
By default, groupby returns a Series or a DataFrame with a hierarchical index. Add .reset_index() for a flat one:
summary = (df.groupby(["region", "quarter"])["amount"]
.sum()
.reset_index())
Filtering within groups
# Keep only groups with more than 10 orders
df.groupby("region").filter(lambda g: len(g) > 10)
# Top 3 per group
df.sort_values("amount", ascending=False).groupby("region").head(3)
Walkthrough: the classic regional summary
Read
df = pd.read_csv("orders.csv", parse_dates=["date"])
df["quarter"] = df["date"].dt.to_period("Q").astype(str)
Summary
summary = (df.groupby(["region", "quarter"])
.agg(total = ("amount", "sum"),
count = ("order_id", "count"),
avg = ("amount", "mean"))
.reset_index())
print(summary.head(10))
Save
summary.to_excel("regional_summary.xlsx", index=False)
π‘ Mental shortcut
"Group by X then aggregate" = the moment you'd reach for a pivot table in Excel.
Key takeaways
df.groupby("col")["other"].sum() = SUMIF for a whole column.
- Use named aggregations for clean output:
.agg(name=("col", "func")).
.reset_index() to get a flat DataFrame back.
Per-customer summary
From an orders DataFrame, build a per-customer summary with: total spend, number of orders, average order value.
09.6 β Pivot tables β pivot_table()
The Excel pivot table, one function call.
What you'll learn
- Build a pivot with index/columns/values
- Add multiple value columns and aggregations
- Handle missing cells with fill_value
The basic pivot
df.pivot_table(index="region",
columns="quarter",
values="amount",
aggfunc="sum")
Rows = region. Columns = quarter. Cells = total amount.
Multiple aggregations
df.pivot_table(index="region",
columns="quarter",
values="amount",
aggfunc=["sum", "mean", "count"])
Multiple value columns
df.pivot_table(index="region",
values=["amount", "quantity"],
aggfunc={"amount": "sum", "quantity": "sum"})
Fill empty cells
df.pivot_table(index="region", columns="quarter",
values="amount", aggfunc="sum", fill_value=0)
With margins (totals row + column)
df.pivot_table(index="region", columns="quarter",
values="amount", aggfunc="sum",
margins=True, margins_name="Total")
crosstab β quick frequency pivot
pd.crosstab(df["region"], df["status"])
pd.crosstab(df["region"], df["status"], normalize="index") # row percentages
Key takeaways
pivot_table(index, columns, values, aggfunc) = Excel pivot.
fill_value=0 stops NaN holes.
margins=True adds totals.
pd.crosstab is the shortcut for frequency tables.
Sales by product Γ month
Build a pivot of total sales with products as rows and months as columns. Add a totals row and column.
09.7 β Joining and merging β pandas's VLOOKUP
merge, concat, join β three ways to combine DataFrames.
What you'll learn
- Use pd.merge() for VLOOKUP-style joins
- Pick the right join type (inner/left/right/outer)
- Stack DataFrames with pd.concat()
The basic merge β like XLOOKUP
orders = pd.read_csv("orders.csv") # has customer_id
customers = pd.read_csv("customers.csv") # has customer_id and name, city, tier
combined = orders.merge(customers, on="customer_id")
Different column names on each side
orders.merge(customers, left_on="cust_id", right_on="id")
Join types
| How | What you get |
how="inner" (default) | Only rows present on both sides. |
how="left" | All rows from the left; matching info from the right (NaN if no match). |
how="right" | All rows from the right. |
how="outer" | Every row from either side; NaN where the other side has no match. |
combined = orders.merge(customers, on="customer_id", how="left")
This is the "VLOOKUP that doesn't lose rows when there's no match" β your default for enrichment joins.
Checking the join with indicator
combined = orders.merge(customers, on="customer_id",
how="left", indicator=True)
combined["_merge"].value_counts()
# left_only 42 β orders with no matching customer (a red flag)
# both 958
Stacking with concat
jan = pd.read_csv("jan.csv")
feb = pd.read_csv("feb.csv")
quarter = pd.concat([jan, feb], ignore_index=True)
Walkthrough: enrich orders with customer info
Load
orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")
Join
enriched = orders.merge(customers, on="customer_id",
how="left", indicator=True)
Sanity-check the join
print(enriched["_merge"].value_counts())
unmatched = enriched[enriched["_merge"] == "left_only"]
print(f"{len(unmatched)} orders without a matching customer")
Drop the indicator and save
enriched = enriched.drop(columns=["_merge"])
enriched.to_csv("orders_enriched.csv", index=False)
Key takeaways
df.merge(other, on="key", how="left") is VLOOKUP without the heartbreak.
- Always use
indicator=True on a one-off to sanity-check matches.
pd.concat([...]) stacks DataFrames vertically.
Three-way join
Given orders, customers, and products tables, build a fully enriched DataFrame containing customer name and product name on each order.
09.8 β Applying functions β apply, map, vectorisation
Run a function across every row or every cell, and when to skip it.
What you'll learn
- Use .apply() on a Series and a DataFrame
- Use .map() for simple value-to-value substitutions
- Prefer vectorised operations when possible
The hierarchy: vectorised > apply > loop
The fastest pandas code uses built-in vectorised operations. Reach for apply when there's no built-in, and only loop manually as a last resort.
Vectorised β the goal
df["total"] = df["price"] * df["quantity"] * (1 + df["tax_rate"])
.map() β value-to-value lookup on a Series
tier_label = {"A": "Gold", "B": "Silver", "C": "Bronze"}
df["tier_name"] = df["tier"].map(tier_label)
.apply() on a Series
def classify(amount):
if amount > 1000: return "Big"
if amount > 100: return "Medium"
return "Small"
df["bucket"] = df["amount"].apply(classify)
.apply() on a DataFrame (row-wise)
def full_label(row):
return f"{row['region']}/{row['product']}"
df["label"] = df.apply(full_label, axis=1)
The vectorised replacement
Almost anything you'd write with apply has a faster vectorised form:
# Slower
df["bucket"] = df["amount"].apply(classify)
# Faster
import numpy as np
df["bucket"] = np.select(
[df["amount"] > 1000, df["amount"] > 100],
["Big", "Medium"],
default="Small")
π‘ The lazy rule
Use
apply when you're prototyping. Switch to vectorised when the data is large and performance matters.
Key takeaways
- Vectorised pandas is fast; apply is slow; manual loops are slowest.
.map(dict) is the cleanest way to translate values.
.apply(func, axis=1) runs func(row) on each row.
Tier classification
Given a column of customer LTV values, add a tier column: Gold > 5000, Silver > 1000, Bronze otherwise. Solve with apply; then rewrite with np.select.
Module 10Data Cleaning and Transformation
Fix dates that won't sort, names with stray spaces, half-empty columns, and every other mess that real data throws at you.
10.1 β Handling missing values
NaN, dropna, fillna β what's missing and what to do about it.
What you'll learn
- Detect missing values
- Drop rows or columns with too many gaps
- Fill missing values sensibly
Spotting missing values
df.isna().sum() # missing count per column
df.isna().mean() * 100 # percentage per column
df[df["email"].isna()] # rows where email is missing
Dropping
df.dropna() # drop any row with any NaN
df.dropna(subset=["email", "customer"]) # only drop if those cols are NaN
df.dropna(axis=1) # drop columns with NaN
df.dropna(thresh=5) # keep rows with β₯5 non-null values
Filling
df["age"] = df["age"].fillna(df["age"].median())
df["country"] = df["country"].fillna("Unknown")
# Fill different columns differently
df = df.fillna({"age": 0, "country": "Unknown", "score": df["score"].mean()})
Forward / back fill (time series)
df["price"] = df["price"].ffill() # carry last known forward
df["price"] = df["price"].bfill() # use the next known backward
Walkthrough: clean a survey export
Inspect first
df.isna().sum().sort_values(ascending=False).head(10)
Drop unusable rows
df = df.dropna(subset=["respondent_id", "submitted_at"])
Fill the rest with sensible defaults
df["age"] = df["age"].fillna(df["age"].median())
df["region"] = df["region"].fillna("Unknown")
df["score"] = df["score"].fillna(0)
Key takeaways
isna().sum() shows you where the holes are.
dropna(subset=[...]) is safer than blanket dropna.
fillna() can take a single value, a dict, or a forward/back fill.
Missing-value audit
Load any messy CSV. Print a table of column name, missing count, and missing percentage, sorted by worst first.
10.2 β Finding and removing duplicates
duplicated, drop_duplicates, and how to decide which copy to keep.
What you'll learn
- Find duplicate rows
- Drop duplicates by all or some columns
- Pick which duplicate to keep (first/last/by ranking)
Find them
df.duplicated().sum() # how many full-row dupes
df.duplicated(subset=["email"]).sum() # dupes by email
df[df.duplicated(subset=["email"], keep=False)] # show all dupe groups
Drop them
df.drop_duplicates() # exact-row dupes
df.drop_duplicates(subset=["email"]) # by email, keep first
df.drop_duplicates(subset=["email"], keep="last")
Keep the "best" duplicate
If a customer has multiple rows and you want the most recent:
df = (df.sort_values("date")
.drop_duplicates(subset=["email"], keep="last"))
Walkthrough: dedupe a contact list keeping the highest-value entry
Sort by what makes "best" mean something
df = df.sort_values("lifetime_value", ascending=False)
Drop, keeping the first (= the highest LTV)
df = df.drop_duplicates(subset=["email"], keep="first")
Key takeaways
duplicated() returns a boolean Series; drop_duplicates() drops them.
subset= lets you check duplicates by chosen columns only.
- Sort first, then drop_duplicates with
keep="first" to keep the "best" one.
Dedupe orders
Given an orders table, dedupe by order_id, keeping the row with the latest last_modified.
10.3 β String cleanup at scale
str.strip, str.lower, str.replace, regex β apply to a whole column.
What you'll learn
- Apply string methods to whole columns
- Use regex via .str.replace
- Split a column into multiple columns
The .str accessor
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["country"] = df["country"].str.upper()
Anywhere you'd call a string method on one string, you call .str.method() on a column.
Replace
df["amount_text"] = df["amount_text"].str.replace(",", "")
df["amount"] = df["amount_text"].astype(float)
Regex replace
# Keep only digits
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)
# Strip parentheses and hyphens
df["phone"] = df["phone"].str.replace(r"[()\-\s]", "", regex=True)
Split a column
# "Alice Smith" β first / last
df[["first", "last"]] = df["full_name"].str.split(" ", n=1, expand=True)
Extract with regex
# pull domain from email
df["domain"] = df["email"].str.extract(r"@(.+)$")
Walkthrough: clean a column of free-text phone numbers
Start
df["phone"]
# "(415) 555-1212"
# " 415.555.1213 "
# "415-555-1214 ext.500"
Normalise
df["phone_clean"] = (df["phone"].str.strip()
.str.replace(r"\D", "", regex=True)
.str[:10]) # keep first 10 digits
Key takeaways
.str.method() applies any string method to a whole column.
regex=True unlocks powerful pattern-based replacement.
.str.split(..., expand=True) splits one column into many.
Email cleanup
Given a column of messy emails, produce a clean lowercase column AND a new column containing just the domain.
10.4 β Working with dates in pandas
to_datetime, .dt accessor, date arithmetic, resampling.
What you'll learn
- Convert string columns to datetime
- Pull year/month/day/weekday from a date column
- Resample a time series
Parse first
df["date"] = pd.to_datetime(df["date"], errors="coerce")
errors="coerce" turns unparsable strings into NaT (missing date) instead of crashing.
Pull parts out
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["dow"] = df["date"].dt.day_name() # 'Saturday'
df["q"] = df["date"].dt.to_period("Q") # PeriodIndex 2026Q2
df["month_start"] = df["date"].dt.to_period("M").dt.to_timestamp()
Date arithmetic
df["days_open"] = (df["closed"] - df["opened"]).dt.days
df["due_date"] = df["created"] + pd.Timedelta(days=30)
df[df["date"] >= "2026-01-01"]
Resampling (group by time bucket)
monthly = (df.set_index("date")["amount"]
.resample("M")
.sum())
weekly = df.set_index("date")["amount"].resample("W").mean()
Walkthrough: monthly revenue trend
Parse and group
df["date"] = pd.to_datetime(df["date"])
monthly = (df.groupby(df["date"].dt.to_period("M"))["amount"]
.sum()
.reset_index())
monthly["date"] = monthly["date"].astype(str)
Save / plot
monthly.to_excel("monthly_revenue.xlsx", index=False)
Key takeaways
pd.to_datetime(..., errors="coerce") is the safe parse.
.dt gives you year, month, day, day_name, to_period.
.resample("M") on a datetime index gives quick time-bucket aggregates.
Weekly average
From a transactions DataFrame with date + amount, compute the average daily revenue per week of the year.
10.5 β Numeric cleanup β currencies, percentages, outliers
Strip $ and %, clip outliers, fix obvious typos.
What you'll learn
- Convert currency-formatted strings to floats
- Convert percent-formatted strings to floats
- Cap or remove outliers
Currency strings β floats
df["amount"] = (df["amount"].str.replace("[$,]", "", regex=True)
.astype(float))
Percent strings β floats
# "12.5%" β 0.125
df["margin"] = (df["margin"].str.replace("%", "")
.astype(float) / 100)
Capping outliers (winsorise)
lo, hi = df["amount"].quantile([0.01, 0.99])
df["amount_capped"] = df["amount"].clip(lower=lo, upper=hi)
Removing outliers
q1, q3 = df["amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = df["amount"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)
clean = df[mask]
Catching obvious typos
# Negative ages, ages over 120, etc.
clean = df.query("0 < age < 120")
Key takeaways
- Strip
$ and , with a regex, then .astype(float).
.clip(lower, upper) caps without removing rows.
- IQR rule (Q1 β 1.5Β·IQR, Q3 + 1.5Β·IQR) catches typical outliers.
Cap and report
Cap the top 1% of values in a numeric column. Print how many rows were affected.
10.6 β Reshape β melt and pivot
Turn wide to long and back. The shape-shifting move you'll thank yourself for.
What you'll learn
- Use melt() to turn wide data long
- Use pivot() to turn long data wide
- Recognise when each shape is the right one
Wide vs long
Wide has one row per "thing" and many columns:
region Q1 Q2 Q3 Q4
North 1200 1500 1300 1700
South 900 1100 1050 950
Long has one row per observation:
region quarter amount
North Q1 1200
North Q2 1500
North Q3 1300
North Q4 1700
South Q1 900
...
Long is friendlier for grouping/plotting; wide is friendlier for humans.
Wide β long with melt
long = df.melt(id_vars="region",
value_vars=["Q1","Q2","Q3","Q4"],
var_name="quarter",
value_name="amount")
Long β wide with pivot
wide = long.pivot(index="region", columns="quarter", values="amount")
When to reshape
- Going to a chart? Long is what seaborn and plotly want.
- Going to a printable report? Wide reads better.
- Doing groupby? Long.
Key takeaways
melt() turns wide to long.
pivot() turns long to wide.
- Long shape is what most analysis tools prefer.
Quarterly to long
Take a wide DataFrame of quarterly figures by region. Melt it into long form. Then pivot it back.
10.7 β Categoricals and ordered types
Save memory and enable ordered comparisons.
What you'll learn
- Convert a column to categorical
- Define an ordered categorical
- Use the savings to handle bigger datasets
Why categoricals
If a column has 5 million rows but only 4 distinct values ("North", "South", "East", "West"), storing it as a categorical can cut its memory use by 90% and speed up groupby.
df["region"] = df["region"].astype("category")
Ordered categoricals
from pandas.api.types import CategoricalDtype
size_order = CategoricalDtype(categories=["XS","S","M","L","XL"], ordered=True)
df["size"] = df["size"].astype(size_order)
df.sort_values("size") # sorts in size order, not alphabetical
df[df["size"] >= "L"] # comparisons work
Memory check
df.memory_usage(deep=True)
Key takeaways
- Repeating-string columns benefit hugely from
.astype("category").
- Ordered categoricals let you sort and compare logically.
- Useful when your DataFrame starts pushing your laptop's memory.
Memory diet
Take any DataFrame with a low-cardinality text column. Convert it to categorical. Print before/after memory usage.
Module 11Data Visualization
Matplotlib basics, pandas plotting, seaborn for prettier defaults, and how to drop charts into Excel.
11.1 β Matplotlib basics
fig, ax, plot, title, save. The foundation everything else builds on.
What you'll learn
- Create a figure and axes
- Plot lines, bars, and scatter
- Save the chart to PNG
The basic shape
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot([1, 2, 3, 4], [10, 25, 20, 35])
ax.set_title("My first chart")
ax.set_xlabel("Quarter")
ax.set_ylabel("Revenue")
plt.tight_layout()
plt.savefig("chart.png", dpi=150)
plt.show()
Every chart you make will follow this skeleton: build a figure, draw onto axes, label, save/show.
Three chart types in three lines
ax.plot(x, y) # line
ax.bar(categories, values) # bar
ax.scatter(x, y) # scatter
Style sample
ax.plot(x, y, color="#217346", linewidth=2, marker="o")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False) # cleaner look
Walkthrough: monthly trend
Data
months = ["Jan","Feb","Mar","Apr","May","Jun"]
revenue = [12, 15, 18, 14, 22, 27]
Plot
fig, ax = plt.subplots(figsize=(7, 3.5))
ax.plot(months, revenue, marker="o", color="#2b6cb0", linewidth=2)
ax.set_title("Monthly Revenue (in thousands)")
ax.set_ylabel("$k")
ax.grid(alpha=0.3)
ax.spines[["top","right"]].set_visible(False)
plt.tight_layout()
plt.savefig("monthly_revenue.png", dpi=150)
Key takeaways
fig, ax = plt.subplots(figsize=(w, h)) is the starting point.
ax.plot / ax.bar / ax.scatter are the three you'll use most.
- Always set a title and axis labels. Save with
plt.savefig(..., dpi=150).
Bar chart
Make a bar chart of the same monthly revenue data. Make Jan and Feb grey, the other months green.
11.2 β Plotting directly from a DataFrame
df.plot() β the fastest way from data to picture.
What you'll learn
- Use df.plot() to chart a Series
- Plot multiple columns at once
- Pick the right kind (line/bar/area/hist)
The one-liner
df.set_index("date")["amount"].plot(figsize=(8, 4), title="Daily Revenue")
Multiple columns
df.set_index("date")[["amount", "forecast"]].plot()
Kinds
df["amount"].plot(kind="line")
df["amount"].plot(kind="hist", bins=30)
df["region"].value_counts().plot(kind="bar")
df["region"].value_counts().plot(kind="pie") # use sparingly!
df.plot(kind="scatter", x="qty", y="amount")
Grouped bars from a pivot
pivot = df.pivot_table(index="region", columns="quarter",
values="amount", aggfunc="sum")
pivot.plot(kind="bar")
Stacked bars
pivot.plot(kind="bar", stacked=True)
Key takeaways
df.plot() is the quickest path from data to chart.
kind= picks the chart type.
- For grouped/stacked bars, pivot first then plot.
Side-by-side bars
Pivot orders by region (rows) and quarter (columns). Plot the result as a grouped bar chart.
11.3 β Seaborn β prettier defaults, less code
When you want a chart that looks publication-ready out of the box.
What you'll learn
- Use seaborn's most common charts
- Take advantage of automatic styling
- Plot from long-form DataFrames
Setup
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
Bar chart
sns.barplot(data=df, x="region", y="amount", estimator="sum", errorbar=None)
Scatter with colour-coding
sns.scatterplot(data=df, x="qty", y="amount", hue="region")
Distribution
sns.histplot(data=df, x="amount", hue="region", multiple="stack")
Heatmap
pivot = df.pivot_table(index="dow", columns="hour",
values="orders", aggfunc="sum")
sns.heatmap(pivot, cmap="YlGnBu", annot=False)
Faceted small-multiples
g = sns.FacetGrid(df, col="region", col_wrap=2)
g.map_dataframe(sns.lineplot, x="date", y="amount")
π‘ Seaborn loves long data
Seaborn assumes your data is in long form (one row per observation). If your data is wide, melt it first (Module 10, Lesson 6).
Key takeaways
- Seaborn = matplotlib with prettier defaults and fewer lines.
- Use long-form DataFrames; let
hue= handle grouping.
- Heatmaps and facets are seaborn superpowers Excel can't easily match.
Hour-of-day heatmap
Given an orders table with date+hour, build a heatmap of total orders by day-of-week Γ hour-of-day.
11.4 β Formatting for stakeholders
Removing chart-junk, labelling totals, and saving at the right DPI.
What you'll learn
- Remove unnecessary chart-junk
- Add data labels on top of bars
- Save at print quality
Less is more
ax.spines[["top","right"]].set_visible(False)
ax.grid(axis="y", alpha=0.3)
ax.set_xlabel("") # X axis label often redundant
ax.tick_params(left=False) # hide ticks
Labels on bars
bars = ax.bar(months, revenue, color="#217346")
for bar, value in zip(bars, revenue):
ax.text(bar.get_x() + bar.get_width()/2,
bar.get_height() + 0.5,
f"${value:,.0f}",
ha="center", va="bottom", fontsize=9)
Save
plt.tight_layout()
plt.savefig("revenue.png", dpi=200, bbox_inches="tight")
Colours for accessibility
About 1 in 12 people has red-green colour blindness. When using colour as a meaningful signal, prefer a colour-blind-safe palette:
sns.set_palette("colorblind")
Key takeaways
- Drop the chart-junk β top/right spines, redundant labels.
- Label totals on bars; don't make the reader squint.
- Save at
dpi=200 for slides / print.
- Use colour-blind-safe palettes.
Polish a chart
Take your monthly-revenue bar chart from the matplotlib lesson. Remove the spines. Add dollar-value labels on top of each bar. Save at 200 dpi.
11.5 β Inserting charts into Excel
Three ways: paste, =PY(), and the openpyxl image insert.
What you'll learn
- Save a chart as PNG and insert it into a sheet
- Use =PY() to embed a chart directly
- Build a sheet with multiple charts in code
Option 1 β Paste a PNG manually
Save the chart, then in Excel: Insert β Pictures β from file. Fastest for one-offs.
Option 2 β =PY() chart
Covered in Module 8. Best for charts that should refresh when the underlying Excel data changes.
Option 3 β Programmatic with openpyxl
import pandas as pd
import matplotlib.pyplot as plt
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
# Generate the chart
df = pd.read_csv("monthly.csv")
fig, ax = plt.subplots(figsize=(7, 3))
df.plot(x="month", y="revenue", ax=ax, marker="o")
plt.tight_layout()
fig.savefig("monthly.png", dpi=150)
plt.close(fig)
# Insert into the workbook
df.to_excel("report.xlsx", index=False, sheet_name="Data")
wb = load_workbook("report.xlsx")
ws = wb.create_sheet("Chart")
img = Image("monthly.png")
ws.add_image(img, "B2")
wb.save("report.xlsx")
Walkthrough: monthly report builder
The plan
One script. Reads a CSV. Writes a 2-sheet Excel: a Data sheet and a Charts sheet with three charts.
The pattern
def make_chart(df, kind, filename, title):
fig, ax = plt.subplots(figsize=(7, 3))
df.plot(kind=kind, ax=ax, color="#217346")
ax.set_title(title)
plt.tight_layout()
fig.savefig(filename, dpi=150)
plt.close(fig)
Then call it three times and insert each via openpyxl.
Key takeaways
- One-offs: save PNG, paste into Excel.
- Live:
=PY() chart that refreshes.
- Bulk reports: openpyxl
add_image in code.
Multi-chart workbook
Build a script that produces an Excel report with a Data sheet, a Bar Chart sheet, and a Line Chart sheet β all from one CSV input.
11.6 β Tiny dashboards with subplots
Multiple charts on one figure β your first 'dashboard'.
What you'll learn
- Build a 2Γ2 grid of charts
- Share axes appropriately
- Save the whole thing as one image
The subplots grid
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
ax_a, ax_b = axes[0]
ax_c, ax_d = axes[1]
ax_a.plot(x, y1); ax_a.set_title("Trend")
ax_b.bar(cats, vals); ax_b.set_title("By Category")
ax_c.scatter(x, y2); ax_c.set_title("Scatter")
ax_d.hist(values, bins=20); ax_d.set_title("Distribution")
plt.suptitle("Q1 Dashboard", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.savefig("dashboard.png", dpi=200)
Shared axes
fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
Tip: give each panel breathing room
fig.subplots_adjust(hspace=0.4, wspace=0.3)
Key takeaways
plt.subplots(rows, cols) gives you a grid of axes.
sharey=True aligns scales for fair comparison.
- One
suptitle(), one tight_layout(), save once.
Three-up dashboard
Make a 1Γ3 dashboard with: monthly revenue line, top-5 product bar, and a distribution histogram. Save as one PNG.
Module 12Automating Boring Excel Tasks
Merge 100 files, rename sheets, format dozens of workbooks, send personalised emails β set it once and walk away.
12.1 β Merging many files into one
The everyday 'I have a folder of monthly reports' job.
What you'll learn
- Combine every file in a folder
- Tag each row with its source
- Handle a folder of mixed-format files gracefully
The script
from pathlib import Path
import pandas as pd
SRC = Path("monthly")
DST = Path("combined.xlsx")
frames = []
for path in sorted(SRC.glob("*.xlsx")):
df = pd.read_excel(path)
df["source_file"] = path.name
frames.append(df)
print(f" loaded {path.name}: {len(df):,} rows")
combined = pd.concat(frames, ignore_index=True)
combined.to_excel(DST, index=False, sheet_name="All")
print(f"Wrote {len(combined):,} rows to {DST}")
Mixed CSV + XLSX
def read_any(p):
if p.suffix == ".csv":
return pd.read_csv(p)
return pd.read_excel(p)
frames = [read_any(p) for p in SRC.iterdir() if p.suffix in (".csv", ".xlsx")]
Skip broken files
frames = []
errors = []
for path in SRC.glob("*"):
try:
df = read_any(path)
df["source_file"] = path.name
frames.append(df)
except Exception as e:
errors.append((path.name, str(e)))
print(f"Processed: {len(frames)}, Errors: {len(errors)}")
for name, msg in errors:
print(f" {name}: {msg}")
Key takeaways
folder.glob("*.xlsx") + pd.concat() is the merge skeleton.
- Tag every row with
source_file β you'll always want it later.
- Wrap in try/except so one broken file doesn't stop the whole job.
Build a merger
Put 3-5 small CSVs in a folder. Write a script that merges them into one Excel file with a "source" column.
12.2 β Splitting one file into many
Group by a column and save one workbook per group.
What you'll learn
- Iterate over groupby groups
- Write each to a named file
- Create the output folder if needed
The pattern
from pathlib import Path
import pandas as pd
df = pd.read_excel("all_orders.xlsx")
out = Path("per_region"); out.mkdir(exist_ok=True)
for region, group in df.groupby("region"):
safe = region.replace("/", "_")
group.to_excel(out / f"{safe}.xlsx", index=False)
print(f" wrote {safe}.xlsx ({len(group):,} rows)")
One workbook with one sheet per group
with pd.ExcelWriter("by_region.xlsx") as writer:
for region, group in df.groupby("region"):
safe = region.replace("/", "_")[:31] # Excel sheet name limit
group.to_excel(writer, sheet_name=safe, index=False)
β οΈ Sheet name limit
Excel sheet names cap at 31 characters and can't contain
/ \ ? * [ ]. Always sanitise.
Key takeaways
df.groupby("col") yields (key, group_df) tuples.
- One file per group: loop and
.to_excel().
- One workbook, many sheets:
pd.ExcelWriter + sheet_name=.
Split a big workbook
Take a multi-region orders file and produce one workbook per salesperson, each containing only their rows.
12.3 β Filling a template workbook
Open a pre-formatted .xlsx, drop data into specific cells, save with a new name.
What you'll learn
- Load an Excel template with openpyxl
- Write values into named cells
- Save the filled copy with a unique name
Why templates beat from-scratch reports
Excel templates can have logos, frozen panes, formulas, colours, named ranges β all the spreadsheety polish. With openpyxl you can open the template, plug in values, and save a copy β keeping every bit of formatting.
The script
from openpyxl import load_workbook
from datetime import date
import pandas as pd
template = "templates/invoice.xlsx"
orders = pd.read_csv("orders_to_invoice.csv")
for _, row in orders.iterrows():
wb = load_workbook(template)
ws = wb.active
ws["B2"] = f"Invoice {row['invoice_no']}"
ws["B3"] = row["customer"]
ws["B4"] = row["billing_address"]
ws["B5"] = date.today().isoformat()
ws["E10"] = row["quantity"]
ws["E11"] = row["unit_price"]
ws["E13"] = row["tax_rate"]
# formulas in the template will recalc when the file is opened
out = f"invoices/{row['invoice_no']}.xlsx"
wb.save(out)
print(f" wrote {out}")
Use named ranges to make this readable
In Excel: Formulas β Define Name. Then in Python:
wb["TemplateSheet"]["CustomerName"] # if you defined CustomerName as a named range
(For most everyday cases, the ws["B3"] style is fine.)
Key takeaways
- Templates keep formatting; openpyxl just stamps data into specific cells.
- Loop a DataFrame, one template-copy per row.
- Formulas in the template still work β they recalc on open.
Receipt printer
Build a tiny Excel "receipt" template with a logo and a totals formula. Loop a small DataFrame and generate one receipt per row.
12.4 β Renaming and organising files in bulk
pathlib.rename, glob, regex β your weekend file-cleanup script.
What you'll learn
- Bulk-rename files with a regex
- Move files into dated subfolders
- Always preview before renaming
Rename pattern
from pathlib import Path
import re
folder = Path("downloads")
for p in folder.glob("Report_*.xlsx"):
new_name = re.sub(r"Report_(\d{4})(\d{2})(\d{2})",
r"report-\1-\2-\3",
p.stem) + p.suffix
print(f"{p.name} β {new_name}")
# p.rename(p.with_name(new_name)) # uncomment when ready
β οΈ Always dry-run first
Print the planned renames in a first pass. Only enable the actual
p.rename() after you've eyeballed the output.
Move into dated folders
from datetime import datetime
for p in folder.glob("*.csv"):
mtime = datetime.fromtimestamp(p.stat().st_mtime)
target = folder / mtime.strftime("%Y-%m")
target.mkdir(exist_ok=True)
p.rename(target / p.name)
Key takeaways
- Always dry-run renames before doing them.
- Use regex for structured filename patterns.
- Sort by modification date to organise into month folders.
Downloads cleaner
Sketch a script (don't enable rename yet) that moves every .pdf in your Downloads folder into Downloads/PDFs/ and every .xlsx into Downloads/Spreadsheets/.
12.5 β Sending emails from Python
smtplib for the standard library way; better libraries for Gmail/Outlook.
What you'll learn
- Send a plain-text email with smtplib
- Attach a file
- Use environment variables for credentials
The standard library way
import smtplib
import os
from email.message import EmailMessage
msg = EmailMessage()
msg["Subject"] = "Monthly Report"
msg["From"] = "you@yourdomain.com"
msg["To"] = "boss@yourdomain.com"
msg.set_content("Hi β attached is the monthly report.\n\nThanks,\nYou")
# Attach a file
with open("report.xlsx", "rb") as f:
msg.add_attachment(
f.read(),
maintype="application",
subtype="vnd.openxmlformats-officedocument.spreadsheetml.sheet",
filename="report.xlsx",
)
with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
smtp.send_message(msg)
print("Sent.")
β οΈ Never paste passwords in code
Use environment variables (
os.environ["EMAIL_PASS"]), a
.env file loaded with
python-dotenv, or a system secret manager. Never check credentials into version control.
Gmail and Outlook today
Big providers now require app-specific passwords or OAuth. The cleanest libraries:
- Gmail: enable 2-factor authentication, generate an app password, use it instead of your real password.
- Outlook 365: use the
O365 library (pip install O365) which wraps the Microsoft Graph API.
Personalised mass email
for _, row in customers.iterrows():
msg = EmailMessage()
msg["Subject"] = f"Your {row['month']} statement"
msg["From"] = "you@yourdomain.com"
msg["To"] = row["email"]
msg.set_content(f"""Hi {row['first_name']},
Your statement for {row['month']} is attached.
Best,
You""")
with open(row["attachment"], "rb") as f:
msg.add_attachment(f.read(),
maintype="application",
subtype="pdf",
filename=Path(row["attachment"]).name)
with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
smtp.login(os.environ["EMAIL_USER"], os.environ["EMAIL_PASS"])
smtp.send_message(msg)
Key takeaways
smtplib sends mail; EmailMessage formats it.
- Use app passwords or OAuth β never your real password.
- Personalised mass-mail = loop over a DataFrame, one message per row.
Test send
Send yourself a test email with an attached CSV. Confirm it lands. Then do the same for three rows of a small DataFrame.
12.6 β Scheduling: run it automatically
Make a script run on its own β cron (Mac/Linux), Task Scheduler (Windows), or a Python scheduler library.
What you'll learn
- Schedule a script daily on Mac
- Schedule a script daily on Windows
- Know when to use a Python scheduler instead
Mac and Linux β cron
In Terminal: crontab -e. Add a line:
# minute hour day month weekday command
0 8 * * 1-5 /opt/anaconda3/bin/python /Users/you/scripts/morning_report.py >> ~/cron.log 2>&1
This runs at 08:00 on weekdays.
Windows β Task Scheduler
- Start menu β Task Scheduler β Create Task.
- Triggers β Daily, 8:00 AM.
- Actions β Start a program β
python.exe; arguments: C:\path\to\morning_report.py.
- Save.
The script must be self-sufficient
A scheduled script can't ask the user for input. Hard-code paths or read them from environment variables / a config file.
Logging β so you know it ran
import logging
from datetime import datetime
logging.basicConfig(
filename="morning_report.log",
level=logging.INFO,
format="%(asctime)s %(levelname)s: %(message)s",
)
logging.info("Starting morning report")
# ... do work ...
logging.info(f"Completed in {(datetime.now()-start).total_seconds():.1f}s")
Python-native scheduling
For a long-running script that schedules itself:
import schedule, time
def job():
print("Running daily report")
schedule.every().day.at("08:00").do(job)
while True:
schedule.run_pending()
time.sleep(60)
Useful inside Docker containers; for laptop use, cron / Task Scheduler is simpler.
Key takeaways
- cron on Mac/Linux; Task Scheduler on Windows.
- Scheduled scripts must run without prompts β hard-code or config-file the inputs.
- Always log; you want to know if it silently failed.
Schedule a 'hello'
Schedule a tiny Python script (one line: append today's date to a log file) to run daily. Confirm it ran tomorrow.
12.7 β An automation recipe book
Eight common boring-Excel jobs and the 20-line scripts that automate them.
What you'll learn
- Recognise which of these matches your daily work
- Adapt the recipe to your own data
- Save them in a shared script folder
1 β Add today's date to every filename in a folder
from pathlib import Path
from datetime import date
stamp = date.today().isoformat()
for p in Path("inbox").glob("*.xlsx"):
p.rename(p.with_name(f"{stamp}_{p.name}"))
2 β Find every sheet across a folder of workbooks containing a keyword
from openpyxl import load_workbook
matches = []
for p in Path("archive").rglob("*.xlsx"):
wb = load_workbook(p, read_only=True, data_only=True)
for ws in wb.worksheets:
for row in ws.iter_rows(values_only=True):
if any(c == "URGENT" for c in row if isinstance(c, str)):
matches.append((p.name, ws.title))
break
print(matches)
3 β Roll a year of weekly reports into one DataFrame
import pandas as pd
frames = [pd.read_excel(p) for p in Path("weekly_2026").glob("week_*.xlsx")]
pd.concat(frames, ignore_index=True).to_excel("weekly_2026_combined.xlsx", index=False)
4 β Find rows that exist this month but didn't last month (new customers)
this_m = set(pd.read_excel("apr.xlsx")["email"])
last_m = set(pd.read_excel("mar.xlsx")["email"])
new = this_m - last_m
print(f"{len(new)} new customers")
5 β Convert all CSVs in a folder to Excel
for p in Path(".").glob("*.csv"):
pd.read_csv(p).to_excel(p.with_suffix(".xlsx"), index=False)
6 β Rename a sheet across every workbook in a folder
for p in Path("reports").glob("*.xlsx"):
wb = load_workbook(p)
if "Sheet1" in wb.sheetnames:
wb["Sheet1"].title = "Data"
wb.save(p)
7 β Email the boss a CSV every Monday morning
# See Lesson 5 β combine with Lesson 6 scheduling.
8 β Build a daily PDF dashboard
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
df.groupby("region")["amount"].sum().plot(kind="bar", ax=ax)
ax.set_title(f"Revenue {date.today():%Y-%m-%d}")
fig.savefig(f"dashboards/{date.today():%Y-%m-%d}.pdf")
Key takeaways
- Most "automate the boring" jobs are 5-20 lines.
- Once written, schedule them and forget them.
- Build a personal recipe book β your future self will thank you.
Adapt one
Pick the recipe closest to a real task you do. Spend 30 minutes adapting it to your own data and folder.
Module 13Scenario: Finance & Accounting
Reconciliations, variance analysis, budget consolidation, invoice processing, and the small daily wins that compound.
This module follows a fictional accountant β call her Priya β through a real month. Each lesson is a story: a problem on Priya's desk, the Excel-only way, and the Python solution that saves an hour. Copy each script. Adapt to your numbers.
13.1 β User story: bank reconciliation
Match GL transactions to a bank statement. Find the unmatched.
What you'll learn
- Load two tables and align them
- Use merge with indicator to find unmatched rows
- Handle near-matches (same amount, different date)
The story
Priya gets two files at month-end: gl_april.xlsx (the general ledger from the accounting system) and bank_april.csv (the bank statement). She has to find every transaction on one side that isn't on the other and investigate it.
In Excel: copy both into a workbook, build a helper column of date+amount, VLOOKUP across, scroll through #N/A results. Two hours.
In Python: 25 lines. Five minutes. Below.
The script
import pandas as pd
from pathlib import Path
gl = pd.read_excel("gl_april.xlsx")
bank = pd.read_csv("bank_april.csv")
# Normalise both sides to the same key
def keyify(df, date_col, amount_col):
df = df.copy()
df["date"] = pd.to_datetime(df[date_col]).dt.date
df["amount"] = df[amount_col].round(2)
df["key"] = df["date"].astype(str) + "_" + df["amount"].astype(str)
return df
gl_k = keyify(gl, "TxnDate", "Amount")
bank_k = keyify(bank, "PostDate", "Amount")
merged = gl_k.merge(bank_k, on="key", how="outer",
suffixes=("_gl", "_bank"), indicator=True)
unmatched_gl = merged[merged["_merge"] == "left_only"]
unmatched_bank = merged[merged["_merge"] == "right_only"]
with pd.ExcelWriter("reconciliation_april.xlsx") as w:
unmatched_gl.to_excel(w, sheet_name="GL only", index=False)
unmatched_bank.to_excel(w, sheet_name="Bank only", index=False)
merged[merged["_merge"]=="both"].to_excel(w, sheet_name="Matched", index=False)
print(f"GL only: {len(unmatched_gl)}")
print(f"Bank only: {len(unmatched_bank)}")
Handling near-matches
If amounts can be off by a small fee, allow a tolerance:
def near_match(amt_a, amt_b, tol=0.50):
return abs(amt_a - amt_b) <= tol
# For each unmatched GL row, search bank rows within Β±3 days for a near-amount match.
for _, gl_row in unmatched_gl.iterrows():
candidates = bank[
(bank["PostDate"] >= gl_row["TxnDate"] - pd.Timedelta(days=3)) &
(bank["PostDate"] <= gl_row["TxnDate"] + pd.Timedelta(days=3)) &
(bank["Amount"].apply(lambda a: near_match(a, gl_row["Amount"])))
]
if len(candidates) == 1:
print(f"Possible fuzzy match for GL {gl_row['TxnId']} β Bank {candidates.iloc[0]['BankRef']}")
Video walkthrough script (for recording)
Show both raw files in Excel
"Here's the GL. Here's the bank. Look β same month, but different columns and different date formats. Reconciling these manually is a nightmare."
Open Jupyter, load both
"pd.read_excel for the GL, pd.read_csv for the bank. Two lines."
Build the match key
"Both sides get a date and a 2-decimal amount, joined as a string. That's the merge key."
Outer merge with indicator
"This is the magic line. how='outer' keeps everything from both sides; indicator=True tells us which side each row came from."
Split into three sheets and save
"Matched, GL only, Bank only β each gets its own sheet. The file lands ready to email to the controller."
Key takeaways
- The reconciliation skeleton: build a merge key, outer-merge with indicator, split by
_merge.
- Tolerance match handles bank fees and tiny rounding differences.
- One script, multiple sheets in the output β ready for the controller.
Your own recon
Build a recon between two of your own CSVs. The match key can be whatever uniquely identifies a transaction in both systems (txn ID, date+amount, invoice number).
13.2 β User story: budget vs actual variance
Compare actuals to budget by department, flag the worrying lines.
What you'll learn
- Join budget and actual data
- Compute variance and percent variance
- Flag and sort by exposure
The story
Priya needs a one-page variance report for the CFO. Every department, every GL account, budget vs actual, with the worst variances at the top.
The script
import pandas as pd
budget = pd.read_excel("budget_2026.xlsx") # cols: dept, account, month, budget
actual = pd.read_excel("actual_2026.xlsx") # cols: dept, account, month, actual
merged = budget.merge(actual,
on=["dept", "account", "month"],
how="outer").fillna(0)
merged["variance"] = merged["actual"] - merged["budget"]
merged["variance_pct"] = (merged["variance"] / merged["budget"].replace(0, pd.NA)) * 100
# Flag the worrying lines
def flag(row):
if abs(row["variance"]) > 50_000: return "MATERIAL"
if abs(row["variance_pct"] or 0) > 20: return "HIGH-PCT"
return ""
merged["flag"] = merged.apply(flag, axis=1)
# Sort: flagged first, then by absolute variance
report = (merged.assign(abs_var = merged["variance"].abs())
.sort_values(["flag", "abs_var"], ascending=[True, False])
.drop(columns="abs_var"))
report.to_excel("variance_april.xlsx", index=False)
Make the output executive-friendly
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill
wb = load_workbook("variance_april.xlsx")
ws = wb.active
# Headers: bold, green
for cell in ws[1]:
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = PatternFill("solid", fgColor="217346")
# Highlight MATERIAL rows
red = PatternFill("solid", fgColor="FEE2E2")
for row in ws.iter_rows(min_row=2):
if row[ws.max_column - 1].value == "MATERIAL":
for cell in row:
cell.fill = red
# Currency formats
for col in "DEFG":
for cell in ws[col][1:]:
cell.number_format = '"$"#,##0;[Red]("$"#,##0)'
# Freeze pane + widths
ws.freeze_panes = "A2"
for col, w in zip("ABCDEFGH", [10, 12, 10, 14, 14, 14, 10, 14]):
ws.column_dimensions[col].width = w
wb.save("variance_april.xlsx")
Walkthrough for video
Open both source files
"Budget. Actual. Same shape but different numbers."
One merge, one subtraction
"merge(on=[dept, account, month]) joins them. Two assignments compute variance and percent variance."
Apply the flag
"Anything over $50k absolute or 20% relative gets a label."
Save and style
"Save with pandas, restyle with openpyxl. Red rows for the lines that matter."
Key takeaways
- Variance reports are merge + arithmetic + sort.
- Always flag both absolute and percentage variance β a 50% variance on $200 doesn't matter.
- Stylesheet at the end makes it CFO-ready.
Add a sparkline column
Extend the report: add a "trend" column with the last six months of actuals as a comma-separated list. (Hint: groupby dept+account, take last 6 months, agg into a list.)
13.3 β User story: consolidating subsidiary budgets
Six subsidiaries send six workbooks, slightly differently shaped. Combine them.
What you'll learn
- Load workbooks with mismatched columns
- Normalise to a single schema
- Produce a consolidated rollup
The story
Six subsidiaries send a budget workbook every quarter. Each is slightly different β different column orders, different account naming, sometimes an extra sheet. Priya needs one consolidated rollup.
Step 1 β load and inspect
from pathlib import Path
import pandas as pd
files = list(Path("subs_q2").glob("*.xlsx"))
for p in files:
print(p.name)
df = pd.read_excel(p, sheet_name="Budget")
print(" cols:", list(df.columns))
print(" rows:", len(df))
Step 2 β write a per-file normaliser
COL_MAP = {
# canonical_name : list of possible source names
"subsidiary": ["Sub", "Subsidiary", "Entity", "Company"],
"account": ["Account", "GL Account", "Ledger Acct", "AcctName"],
"month": ["Month", "Period", "MonthEnd"],
"amount": ["Amount", "Budget", "USD", "BudgetUSD"],
}
def rename(df):
out = {}
for canon, options in COL_MAP.items():
for opt in options:
if opt in df.columns:
out[opt] = canon
break
return df.rename(columns=out)
def normalise(df, sub_default):
df = rename(df)
if "subsidiary" not in df.columns:
df["subsidiary"] = sub_default
df["month"] = pd.to_datetime(df["month"]).dt.to_period("M").astype(str)
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
return df[["subsidiary", "account", "month", "amount"]]
Step 3 β combine
frames = []
for p in files:
raw = pd.read_excel(p, sheet_name="Budget")
frames.append(normalise(raw, sub_default=p.stem))
combined = pd.concat(frames, ignore_index=True)
combined = combined.dropna(subset=["account", "amount"])
Step 4 β rollups
by_sub_account = (combined.groupby(["subsidiary", "account"])["amount"]
.sum()
.reset_index())
monthly_by_sub = combined.pivot_table(
index="subsidiary", columns="month",
values="amount", aggfunc="sum", fill_value=0,
margins=True, margins_name="Total")
with pd.ExcelWriter("consolidated_q2.xlsx") as w:
combined.to_excel(w, sheet_name="Raw", index=False)
by_sub_account.to_excel(w, sheet_name="By Sub Γ Account", index=False)
monthly_by_sub.to_excel(w, sheet_name="By Sub Γ Month")
π‘ The COL_MAP pattern
A column-mapping dictionary makes new subsidiaries painless to onboard β just add their column names to the options list. The rest of the pipeline doesn't change.
Key takeaways
- Consolidation = load + normalise + concat + rollup.
- A COL_MAP dict absorbs the "everyone names things differently" problem.
- Output multiple sheets: raw, summary, pivot.
Add a sub
Add a seventh subsidiary's file with brand-new column names. Update COL_MAP. Re-run. Confirm nothing else needed to change.
13.4 β User story: invoice batch processing
Read a folder of supplier invoices (Excel), validate, post to the GL feed.
What you'll learn
- Read many invoices from a folder
- Validate each (totals, tax, supplier exists)
- Produce a clean GL import file plus an exceptions report
The story
Suppliers send Excel invoices into a shared folder. Priya checks each, keys them into the GL. She wants this scripted.
Layout
Each invoice has a fixed structure: supplier in B2, invoice no in B3, date in B4, subtotal/tax/total in E10/E11/E12, line items in B6:D8.
The script
import pandas as pd
from openpyxl import load_workbook
from pathlib import Path
valid_suppliers = set(pd.read_excel("master_data.xlsx", sheet_name="Suppliers")["name"])
TAX_RATE = 0.08
rows, errors = [], []
for path in Path("inbox").glob("*.xlsx"):
wb = load_workbook(path, data_only=True)
ws = wb.active
supplier = ws["B2"].value
inv_no = ws["B3"].value
inv_date = ws["B4"].value
subtotal = ws["E10"].value
tax = ws["E11"].value
total = ws["E12"].value
# Validate
if supplier not in valid_suppliers:
errors.append((path.name, f"Unknown supplier: {supplier}")); continue
if subtotal is None or total is None:
errors.append((path.name, "Missing totals")); continue
expected_tax = round(subtotal * TAX_RATE, 2)
if abs(tax - expected_tax) > 0.05:
errors.append((path.name, f"Tax mismatch: {tax} vs expected {expected_tax}"))
continue
if abs(total - (subtotal + tax)) > 0.05:
errors.append((path.name, "Total β subtotal+tax")); continue
rows.append({
"invoice_no": inv_no,
"supplier": supplier,
"date": inv_date,
"subtotal": subtotal,
"tax": tax,
"total": total,
"source": path.name,
})
gl_feed = pd.DataFrame(rows)
gl_feed.to_csv("gl_feed_april.csv", index=False)
if errors:
pd.DataFrame(errors, columns=["file", "error"]).to_csv("invoice_errors.csv", index=False)
print(f"β οΈ {len(errors)} errors β see invoice_errors.csv")
print(f"β
Posted {len(gl_feed)} clean invoices")
Move processed files
from pathlib import Path
done = Path("processed"); done.mkdir(exist_ok=True)
for r in rows:
Path("inbox") / r["source"]).rename(done / r["source"])
Key takeaways
- Validate at the door β supplier exists, tax math checks, totals tie.
- Produce both a clean output and an exceptions report.
- Move successfully-processed files out of the inbox so re-runs don't double-post.
Add a duplicate check
Extend the script: refuse to post an invoice number that's already in the GL feed from previous runs.
13.5 β User story: accounts receivable aging
Bucket open invoices into 0-30, 31-60, 61-90, 90+.
What you'll learn
- Compute days since due date
- Bucket with pd.cut
- Produce a pivot by customer Γ age band
The story
Every Monday the controller asks for the AR aging. Priya wants it ready by 9 AM.
The script
import pandas as pd
from datetime import date
ar = pd.read_excel("ar_open.xlsx") # cols: customer, invoice_no, due_date, balance
today = pd.Timestamp(date.today())
ar["due_date"] = pd.to_datetime(ar["due_date"])
ar["days_overdue"] = (today - ar["due_date"]).dt.days.clip(lower=0)
bins = [-1, 0, 30, 60, 90, 99999]
labels = ["Current", "1-30", "31-60", "61-90", "90+"]
ar["bucket"] = pd.cut(ar["days_overdue"], bins=bins, labels=labels)
pivot = ar.pivot_table(index="customer", columns="bucket",
values="balance", aggfunc="sum",
fill_value=0, observed=True)
pivot["Total"] = pivot.sum(axis=1)
pivot = pivot.sort_values("Total", ascending=False)
pivot.to_excel("ar_aging.xlsx")
Highlight worst offenders
from openpyxl import load_workbook
from openpyxl.styles import PatternFill
wb = load_workbook("ar_aging.xlsx")
ws = wb.active
red = PatternFill("solid", fgColor="FECACA")
yellow = PatternFill("solid", fgColor="FEF3C7")
for row in ws.iter_rows(min_row=2):
over90 = row[5].value or 0 # column F = "90+"
over60 = row[4].value or 0
if over90 > 0:
for cell in row: cell.fill = red
elif over60 > 0:
for cell in row: cell.fill = yellow
wb.save("ar_aging.xlsx")
Key takeaways
pd.cut with explicit bins gives clean labelled buckets.
- Pivot with
fill_value=0 avoids holes in the table.
- openpyxl styling at the end makes the report scannable.
Add aging by salesperson
Modify to also pivot by salesperson (one sheet per salesperson) so each can chase their own delinquents.
13.6 β User story: multi-currency consolidation
Convert a multi-currency P&L to USD using daily FX rates.
What you'll learn
- Merge a transactions table with an FX rates table
- Apply the correct daily rate to each transaction
- Roll up to USD totals
The story
Priya consolidates P&L from offices in London (GBP), Tokyo (JPY), and Berlin (EUR). The CFO wants USD totals using the rate on the date of each transaction.
The data
tx = pd.read_excel("multi_ccy_tx.xlsx")
# date, office, account, amount, currency
rates = pd.read_csv("fx_rates_2026.csv")
# date, currency, rate_to_usd
The join
tx["date"] = pd.to_datetime(tx["date"])
rates["date"] = pd.to_datetime(rates["date"])
merged = tx.merge(rates, on=["date", "currency"], how="left")
missing = merged["rate_to_usd"].isna().sum()
print(f"Transactions with no rate: {missing}")
# If date doesn't match exactly (weekend), forward-fill from previous business day
if missing > 0:
rates_indexed = rates.set_index(["currency", "date"]).sort_index()
def lookup(row):
try:
sub = rates_indexed.loc[row["currency"]]
avail = sub[sub.index <= row["date"]]
return avail.iloc[-1]["rate_to_usd"] if len(avail) else None
except KeyError:
return None
merged.loc[merged["rate_to_usd"].isna(), "rate_to_usd"] = merged[merged["rate_to_usd"].isna()].apply(lookup, axis=1)
merged["usd_amount"] = (merged["amount"] * merged["rate_to_usd"]).round(2)
Rollup
usd_pnl = (merged.groupby(["office", "account"])["usd_amount"]
.sum()
.unstack(fill_value=0))
usd_pnl["Total"] = usd_pnl.sum(axis=1)
usd_pnl.loc["Total"] = usd_pnl.sum(numeric_only=True)
usd_pnl.to_excel("pnl_usd.xlsx")
Key takeaways
- Multi-currency conversion = merge on (date, currency) + multiply.
- Forward-fill the previous business day's rate for weekend transactions.
- Roll up to totals at the end so the CFO has one number.
Sensitivity
Compute the same P&L using month-average rates instead of daily. By how much does the USD total differ? That difference is your FX translation risk.
13.7 β Project: the monthly close pack
Combine the previous lessons into one script that produces the entire close pack.
What you'll learn
- Orchestrate multiple analyses in one script
- Build a single output Excel with one sheet per analysis
- Add a cover sheet with summary KPIs
The story
Month-end. Priya wants one script to run that produces the whole close pack: a cover sheet, reconciliation, variance, AR aging, FX-translated P&L. One file. One double-click.
The skeleton
"""monthly_close.py β generate the month-end close pack."""
from datetime import date
from pathlib import Path
import pandas as pd
# 1) Recon
def make_recon():
# ... reuse code from Lesson 1
return matched, gl_only, bank_only
# 2) Variance
def make_variance():
# ... reuse code from Lesson 2
return variance_df
# 3) AR aging
def make_aging():
# ... reuse code from Lesson 5
return aging_pivot
# 4) FX P&L
def make_pnl():
# ... reuse code from Lesson 6
return pnl_df
def make_cover(matched, gl_only, bank_only, variance, aging, pnl):
return pd.DataFrame({
"Metric": [
"Period",
"GL transactions",
"Bank transactions",
"GL only", "Bank only",
"Material variances",
"AR over 60 days",
"Total revenue (USD)",
],
"Value": [
f"{date.today():%B %Y}",
len(matched) + len(gl_only),
len(matched) + len(bank_only),
len(gl_only), len(bank_only),
(variance["flag"] == "MATERIAL").sum(),
aging["61-90"].sum() + aging["90+"].sum(),
pnl.loc["Total", "Total"],
],
})
def main():
matched, gl_only, bank_only = make_recon()
variance = make_variance()
aging = make_aging()
pnl = make_pnl()
cover = make_cover(matched, gl_only, bank_only, variance, aging, pnl)
out = f"close_pack_{date.today():%Y-%m}.xlsx"
with pd.ExcelWriter(out) as w:
cover.to_excel(w, sheet_name="Cover", index=False)
matched.to_excel(w, sheet_name="Recon-Matched", index=False)
gl_only.to_excel(w, sheet_name="Recon-GL only", index=False)
bank_only.to_excel(w,sheet_name="Recon-Bank only",index=False)
variance.to_excel(w, sheet_name="Variance", index=False)
aging.to_excel(w, sheet_name="AR Aging")
pnl.to_excel(w, sheet_name="USD P&L")
print(f"β
Wrote {out}")
if __name__ == "__main__":
main()
Schedule it
Set this script to run automatically on the 1st of every month (Module 12, Lesson 6). Wake up to a finished close pack in your inbox.
Key takeaways
- Compose previous scripts into one orchestrator with a
main().
- Each analysis is its own function;
main() just glues them.
- A cover sheet with KPIs makes the pack scannable in 30 seconds.
Email it
Extend the script: after building the file, email it to the controller using the email recipe from Module 12, Lesson 5.
Module 14Scenario: Data & Business Analyst
Cleaning huge exports, joining tables, building dashboards, ad-hoc analyses, and shipping the results.
Meet Marcus, a business analyst at a mid-sized retailer. His job is roughly 'turn data into decisions'. Here's how Python earns its keep on his typical week.
14.1 β User story: cleaning a 4-million-row export
When the CSV is too big for Excel itself.
What you'll learn
- Open a multi-million-row CSV with pandas
- Filter and sample before doing heavy work
- Save a clean subset back to Excel
The story
Marcus gets a 4M-row CSV of transactions. Excel opens 1M and silently drops the rest. He needs all of it cleaned and pivoted.
Step 1 β peek before loading the whole thing
import pandas as pd
sample = pd.read_csv("transactions.csv", nrows=1000)
sample.info() # columns + dtypes
sample.head()
Step 2 β load with the right options
df = pd.read_csv("transactions.csv",
usecols=["date", "store_id", "sku", "qty", "revenue"],
dtype={"store_id": "category", "sku": "category"},
parse_dates=["date"])
print(df.shape) # (4_000_000, 5)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
usecols drops unused columns at load time. dtype="category" on repeating strings cuts memory by ~90%.
Step 3 β filter early
# Last 90 days only
recent = df[df["date"] >= df["date"].max() - pd.Timedelta(days=90)]
print(len(recent)) # ~1M
Step 4 β aggregate down to something Excel-friendly
by_store_sku = (recent.groupby(["store_id", "sku"], observed=True)
.agg(units=("qty", "sum"),
revenue=("revenue", "sum"))
.reset_index()
.sort_values("revenue", ascending=False))
by_store_sku.head(20).to_excel("top_skus_per_store.xlsx", index=False)
If even pandas struggles β chunked read
aggregates = []
for chunk in pd.read_csv("transactions.csv", chunksize=500_000, parse_dates=["date"]):
chunk = chunk[chunk["date"] >= "2026-01-01"]
aggregates.append(chunk.groupby("store_id")["revenue"].sum())
total = pd.concat(aggregates).groupby(level=0).sum()
print(total)
Key takeaways
usecols + dtype="category" makes huge files manageable.
- Filter and aggregate early; only the small result needs to land in Excel.
chunksize lets you stream files bigger than RAM.
Your own big file
Find a CSV bigger than 100k rows in your work. Load it with usecols and dtype. Print memory savings.
14.2 β User story: joining six tables to answer one question
Stars, snowflakes, and the merge dance.
What you'll learn
- Plan a multi-table join
- Verify match rates at every step
- Avoid the silent row-explosion of a bad join
The story
"Who are our top customers in the Northeast, what did they buy last quarter, and what was their average margin?" Six tables: customers, orders, order_items, products, regions, costs.
Map it out first
customers ββ(customer_id)ββ orders ββ(order_id)ββ order_items ββ(sku)ββ products ββ(sku)ββ costs
customers ββ(region_id)ββ regions
Load and check sizes
cu = pd.read_csv("customers.csv")
o = pd.read_csv("orders.csv", parse_dates=["order_date"])
oi = pd.read_csv("order_items.csv")
p = pd.read_csv("products.csv")
r = pd.read_csv("regions.csv")
co = pd.read_csv("costs.csv")
for name, df in [("cu",cu),("o",o),("oi",oi),("p",p),("r",r),("co",co)]:
print(name, df.shape)
Join one step at a time, with indicator
cu_r = cu.merge(r, on="region_id", how="left", indicator=True)
print(cu_r["_merge"].value_counts()) # all should be 'both'
cu_r = cu_r.drop(columns="_merge")
o_cu = o.merge(cu_r, on="customer_id", how="left", indicator=True)
print(o_cu["_merge"].value_counts())
o_cu = o_cu.drop(columns="_merge")
# Items per order β this is the row-multiplying join
items = oi.merge(o_cu, on="order_id", how="inner")
print("rows after items join:", len(items))
# Add product + cost
items = items.merge(p, on="sku", how="left").merge(co, on="sku", how="left")
β οΈ The row explosion
If your join key has duplicates on both sides, you get a Cartesian product. Check
len(after) β€ len(before) Γ max_dupes. If not, debug.
Answer the question
last_q = items[items["order_date"] >= "2026-01-01"]
last_q["revenue"] = last_q["qty"] * last_q["unit_price"]
last_q["margin"] = last_q["revenue"] - last_q["qty"] * last_q["unit_cost"]
ne = last_q[last_q["region_name"] == "Northeast"]
top = (ne.groupby(["customer_id", "customer_name"])
.agg(orders = ("order_id", "nunique"),
revenue = ("revenue", "sum"),
margin = ("margin", "sum"))
.assign(margin_pct = lambda d: (d["margin"] / d["revenue"] * 100).round(1))
.sort_values("revenue", ascending=False)
.head(20)
.reset_index())
top.to_excel("top20_ne_q1.xlsx", index=False)
Key takeaways
- Diagram joins before writing them.
- Use
indicator=True on each step the first time you write the script.
- Watch for row explosion on many-to-many joins.
Reverse it
Same data, different question: which products have the highest margin% but the lowest sales volume? (Hidden gems to push.)
14.3 β User story: customer cohort retention
Group customers by signup month, track their retention over time.
What you'll learn
- Compute a cohort label per customer
- Build a cohort Γ period retention matrix
- Render it as a heatmap
The story
"How are our newer customers doing compared to older ones?" The classic cohort retention chart.
Step 1 β label each customer
orders["order_month"] = orders["order_date"].dt.to_period("M")
first_order_month = (orders.groupby("customer_id")["order_month"].min()
.rename("cohort"))
orders = orders.merge(first_order_month, on="customer_id")
orders["months_since_join"] = (orders["order_month"] - orders["cohort"]).apply(lambda d: d.n)
Step 2 β build the cohort matrix
cohort_counts = (orders.groupby(["cohort", "months_since_join"])["customer_id"]
.nunique()
.unstack(fill_value=0))
cohort_sizes = cohort_counts.iloc[:, 0]
retention = cohort_counts.divide(cohort_sizes, axis=0).round(3) * 100
Step 3 β heatmap
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(retention.iloc[:, :12], # first 12 months
annot=True, fmt=".0f",
cmap="YlGn", cbar_kws={"label": "Retention %"},
ax=ax)
ax.set_title("Cohort Retention β % of cohort active in each month after signup")
ax.set_xlabel("Months after signup")
ax.set_ylabel("Cohort (signup month)")
plt.tight_layout()
plt.savefig("cohort_retention.png", dpi=200)
Key takeaways
- Cohort = group by signup period; track by months since signup.
.divide(..., axis=0) turns counts into percentages.
- A heatmap makes retention patterns obvious at a glance.
Revenue cohort
Change the analysis: instead of retention %, show average revenue per active customer by cohort Γ months-since-signup.
14.4 β User story: did the change work? A quick A/B test
Two groups, one metric, was the difference real or noise?
What you'll learn
- Run a two-sample t-test
- Interpret a p-value plainly
- Build a confidence interval for the lift
The story
Marketing tried a new landing page on half the traffic. The new version has 4.2% conversion vs the old 3.8%. Is that a real win or just luck?
The data
import pandas as pd
df = pd.read_csv("ab_test.csv") # visitor_id, group, converted (0/1)
print(df.groupby("group")["converted"].agg(["count","mean","sum"]))
Quick check with scipy
from scipy import stats
a = df.loc[df["group"]=="control", "converted"]
b = df.loc[df["group"]=="treatment", "converted"]
result = stats.ttest_ind(b, a, equal_var=False)
print(f"t = {result.statistic:.2f}, p = {result.pvalue:.4f}")
What the p-value really says
p = the probability of seeing a difference at least this big, if there was no real effect. p < 0.05 by convention means "this is unlikely to be noise." Not "definitely real." Not "important." Just "unlikely noise."
Confidence interval for the lift
import numpy as np
p_a, n_a = a.mean(), len(a)
p_b, n_b = b.mean(), len(b)
lift = p_b - p_a
se = (p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b) ** 0.5
ci = (lift - 1.96*se, lift + 1.96*se)
print(f"Lift: {lift*100:+.2f} pp ({ci[0]*100:+.2f}, {ci[1]*100:+.2f})")
The confidence interval is more useful than the p-value β it tells you the size of the effect, not just "is it real."
Quick visual
import matplotlib.pyplot as plt
groups = df.groupby("group")["converted"].mean()
errs = df.groupby("group")["converted"].apply(lambda s: 1.96 * (s.mean()*(1-s.mean())/len(s))**0.5)
fig, ax = plt.subplots(figsize=(5, 4))
groups.plot(kind="bar", yerr=errs, capsize=6, ax=ax, color=["#94a3b8","#217346"])
ax.set_title("Conversion rate by group (95% CI)")
ax.set_ylabel("Conversion %")
plt.tight_layout()
plt.savefig("ab_test.png", dpi=200)
Key takeaways
- Two-sample t-test is the workhorse for "is A different from B?"
- p-value: probability of seeing this if there's no real effect.
- Confidence intervals communicate size, not just significance.
Underpowered?
Re-run the analysis with only the first 1,000 rows of each group. How does the CI change? You've just seen why we need lots of data for small effects.
14.5 β User story: trend, seasonality, and 'is this Tuesday weird?'
Decompose a time series. Spot anomalies.
What you'll learn
- Resample daily data to weekly
- Decompose into trend + seasonality
- Flag anomaly days
The story
Daily sessions on the website have a clear weekly rhythm. Marcus needs to spot days that are unusually high or low after accounting for that rhythm.
Decomposition
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
df = pd.read_csv("daily_sessions.csv", parse_dates=["date"]).set_index("date").sort_index()
decomp = seasonal_decompose(df["sessions"], model="additive", period=7)
decomp.plot()
plt.tight_layout()
plt.savefig("decomposition.png", dpi=150)
Anomaly flag
# Residual = actual - trend - seasonal
resid = decomp.resid.dropna()
std = resid.std()
df["expected"] = (decomp.trend + decomp.seasonal)
df["anomaly"] = (resid.abs() > 2 * std)
anomalies = df[df["anomaly"]].sort_values("sessions", ascending=False)
print(anomalies)
Plot with anomalies highlighted
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df.index, df["sessions"], lw=0.8)
ax.plot(df.index, df["expected"], lw=1.2, color="#217346", label="expected")
mask = df["anomaly"].fillna(False)
ax.scatter(df.index[mask], df["sessions"][mask], color="red", s=20, zorder=3, label="anomaly")
ax.legend()
plt.tight_layout()
plt.savefig("anomalies.png", dpi=150)
Key takeaways
seasonal_decompose splits a series into trend, seasonality, and residual.
- An anomaly is just a residual that's more than ~2 standard deviations from zero.
- Use period=7 for weekly seasonality, period=12 for monthly, period=24 for hourly.
Find your anomalies
Take any daily metric you track (web sessions, support tickets, orders). Decompose it. List the anomalous days from the past 90.
14.6 β User story: a refreshing dashboard sheet
Build an Excel dashboard that auto-updates when the source data changes.
What you'll learn
- Build multiple analyses against a single source
- Use =PY() so the dashboard refreshes on data change
- Lay out the dashboard for at-a-glance reading
The story
Marcus's manager wants a single Excel file she can open every morning and see: today's revenue, week-on-week, top 5 products, top 5 stores, anomaly count.
The structure
- Sheet 1:
Data β the raw transactions, refreshed nightly.
- Sheet 2:
Dashboard β 5 =PY() cells, each producing one tile.
The =PY() cells
# Tile 1: Yesterday's revenue
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
y = df[df["date"].dt.date == (pd.Timestamp.today().date() - pd.Timedelta(days=1))]
f"${y['revenue'].sum():,.0f}"
)
# Tile 2: Week-on-week
=PY(
df = xl("Data!A1:F100000", headers=True)
df["date"] = pd.to_datetime(df["date"])
this7 = df[df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=7)]["revenue"].sum()
prev7 = df[(df["date"] < pd.Timestamp.today() - pd.Timedelta(days=7)) &
(df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=14))]["revenue"].sum()
f"{(this7/prev7 - 1)*100:+.1f}%"
)
# Tile 3: Top 5 products
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("product")["revenue"].sum().nlargest(5)
)
# Tile 4: Top 5 stores
=PY(
df = xl("Data!A1:F100000", headers=True)
df.groupby("store")["revenue"].sum().nlargest(5)
)
# Tile 5: Anomaly count
=PY(
df = xl("Data!A1:F100000", headers=True)
daily = df.groupby(pd.to_datetime(df["date"]).dt.date)["revenue"].sum()
mu, sd = daily.mean(), daily.std()
int(((daily - mu).abs() > 2*sd).sum())
)
Lay it out
Put the small tiles (revenue, WoW, anomaly count) in big bold cells across the top. Put the Top 5 lists in the middle. Add a single chart at the bottom: =PY() returning a line chart of the last 30 days.
π‘ Refresh discipline
The dashboard auto-recalculates when the Data sheet changes. Tell stakeholders to
only edit the Data sheet and nothing else.
Key takeaways
- Dashboards are a stack of
=PY() tiles, each producing one number or one table.
- Keep the data in a separate sheet so the dashboard layout is stable.
- Aim for one screen β if you have to scroll, it's a report, not a dashboard.
Add a tile
Add a sixth tile: yesterday's conversion rate vs the trailing 30-day average. Format as "X.XX% (Y.Y pp vs avg)".
14.7 β User story: shipping the analysis β narrative report
Turn an analysis notebook into a one-page narrative the boss will actually read.
What you'll learn
- Choose the one chart that tells the story
- Structure a memo: TL;DR, evidence, recommendation
- Export to PDF or DOCX
The story
Marcus finished his analysis. Now he has to convince the VP to act on it.
The 'analyst memo' template
- TL;DR (3 sentences). What you found, what it means, what you recommend.
- The headline chart. One image that proves the TL;DR.
- How you measured it. Data, method, limits, in plain English.
- What to do next. Specific, time-bound, owned.
Auto-generate a PDF memo
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
with PdfPages("memo_q1.pdf") as pdf:
# Page 1 β title + TL;DR
fig, ax = plt.subplots(figsize=(8.5, 11))
ax.axis("off")
ax.text(0.05, 0.95, "Q1 Northeast Customers β Memo", fontsize=20, weight="bold", va="top")
ax.text(0.05, 0.85,
"TL;DR β The top 20 NE customers drove 62% of revenue but their\n"
"margin% trails the national average by 6 points. We recommend\n"
"a pricing review on the three SKUs that account for 80% of the gap.",
fontsize=12, va="top")
pdf.savefig(fig); plt.close(fig)
# Page 2 β the chart
fig, ax = plt.subplots(figsize=(8.5, 11))
# ... build the chart
pdf.savefig(fig); plt.close(fig)
Or generate a Word memo
from docx import Document
doc = Document()
doc.add_heading("Q1 Northeast Customers β Memo", 0)
doc.add_paragraph("TL;DR β The top 20 NE customers drove 62% of revenue...")
doc.add_picture("ne_margins.png", width=docx.shared.Inches(6))
doc.add_heading("Recommendation", level=2)
doc.add_paragraph("Pricing review on SKUs 12345, 23456, 34567 by end of June.")
doc.save("memo_q1.docx")
Key takeaways
- Analysis without a recommendation is just trivia.
- One chart, one number, one ask β that's the memo.
- Python can generate the PDF/DOCX itself; the same script can rebuild it monthly.
Write your own memo
Take your last analysis. Distill it into a TL;DR, one chart, and a recommendation. Hand it to a colleague and ask: "did this make me sound certain enough?"
Module 15Scenario: HR & Operations
Headcount, payroll prep, timesheet cleanup, employee data hygiene, and the dozen small reports that pile up.
Meet Tomas, an HR coordinator. He's drowning in spreadsheets β exports from the HRIS, manager surveys, timesheet dumps. Here's how Python turns each one into a quiet five-minute job.
15.1 β User story: monthly headcount report
Active employees by department, with hires/leavers.
What you'll learn
- Compute active headcount as of any date
- Identify hires and leavers in a month
- Format the rollup for the leadership pack
The story
Tomas needs a monthly headcount file: active count by department on month-end, plus hires and leavers in the month, plus turnover %.
The data
import pandas as pd
emp = pd.read_excel("employees.xlsx")
# emp cols: employee_id, name, department, hire_date, term_date (NaT if active)
emp["hire_date"] = pd.to_datetime(emp["hire_date"])
emp["term_date"] = pd.to_datetime(emp["term_date"], errors="coerce")
Active as of any date
def active_as_of(df, asof):
asof = pd.Timestamp(asof)
return df[(df["hire_date"] <= asof) & ((df["term_date"].isna()) | (df["term_date"] > asof))]
eom = pd.Timestamp("2026-04-30")
active = active_as_of(emp, eom)
print(f"Headcount on {eom:%Y-%m-%d}: {len(active)}")
By department
hc_by_dept = active.groupby("department").size().rename("active").reset_index()
hc_by_dept = hc_by_dept.sort_values("active", ascending=False)
Hires and leavers in the month
month_start = pd.Timestamp("2026-04-01")
hires = emp[(emp["hire_date"] >= month_start) & (emp["hire_date"] <= eom)]
leavers = emp[(emp["term_date"] >= month_start) & (emp["term_date"] <= eom)]
events = pd.DataFrame({
"department": hc_by_dept["department"],
"hires": hires.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
"leavers": leavers.groupby("department").size().reindex(hc_by_dept["department"], fill_value=0).values,
})
events["turnover_pct"] = (events["leavers"] / hc_by_dept["active"] * 100).round(1)
summary = hc_by_dept.merge(events, on="department")
summary.loc["Total"] = [
"Total", summary["active"].sum(), summary["hires"].sum(),
summary["leavers"].sum(),
round(summary["leavers"].sum() / summary["active"].sum() * 100, 1),
]
summary.to_excel(f"headcount_{eom:%Y-%m}.xlsx", index=False)
Key takeaways
- A point-in-time headcount = filter where hire β€ asof and (term IS NULL or term > asof).
- Monthly events = hires/terms with date in the month.
- Turnover% = leavers / active.
Year-to-date version
Build a "YTD" version: rolling headcount at each month-end of the year, in one wide table.
15.2 β User story: payroll prep
Take the timesheet dump, validate it, output the payroll-ready file.
What you'll learn
- Aggregate hours per employee per pay period
- Apply overtime rules
- Output a payroll-system-friendly file
The data
ts = pd.read_csv("timesheets.csv", parse_dates=["date"])
# columns: employee_id, date, hours, project
Validate
# No more than 16 hours in a day
bad = ts.groupby(["employee_id","date"])["hours"].sum()
bad = bad[bad > 16]
if len(bad):
print(f"β οΈ {len(bad)} employee-days over 16 hours")
print(bad)
Aggregate to pay period (e.g., bi-weekly)
ts["pay_period"] = ts["date"].dt.to_period("2W-SUN") # bi-weekly ending Sunday
daily = ts.groupby(["employee_id","date"])["hours"].sum().reset_index()
# Apply 40h/week overtime rule
daily["week"] = daily["date"].dt.to_period("W-SUN")
weekly = (daily.groupby(["employee_id","week"])["hours"]
.sum()
.reset_index())
weekly["regular_hrs"] = weekly["hours"].clip(upper=40)
weekly["overtime_hrs"] = (weekly["hours"] - 40).clip(lower=0)
Roll up to pay period
weekly["pay_period"] = weekly["week"].apply(lambda w: pd.Period(w.start_time, freq="2W-SUN"))
payroll = (weekly.groupby(["employee_id","pay_period"])
.agg(regular = ("regular_hrs", "sum"),
overtime= ("overtime_hrs", "sum"))
.reset_index())
payroll["pay_period"] = payroll["pay_period"].astype(str)
payroll.to_csv("payroll_import.csv", index=False)
print(f"Wrote payroll for {payroll['employee_id'].nunique()} employees")
Key takeaways
- Validate before you compute β bad rows in the timesheet ruin the payroll.
- Overtime is a weekly cap, even if pay periods are bi-weekly.
- Always tag the output with employee_id + period.
Pay rate
Merge the payroll output with an employees table containing hourly rates. Compute total pay (regularΓrate + overtimeΓrateΓ1.5).
15.3 β User story: employee data hygiene
Names with stray spaces, missing managers, duplicate IDs.
What you'll learn
- Profile a master employee file
- Catch the common dirty-data patterns
- Produce a 'to fix' list for the HRIS
Profile first
def profile(df):
return pd.DataFrame({
"dtype": df.dtypes.astype(str),
"missing": df.isna().sum(),
"missing_pct": (df.isna().mean()*100).round(1),
"distinct": df.nunique(),
})
print(profile(emp))
Common HRIS dirty patterns
issues = []
# Stray whitespace
for col in ["first_name", "last_name", "email"]:
bad = emp[col].astype(str).str.strip() != emp[col].astype(str)
issues += [(rowid, col, "whitespace") for rowid in emp.index[bad]]
# Mixed case emails
bad = emp["email"].astype(str) != emp["email"].astype(str).str.lower()
issues += [(r, "email", "mixed case") for r in emp.index[bad]]
# Missing manager
issues += [(r, "manager_id", "missing") for r in emp.index[emp["manager_id"].isna()]]
# Manager who doesn't exist
mgr_ids = set(emp["employee_id"])
bad = ~emp["manager_id"].isin(mgr_ids) & emp["manager_id"].notna()
issues += [(r, "manager_id", "non-existent") for r in emp.index[bad]]
# Duplicate IDs
dup = emp[emp.duplicated("employee_id", keep=False)]
issues += [(r, "employee_id", "duplicate") for r in dup.index]
fix = pd.DataFrame(issues, columns=["row_index", "column", "issue"]).merge(
emp[["employee_id", "first_name", "last_name"]],
left_on="row_index", right_index=True
)
fix.to_excel("hris_fix_list.xlsx", index=False)
print(f"β οΈ {len(fix)} issues to fix")
Key takeaways
- A profile() function tells you 80% of what's broken in 5 seconds.
- Track issues row-by-row so HR can fix them in the system, not the spreadsheet.
- Run this monthly β quality drifts.
Email validity
Add a check: flag emails that don't match a basic name@domain.tld pattern. Hint: str.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$").
15.4 β User story: build the org chart programmatically
From a flat employee table to a parent-child hierarchy.
What you'll learn
- Walk a parent-pointer hierarchy
- Count direct and indirect reports
- Render the org tree as text
The data
emp = pd.read_excel("employees.xlsx")[["employee_id","name","manager_id","title"]]
Build a parent-children map
from collections import defaultdict
children = defaultdict(list)
for _, r in emp.iterrows():
if pd.notna(r["manager_id"]):
children[r["manager_id"]].append(r["employee_id"])
names = dict(zip(emp["employee_id"], emp["name"] + " β " + emp["title"]))
def print_tree(emp_id, depth=0):
print(" " * depth + "β’ " + names[emp_id])
for kid in children[emp_id]:
print_tree(kid, depth + 1)
# Find roots (no manager) and print
roots = emp[emp["manager_id"].isna()]["employee_id"]
for r in roots:
print_tree(r)
Count direct + indirect reports
def total_reports(emp_id):
direct = children[emp_id]
return len(direct) + sum(total_reports(c) for c in direct)
emp["direct_reports"] = emp["employee_id"].apply(lambda e: len(children[e]))
emp["total_reports"] = emp["employee_id"].apply(total_reports)
emp.sort_values("total_reports", ascending=False).head(10)
Key takeaways
- Org chart = flat table with a parent-pointer column.
- defaultdict(list) makes the children-of-X lookup trivial.
- Recursion is the natural way to count indirect reports.
Span check
Flag managers with more than 12 direct reports (broad span) or fewer than 3 (too narrow).
15.5 β User story: engagement survey analysis
Turn 800 free-text + Likert responses into one summary file.
What you'll learn
- Score Likert columns and roll up by team
- Word-count the open-ended responses
- Output a summary deck-ready set of charts
The data
survey = pd.read_excel("engagement_2026.xlsx")
# cols: respondent_id, team, q1_likert (1-5), q2_likert ..., q10_open (free text)
Score the Likerts
likert_cols = [c for c in survey.columns if c.endswith("_likert")]
by_team = (survey.groupby("team")[likert_cols].mean().round(2))
by_team["overall"] = by_team.mean(axis=1).round(2)
by_team.to_excel("survey_scores_by_team.xlsx")
# Distribution per question
dist = survey.melt(id_vars="team", value_vars=likert_cols,
var_name="question", value_name="score")
dist_pct = (dist.groupby(["question","score"]).size()
.unstack(fill_value=0)
.pipe(lambda d: d.div(d.sum(axis=1), axis=0) * 100)
.round(1))
Free text: top words
import re
from collections import Counter
text = " ".join(survey["q10_open"].dropna().astype(str)).lower()
words = re.findall(r"[a-z']+", text)
stop = {"the","and","a","to","of","in","is","i","it","for","that","but","my","we","with","this","on","be","are"}
freq = Counter(w for w in words if w not in stop and len(w) > 3)
top20 = pd.DataFrame(freq.most_common(20), columns=["word","count"])
Output
with pd.ExcelWriter("survey_summary.xlsx") as w:
by_team.to_excel(w, sheet_name="By Team")
dist_pct.to_excel(w, sheet_name="Distributions")
top20.to_excel(w, sheet_name="Top Words", index=False)
Key takeaways
- Likerts roll up cleanly with mean by team.
- Free text gets a quick "top words after removing stopwords" pass β enough for a first cut.
- For deeper text analysis, level up to NLTK or spaCy (out of scope here).
Net Promoter Score
Add an NPS computation: % promoters (9-10) minus % detractors (0-6).
15.6 β User story: PTO accrual and balances
How much vacation does each employee have, with the accrual rules baked in?
What you'll learn
- Compute accrual based on tenure
- Subtract taken leave
- Flag anyone over the cap
The rules (example)
- Default accrual: 15 days/year, prorated.
- After 5 years tenure: 20 days/year.
- After 10 years: 25 days/year.
- Max balance carried into a year: 30 days.
Compute
from datetime import date
import pandas as pd
emp = pd.read_excel("employees.xlsx")
leave = pd.read_excel("leave_taken.xlsx") # employee_id, days_taken_ytd
today = pd.Timestamp(date.today())
emp["tenure_yrs"] = ((today - pd.to_datetime(emp["hire_date"])).dt.days / 365.25)
def annual_rate(years):
if years >= 10: return 25
if years >= 5: return 20
return 15
emp["annual_rate"] = emp["tenure_yrs"].apply(annual_rate)
ytd_fraction = (today.dayofyear / 365.25)
emp["accrued_ytd"] = (emp["annual_rate"] * ytd_fraction).round(1)
emp = emp.merge(leave, on="employee_id", how="left")
emp["days_taken_ytd"] = emp["days_taken_ytd"].fillna(0)
emp["balance"] = (emp["accrued_ytd"] - emp["days_taken_ytd"]).round(1)
emp["over_cap"] = emp["balance"] > 30
print(f"{emp['over_cap'].sum()} employees are over the 30-day cap.")
emp[["employee_id","name","tenure_yrs","annual_rate","accrued_ytd","days_taken_ytd","balance","over_cap"]].to_excel("pto_balances.xlsx", index=False)
Key takeaways
- Tiered accrual = a function of tenure, applied with
.apply().
- YTD accrual is a simple proration.
- Always flag exceptions (over-cap, negative balance, > 1yr no leave).
Use-it-or-lose-it warning
Find employees who would lose more than 5 days at year-end if they don't take any more leave. Email each one a polite nudge using Module 12 Lesson 5.
15.7 β User story: new-hire onboarding tracker
Generate per-hire onboarding sheets from a master checklist.
What you'll learn
- Fill an Excel template per new hire
- Track progress in one master file
- Send a reminder when items are overdue
Per-hire onboarding workbook
from openpyxl import load_workbook
from datetime import date, timedelta
import pandas as pd
template = "onboarding_template.xlsx"
hires = pd.read_excel("new_hires_this_week.xlsx")
for _, h in hires.iterrows():
wb = load_workbook(template)
ws = wb["Checklist"]
ws["B2"] = h["name"]
ws["B3"] = h["start_date"]
ws["B4"] = h["manager"]
ws["B5"] = h["department"]
# Auto-fill due dates
items = pd.read_excel(template, sheet_name="ItemDefs")
for i, item in items.iterrows():
row = 8 + i
ws.cell(row=row, column=1, value=item["item"])
ws.cell(row=row, column=2, value=pd.Timestamp(h["start_date"]) + pd.Timedelta(days=item["due_offset"]))
wb.save(f"onboarding/{h['name'].replace(' ', '_')}.xlsx")
Master tracker β read every per-hire workbook, summarise
from pathlib import Path
rows = []
for p in Path("onboarding").glob("*.xlsx"):
wb = load_workbook(p, data_only=True)
ws = wb["Checklist"]
name = ws["B2"].value
for r in ws.iter_rows(min_row=8, values_only=True):
item, due, done, owner = r[:4]
if item:
rows.append({"hire": name, "item": item, "due": due, "done": done, "owner": owner})
tracker = pd.DataFrame(rows)
tracker["overdue"] = (~tracker["done"].fillna(False)) & (tracker["due"] < pd.Timestamp.now())
print(f"Overdue items: {tracker['overdue'].sum()}")
tracker.to_excel("onboarding_master.xlsx", index=False)
Key takeaways
- Templates + Python loops = consistent per-hire artifacts.
- The master tracker reads back all the individual files β single source of truth.
- Schedule the tracker to run daily and email a one-page status.
Slack/Teams reminder
Instead of email, post the daily summary into a Teams/Slack channel via incoming webhook. (Both have free webhook URLs.)
Module 16Scenario: Sales & Marketing
Lead-list cleanup, CRM exports, campaign reporting, funnel analysis, and a weekly pipeline report you stop having to build.
Meet Lena, a sales-ops lead. Every Monday she rebuilds the same pipeline report. Every quarter she dedupes a giant lead list. Here's how Python earns its keep on each of these tasks.
16.1 β User story: dedupe a 50,000-lead list
Marketing pulled a list. Half of them are already in the CRM. Find the half that aren't.
What you'll learn
- Normalise emails and company names
- Match on multiple keys
- Output the net-new leads only
The data
import pandas as pd
new_list = pd.read_csv("conference_list.csv") # 50,000 rows
existing = pd.read_csv("crm_export.csv") # 220,000 rows
Normalise both sides
def norm_email(s):
return str(s).strip().lower()
def norm_company(s):
s = str(s).strip().lower()
for suffix in [" inc", " inc.", " llc", " ltd", " ltd.", " gmbh", " co", " co."]:
if s.endswith(suffix): s = s[:-len(suffix)]
return s.strip()
for df in (new_list, existing):
df["email_n"] = df["email"].apply(norm_email)
df["company_n"] = df["company"].apply(norm_company)
Match on email first, company second
known_emails = set(existing["email_n"])
known_companies = set(existing["company_n"])
new_list["match_by_email"] = new_list["email_n"].isin(known_emails)
new_list["match_by_company"] = new_list["company_n"].isin(known_companies)
new_list["already_known"] = new_list["match_by_email"] | new_list["match_by_company"]
net_new = new_list[~new_list["already_known"]].drop(columns=["match_by_email","match_by_company"])
print(f"Net-new leads: {len(net_new):,} of {len(new_list):,}")
net_new.to_excel("net_new_leads.xlsx", index=False)
Catch fuzzy duplicates within the new list itself
net_new = net_new.sort_values("score", ascending=False) # keep highest-scoring dupe
net_new = net_new.drop_duplicates(subset="email_n")
net_new = net_new.drop_duplicates(subset=["company_n", "last_name"])
Key takeaways
- Normalise (lowercase, trim, strip "Inc/LLC") before matching.
- Match on multiple keys; OR them together for "already known".
- Dedupe within the new list before importing.
Domain match
Add a third match: if the email domain (after @) matches a known company, treat it as a likely dupe.
16.2 β User story: roll up the marketing campaign report
Spend, impressions, clicks, conversions, by campaign and channel.
What you'll learn
- Combine exports from multiple ad platforms
- Compute the standard ratios (CTR, CPC, CPA)
- Rank by ROAS
The data
google = pd.read_csv("google_ads.csv")
meta = pd.read_csv("meta_ads.csv")
linked = pd.read_csv("linkedin_ads.csv")
Normalise each to a common schema
def normalise(df, channel, mapping):
return (df.rename(columns=mapping)
.assign(channel=channel)
[["channel","campaign","date","spend","impressions","clicks","conversions","revenue"]])
g = normalise(google, "Google", {"Campaign":"campaign","Day":"date","Cost":"spend",
"Impr.":"impressions","Clicks":"clicks",
"Conv.":"conversions","Conv. value":"revenue"})
m = normalise(meta, "Meta", {"Ad set name":"campaign","Reporting starts":"date",
"Amount spent (USD)":"spend","Impressions":"impressions",
"Link clicks":"clicks","Results":"conversions",
"Purchase conversion value":"revenue"})
l = normalise(linked, "LinkedIn", {"Campaign Name":"campaign","Start Date":"date",
"Total Spent":"spend","Impressions":"impressions",
"Clicks":"clicks","Conversions":"conversions",
"Conversion Value":"revenue"})
all_ads = pd.concat([g, m, l], ignore_index=True)
all_ads["date"] = pd.to_datetime(all_ads["date"])
Compute ratios
all_ads["CTR"] = (all_ads["clicks"] / all_ads["impressions"]).fillna(0)
all_ads["CPC"] = (all_ads["spend"] / all_ads["clicks"]).replace([float("inf")], 0).fillna(0)
all_ads["CPA"] = (all_ads["spend"] / all_ads["conversions"]).replace([float("inf")], 0).fillna(0)
all_ads["ROAS"] = (all_ads["revenue"] / all_ads["spend"]).replace([float("inf")], 0).fillna(0)
Roll up
by_channel = (all_ads.groupby("channel")
.agg(spend=("spend","sum"),
revenue=("revenue","sum"),
conversions=("conversions","sum"))
.assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2),
CPA= lambda d: (d["spend"]/d["conversions"]).round(2))
.sort_values("ROAS", ascending=False))
by_campaign = (all_ads.groupby(["channel","campaign"])
.agg(spend=("spend","sum"),
revenue=("revenue","sum"),
conversions=("conversions","sum"))
.assign(ROAS=lambda d: (d["revenue"]/d["spend"]).round(2))
.sort_values("ROAS", ascending=False)
.reset_index())
with pd.ExcelWriter("campaign_report.xlsx") as w:
by_channel.to_excel(w, sheet_name="By Channel")
by_campaign.to_excel(w, sheet_name="By Campaign", index=False)
Key takeaways
- Three platforms, three rename maps, one DataFrame.
- Replace inf and NaN before ratios β every ad platform has zero-impression rows.
- Sort by the metric the team actually cares about (usually ROAS).
Channel ROI trend
Add a weekly line chart of ROAS by channel for the last 12 weeks.
16.3 β User story: funnel analysis
From session β product view β add to cart β checkout β purchase.
What you'll learn
- Compute step-to-step conversion rates
- Spot where the biggest drop-off is
- Build a funnel chart
The data
events = pd.read_csv("events.csv")
# user_id, event_name, timestamp
Per-step counts
steps = ["session_start","product_view","add_to_cart","checkout","purchase"]
step_users = {}
for s in steps:
step_users[s] = events.loc[events["event_name"]==s, "user_id"].nunique()
funnel = pd.DataFrame({"step": steps, "users": [step_users[s] for s in steps]})
funnel["pct_of_top"] = (funnel["users"] / funnel["users"].iloc[0] * 100).round(1)
funnel["pct_of_prev"] = (funnel["users"] / funnel["users"].shift().fillna(funnel["users"].iloc[0]) * 100).round(1)
print(funnel)
Strict sequential funnel (each step must have happened in order)
seq = events.sort_values(["user_id","timestamp"]).groupby("user_id")["event_name"].agg(list)
def reached(events_list, step_idx):
pos = 0
for ev in events_list:
if pos < len(steps) and ev == steps[pos]:
pos += 1
if pos > step_idx: return True
return False
reached_step = {s: int(seq.apply(lambda L: reached(L, i)).sum()) for i, s in enumerate(steps)}
print(reached_step)
Funnel chart
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4))
ax.barh(funnel["step"][::-1], funnel["users"][::-1], color="#2b6cb0")
for i, (s, n, pct) in enumerate(zip(funnel["step"][::-1], funnel["users"][::-1], funnel["pct_of_top"][::-1])):
ax.text(n, i, f" {n:,} ({pct}%)", va="center")
ax.set_title("Funnel β last 30 days")
plt.tight_layout()
plt.savefig("funnel.png", dpi=200)
Key takeaways
- Naive funnel = distinct users per step. Strict funnel = sequential check.
- Step-to-prev % shows where you're leaking the most.
- One bar chart, one number per bar β readable in 5 seconds.
Segment the funnel
Compute the funnel separately for desktop vs mobile users. Where does mobile drop off more?
16.4 β User story: the Monday pipeline report
Total pipeline, weighted forecast, deals moving by stage.
What you'll learn
- Compute pipeline by stage with stage-weighted forecast
- Spot deals stuck (no movement in 30+ days)
- Output a one-pager for the sales leader
The data
deals = pd.read_csv("crm_pipeline.csv", parse_dates=["close_date","last_modified"])
# id, owner, account, amount, stage, close_date, last_modified
stage_prob = {
"Discovery": 0.10,
"Qualified": 0.20,
"Proposal": 0.40,
"Negotiation": 0.70,
"Closing": 0.90,
}
deals["prob"] = deals["stage"].map(stage_prob).fillna(0)
deals["weighted"] = deals["amount"] * deals["prob"]
Pipeline by stage
by_stage = (deals.groupby("stage")
.agg(deals=("id","count"),
value=("amount","sum"),
weighted=("weighted","sum"))
.reindex(stage_prob.keys()))
Per-rep snapshot
by_rep = (deals.groupby("owner")
.agg(deals=("id","count"),
pipeline=("amount","sum"),
forecast=("weighted","sum"))
.sort_values("forecast", ascending=False))
Stuck deals
stuck = deals[(deals["last_modified"] < pd.Timestamp.now() - pd.Timedelta(days=30)) &
(deals["stage"] != "Closed")]
One-page output
with pd.ExcelWriter("monday_pipeline.xlsx") as w:
by_stage.to_excel(w, sheet_name="By Stage")
by_rep.to_excel(w, sheet_name="By Rep")
stuck.to_excel(w, sheet_name="Stuck Deals", index=False)
pd.DataFrame({
"Metric": ["Total pipeline","Weighted forecast","# stuck deals"],
"Value": [by_stage["value"].sum(), by_stage["weighted"].sum(), len(stuck)],
}).to_excel(w, sheet_name="Cover", index=False)
Schedule it
Schedule the script to run at 6 AM every Monday (Module 12, Lesson 6). Email it to the VP of Sales (Module 12, Lesson 5). Now your "Monday morning pipeline report" job is done before you wake up.
Key takeaways
- Stage probabilities turn raw pipeline into a forecast.
- "Stuck" = last_modified < today - 30 days, and not closed.
- Schedule + email = the report writes itself.
Forecast vs quota
Add a column to by_rep for quota; compute forecast / quota %, sort, flag < 70%.
16.5 β User story: RFM segmentation
Score every customer on Recency, Frequency, Monetary value. Build a champions list.
What you'll learn
- Compute R, F, M scores using quintiles
- Combine into a segment label
- Identify champions, at-risk, lost
The data
orders = pd.read_csv("orders.csv", parse_dates=["order_date"])
# customer_id, order_date, amount
Compute R, F, M
asof = orders["order_date"].max()
rfm = orders.groupby("customer_id").agg(
recency = ("order_date", lambda d: (asof - d.max()).days),
frequency = ("order_date", "count"),
monetary = ("amount", "sum"),
)
Score 1-5 by quintile
# Recency: lower = better, so reverse
rfm["R"] = pd.qcut(rfm["recency"], 5, labels=[5,4,3,2,1]).astype(int)
rfm["F"] = pd.qcut(rfm["frequency"], 5, labels=[1,2,3,4,5], duplicates="drop").astype(int)
rfm["M"] = pd.qcut(rfm["monetary"], 5, labels=[1,2,3,4,5]).astype(int)
rfm["RFM"] = rfm["R"].astype(str) + rfm["F"].astype(str) + rfm["M"].astype(str)
Segment
def label(row):
if row["R"] >= 4 and row["F"] >= 4 and row["M"] >= 4: return "Champion"
if row["R"] >= 4 and row["F"] >= 3: return "Loyal"
if row["R"] >= 4 and row["F"] <= 2: return "New"
if row["R"] <= 2 and row["F"] >= 4: return "At Risk"
if row["R"] <= 2 and row["F"] <= 2: return "Lost"
return "Need Attention"
rfm["segment"] = rfm.apply(label, axis=1)
print(rfm["segment"].value_counts())
Output target lists
with pd.ExcelWriter("rfm_targets.xlsx") as w:
for seg in rfm["segment"].unique():
rfm[rfm["segment"]==seg].reset_index().to_excel(w, sheet_name=seg[:31], index=False)
Key takeaways
- RFM is the simplest, most powerful customer segmentation.
- Quintile scoring (qcut) auto-adjusts to your data's distribution.
- Run quarterly; hand the segments to the marketing team.
Send-list builder
Build a campaign email list of "At Risk" customers (Rβ€2, Fβ₯4) with their last-order-amount as a personalisation token.
16.6 β User story: simple multi-touch attribution
Who gets credit when a customer touched four channels before buying?
What you'll learn
- Build a journey per customer
- Try first-touch, last-touch, and linear attribution
- Compare the answers
The data
touches = pd.read_csv("touches.csv", parse_dates=["ts"])
# user_id, ts, channel, conversion (bool), revenue
Build journeys
converters = touches.loc[touches["conversion"], "user_id"].unique()
j = (touches[touches["user_id"].isin(converters)]
.sort_values(["user_id","ts"])
.groupby("user_id")
.agg(path=("channel", list),
revenue=("revenue", "max")))
First-touch
j["first"] = j["path"].apply(lambda p: p[0])
first_touch = j.groupby("first")["revenue"].sum()
Last-touch
j["last"] = j["path"].apply(lambda p: p[-1])
last_touch = j.groupby("last")["revenue"].sum()
Linear (equal credit to every touch)
rows = []
for path, rev in zip(j["path"], j["revenue"]):
share = rev / len(path)
for c in path:
rows.append({"channel": c, "credit": share})
linear = pd.DataFrame(rows).groupby("channel")["credit"].sum()
Compare
compare = pd.concat([first_touch.rename("first"),
last_touch.rename("last"),
linear.rename("linear")], axis=1).fillna(0).round(0)
compare.sort_values("linear", ascending=False)
Key takeaways
- First-touch overrates discovery channels; last-touch overrates closing channels.
- Linear is naive but a good "everyone gets credit" baseline.
- Show all three side-by-side; the truth is usually somewhere in between.
Position-based (40/20/40)
Implement a position-based model: 40% to first, 40% to last, 20% split among the middle touches.
16.7 β User story: sales territory rebalancing
Split a country into roughly-equal account groups for new reps.
What you'll learn
- Sum account value by region/state
- Use a greedy assignment to balance territories
- Output one workbook per rep
The setup
Lena is hiring three new reps. She has to carve out three territories with roughly equal pipeline value.
Sum by geography
accts = pd.read_csv("accounts.csv") # account_id, state, pipeline_value
state_value = accts.groupby("state")["pipeline_value"].sum().sort_values(ascending=False)
print(state_value)
Greedy assignment
n_reps = 3
territories = {f"Rep{i+1}": {"states": [], "value": 0} for i in range(n_reps)}
for state, value in state_value.items():
# assign to whichever rep currently has the lowest total
lowest = min(territories, key=lambda r: territories[r]["value"])
territories[lowest]["states"].append(state)
territories[lowest]["value"] += value
for rep, info in territories.items():
print(f"{rep}: ${info['value']:,.0f} ({len(info['states'])} states)")
Output per-rep workbooks
for rep, info in territories.items():
sub = accts[accts["state"].isin(info["states"])].sort_values("pipeline_value", ascending=False)
sub.to_excel(f"territory_{rep}.xlsx", index=False)
How balanced did we end up?
values = [t["value"] for t in territories.values()]
print(f"min:{min(values):,.0f} max:{max(values):,.0f} spread:{(max(values)-min(values))/sum(values)*100:.1f}%")
π Beyond greedy
For real territory optimisation (drive time, equal account count, named-account constraints), look at
scipy.optimize.linprog or the
pulp library. Greedy is a great first cut.
Key takeaways
- Greedy bin-packing is the 80% solution for territory carving.
- Always print spread% β if it's huge you need a smarter algorithm.
- One workbook per rep is the kindest deliverable.
Add a constraint
Modify the script so each rep gets at most 12 states (forces re-balancing in dense areas).
Module 17Capstone Projects
Four end-to-end projects that combine everything in the course. Pick one, ship it, show it.
You've learned the moves. Now build the dance. Each capstone is a real-world project you can finish in 2-6 hours, share with colleagues, and put on your portfolio.
π‘ You don't have to do all four
Pick the one closest to your job. Do that one well. Come back for the others when you need them.
17.1 β Capstone 1 β Monthly close pack (Finance)
End-to-end: load, reconcile, variance, AR aging, FX P&L, single Excel output, scheduled.
What you'll learn
- Combine everything from Module 13 into one orchestrator
- Add a styled cover sheet with KPIs
- Schedule it to run on the 1st of each month
Project brief
Build a single script monthly_close.py that, when run, produces close_pack_YYYY-MM.xlsx containing:
- A cover sheet with 6 KPIs.
- Bank reconciliation (matched, GL-only, bank-only).
- Budget vs actual variance (with material-variance flags).
- AR aging by customer Γ bucket.
- USD-translated P&L by office.
Schedule it to run automatically on the 1st of every month and email the result to your controller.
Suggested file layout
monthly_close/
βββ monthly_close.py β the orchestrator
βββ recon.py β reconciliation function
βββ variance.py β variance function
βββ aging.py
βββ fx_pnl.py
βββ style.py β shared openpyxl styling
βββ inputs/ β raw files dropped here
βββ outputs/ β finished pack lands here
Acceptance criteria
- Runs end-to-end with one command:
python monthly_close.py.
- Cover sheet headline KPIs match the detail sheets.
- All numeric cells formatted as currency.
- Material variance rows have red fill.
- AR rows over 90 days have red fill; 60-90 yellow.
- Email lands in the controller's inbox automatically every 1st.
Stretch goals
- Add a "previous month vs this month" comparison column on each summary.
- Embed a 12-month revenue trend chart on the cover sheet.
- Save the pack to a SharePoint or Google Drive folder so it's auto-shared.
What to show off
When it's done: a 90-second screen recording showing you double-click the script, the file appears, you open it, scroll the cover sheet, click into one detail tab. That's your portfolio piece.
17.2 β Capstone 2 β Self-refreshing dashboard (Analyst)
A daily-refresh dashboard from a database export, with anomaly detection.
What you'll learn
- Pipe a daily CSV export into a clean dashboard Excel
- Add anomaly flags and a 30-day trend
- Make it run automatically and email the file
Project brief
Build daily_dashboard.py that runs every morning and produces a one-page Excel dashboard for your manager. Source data: a CSV dump from the data warehouse (or any large CSV you have access to).
Required content
- Big number tiles: yesterday's revenue, WoW change, MoM change, anomaly count.
- Top 5 products by revenue.
- Top 5 stores by revenue.
- 30-day revenue trend chart.
- Heatmap: hour Γ day-of-week.
- Anomalies tab: list of dates where the residual was > 2Ο.
Acceptance criteria
- One command:
python daily_dashboard.py.
- Cover-page tiles auto-refresh against today's data.
- Charts saved as PNG and embedded into the workbook.
- Anomaly tab populated using
seasonal_decompose from Module 14 Lesson 5.
- Scheduled to run daily at 6 AM (Module 12 Lesson 6).
- Result emailed to your manager (Module 12 Lesson 5).
Stretch goals
- Render a second copy as a single PNG suitable for posting in Slack.
- Add segmentation: same dashboard but filtered to one region per tab.
- Replace the file-emailing with a Slack/Teams webhook post.
What to show off
The dashboard your manager now refreshes themselves just by opening yesterday's file. The fact that they stopped asking you for it.
17.3 β Capstone 3 β Headcount + payroll prep + tracker (HR)
Three monthly HR jobs in one weekly script.
What you'll learn
- Run all weekly HR jobs as one pipeline
- Produce a headcount summary, payroll-import file, and onboarding tracker
- Email each output to the right people
Project brief
Build hr_weekly.py that produces three artifacts and routes each to the right stakeholder:
headcount_YYYY-MM-DD.xlsx β HR Director.
payroll_import_YYYY-MM-DD.csv β Payroll vendor SFTP (or just a folder).
onboarding_status_YYYY-MM-DD.xlsx β People Manager.
Required inputs
employees.xlsx β HRIS export.
timesheets.csv β last week's hours.
onboarding/*.xlsx β per-new-hire onboarding workbooks.
Acceptance criteria
- Headcount file: current active count by dept, hires and leavers in last 7 days, turnover %.
- Payroll file: per-employee regular hours + overtime, ready to import.
- Onboarding file: list of overdue items with the new-hire and owner.
- Each file is emailed to a different recipient with a templated subject and body.
- Scheduled for Monday 7 AM.
Stretch goals
- Add a "PTO over cap" alert section in the headcount file.
- Skip employees who didn't submit timesheets and email them a reminder.
- Log every send to
hr_weekly.log so you can audit who got what.
What to show off
Three jobs collapsed to one calendar entry that runs while you sleep. Your Monday opens with a clean inbox.
17.4 β Capstone 4 β Pipeline + campaign report (Sales/Marketing)
A weekly Monday email that combines pipeline, marketing spend, and ROAS.
What you'll learn
- Combine CRM data and ad-platform exports
- Produce a weighted forecast and channel ROAS
- Email a 1-page PDF + the underlying Excel
Project brief
Build monday_gtm.py that runs every Sunday night and emails a 1-page PDF + Excel pack to the VP of Sales and the VP of Marketing.
The PDF cover page (auto-generated)
- Weighted pipeline forecast β total, by stage, by rep.
- Last week's spend by channel + ROAS.
- 5 "stuck deals" that have not moved in 30+ days.
- Net-new leads added in the last week.
The Excel underneath
- Tab 1: Pipeline by stage.
- Tab 2: Pipeline by rep, with quota %.
- Tab 3: Stuck deals (sorted by amount desc).
- Tab 4: Channel ROAS, last 4 weeks.
- Tab 5: Net-new leads (segmented by source).
Acceptance criteria
- One command, full pack built in < 90 seconds.
- The PDF prints cleanly on A4 / Letter without scaling.
- Excel is opened and looks "clean" β no extra empty tabs, no #DIV/0! cells.
- Email sent with both files attached, Subject line: "GTM Weekly β <week-of>".
- Scheduled for Sunday 11 PM.
Stretch goals
- Add a small RFM segment summary on the PDF cover.
- Track week-over-week changes for every KPI on the cover.
- Add a "biggest deal closed last week" callout for morale.
What to show off
An automated Monday-morning report that two VPs now rely on. The fact that they can't tell you didn't hand-make it.
17.5 β Where to go from here
Books, sites, communities, and the next skills worth learning.
What you'll learn
- Find the right next book or site
- Join a community to keep learning
- Pick one more skill to add this quarter
Books to read next
- Automate the Boring Stuff with Python β Al Sweigart. Free online at automatetheboringstuff.com. Sister book to this course, more breadth, less Excel-focused.
- Python for Data Analysis β Wes McKinney (the creator of pandas). The deeper dive on every pandas concept we glossed.
- Fluent Python β Luciano Ramalho. When you want to understand the language, not just use it.
Sites worth bookmarking
- pandas user guide β pandas.pydata.org. The official reference; surprisingly readable.
- Real Python β realpython.com. High-quality tutorials.
- Stack Overflow β for every error message you'll ever hit.
Skills to add this quarter
- SQL. If pandas is the Excel-replacement, SQL is the "where the data actually lives" language. Two weekends of practice pays for itself for the rest of your career.
- Git basics. Version control for your scripts. Two hours. You'll never lose work again.
- One specific library deeper. Depending on your job:
requests for APIs, beautifulsoup4 for scraping, plotly for interactive charts, streamlit for tiny web apps.
Communities
- r/learnpython on Reddit β beginner-friendly, helpful answers.
- Python Discord β quick chat-style help.
- Pandas tag on Stack Overflow β search before posting; your question probably has an answer.
The most underrated next move
Pick one script you wrote during this course. Make it 50% better β add error handling, write a docstring for every function, add a small test. The difference between "I tinkered with Python" and "I write Python" is taste, and taste comes from polishing your own work.
And finally β keep this site bookmarked
This whole course is here whenever you need to look something up. New lessons get added when there's something worth saying. Send corrections and lesson requests via the contact page.
Key takeaways
- Read Automate the Boring Stuff next for breadth; Python for Data Analysis for pandas depth.
- Learn SQL and Git β two weekends, lifetime value.
- Polish one of your own scripts. That's the real skill jump.
Your last exercise
Pick one script you wrote during this course. Spend 30 minutes: add a docstring at the top, error handling on the I/O, and a print statement at the end that says how long it took. Save it somewhere safe. That's your first piece of personal infrastructure.
End of manual. Β© 2026 LearnPythonForExcel.com
The live, always-up-to-date version lives at learnpythonforexcel.com.