HomeCourseModule 14 › User story: cleaning a 4-million-row export

User story: cleaning a 4-million-row export

Module 14 · Scenario: Data & Business Analyst10 min readIntermediate

What you'll learn

  • Open a multi-million-row CSV with pandas
  • Filter and sample before doing heavy work
  • Save a clean subset back to Excel

The story

Marcus gets a 4M-row CSV of transactions. Excel opens 1M and silently drops the rest. He needs all of it cleaned and pivoted.

Step 1 — peek before loading the whole thing

import pandas as pd
sample = pd.read_csv("transactions.csv", nrows=1000)
sample.info()                     # columns + dtypes
sample.head()

Step 2 — load with the right options

df = pd.read_csv("transactions.csv",
                 usecols=["date", "store_id", "sku", "qty", "revenue"],
                 dtype={"store_id": "category", "sku": "category"},
                 parse_dates=["date"])
print(df.shape)        # (4_000_000, 5)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

usecols drops unused columns at load time. dtype="category" on repeating strings cuts memory by ~90%.

Step 3 — filter early

# Last 90 days only
recent = df[df["date"] >= df["date"].max() - pd.Timedelta(days=90)]
print(len(recent))   # ~1M

Step 4 — aggregate down to something Excel-friendly

by_store_sku = (recent.groupby(["store_id", "sku"], observed=True)
                       .agg(units=("qty", "sum"),
                            revenue=("revenue", "sum"))
                       .reset_index()
                       .sort_values("revenue", ascending=False))

by_store_sku.head(20).to_excel("top_skus_per_store.xlsx", index=False)

If even pandas struggles — chunked read

aggregates = []
for chunk in pd.read_csv("transactions.csv", chunksize=500_000, parse_dates=["date"]):
    chunk = chunk[chunk["date"] >= "2026-01-01"]
    aggregates.append(chunk.groupby("store_id")["revenue"].sum())

total = pd.concat(aggregates).groupby(level=0).sum()
print(total)

Key takeaways

  • usecols + dtype="category" makes huge files manageable.
  • Filter and aggregate early; only the small result needs to land in Excel.
  • chunksize lets you stream files bigger than RAM.

Your own big file

Find a CSV bigger than 100k rows in your work. Load it with usecols and dtype. Print memory savings.

📹 Video walkthrough
A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.