Marcus gets a 4M-row CSV of transactions. Excel opens 1M and silently drops the rest. He needs all of it cleaned and pivoted.
import pandas as pd
sample = pd.read_csv("transactions.csv", nrows=1000)
sample.info() # columns + dtypes
sample.head()
df = pd.read_csv("transactions.csv",
usecols=["date", "store_id", "sku", "qty", "revenue"],
dtype={"store_id": "category", "sku": "category"},
parse_dates=["date"])
print(df.shape) # (4_000_000, 5)
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
usecols drops unused columns at load time. dtype="category" on repeating strings cuts memory by ~90%.
# Last 90 days only
recent = df[df["date"] >= df["date"].max() - pd.Timedelta(days=90)]
print(len(recent)) # ~1M
by_store_sku = (recent.groupby(["store_id", "sku"], observed=True)
.agg(units=("qty", "sum"),
revenue=("revenue", "sum"))
.reset_index()
.sort_values("revenue", ascending=False))
by_store_sku.head(20).to_excel("top_skus_per_store.xlsx", index=False)
aggregates = []
for chunk in pd.read_csv("transactions.csv", chunksize=500_000, parse_dates=["date"]):
chunk = chunk[chunk["date"] >= "2026-01-01"]
aggregates.append(chunk.groupby("store_id")["revenue"].sum())
total = pd.concat(aggregates).groupby(level=0).sum()
print(total)
usecols + dtype="category" makes huge files manageable.chunksize lets you stream files bigger than RAM.Find a CSV bigger than 100k rows in your work. Load it with usecols and dtype. Print memory savings.