"Who are our top customers in the Northeast, what did they buy last quarter, and what was their average margin?" Six tables: customers, orders, order_items, products, regions, costs.
customers ──(customer_id)── orders ──(order_id)── order_items ──(sku)── products ──(sku)── costs
customers ──(region_id)── regions
cu = pd.read_csv("customers.csv")
o = pd.read_csv("orders.csv", parse_dates=["order_date"])
oi = pd.read_csv("order_items.csv")
p = pd.read_csv("products.csv")
r = pd.read_csv("regions.csv")
co = pd.read_csv("costs.csv")
for name, df in [("cu",cu),("o",o),("oi",oi),("p",p),("r",r),("co",co)]:
print(name, df.shape)
cu_r = cu.merge(r, on="region_id", how="left", indicator=True)
print(cu_r["_merge"].value_counts()) # all should be 'both'
cu_r = cu_r.drop(columns="_merge")
o_cu = o.merge(cu_r, on="customer_id", how="left", indicator=True)
print(o_cu["_merge"].value_counts())
o_cu = o_cu.drop(columns="_merge")
# Items per order — this is the row-multiplying join
items = oi.merge(o_cu, on="order_id", how="inner")
print("rows after items join:", len(items))
# Add product + cost
items = items.merge(p, on="sku", how="left").merge(co, on="sku", how="left")
len(after) ≤ len(before) × max_dupes. If not, debug.
last_q = items[items["order_date"] >= "2026-01-01"]
last_q["revenue"] = last_q["qty"] * last_q["unit_price"]
last_q["margin"] = last_q["revenue"] - last_q["qty"] * last_q["unit_cost"]
ne = last_q[last_q["region_name"] == "Northeast"]
top = (ne.groupby(["customer_id", "customer_name"])
.agg(orders = ("order_id", "nunique"),
revenue = ("revenue", "sum"),
margin = ("margin", "sum"))
.assign(margin_pct = lambda d: (d["margin"] / d["revenue"] * 100).round(1))
.sort_values("revenue", ascending=False)
.head(20)
.reset_index())
top.to_excel("top20_ne_q1.xlsx", index=False)
indicator=True on each step the first time you write the script.Same data, different question: which products have the highest margin% but the lowest sales volume? (Hidden gems to push.)