HomeCourseModule 14 › User story: joining six tables to answer one question

User story: joining six tables to answer one question

Module 14 · Scenario: Data & Business Analyst10 min readIntermediate

What you'll learn

  • Plan a multi-table join
  • Verify match rates at every step
  • Avoid the silent row-explosion of a bad join

The story

"Who are our top customers in the Northeast, what did they buy last quarter, and what was their average margin?" Six tables: customers, orders, order_items, products, regions, costs.

Map it out first

customers ──(customer_id)── orders ──(order_id)── order_items ──(sku)── products ──(sku)── costs
customers ──(region_id)── regions

Load and check sizes

cu = pd.read_csv("customers.csv")
o  = pd.read_csv("orders.csv", parse_dates=["order_date"])
oi = pd.read_csv("order_items.csv")
p  = pd.read_csv("products.csv")
r  = pd.read_csv("regions.csv")
co = pd.read_csv("costs.csv")
for name, df in [("cu",cu),("o",o),("oi",oi),("p",p),("r",r),("co",co)]:
    print(name, df.shape)

Join one step at a time, with indicator

cu_r = cu.merge(r, on="region_id", how="left", indicator=True)
print(cu_r["_merge"].value_counts())   # all should be 'both'
cu_r = cu_r.drop(columns="_merge")

o_cu = o.merge(cu_r, on="customer_id", how="left", indicator=True)
print(o_cu["_merge"].value_counts())
o_cu = o_cu.drop(columns="_merge")

# Items per order — this is the row-multiplying join
items = oi.merge(o_cu, on="order_id", how="inner")
print("rows after items join:", len(items))

# Add product + cost
items = items.merge(p,  on="sku", how="left").merge(co, on="sku", how="left")
⚠️ The row explosion
If your join key has duplicates on both sides, you get a Cartesian product. Check len(after) ≤ len(before) × max_dupes. If not, debug.

Answer the question

last_q = items[items["order_date"] >= "2026-01-01"]
last_q["revenue"] = last_q["qty"] * last_q["unit_price"]
last_q["margin"]  = last_q["revenue"] - last_q["qty"] * last_q["unit_cost"]

ne = last_q[last_q["region_name"] == "Northeast"]

top = (ne.groupby(["customer_id", "customer_name"])
         .agg(orders   = ("order_id", "nunique"),
              revenue  = ("revenue", "sum"),
              margin   = ("margin",  "sum"))
         .assign(margin_pct = lambda d: (d["margin"] / d["revenue"] * 100).round(1))
         .sort_values("revenue", ascending=False)
         .head(20)
         .reset_index())

top.to_excel("top20_ne_q1.xlsx", index=False)

Key takeaways

  • Diagram joins before writing them.
  • Use indicator=True on each step the first time you write the script.
  • Watch for row explosion on many-to-many joins.

Reverse it

Same data, different question: which products have the highest margin% but the lowest sales volume? (Hidden gems to push.)

📹 Video walkthrough
A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.