HomeCourseModule 16 › User story: dedupe a 50,000-lead list

User story: dedupe a 50,000-lead list

Module 16 · Scenario: Sales & Marketing10 min readIntermediate

What you'll learn

  • Normalise emails and company names
  • Match on multiple keys
  • Output the net-new leads only

The data

import pandas as pd

new_list = pd.read_csv("conference_list.csv")        # 50,000 rows
existing = pd.read_csv("crm_export.csv")             # 220,000 rows

Normalise both sides

def norm_email(s):
    return str(s).strip().lower()

def norm_company(s):
    s = str(s).strip().lower()
    for suffix in [" inc", " inc.", " llc", " ltd", " ltd.", " gmbh", " co", " co."]:
        if s.endswith(suffix): s = s[:-len(suffix)]
    return s.strip()

for df in (new_list, existing):
    df["email_n"]   = df["email"].apply(norm_email)
    df["company_n"] = df["company"].apply(norm_company)

Match on email first, company second

known_emails    = set(existing["email_n"])
known_companies = set(existing["company_n"])

new_list["match_by_email"]   = new_list["email_n"].isin(known_emails)
new_list["match_by_company"] = new_list["company_n"].isin(known_companies)
new_list["already_known"]    = new_list["match_by_email"] | new_list["match_by_company"]

net_new = new_list[~new_list["already_known"]].drop(columns=["match_by_email","match_by_company"])
print(f"Net-new leads: {len(net_new):,} of {len(new_list):,}")
net_new.to_excel("net_new_leads.xlsx", index=False)

Catch fuzzy duplicates within the new list itself

net_new = net_new.sort_values("score", ascending=False)   # keep highest-scoring dupe
net_new = net_new.drop_duplicates(subset="email_n")
net_new = net_new.drop_duplicates(subset=["company_n", "last_name"])

Key takeaways

  • Normalise (lowercase, trim, strip "Inc/LLC") before matching.
  • Match on multiple keys; OR them together for "already known".
  • Dedupe within the new list before importing.

Domain match

Add a third match: if the email domain (after @) matches a known company, treat it as a likely dupe.

📹 Video walkthrough
A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.