df.duplicated().sum() # how many full-row dupes
df.duplicated(subset=["email"]).sum() # dupes by email
df[df.duplicated(subset=["email"], keep=False)] # show all dupe groups
df.drop_duplicates() # exact-row dupes
df.drop_duplicates(subset=["email"]) # by email, keep first
df.drop_duplicates(subset=["email"], keep="last")
If a customer has multiple rows and you want the most recent:
df = (df.sort_values("date")
.drop_duplicates(subset=["email"], keep="last"))
df = df.sort_values("lifetime_value", ascending=False)
df = df.drop_duplicates(subset=["email"], keep="first")
duplicated() returns a boolean Series; drop_duplicates() drops them.subset= lets you check duplicates by chosen columns only.keep="first" to keep the "best" one.Given an orders table, dedupe by order_id, keeping the row with the latest last_modified.