Home › Course › Module 10 › Finding and removing duplicates

Finding and removing duplicates

Module 10 · Data Cleaning and Transformation6 min readBeginner

What you'll learn

Find duplicate rows
Drop duplicates by all or some columns
Pick which duplicate to keep (first/last/by ranking)

Find them

df.duplicated().sum()                          # how many full-row dupes
df.duplicated(subset=["email"]).sum()          # dupes by email
df[df.duplicated(subset=["email"], keep=False)]   # show all dupe groups

Drop them

df.drop_duplicates()                          # exact-row dupes
df.drop_duplicates(subset=["email"])          # by email, keep first
df.drop_duplicates(subset=["email"], keep="last")

Keep the "best" duplicate

If a customer has multiple rows and you want the most recent:

df = (df.sort_values("date")
        .drop_duplicates(subset=["email"], keep="last"))

Walkthrough: dedupe a contact list keeping the highest-value entry

Sort by what makes "best" mean something

df = df.sort_values("lifetime_value", ascending=False)

Drop, keeping the first (= the highest LTV)

df = df.drop_duplicates(subset=["email"], keep="first")

Key takeaways

duplicated() returns a boolean Series; drop_duplicates() drops them.
subset= lets you check duplicates by chosen columns only.
Sort first, then drop_duplicates with keep="first" to keep the "best" one.

Dedupe orders

Given an orders table, dedupe by order_id, keeping the row with the latest last_modified.

📹 Video walkthrough

A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.

← Previous lessonHandling missing values Next lesson →String cleanup at scale