HomeCourseModule 10 › String cleanup at scale

String cleanup at scale

Module 10 · Data Cleaning and Transformation8 min readBeginner

What you'll learn

  • Apply string methods to whole columns
  • Use regex via .str.replace
  • Split a column into multiple columns

The .str accessor

df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["country"] = df["country"].str.upper()

Anywhere you'd call a string method on one string, you call .str.method() on a column.

Replace

df["amount_text"] = df["amount_text"].str.replace(",", "")
df["amount"] = df["amount_text"].astype(float)

Regex replace

# Keep only digits
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)

# Strip parentheses and hyphens
df["phone"] = df["phone"].str.replace(r"[()\-\s]", "", regex=True)

Split a column

# "Alice Smith" → first / last
df[["first", "last"]] = df["full_name"].str.split(" ", n=1, expand=True)

Extract with regex

# pull domain from email
df["domain"] = df["email"].str.extract(r"@(.+)$")

Walkthrough: clean a column of free-text phone numbers

Start

df["phone"]
# "(415) 555-1212"
# " 415.555.1213 "
# "415-555-1214 ext.500"

Normalise

df["phone_clean"] = (df["phone"].str.strip()
                                .str.replace(r"\D", "", regex=True)
                                .str[:10])     # keep first 10 digits

Key takeaways

  • .str.method() applies any string method to a whole column.
  • regex=True unlocks powerful pattern-based replacement.
  • .str.split(..., expand=True) splits one column into many.

Email cleanup

Given a column of messy emails, produce a clean lowercase column AND a new column containing just the domain.

📹 Video walkthrough
A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.