HomeCourseModule 10 › Numeric cleanup — currencies, percentages, outliers

Numeric cleanup — currencies, percentages, outliers

Module 10 · Data Cleaning and Transformation7 min readBeginner

What you'll learn

  • Convert currency-formatted strings to floats
  • Convert percent-formatted strings to floats
  • Cap or remove outliers

Currency strings → floats

df["amount"] = (df["amount"].str.replace("[$,]", "", regex=True)
                              .astype(float))

Percent strings → floats

# "12.5%" → 0.125
df["margin"] = (df["margin"].str.replace("%", "")
                              .astype(float) / 100)

Capping outliers (winsorise)

lo, hi = df["amount"].quantile([0.01, 0.99])
df["amount_capped"] = df["amount"].clip(lower=lo, upper=hi)

Removing outliers

q1, q3 = df["amount"].quantile([0.25, 0.75])
iqr = q3 - q1
mask = df["amount"].between(q1 - 1.5*iqr, q3 + 1.5*iqr)
clean = df[mask]

Catching obvious typos

# Negative ages, ages over 120, etc.
clean = df.query("0 < age < 120")

Key takeaways

  • Strip $ and , with a regex, then .astype(float).
  • .clip(lower, upper) caps without removing rows.
  • IQR rule (Q1 − 1.5·IQR, Q3 + 1.5·IQR) catches typical outliers.

Cap and report

Cap the top 1% of values in a numeric column. Print how many rows were affected.

📹 Video walkthrough
A video walkthrough of this lesson will be embedded here. Until then, the written walkthrough above mirrors what the video will cover step-for-step.