Wrangling Missing Data: Techniques


Understanding Missing Data

It's pivotal to understand why data might be missing as it helps in choosing best strategy to handle it.

Missing data can be categorized as:

  • Missing completely at random (MCAR): missing data entries are random and don't correlate with any other data
  • Missing at random (MAR): missing values depend on values of other variables
  • Missing not at random (MNAR): missing values have a particular pattern or logic

Identifying Missing Values in the Titanic Dataset

Can easily identify missing data using the isnull() and sum() functions from the Pandas library:

import seaborn as sns
import pandas as pd


# Import Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify missing values
missing_values = titanic_df.isnull().sum()
print(missing_values)

Output:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Strategies to Handle Missing Data

Broadly, there are three main strategies:

  • Deletion: involves removing rows and columns containing missing data
    • might lead to the loss of valuable information
  • Imputation: includes filling missing values with substituted ones, such as the mean, median, or mode (most common value in data frame)
  • Prediction: involves using predictive model to estimate missing values

Handling Missing Data in the Titanic Dataset

In following example:

  • for "age", fill missing entries with median passenger age
  • for "deck", delete entire column (since most entries are missing)
# Dealing with missing values

# Dropping columns with excessive missing data
new_titanic_df = titanic_df.drop(columns=['deck'])

# Imputing median age for missing age data
new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)

# Display the number of missing values post-imputation
missing_values_updated = new_titanic_df.isnull().sum()
print(missing_values_updated)


# Removes rows with missing values (MIGHT remove significant portion of data)
new_titanic_df.dropna()

Output:

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64
Made with Gatsby G Logo