It's pivotal to understand why data might be missing as it helps in choosing best strategy to handle it.
Missing data can be categorized as:
Can easily identify missing data using the isnull() and sum() functions from the Pandas library:
import seaborn as sns
import pandas as pd
# Import Titanic dataset
titanic_df = sns.load_dataset('titanic')
# Identify missing values
missing_values = titanic_df.isnull().sum()
print(missing_values)Output:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64Broadly, there are three main strategies:
In following example:
"age", fill missing entries with median passenger age"deck", delete entire column (since most entries are missing)# Dealing with missing values
# Dropping columns with excessive missing data
new_titanic_df = titanic_df.drop(columns=['deck'])
# Imputing median age for missing age data
new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)
# Display the number of missing values post-imputation
missing_values_updated = new_titanic_df.isnull().sum()
print(missing_values_updated)
# Removes rows with missing values (MIGHT remove significant portion of data)
new_titanic_df.dropna()Output:
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
embark_town 2
alive 0
alone 0
dtype: int64