314 results

Wrangling Missing Data: Techniques Applied to the Titanic Dataset

Understanding Missing Data

It's pivotal to understand why data might be missing as it helps in choosing best strategy to handle it.

Missing data can be categorized as:

Missing completely at random (MCAR): missing data entries are random and don't correlate with any other data
Missing at random (MAR): missing values depend on values of other variables
Missing not at random (MNAR): missing values have a particular pattern or logic

Identifying Missing Values in the Titanic Dataset

Can easily identify missing data using the isnull() and sum() functions from the Pandas library:

import seaborn as sns
import pandas as pd


# Import Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify missing values
missing_values = titanic_df.isnull().sum()
print(missing_values)

Output:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Strategies to Handle Missing Data

Broadly, there are three main strategies:

Deletion: involves removing rows and columns containing missing data
- might lead to the loss of valuable information
Imputation: includes filling missing values with substituted ones, such as the mean, median, or mode (most common value in data frame)
Prediction: involves using predictive model to estimate missing values

Handling Missing Data in the Titanic Dataset

In following example:

for "age", fill missing entries with median passenger age
for "deck", delete entire column (since most entries are missing)

# Dealing with missing values

# Dropping columns with excessive missing data
new_titanic_df = titanic_df.drop(columns=['deck'])

# Imputing median age for missing age data
new_titanic_df['age'].fillna(new_titanic_df['age'].median(), inplace=True)

# Display the number of missing values post-imputation
missing_values_updated = new_titanic_df.isnull().sum()
print(missing_values_updated)


# Removes rows with missing values (MIGHT remove significant portion of data)
new_titanic_df.dropna()

Output:

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64

Made with

Google's Tech Dev Guide

Mastering Algorithms and Data Structures in JavaScript

NestJS

Security

File IO and Resource Management

Exceptions and Error Handling

Classes

Tree-based DSA I

DSA Programiz Course

Neetcode

Understanding and Using Trees in Python

Sorting and Searching Algorithms in Python

Python Coding Practice for Technical Interviews

Mastering Graphs in Python

Linked Lists, Stacks, and Queues in Python

Hashing, Dictionaries, and Sets in Python

CodeSignal JavaScript Courses

Data Structures and Algorithms

Algorithm Exercises

Data Cleaning and Preprocessing in Machine Learning

Wrangling Missing Data: Techniques Applied to the Titanic Dataset

Understanding Missing Data

Identifying Missing Values in the Titanic Dataset

Strategies to Handle Missing Data

Handling Missing Data in the Titanic Dataset