By Ajiboye Toluwalase · Jun 09, 2026


Introduction

The goal of this blog is to shift how we think about data cleaning and feature engineering. Often, data cleaning is viewed simply as a necessary chore to tidy up a dataset, while feature engineering is seen as a way to create new rows and insights to make a model more accurate. That type of thinking isn't wrong, but I propose a different perspective.

Imagine making your data so clean, processed, and insightful that your baseline machine learning algorithm becomes highly accurate before you even touch hyper-parameter tuning. While achieving a perfect baseline is rare, thinking about it from this angle changes the entire workflow.

Consider how babies are fed. When they are born, they are given highly specialized, easily digestible food—first colostrum, then milk, and eventually pureed foods. The trend is clear: the food is heavily processed into a smooth, flowing state so the baby's system can effortlessly process it.

That is exactly how I propose we perform data cleaning and feature engineering. We need to process the data so our "baby" (the model) can digest it without struggling. There are two primary ways to approach this:

1. The General Way

This approach involves cleaning and engineering features generally, keeping a variety of machine learning algorithms in mind. We typically use this method in the early stages of a project when we are still experimenting and searching for the best algorithm to solve the problem at hand. However, once a specific algorithm is selected, this broad approach becomes less effective.

2. The Specialized Way

This approach is used when we know exactly which algorithm we will be deploying. For instance, if we know we are using a linear model like Logistic Regression, we must take its specific quirks into consideration. Because Logistic Regression relies on linear decision boundaries, we clean and engineer our features strictly with that mathematical limitation in mind. By tailoring the data to the algorithm's specific needs, we often see a significant boost in baseline accuracy.

A Case Scenario: The Titanic Dataset

Let's use the Titanic dataset as an example. This is famously the "Hello World" of data science, largely because it serves as a great benchmark to revisit and track your progress as an engineer.

When working with this data, you can apply basic cleaning and feature engineering: dropping columns with too many null values, filling remaining missing values with the mean or mode, and encoding categorical columns. This standard approach works well enough, and usually, engineers move straight to hyper-parameter tuning from there.

However, if you decide to take the Specialized Way, the golden rule applies: you are engineering this data for a specific algorithm, so you don't want to leave the data in a state that makes it difficult for the model to map relationships.

Take a look at the "Gender" column:

Gender
Female
Male
Female
Male
Female
Male

Using a standard one-hot encoding method like pandas.get_dummies, you might transform the data to look like this:

Gender_male Gender_female
0 1
1 0
0 1
1 0
0 1
1 0

While this is technically correct and the algorithm will run, I propose a better way when considering the algorithm's mechanics. Instead, format it like this:

is_female
1
0
1
0
1
0

Instead of forcing the model to evaluate two distinct columns and navigate the "dummy variable trap" (where columns are perfectly collinear), the model now has a single, highly efficient column to determine if the passenger is female or not.

It is simpler, reduces dimensionality, and is much easier for the model to digest.

Related Posts