DropMissingData#

Removing rows with nan values from a dataset is a common practice in data science and machine learning projects.

You are probably familiar with the use of pandas dropna. You basically take a pandas dataframe or a pandas series, apply dropna, and eliminate those rows that contain nan values in one or more columns.

Here, we have an example of that syntax:

import numpy as np
import pandas as pd

X = pd.DataFrame(dict(
       x1 = [np.nan,1,1,0,np.nan],
       x2 = ["a", np.nan, "b", np.nan, "a"],
       ))

X.dropna(inplace=True)
print(X)

The previous code returns a dataframe without missing values:

    x1 x2
2  1.0  b

Feature-engine’s DropMissingData() wraps pandas dropna in a transformer that will remove rows with nan values while adhering to scikit-learn’s fit and transform functionality.

Here we have a snapshot of DropMissingData()’s syntax:

import pandas as pd
import numpy as np
from feature_engine.imputation import DropMissingData

X = pd.DataFrame(dict(
       x1 = [np.nan,1,1,0,np.nan],
       x2 = ["a", np.nan, "b", np.nan, "a"],
       ))

dmd = DropMissingData()
dmd.fit(X)
dmd.transform(X)

The previous code returns a dataframe without missing values:

    x1 x2
2  1.0  b

DropMissingData() allows you therefore to remove null values as part of any scikit-learn feature engineering workflow.

DropMissingData#

DropMissingData() has some advantages over pandas:

It learns and stores the variables for which rows with nan values should be deleted.
It can be used within a Scikit-learn like pipeline.

With DropMissingData(), you can drop nan values from numerical and categorical variables. In other words, you can remove null values from numerical, categorical or object datatypes.

You have the option to remove nan values from all columns or only from a subset of them. Alternatively, you can remove rows if they have more than a certain percentage of nan values.

Let’s better illustrate DropMissingData()’s functionality through code examples.

Python implementation#

Let’s start by importing pandas and numpy, and creating a toy dataframe with nan values in 2 columns:

import numpy as np
import pandas as pd

from feature_engine.imputation import DropMissingData

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
        x3=[2, 3, 4, 5, 5],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

print(X.head())

Below we see the new dataframe:

    x1   x2  x3
2.0    a   2
1.0  NaN   3
1.0    b   4
0.0  NaN   5
NaN    a   5

We can drop nan values across all columns as follows:

dmd =  DropMissingData()
Xt = dmd.fit_transform(X)
Xt.head()

We see the transformed dataframe without null values:

    x1 x2  x3
0  2.0  a   2
2  1.0  b   4

By default, DropMissingData() will find and store the columns that had missing data during fit, that is, in the training set. They are stored here:

dmd.variables_

Below we see the variables that contained NAN values in the training set:

['x1', 'x2']

That means that every time that we apply transform() to a new dataframe, the transformer will remove rows with nan values only in those columns.

If we want to force DropMissingData() to drop na across all columns, regardless of whether they had nan values during fit, we need to set up the class like this:

dmd =  DropMissingData(missing_only=False)
Xt = dmd.fit_transform(X)

Now, when we explore the paramter variables_, we see that all the variables in the train set are stored, and hence, will be used to remove nan values:

dmd.variables_

Below we see the variables seen in the training set, regardless of whether they had NAN:

['x1', 'x2', 'x3']

Adjust target after dropna#

DropMissingData() has the option to remove rows with nan from both training set and target variable. Like this, we can obtain a target that is aligned with the resulting dataframe after the transformation.

The method transform_x_y removes rows with null values from the train set, and then realigns the target. Let’s take a look:

Xt, yt = dmd.transform_x_y(X, y)
Xt

Below we see the dataframe without nan:

    x1 x2  x3
0  2.0  a   2
2  1.0  b   4

Let’s now display the transformed target:

yt

Here we see the target with those rows corresponding to the rows in the transformed dataframe:

0    1
2    3
dtype: int64

Let’s check that the shape of the transformed dataframe and target are the same:

Xt.shape, yt.shape

We see that the resulting training set and target have each 2 rows, instead of the 5 original rows.

((2, 3), (2,))

Return the rows with nan#

When we have a model in production, it might be useful to know which rows are being removed by the transformer. We can obtain that information as follows:

dmd.return_na_data(X)

The previous command returns the rows with nan. In other words, it does the opposite of transform(), or pandas.dropna.

    x1   x2  x3
1.0  NaN   3
0.0  NaN   5
NaN    a   5

Dropna from subset of variables#

We can choose to remove missing data only from a specific column or group of columns. We just need to pass the column name or names to the variables parameter:

Here, we’ll dropna from the variables “x1” and “x3”.

dmd = DropMissingData(variables=["x1", "x3"], missing_only=False)
Xt = dmd.fit_transform(X)
Xt.head()

Below, we see the transformed dataframe. It removed the rows with nan in “x1”, and we see that those rows with nan in “x2” are still in the dataframe:

    x1   x2  x3
2.0    a   2
1.0  NaN   3
1.0    b   4
0.0  NaN   5

Only rows with nan in “x1” and “x3” are removed. We can corroborate that by examining the variables_ parameter:

Note

When you indicate which variables should be examined to remove rows with nan, make sure you set the parameter missing_only to the boolean False. Otherwise, DropMissingData() will select from your list only those variables that showed nan values in the train set.

See for example what happens when we set up the class like this:

dmd = DropMissingData(variables=["x1", "x3"], missing_only=True)
Xt = dmd.fit_transform(X)
dmd.variables_

Note, that we indicated that we wanted to remove nan from “x1” and “x3”. Yet, only “x1” has nan in X. So the transformer learns that nan should be only dropped from “x1”:

['x1']

DropMissingData() took the 2 variables indicated in the list, and stored only the one that showed nan in during fit. That means that when transforming future dataframes, it will only remove rows with nan in “x1”.

In other words, if you pass a list of variables to impute and set missing_only=True, and some of the variables in your list do not have missing data in the train set, missing data will not be removed during transform for those particular variables.

When missing_only=True, the transformer “double checks” that the entered variables have missing data in the train set. If not, it ignores them during transform().

It is recommended to use missing_only=True when not passing a list of variables to impute.

Dropna based on percentage of non-nan values#

We can set DropMissingData() to require a percentage of non-NA values in a row to keep it. We can control this behaviour through the threshold parameter, which is equivalent to pandas.dropna’s thresh parameter.

If threshold=1, all variables need to have data to keep a row. If threshold=0.5, 50% of the variables need to have data to keep a row. If threshold=0.01, 10% of the variables need to have data to keep the row. If threshold=None, rows with NA in any of the variables will be dropped.

Let’s see this with an example. We create a new dataframe that has different proportion of non-nan values in every row.

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, np.nan, np.nan],
        x2=["a", np.nan, "b", np.nan, np.nan],
        x3=[2, 3, 4, 5, np.nan],
    )
)

X

We see that the bottom row has nan in all columns, row 3 has nan in 2 of 3 columns, and row 1 has nan in 1 variable:

    x1   x2   x3
2.0    a  2.0
1.0  NaN  3.0
1.0    b  4.0
NaN  NaN  5.0
NaN  NaN  NaN

Now, we can set DropMissingData() to drop rows if >50% of its values are nan:

dmd = DropMissingData(threshold=.5)
dmd.fit(X)
dmd.transform(X)

We see that the last 2 rows are dropped, because they have more than 50% nan values.

    x1   x2   x3
2.0    a  2.0
1.0  NaN  3.0
1.0    b  4.0

Instead, we can set class:DropMissingData() to drop rows if >70% of its values are nan as follows:

dmd = DropMissingData(threshold=.3)
dmd.fit(X)
dmd.transform(X)

Now we see that only the last row was removed.

    x1   x2   x3
2.0    a  2.0
1.0  NaN  3.0
1.0    b  4.0
NaN  NaN  5.0

Scikit-learn compatible#

DropMissingData() is fully compatible with the Scikit-learn API, so you will find common methods that you also find in Scikit-learn transformers, like, for example, the get_feature_names_out() method to obtain the variable names in the transformed dataframe.

Pipeline#

When we dropna from a dataframe, we then need to realign the target. We saw previously that we can do that by using the method transform_x_y.

We can align the target with the resulting dataframe automatically from within a pipeline as well, by utilizing Feature-engine’s pipeline.

Let’s start by importing the necessary libraries:

import numpy as np
import pandas as pd

from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline

Let’s create a new dataframe with nan values in some rows, two numerical and one categorical variable, and its corresponding target variable:

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
        x3=[2, 3, 4, 5, 5],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

X.head()

Below, we see the resulting dataframe:

    x1   x2  x3
2.0    a   2
1.0  NaN   3
1.0    b   4
0.0  NaN   5
NaN    a   5

Let’s now set up a pipeline to dropna first, and then encode the categorical variable by using ordinal encoding:

pipe = Pipeline(
    [
        ("drop", DropMissingData()),
        ("enc", OrdinalEncoder(encoding_method="arbitrary")),
    ]
)

pipe.fit_transform(X, y)

When we apply fit and transform or fit_transform, we will obtain the transformed training set only:

    x1  x2  x3
0  2.0   0   2
2  1.0   1   4

To obtain the transform training set and target, we use transform_x_y:

pipe.fit(X,y)
Xt, yt = pipe.transform_x_y(X, y)
Xt

Here we see the transformed training set:

    x1  x2  x3
0  2.0   0   2
2  1.0   1   4

Let’s now display the transformed target:

yt

Here we see the re-aligned target variable:

0    1
2    3

To wrap up, let’s add an estimator to the pipeline:

import numpy as np
import pandas as pd

from sklearn.linear_model import Lasso

from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline

df = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
        x3=[2, 3, 4, 5, 5],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = Pipeline(
    [
        ("drop", DropMissingData()),
        ("enc", OrdinalEncoder(encoding_method="arbitrary")),
        ("lasso", Lasso(random_state=2))
    ]
)

pipe.fit(df, y)
pipe.predict(df)

In the following output we see the predictions made by the pipeline:

array([2., 2.])

Dropna or fillna?#

DropMissingData() has the same functionality than pandas.series.dropna or pandas.dataframe.dropna`. If you want functionality compatible with pandas.fillna instead, check out our missing data imputation transformers.

Drop columns with nan#

At the moment, Feature-engine does not have transformers that will find columns with a certain percentage of missing values and drop them. Instead, you can find those columns manually, and then drop them with the help of DropFeatures from the selection module.

Additional resources#

For tutorials about missing data imputation methods check out these resources:

Feature Engineering for Machine Learning, online course.
Feature Engineering for Time Series Forecasting, online course.
Python Feature Engineering Cookbook, book.

Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

Boost Your Data Science Skills

DropMissingData#

DropMissingData#

Python implementation#

Adjust target after dropna#

Return the rows with nan#

Dropna from subset of variables#

Dropna based on percentage of non-nan values#

Scikit-learn compatible#

Pipeline#

Dropna or fillna?#

Drop columns with nan#

See also#

Additional resources#