.. _drop_missing_data:
.. currentmodule:: feature_engine.imputation
DropMissingData
===============
Removing rows with nan values from a dataset is a common practice in data science and
machine learning projects.
You are probably familiar with the use of pandas dropna. You basically take a pandas
dataframe or a pandas series, apply dropna, and eliminate those rows that contain nan
values in one or more columns.
Here, we have an example of that syntax:
.. code:: python
import numpy as np
import pandas as pd
X = pd.DataFrame(dict(
x1 = [np.nan,1,1,0,np.nan],
x2 = ["a", np.nan, "b", np.nan, "a"],
))
X.dropna(inplace=True)
print(X)
The previous code returns a dataframe without missing values:
.. code:: python
x1 x2
2 1.0 b
Featureengine's :class:`DropMissingData()` wraps pandas dropna in a transformer that
will remove rows with na values while adhering to scikitlearn's `fit` and `transform`
functionality.
Here we have a snapshot of :class:`DropMissingData()`'s syntax:
.. code:: python
import pandas as pd
import numpy as np
from feature_engine.imputation import DropMissingData
X = pd.DataFrame(dict(
x1 = [np.nan,1,1,0,np.nan],
x2 = ["a", np.nan, "b", np.nan, "a"],
))
dmd = DropMissingData()
dmd.fit(X)
dmd.transform(X)
The previous code returns a dataframe without missing values:
.. code:: python
x1 x2
2 1.0 b
:class:`DropMissingData()` allows you therefore to remove null values as part of any
scikitlearn feature engineering workflow.
DropMissingData

:class:`DropMissingData()` has some advantages over pandas:
 It learns and stores the variables for which rows with nan values should be deleted.
 It can be used within a Scikitlearn like pipeline.
With :class:`DropMissingData()`, you can drop nan values from numerical and categorical
variables. In other words, you can remove null values from numerical, categorical or
object datatypes.
You have the option to remove nan values from all columns or only from a subset of
them. Alternatively, you can remove rows if they have more than a certain percentage of
nan values.
Let's better illustrate :class:`DropMissingData()`'s functionality through code examples.
Dropna
^^^^^^
Let's start by importing pandas and numpy, and creating a toy dataframe with nan values
in 2 columns:
.. code:: python
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
X = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
print(X.head())
Below we see the new dataframe:
.. code:: python
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
4 NaN a 5
We can drop nan values across all columns as follows:
.. code:: python
dmd = DropMissingData()
Xt = dmd.fit_transform(X)
Xt.head()
We see the transformed dataframe without null values:
.. code:: python
x1 x2 x3
0 2.0 a 2
2 1.0 b 4
By default, :class:`DropMissingData()` will find and store the columns that had
missing data during fit, that is, in the training set. They are stored here:
.. code:: python
dmd.variables_
.. code:: python
['x1', 'x2']
That means that every time that we apply `transform()` to a new dataframe, the
transformer will remove rows with nan values only in those columns.
If we want to force :class:`DropMissingData()` to drop na across all columns, regardless
of whether they had nan values during fit, we need to set up the class like this:
.. code:: python
dmd = DropMissingData(missing_only=False)
Xt = dmd.fit_transform(X)
Now, when we explore the paramter `variables_`, we see that all the variables in the
train set are stored, and hence, will be used to remove nan values:
.. code:: python
dmd.variables_
.. code:: python
['x1', 'x2', 'x3']
Adjust target after dropna
^^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`DropMissingData()` has the option to remove rows with nan from both training set
and target variable. Like this, we can obtain a target that is aligned with the
resulting dataframe after the transformation.
The method `transform_x_y` removes rows with null values from the train set, and then
realigns the target. Let's take a look:
.. code:: python
Xt, yt = dmd.transform_x_y(X, y)
Xt
Below we see the dataframe without nan:
.. code:: python
x1 x2 x3
0 2.0 a 2
2 1.0 b 4
.. code:: python
yt
And here we see the target with those rows corresponing to the remaining rows in the
transformed dataframe:
.. code:: python
0 1
2 3
dtype: int64
Let's check that the shape of the transformed dataframe and target are the same:
.. code:: python
Xt.shape, yt.shape
We see that the resulting training set and target have each 2 rows, instead of the 5
original rows.
.. code:: python
((2, 3), (2,))
Return the rows with nan
^^^^^^^^^^^^^^^^^^^^^^^^
When we have a model in production, it might be useful to know which rows are being
removed by the transformer. We can obtain that information as follows:
.. code:: python
dmd.return_na_data(X)
The previous command returns the rows with nan. In other words, it does the opposite
of `transform()`, or pandas.dropna.
.. code:: python
x1 x2 x3
1 1.0 NaN 3
3 0.0 NaN 5
4 NaN a 5
Dropna from subset of variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We can choose to remove missing data only from a specific column or group of columns.
We just need to pass the column name or names to the `variables` parameter:
Here, we'll dropna from the variables "x1", "x3".
.. code:: python
dmd = DropMissingData(variables=["x1", "x3"], missing_only=False)
Xt = dmd.fit_transform(X)
Xt.head()
Below, we see the transformed dataframe. It removed the rows with nan in "x1", and we
see that those rows with nan in "x2" are still in the dataframe:
.. code:: python
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
Only rows with nan in "x1" and "x3" are removed. We can corroborate that by examining
the `variables_` parameter:
.. code::python
dmd.variables_
.. code::python
['x1', 'x3']
**Important**
When you indicate which variables should be examined to remove rows with nan, make sure
you set the parameter `missing_only` to the boolean `False`. Otherwise,
:class:`DropMissingData()` will select from your list only those variables that showed
nan values in the train set.
See for example what happens when we set up the class like this:
.. code:: python
dmd = DropMissingData(variables=["x1", "x3"], missing_only=True)
Xt = dmd.fit_transform(X)
dmd.variables_
Note, that we indicated that we wanted to remove nan from "x1", "x3". Yet, only "x1"
has nan in X. So the transformer learns that nan should be only dropped from "x1":
.. code:: python
['x1']
:class:`DropMissingData()` took the 2 variables indicated in the list, and stored
only the one that showed nan in during fit. That means that when transforming future
dataframes, it will only remove rows with nan in "x1".
In other words, if you pass a list of variables to impute and set `missing_only=True`,
and some of the variables in your list do not have missing data in the train set,
missing data will not be removed during transform for those particular variables.
When `missing_only=True`, the transformer "double checks" that the entered
variables have missing data in the train set. If not, it ignores them during
`transform()`.
It is recommended to use `missing_only=True` when not passing a list of variables to
impute.
Dropna based on percentage of nonnan values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We can set :class:`DropMissingData()` to require a percentage of nonNA values in a
row to keep it. We can control this behaviour through the `threshold` parameter, which
is equivalent to pandas.dropna's `thresh` parameter.
If `threshold=1`, all variables need to have data to keep a row. If `threshold=0.5`,
50% of the variables need to have data to keep a row. If `threshold=0.01`, 10% of the
variables need to have data to keep the row. If `threshold=None`, rows with NA in any
of the variables will be dropped.
Let's see this with an example. We create a new dataframe that has different proportion
of nonnan values in every row.
.. code:: python
X = pd.DataFrame(
dict(
x1=[2, 1, 1, np.nan, np.nan],
x2=["a", np.nan, "b", np.nan, np.nan],
x3=[2, 3, 4, 5, np.nan],
)
)
X
We see that the bottom row has nan in all columns, row 3 has nan in 2 of 3 columns,
and row 1 has nan in 1 variable:
.. code:: python
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
3 NaN NaN 5.0
4 NaN NaN NaN
Now, we can set :class:`DropMissingData()` to drop rows if >50% of its values are nan:
.. code:: python
dmd = DropMissingData(threshold=.5)
dmd.fit(X)
dmd.transform(X)
We see that the last 2 rows are dropped, because they have more than 50% nan values.
.. code:: python
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
Instead, we can set class:`DropMissingData()` to drop rows if >70% of its values are
nan as follows:
.. code:: python
dmd = DropMissingData(threshold=.3)
dmd.fit(X)
dmd.transform(X)
Now we see that only the last row was removed.
.. code:: python
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
3 NaN NaN 5.0
Scikitlearn compatible
^^^^^^^^^^^^^^^^^^^^^^^
:class:`DropMissingData()` is fully compatible with the Scikitlearn API, so you will
find common methods that you also find in Scikitlearn transformers, like, for example,
the `get_feature_names_out()` method to obtain the variable names in the transformed
dataframe.
Pipeline
^^^^^^^^
When we dropna from a dataframe, we then need to realign the target. We saw previously
that we can do that by using the method `transform_x_y`.
We can align the target with the resulting dataframe automatically from within a
pipeline as well, by utilizing Featureengine's pipeline.
Let's start by importing the necessary libraries:
.. code:: python
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline
Let's create a new dataframe with nan values in some rows, two numerical and one
categorical variable, and its corresponding target variable:
.. code:: python
X = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
X.head()
Below, we see the resulting dataframe:
.. code:: python
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
4 NaN a 5
Let's now set up a pipeline to dropna first, and then encode the categorical variable
by using ordinal encoding:
.. code:: python
pipe = Pipeline(
[
("drop", DropMissingData()),
("enc", OrdinalEncoder(encoding_method="arbitrary")),
]
)
pipe.fit_transform(X, y)
When we apply `fit` and `transform` or `fit_transform`, we will obtain the transformed
training set only:
.. code:: python
x1 x2 x3
0 2.0 0 2
2 1.0 1 4
To obtain the transform training set and target, we use `transform_x_y`:
.. code:: python
pipe.fit(X,y)
Xt, yt = pipe.transform_x_y(X, y)
Xt
Here we see the transformed training set:
.. code:: python
x1 x2 x3
0 2.0 0 2
2 1.0 1 4
.. code:: python
yt
And here we see the realigned target variable:
.. code:: python
0 1
2 3
And to wrap up, let's add an estimator to the pipeline:
.. code:: python
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline
df = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
pipe = Pipeline(
[
("drop", DropMissingData()),
("enc", OrdinalEncoder(encoding_method="arbitrary")),
("lasso", Lasso(random_state=2))
]
)
pipe.fit(df, y)
pipe.predict(df)
.. code:: python
array([2., 2.])
Dropna or fillna?
^^^^^^^^^^^^^^^^^
:class:`DropMissingData()` has the same functionality than `pandas.series.dropna` or
`pandas.dataframe.dropna``. If you want functionality compatible with `pandas.fillna`
instead, check our other imputation transformers.
Drop columns with nan
^^^^^^^^^^^^^^^^^^^^^
At the moment, Featureengine does not have transformers that will find columns with a
certain percentage of missing values and drop them. Instead, you can find those columns
manually, and then drop them with the help of `DropFeatures` from the selection module.
See also
^^^^^^^^
Check out our tutorials on `LagFeatures` and `WindowFeatures` to see how to combine
:class:`DropMissingData()` with lags or rolling windows, to create features for
forecasting.
Tutorials, books and courses
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the following Jupyter notebook, in our accompanying Github repository, you will find
more examples using :class:`DropMissingData()`.
 `Jupyter notebook `_
For tutorials about this and other feature engineering methods check out our online course:
.. figure:: ../../images/feml.png
:width: 300
:figclass: aligncenter
:align: left
:target: https://www.trainindata.com/p/featureengineeringformachinelearning
Feature Engineering for Machine Learning










Or read our book:
.. figure:: ../../images/cookbook.png
:width: 200
:figclass: aligncenter
:align: left
:target: https://packt.link/0ewSo
Python Feature Engineering Cookbook













Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Featureengine.