# DropMissingData#

Removing rows with nan values from a dataset is a common practice in data science and machine learning projects.

You are probably familiar with the use of pandas dropna. You basically take a pandas dataframe or a pandas series, apply dropna, and eliminate those rows that contain nan values in one or more columns.

Here, we have an example of that syntax:

```
import numpy as np
import pandas as pd
X = pd.DataFrame(dict(
x1 = [np.nan,1,1,0,np.nan],
x2 = ["a", np.nan, "b", np.nan, "a"],
))
X.dropna(inplace=True)
print(X)
```

The previous code returns a dataframe without missing values:

```
x1 x2
2 1.0 b
```

Feature-engine’s `DropMissingData()`

wraps pandas dropna in a transformer that
will remove rows with na values while adhering to scikit-learn’s `fit`

and `transform`

functionality.

Here we have a snapshot of `DropMissingData()`

’s syntax:

```
import pandas as pd
import numpy as np
from feature_engine.imputation import DropMissingData
X = pd.DataFrame(dict(
x1 = [np.nan,1,1,0,np.nan],
x2 = ["a", np.nan, "b", np.nan, "a"],
))
dmd = DropMissingData()
dmd.fit(X)
dmd.transform(X)
```

The previous code returns a dataframe without missing values:

```
x1 x2
2 1.0 b
```

`DropMissingData()`

allows you therefore to remove null values as part of any
scikit-learn feature engineering workflow.

## DropMissingData#

`DropMissingData()`

has some advantages over pandas:

It learns and stores the variables for which rows with nan values should be deleted.

It can be used within a Scikit-learn like pipeline.

With `DropMissingData()`

, you can drop nan values from numerical and categorical
variables. In other words, you can remove null values from numerical, categorical or
object datatypes.

You have the option to remove nan values from all columns or only from a subset of them. Alternatively, you can remove rows if they have more than a certain percentage of nan values.

Let’s better illustrate `DropMissingData()`

’s functionality through code examples.

### Dropna#

Let’s start by importing pandas and numpy, and creating a toy dataframe with nan values in 2 columns:

```
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
X = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
print(X.head())
```

Below we see the new dataframe:

```
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
4 NaN a 5
```

We can drop nan values across all columns as follows:

```
dmd = DropMissingData()
Xt = dmd.fit_transform(X)
Xt.head()
```

We see the transformed dataframe without null values:

```
x1 x2 x3
0 2.0 a 2
2 1.0 b 4
```

By default, `DropMissingData()`

will find and store the columns that had
missing data during fit, that is, in the training set. They are stored here:

```
dmd.variables_
```

```
['x1', 'x2']
```

That means that every time that we apply `transform()`

to a new dataframe, the
transformer will remove rows with nan values only in those columns.

If we want to force `DropMissingData()`

to drop na across all columns, regardless
of whether they had nan values during fit, we need to set up the class like this:

```
dmd = DropMissingData(missing_only=False)
Xt = dmd.fit_transform(X)
```

Now, when we explore the paramter `variables_`

, we see that all the variables in the
train set are stored, and hence, will be used to remove nan values:

```
dmd.variables_
```

```
['x1', 'x2', 'x3']
```

### Adjust target after dropna#

`DropMissingData()`

has the option to remove rows with nan from both training set
and target variable. Like this, we can obtain a target that is aligned with the
resulting dataframe after the transformation.

The method `transform_x_y`

removes rows with null values from the train set, and then
realigns the target. Let’s take a look:

```
Xt, yt = dmd.transform_x_y(X, y)
Xt
```

Below we see the dataframe without nan:

```
x1 x2 x3
0 2.0 a 2
2 1.0 b 4
```

```
yt
```

And here we see the target with those rows corresponing to the remaining rows in the transformed dataframe:

```
0 1
2 3
dtype: int64
```

Let’s check that the shape of the transformed dataframe and target are the same:

```
Xt.shape, yt.shape
```

We see that the resulting training set and target have each 2 rows, instead of the 5 original rows.

```
((2, 3), (2,))
```

### Return the rows with nan#

When we have a model in production, it might be useful to know which rows are being removed by the transformer. We can obtain that information as follows:

```
dmd.return_na_data(X)
```

The previous command returns the rows with nan. In other words, it does the opposite
of `transform()`

, or pandas.dropna.

```
x1 x2 x3
1 1.0 NaN 3
3 0.0 NaN 5
4 NaN a 5
```

### Dropna from subset of variables#

We can choose to remove missing data only from a specific column or group of columns.
We just need to pass the column name or names to the `variables`

parameter:

Here, we’ll dropna from the variables “x1”, “x3”.

```
dmd = DropMissingData(variables=["x1", "x3"], missing_only=False)
Xt = dmd.fit_transform(X)
Xt.head()
```

Below, we see the transformed dataframe. It removed the rows with nan in “x1”, and we see that those rows with nan in “x2” are still in the dataframe:

```
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
```

Only rows with nan in “x1” and “x3” are removed. We can corroborate that by examining
the `variables_`

parameter:

**Important**

When you indicate which variables should be examined to remove rows with nan, make sure
you set the parameter `missing_only`

to the boolean `False`

. Otherwise,
`DropMissingData()`

will select from your list only those variables that showed
nan values in the train set.

See for example what happens when we set up the class like this:

```
dmd = DropMissingData(variables=["x1", "x3"], missing_only=True)
Xt = dmd.fit_transform(X)
dmd.variables_
```

Note, that we indicated that we wanted to remove nan from “x1”, “x3”. Yet, only “x1” has nan in X. So the transformer learns that nan should be only dropped from “x1”:

```
['x1']
```

`DropMissingData()`

took the 2 variables indicated in the list, and stored
only the one that showed nan in during fit. That means that when transforming future
dataframes, it will only remove rows with nan in “x1”.

In other words, if you pass a list of variables to impute and set `missing_only=True`

,
and some of the variables in your list do not have missing data in the train set,
missing data will not be removed during transform for those particular variables.

When `missing_only=True`

, the transformer “double checks” that the entered
variables have missing data in the train set. If not, it ignores them during
`transform()`

.

It is recommended to use `missing_only=True`

when not passing a list of variables to
impute.

### Dropna based on percentage of non-nan values#

We can set `DropMissingData()`

to require a percentage of non-NA values in a
row to keep it. We can control this behaviour through the `threshold`

parameter, which
is equivalent to pandas.dropna’s `thresh`

parameter.

If `threshold=1`

, all variables need to have data to keep a row. If `threshold=0.5`

,
50% of the variables need to have data to keep a row. If `threshold=0.01`

, 10% of the
variables need to have data to keep the row. If `threshold=None`

, rows with NA in any
of the variables will be dropped.

Let’s see this with an example. We create a new dataframe that has different proportion of non-nan values in every row.

```
X = pd.DataFrame(
dict(
x1=[2, 1, 1, np.nan, np.nan],
x2=["a", np.nan, "b", np.nan, np.nan],
x3=[2, 3, 4, 5, np.nan],
)
)
X
```

We see that the bottom row has nan in all columns, row 3 has nan in 2 of 3 columns, and row 1 has nan in 1 variable:

```
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
3 NaN NaN 5.0
4 NaN NaN NaN
```

Now, we can set `DropMissingData()`

to drop rows if >50% of its values are nan:

```
dmd = DropMissingData(threshold=.5)
dmd.fit(X)
dmd.transform(X)
```

We see that the last 2 rows are dropped, because they have more than 50% nan values.

```
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
```

Instead, we can set class:`DropMissingData()`

to drop rows if >70% of its values are
nan as follows:

```
dmd = DropMissingData(threshold=.3)
dmd.fit(X)
dmd.transform(X)
```

Now we see that only the last row was removed.

```
x1 x2 x3
0 2.0 a 2.0
1 1.0 NaN 3.0
2 1.0 b 4.0
3 NaN NaN 5.0
```

### Scikit-learn compatible#

`DropMissingData()`

is fully compatible with the Scikit-learn API, so you will
find common methods that you also find in Scikit-learn transformers, like, for example,
the `get_feature_names_out()`

method to obtain the variable names in the transformed
dataframe.

### Pipeline#

When we dropna from a dataframe, we then need to realign the target. We saw previously
that we can do that by using the method `transform_x_y`

.

We can align the target with the resulting dataframe automatically from within a pipeline as well, by utilizing Feature-engine’s pipeline.

Let’s start by importing the necessary libraries:

```
import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline
```

Let’s create a new dataframe with nan values in some rows, two numerical and one categorical variable, and its corresponding target variable:

```
X = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
X.head()
```

Below, we see the resulting dataframe:

```
x1 x2 x3
0 2.0 a 2
1 1.0 NaN 3
2 1.0 b 4
3 0.0 NaN 5
4 NaN a 5
```

Let’s now set up a pipeline to dropna first, and then encode the categorical variable by using ordinal encoding:

```
pipe = Pipeline(
[
("drop", DropMissingData()),
("enc", OrdinalEncoder(encoding_method="arbitrary")),
]
)
pipe.fit_transform(X, y)
```

When we apply `fit`

and `transform`

or `fit_transform`

, we will obtain the transformed
training set only:

```
x1 x2 x3
0 2.0 0 2
2 1.0 1 4
```

To obtain the transform training set and target, we use `transform_x_y`

:

```
pipe.fit(X,y)
Xt, yt = pipe.transform_x_y(X, y)
Xt
```

Here we see the transformed training set:

```
x1 x2 x3
0 2.0 0 2
2 1.0 1 4
```

```
yt
```

And here we see the re-aligned target variable:

```
0 1
2 3
```

And to wrap up, let’s add an estimator to the pipeline:

```
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import Pipeline
df = pd.DataFrame(
dict(
x1=[2, 1, 1, 0, np.nan],
x2=["a", np.nan, "b", np.nan, "a"],
x3=[2, 3, 4, 5, 5],
)
)
y = pd.Series([1, 2, 3, 4, 5])
pipe = Pipeline(
[
("drop", DropMissingData()),
("enc", OrdinalEncoder(encoding_method="arbitrary")),
("lasso", Lasso(random_state=2))
]
)
pipe.fit(df, y)
pipe.predict(df)
```

```
array([2., 2.])
```

### Dropna or fillna?#

`DropMissingData()`

has the same functionality than `pandas.series.dropna`

or
`pandas.dataframe.dropna``

. If you want functionality compatible with `pandas.fillna`

instead, check our other imputation transformers.

### Drop columns with nan#

At the moment, Feature-engine does not have transformers that will find columns with a
certain percentage of missing values and drop them. Instead, you can find those columns
manually, and then drop them with the help of `DropFeatures`

from the selection module.

### See also#

Check out our tutorials on `LagFeatures`

and `WindowFeatures`

to see how to combine
`DropMissingData()`

with lags or rolling windows, to create features for
forecasting.

### Tutorials, books and courses#

In the following Jupyter notebook, in our accompanying Github repository, you will find
more examples using `DropMissingData()`

.

For tutorials about this and other feature engineering methods check out our online course:

Or read our book:

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.