.. _drop_missing_data: .. currentmodule:: feature_engine.imputation DropMissingData =============== Removing rows with nan values from a dataset is a common practice in data science and machine learning projects. You are probably familiar with the use of pandas dropna. You basically take a pandas dataframe or a pandas series, apply dropna, and eliminate those rows that contain nan values in one or more columns. Here, we have an example of that syntax: .. code:: python import numpy as np import pandas as pd X = pd.DataFrame(dict( x1 = [np.nan,1,1,0,np.nan], x2 = ["a", np.nan, "b", np.nan, "a"], )) X.dropna(inplace=True) print(X) The previous code returns a dataframe without missing values: .. code:: python x1 x2 2 1.0 b Feature-engine's :class:`DropMissingData()` wraps pandas dropna in a transformer that will remove rows with na values while adhering to scikit-learn's `fit` and `transform` functionality. Here we have a snapshot of :class:`DropMissingData()`'s syntax: .. code:: python import pandas as pd import numpy as np from feature_engine.imputation import DropMissingData X = pd.DataFrame(dict( x1 = [np.nan,1,1,0,np.nan], x2 = ["a", np.nan, "b", np.nan, "a"], )) dmd = DropMissingData() dmd.fit(X) dmd.transform(X) The previous code returns a dataframe without missing values: .. code:: python x1 x2 2 1.0 b :class:`DropMissingData()` allows you therefore to remove null values as part of any scikit-learn feature engineering workflow. DropMissingData --------------- :class:`DropMissingData()` has some advantages over pandas: - It learns and stores the variables for which rows with nan values should be deleted. - It can be used within a Scikit-learn like pipeline. With :class:`DropMissingData()`, you can drop nan values from numerical and categorical variables. In other words, you can remove null values from numerical, categorical or object datatypes. You have the option to remove nan values from all columns or only from a subset of them. Alternatively, you can remove rows if they have more than a certain percentage of nan values. Let's better illustrate :class:`DropMissingData()`'s functionality through code examples. Dropna ^^^^^^ Let's start by importing pandas and numpy, and creating a toy dataframe with nan values in 2 columns: .. code:: python import numpy as np import pandas as pd from feature_engine.imputation import DropMissingData X = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], x3=[2, 3, 4, 5, 5], ) ) y = pd.Series([1, 2, 3, 4, 5]) print(X.head()) Below we see the new dataframe: .. code:: python x1 x2 x3 0 2.0 a 2 1 1.0 NaN 3 2 1.0 b 4 3 0.0 NaN 5 4 NaN a 5 We can drop nan values across all columns as follows: .. code:: python dmd = DropMissingData() Xt = dmd.fit_transform(X) Xt.head() We see the transformed dataframe without null values: .. code:: python x1 x2 x3 0 2.0 a 2 2 1.0 b 4 By default, :class:`DropMissingData()` will find and store the columns that had missing data during fit, that is, in the training set. They are stored here: .. code:: python dmd.variables_ .. code:: python ['x1', 'x2'] That means that every time that we apply `transform()` to a new dataframe, the transformer will remove rows with nan values only in those columns. If we want to force :class:`DropMissingData()` to drop na across all columns, regardless of whether they had nan values during fit, we need to set up the class like this: .. code:: python dmd = DropMissingData(missing_only=False) Xt = dmd.fit_transform(X) Now, when we explore the paramter `variables_`, we see that all the variables in the train set are stored, and hence, will be used to remove nan values: .. code:: python dmd.variables_ .. code:: python ['x1', 'x2', 'x3'] Adjust target after dropna ^^^^^^^^^^^^^^^^^^^^^^^^^^ :class:`DropMissingData()` has the option to remove rows with nan from both training set and target variable. Like this, we can obtain a target that is aligned with the resulting dataframe after the transformation. The method `transform_x_y` removes rows with null values from the train set, and then realigns the target. Let's take a look: .. code:: python Xt, yt = dmd.transform_x_y(X, y) Xt Below we see the dataframe without nan: .. code:: python x1 x2 x3 0 2.0 a 2 2 1.0 b 4 .. code:: python yt And here we see the target with those rows corresponing to the remaining rows in the transformed dataframe: .. code:: python 0 1 2 3 dtype: int64 Let's check that the shape of the transformed dataframe and target are the same: .. code:: python Xt.shape, yt.shape We see that the resulting training set and target have each 2 rows, instead of the 5 original rows. .. code:: python ((2, 3), (2,)) Return the rows with nan ^^^^^^^^^^^^^^^^^^^^^^^^ When we have a model in production, it might be useful to know which rows are being removed by the transformer. We can obtain that information as follows: .. code:: python dmd.return_na_data(X) The previous command returns the rows with nan. In other words, it does the opposite of `transform()`, or pandas.dropna. .. code:: python x1 x2 x3 1 1.0 NaN 3 3 0.0 NaN 5 4 NaN a 5 Dropna from subset of variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can choose to remove missing data only from a specific column or group of columns. We just need to pass the column name or names to the `variables` parameter: Here, we'll dropna from the variables "x1", "x3". .. code:: python dmd = DropMissingData(variables=["x1", "x3"], missing_only=False) Xt = dmd.fit_transform(X) Xt.head() Below, we see the transformed dataframe. It removed the rows with nan in "x1", and we see that those rows with nan in "x2" are still in the dataframe: .. code:: python x1 x2 x3 0 2.0 a 2 1 1.0 NaN 3 2 1.0 b 4 3 0.0 NaN 5 Only rows with nan in "x1" and "x3" are removed. We can corroborate that by examining the `variables_` parameter: .. code::python dmd.variables_ .. code::python ['x1', 'x3'] **Important** When you indicate which variables should be examined to remove rows with nan, make sure you set the parameter `missing_only` to the boolean `False`. Otherwise, :class:`DropMissingData()` will select from your list only those variables that showed nan values in the train set. See for example what happens when we set up the class like this: .. code:: python dmd = DropMissingData(variables=["x1", "x3"], missing_only=True) Xt = dmd.fit_transform(X) dmd.variables_ Note, that we indicated that we wanted to remove nan from "x1", "x3". Yet, only "x1" has nan in X. So the transformer learns that nan should be only dropped from "x1": .. code:: python ['x1'] :class:`DropMissingData()` took the 2 variables indicated in the list, and stored only the one that showed nan in during fit. That means that when transforming future dataframes, it will only remove rows with nan in "x1". In other words, if you pass a list of variables to impute and set `missing_only=True`, and some of the variables in your list do not have missing data in the train set, missing data will not be removed during transform for those particular variables. When `missing_only=True`, the transformer "double checks" that the entered variables have missing data in the train set. If not, it ignores them during `transform()`. It is recommended to use `missing_only=True` when not passing a list of variables to impute. Dropna based on percentage of non-nan values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can set :class:`DropMissingData()` to require a percentage of non-NA values in a row to keep it. We can control this behaviour through the `threshold` parameter, which is equivalent to pandas.dropna's `thresh` parameter. If `threshold=1`, all variables need to have data to keep a row. If `threshold=0.5`, 50% of the variables need to have data to keep a row. If `threshold=0.01`, 10% of the variables need to have data to keep the row. If `threshold=None`, rows with NA in any of the variables will be dropped. Let's see this with an example. We create a new dataframe that has different proportion of non-nan values in every row. .. code:: python X = pd.DataFrame( dict( x1=[2, 1, 1, np.nan, np.nan], x2=["a", np.nan, "b", np.nan, np.nan], x3=[2, 3, 4, 5, np.nan], ) ) X We see that the bottom row has nan in all columns, row 3 has nan in 2 of 3 columns, and row 1 has nan in 1 variable: .. code:: python x1 x2 x3 0 2.0 a 2.0 1 1.0 NaN 3.0 2 1.0 b 4.0 3 NaN NaN 5.0 4 NaN NaN NaN Now, we can set :class:`DropMissingData()` to drop rows if >50% of its values are nan: .. code:: python dmd = DropMissingData(threshold=.5) dmd.fit(X) dmd.transform(X) We see that the last 2 rows are dropped, because they have more than 50% nan values. .. code:: python x1 x2 x3 0 2.0 a 2.0 1 1.0 NaN 3.0 2 1.0 b 4.0 Instead, we can set class:`DropMissingData()` to drop rows if >70% of its values are nan as follows: .. code:: python dmd = DropMissingData(threshold=.3) dmd.fit(X) dmd.transform(X) Now we see that only the last row was removed. .. code:: python x1 x2 x3 0 2.0 a 2.0 1 1.0 NaN 3.0 2 1.0 b 4.0 3 NaN NaN 5.0 Scikit-learn compatible ^^^^^^^^^^^^^^^^^^^^^^^ :class:`DropMissingData()` is fully compatible with the Scikit-learn API, so you will find common methods that you also find in Scikit-learn transformers, like, for example, the `get_feature_names_out()` method to obtain the variable names in the transformed dataframe. Pipeline ^^^^^^^^ When we dropna from a dataframe, we then need to realign the target. We saw previously that we can do that by using the method `transform_x_y`. We can align the target with the resulting dataframe automatically from within a pipeline as well, by utilizing Feature-engine's pipeline. Let's start by importing the necessary libraries: .. code:: python import numpy as np import pandas as pd from feature_engine.imputation import DropMissingData from feature_engine.encoding import OrdinalEncoder from feature_engine.pipeline import Pipeline Let's create a new dataframe with nan values in some rows, two numerical and one categorical variable, and its corresponding target variable: .. code:: python X = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], x3=[2, 3, 4, 5, 5], ) ) y = pd.Series([1, 2, 3, 4, 5]) X.head() Below, we see the resulting dataframe: .. code:: python x1 x2 x3 0 2.0 a 2 1 1.0 NaN 3 2 1.0 b 4 3 0.0 NaN 5 4 NaN a 5 Let's now set up a pipeline to dropna first, and then encode the categorical variable by using ordinal encoding: .. code:: python pipe = Pipeline( [ ("drop", DropMissingData()), ("enc", OrdinalEncoder(encoding_method="arbitrary")), ] ) pipe.fit_transform(X, y) When we apply `fit` and `transform` or `fit_transform`, we will obtain the transformed training set only: .. code:: python x1 x2 x3 0 2.0 0 2 2 1.0 1 4 To obtain the transform training set and target, we use `transform_x_y`: .. code:: python pipe.fit(X,y) Xt, yt = pipe.transform_x_y(X, y) Xt Here we see the transformed training set: .. code:: python x1 x2 x3 0 2.0 0 2 2 1.0 1 4 .. code:: python yt And here we see the re-aligned target variable: .. code:: python 0 1 2 3 And to wrap up, let's add an estimator to the pipeline: .. code:: python import numpy as np import pandas as pd from sklearn.linear_model import Lasso from feature_engine.imputation import DropMissingData from feature_engine.encoding import OrdinalEncoder from feature_engine.pipeline import Pipeline df = pd.DataFrame( dict( x1=[2, 1, 1, 0, np.nan], x2=["a", np.nan, "b", np.nan, "a"], x3=[2, 3, 4, 5, 5], ) ) y = pd.Series([1, 2, 3, 4, 5]) pipe = Pipeline( [ ("drop", DropMissingData()), ("enc", OrdinalEncoder(encoding_method="arbitrary")), ("lasso", Lasso(random_state=2)) ] ) pipe.fit(df, y) pipe.predict(df) .. code:: python array([2., 2.]) Dropna or fillna? ^^^^^^^^^^^^^^^^^ :class:`DropMissingData()` has the same functionality than `pandas.series.dropna` or `pandas.dataframe.dropna``. If you want functionality compatible with `pandas.fillna` instead, check our other imputation transformers. Drop columns with nan ^^^^^^^^^^^^^^^^^^^^^ At the moment, Feature-engine does not have transformers that will find columns with a certain percentage of missing values and drop them. Instead, you can find those columns manually, and then drop them with the help of `DropFeatures` from the selection module. See also ^^^^^^^^ Check out our tutorials on `LagFeatures` and `WindowFeatures` to see how to combine :class:`DropMissingData()` with lags or rolling windows, to create features for forecasting. Tutorials, books and courses ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In the following Jupyter notebook, in our accompanying Github repository, you will find more examples using :class:`DropMissingData()`. - `Jupyter notebook `_ For tutorials about this and other feature engineering methods check out our online course: .. figure:: ../../images/feml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning Feature Engineering for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/cookbook.png :width: 200 :figclass: align-center :align: left :target: https://packt.link/0ewSo Python Feature Engineering Cookbook | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.