.. _string_similarity: .. currentmodule:: feature_engine.encoding StringSimilarityEncoder ======================= The :class:`StringSimilarityEncoder()` replaces categorical variables with a set of float variables that capture the similarity between the category names. The new variables have values between 0 and 1, where 0 indicates no similarity and 1 is an exact match between the names of the categories. To calculate the similarity between the categories, :class:`StringSimilarityEncoder()` uses Gestalt pattern matching. Under the hood, :class:`StringSimilarityEncoder()` uses the `quick_ratio` method from the `SequanceMatcher()` from `difflib`. The similarity is calculated as: .. math:: GPM = 2 M / T where T is the total number of elements in both sequences and M is the number of matches. For example, the similarity between the categories "dog" and "dig" is 0.66. T is the total number of elements in both categories, that is 6. There are 2 matches between the words, the letters d and g, so: 2 * M / T = 2 * 2 / 6 = 0.66. Output of the :class:`StringSimilarityEncoder()` ------------------------------------------------ Let's create a dataframe with the categories "dog", "dig" and "cat": .. code:: python import pandas as pd from feature_engine.encoding import StringSimilarityEncoder df = pd.DataFrame({"words": ["dog", "dig", "cat"]}) df We see the dataframe in the following output: .. code:: python words 0 dog 1 dig 2 cat Let's now encode the variable: .. code:: python encoder = StringSimilarityEncoder() dft = encoder.fit_transform(df) dft We see the encoded variables below: .. code:: python words_dog words_dig words_cat 0 1.000000 0.666667 0.0 1 0.666667 1.000000 0.0 2 0.000000 0.000000 1.0 Note that :class:`StringSimilarityEncoder()` replaces the original variables by the distance variables. :class:`StringSimilarityEncoder()` vs One-hot encoding ------------------------------------------------------ String similarity encoding is similar to one-hot encoding, in the sense that each category is encoded as a new variable. But the values, instead of 1 or 0, are the similarity between the observation's category and the dummy variable. It is suitable for poorly defined (or 'dirty') categorical variables. Encoding only popular categories -------------------------------- The :class:`StringSimilarityEncoder()` can also create similarity variables for the *n* most popular categories, *n* being determined by the user. For example, if we encode only the 6 more popular categories, by setting the parameter `top_categories=6`, the transformer will add variables only for the 6 most frequent categories. The most frequent categories are those with the largest number of observations. This behaviour is useful when the categorical variables are highly cardinal, to control the expansion of the feature space. Specifying how :class:`StringSimilarityEncoder()` should deal with missing values --------------------------------------------------------------------------------- The :class:`StringSimilarityEncoder()` has three options for dealing with missing values, which can be specified with the parameter `missing_values`: 1. Ignore NaNs (option `ignore`) - will leave the NaN in the resulting dataframe after transformation. Could be useful, if the next step in the pipeline is imputation or if the machine learning algorithm can handle missing data out-of-the-box. 2. Impute NaNs (option `impute`) - will impute NaN with an empty string, and then calculate the similarity between the empty string and the variable's categories. Most of the time, the similarity value will be 0 in resulting dataframe. This is the default option. 3. Raise an error (option `raise`) - will raise an error if NaN is present during `fit`, `transform` or `fit_transform`. Could be useful for debugging and monitoring purposes. Important --------- :class:`StringSimilarityEncoder()` will encode unseen categories by out-of-the-box, by measuring the string similarity to the seen categories. No text preprocessing is applied by :class:`StringSimilarityEncoder()`. Be mindful of preparing string categorical variables if needed. :class:`StringSimilarityEncoder()` works with categorical variables by default. And it has the option to encode numerical variables as well. This is useful, when the values of the numerical variables are more useful as strings, than as numbers. For example, for variables like barcode. Examples -------- Let's look at an example using the Titanic Dataset. First we load the data and divide it into a train and a test set: .. code:: python import string from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.encoding import StringSimilarityEncoder def clean_titanic(): translate_table = str.maketrans('' , '', string.punctuation) data = load_titanic() data['home.dest'] = ( data['home.dest'] .str.strip() .str.translate(translate_table) .str.replace(' ', ' ') .str.lower() ) data['name'] = ( data['name'] .str.strip() .str.translate(translate_table) .str.replace(' ', ' ') .str.lower() ) data['ticket'] = ( data['ticket'] .str.strip() .str.translate(translate_table) .str.replace(' ', ' ') .str.lower() ) return data data = clean_titanic() # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived', 'sex', 'cabin', 'embarked'], axis=1), data['survived'], test_size=0.3, random_state=0 ) X_train.head() Below, we see the first rows of the dataset: .. code:: python pclass name age sibsp parch \ 501 2 mellinger miss madeleine violet 13 0 1 588 2 wells miss joan 4 1 1 402 2 duran y more miss florentina 30 1 0 1193 3 scanlan mr james NaN 0 0 686 3 bradley miss bridget delia 22 0 0 ticket fare boat body \ 501 250644 19.5 14 NaN 588 29103 23 14 NaN 402 scparis 2148 13.8583 12 NaN 1193 36209 7.725 NaN NaN 686 334914 7.725 13 NaN home.dest 501 england bennington vt 588 cornwall akron oh 402 barcelona spain havana cuba 1193 NaN 686 kingwilliamstown co cork ireland glens falls ny Now, we set up the encoder to encode only the 2 most frequent categories of each of the 3 indicated categorical variables: .. code:: python # set up the encoder encoder = StringSimilarityEncoder( top_categories=2, variables=['name', 'home.dest', 'ticket'], ignore_format=True ) # fit the encoder encoder.fit(X_train) With `fit()` the encoder will learn the most popular categories of the variables, which are stored in the attribute `encoder_dict_`. .. code:: python encoder.encoder_dict_ .. code:: python { 'name': ['mellinger miss madeleine violet', 'barbara mrs catherine david'], 'home.dest': ['', 'new york ny'], 'ticket': ['ca 2343', 'ca 2144'] } The `encoder_dict_` contains the categories that will derive similarity variables for each categorical variable. With transform, we go ahead and encode the variables. Note that the :class:`StringSimilarityEncoder()` will drop the original variables. .. code:: python # transform the data train_t = encoder.transform(X_train) test_t = encoder.transform(X_test) test_t.head() Below, we see the resulting dataframe: .. code:: python pclass age sibsp parch fare boat body \ 1139 3 38 0 0 7.8958 NaN NaN 533 2 21 0 1 21 12 NaN 459 2 42 1 0 27 NaN NaN 1150 3 NaN 0 0 14.5 NaN NaN 393 2 25 0 0 31.5 NaN NaN name_mellinger miss madeleine violet name_barbara mrs catherine david \ 1139 0.454545 0.550000 533 0.615385 0.524590 459 0.596491 0.603774 1150 0.641509 0.693878 393 0.408163 0.666667 home.dest_nan home.dest_new york ny ticket_ca 2343 ticket_ca 2144 1139 1.0 0.000000 0.461538 0.461538 533 0.0 0.370370 0.307692 0.307692 459 0.0 0.352941 0.461538 0.461538 1150 1.0 0.000000 0.307692 0.307692 393 0.0 0.437500 0.666667 0.666667 More details ^^^^^^^^^^^^ For more details into :class:`StringSimilarityEncoder()`'s functionality visit: - `Jupyter notebook `_ All notebooks can be found in a `dedicated repository `_.