StringSimilarityEncoder#
The StringSimilarityEncoder()
replaces categorical variables with a set of float
variables that capture the similarity between the category names. The new variables
have values between 0 and 1, where 0 indicates no similarity and 1 is an exact
match between the names of the categories.
To calculate the similarity between the categories, StringSimilarityEncoder()
uses Gestalt pattern matching. Under the hood, StringSimilarityEncoder()
uses
the quick_ratio
method from the SequanceMatcher()
from difflib
.
The similarity is calculated as:
where T is the total number of elements in both sequences and M is the number of matches.
For example, the similarity between the categories “dog” and “dig” is 0.66. T is the total number of elements in both categories, that is 6. There are 2 matches between the words, the letters d and g, so: 2 * M / T = 2 * 2 / 6 = 0.66.
Output of the StringSimilarityEncoder()
#
Let’s create a dataframe with the categories “dog”, “dig” and “cat”:
import pandas as pd
from feature_engine.encoding import StringSimilarityEncoder
df = pd.DataFrame({"words": ["dog", "dig", "cat"]})
df
We see the dataframe in the following output:
words
0 dog
1 dig
2 cat
Let’s now encode the variable:
encoder = StringSimilarityEncoder()
dft = encoder.fit_transform(df)
dft
We see the encoded variables below:
words_dog words_dig words_cat
0 1.000000 0.666667 0.0
1 0.666667 1.000000 0.0
2 0.000000 0.000000 1.0
Note that StringSimilarityEncoder()
replaces the original variables by the
distance variables.
StringSimilarityEncoder()
vs One-hot encoding#
String similarity encoding is similar to one-hot encoding, in the sense that each category is encoded as a new variable. But the values, instead of 1 or 0, are the similarity between the observation’s category and the dummy variable. It is suitable for poorly defined (or ‘dirty’) categorical variables.
Encoding only popular categories#
The StringSimilarityEncoder()
can also create similarity variables for the n most popular
categories, n being determined by the user. For example, if we encode only the 6 more popular categories, by
setting the parameter top_categories=6
, the transformer will add variables only
for the 6 most frequent categories. The most frequent categories are those with the largest
number of observations. This behaviour is useful when the categorical variables are highly cardinal,
to control the expansion of the feature space.
Specifying how StringSimilarityEncoder()
should deal with missing values#
The StringSimilarityEncoder()
has three options for dealing with missing values, which can be
specified with the parameter missing_values
:
Ignore NaNs (option
ignore
) - will leave the NaN in the resulting dataframe after transformation. Could be useful, if the next step in the pipeline is imputation or if the machine learning algorithm can handle missing data out-of-the-box.Impute NaNs (option
impute
) - will impute NaN with an empty string, and then calculate the similarity between the empty string and the variable’s categories. Most of the time, the similarity value will be 0 in resulting dataframe. This is the default option.Raise an error (option
raise
) - will raise an error if NaN is present duringfit
,transform
orfit_transform
. Could be useful for debugging and monitoring purposes.
Important#
StringSimilarityEncoder()
will encode unseen categories by out-of-the-box, by measuring the
string similarity to the seen categories.
No text preprocessing is applied by StringSimilarityEncoder()
. Be mindful of preparing
string categorical variables if needed.
StringSimilarityEncoder()
works with categorical variables by default. And it has the option to
encode numerical variables as well. This is useful, when the values of the numerical variables are more
useful as strings, than as numbers. For example, for variables like barcode.
Examples#
Let’s look at an example using the Titanic Dataset. First we load the data and divide it into a train and a test set:
import string
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import StringSimilarityEncoder
def clean_titanic():
translate_table = str.maketrans('' , '', string.punctuation)
data = load_titanic()
data['home.dest'] = (
data['home.dest']
.str.strip()
.str.translate(translate_table)
.str.replace(' ', ' ')
.str.lower()
)
data['name'] = (
data['name']
.str.strip()
.str.translate(translate_table)
.str.replace(' ', ' ')
.str.lower()
)
data['ticket'] = (
data['ticket']
.str.strip()
.str.translate(translate_table)
.str.replace(' ', ' ')
.str.lower()
)
return data
data = clean_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'sex', 'cabin', 'embarked'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)
X_train.head()
Below, we see the first rows of the dataset:
pclass name age sibsp parch \
501 2 mellinger miss madeleine violet 13 0 1
588 2 wells miss joan 4 1 1
402 2 duran y more miss florentina 30 1 0
1193 3 scanlan mr james NaN 0 0
686 3 bradley miss bridget delia 22 0 0
ticket fare boat body \
501 250644 19.5 14 NaN
588 29103 23 14 NaN
402 scparis 2148 13.8583 12 NaN
1193 36209 7.725 NaN NaN
686 334914 7.725 13 NaN
home.dest
501 england bennington vt
588 cornwall akron oh
402 barcelona spain havana cuba
1193 NaN
686 kingwilliamstown co cork ireland glens falls ny
Now, we set up the encoder to encode only the 2 most frequent categories of each of the 3 indicated categorical variables:
# set up the encoder
encoder = StringSimilarityEncoder(
top_categories=2,
variables=['name', 'home.dest', 'ticket'],
ignore_format=True
)
# fit the encoder
encoder.fit(X_train)
With fit()
the encoder will learn the most popular categories of the variables, which
are stored in the attribute encoder_dict_
.
encoder.encoder_dict_
{
'name': ['mellinger miss madeleine violet', 'barbara mrs catherine david'],
'home.dest': ['', 'new york ny'],
'ticket': ['ca 2343', 'ca 2144']
}
The encoder_dict_
contains the categories that will derive similarity variables for each
categorical variable.
With transform, we go ahead and encode the variables. Note that the
StringSimilarityEncoder()
will drop the original variables.
# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
test_t.head()
Below, we see the resulting dataframe:
pclass age sibsp parch fare boat body \
1139 3 38 0 0 7.8958 NaN NaN
533 2 21 0 1 21 12 NaN
459 2 42 1 0 27 NaN NaN
1150 3 NaN 0 0 14.5 NaN NaN
393 2 25 0 0 31.5 NaN NaN
name_mellinger miss madeleine violet name_barbara mrs catherine david \
1139 0.454545 0.550000
533 0.615385 0.524590
459 0.596491 0.603774
1150 0.641509 0.693878
393 0.408163 0.666667
home.dest_nan home.dest_new york ny ticket_ca 2343 ticket_ca 2144
1139 1.0 0.000000 0.461538 0.461538
533 0.0 0.370370 0.307692 0.307692
459 0.0 0.352941 0.461538 0.461538
1150 1.0 0.000000 0.307692 0.307692
393 0.0 0.437500 0.666667 0.666667
More details#
For more details into StringSimilarityEncoder()
’s functionality visit:
All notebooks can be found in a dedicated repository.