Extracting Features from Text#

Short pieces of text are often found among the variables in our datasets. For example, in insurance, a text variable can describe the circumstances of an accident. Customer feedback is also stored as a text variable.

While text data as such can’t be used to train machine learning models, we can extract a lot of numerical information from these texts, which can provide predictive features to train machine learning models.

Feature-engine allows you to quickly extract numerical features from short pieces of text, to complement your predictive models. These features aim to capture a piece of text’s complexity by looking at some statistical parameters of the text, such as the word length and count, the number of words and unique words used, the number of sentences, and so on.

TextFeatures() extracts many numerical features from text out-of-the-box.

TextFeatures#

TextFeatures() extracts numerical features from text/string variables. This transformer is useful for extracting basic text statistics that can be used as features in machine learning models. Users must explicitly specify which columns contain text data via the variables parameter.

Unlike scikit-learn’s CountVectorizer or TfidfVectorizer which create sparse matrices, TextFeatures() extracts metadata features that remain in DataFrame format and can be easily combined with other Feature-engine or sklearn transformers in a pipeline.

Text Features#

TextFeatures() can extract the following features from a text piece:

  • char_count: Number of characters in the text

  • word_count: Number of words (whitespace-separated tokens)

  • sentence_count: Number of sentences (based on .!? punctuation)

  • avg_word_length: Average length of words

  • digit_count: Number of digit characters

  • letter_count: Number of alphabetic characters (a-z, A-Z)

  • uppercase_count: Number of uppercase letters

  • lowercase_count: Number of lowercase letters

  • special_char_count: Number of special characters (non-alphanumeric)

  • whitespace_count: Number of whitespace characters

  • whitespace_ratio: Ratio of whitespace to total characters

  • digit_ratio: Ratio of digits to total characters

  • uppercase_ratio: Ratio of uppercase to total characters

  • has_digits: Binary indicator if text contains digits

  • has_uppercase: Binary indicator if text contains uppercase

  • is_empty: Binary indicator if text is empty

  • starts_with_uppercase: Binary indicator if text starts with uppercase

  • ends_with_punctuation: Binary indicator if text ends with .!?

  • unique_word_count: Number of unique words (case-insensitive)

  • lexical_diversity: Ratio of unique words to total words

The number of sentences is inferred by TextFeatures() by counting blocks of sentence-ending punctuation (., !, ?) as a proxy for sentence boundaries. This means that multiple consecutive punctuation marks (e.g., “!!!” or “??”) are counted as a single sentence-ending, which avoids overestimating the count in emphatic text.

However, this is still a simple heuristic. It won’t handle edge cases like abbreviations (e.g., ‘Dr.’, ‘U.S.’, ‘e.g.’, ‘i.e.’) or text without punctuation. These abbreviations will be counted as sentence endings, resulting in an overestimate of the actual sentence count.

The features number of unique words and lexical diversity are intended to capture the complexity of the text. Simpler texts have few unique words and tend to repeat them. More complex texts use a wider array of words and tend not to repeat them. Hence, in more complex texts, both the number of unique words and the lexical diversity are greater.

Handling missing values#

By default, TextFeatures() ignores missing values by treating them as empty strings (missing_values='ignore'). You can change this behavior by setting the parameter to 'raise' if you prefer the transformer to raise an error when encountering missing data.

In this case, missing values will be treated as empty strings, and the numerical features will be calculated accordingly (e.g., word count and character count will be 0) as shown in the following example:

import pandas as pd
import numpy as np
from feature_engine.text import TextFeatures

# Create sample data with NaN
X = pd.DataFrame({
    'text': ['Hello', np.nan, 'World']
})

# Set up the transformer (defaults to ignore missing values)
tf = TextFeatures(
    variables=['text'],
    features=['char_count']
)

# Transform
X_transformed = tf.fit_transform(X)

print(X_transformed)

In the resulting dataframe, we see that the row with NaN returned 0 in the character count:

text  text_char_count
0  Hello                5
1    NaN                0
2  World                5

Python demo#

In this section, we’ll show how to use TextFeatures(). Let’s create a dataframe with text data:

import pandas as pd
from feature_engine.text import TextFeatures

# Create sample data
X = pd.DataFrame({
    'review': [
        'This product is AMAZING! Best purchase ever.',
        'Not great. Would not recommend.',
        'OK for the price. 3 out of 5 stars.',
        'TERRIBLE!!! DO NOT BUY!',
    ],
    'title': [
        'Great Product',
        'Disappointed',
        'Average',
        'Awful',
    ]
})

print(X)

The input dataframe looks like this:

                                       review          title
0  This product is AMAZING! Best purchase ever.  Great Product
1             Not great. Would not recommend.   Disappointed
2       OK for the price. 3 out of 5 stars.        Average
3                     TERRIBLE!!! DO NOT BUY!          Awful

Now let’s extract 5 specific text features: the number of words, the number of characters, the number of sentences, whether the text has digits, and the ratio of upper- to lowercase:

# Set up the transformer with specific features
tf = TextFeatures(
    variables=['review'],
    features=[
        'word_count',
        'char_count',
        'sentence_count',
        'has_digits',
        'uppercase_ratio',
    ])

# Fit and transform
X_transformed = tf.fit_transform(X)

print(X_transformed)

In the following output, we see the resulting dataframe containing the numerical features extracted from the pieces of text:

                                         review          title  review_word_count  review_char_count
0  This product is AMAZING! Best purchase ever.  Great Product                  7                 38
1               Not great. Would not recommend.   Disappointed                  5                 27
2           OK for the price. 3 out of 5 stars.        Average                  9                 27
3                       TERRIBLE!!! DO NOT BUY!          Awful                  4                 20

   review_sentence_count  review_has_digits  review_uppercase_ratio
0                      2                  0                0.236842
1                      2                  0                0.074074
2                      2                  1                0.074074
3                      2                  0                0.800000

Extracting all features#

By default, if no text features are specified, all available features will be extracted:

# Extract all features from a single text column
tf = TextFeatures(variables=['review'])
X_transformed = tf.fit_transform(X)

print(X_transformed.head())

The output dataframe contains all 20 text features extracted from the review column:

                                         review          title  review_char_count  review_word_count
0  This product is AMAZING! Best purchase ever.  Great Product                 38                  7
1               Not great. Would not recommend.   Disappointed                 27                  5
2           OK for the price. 3 out of 5 stars.        Average                 27                  9
3                       TERRIBLE!!! DO NOT BUY!          Awful                 20                  4

   review_sentence_count  review_avg_word_length  review_digit_count  review_letter_count
0                      2                6.285714                   0                   36
1                      2                6.200000                   0                   25
2                      2                3.888889                   2                   23
3                      2                5.750000                   0                   16

   review_uppercase_count  review_lowercase_count  review_special_char_count  review_whitespace_count
0                       9                      27                          2                        6
1                       2                      23                          2                        4
2                       2                      21                          2                        8
3                      16                       0                          4                        3

   review_whitespace_ratio  review_digit_ratio  review_uppercase_ratio  review_has_digits
0                 0.136364            0.000000                0.236842                  0
1                 0.129032            0.000000                0.074074                  0
2                 0.228571            0.074074                0.074074                  1
3                 0.130435            0.000000                0.800000                  0

   review_has_uppercase  review_is_empty  review_starts_with_uppercase  review_ends_with_punctuation
0                     1                0                             1                             1
1                     1                0                             1                             1
2                     1                0                             1                             1
3                     1                0                             1                             1

   review_unique_word_count  review_lexical_diversity
0                         7                       1.0
1                         4                      1.25
2                         9                       1.0
3                         4                       1.0

Dropping original columns#

You can drop the original text columns after extracting features, by setting the parameter drop_original to True:

tf = TextFeatures(
    variables=['review'],
    features=['word_count', 'char_count'],
    drop_original=True
)

X_transformed = tf.fit_transform(X)

print(X_transformed)

The original 'review' column has been removed, and only the 'title' column and the extracted features remain:

          title  review_word_count  review_char_count
0  Great Product                  7                 38
1  Disappointed                   5                 27
2       Average                   9                 27
3         Awful                   4                 20

Combining with scikit-learn Bag-of-Words#

In most NLP tasks, it is common to use bag-of-words (e.g., CountVectorizer) or TF-IDF (e.g., TfidfVectorizer) to represent the text. TextFeatures() can be used alongside these transformers to provide additional metadata that might improve model performance.

In the following example, we compare a baseline model using only TF-IDF with a model that combines TF-IDF and TextFeatures() metadata:

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

from feature_engine.text import TextFeatures

# Load and split data
data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.sport.hockey'])
df = pd.DataFrame({'text': data.data, 'target': data.target})
X_train, X_test, y_train, y_test = train_test_split(
    df[['text']], df['target'], test_size=0.3, random_state=42
)

print(X_train.head())

The input dataframe contains the raw text of newsgroup posts:

                                                 text
562  From: xxx@yyy.zzz (John Smith)\nSubject: Re:...
459  From: aaa@bbb.ccc (Jane Doe)\nSubject: Shutt...
21   From: ddd@eee.fff\nSubject: Space Station Fr...
892  From: ggg@hhh.iii\nSubject: NHL Scores\nOrga...
317  From: jjj@kkk.lll (Bob Wilson)\nSubject: Re:...

Now let’s set up two pipelines to compare a baseline model using only TF-IDF with a model that combines TF-IDF and TextFeatures() metadata:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# 1. Baseline: TF-IDF only
tfidf_pipe = Pipeline([
    ('vec', ColumnTransformer([
        ('tfidf', TfidfVectorizer(max_features=500), 'text')
    ])),
    ('clf', LogisticRegression())
])
tfidf_pipe.fit(X_train, y_train)
print(f"TF-IDF Accuracy: {tfidf_pipe.score(X_test, y_test):.3f}")

# 2. Combined: TextFeatures + TF-IDF
combined_pipe = Pipeline([
    ('features', ColumnTransformer([
        ('text_meta', TextFeatures(variables=['text']), 'text'),
        ('tfidf', TfidfVectorizer(max_features=500), 'text')
    ])),
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
combined_pipe.fit(X_train, y_train)
print(f"Combined Accuracy: {combined_pipe.score(X_test, y_test):.3f}")

Below we see the accuracy of a model trained using only the bag of words, respect to a model trained using both the bag of words and the additional meta data:

TF-IDF Accuracy: 0.957
Combined Accuracy: 0.963

By adding statistical metadata through TextFeatures(), we provided the model with information about text length, complexity, and style that is not explicitly captured by a word-count-based approach like TF-IDF, leading to a small but noticeable improvement in performance.