TextFeatures#
- class feature_engine.text.TextFeatures(variables, features=None, missing_values='ignore', drop_original=False)[source]#
TextFeatures() extracts numerical features from text/string variables. This transformer is useful for extracting basic text statistics that can be used as features in machine learning models.
A list with the text variables must be passed as an argument.
More details in the User Guide.
- Parameters
- variables: string, list
The list of text/string variables to extract features from.
- features: list, default=None
List of text features to extract. Available features are:
‘char_count’: Number of characters in the text
‘word_count’: Number of words (whitespace-separated tokens)
‘sentence_count’: Number of sentences (based on .!? punctuation)
‘avg_word_length’: Average length of words
‘digit_count’: Number of digit characters
‘letter_count’: Number of alphabetic characters (a-z, A-Z)
‘uppercase_count’: Number of uppercase letters
‘lowercase_count’: Number of lowercase letters
‘special_char_count’: Number of special characters (non-alphanumeric)
‘whitespace_count’: Number of whitespace characters
‘whitespace_ratio’: Ratio of whitespace to total characters
‘digit_ratio’: Ratio of digits to total characters
‘uppercase_ratio’: Ratio of uppercase to total characters
‘has_digits’: Binary indicator if text contains digits
‘has_uppercase’: Binary indicator if text contains uppercase
‘is_empty’: Binary indicator if text is empty
‘starts_with_uppercase’: Binary indicator if text starts with uppercase
‘ends_with_punctuation’: Binary indicator if text ends with .!?
‘unique_word_count’: Number of unique words (case-insensitive)
‘lexical_diversity’: Ratio of unique words to total words
If None, extracts all available features.
- missing_values: string, default=’ignore’
If ‘ignore’, NaNs will be filled with an empty string before feature extraction. If ‘raise’, the transformer will raise an error if missing data is found.
- drop_original: bool, default=False
Whether to drop the original text columns after transformation.
- Attributes
- variables_:
The list of text variables that will be transformed.
- features_:
The list of features that will be extracted.
- feature_names_in_:
List with the names of features seen during fit.
- n_features_in_:
The number of features in the train set used in fit.
See also
feature_engine.encoding.StringSimilarityEncoderEncodes categorical variables based on string similarity.
Examples
>>> import pandas as pd >>> from feature_engine.text import TextFeatures >>> X = pd.DataFrame({ ... 'text': ['Hello World!', 'Python is GREAT.', 'ML rocks 123'] ... }) >>> tf = TextFeatures( ... variables=['text'], ... features=['char_count', 'word_count', 'has_digits'] ... ) >>> tf.fit(X) TextFeatures(features=['char_count', 'word_count', 'has_digits'], variables=['text']) >>> X = tf.transform(X) >>> pd.options.display.max_columns = 10 >>> print(X) text text_char_count text_word_count text_has_digits 0 Hello World! 11 2 0 1 Python is GREAT. 14 3 0 2 ML rocks 123 10 3 1
Methods
fit:
This transformer does not learn parameters.
fit_transform:
Fit to data, then transform it.
transform:
Extract text features and add them to the dataframe.
get_feature_names_out:
Get output feature names for transformation.
- fit(X, y=None)[source]#
This transformer does not learn any parameters.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training input samples. Can be the entire dataframe, not just the variables to transform.
- y: pandas Series, or np.array. Defaults to None.
The target. It is not needed in this transformer. You can pass y or None.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to
Xandywith optional parametersfit_paramsand returns a transformed version ofX.- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters. Pass only if the estimator accepts additional params in its
fitmethod.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.