Feature Processing

Feature metadata and preprocessing helpers.

class Feature(ftype, *, levels=(), thresholds=(), codes=())[source]

Bases: object

Description of a single feature after OCEAN preprocessing.

class Type(*values)[source]

Bases: Enum

Supported feature categories.

CONTINUOUS = 'continuous'

DISCRETE = 'discrete'

ONE_HOT_ENCODED = 'one-hot-encoded'

BINARY = 'binary'

add(*levels)[source]

property codes

property ftype

property is_binary

property is_continuous

property is_discrete

property is_numeric

property is_one_hot_encoded

property levels

property thresholds

parse_features(data, *, discretes=(), encoded=(), drop_na=True, drop_constant=True, scale=True)[source]

Parse a tabular dataset into OCEAN’s feature representation.

Parameters:

data (pd.DataFrame) – The DataFrame to be processed.
discretes (tuple[Key, ...], optional) – A tuple of column names that should be treated as ordered discrete (ordinal) features, such as integer-valued counts or ranked buckets. default is (). If None, no column is treated as discrete.
encoded (tuple[Key, ...], optional) – A tuple of column names that should be treated as one-hot encoded features, typically unordered nominal categories. default is ().
drop_na (bool, optional) – Whether to drop columns with NaN values. default is True.
drop_constant (bool, optional) – Whether to drop columns with constant values. default is True.
scale (bool, optional) – Whether to scale continuous features to the centered interval [-0.5, 0.5]. default is True.

Returns:

A tuple (processed_data, mapper) where processed_data is ready to train a tree ensemble and mapper keeps the relationship between original feature names and transformed columns.

Return type:

Parsed

Raises:

ValueError – If a column in discretes is not found in the input frame.