Data Preparation
OCEAN expects a numerical design matrix, but it also needs enough metadata to
recover the meaning of each transformed column when it builds or prints an
explanation. The ocean.feature.parse_features() function handles both at
once by returning (processed_data, mapper).
Automatic feature typing
For each input column, ocean.feature.parse_features() applies these rules.
Input pattern |
Resulting feature type |
Notes |
|---|---|---|
Column listed in |
Discrete |
Kept numeric as an ordered or ordinal feature, with explicit levels and thresholds. |
Column listed in |
One-hot encoded |
Treated as an unordered nominal feature and expanded into indicator columns. |
Non-numeric or non-binary column |
One-hot encoded |
OCEAN assumes no natural order and encodes the column as unordered categories automatically. |
Column with exactly two unique values |
Binary |
Encoded as a single 0/1 column. |
Remaining numeric column |
Continuous |
Optionally scaled to the centered interval |
Cleaning behavior
By default, preprocessing removes columns that would not be useful for the explainers.
drop_na=Trueremoves columns containing missing values.drop_constant=Trueremoves constant columns.scale=Truecenters continuous features in[-0.5, 0.5].
If you need to preserve a column, disable the relevant option explicitly.
Discrete versus one-hot encoded
In OCEAN, these two categories are intentionally different.
DiscreteOrdered or ordinal values that still carry a notion of rank or distance, such as integer counts, age buckets, or credit levels. These should usually stay numeric and be passed through
discretes=...when they are not continuous.One-hot encodedUnordered nominal categories, such as job titles, regions, or product labels, where no ordering should be assumed. These are expanded into binary indicator columns.
This distinction matters because the explainers may move along ordered discrete levels differently than they switch between unordered categories.
Example
import pandas as pd
from ocean.feature import parse_features
raw = pd.DataFrame({
"age_bucket": [18, 25, 35, 45],
"owns_home": [0, 1, 1, 0],
"income_ratio": [0.1, 0.4, 0.7, 0.3],
"job_type": ["office", "manual", "service", "office"],
})
data, mapper = parse_features(
raw,
discretes=("age_bucket",),
)
print(data.columns)
print(mapper["age_bucket"].ftype)
print(mapper["job_type"].codes)
Why the mapper matters
The processed matrix seen by the ensemble may contain more columns than the raw input because one-hot encoding expands categorical variables. The mapper stores that relation and is required by every explainer constructor.
Without the mapper, OCEAN cannot correctly:
associate split decisions with the right original feature,
decode a one-hot explanation back into a category label,
rebuild readable explanations from solver variables.
Recommended practice
Keep the mapper next to the trained model.
Train and explain on the exact same processed columns.
If you write your own preprocessing pipeline around OCEAN, make sure the final column order stays stable between training and explanation time.