Data Preparation

OCEAN expects a numerical design matrix, but it also needs enough metadata to recover the meaning of each transformed column when it builds or prints an explanation. The ocean.feature.parse_features() function handles both at once by returning (processed_data, mapper).

Automatic feature typing

For each input column, ocean.feature.parse_features() applies these rules.

Parsing rules

Input pattern

Resulting feature type

Notes

Column listed in discretes

Discrete

Kept numeric as an ordered or ordinal feature, with explicit levels and thresholds.

Column listed in encoded

One-hot encoded

Treated as an unordered nominal feature and expanded into indicator columns.

Non-numeric or non-binary column

One-hot encoded

OCEAN assumes no natural order and encodes the column as unordered categories automatically.

Column with exactly two unique values

Binary

Encoded as a single 0/1 column.

Remaining numeric column

Continuous

Optionally scaled to the centered interval [-0.5, 0.5].

Cleaning behavior

By default, preprocessing removes columns that would not be useful for the explainers.

  • drop_na=True removes columns containing missing values.

  • drop_constant=True removes constant columns.

  • scale=True centers continuous features in [-0.5, 0.5].

If you need to preserve a column, disable the relevant option explicitly.

Discrete versus one-hot encoded

In OCEAN, these two categories are intentionally different.

Discrete

Ordered or ordinal values that still carry a notion of rank or distance, such as integer counts, age buckets, or credit levels. These should usually stay numeric and be passed through discretes=... when they are not continuous.

One-hot encoded

Unordered nominal categories, such as job titles, regions, or product labels, where no ordering should be assumed. These are expanded into binary indicator columns.

This distinction matters because the explainers may move along ordered discrete levels differently than they switch between unordered categories.

Example

import pandas as pd

from ocean.feature import parse_features

raw = pd.DataFrame({
    "age_bucket": [18, 25, 35, 45],
    "owns_home": [0, 1, 1, 0],
    "income_ratio": [0.1, 0.4, 0.7, 0.3],
    "job_type": ["office", "manual", "service", "office"],
})

data, mapper = parse_features(
    raw,
    discretes=("age_bucket",),
)

print(data.columns)
print(mapper["age_bucket"].ftype)
print(mapper["job_type"].codes)

Why the mapper matters

The processed matrix seen by the ensemble may contain more columns than the raw input because one-hot encoding expands categorical variables. The mapper stores that relation and is required by every explainer constructor.

Without the mapper, OCEAN cannot correctly:

  • associate split decisions with the right original feature,

  • decode a one-hot explanation back into a category label,

  • rebuild readable explanations from solver variables.