Data Preparation ================ OCEAN expects a numerical design matrix, but it also needs enough metadata to recover the meaning of each transformed column when it builds or prints an explanation. The :func:`ocean.feature.parse_features` function handles both at once by returning ``(processed_data, mapper)``. Automatic feature typing ------------------------ For each input column, :func:`ocean.feature.parse_features` applies these rules. .. list-table:: Parsing rules :header-rows: 1 * - Input pattern - Resulting feature type - Notes * - Column listed in ``discretes`` - Discrete - Kept numeric as an ordered or ordinal feature, with explicit levels and thresholds. * - Column listed in ``encoded`` - One-hot encoded - Treated as an unordered nominal feature and expanded into indicator columns. * - Non-numeric or non-binary column - One-hot encoded - OCEAN assumes no natural order and encodes the column as unordered categories automatically. * - Column with exactly two unique values - Binary - Encoded as a single 0/1 column. * - Remaining numeric column - Continuous - Optionally scaled to the centered interval ``[-0.5, 0.5]``. Cleaning behavior ----------------- By default, preprocessing removes columns that would not be useful for the explainers. - ``drop_na=True`` removes columns containing missing values. - ``drop_constant=True`` removes constant columns. - ``scale=True`` centers continuous features in ``[-0.5, 0.5]``. If you need to preserve a column, disable the relevant option explicitly. Discrete versus one-hot encoded ------------------------------- In OCEAN, these two categories are intentionally different. ``Discrete`` Ordered or ordinal values that still carry a notion of rank or distance, such as integer counts, age buckets, or credit levels. These should usually stay numeric and be passed through ``discretes=...`` when they are not continuous. ``One-hot encoded`` Unordered nominal categories, such as job titles, regions, or product labels, where no ordering should be assumed. These are expanded into binary indicator columns. This distinction matters because the explainers may move along ordered discrete levels differently than they switch between unordered categories. Example ------- .. code-block:: python import pandas as pd from ocean.feature import parse_features raw = pd.DataFrame({ "age_bucket": [18, 25, 35, 45], "owns_home": [0, 1, 1, 0], "income_ratio": [0.1, 0.4, 0.7, 0.3], "job_type": ["office", "manual", "service", "office"], }) data, mapper = parse_features( raw, discretes=("age_bucket",), ) print(data.columns) print(mapper["age_bucket"].ftype) print(mapper["job_type"].codes) Why the mapper matters ---------------------- The processed matrix seen by the ensemble may contain more columns than the raw input because one-hot encoding expands categorical variables. The mapper stores that relation and is required by every explainer constructor. Without the mapper, OCEAN cannot correctly: - associate split decisions with the right original feature, - decode a one-hot explanation back into a category label, - rebuild readable explanations from solver variables. Recommended practice -------------------- - Keep the mapper next to the trained model. - Train and explain on the exact same processed columns. - If you write your own preprocessing pipeline around OCEAN, make sure the final column order stays stable between training and explanation time.