Custom Dataset Example
======================

This page turns the synthetic custom-dataset example into a notebook-style
walkthrough. It starts from a raw pandas dataframe, parses the feature types
with OCEAN, trains a random forest, and explains one class-``0`` prediction
with the constraint-programming backend.

Why this example matters
------------------------

The packaged dataset loaders are useful when you want a quick start, but most
real integrations begin with a dataframe that you already own. This example
shows the full workflow on mixed feature types:

- ordered discrete values through ``credit_lines``,
- binary flags through ``owns_home`` and ``has_guarantor``,
- continuous ratios through ``income_ratio``, ``debt_ratio``, and
  ``savings_ratio``,
- unordered nominal features through ``job_type`` and ``region``.

Cell 1: Build a mixed-type dataframe
------------------------------------

.. code-block:: python

   import numpy as np
   import pandas as pd

   rng = np.random.default_rng(42)
   raw = pd.DataFrame({
       "credit_lines": rng.choice([0, 1, 2, 4], size=300),
       "owns_home": rng.integers(0, 2, size=300),
       "has_guarantor": rng.integers(0, 2, size=300),
       "income_ratio": rng.uniform(-0.4, 0.8, size=300),
       "debt_ratio": rng.uniform(0.0, 1.0, size=300),
       "savings_ratio": rng.uniform(-0.5, 0.6, size=300),
       "job_type": rng.choice(
           ["office", "manual", "service", "student"],
           size=300,
       ),
       "region": rng.choice(
           ["north", "south", "east", "west"],
           size=300,
       ),
   })

   score = (
       (raw["credit_lines"] >= 2).astype(int)
       + raw["owns_home"].astype(int)
       + raw["has_guarantor"].astype(int)
       + (raw["income_ratio"] > 0.1).astype(int)
       + (raw["savings_ratio"] > 0.0).astype(int)
       + raw["job_type"].isin(["office", "service"]).astype(int)
       + raw["region"].isin(["north", "east"]).astype(int)
       - (raw["debt_ratio"] > 0.55).astype(int)
   )
   target = (score >= 4).astype(int).rename("approved")

Cell 2: Parse the features with OCEAN
-------------------------------------

.. code-block:: python

   from ocean.feature import parse_features

   data, mapper = parse_features(raw, discretes=("credit_lines",))
   print(data.columns)

.. code-block:: text

   MultiIndex([(  'credit_lines',        ''),
               (     'owns_home',        ''),
               ( 'has_guarantor',        ''),
               (  'income_ratio',        ''),
               (    'debt_ratio',        ''),
               ( 'savings_ratio',        ''),
               (      'job_type',  'manual'),
               (      'job_type',  'office'),
               (      'job_type', 'service'),
               (      'job_type', 'student'),
               (        'region',    'east'),
               (        'region',   'north'),
               (        'region',   'south'),
               (        'region',    'west')],
              )

The important part is that ``credit_lines`` stays ordered and numeric, while
``job_type`` and ``region`` expand into one-hot blocks.

Cell 3: Fit a classifier and choose a query
-------------------------------------------

.. code-block:: python

   import pandas as pd
   from sklearn.ensemble import RandomForestClassifier

   model = RandomForestClassifier(
       n_estimators=40,
       max_depth=4,
       random_state=42,
   )
   model.fit(data, target)

   predictions = pd.Series(model.predict(data), index=data.index)
   query_index = predictions[predictions == 0].index[0]
   query = data.loc[query_index].to_numpy(dtype=float).flatten()
   query_frame = data.loc[[query_index]]
   raw_query = raw.loc[query_index]

   print(raw_query)
   print()
   print("Model prediction:", int(model.predict(query_frame).item()))

.. code-block:: text

   credit_lines            0
   owns_home               0
   has_guarantor           1
   income_ratio    -0.349179
   debt_ratio       0.260349
   savings_ratio    0.234634
   job_type          student
   region               west
   Name: 0, dtype: object

   Model prediction: 0

Cell 4: Explain the query
-------------------------

.. code-block:: python

   from ocean import ConstraintProgrammingExplainer

   explainer = ConstraintProgrammingExplainer(model, mapper=mapper)
   explanation = explainer.explain(
       query,
       y=1,
       norm=1,
       max_time=10,
       num_workers=1,
       random_seed=42,
   )
   if explanation is None:
       raise RuntimeError("No counterfactual was found for the synthetic example.")

   counterfactual_frame = pd.DataFrame(
       [explanation.to_numpy()],
       columns=data.columns,
   )

   print("Target class:", 1)
   print("Counterfactual prediction:", int(model.predict(counterfactual_frame).item()))

.. code-block:: text

   Target class: 1
   Counterfactual prediction: 1

Cell 5: Inspect the decoded explanation
---------------------------------------

.. code-block:: python

   print(explanation)

.. code-block:: text

   Explanation:
   credit_lines   : 0.0
   owns_home      : 0
   has_guarantor  : 1
   income_ratio   : -0.2833683341741562
   debt_ratio     : -0.1887158378958702
   savings_ratio  : 0.29842646420001984
   job_type       : student
   region         : north

This decoded view is usually the most readable one: categorical one-hot blocks
are mapped back to labels, and the keys match the original dataframe columns.

Cell 6: Inspect the processed vector and the final distance
-----------------------------------------------------------

.. code-block:: python

   print(explanation.to_series())
   print()
   print("Distance:", explainer.get_distance())

.. code-block:: text

   credit_lines              0.000000
   owns_home                 0.000000
   has_guarantor             1.000000
   income_ratio             -0.283368
   debt_ratio               -0.188716
   savings_ratio             0.298426
   job_type       manual     0.000000
                  office     0.000000
                  service    0.000000
                  student    1.000000
   region         east       0.000000
                  north      1.000000
                  south      0.000000
                  west       0.000000
   dtype: float64

   Distance: 1.3566914800296987

``get_distance()`` is the user-facing metric to report here: it reconstructs
the post-processed :math:`L_1` distance between the original query and the
decoded counterfactual, including the half-weight treatment for one-hot blocks.

Full script
-----------

If you want the exact runnable version behind this page:

.. literalinclude:: ../examples/custom_dataset.py
   :language: python
   :linenos:
   :caption: examples/custom_dataset.py