Skip to content

Drift Detection

quprep.core.drift.DriftDetector(mean_threshold=3.0, std_threshold=2.0, warn=True)

Detects statistical drift between training data and new data.

Fitted during Pipeline.fit() on the post-cleaning, post-reduction feature matrix. On transform(), compares new data against stored training statistics and issues a QuPrepWarning if drift is detected.

Two signals are checked per feature:

  • Mean shift — the difference in feature means, expressed in units of the training standard deviation. Flagged when |new_mean - train_mean| / train_std > mean_threshold (default 3σ).
  • Std ratio — the ratio of new std to training std. Flagged when the ratio is outside [1/std_threshold, std_threshold] (default 2×, i.e. std has doubled or halved).

Parameters:

Name Type Description Default
mean_threshold float

Number of training standard deviations a mean shift must exceed to be flagged (default: 3.0).

3.0
std_threshold float

Maximum ratio of new std to training std before flagging (default: 2.0). A ratio of 2.0 means the new data is twice as spread out (or half).

2.0
warn bool

Whether to issue a QuPrepWarning when drift is detected (default: True). Set to False to use :meth:check programmatically without side effects.

True
Source code in quprep/core/drift.py
def __init__(
    self,
    mean_threshold: float = 3.0,
    std_threshold: float = 2.0,
    warn: bool = True,
):
    self.mean_threshold = mean_threshold
    self.std_threshold = std_threshold
    self.warn = warn
    self._train_mean: np.ndarray | None = None
    self._train_std: np.ndarray | None = None
    self._feature_names: list[str] | None = None
    self._fitted = False

Functions

check(dataset)

Check new data for drift against the training distribution.

Parameters:

Name Type Description Default
dataset Dataset

New data to check. Must have the same number of features as the training data.

required

Returns:

Type Description
DriftReport

Raises:

Type Description
RuntimeError

If called before fit().

ValueError

If dataset has a different number of features than training data.

Source code in quprep/core/drift.py
def check(self, dataset) -> DriftReport:
    """
    Check new data for drift against the training distribution.

    Parameters
    ----------
    dataset : Dataset
        New data to check. Must have the same number of features as the
        training data.

    Returns
    -------
    DriftReport

    Raises
    ------
    RuntimeError
        If called before ``fit()``.
    ValueError
        If ``dataset`` has a different number of features than training data.
    """
    if not self._fitted:
        raise RuntimeError(
            "DriftDetector has not been fitted. Call fit() first."
        )

    data = dataset.data
    n_train_features = len(self._train_mean)
    if data.shape[1] != n_train_features:
        raise ValueError(
            f"Feature count mismatch: training had {n_train_features} features, "
            f"new data has {data.shape[1]}."
        )

    new_mean = np.nanmean(data, axis=0)
    new_std = np.nanstd(data, axis=0)

    drifted = []
    feature_stats = {}

    for i, name in enumerate(self._feature_names):
        t_mean = float(self._train_mean[i])
        t_std = float(self._train_std[i])
        n_mean = float(new_mean[i])
        n_std = float(new_std[i])

        # Mean shift in units of training std
        if t_std > 0:
            mean_shift_sigmas = abs(n_mean - t_mean) / t_std
        else:
            mean_shift_sigmas = 0.0 if abs(n_mean - t_mean) < 1e-9 else float("inf")

        # Std ratio (avoid div-by-zero)
        if t_std > 0:
            std_ratio = n_std / t_std
        else:
            std_ratio = 1.0

        # Use a small epsilon on the reciprocal check to avoid floating-point
        # false positives when std_ratio is marginally below 1/std_threshold.
        is_drifted = (
            mean_shift_sigmas > self.mean_threshold
            or std_ratio > self.std_threshold
            or (std_ratio > 0 and (1.0 / std_ratio) > self.std_threshold + 1e-9)
        )

        feature_stats[name] = {
            "train_mean": t_mean,
            "new_mean": n_mean,
            "train_std": t_std,
            "new_std": n_std,
            "mean_shift_sigmas": mean_shift_sigmas,
            "std_ratio": std_ratio,
        }

        if is_drifted:
            drifted.append(name)

    overall_drift = len(drifted) > 0
    report = DriftReport(
        drifted_features=drifted,
        feature_stats=feature_stats,
        n_features_drifted=len(drifted),
        overall_drift=overall_drift,
    )

    if overall_drift and self.warn:
        from quprep.validation.input_validator import QuPrepWarning
        warnings.warn(
            f"Data drift detected in {len(drifted)} feature(s): "
            f"{', '.join(drifted[:5])}"
            + (" ..." if len(drifted) > 5 else "")
            + ". New data may be outside the training distribution.",
            QuPrepWarning,
            stacklevel=3,
        )

    return report

fit(dataset)

Record training distribution statistics.

Parameters:

Name Type Description Default
dataset Dataset

Training data (post-cleaning, post-reduction). NaN values are excluded from statistics using nan-safe functions.

required

Returns:

Type Description
DriftDetector

Returns self.

Source code in quprep/core/drift.py
def fit(self, dataset) -> DriftDetector:
    """
    Record training distribution statistics.

    Parameters
    ----------
    dataset : Dataset
        Training data (post-cleaning, post-reduction). NaN values are
        excluded from statistics using nan-safe functions.

    Returns
    -------
    DriftDetector
        Returns ``self``.
    """
    data = dataset.data
    self._train_mean = np.nanmean(data, axis=0)
    self._train_std = np.nanstd(data, axis=0)
    self._feature_names = list(dataset.feature_names) if dataset.feature_names else [
        f"feature[{i}]" for i in range(data.shape[1])
    ]
    self._fitted = True
    return self

quprep.core.drift.DriftReport(drifted_features=list(), feature_stats=dict(), n_features_drifted=0, overall_drift=False) dataclass

Summary of drift detected between training and new data.

Attributes:

Name Type Description
drifted_features list of str

Feature names (or indices) where drift was detected.

feature_stats dict

Per-feature drift details: {name: {"train_mean", "new_mean", "train_std", "new_std", "mean_shift_sigmas", "std_ratio"}}.

n_features_drifted int

Number of features that exceeded the drift threshold.

overall_drift bool

True if any feature exceeded the threshold.