Validation, Schema & Cost Estimation¶

QuPrep v0.4.0 adds a full validation layer so problems are caught early — before encoding, before circuits are built, before any framework is involved.

Quick example¶

import quprep as qd

schema = qd.DataSchema([
    qd.FeatureSpec("age",    dtype="continuous", min_value=0,   max_value=120),
    qd.FeatureSpec("income", dtype="continuous", min_value=0),
    qd.FeatureSpec("flag",   dtype="binary"),
])

pipeline = qd.Pipeline(encoder=qd.AngleEncoder(), schema=schema)
result = pipeline.fit_transform(df)

print(result.cost.nisq_safe)   # True / False
result.summary()               # audit table + cost breakdown

Input validation¶

validate_dataset() runs automatically at pipeline entry and checks:

Dataset is 2-D and non-empty
Data dtype is float (not int, object, etc.)
NaN coverage: warns if any column has missing values (QuPrepWarning)
Qubit mismatch: warns if n_features > n_qubits for the chosen encoder

import warnings
from quprep.validation import validate_dataset, QuPrepWarning

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    validate_dataset(dataset)

for warning in w:
    if issubclass(warning.category, QuPrepWarning):
        print(warning.message)

To suppress QuPrep warnings selectively:

import warnings
from quprep.validation import QuPrepWarning
warnings.filterwarnings("ignore", category=QuPrepWarning)

Schema enforcement¶

A DataSchema defines what the pipeline expects at entry. Violations are collected and reported together so you get the full picture in one error.

Define a schema¶

import quprep as qd

schema = qd.DataSchema([
    qd.FeatureSpec("age",     dtype="continuous", min_value=0, max_value=120),
    qd.FeatureSpec("income",  dtype="continuous", min_value=0),
    qd.FeatureSpec("is_employed", dtype="binary"),
    qd.FeatureSpec("score",   dtype="continuous", nullable=True),
])

FeatureSpec parameters:

Parameter	Type	Default	Description
`name`	str	required	Expected column name
`dtype`	str	required	`'continuous'`, `'discrete'`, or `'binary'`
`min_value`	float	None	Minimum allowed value (inclusive)
`max_value`	float	None	Maximum allowed value (inclusive)
`nullable`	bool	False	Whether NaN is permitted

Attach to a pipeline¶

pipeline = qd.Pipeline(encoder=qd.AngleEncoder(), schema=schema)
# raises SchemaViolationError at fit() / fit_transform() if violated

Validate standalone¶

try:
    schema.validate(dataset)
except qd.SchemaViolationError as e:
    print(e)
# DataSchema validation failed with 2 violation(s):
#   - Feature 'age': min value -5.0 < allowed minimum 0
#   - Feature 'is_employed': expected binary {0, 1}, got values [-1.0, 0.0, 1.0]

Infer from data¶

schema = qd.DataSchema.infer(train_dataset)
# auto-detects names, types, and min/max from the training set

Save and load (JSON)¶

# Save
with open("schema.json", "w") as f:
    f.write(schema.to_json())

# Load
with open("schema.json") as f:
    schema = qd.DataSchema.from_json(f.read())

The JSON format is a plain list of dicts — easy to edit by hand:

[
  {"name": "age",    "dtype": "continuous", "min_value": 0, "max_value": 120},
  {"name": "income", "dtype": "continuous", "min_value": 0},
  {"name": "flag",   "dtype": "binary"}
]

CLI: `quprep validate`¶

Inspect any CSV file without writing any Python:

# Structural report — shape, NaN counts, value ranges
quprep validate dataset.csv

# Infer schema and save to JSON
quprep validate dataset.csv --infer-schema schema.json

# Print inferred schema to stdout
quprep validate dataset.csv --infer-schema -

# Validate against a saved schema
quprep validate new_data.csv --schema schema.json

Typical output:

Dataset : dataset.csv
Shape   : 150 samples × 4 features
Columns : sepal_length, sepal_width, petal_length, petal_width
NaN     : none
Ranges  :
          'sepal_length': [4.3, 7.9]
          'sepal_width':  [2.0, 4.4]
          'petal_length': [1.0, 6.9]
          'petal_width':  [0.1, 2.5]

Schema  : checking ...
Schema  : OK — no violations

Exits 0 on success, 1 on violation (safe to use in CI).

Cost estimation¶

Know your circuit complexity before encoding:

cost = qd.estimate_cost(qd.AngleEncoder(), n_features=8)

print(cost.encoding)       # "angle"
print(cost.n_qubits)       # 8
print(cost.gate_count)     # 8
print(cost.circuit_depth)  # 1
print(cost.two_qubit_gates) # 0
print(cost.nisq_safe)      # True   (depth < 200 and CNOTs < 50)
print(cost.warning)        # None   (or a warning string if unsafe)

NISQ thresholds: circuit_depth < 200 and two_qubit_gates < 50. Both must hold for nisq_safe=True.

Cost on PipelineResult¶

PipelineResult.cost is populated automatically — computed after all reduction stages are applied so the qubit count reflects the actual dimensionality, not the raw input:

result = pipeline.fit_transform(df)
print(result.cost.nisq_safe)
print(result.cost.gate_count)

Preprocessing audit log¶

PipelineResult.audit_log records what happened to the data at each stage:

result = qd.Pipeline(
    cleaner=qd.Imputer(),
    reducer=qd.PCAReducer(n_components=4),
    encoder=qd.AngleEncoder(),
).fit_transform(df)

for entry in result.audit_log:
    print(entry)
# {'stage': 'cleaner',    'n_samples_in': 150, 'n_features_in': 10, 'n_samples_out': 148, 'n_features_out': 10}
# {'stage': 'reducer',    'n_samples_in': 148, 'n_features_in': 10, 'n_samples_out': 148, 'n_features_out':  4}
# {'stage': 'normalizer', 'n_samples_in': 148, 'n_features_in':  4, 'n_samples_out': 148, 'n_features_out':  4}

audit_log is None when no preprocessing stages ran.

Intermediate stage datasets¶

PipelineResult.stages gives you the actual Dataset after each pipeline step — useful for debugging transforms or inspecting what the data looks like before encoding:

result = qd.Pipeline(
    cleaner=qd.OutlierHandler(),
    reducer=qd.PCAReducer(n_components=4),
    encoder=qd.AngleEncoder(),
).fit_transform(df)

# Keys present for every stage that ran
print(result.stages.keys())
# dict_keys(['input', 'after_cleaner', 'after_reducer', 'after_normalizer'])

# Inspect data shape at each step
print(result.stages["input"].data.shape)          # (150, 10)
print(result.stages["after_cleaner"].data.shape)  # (148, 10)
print(result.stages["after_reducer"].data.shape)  # (148, 4)
print(result.stages["after_normalizer"].data.shape)  # (148, 4)

Only stages that actually ran appear as keys — an encoder-only pipeline will have just 'input'.

Summary output¶

Both Pipeline and PipelineResult have a .summary() method useful in notebooks and scripts:

pipeline.fit(df)
print(pipeline.summary())

result = pipeline.transform(df)
result.summary()

str(pipeline) also calls .summary().

sklearn compatibility¶

Every stage and the Pipeline itself now support the full fit / transform split:

# Fit once on training data
pipeline.fit(X_train)

# Transform any number of test sets
X_val_result   = pipeline.transform(X_val)
X_test_result  = pipeline.transform(X_test)

get_params() and set_params() are also implemented for hyperparameter search:

from sklearn.model_selection import GridSearchCV

pipeline.set_params(encoder=qd.BasisEncoder())
params = pipeline.get_params()

Encoding compatibility check¶

check_compatibility catches problems before encoding — NaN values, out-of-range data, and wrong binary assumptions — with clear, actionable messages:

import quprep as qd

report = qd.check_compatibility(qd.AngleEncoder(rotation="ry"), dataset)

if not report.is_compatible:
    print("Errors (must fix):")
    for e in report.errors:
        print(" •", e)
else:
    print("Compatible")

if report.warnings:
    print("Warnings (consider fixing):")
    for w in report.warnings:
        print(" •", w)

Typical output with out-of-range angles:

Compatible
Warnings (consider fixing):
 • Values outside [0, π] for AngleEncoder(ry); suggest minmax_pi normalizer

Errors are raised as hard incompatibilities (is_compatible = False):

Errors (must fix):
 • NaN values detected in dataset; use Imputer before encoding

What each encoder checks¶

Encoder	Warnings	Errors
`AngleEncoder(rotation="ry")`	values outside `[0, π]`	NaN
`AngleEncoder(rotation="rx")`	values outside `[-π, π]`	NaN
`AmplitudeEncoder`	features not power-of-two	NaN
`BasisEncoder`	non-binary values	NaN
`ZZFeatureMapEncoder`	values outside `[0, 2π]`	NaN

Encoding verification (post-encode)¶

verify_encoding checks that the output of an encoder satisfies the expected mathematical invariants — unit norm for amplitude encoding, angle ranges for angle encoding:

import quprep as qd

enc = qd.AmplitudeEncoder()
encoded = enc.encode_batch(dataset)   # must pass L2-normalised input

report = qd.verify_encoding(encoded, enc)
print(report.passed)              # True / False
for check in report.checks:
    print(check["name"], check["passed"], check.get("detail", ""))

For an empty list of encoded samples, verify_encoding returns passed=True with an empty checks list.

Full preprocessing audit¶

For a single combined pre-encoding report — NaN fractions, outliers, qubit-budget shortfall, class imbalance, and encoder-range violations in one object — use preprocessing_report(dataset, encoder=..., qubit_budget=...). It returns n_issues and a list of human-readable recommendations. See the Auto-suggest guide for details.