Encoding Comparison & Smart Recommendations¶
QuPrep can compare all encoding methods side-by-side and recommend the best one for your dataset — analytically, before a single circuit is built.
Encoding comparison¶
compare_encodings() profiles your dataset and runs cost estimation for every encoder, returning a table of qubit count, gate count, circuit depth, 2-qubit gates, and NISQ safety.
Encoding Qubits Gate Count Depth 2Q Gates NISQ Safe
------------------ -------- ----------- ------- ---------- ----------
angle 8 8 1 0 Yes
amplitude 3 16 8 8 Yes
basis 8 8 1 0 Yes
iqp 8 84 64 28 Yes
reupload 8 24 24 0 Yes
entangled_angle 8 15 8 7 Yes
hamiltonian 8 8 8 0 Yes
No circuits are generated — costs are estimated analytically, so comparison is fast even for large datasets.
Filtering encoders¶
# Only compare a subset
result = qd.compare_encodings(X, include=["angle", "iqp", "amplitude"])
# Exclude encoders you know won't fit
result = qd.compare_encodings(X, exclude=["amplitude", "hamiltonian"])
Task-aware recommendation¶
Pass task= to highlight the best encoding for your use case:
result = qd.compare_encodings("data.csv", task="classification")
print(result)
# The recommended encoding is starred in the table
Valid tasks: classification, regression, qaoa, kernel, simulation.
Qubit budget¶
result = qd.compare_encodings("data.csv", qubits=8)
# Encoders requiring more than 8 qubits have nisq_safe=False and a budget warning
Picking the best¶
best = result.best(prefer="nisq") # NISQ-safe, then lowest depth (default)
best = result.best(prefer="depth") # Globally shallowest circuit
best = result.best(prefer="gates") # Fewest total gates
best = result.best(prefer="qubits") # Fewest qubits
print(best.encoding, best.n_qubits, best.circuit_depth)
Programmatic access¶
CLI¶
quprep compare data.csv
quprep compare data.csv --task classification --qubits 8
quprep compare data.csv --include angle,iqp,amplitude
quprep compare data.csv --exclude amplitude,hamiltonian
Smart recommendation engine¶
recommend() goes beyond a fixed lookup table — it adapts its scores based on what it finds in your data:
| Signal | How it affects the recommendation |
|---|---|
n_samples > 500 |
Penalises amplitude (expensive state prep per sample); rewards reupload |
n_samples < 20 |
Penalises reupload (high expressivity → overfitting risk) |
missing_rate > 10% |
Penalises amplitude (requires exact unit norm) |
| Negative values in data | Rewards amplitude (handles negatives naturally via superposition); penalises basis (all negatives → 0 after binarization) |
| Sparse data (many zeros) | Boosts basis encoding |
| Correlated features | Boosts IQP and entangled angle (entanglement captures inter-feature structure) |
n_features > 15 |
Penalises IQP (depth grows as O(d²)) |
rec = qd.recommend("data.csv", task="classification", qubits=8)
print(rec)
# Recommended encoding : iqp
# Qubits needed : 8
# Circuit depth : O(d²·reps)
# NISQ safe : yes
# Score : 54.0
# Reason : best fit for classification tasks; continuous features
# map naturally to rotation angles; NISQ-safe (shallow circuit).
# Alternatives :
# angle score=45.0 O(d)
# reupload score=45.0 O(d·layers)
# ...
The reason field always explains which dataset signals drove the recommendation.
Simulation-based re-ranking (use_metrics=True)¶
Pass use_metrics=True to augment the heuristic scores with circuit-level metrics computed by simulating each encoding on samples from your data:
When n_features ≤ 12, QuPrep simulates the candidate encodings with a lightweight numpy statevector backend and adds the following bonuses to each encoding's heuristic score:
| Metric | Bonus range | Direction |
|---|---|---|
| Expressibility (KL divergence) | up to +8 | lower KL → more expressive → higher bonus |
| Entanglement capability | up to +6 | classification / kernel tasks only |
| Kernel alignment | up to ±12 | higher alignment → better class separation |
The recommendation is then re-ranked by the combined score. For datasets with more than 12 features the metrics pass is skipped and the heuristic scores are used unchanged.
Combining both¶
# Compare first, then apply the best one
result = qd.compare_encodings("data.csv", task="classification", qubits=8)
best = result.best(prefer="nisq")
pipeline = qd.Pipeline(
encoder=getattr(qd, f"{best.encoding.title().replace('_', '')}Encoder")(),
exporter=qd.QASMExporter(),
)
pipeline_result = pipeline.fit_transform("data.csv")
Or use recommend() directly with .apply():
Reproducibility fingerprinting¶
Once you've chosen an encoding, lock in the exact configuration with a deterministic hash — useful for paper methods sections and experiment logs.
import quprep as qd
pipeline = qd.Pipeline(
cleaner=qd.Imputer(strategy="knn"),
reducer=qd.PCAReducer(n_components=4),
encoder=qd.AngleEncoder(rotation="ry"),
exporter=qd.QASMExporter(),
)
fp = pipeline.fingerprint()
print(fp.hash) # sha256 hex — same config always produces the same hash
Or via the standalone function:
What the hash captures¶
- Class name and all constructor parameters for every configured stage
- Installed versions of key dependencies (numpy, scikit-learn, scipy, qiskit, pennylane, etc.)
- QuPrep version and Python version
The hash is stable across runs — the timestamp is excluded so that the same configuration always produces the same hash regardless of when or where it runs.
Exporting for a paper¶
# JSON (default) — attach to supplementary material
fp.save("experiment.json")
# YAML — requires pyyaml
fp.save("experiment.yaml", format="yaml")
# Inline JSON string
print(fp.to_json())
Example JSON output:
{
"hash": "sha256:a3f7c1...",
"timestamp": "2026-04-10T10:23:00+00:00",
"quprep_version": "0.9.0",
"python_version": "3.12.0",
"stages": {
"cleaner": { "class": "Imputer", "params": { "strategy": "knn" } },
"reducer": { "class": "PCAReducer", "params": { "n_components": 4 } },
"encoder": { "class": "AngleEncoder","params": { "rotation": "ry" } },
"exporter": { "class": "QASMExporter","params": {} }
},
"dependencies": {
"numpy": "1.26.4",
"scikit-learn": "1.4.2"
}
}
Include the hash value in your paper's methods section. Readers can reproduce your exact setup by inspecting the JSON and recreating the pipeline.