Neural Recombination Runbook
Step-by-step guide for repeating the full distillation → quantization → crossover → fine-tune → validation pipeline from AgentFarm#8.
Table of Contents
- Prerequisites
- Quick Reference: Pipeline Order
- Stage 1 — Distillation
- Stage 2 — Post-Training Quantization (PTQ)
- Stage 3 — Quantization-Aware Training (QAT) — Optional
- Stage 4 — Crossover + Fine-tuning
- Stage 5 — Validation
- Optional: Compare Distillation Modes
- Parameter Reference
- Tuning Guide
- Copy-Paste Recipes
- Generalization: Holdout & Domain-Shift Evaluation
- Publication Ablations
- Qualitative Error Analysis for Recombined Networks
- Multi-Generation Crossover Search
1. Prerequisites
Environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .
Python 3.8+ required; 3.9+ recommended. All scripts below assume the repo root is your working directory and that the venv is active.
Parent checkpoints
Each pipeline run requires two pre-trained parent BaseQNetwork state-dicts — parent_A.pt and parent_B.pt. These are the teachers for distillation. Architecture dimensions (input_dim, output_dim, hidden_size) must stay consistent across every stage.
The default architecture shipped with farm/config/default.yaml is:
| Dimension | Default |
|---|---|
input_dim |
8 |
output_dim |
4 |
hidden_size / parent_hidden |
64 |
If your parents use different dimensions, pass the corresponding flags to every script.
State buffer
All scripts share a common evaluation dataset: an (N, input_dim) float32 NumPy array.
- Synthetic (quickest): generated at runtime via
--n-states+--seed; always reproducible but not representative of real agent behaviour. - Real replay buffer: pass
--states-file path/to/states.npy. The.npyfile must have shape(N, input_dim)in float32. Using the same distribution as training/deployment gives the most realistic validation metrics.
Reproducibility rule: use the same
--states-file(or the same--n-states/--seedpair) across all stages so that metrics are comparable.
2. Quick Reference: Pipeline Order
parent_A.pt ─┐
├─► run_distillation.py ─► student_A.pt ─┐
parent_B.pt ─┘ student_B.pt ─┘
│
(PTQ) quantize_distilled.py ─► student_A_int8.pt
(QAT) qat_distilled.py ─► student_A_qat_int8.pt
│
finetune_child.py (crossover inside) ─► child_finetuned.pt
│
validate_distillation.py · validate_quantized.py
validate_recombination.py
Each stage writes checkpoints and companion JSON metadata (*.pt.json) that become inputs for the next stage.
3. Stage 1 — Distillation
Script: scripts/run_distillation.py
Goal: train a smaller StudentQNetwork to reproduce the Q-value distribution of each frozen parent (BaseQNetwork).
Minimal synthetic run
python scripts/run_distillation.py \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--n-states 2000 \
--seed 42 \
--output-dir checkpoints/distillation
With a real replay buffer
python scripts/run_distillation.py \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--states-file data/replay_states.npy \
--seed 42 \
--output-dir checkpoints/distillation
Expected shape of replay_states.npy: (N, 8) float32 (or whatever your input_dim is).
Key parameters
| Flag | Default | Description |
|---|---|---|
--pair |
both |
A, B, or both |
--input-dim |
8 |
State feature dimension |
--output-dim |
4 |
Number of actions |
--parent-hidden |
64 |
Teacher hidden width |
--parent-a-ckpt |
(required) | Path to parent_A.pt |
--parent-b-ckpt |
(required) | Path to parent_B.pt |
--n-states |
1000 |
Synthetic state count (ignored if --states-file) |
--states-file |
— | .npy replay buffer path |
--temperature |
3.0 |
Softmax temperature for soft labels |
--alpha |
1.0 |
Soft/hard blend: 1.0 = pure soft KL, 0.0 = pure hard CE |
--epochs |
10 |
Training epochs |
--lr |
1e-3 |
Adam learning rate |
--batch-size |
32 |
Mini-batch size |
--max-grad-norm |
1.0 |
Gradient clipping norm |
--val-fraction |
0.1 |
Held-out validation split |
--loss-fn |
kl |
Soft loss: kl (recommended) or mse |
--seed |
None |
RNG seed |
--output-dir |
checkpoints/distillation |
Output directory |
Outputs
checkpoints/distillation/
student_A.pt # StudentQNetwork state dict
student_A.pt.json # config + per-epoch metrics
student_B.pt
student_B.pt.json
4. Stage 2 — Post-Training Quantization (PTQ)
Script: scripts/quantize_distilled.py
Goal: compress student_*.pt to int8 with no re-training. Start here before trying QAT.
Minimal dynamic PTQ
python scripts/quantize_distilled.py \
--checkpoint-dir checkpoints/distillation \
--input-dim 8 --output-dim 4 --parent-hidden 64 \
--mode dynamic \
--output-dir checkpoints/quantized
Static PTQ (with calibration data)
python scripts/quantize_distilled.py \
--checkpoint-dir checkpoints/distillation \
--states-file data/replay_states.npy \
--mode static \
--calibration-batches 10 \
--calibration-batch-size 64 \
--output-dir checkpoints/quantized
Key parameters
| Flag | Default | Description |
|---|---|---|
--pair |
both |
A, B, or both |
--checkpoint-dir |
— | Dir containing student_A.pt / student_B.pt |
--student-a-ckpt / --student-b-ckpt |
— | Explicit paths (override --checkpoint-dir) |
--input-dim |
8 |
Must match distillation |
--output-dim |
4 |
Must match distillation |
--parent-hidden |
64 |
Must match distillation |
--mode |
dynamic |
dynamic (weight-only, no calibration) or static |
--dtype |
qint8 |
Quantization dtype |
--backend |
auto |
auto, x86, fbgemm, qnnpack |
--calibration-batches |
10 |
Static mode: number of calibration batches |
--calibration-batch-size |
64 |
Static mode: batch size for calibration |
--states-file |
— | Calibration states (static mode); also used for output comparison |
--n-states |
1000 |
Synthetic calibration states if no file |
--seed |
42 |
RNG seed |
--output-dir |
checkpoints/quantized |
Output directory |
Outputs
checkpoints/quantized/
student_A_int8.pt # Quantized model (CPU int8 pickle)
student_A_int8.pt.json # QuantizationConfig + timing
student_B_int8.pt
student_B_int8.pt.json
5. Stage 3 — Quantization-Aware Training (QAT) — Optional
Script: scripts/qat_distilled.py
When to use QAT instead of PTQ:
Use QAT when PTQ action agreement (reported by
validate_quantized.py) falls below your target threshold (e.g. < 90%). QAT adds a short training pass with fake quantization, recovering accuracy at the cost of a few extra minutes.
Minimal QAT run
python scripts/qat_distilled.py \
--checkpoint-dir checkpoints/distillation \
--input-dim 8 --output-dim 4 --parent-hidden 64 \
--epochs 5 \
--n-states 2000 \
--seed 42 \
--output-dir checkpoints/qat
Omit --no-convert (default) to also produce the converted int8 checkpoint.
Key parameters
| Flag | Default | Description |
|---|---|---|
--pair |
both |
A, B, or both |
--checkpoint-dir |
— | Dir with parent_<pair>.pt + student_<pair>.pt |
--teacher-a-ckpt / --student-a-ckpt |
— | Explicit path overrides |
--input-dim |
8 |
Must match distillation |
--output-dim |
4 |
Must match distillation |
--parent-hidden |
64 |
Must match distillation |
--epochs |
5 |
QAT fine-tuning epochs |
--learning-rate |
1e-4 |
Adam LR (lower than distillation; model is already trained) |
--batch-size |
32 |
Mini-batch size |
--max-grad-norm |
1.0 |
Gradient clipping norm |
--val-fraction |
0.1 |
Validation split |
--loss-fn |
mse |
mse (default for QAT) or kl |
--temperature |
3.0 |
Temperature for kl mode |
--alpha |
1.0 |
Soft/hard blend for kl mode |
--no-convert |
False |
Skip int8 conversion; save only float QAT checkpoint |
--states-file / --n-states / --seed |
— | Same semantics as other stages |
--output-dir |
checkpoints/qat |
Output directory |
Outputs
checkpoints/qat/
student_A_qat.pt # Float QAT checkpoint
student_A_qat.pt.json
student_A_qat_int8.pt # Converted int8 (same format as PTQ output)
student_A_qat_int8.pt.json
6. Stage 4 — Crossover + Fine-tuning
Script: scripts/finetune_child.py
Goal: blend two parent state dicts into a child via a crossover strategy, then fine-tune the child against a frozen reference (parent A) using a distillation-style loss.
Minimal synthetic run
python scripts/finetune_child.py \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--crossover-mode weighted \
--crossover-alpha 0.5 \
--crossover-seed 42 \
--n-states 2000 \
--seed 42 \
--output-dir checkpoints/finetune
With a replay buffer and YAML overrides
python scripts/finetune_child.py \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--crossover-mode random \
--crossover-alpha 0.5 \
--crossover-seed 0 \
--states-file data/replay_states.npy \
--config-yaml farm/config/default.yaml \
--epochs 10 \
--lr 5e-4 \
--output-dir checkpoints/finetune
YAML defaults (farm/config/default.yaml)
The crossover_child_finetune section provides all fine-tuning defaults:
crossover_child_finetune:
learning_rate: 0.001
epochs: 5
batch_size: 32
max_grad_norm: 1.0
val_fraction: 0.1
seed: null
loss_fn: kl
temperature: 3.0
temp_decay: 1.0 # per-epoch temperature multiplier; 1.0 = no decay
alpha: 1.0 # soft/hard blend (1.0 = pure KL, 0.0 = pure CE)
lr_schedule_patience: 0 # ReduceLROnPlateau patience; 0 = disabled
lr_schedule_factor: 0.5 # LR reduction factor when plateau detected
quantization_applied: none # none | ptq_dynamic | ptq_static | qat_float
optimizer: adam
optimizer_kwargs: {}
early_stopping_patience: 0 # 0 = disabled
CLI flags such as --lr, --epochs, --alpha override the YAML values when specified.
Key parameters
| Flag | Default (YAML) | Description |
|---|---|---|
--input-dim |
8 |
Must match parent architecture |
--output-dim |
4 |
Must match parent architecture |
--hidden-size |
64 |
Must match parent architecture |
--parent-a-ckpt |
(required) | Parent A checkpoint (also the fine-tune teacher) |
--parent-b-ckpt |
(required) | Parent B checkpoint |
--crossover-mode |
(required) | random, layer, or weighted |
--crossover-alpha |
— | Blend/selection coefficient (see below) |
--crossover-seed |
— | RNG seed for random mode |
--n-states / --states-file |
— | State buffer (same as other stages) |
--config-yaml |
farm/config/default.yaml |
YAML with crossover_child_finetune section |
--lr |
1e-3 |
Adam learning rate |
--epochs |
5 |
Fine-tuning epochs |
--batch-size |
32 |
Mini-batch size |
--max-grad-norm |
1.0 |
Gradient clipping norm |
--val-fraction |
0.1 |
Validation split |
--loss-fn |
kl |
Distillation loss (kl or mse) |
--temperature |
3.0 |
Softmax temperature |
--alpha |
1.0 |
Soft/hard blend |
--lr-patience |
0 |
ReduceLROnPlateau patience (0 = off) |
--lr-factor |
0.5 |
LR reduction factor |
--seed |
null |
Fine-tuning RNG seed (separate from crossover seed) |
--quantization-applied |
none |
none, ptq_dynamic, ptq_static, or qat_float |
--optimizer |
adam |
adam, adamw, sgd, or rmsprop |
--early-stopping-patience |
0 |
Validation-loss patience (0 = off) |
--output-dir |
(required) | Output directory |
Crossover modes
| Mode | alpha meaning |
Deterministic? | Notes |
|---|---|---|---|
random |
Probability of selecting from parent A per tensor | No (needs --crossover-seed) |
High diversity; use multiple seeds to estimate variance |
layer |
Ignored | Yes | Even blocks from A, odd blocks from B; structurally coherent |
weighted |
Linear blend weight: child = alpha*A + (1-alpha)*B |
Yes | Smooth interpolation; 0.5 = midpoint |
Outputs
checkpoints/finetune/
child.pt # Raw crossover child (pre fine-tune)
child.pt.json
child_finetuned.pt # Fine-tuned child
child_finetuned.pt.json
7. Stage 5 — Validation
7.1 Distillation quality
Script: scripts/validate_distillation.py
Checks KL divergence, MSE, MAE, cosine similarity, top-k agreement, latency ratio, and robustness slices between parent and student.
python scripts/validate_distillation.py \
--checkpoint-dir checkpoints/distillation \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--n-states 2000 --seed 42 \
--report-dir reports/distillation
Key threshold flags (all have sensible defaults; override to tighten):
| Flag | Default | Description |
|---|---|---|
--min-action-agreement |
0.8 |
Minimum top-1 action agreement |
--max-kl-divergence |
0.1 |
Maximum KL divergence |
--max-mse |
0.01 |
Maximum mean-squared error |
--min-cosine-similarity |
0.9 |
Minimum cosine similarity |
--max-latency-ratio |
2.0 |
Maximum student/parent latency ratio |
--min-robustness-action-agreement |
0.7 |
Agreement on noisy/out-of-distribution slices |
7.2 Quantization fidelity
Script: scripts/validate_quantized.py
Compares float student vs int8 student on agreement, Q-error, latency, and memory.
python scripts/validate_quantized.py \
--float-dir checkpoints/distillation \
--quant-dir checkpoints/quantized \
--n-states 2000 --seed 42 \
--report-dir reports/quantized
Key threshold flags:
| Flag | Default | Description |
|---|---|---|
--min-action-agreement |
0.9 |
Quantized vs float top-1 agreement |
--max-mean-q-error |
0.05 |
Mean absolute Q-value error |
--min-cosine-similarity |
0.95 |
Cosine similarity |
--max-latency-ratio |
1.5 |
Int8 / float latency ratio |
7.3 Recombination quality
Script: scripts/validate_recombination.py
Evaluates the fine-tuned child against both parents.
python scripts/validate_recombination.py \
--checkpoint-dir checkpoints/finetune \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--n-states 2000 --seed 42 \
--include-parent-baseline \
--report-dir reports/recombination
For quantized children add --child-quantized (and --parent-a-quantized / --parent-b-quantized if parents are also int8).
Key threshold flags:
| Flag | Default | Description |
|---|---|---|
--min-action-agreement |
0.7 |
Child vs parent action agreement |
--max-kl-divergence |
0.3 |
Child vs parent KL |
--max-mse |
0.05 |
Child vs parent MSE |
--min-cosine-similarity |
0.8 |
Child vs parent cosine similarity |
“Good enough” heuristic: aim for child-vs-A and child-vs-B top-1 agreement both ≥ 0.7, with neither collapsing to one parent. The
primary_metric = min(agreement_A, agreement_B)used byrun_crossover_search.pycaptures this directly.
Case-level analysis: for per-state disagreements, logit summaries, worst-k states, and hidden-layer activations see § 14 — Qualitative Error Analysis and
scripts/analyze_recombination.py.
Validation report layout
Each script writes a JSON report alongside a human-readable markdown summary. JSON keys of interest:
action_agreement— top-1 agreement between modelsoracle_action_agreement— agreement when both models are uncertainkl_divergence,mse,mae,cosine_similarity— Q-value closeness metricslatency_ms_median/latency_ratio— speed comparison
8. Optional: Compare Distillation Modes
Script: scripts/compare_distillation_modes.py
Runs hard-only (alpha=0), soft-only (alpha=1), and blended distillation back-to-back with a shared frozen teacher and state buffer so results are directly comparable.
python scripts/compare_distillation_modes.py \
--seed 42 --epochs 10 --n-states 2000 \
--json-out reports/distillation_mode_comparison.json
See docs/distillation_soft_label_comparison.md for recorded results and discussion.
9. Parameter Reference
Architecture consistency
The following dimensions must match across all stages:
| Parameter | CLI flag (all scripts) | YAML key | Notes |
|---|---|---|---|
| State feature dim | --input-dim |
— | Parent network input size |
| Action count | --output-dim |
— | Parent network output size |
| Teacher hidden width | --parent-hidden / --hidden-size |
— | Used to reconstruct BaseQNetwork |
Seed inventory
| Stage | Flag | Purpose |
|---|---|---|
| Distillation | --seed |
State generation + training RNG |
| Quantization (PTQ) | --seed |
Synthetic calibration state generation |
| QAT | --seed |
Synthetic state generation + training RNG |
| Crossover | --crossover-seed |
Per-tensor selection RNG (random mode) |
| Fine-tuning | --seed |
Training batch shuffle + dropout |
crossover_child_finetune YAML keys
| Key | Type | Safe range | Description |
|---|---|---|---|
learning_rate |
float | 5e-5 – 5e-3 |
Adam LR |
epochs |
int | 3 – 20 |
Training epochs |
batch_size |
int | 16 – 128 |
Mini-batch size |
max_grad_norm |
float | 0.5 – 5.0; 0 = off |
Gradient clipping |
val_fraction |
float | 0.05 – 0.2 |
Held-out validation fraction |
seed |
int or null | any | null = non-deterministic |
loss_fn |
kl / mse |
— | kl preferred for soft distillation |
temperature |
float | 1.0 – 10.0 |
Softmax temperature for kl loss |
temp_decay |
float | 0.9 – 1.0 |
Per-epoch temperature multiplier |
alpha |
float | 0.0 – 1.0 |
Soft/hard blend; 1.0 = pure soft |
lr_schedule_patience |
int | 0 – 5 |
ReduceLROnPlateau; 0 = off |
lr_schedule_factor |
float | 0.1 – 0.9 |
LR multiplier on plateau |
quantization_applied |
string | see below | Triggers QAT-aware fine-tune path |
optimizer |
string | adam, adamw, sgd, rmsprop |
Optimizer choice |
early_stopping_patience |
int | 0 – 10 |
Val-loss patience; 0 = off |
quantization_applied values: none (default, float path), ptq_dynamic, ptq_static, qat_float — when not none, FineTuner replaces Linear layers with WeightOnlyFakeQuantLinear; call convert() + save_quantized() after finetune() for int8 output.
10. Tuning Guide
State data choice
- Use a real replay buffer whenever possible. Synthetic standard-normal states cover the full input range but may not reflect the distribution your agents encounter at inference time. Validation metrics on synthetic states can be optimistic.
- The
.npyfile should be float32, shape(N, input_dim). Prefer N ≥ 2 000 for stable statistics; 10 000+ for final validation. - Keep the same states file across all stages (or at least the same
--seed) so that reported metrics are on a common distribution.
Distillation
- Temperature (
--temperature, default3.0): higher values flatten the teacher’s distribution, making inter-action confidence ordering more visible. Range2.0–6.0is typical; reduce toward1.0if the student converges too slowly. - Alpha (
--alpha, default1.0):1.0= pure KL soft loss;0.0= pure cross-entropy on the argmax. Start with1.0; blend toward0.5if action agreement is high but hard-label accuracy is poor. - Learning rate instability: if training loss oscillates or diverges, lower
--lr(try5e-4) and ensure--max-grad-norm 1.0is in effect. - Low agreement after training: increase
--epochs(try20), or provide a larger / more representative state buffer. --loss-fn mse: simpler objective; useful for debugging, but KL is generally better for Q-value distributions.
Post-Training Quantization
- Try dynamic PTQ first — zero training cost, typically ≥ 90 % action agreement. Only invest in static PTQ or QAT if
validate_quantized.pyreports agreement below your target. - Static PTQ: requires calibration data that reflects real input distribution; use
--states-file. Use--calibration-batches 10–50and--calibration-batch-size 64–256. - Backend choice: on Intel CPUs
fbgemmis often fastest; on ARM/mobile useqnnpack;autoselects automatically. Benchmark with--throughput-batch-sizeinvalidate_quantized.py. - QAT vs PTQ decision rule: if PTQ
action_agreement < 0.90, try QAT with--epochs 5 --learning-rate 1e-4. Use--loss-fn msefor QAT (default) unless the teacher distribution is very soft.
Crossover
weightedmode is the smoothest starting point:--crossover-alpha 0.5gives a midpoint blend. Movealphatoward0.3or0.7to bias toward one parent.randommode produces the most diverse children but highest variance across seeds. Run three seeds (--crossover-seed 0,1,2via separate invocations) to gauge variance before committing to one.layermode preserves structural coherence (each Linear + LayerNorm pair from the same parent) at the cost of diversity — only two possible children per pair. Useful when the network is sensitive to feature-scaling mismatches.- If child collapses to one parent:
primary_metric = min(agreement_A, agreement_B)will be low. Tryweightedat0.5, orrandomwith a different seed.
Fine-tuning
- Reference teacher:
finetune_child.pyalways uses parent A as the fine-tune teacher. This biases the child toward A’s behaviour; if you want a more balanced child, check parent B agreement explicitly withvalidate_recombination.py --include-parent-baseline. - Loss function:
klwithtemperature 3.0andalpha 1.0mirrors the distillation loss and tends to produce soft, calibrated Q-values. Usemsefor a simpler target. - LR schedule: enable
--lr-patience 3 --lr-factor 0.5if validation loss plateaus early. This is disabled by default to keep runs short. - Early stopping:
--early-stopping-patience 5prevents overfitting on small state buffers; safe to enable for production runs. quantization_applied: set toptq_dynamic,ptq_static, orqat_floatonly when you intend to produce an int8 fine-tuned child. Requires callingFineTuner.convert()+save_quantized()programmatically after the script; for a pure float32 child leave asnone.
Validation
- Dimension/architecture mismatches are the most common failure mode. Always pass the same
--input-dim,--output-dim, and--parent-hidden(or--hidden-size) that were used at distillation time. The checkpoint JSON metadata (*.pt.json) records these for reference. - Agreement vs oracle agreement:
action_agreementcounts top-1 matches;oracle_action_agreementcounts matches only when both models are confident. Low oracle agreement with high action agreement usually indicates one model is uncertain overall. - Threshold calibration: default thresholds in the validation scripts are conservative. After a successful baseline run you can tighten
--min-action-agreementtoward0.85–0.95for production gates. - Using the same state buffer as training: evaluation metrics will be optimistically inflated if the same states are used for both training and validation. Prefer a held-out test split by passing a separate
.npyfile to the validation scripts.
11. Copy-Paste Recipes
Recipe A — Minimal synthetic run (no parent checkpoints required for architecture test)
# 1. Distil (synthetic states, no real parents — useful for smoke test)
python scripts/run_distillation.py \
--n-states 2000 --seed 42 \
--epochs 5 \
--output-dir checkpoints/distillation
# 2. PTQ
python scripts/quantize_distilled.py \
--checkpoint-dir checkpoints/distillation \
--n-states 2000 --seed 42 \
--mode dynamic \
--output-dir checkpoints/quantized
# 3. Validate distillation
python scripts/validate_distillation.py \
--checkpoint-dir checkpoints/distillation \
--n-states 2000 --seed 42 \
--report-dir reports/distillation
# 4. Validate quantization
python scripts/validate_quantized.py \
--float-dir checkpoints/distillation \
--quant-dir checkpoints/quantized \
--n-states 2000 --seed 42 \
--report-dir reports/quantized
# 5. Crossover + fine-tune
python scripts/finetune_child.py \
--parent-a-ckpt checkpoints/distillation/student_A.pt \
--parent-b-ckpt checkpoints/distillation/student_B.pt \
--crossover-mode weighted --crossover-alpha 0.5 \
--n-states 2000 --seed 42 \
--output-dir checkpoints/finetune
# 6. Validate recombination
python scripts/validate_recombination.py \
--parent-a-ckpt checkpoints/distillation/student_A.pt \
--parent-b-ckpt checkpoints/distillation/student_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--n-states 2000 --seed 42 \
--report-dir reports/recombination
Recipe B — Full run with real parent checkpoints and replay buffer
PARENTS=checkpoints/parents
STATES=data/replay_states.npy # shape (N, 8) float32
OUT=checkpoints/run1
# 1. Distil
python scripts/run_distillation.py \
--parent-a-ckpt $PARENTS/parent_A.pt \
--parent-b-ckpt $PARENTS/parent_B.pt \
--states-file $STATES --seed 42 \
--temperature 3.0 --alpha 1.0 \
--epochs 20 --lr 1e-3 \
--output-dir $OUT/distillation
# 2. Validate distillation
python scripts/validate_distillation.py \
--checkpoint-dir $OUT/distillation \
--parent-a-ckpt $PARENTS/parent_A.pt \
--parent-b-ckpt $PARENTS/parent_B.pt \
--states-file $STATES --seed 42 \
--report-dir $OUT/reports/distillation
# 3. PTQ (try dynamic first)
python scripts/quantize_distilled.py \
--checkpoint-dir $OUT/distillation \
--states-file $STATES --seed 42 \
--mode dynamic \
--output-dir $OUT/quantized
# 4. Validate quantization
python scripts/validate_quantized.py \
--float-dir $OUT/distillation \
--quant-dir $OUT/quantized \
--states-file $STATES --seed 42 \
--report-dir $OUT/reports/quantized
# (Optional) If PTQ agreement < 0.90, run QAT instead:
# python scripts/qat_distilled.py \
# --checkpoint-dir $OUT/distillation \
# --states-file $STATES --seed 42 \
# --epochs 5 --learning-rate 1e-4 \
# --output-dir $OUT/qat
# 5. Crossover + fine-tune (using float parents)
python scripts/finetune_child.py \
--parent-a-ckpt $PARENTS/parent_A.pt \
--parent-b-ckpt $PARENTS/parent_B.pt \
--crossover-mode weighted --crossover-alpha 0.5 \
--crossover-seed 42 \
--states-file $STATES --seed 42 \
--epochs 10 --lr 5e-4 \
--lr-patience 3 --early-stopping-patience 5 \
--output-dir $OUT/finetune
# 6. Validate recombination
python scripts/validate_recombination.py \
--parent-a-ckpt $PARENTS/parent_A.pt \
--parent-b-ckpt $PARENTS/parent_B.pt \
--child-ckpt $OUT/finetune/child_finetuned.pt \
--states-file $STATES --seed 42 \
--include-parent-baseline \
--report-dir $OUT/reports/recombination
12. Generalization: Holdout & Domain-Shift Evaluation
Script: scripts/eval_generalization.py
Standard validation metrics (Sections 7.1–7.3) are measured on a single state buffer that may overlap with calibration or training data. For publication-grade generalization claims, you need:
- A held-out test split that was never used for training or calibration.
- An optional domain-shift evaluation generated by perturbing the holdout states (for example, sensor noise or input scaling).
eval_generalization.py automates both steps by:
- Splitting the state buffer into an in-distribution (ID) and holdout subset.
- Optionally perturbing the holdout set with Gaussian noise or input scaling.
- Running
RecombinationEvaluatoron each subset and writing a per-set JSON report plus a combinedgeneralization_summary.json.
At present, the CLI documents perturbation-based domain shift only; it does not expose a flag for loading a separate shifted .npy states file.
Library helpers
The split and perturbation logic is available as standalone functions in
farm.core.decision.training.holdout_utils:
| Function | Purpose |
|---|---|
split_replay_buffer(states, holdout_fraction, seed) |
Random train/holdout split |
apply_gaussian_noise(states, std, seed) |
Add i.i.d. Gaussian noise |
apply_input_scaling(states, scale_factor) |
Multiply all features by a scalar |
make_shifted_states(states, shift_type, **kwargs) |
Factory dispatcher for the above |
All helpers are also re-exported from farm.core.decision.training.
Minimal synthetic run (synthetic states, existing checkpoints required)
This example uses synthetic evaluation states, but it still requires trained parent A, parent B, and child checkpoints from the Recipe A workflow in Section 11. Replace paths as needed.
python scripts/eval_generalization.py \
--parent-a-ckpt checkpoints/distillation/student_A.pt \
--parent-b-ckpt checkpoints/distillation/student_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--n-states 2000 --seed 42 \
--holdout-fraction 0.2 \
--report-dir reports/generalization
Output:
reports/generalization/
id_report.json # in-distribution split report
holdout_report.json # held-out split report
generalization_summary.json
With a real replay buffer
python scripts/eval_generalization.py \
--parent-a-ckpt checkpoints/parents/parent_A.pt \
--parent-b-ckpt checkpoints/parents/parent_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--states-file data/replay_states.npy \
--holdout-fraction 0.2 \
--report-dir reports/generalization
With Gaussian-noise domain shift
python scripts/eval_generalization.py \
--parent-a-ckpt checkpoints/parents/parent_A.pt \
--parent-b-ckpt checkpoints/parents/parent_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--states-file data/replay_states.npy \
--holdout-fraction 0.2 \
--shift-type gaussian_noise \
--shift-std 0.1 \
--shift-seed 0 \
--report-dir reports/generalization
Output adds reports/generalization/shifted_report.json.
With input-scaling domain shift
python scripts/eval_generalization.py \
--parent-a-ckpt checkpoints/parents/parent_A.pt \
--parent-b-ckpt checkpoints/parents/parent_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--states-file data/replay_states.npy \
--shift-type input_scaling \
--shift-scale-factor 2.0 \
--report-dir reports/generalization
Key flags
| Flag | Default | Description |
|---|---|---|
--holdout-fraction |
0.2 |
Fraction of states reserved for holdout |
--no-shuffle |
off | Skip shuffle before split (for pre-randomised buffers) |
--shift-type |
— | gaussian_noise or input_scaling; omit to skip shifted eval |
--shift-std |
0.1 |
Gaussian noise standard deviation |
--shift-scale-factor |
2.0 |
Input scaling multiplier |
--shift-seed |
0 |
Noise RNG seed |
--report-only |
off | Write reports without applying pass/fail thresholds |
Reading the generalization_summary.json
{
"overall_passed": true,
"report_only": false,
"holdout_fraction": 0.2,
"shift_type": "gaussian_noise",
"sets": {
"in_distribution": {
"child_agrees_with_parent_a": 0.82,
"child_agrees_with_parent_b": 0.79,
"oracle_agreement": 0.91,
"n_states": 1600,
"passed": true
},
"holdout": {
"child_agrees_with_parent_a": 0.80,
"child_agrees_with_parent_b": 0.77,
"oracle_agreement": 0.89,
"n_states": 400,
"passed": true
},
"shifted": {
"child_agrees_with_parent_a": 0.73,
"child_agrees_with_parent_b": 0.70,
"oracle_agreement": 0.84,
"n_states": 400,
"passed": true,
"shift_type": "gaussian_noise"
}
}
}
A meaningful generalization drop is when holdout or shifted agreement falls more than ~5 pp below the ID score. If this happens, consider:
- Training on a larger or more diverse replay buffer.
- Increasing the holdout fraction to detect over-fitting earlier in development.
- Tuning the crossover alpha or fine-tuning LR to reduce ID–holdout gap.
13. Publication Ablations
Script: scripts/run_recombination_ablation.py
For reproducible paper tables and CI-style regression of the full pipeline,
use the unified ablation runner. A single invocation sweeps multiple
conditions (e.g. distill-only, distill+quantize, or the full pipeline)
across a list of seeds and writes every result into a structured
results/ tree together with a consolidated CSV and Markdown summary
table. If you provide a shared states_file, that same state buffer is
reused across seeds and conditions. If states_file is omitted, the
runner generates synthetic states per seed, so cross-seed results are not
directly comparable unless you supply a common state buffer.
Quick start (no config file needed)
# Dry-run: validate plan, write stub summary, no training
python scripts/run_recombination_ablation.py --smoke-test --dry-run
# Smoke-test: tiny synthetic run (2 seeds × 3 conditions, 50 states, 2 epochs)
python scripts/run_recombination_ablation.py --smoke-test --results-dir /tmp/ablation_smoke
Full run from a config file
python scripts/run_recombination_ablation.py --config ablation.yaml
The config file is YAML (recommended) or JSON. A minimal example:
seeds: [0, 1, 2]
n_states: 2000
states_file: "" # leave empty to synthesise per-seed (supply a .npy path for comparable cross-seed results)
input_dim: 8
output_dim: 4
hidden_size: 64
results_dir: results/ablation
conditions:
- name: distill_only
stages: [distill]
- name: distill_quantize
stages: [distill, quantize]
- name: full_pipeline
stages: [distill, quantize, crossover, compare]
distillation:
epochs: 20
temperature: 3.0
alpha: 1.0
lr: 0.001
batch_size: 32
quantization:
mode: dynamic
crossover:
mode: weighted
alpha: 0.5
comparison:
report_only: true
Output layout
results/ablation/
distill_only/
seed_0/student_A.pt student_B.pt
seed_1/...
seed_2/...
distill_quantize/
seed_0/student_A.pt student_B.pt student_A_int8.pt student_B_int8.pt
...
full_pipeline/
seed_0/student_A.pt student_B.pt student_A_int8.pt student_B_int8.pt
child_finetuned.pt compare_child_vs_students.json
...
ablation_summary.csv ← consolidated table (paste into spreadsheet)
ablation_summary.md ← Markdown version (paste into GitHub issues)
Per-condition stage overrides
Each condition can override any global distillation / quantization / crossover / comparison setting:
conditions:
- name: high_temp_distill
stages: [distill, crossover, compare]
distillation:
temperature: 6.0 # overrides global temperature: 3.0
epochs: 30
Valid stages
| Stage | What it runs |
|---|---|
distill |
DistillationTrainer for both A and B pairs; writes student_A.pt, student_B.pt |
quantize |
PostTrainingQuantizer on both students; writes student_A_int8.pt, student_B_int8.pt |
crossover |
When quantize is included, int8 parents are loaded and dequantized to float weights, then crossover_quantized_state_dict blends them into a float child; otherwise float student_*.pt parents are blended. FineTuner always uses float student A as KD teacher. Writes child_finetuned.pt. |
compare |
RecombinationEvaluator (float child vs float or int8 parents matching the pipeline); writes compare_child_vs_students.json |
Stages are always applied in the order listed above regardless of declaration
order in the config. Parse-time rules: quantize and crossover require
distill; compare requires crossover (there must be a child to score).
Dry-run mode
python scripts/run_recombination_ablation.py --config ablation.yaml --dry-run
Prints the full execution plan (conditions × seeds × stages × directories)
and writes a stub ablation_summary.md / ablation_summary.csv without
running any training. Use this to verify the config before a long run.
Using a shared real replay buffer
Set states_file in the config to a .npy file of shape (N, input_dim)
float32. All seeds and conditions will use the same state file, ensuring
metrics are comparable across the ablation.
states_file: data/replay_states.npy
Reading the summary table
The Markdown summary table (ablation_summary.md) contains one row per
(condition, seed) pair. Key columns:
| Column | Meaning |
|---|---|
child_vs_ref_a_agreement |
Top-1 action agreement of child vs parent A (float student, or int8 checkpoint when the condition includes quantize) |
child_vs_ref_b_agreement |
Top-1 action agreement of child vs parent B (same rule) |
oracle_agreement |
Fraction where child matches at least one reference |
elapsed_s |
Wall-clock seconds for the (condition, seed) run |
child_vs_ref_*_agreement columns are populated only when the compare
stage is included. Conditions without a compare stage show n/a.
14. Qualitative Error Analysis for Recombined Networks
Script: scripts/analyze_recombination.py
Python API: farm.core.decision.training.recombination_analysis
The aggregate fidelity report from validate_recombination.py (§ 7.3) shows
mean agreement across all states. For publication or debugging you often
need case-level insight: which states does the child get wrong, and is
the disagreement systematic? analyze_recombination.py provides this.
What it produces
| Output | Description |
|---|---|
disagreements.csv |
One row per evaluation state. Columns: actions, agreement flags, per-state KL / MSE / cosine similarity, top-k mismatch flags. |
disagreements.json |
Same records in JSON with summary counts; includes raw logits when --include-logits is set. |
worst_<k>_states.json |
The k states with the largest errors, sorted by the chosen criterion. |
<activations>.npy |
(Optional) NumPy array of shape (N, activation_dim) — hidden-layer activations for a memory-bounded probe set. |
Minimal run
python scripts/analyze_recombination.py \
--checkpoint-dir checkpoints/finetune \
--parent-a-ckpt checkpoints/parent_A.pt \
--parent-b-ckpt checkpoints/parent_B.pt \
--child-ckpt checkpoints/finetune/child_finetuned.pt \
--states-file data/replay_states.npy \
--output-dir reports/analysis
With logits, worst-10 states, and activation export
python scripts/analyze_recombination.py \
--checkpoint-dir checkpoints/finetune \
--states-file data/replay_states.npy \
--include-logits \
--worst-k 10 \
--worst-k-criterion max_kl \
--activations-out reports/analysis/child_activations.npy \
--activation-layer-index 4 \
--activation-max-states 500 \
--output-dir reports/analysis
--activation-layer-index selects a sub-module by its index in
list(model.modules()). For BaseQNetwork:
| Index | Layer |
|---|---|
| 4 | First hidden ReLU (post LayerNorm) |
| 8 | Second hidden ReLU (post LayerNorm) |
Python API
import numpy as np
from farm.core.decision.training.recombination_analysis import (
extract_disagreements,
worst_k_states,
export_disagreements_csv,
export_disagreements_json,
extract_activations,
)
states = np.load("data/replay_states.npy")
records = extract_disagreements(
parent_a, parent_b, child, states,
include_logits=True,
k_values=[1, 2, 3],
)
# Worst-10 states by maximum KL divergence across the two parents
worst = worst_k_states(records, k=10, criterion="max_kl")
export_disagreements_csv(records, "reports/analysis/disagreements.csv")
export_disagreements_json(records, "reports/analysis/disagreements.json")
# Memory-bounded activation export (first hidden ReLU, max 500 states)
acts = extract_activations(child, states, layer_index=4, max_states=500)
np.save("reports/analysis/child_activations.npy", acts)
Worst-k criteria
| Criterion | Sorts by |
|---|---|
max_kl (default) |
max(KL_vs_A, KL_vs_B) |
kl_parent_a |
KL divergence vs parent A |
kl_parent_b |
KL divergence vs parent B |
max_mse |
max(MSE_vs_A, MSE_vs_B) |
mse_parent_a |
MSE vs parent A |
mse_parent_b |
MSE vs parent B |
KL columns: kl_child_vs_parent_a / _b are KL(parent ‖ child) over action softmaxes (parent as the reference distribution). The field names are historical; compare to other tools that report KL(child ‖ parent) carefully.
Integration with validate_recombination.py
Run validate_recombination.py first to check that aggregate fidelity meets
thresholds, then run analyze_recombination.py to drill into problem states.
Both share the same architecture flags and checkpoint conventions.
15. Multi-Generation Crossover Search
Script: scripts/run_multi_gen_search.py
Python API: farm.core.decision.training.crossover_search.run_multi_generation_search, GenerationConfig
This mode runs :func:~farm.core.decision.training.crossover_search.run_crossover_search repeatedly. After each generation, the best child checkpoint becomes parent A for the next generation; parent B is chosen from the leaderboard according to selection_strategy (see below). Optional Gaussian mutation perturbs promoted parents’ weights before the next generation’s crossovers.
Semantics (parent selection)
| Role | Rule |
|---|---|
| Parent A (next gen) | Always the globally best child of the current generation (highest primary_metric on the full manifest). |
Parent B under selection_strategy="best" |
The rank-2 child in the sorted leaderboard within the first keep_top_k entries when at least two distinct children exist there. If keep_top_k is 1, or only one child exists in that prefix, parent B falls back to the original parent B from generation 0. |
Parent B under "best_vs_original" |
Always the original parent B (same in-memory / checkpoint object as at start). |
| Mutation | Applied only to loaded child checkpoints promoted as parents. The original parent B reference is never mutated when it is reused as parent B. |
| Lineage | lineage.json lists every child from every generation; keep_top_k does not trim stored lineage—it only affects which leaderboard prefix is used to pick rank-2 for parent B. |
Per-generation RNG (GenerationConfig.seed)
In the CLI, --seed is forwarded to GenerationConfig.seed and seeds synthetic parent/state generation when you omit checkpoint/state files. When seed is set, generation g adds seed + g to every non-None crossover recipe seed and fine-tune regime seed in the shared SearchConfig. When seed is None, recipe and regime seeds are used exactly as configured. Mutation RNG seeds combine MutationConfig.seed, GenerationConfig.seed, and the generation index; see the GenerationConfig docstring in code.
Minimal CLI example
python scripts/run_multi_gen_search.py \
--search-space minimal \
--max-runs 3 \
--num-generations 3 \
--seed 1000 \
--run-dir runs/multi_gen_smoke
For a single generation with the full run_crossover_search.py flag surface (eval batch size, workers, recombination thresholds, etc.), use scripts/run_crossover_search.py. run_multi_gen_search.py uses a smaller argument set focused on multi-gen knobs; extend the Python API if you need full parity.
Related Documentation
| Document | Contents |
|---|---|
docs/design/distill_quantize_crossover_finetune.md |
Architecture overview, Mermaid pipeline diagram, module map, and recorded experimental results |
docs/design/crossover_strategies.md |
Detailed semantics of random, layer, and weighted crossover strategies with code examples |
docs/design/crossover_search_space.md |
Grid definitions, pre-defined search presets, and leaderboard format for run_crossover_search.py / run_multi_gen_search.py |
docs/distillation_soft_label_comparison.md |
Hard vs blended vs soft distillation objective comparison with reproducible results |
farm/config/default.yaml |
All YAML defaults including the crossover_child_finetune section |