Neural Recombination Runbook

Step-by-step guide for repeating the full distillation → quantization → crossover → fine-tune → validation pipeline from AgentFarm#8.

Prerequisites
Quick Reference: Pipeline Order
Stage 1 — Distillation
Stage 2 — Post-Training Quantization (PTQ)
Stage 3 — Quantization-Aware Training (QAT) — Optional
Stage 4 — Crossover + Fine-tuning
Stage 5 — Validation
Optional: Compare Distillation Modes
Parameter Reference
Tuning Guide
Copy-Paste Recipes
Generalization: Holdout & Domain-Shift Evaluation
Publication Ablations
Qualitative Error Analysis for Recombined Networks
Multi-Generation Crossover Search

1. Prerequisites

Environment

python -m venv venv
source venv/bin/activate       # Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Python 3.8+ required; 3.9+ recommended. All scripts below assume the repo root is your working directory and that the venv is active.

Parent checkpoints

Each pipeline run requires two pre-trained parent BaseQNetwork state-dicts — parent_A.pt and parent_B.pt. These are the teachers for distillation. Architecture dimensions (input_dim, output_dim, hidden_size) must stay consistent across every stage.

The default architecture shipped with farm/config/default.yaml is:

Dimension	Default
`input_dim`	8
`output_dim`	4
`hidden_size` / `parent_hidden`	64

If your parents use different dimensions, pass the corresponding flags to every script.

State buffer

All scripts share a common evaluation dataset: an (N, input_dim) float32 NumPy array.

Synthetic (quickest): generated at runtime via --n-states + --seed; always reproducible but not representative of real agent behaviour.
Real replay buffer: pass --states-file path/to/states.npy. The .npy file must have shape (N, input_dim) in float32. Using the same distribution as training/deployment gives the most realistic validation metrics.

Reproducibility rule: use the same --states-file (or the same --n-states/--seed pair) across all stages so that metrics are comparable.

2. Quick Reference: Pipeline Order

parent_A.pt ─┐
             ├─► run_distillation.py ─► student_A.pt ─┐
parent_B.pt ─┘                          student_B.pt ─┘
                                               │
                              (PTQ) quantize_distilled.py ─► student_A_int8.pt
                         (QAT) qat_distilled.py    ─► student_A_qat_int8.pt
                                               │
                         finetune_child.py (crossover inside) ─► child_finetuned.pt
                                               │
                      validate_distillation.py · validate_quantized.py
                      validate_recombination.py

Each stage writes checkpoints and companion JSON metadata (*.pt.json) that become inputs for the next stage.

3. Stage 1 — Distillation

Script: scripts/run_distillation.py

Goal: train a smaller StudentQNetwork to reproduce the Q-value distribution of each frozen parent (BaseQNetwork).

Minimal synthetic run

python scripts/run_distillation.py \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --n-states 2000 \
  --seed 42 \
  --output-dir checkpoints/distillation

With a real replay buffer

python scripts/run_distillation.py \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --states-file data/replay_states.npy \
  --seed 42 \
  --output-dir checkpoints/distillation

Expected shape of replay_states.npy: (N, 8) float32 (or whatever your input_dim is).

Key parameters

Flag	Default	Description
`--pair`	`both`	`A`, `B`, or `both`
`--input-dim`	`8`	State feature dimension
`--output-dim`	`4`	Number of actions
`--parent-hidden`	`64`	Teacher hidden width
`--parent-a-ckpt`	(required)	Path to `parent_A.pt`
`--parent-b-ckpt`	(required)	Path to `parent_B.pt`
`--n-states`	`1000`	Synthetic state count (ignored if `--states-file`)
`--states-file`	—	`.npy` replay buffer path
`--temperature`	`3.0`	Softmax temperature for soft labels
`--alpha`	`1.0`	Soft/hard blend: `1.0` = pure soft KL, `0.0` = pure hard CE
`--epochs`	`10`	Training epochs
`--lr`	`1e-3`	Adam learning rate
`--batch-size`	`32`	Mini-batch size
`--max-grad-norm`	`1.0`	Gradient clipping norm
`--val-fraction`	`0.1`	Held-out validation split
`--loss-fn`	`kl`	Soft loss: `kl` (recommended) or `mse`
`--seed`	`None`	RNG seed
`--output-dir`	`checkpoints/distillation`	Output directory

Outputs

checkpoints/distillation/
  student_A.pt          # StudentQNetwork state dict
  student_A.pt.json     # config + per-epoch metrics
  student_B.pt
  student_B.pt.json

4. Stage 2 — Post-Training Quantization (PTQ)

Script: scripts/quantize_distilled.py

Goal: compress student_*.pt to int8 with no re-training. Start here before trying QAT.

Minimal dynamic PTQ

python scripts/quantize_distilled.py \
  --checkpoint-dir checkpoints/distillation \
  --input-dim 8 --output-dim 4 --parent-hidden 64 \
  --mode dynamic \
  --output-dir checkpoints/quantized

Static PTQ (with calibration data)

python scripts/quantize_distilled.py \
  --checkpoint-dir checkpoints/distillation \
  --states-file data/replay_states.npy \
  --mode static \
  --calibration-batches 10 \
  --calibration-batch-size 64 \
  --output-dir checkpoints/quantized

Key parameters

Flag	Default	Description
`--pair`	`both`	`A`, `B`, or `both`
`--checkpoint-dir`	—	Dir containing `student_A.pt` / `student_B.pt`
`--student-a-ckpt` / `--student-b-ckpt`	—	Explicit paths (override `--checkpoint-dir`)
`--input-dim`	`8`	Must match distillation
`--output-dim`	`4`	Must match distillation
`--parent-hidden`	`64`	Must match distillation
`--mode`	`dynamic`	`dynamic` (weight-only, no calibration) or `static`
`--dtype`	`qint8`	Quantization dtype
`--backend`	`auto`	`auto`, `x86`, `fbgemm`, `qnnpack`
`--calibration-batches`	`10`	Static mode: number of calibration batches
`--calibration-batch-size`	`64`	Static mode: batch size for calibration
`--states-file`	—	Calibration states (static mode); also used for output comparison
`--n-states`	`1000`	Synthetic calibration states if no file
`--seed`	`42`	RNG seed
`--output-dir`	`checkpoints/quantized`	Output directory

Outputs

checkpoints/quantized/
  student_A_int8.pt        # Quantized model (CPU int8 pickle)
  student_A_int8.pt.json   # QuantizationConfig + timing
  student_B_int8.pt
  student_B_int8.pt.json

5. Stage 3 — Quantization-Aware Training (QAT) — Optional

Script: scripts/qat_distilled.py

When to use QAT instead of PTQ:

Use QAT when PTQ action agreement (reported by validate_quantized.py) falls below your target threshold (e.g. < 90%). QAT adds a short training pass with fake quantization, recovering accuracy at the cost of a few extra minutes.

Minimal QAT run

python scripts/qat_distilled.py \
  --checkpoint-dir checkpoints/distillation \
  --input-dim 8 --output-dim 4 --parent-hidden 64 \
  --epochs 5 \
  --n-states 2000 \
  --seed 42 \
  --output-dir checkpoints/qat

Omit --no-convert (default) to also produce the converted int8 checkpoint.

Key parameters

Flag	Default	Description
`--pair`	`both`	`A`, `B`, or `both`
`--checkpoint-dir`	—	Dir with `parent_<pair>.pt` + `student_<pair>.pt`
`--teacher-a-ckpt` / `--student-a-ckpt`	—	Explicit path overrides
`--input-dim`	`8`	Must match distillation
`--output-dim`	`4`	Must match distillation
`--parent-hidden`	`64`	Must match distillation
`--epochs`	`5`	QAT fine-tuning epochs
`--learning-rate`	`1e-4`	Adam LR (lower than distillation; model is already trained)
`--batch-size`	`32`	Mini-batch size
`--max-grad-norm`	`1.0`	Gradient clipping norm
`--val-fraction`	`0.1`	Validation split
`--loss-fn`	`mse`	`mse` (default for QAT) or `kl`
`--temperature`	`3.0`	Temperature for `kl` mode
`--alpha`	`1.0`	Soft/hard blend for `kl` mode
`--no-convert`	`False`	Skip int8 conversion; save only float QAT checkpoint
`--states-file` / `--n-states` / `--seed`	—	Same semantics as other stages
`--output-dir`	`checkpoints/qat`	Output directory

Outputs

checkpoints/qat/
  student_A_qat.pt           # Float QAT checkpoint
  student_A_qat.pt.json
  student_A_qat_int8.pt      # Converted int8 (same format as PTQ output)
  student_A_qat_int8.pt.json

6. Stage 4 — Crossover + Fine-tuning

Script: scripts/finetune_child.py

Goal: blend two parent state dicts into a child via a crossover strategy, then fine-tune the child against a frozen reference (parent A) using a distillation-style loss.

Minimal synthetic run

python scripts/finetune_child.py \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --crossover-mode weighted \
  --crossover-alpha 0.5 \
  --crossover-seed 42 \
  --n-states 2000 \
  --seed 42 \
  --output-dir checkpoints/finetune

With a replay buffer and YAML overrides

python scripts/finetune_child.py \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --crossover-mode random \
  --crossover-alpha 0.5 \
  --crossover-seed 0 \
  --states-file data/replay_states.npy \
  --config-yaml farm/config/default.yaml \
  --epochs 10 \
  --lr 5e-4 \
  --output-dir checkpoints/finetune

YAML defaults (`farm/config/default.yaml`)

The crossover_child_finetune section provides all fine-tuning defaults:

crossover_child_finetune:
  learning_rate: 0.001
  epochs: 5
  batch_size: 32
  max_grad_norm: 1.0
  val_fraction: 0.1
  seed: null
  loss_fn: kl
  temperature: 3.0
  temp_decay: 1.0          # per-epoch temperature multiplier; 1.0 = no decay
  alpha: 1.0               # soft/hard blend (1.0 = pure KL, 0.0 = pure CE)
  lr_schedule_patience: 0  # ReduceLROnPlateau patience; 0 = disabled
  lr_schedule_factor: 0.5  # LR reduction factor when plateau detected
  quantization_applied: none   # none | ptq_dynamic | ptq_static | qat_float
  optimizer: adam
  optimizer_kwargs: {}
  early_stopping_patience: 0   # 0 = disabled

CLI flags such as --lr, --epochs, --alpha override the YAML values when specified.

Key parameters

Flag	Default (YAML)	Description
`--input-dim`	`8`	Must match parent architecture
`--output-dim`	`4`	Must match parent architecture
`--hidden-size`	`64`	Must match parent architecture
`--parent-a-ckpt`	(required)	Parent A checkpoint (also the fine-tune teacher)
`--parent-b-ckpt`	(required)	Parent B checkpoint
`--crossover-mode`	(required)	`random`, `layer`, or `weighted`
`--crossover-alpha`	—	Blend/selection coefficient (see below)
`--crossover-seed`	—	RNG seed for `random` mode
`--n-states` / `--states-file`	—	State buffer (same as other stages)
`--config-yaml`	`farm/config/default.yaml`	YAML with `crossover_child_finetune` section
`--lr`	`1e-3`	Adam learning rate
`--epochs`	`5`	Fine-tuning epochs
`--batch-size`	`32`	Mini-batch size
`--max-grad-norm`	`1.0`	Gradient clipping norm
`--val-fraction`	`0.1`	Validation split
`--loss-fn`	`kl`	Distillation loss (`kl` or `mse`)
`--temperature`	`3.0`	Softmax temperature
`--alpha`	`1.0`	Soft/hard blend
`--lr-patience`	`0`	ReduceLROnPlateau patience (0 = off)
`--lr-factor`	`0.5`	LR reduction factor
`--seed`	`null`	Fine-tuning RNG seed (separate from crossover seed)
`--quantization-applied`	`none`	`none`, `ptq_dynamic`, `ptq_static`, or `qat_float`
`--optimizer`	`adam`	`adam`, `adamw`, `sgd`, or `rmsprop`
`--early-stopping-patience`	`0`	Validation-loss patience (0 = off)
`--output-dir`	(required)	Output directory

Crossover modes

Mode	`alpha` meaning	Deterministic?	Notes
`random`	Probability of selecting from parent A per tensor	No (needs `--crossover-seed`)	High diversity; use multiple seeds to estimate variance
`layer`	Ignored	Yes	Even blocks from A, odd blocks from B; structurally coherent
`weighted`	Linear blend weight: `child = alphaA + (1-alpha)B`	Yes	Smooth interpolation; `0.5` = midpoint

Outputs

checkpoints/finetune/
  child.pt                  # Raw crossover child (pre fine-tune)
  child.pt.json
  child_finetuned.pt        # Fine-tuned child
  child_finetuned.pt.json

7. Stage 5 — Validation

7.1 Distillation quality

Script: scripts/validate_distillation.py

Checks KL divergence, MSE, MAE, cosine similarity, top-k agreement, latency ratio, and robustness slices between parent and student.

python scripts/validate_distillation.py \
  --checkpoint-dir checkpoints/distillation \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --n-states 2000 --seed 42 \
  --report-dir reports/distillation

Key threshold flags (all have sensible defaults; override to tighten):

Flag	Default	Description
`--min-action-agreement`	`0.8`	Minimum top-1 action agreement
`--max-kl-divergence`	`0.1`	Maximum KL divergence
`--max-mse`	`0.01`	Maximum mean-squared error
`--min-cosine-similarity`	`0.9`	Minimum cosine similarity
`--max-latency-ratio`	`2.0`	Maximum student/parent latency ratio
`--min-robustness-action-agreement`	`0.7`	Agreement on noisy/out-of-distribution slices

7.2 Quantization fidelity

Script: scripts/validate_quantized.py

Compares float student vs int8 student on agreement, Q-error, latency, and memory.

python scripts/validate_quantized.py \
  --float-dir checkpoints/distillation \
  --quant-dir checkpoints/quantized \
  --n-states 2000 --seed 42 \
  --report-dir reports/quantized

Key threshold flags:

Flag	Default	Description
`--min-action-agreement`	`0.9`	Quantized vs float top-1 agreement
`--max-mean-q-error`	`0.05`	Mean absolute Q-value error
`--min-cosine-similarity`	`0.95`	Cosine similarity
`--max-latency-ratio`	`1.5`	Int8 / float latency ratio

7.3 Recombination quality

Script: scripts/validate_recombination.py

Evaluates the fine-tuned child against both parents.

python scripts/validate_recombination.py \
  --checkpoint-dir checkpoints/finetune \
  --parent-a-ckpt checkpoints/parent_A.pt \
  --parent-b-ckpt checkpoints/parent_B.pt \
  --child-ckpt checkpoints/finetune/child_finetuned.pt \
  --n-states 2000 --seed 42 \
  --include-parent-baseline \
  --report-dir reports/recombination

For quantized children add --child-quantized (and --parent-a-quantized / --parent-b-quantized if parents are also int8).

Key threshold flags:

Flag	Default	Description
`--min-action-agreement`	`0.7`	Child vs parent action agreement
`--max-kl-divergence`	`0.3`	Child vs parent KL
`--max-mse`	`0.05`	Child vs parent MSE
`--min-cosine-similarity`	`0.8`	Child vs parent cosine similarity

“Good enough” heuristic: aim for child-vs-A and child-vs-B top-1 agreement both ≥ 0.7, with neither collapsing to one parent. The primary_metric = min(agreement_A, agreement_B) used by run_crossover_search.py captures this directly.

Case-level analysis: for per-state disagreements, logit summaries, worst-k states, and hidden-layer activations see § 14 — Qualitative Error Analysis and scripts/analyze_recombination.py.

Validation report layout

Each script writes a JSON report alongside a human-readable markdown summary. JSON keys of interest:

action_agreement — top-1 agreement between models
oracle_action_agreement — agreement when both models are uncertain
kl_divergence, mse, mae, cosine_similarity — Q-value closeness metrics
latency_ms_median / latency_ratio — speed comparison

8. Optional: Compare Distillation Modes

Script: scripts/compare_distillation_modes.py

Runs hard-only (alpha=0), soft-only (alpha=1), and blended distillation back-to-back with a shared frozen teacher and state buffer so results are directly comparable.

python scripts/compare_distillation_modes.py \
  --seed 42 --epochs 10 --n-states 2000 \
  --json-out reports/distillation_mode_comparison.json

See docs/distillation_soft_label_comparison.md for recorded results and discussion.

9. Parameter Reference

Architecture consistency

The following dimensions must match across all stages:

Parameter	CLI flag (all scripts)	YAML key	Notes
State feature dim	`--input-dim`	—	Parent network input size
Action count	`--output-dim`	—	Parent network output size
Teacher hidden width	`--parent-hidden` / `--hidden-size`	—	Used to reconstruct `BaseQNetwork`

Seed inventory

Stage	Flag	Purpose
Distillation	`--seed`	State generation + training RNG
Quantization (PTQ)	`--seed`	Synthetic calibration state generation
QAT	`--seed`	Synthetic state generation + training RNG
Crossover	`--crossover-seed`	Per-tensor selection RNG (`random` mode)
Fine-tuning	`--seed`	Training batch shuffle + dropout

`crossover_child_finetune` YAML keys

Key	Type	Safe range	Description
`learning_rate`	float	`5e-5` – `5e-3`	Adam LR
`epochs`	int	`3` – `20`	Training epochs
`batch_size`	int	`16` – `128`	Mini-batch size
`max_grad_norm`	float	`0.5` – `5.0`; `0` = off	Gradient clipping
`val_fraction`	float	`0.05` – `0.2`	Held-out validation fraction
`seed`	int or null	any	`null` = non-deterministic
`loss_fn`	`kl` / `mse`	—	`kl` preferred for soft distillation
`temperature`	float	`1.0` – `10.0`	Softmax temperature for `kl` loss
`temp_decay`	float	`0.9` – `1.0`	Per-epoch temperature multiplier
`alpha`	float	`0.0` – `1.0`	Soft/hard blend; `1.0` = pure soft
`lr_schedule_patience`	int	`0` – `5`	ReduceLROnPlateau; `0` = off
`lr_schedule_factor`	float	`0.1` – `0.9`	LR multiplier on plateau
`quantization_applied`	string	see below	Triggers QAT-aware fine-tune path
`optimizer`	string	`adam`, `adamw`, `sgd`, `rmsprop`	Optimizer choice
`early_stopping_patience`	int	`0` – `10`	Val-loss patience; `0` = off

quantization_applied values: none (default, float path), ptq_dynamic, ptq_static, qat_float — when not none, FineTuner replaces Linear layers with WeightOnlyFakeQuantLinear; call convert() + save_quantized() after finetune() for int8 output.

10. Tuning Guide

State data choice

Use a real replay buffer whenever possible. Synthetic standard-normal states cover the full input range but may not reflect the distribution your agents encounter at inference time. Validation metrics on synthetic states can be optimistic.
The .npy file should be float32, shape (N, input_dim). Prefer N ≥ 2 000 for stable statistics; 10 000+ for final validation.
Keep the same states file across all stages (or at least the same --seed) so that reported metrics are on a common distribution.

Distillation

Temperature (--temperature, default 3.0): higher values flatten the teacher’s distribution, making inter-action confidence ordering more visible. Range 2.0–6.0 is typical; reduce toward 1.0 if the student converges too slowly.
Alpha (--alpha, default 1.0): 1.0 = pure KL soft loss; 0.0 = pure cross-entropy on the argmax. Start with 1.0; blend toward 0.5 if action agreement is high but hard-label accuracy is poor.
Learning rate instability: if training loss oscillates or diverges, lower --lr (try 5e-4) and ensure --max-grad-norm 1.0 is in effect.
Low agreement after training: increase --epochs (try 20), or provide a larger / more representative state buffer.
--loss-fn mse: simpler objective; useful for debugging, but KL is generally better for Q-value distributions.

Post-Training Quantization

Try dynamic PTQ first — zero training cost, typically ≥ 90 % action agreement. Only invest in static PTQ or QAT if validate_quantized.py reports agreement below your target.
Static PTQ: requires calibration data that reflects real input distribution; use --states-file. Use --calibration-batches 10–50 and --calibration-batch-size 64–256.
Backend choice: on Intel CPUs fbgemm is often fastest; on ARM/mobile use qnnpack; auto selects automatically. Benchmark with --throughput-batch-size in validate_quantized.py.
QAT vs PTQ decision rule: if PTQ action_agreement < 0.90, try QAT with --epochs 5 --learning-rate 1e-4. Use --loss-fn mse for QAT (default) unless the teacher distribution is very soft.

Crossover

weighted mode is the smoothest starting point: --crossover-alpha 0.5 gives a midpoint blend. Move alpha toward 0.3 or 0.7 to bias toward one parent.
random mode produces the most diverse children but highest variance across seeds. Run three seeds (--crossover-seed 0,1,2 via separate invocations) to gauge variance before committing to one.
layer mode preserves structural coherence (each Linear + LayerNorm pair from the same parent) at the cost of diversity — only two possible children per pair. Useful when the network is sensitive to feature-scaling mismatches.
If child collapses to one parent: primary_metric = min(agreement_A, agreement_B) will be low. Try weighted at 0.5, or random with a different seed.

Fine-tuning

Reference teacher: finetune_child.py always uses parent A as the fine-tune teacher. This biases the child toward A’s behaviour; if you want a more balanced child, check parent B agreement explicitly with validate_recombination.py --include-parent-baseline.
Loss function: kl with temperature 3.0 and alpha 1.0 mirrors the distillation loss and tends to produce soft, calibrated Q-values. Use mse for a simpler target.
LR schedule: enable --lr-patience 3 --lr-factor 0.5 if validation loss plateaus early. This is disabled by default to keep runs short.
Early stopping: --early-stopping-patience 5 prevents overfitting on small state buffers; safe to enable for production runs.
quantization_applied: set to ptq_dynamic, ptq_static, or qat_float only when you intend to produce an int8 fine-tuned child. Requires calling FineTuner.convert() + save_quantized() programmatically after the script; for a pure float32 child leave as none.

Validation

Dimension/architecture mismatches are the most common failure mode. Always pass the same --input-dim, --output-dim, and --parent-hidden (or --hidden-size) that were used at distillation time. The checkpoint JSON metadata (*.pt.json) records these for reference.
Agreement vs oracle agreement: action_agreement counts top-1 matches; oracle_action_agreement counts matches only when both models are confident. Low oracle agreement with high action agreement usually indicates one model is uncertain overall.
Threshold calibration: default thresholds in the validation scripts are conservative. After a successful baseline run you can tighten --min-action-agreement toward 0.85–0.95 for production gates.
Using the same state buffer as training: evaluation metrics will be optimistically inflated if the same states are used for both training and validation. Prefer a held-out test split by passing a separate .npy file to the validation scripts.

11. Copy-Paste Recipes

Recipe A — Minimal synthetic run (no parent checkpoints required for architecture test)

# 1. Distil (synthetic states, no real parents — useful for smoke test)
python scripts/run_distillation.py \
  --n-states 2000 --seed 42 \
  --epochs 5 \
  --output-dir checkpoints/distillation

# 2. PTQ
python scripts/quantize_distilled.py \
  --checkpoint-dir checkpoints/distillation \
  --n-states 2000 --seed 42 \
  --mode dynamic \
  --output-dir checkpoints/quantized

# 3. Validate distillation
python scripts/validate_distillation.py \
  --checkpoint-dir checkpoints/distillation \
  --n-states 2000 --seed 42 \
  --report-dir reports/distillation

# 4. Validate quantization
python scripts/validate_quantized.py \
  --float-dir checkpoints/distillation \
  --quant-dir checkpoints/quantized \
  --n-states 2000 --seed 42 \
  --report-dir reports/quantized

# 5. Crossover + fine-tune
python scripts/finetune_child.py \
  --parent-a-ckpt checkpoints/distillation/student_A.pt \
  --parent-b-ckpt checkpoints/distillation/student_B.pt \
  --crossover-mode weighted --crossover-alpha 0.5 \
  --n-states 2000 --seed 42 \
  --output-dir checkpoints/finetune

# 6. Validate recombination
python scripts/validate_recombination.py \
  --parent-a-ckpt checkpoints/distillation/student_A.pt \
  --parent-b-ckpt checkpoints/distillation/student_B.pt \
  --child-ckpt checkpoints/finetune/child_finetuned.pt \
  --n-states 2000 --seed 42 \
  --report-dir reports/recombination

Recipe B — Full run with real parent checkpoints and replay buffer

PARENTS=checkpoints/parents
STATES=data/replay_states.npy   # shape (N, 8) float32
OUT=checkpoints/run1

# 1. Distil
python scripts/run_distillation.py \
  --parent-a-ckpt $PARENTS/parent_A.pt \
  --parent-b-ckpt $PARENTS/parent_B.pt \
  --states-file $STATES --seed 42 \
  --temperature 3.0 --alpha 1.0 \
  --epochs 20 --lr 1e-3 \
  --output-dir $OUT/distillation

# 2. Validate distillation
python scripts/validate_distillation.py \
  --checkpoint-dir $OUT/distillation \
  --parent-a-ckpt $PARENTS/parent_A.pt \
  --parent-b-ckpt $PARENTS/parent_B.pt \
  --states-file $STATES --seed 42 \
  --report-dir $OUT/reports/distillation

# 3. PTQ (try dynamic first)
python scripts/quantize_distilled.py \
  --checkpoint-dir $OUT/distillation \
  --states-file $STATES --seed 42 \
  --mode dynamic \
  --output-dir $OUT/quantized

# 4. Validate quantization
python scripts/validate_quantized.py \
  --float-dir $OUT/distillation \
  --quant-dir $OUT/quantized \
  --states-file $STATES --seed 42 \
  --report-dir $OUT/reports/quantized

# (Optional) If PTQ agreement < 0.90, run QAT instead:
# python scripts/qat_distilled.py \
#   --checkpoint-dir $OUT/distillation \
#   --states-file $STATES --seed 42 \
#   --epochs 5 --learning-rate 1e-4 \
#   --output-dir $OUT/qat

# 5. Crossover + fine-tune (using float parents)
python scripts/finetune_child.py \
  --parent-a-ckpt $PARENTS/parent_A.pt \
  --parent-b-ckpt $PARENTS/parent_B.pt \
  --crossover-mode weighted --crossover-alpha 0.5 \
  --crossover-seed 42 \
  --states-file $STATES --seed 42 \
  --epochs 10 --lr 5e-4 \
  --lr-patience 3 --early-stopping-patience 5 \
  --output-dir $OUT/finetune

# 6. Validate recombination
python scripts/validate_recombination.py \
  --parent-a-ckpt $PARENTS/parent_A.pt \
  --parent-b-ckpt $PARENTS/parent_B.pt \
  --child-ckpt $OUT/finetune/child_finetuned.pt \
  --states-file $STATES --seed 42 \
  --include-parent-baseline \
  --report-dir $OUT/reports/recombination

12. Generalization: Holdout & Domain-Shift Evaluation

Script: scripts/eval_generalization.py

Standard validation metrics (Sections 7.1–7.3) are measured on a single state buffer that may overlap with calibration or training data. For publication-grade generalization claims, you need:

A held-out test split that was never used for training or calibration.
An optional domain-shift evaluation generated by perturbing the holdout states (for example, sensor noise or input scaling).

eval_generalization.py automates both steps by:

Splitting the state buffer into an in-distribution (ID) and holdout subset.
Optionally perturbing the holdout set with Gaussian noise or input scaling.
Running RecombinationEvaluator on each subset and writing a per-set JSON report plus a combined generalization_summary.json.

At present, the CLI documents perturbation-based domain shift only; it does not expose a flag for loading a separate shifted .npy states file.

Library helpers

The split and perturbation logic is available as standalone functions in farm.core.decision.training.holdout_utils:

Function	Purpose
`split_replay_buffer(states, holdout_fraction, seed)`	Random train/holdout split
`apply_gaussian_noise(states, std, seed)`	Add i.i.d. Gaussian noise
`apply_input_scaling(states, scale_factor)`	Multiply all features by a scalar
`make_shifted_states(states, shift_type, **kwargs)`	Factory dispatcher for the above

All helpers are also re-exported from farm.core.decision.training.

Minimal synthetic run (synthetic states, existing checkpoints required)

This example uses synthetic evaluation states, but it still requires trained parent A, parent B, and child checkpoints from the Recipe A workflow in Section 11. Replace paths as needed.

python scripts/eval_generalization.py \
  --parent-a-ckpt checkpoints/distillation/student_A.pt \
  --parent-b-ckpt checkpoints/distillation/student_B.pt \
  --child-ckpt    checkpoints/finetune/child_finetuned.pt \
  --n-states 2000 --seed 42 \
  --holdout-fraction 0.2 \
  --report-dir reports/generalization

Output:

reports/generalization/
  id_report.json           # in-distribution split report
  holdout_report.json      # held-out split report
  generalization_summary.json

With a real replay buffer

python scripts/eval_generalization.py \
  --parent-a-ckpt checkpoints/parents/parent_A.pt \
  --parent-b-ckpt checkpoints/parents/parent_B.pt \
  --child-ckpt    checkpoints/finetune/child_finetuned.pt \
  --states-file   data/replay_states.npy \
  --holdout-fraction 0.2 \
  --report-dir    reports/generalization

With Gaussian-noise domain shift

python scripts/eval_generalization.py \
  --parent-a-ckpt checkpoints/parents/parent_A.pt \
  --parent-b-ckpt checkpoints/parents/parent_B.pt \
  --child-ckpt    checkpoints/finetune/child_finetuned.pt \
  --states-file   data/replay_states.npy \
  --holdout-fraction 0.2 \
  --shift-type    gaussian_noise \
  --shift-std     0.1 \
  --shift-seed    0 \
  --report-dir    reports/generalization

Output adds reports/generalization/shifted_report.json.

With input-scaling domain shift

python scripts/eval_generalization.py \
  --parent-a-ckpt checkpoints/parents/parent_A.pt \
  --parent-b-ckpt checkpoints/parents/parent_B.pt \
  --child-ckpt    checkpoints/finetune/child_finetuned.pt \
  --states-file   data/replay_states.npy \
  --shift-type    input_scaling \
  --shift-scale-factor 2.0 \
  --report-dir    reports/generalization

Key flags

Flag	Default	Description
`--holdout-fraction`	`0.2`	Fraction of states reserved for holdout
`--no-shuffle`	off	Skip shuffle before split (for pre-randomised buffers)
`--shift-type`	—	`gaussian_noise` or `input_scaling`; omit to skip shifted eval
`--shift-std`	`0.1`	Gaussian noise standard deviation
`--shift-scale-factor`	`2.0`	Input scaling multiplier
`--shift-seed`	`0`	Noise RNG seed
`--report-only`	off	Write reports without applying pass/fail thresholds

Reading the `generalization_summary.json`

{
  "overall_passed": true,
  "report_only": false,
  "holdout_fraction": 0.2,
  "shift_type": "gaussian_noise",
  "sets": {
    "in_distribution": {
      "child_agrees_with_parent_a": 0.82,
      "child_agrees_with_parent_b": 0.79,
      "oracle_agreement": 0.91,
      "n_states": 1600,
      "passed": true
    },
    "holdout": {
      "child_agrees_with_parent_a": 0.80,
      "child_agrees_with_parent_b": 0.77,
      "oracle_agreement": 0.89,
      "n_states": 400,
      "passed": true
    },
    "shifted": {
      "child_agrees_with_parent_a": 0.73,
      "child_agrees_with_parent_b": 0.70,
      "oracle_agreement": 0.84,
      "n_states": 400,
      "passed": true,
      "shift_type": "gaussian_noise"
    }
  }
}

A meaningful generalization drop is when holdout or shifted agreement falls more than ~5 pp below the ID score. If this happens, consider:

Training on a larger or more diverse replay buffer.
Increasing the holdout fraction to detect over-fitting earlier in development.
Tuning the crossover alpha or fine-tuning LR to reduce ID–holdout gap.

13. Publication Ablations

Script: scripts/run_recombination_ablation.py

For reproducible paper tables and CI-style regression of the full pipeline, use the unified ablation runner. A single invocation sweeps multiple conditions (e.g. distill-only, distill+quantize, or the full pipeline) across a list of seeds and writes every result into a structured results/ tree together with a consolidated CSV and Markdown summary table. If you provide a shared states_file, that same state buffer is reused across seeds and conditions. If states_file is omitted, the runner generates synthetic states per seed, so cross-seed results are not directly comparable unless you supply a common state buffer.

Quick start (no config file needed)

# Dry-run: validate plan, write stub summary, no training
python scripts/run_recombination_ablation.py --smoke-test --dry-run

# Smoke-test: tiny synthetic run (2 seeds × 3 conditions, 50 states, 2 epochs)
python scripts/run_recombination_ablation.py --smoke-test --results-dir /tmp/ablation_smoke

Full run from a config file

python scripts/run_recombination_ablation.py --config ablation.yaml

The config file is YAML (recommended) or JSON. A minimal example:

seeds: [0, 1, 2]
n_states: 2000
states_file: ""           # leave empty to synthesise per-seed (supply a .npy path for comparable cross-seed results)
input_dim: 8
output_dim: 4
hidden_size: 64
results_dir: results/ablation

conditions:
  - name: distill_only
    stages: [distill]
  - name: distill_quantize
    stages: [distill, quantize]
  - name: full_pipeline
    stages: [distill, quantize, crossover, compare]

distillation:
  epochs: 20
  temperature: 3.0
  alpha: 1.0
  lr: 0.001
  batch_size: 32

quantization:
  mode: dynamic

crossover:
  mode: weighted
  alpha: 0.5

comparison:
  report_only: true

Output layout

results/ablation/
  distill_only/
    seed_0/student_A.pt  student_B.pt
    seed_1/...
    seed_2/...
  distill_quantize/
    seed_0/student_A.pt  student_B.pt  student_A_int8.pt  student_B_int8.pt
    ...
  full_pipeline/
    seed_0/student_A.pt  student_B.pt  student_A_int8.pt  student_B_int8.pt
             child_finetuned.pt  compare_child_vs_students.json
    ...
  ablation_summary.csv          ← consolidated table (paste into spreadsheet)
  ablation_summary.md           ← Markdown version (paste into GitHub issues)

Per-condition stage overrides

Each condition can override any global distillation / quantization / crossover / comparison setting:

conditions:
  - name: high_temp_distill
    stages: [distill, crossover, compare]
    distillation:
      temperature: 6.0   # overrides global temperature: 3.0
      epochs: 30

Valid stages

Stage	What it runs
`distill`	`DistillationTrainer` for both A and B pairs; writes `student_A.pt`, `student_B.pt`
`quantize`	`PostTrainingQuantizer` on both students; writes `student_A_int8.pt`, `student_B_int8.pt`
`crossover`	When `quantize` is included, int8 parents are loaded and dequantized to float weights, then `crossover_quantized_state_dict` blends them into a float child; otherwise float `student_*.pt` parents are blended. `FineTuner` always uses float student A as KD teacher. Writes `child_finetuned.pt`.
`compare`	`RecombinationEvaluator` (float child vs float or int8 parents matching the pipeline); writes `compare_child_vs_students.json`

Stages are always applied in the order listed above regardless of declaration order in the config. Parse-time rules: quantize and crossover require distill; compare requires crossover (there must be a child to score).

Dry-run mode

python scripts/run_recombination_ablation.py --config ablation.yaml --dry-run

Prints the full execution plan (conditions × seeds × stages × directories) and writes a stub ablation_summary.md / ablation_summary.csv without running any training. Use this to verify the config before a long run.

Using a shared real replay buffer

Set states_file in the config to a .npy file of shape (N, input_dim) float32. All seeds and conditions will use the same state file, ensuring metrics are comparable across the ablation.

states_file: data/replay_states.npy

Reading the summary table

The Markdown summary table (ablation_summary.md) contains one row per (condition, seed) pair. Key columns:

Column	Meaning
`child_vs_ref_a_agreement`	Top-1 action agreement of child vs parent A (float student, or int8 checkpoint when the condition includes `quantize`)
`child_vs_ref_b_agreement`	Top-1 action agreement of child vs parent B (same rule)
`oracle_agreement`	Fraction where child matches at least one reference
`elapsed_s`	Wall-clock seconds for the (condition, seed) run

child_vs_ref_*_agreement columns are populated only when the compare stage is included. Conditions without a compare stage show n/a.

14. Qualitative Error Analysis for Recombined Networks

Script: scripts/analyze_recombination.py
Python API: farm.core.decision.training.recombination_analysis

The aggregate fidelity report from validate_recombination.py (§ 7.3) shows mean agreement across all states. For publication or debugging you often need case-level insight: which states does the child get wrong, and is the disagreement systematic? analyze_recombination.py provides this.

What it produces

Output	Description
`disagreements.csv`	One row per evaluation state. Columns: actions, agreement flags, per-state KL / MSE / cosine similarity, top-k mismatch flags.
`disagreements.json`	Same records in JSON with summary counts; includes raw logits when `--include-logits` is set.
`worst_<k>_states.json`	The k states with the largest errors, sorted by the chosen criterion.
`<activations>.npy`	(Optional) NumPy array of shape `(N, activation_dim)` — hidden-layer activations for a memory-bounded probe set.

Minimal run

python scripts/analyze_recombination.py \
  --checkpoint-dir checkpoints/finetune \
  --parent-a-ckpt  checkpoints/parent_A.pt \
  --parent-b-ckpt  checkpoints/parent_B.pt \
  --child-ckpt     checkpoints/finetune/child_finetuned.pt \
  --states-file    data/replay_states.npy \
  --output-dir     reports/analysis

With logits, worst-10 states, and activation export

python scripts/analyze_recombination.py \
  --checkpoint-dir      checkpoints/finetune \
  --states-file         data/replay_states.npy \
  --include-logits \
  --worst-k             10 \
  --worst-k-criterion   max_kl \
  --activations-out     reports/analysis/child_activations.npy \
  --activation-layer-index 4 \
  --activation-max-states  500 \
  --output-dir          reports/analysis

--activation-layer-index selects a sub-module by its index in list(model.modules()). For BaseQNetwork:

Index	Layer
4	First hidden ReLU (post LayerNorm)
8	Second hidden ReLU (post LayerNorm)

Python API

import numpy as np
from farm.core.decision.training.recombination_analysis import (
    extract_disagreements,
    worst_k_states,
    export_disagreements_csv,
    export_disagreements_json,
    extract_activations,
)

states = np.load("data/replay_states.npy")

records = extract_disagreements(
    parent_a, parent_b, child, states,
    include_logits=True,
    k_values=[1, 2, 3],
)

# Worst-10 states by maximum KL divergence across the two parents
worst = worst_k_states(records, k=10, criterion="max_kl")

export_disagreements_csv(records, "reports/analysis/disagreements.csv")
export_disagreements_json(records, "reports/analysis/disagreements.json")

# Memory-bounded activation export (first hidden ReLU, max 500 states)
acts = extract_activations(child, states, layer_index=4, max_states=500)
np.save("reports/analysis/child_activations.npy", acts)

Worst-k criteria

Criterion	Sorts by
`max_kl` (default)	`max(KL_vs_A, KL_vs_B)`
`kl_parent_a`	KL divergence vs parent A
`kl_parent_b`	KL divergence vs parent B
`max_mse`	`max(MSE_vs_A, MSE_vs_B)`
`mse_parent_a`	MSE vs parent A
`mse_parent_b`	MSE vs parent B

KL columns: kl_child_vs_parent_a / _b are KL(parent ‖ child) over action softmaxes (parent as the reference distribution). The field names are historical; compare to other tools that report KL(child ‖ parent) carefully.

Integration with validate_recombination.py

Run validate_recombination.py first to check that aggregate fidelity meets thresholds, then run analyze_recombination.py to drill into problem states. Both share the same architecture flags and checkpoint conventions.

15. Multi-Generation Crossover Search

Script: scripts/run_multi_gen_search.py
Python API: farm.core.decision.training.crossover_search.run_multi_generation_search, GenerationConfig

This mode runs :func:~farm.core.decision.training.crossover_search.run_crossover_search repeatedly. After each generation, the best child checkpoint becomes parent A for the next generation; parent B is chosen from the leaderboard according to selection_strategy (see below). Optional Gaussian mutation perturbs promoted parents’ weights before the next generation’s crossovers.

Semantics (parent selection)

Role	Rule
Parent A (next gen)	Always the globally best child of the current generation (highest `primary_metric` on the full manifest).
Parent B under `selection_strategy="best"`	The rank-2 child in the sorted leaderboard within the first `keep_top_k` entries when at least two distinct children exist there. If `keep_top_k` is `1`, or only one child exists in that prefix, parent B falls back to the original parent B from generation 0.
Parent B under `"best_vs_original"`	Always the original parent B (same in-memory / checkpoint object as at start).
Mutation	Applied only to loaded child checkpoints promoted as parents. The original parent B reference is never mutated when it is reused as parent B.
Lineage	`lineage.json` lists every child from every generation; `keep_top_k` does not trim stored lineage—it only affects which leaderboard prefix is used to pick rank-2 for parent B.

Per-generation RNG (`GenerationConfig.seed`)

In the CLI, --seed is forwarded to GenerationConfig.seed and seeds synthetic parent/state generation when you omit checkpoint/state files. When seed is set, generation g adds seed + g to every non-None crossover recipe seed and fine-tune regime seed in the shared SearchConfig. When seed is None, recipe and regime seeds are used exactly as configured. Mutation RNG seeds combine MutationConfig.seed, GenerationConfig.seed, and the generation index; see the GenerationConfig docstring in code.

Minimal CLI example

python scripts/run_multi_gen_search.py \
  --search-space minimal \
  --max-runs 3 \
  --num-generations 3 \
  --seed 1000 \
  --run-dir runs/multi_gen_smoke

For a single generation with the full run_crossover_search.py flag surface (eval batch size, workers, recombination thresholds, etc.), use scripts/run_crossover_search.py. run_multi_gen_search.py uses a smaller argument set focused on multi-gen knobs; extend the Python API if you need full parity.

Document	Contents
`docs/design/distill_quantize_crossover_finetune.md`	Architecture overview, Mermaid pipeline diagram, module map, and recorded experimental results
`docs/design/crossover_strategies.md`	Detailed semantics of `random`, `layer`, and `weighted` crossover strategies with code examples
`docs/design/crossover_search_space.md`	Grid definitions, pre-defined search presets, and leaderboard format for `run_crossover_search.py` / `run_multi_gen_search.py`
`docs/distillation_soft_label_comparison.md`	Hard vs blended vs soft distillation objective comparison with reproducible results
`farm/config/default.yaml`	All YAML defaults including the `crossover_child_finetune` section

Table of Contents

1. Prerequisites

Environment

Parent checkpoints

State buffer

2. Quick Reference: Pipeline Order

3. Stage 1 — Distillation

Minimal synthetic run

With a real replay buffer

Key parameters

Outputs

4. Stage 2 — Post-Training Quantization (PTQ)

Minimal dynamic PTQ

Static PTQ (with calibration data)

Key parameters

Outputs

5. Stage 3 — Quantization-Aware Training (QAT) — Optional

Minimal QAT run

Key parameters

Outputs

6. Stage 4 — Crossover + Fine-tuning

Minimal synthetic run

With a replay buffer and YAML overrides

YAML defaults (farm/config/default.yaml)

Key parameters

Crossover modes

Outputs

7. Stage 5 — Validation

7.1 Distillation quality

7.2 Quantization fidelity

7.3 Recombination quality

Validation report layout

8. Optional: Compare Distillation Modes

9. Parameter Reference

Architecture consistency

Seed inventory

crossover_child_finetune YAML keys

10. Tuning Guide

State data choice

Distillation

Post-Training Quantization

Crossover

Fine-tuning

Validation

11. Copy-Paste Recipes

Recipe A — Minimal synthetic run (no parent checkpoints required for architecture test)

Recipe B — Full run with real parent checkpoints and replay buffer

12. Generalization: Holdout & Domain-Shift Evaluation

Library helpers

Minimal synthetic run (synthetic states, existing checkpoints required)

With a real replay buffer

With Gaussian-noise domain shift

With input-scaling domain shift

Key flags

Reading the generalization_summary.json

13. Publication Ablations

Quick start (no config file needed)

Full run from a config file

Output layout

Per-condition stage overrides

Valid stages

Dry-run mode

Using a shared real replay buffer

Reading the summary table

14. Qualitative Error Analysis for Recombined Networks

What it produces

Minimal run

With logits, worst-10 states, and activation export

Python API

Worst-k criteria

Integration with validate_recombination.py

15. Multi-Generation Crossover Search

Semantics (parent selection)

Per-generation RNG (GenerationConfig.seed)

Minimal CLI example

Related Documentation

YAML defaults (`farm/config/default.yaml`)

`crossover_child_finetune` YAML keys

Reading the `generalization_summary.json`

Per-generation RNG (`GenerationConfig.seed`)