Soft-label distillation: hard vs soft vs blended (issue #596)
This note documents a reproducible three-way comparison of knowledge-distillation objectives implemented in farm.core.decision.training.trainer_distill and runnable via CLI (scripts/run_distillation.py). It addresses the “results documented” acceptance item from issue #596.
Method
- Script:
scripts/compare_distillation_modes.py - Controlled variables: One frozen
BaseQNetworkteacher (fixed seed), one shared synthetic state buffer(N, input_dim)from NumPy, one fixedStudentQNetworkinitialization (reloaded for each run). - What changes: Only
DistillationConfig.alpha:- hard_only:
alpha = 0→ cross-entropy on teacher argmax only. - blended:
alpha = 0.7→0.7 * L_soft + 0.3 * L_hard(KL soft term, temperature 3). - soft_only:
alpha = 1→ temperature-scaled KL distillation only.
- hard_only:
- Metrics (validation split, last epoch):
- Action agreement: fraction of states where student argmax equals teacher argmax.
- Mean probability similarity:
1 - mean(|p_teacher - p_student|)over actions and samples (temperature-1 softmax in the evaluator; seeDistillationTrainer._evaluate).
Hyperparameters (default run): seed_base=42, input_dim=8, output_dim=4, parent_hidden=64, n_states=5000, temperature=3, epochs=25, batch_size=64, lr=1e-3, val_fraction=0.1, loss_fn=kl.
Results (2026-04-08, local run)
| Mode | α | Final action agreement | Final mean prob. similarity | Best val loss* |
|---|---|---|---|---|
| hard_only | 0.0 | 93.4% | 0.814 | 0.145 |
| blended | 0.7 | 93.2% | 0.981 | 0.162 |
| soft_only | 1.0 | 93.2% | 0.989 | 0.015 |
*Best val loss is the trainer’s validation objective at the best epoch. It is not comparable across rows: hard-only optimizes CE (scale ~0.14 here), while soft-only optimizes scaled KL (~0.015 here). Use agreement and probability similarity for cross-mode fidelity, not raw val loss.
Interpretation
- Distribution match: Soft-only and blended training produce much closer full-action distributions to the teacher than hard-only (probability similarity ~0.98–0.99 vs ~0.81), which is exactly what soft labels are meant to preserve when Q-values are close across actions.
- Top-1 agreement: On this synthetic setup, hard-only is marginally higher (~0.2 percentage points) than soft-only/blended at the last epoch. That is plausible: argmax CE directly targets the teacher’s top action, while KL spreads gradient across the distribution. On real replay data and longer training, rankings can differ; re-run the script with
--states-fileand your checkpoints for a task-specific read.
Reproduce
source venv/bin/activate
python scripts/compare_distillation_modes.py --json-out reports/distillation_mode_comparison.json
Machine-readable summary: reports/distillation_mode_comparison.json (regenerate locally; paths may be gitignored depending on your reports/ layout).
To mirror production-style usage (pairs A/B, optional parent checkpoints), use:
python scripts/run_distillation.py --alpha 0.0 # hard-focused
python scripts/run_distillation.py --alpha 1.0 # soft-only (default)
python scripts/run_distillation.py --alpha 0.7 # blended
with shared --seed, --states-file, and parent checkpoints for a fair comparison on your data.