Hyperparameter Evolution Convergence
This note documents how to capture and interpret learning-rate convergence from the hyperparameter chromosome evolution runner.
Looking for intra-population (in-situ) evolution where each agent carries its own chromosome and selection emerges from survival? See Intrinsic Evolution Experiment. The two runners are complementary.
Quick start: stable preset
The recommended way to run an experiment for the first time is via the stable_hyper_evo named preset. It encodes the configuration found to prevent lower-bound collapse and diversity collapse in the closure runs described below:
source venv/bin/activate
python scripts/run_evolution_experiment.py \
--preset stable_hyper_evo \
--generations 8 \
--population-size 10 \
--steps-per-candidate 80 \
--output-dir experiments/evolution_smoke
The preset sets --selection-method tournament, --boundary-mode reflect, --mutation-rate 0.20, --mutation-scale 0.15, and enables adaptive mutation. Any flag you pass explicitly overrides the preset value.
Methodology
- Runner:
scripts/run_evolution_experiment.py - Generational metrics source:
evolution_generation_summaries.json - Lineage source:
evolution_lineage.json - Typical search controls:
- parent selection: tournament or roulette
- mutation rate:
--mutation-rate(default0.25) - mutation scale:
--mutation-scale(default0.2) - boundary mode:
--boundary-mode(clamporreflect, defaultclamp) - optional soft boundary penalty:
--boundary-penalty-enabled--boundary-penalty-strength(default0.01)--boundary-penalty-threshold(default0.05)
- fitness metric:
final_population,total_births, orfinal_resources
Example:
source venv/bin/activate
python scripts/run_evolution_experiment.py \
--generations 8 \
--population-size 10 \
--steps-per-candidate 80 \
--selection-method tournament \
--mutation-rate 0.25 \
--mutation-scale 0.2 \
--boundary-mode clamp \
--fitness-metric final_population \
--output-dir experiments/evolution_convergence
Visualization
Use the plotting helper to generate a convergence figure from persisted summaries:
source venv/bin/activate
python scripts/plot_hyperparameter_evolution.py \
--summary-json experiments/evolution_convergence/evolution_generation_summaries.json \
--output experiments/evolution_convergence/hyperparameter_evolution.png
The chart contains:
- best fitness per generation
- per-gene mean and
+-1 stdtrend over generations
Adaptive Mutation
Static mutation rates have a trade-off between boundary collapse (too
exploitative) and stagnation (too exploratory). The runner supports an
opt-in adaptive mutation schedule driven by
AdaptiveMutationConfig. See farm/runners/adaptive_mutation.py for
the controller implementation.
When --adaptive-mutation is passed, each generation’s best fitness and
population diversity are observed and used to adjust the mutation rate and
scale that will produce the next generation:
- Fitness adaptation (
use_fitness_adaptation): if best fitness did not improve over the trailingstall_windowgenerations by more thanimprovement_threshold, the rate/scale multipliers are grown bystall_multiplier. When fitness improves clearly, the multipliers are shrunk byimprove_multiplier. - Diversity adaptation (
use_diversity_adaptation): if the mean normalized gene standard deviation falls at or belowdiversity_threshold, the rate/scale multipliers are boosted bydiversity_multiplierto escape a collapsed population. - Per-gene multipliers (
per_gene_rate_multipliers/per_gene_scale_multipliers, also exposed as--adaptive-per-gene-rate/--adaptive-per-gene-scale): constant weights applied to specific loci to give individual genes stronger or weaker mutation pressure.
Multipliers are always clamped to [min_*_multiplier, max_*_multiplier],
and the effective rate is clamped to [0, 1]. When a clamp actually
moves a value, the controller adds a rate_clamped or scale_clamped
tag to adaptive_event so saturation is visible in telemetry.
Telemetry
Every entry in evolution_generation_summaries.json records the mutation
parameters that produced this generation’s population (not the ones
that will produce the next), along with the multipliers in force, the
measured diversity of this generation, and a short string describing which
adaptation rules fired:
{
"generation": 2,
"best_fitness": 72.0,
"mutation_rate_used": 0.30,
"mutation_scale_used": 0.18,
"mutation_rate_multiplier": 1.5,
"mutation_scale_multiplier": 1.5,
"diversity": 0.034,
"adaptive_event": "stalled+diversity_collapse"
}
Because generation 0 is seeded by EvolutionExperiment._initialize_population
(which uses mutation_rate=1.0 to spread seed candidates) rather than by
the adaptive controller, its mutation_rate_used /
mutation_scale_used / mutation_*_multiplier fields are null and
its adaptive_event is "initial_seeding". diversity is still
recorded for every generation since it describes the evaluated population.
Example
source venv/bin/activate
python scripts/run_evolution_experiment.py \
--generations 12 \
--population-size 10 \
--steps-per-candidate 80 \
--selection-method tournament \
--mutation-rate 0.2 \
--mutation-scale 0.15 \
--adaptive-mutation \
--adaptive-stall-window 3 \
--adaptive-stall-multiplier 1.5 \
--adaptive-improve-multiplier 0.8 \
--adaptive-improve-threshold 1e-6 \
--adaptive-diversity-threshold 0.05 \
--adaptive-diversity-multiplier 1.5 \
--adaptive-per-gene-rate learning_rate=0.5 \
--fitness-metric final_population \
--output-dir experiments/evolution_adaptive
Tuning guidance
- Start with fitness adaptation only (
--adaptive-disable-diversity) on short runs to confirm the stall/improve rules fire as expected in your fitness regime. - Keep
stall_multiplierclose to1.5andimprove_multiplierclose to0.8: larger values can cause oscillation between exploit and explore regimes. stall_windowshould be at least 2 but smaller than the number of generations; a value of3is a good default for 8-20 generation runs.- Set
diversity_thresholdafter inspecting the diversity values logged by a non-adaptive baseline run. Typical collapsing populations reportdiversity < 0.05on normalized gene ranges. - Use
per_gene_rate_multipliers(or--adaptive-per-gene-rate learning_rate=0.5) to mute mutation on genes that are known to be sensitive while letting coarser knobs keep exploring. - Watch for
rate_clamped/scale_clampedinadaptive_event: if these appear repeatedly the multiplier is saturating againstmax_*_multiplier. Either widen the bound or rebalancestall_multiplierandimprove_multiplierso their geometric mean is close to1(defaults of1.5and0.8net-grow over equal numbers of stalls and improvements).
Interpreting Results
When reviewing learning_rate convergence:
- tightening standard deviation over generations suggests convergence pressure
- unstable or growing spread suggests mutation pressure dominates selection
- rising best fitness with shrinking spread usually indicates useful convergence
Artifact Refresh for Multi-Gene Reporting (2026-04-18)
To close the multi-gene acceptance criteria, all checked-in convergence artifacts
under experiments/evolution_convergence were regenerated from the current
chromosome schema and runner wiring.
What changed in persisted outputs:
- every
evolution_generation_summaries.jsonnow includes per-gene stats for all loci in the active schema:learning_rate,gamma,epsilon_decay, andmemory_size - every
best_chromosomesnapshot now includesgammaandepsilon_decayalongsidelearning_rate evolution_lineage.jsonremains intentionally compact and still stores top-levellearning_rate+ metadata (full per-gene stats live in summaries)
Current final-generation snapshots from regenerated runs:
run_clamp_baseline_g6: final best fitness76.0; best chromosome (learning_rate=1e-06,gamma=1.0,epsilon_decay=0.7241990478892087)run_clamp_penalty_g6: final best fitness66.99; best chromosome (learning_rate=0.20027491749951848,gamma=1.0,epsilon_decay=0.7592339511733013)run_clamp_penalty002_g6: final best fitness76.98; best chromosome (learning_rate=1e-06,gamma=0.95,epsilon_decay=0.9320406841297989)run_clamp_penalty005_g6: final best fitness68.95; best chromosome (learning_rate=1e-06,gamma=0.8369660526170754,epsilon_decay=0.8288524477063025)run_clamp_penalty010_g6: final best fitness72.9; best chromosome (learning_rate=1e-06,gamma=0.8618623974764232,epsilon_decay=0.5743829534957358)run_reflect_g6: final best fitness78.0; best chromosome (learning_rate=0.19271257710715683,gamma=0.9751036551406521,epsilon_decay=0.8970057808457064)run_roulette_mut040_g6: final best fitness76.0; best chromosome (learning_rate=1e-06,gamma=1.0,epsilon_decay=1.0)run_tournament_mut020_g6: final best fitness71.0; best chromosome (learning_rate=0.03602658902654344,gamma=0.9751036551406521,epsilon_decay=0.9688639803415231)run_tournament_mut025: final best fitness71.0; best chromosome (learning_rate=0.2269703882742677,gamma=1.0,epsilon_decay=1.0)
Findings From Current Smoke Run
Using the checked-in artifacts in experiments/evolution_smoke:
- Generations evaluated:
2(generation=0andgeneration=1) - Candidates evaluated:
8total (4per generation) - Fitness metadata present in lineage:
final_population - Fitness behavior: flat (
min=mean=max=6.0in both generations) - Learning-rate spread (computed from
evolution_lineage.json):- generation 0: mean
0.0013067191, std0.0017334593, min1e-06, max0.0042248765 - generation 1: mean
0.00025075, std0.0004325797, min1e-06, max0.001
- generation 0: mean
Interpretation:
- The run shows learning-rate contraction toward smaller values across one generation.
- Because fitness is completely flat, there is no meaningful selection gradient in this run; observed contraction is likely driven by initialization/mutation + elitism dynamics rather than clear fitness improvement.
- This is acceptable as a smoke validation of the pipeline (encoding, mutation/crossover, lineage persistence), but it is not yet evidence of optimization convergence.
Closure Run Comparison (Completed)
Two multi-generation runs were executed and persisted under experiments/evolution_convergence:
run_tournament_mut020_g6- selection:
tournament - mutation: rate
0.20, scale0.2 - seed:
42 - settings:
--generations 6 --population-size 8 --steps-per-candidate 40
- selection:
run_roulette_mut040_g6- selection:
roulette - mutation: rate
0.40, scale0.35 - seed:
99 - settings:
--generations 6 --population-size 8 --steps-per-candidate 40
- selection:
Reproducibility manifests were saved at:
experiments/evolution_convergence/run_tournament_mut020_g6/run_manifest.jsonexperiments/evolution_convergence/run_roulette_mut040_g6/run_manifest.json
Generated convergence figures:
experiments/evolution_convergence/run_tournament_mut020_g6/hyperparameter_evolution.pngexperiments/evolution_convergence/run_roulette_mut040_g6/hyperparameter_evolution.png
Observed outcomes from persisted summaries:
- Tournament (
run_tournament_mut020_g6)- best fitness:
68.0 -> 72.0 - learning-rate mean:
0.0557 -> 0.0519 - learning-rate std:
0.0612 -> 0.1372 - best-candidate learning rate:
0.001 -> 1e-06 - narrative: partial optimization with boundary collapse (fitness improved, but the winning learning rate moved to the lower bound and spread increased).
- best fitness:
- Roulette (
run_roulette_mut040_g6)- best fitness:
72.0 -> 72.0(flat) - learning-rate mean:
0.1449 -> 0.2654 - learning-rate std:
0.1443 -> 0.2526 - best-candidate learning rate:
0.2100 -> 0.6895 - narrative: oscillation / mutation-dominated regime (no fitness gain, increasing spread and drift toward larger rates).
- best fitness:
Overall interpretation:
- The pipeline is working end-to-end and can change population-level behavior under different evolutionary settings.
- Lower mutation pressure with tournament selection produced better optimization signal (fitness improvement), but also collapsed the winning
learning_rateto the lower bound, which indicates an over-strong attractor at the boundary. - Higher mutation pressure with roulette selection maintained diversity but did not improve fitness, consistent with exploration overpowering selection.
- Across both runs,
learning_rateappears to be a sensitive but noisy control variable forfinal_population; current settings show trade-offs between exploitation (collapse risk) and exploration (stagnation risk). - Practical next tuning step: keep tournament selection, reduce mutation pressure further, and add a soft lower-bound guard or penalty so improvement does not depend on boundary collapse.
Commands used:
source venv/bin/activate
python scripts/run_evolution_experiment.py \
--generations 6 --population-size 8 --steps-per-candidate 40 \
--selection-method tournament --mutation-rate 0.20 --mutation-scale 0.2 \
--fitness-metric final_population --seed 42 \
--output-dir experiments/evolution_convergence/run_tournament_mut020_g6
python scripts/run_evolution_experiment.py \
--generations 6 --population-size 8 --steps-per-candidate 40 \
--selection-method roulette --mutation-rate 0.40 --mutation-scale 0.35 \
--fitness-metric final_population --seed 99 \
--output-dir experiments/evolution_convergence/run_roulette_mut040_g6
For issue closure write-ups, include:
- selected mutation/selection settings
- one generated convergence figure
- a short narrative of whether convergence, oscillation, or collapse occurred
Boundary-Handling Comparison Plan
To specifically evaluate boundary-collapse risk, keep all settings identical and toggle only boundary handling. Suggested A/B matrix:
- clamp baseline
- reflect mutation
- clamp + soft boundary penalty
Use a shared seed and identical generation/population settings:
source venv/bin/activate
# A) Clamp baseline
python scripts/run_evolution_experiment.py \
--generations 6 --population-size 8 --steps-per-candidate 40 \
--selection-method tournament --mutation-rate 0.20 --mutation-scale 0.2 \
--boundary-mode clamp \
--fitness-metric final_population --seed 42 \
--output-dir experiments/evolution_convergence/run_clamp_baseline_g6
# B) Reflective mutation
python scripts/run_evolution_experiment.py \
--generations 6 --population-size 8 --steps-per-candidate 40 \
--selection-method tournament --mutation-rate 0.20 --mutation-scale 0.2 \
--boundary-mode reflect \
--fitness-metric final_population --seed 42 \
--output-dir experiments/evolution_convergence/run_reflect_g6
# C) Clamp + soft boundary penalty
python scripts/run_evolution_experiment.py \
--generations 6 --population-size 8 --steps-per-candidate 40 \
--selection-method tournament --mutation-rate 0.20 --mutation-scale 0.2 \
--boundary-mode clamp \
--boundary-penalty-enabled \
--boundary-penalty-strength 0.01 \
--boundary-penalty-threshold 0.05 \
--fitness-metric final_population --seed 42 \
--output-dir experiments/evolution_convergence/run_clamp_penalty_g6
Compare each run’s evolution_generation_summaries.json and lineage outputs for:
- best-candidate learning-rate trajectory (especially boundary hits)
- learning-rate min/max and standard deviation trends
- fitness gains relative to boundary occupancy
Boundary-Handling Comparison Results (Completed)
Executed on 2026-04-18 with the exact commands above and shared seed (42).
Outputs were written to:
experiments/evolution_convergence/run_clamp_baseline_g6experiments/evolution_convergence/run_reflect_g6experiments/evolution_convergence/run_clamp_penalty_g6
Observed outcomes from persisted summaries + lineage:
- Clamp baseline (
run_clamp_baseline_g6)- best fitness:
72.0 -> 69.0 - best-candidate learning rate:
0.001 -> 1e-06 - learning-rate std:
0.0612 -> 0.1372 - exact min-boundary occupancy by generation:
[2, 3, 3, 5, 8, 7](out of 8 candidates)
- best fitness:
- Reflect (
run_reflect_g6)- best fitness:
74.0 -> 75.0 - best-candidate learning rate:
0.3771 -> 0.3771 - learning-rate std:
0.1143 -> 0.1638 - exact min-boundary occupancy by generation:
[0, 0, 0, 0, 0, 0]
- best fitness:
- Clamp + penalty (
run_clamp_penalty_g6)- adjusted best fitness:
75.99 -> 73.99 - raw best fitness (metadata):
76.0(max over lineage) - best-candidate learning rate:
1e-06 -> 1e-06 - exact min-boundary occupancy by generation:
[2, 4, 5, 6, 8, 7] - mean boundary penalty across candidates:
0.00764(max0.01)
- adjusted best fitness:
Interpretation:
- Reflective mutation clearly reduced boundary-collapse risk in this comparison: no candidates landed exactly at the lower bound, while clamp variants showed repeated and increasing lower-bound occupancy.
- Reflect also preserved/improved optimization signal (best fitness reached
75.0) without relying on boundary-hugging winners. - The small soft-penalty setting (
0.01, threshold0.05) was not strong enough to dislodge clamp dynamics in this setup; it reduced adjusted fitness but did not prevent lower-bound collapse. Increasing penalty strength and/or threshold is the next tuning step if clamp must be retained.
Penalty Strength Sensitivity (Completed)
A follow-up sweep kept clamp mode fixed and varied only
boundary_penalty_strength (threshold=0.05, same seed/settings):
experiments/evolution_convergence/run_clamp_penalty002_g6experiments/evolution_convergence/run_clamp_penalty005_g6experiments/evolution_convergence/run_clamp_penalty010_g6
Observed outcomes:
strength=0.02(run_clamp_penalty002_g6)- adjusted best fitness:
70.0 -> 79.0 - best-candidate learning rate:
0.1594 -> 0.1594 - min-boundary hits by generation:
[2, 3, 2, 2, 1, 0] - mean penalty:
0.00530(max0.02)
- adjusted best fitness:
strength=0.05(run_clamp_penalty005_g6)- adjusted best fitness:
68.951 -> 67.95 - best-candidate learning rate:
0.001 -> 1e-06 - min-boundary hits by generation:
[2, 3, 0, 1, 7, 7] - mean penalty:
0.03262(max0.05)
- adjusted best fitness:
strength=0.10(run_clamp_penalty010_g6)- adjusted best fitness:
74.902 -> 73.0 - best-candidate learning rate:
0.001 -> 0.1594 - min-boundary hits by generation:
[2, 3, 2, 4, 2, 2] - mean penalty:
0.04303(max0.10)
- adjusted best fitness:
Sensitivity takeaway:
- Penalty impact is non-monotonic in this stochastic setting.
0.02and0.10reduced lower-bound collapse relative to prior clamp runs, while0.05still collapsed late.- A higher penalty (
0.10) prevented the final winner from collapsing to1e-06, but reflective mutation remains the most consistent anti-collapse strategy in this set of experiments.