Multi-Seed Cohort Runner

Single evolution runs are noisy. A configuration that looks best in one run can appear significantly worse (or better) in another simply because of random seed variance. The cohort runner eliminates this ambiguity by executing the same evolution configuration over N independent random seeds and aggregating the results into a single summary with mean, standard deviation, and convergence statistics.

Quick start

source venv/bin/activate
python scripts/run_cohort_experiment.py \
  --preset stable_hyper_evo \
  --generations 8 \
  --population-size 10 \
  --steps-per-candidate 80 \
  --num-seeds 5 \
  --base-seed 0 \
  --output-dir experiments/cohort_smoke

This runs 5 seeds (0, 1, 2, 3, 4) and writes four artifact types to experiments/cohort_smoke/:

File	Contents
`cohort_manifest.json`	Resolved configuration snapshot (written before the run)
`cohort_aggregate.json`	Per-seed detail + aggregate statistics
`cohort_aggregate.csv`	One row per seed (notebook-ready)
`seed_<N>/`	Full per-seed evolution artifacts (same layout as `run_evolution_experiment.py`)

Command-line flags

All evolution flags from run_evolution_experiment.py are available plus two cohort-specific flags:

Flag	Default	Description
`--num-seeds`	`3`	Number of seeds to run
`--base-seed`	`0`	Seeds are `[base_seed, …, base_seed+num_seeds-1]`

Every other flag (generations, population-size, preset, adaptive-mutation, convergence, etc.) applies identically to each seed run.

Artifact schema

`cohort_aggregate.json`

{
  "config": { ... },
  "num_seeds": 5,
  "seeds": [0, 1, 2, 3, 4],

  "best_fitness_mean": 7.8,
  "best_fitness_std":  1.2,
  "best_fitness_min":  6.0,
  "best_fitness_max":  9.5,

  "convergence_rate": 0.4,
  "convergence_reason_counts": {
    "fitness_plateau": 2
  },
  "mean_generation_of_convergence": 5.5,
  "std_generation_of_convergence":  0.7,

  "lower_bound_occupancy_mean": 0.125,
  "lower_bound_occupancy_std":  0.05,

  "mean_elapsed_seconds": 12.3,
  "total_elapsed_seconds": 61.5,

  "seed_results": [
    {
      "seed": 0,
      "best_fitness": 8.0,
      "num_generations_completed": 8,
      "converged": true,
      "convergence_reason": "fitness_plateau",
      "generation_of_convergence": 6,
      "elapsed_seconds": 11.9,
      "lower_bound_occupancy": 0.125
    }
  ]
}

Field reference

Field	Type	Description
`config`	object	Serialised `EvolutionExperimentConfig` template (seed field is the template value before per-seed override)
`num_seeds`	int	Total seeds executed
`seeds`	list[int]	Seed values in execution order
`best_fitness_mean`	float	Mean of per-seed best fitness values
`best_fitness_std`	float	Population standard deviation of best fitness
`best_fitness_min`	float	Minimum best fitness across seeds
`best_fitness_max`	float	Maximum best fitness across seeds
`convergence_rate`	float	Fraction (0–1) of seeds that satisfied a convergence criterion
`convergence_reason_counts`	object	Mapping of `ConvergenceReason` value → count
`mean_generation_of_convergence`	float\|null	Mean 0-based generation index at convergence (converged seeds only); `null` when no seed converged
`std_generation_of_convergence`	float\|null	Standard deviation of the same; `null` when fewer than 2 seeds converged
`lower_bound_occupancy_mean`	float\|null	Mean fraction of generations where the best chromosome’s `learning_rate` was at its lower boundary
`lower_bound_occupancy_std`	float\|null	Standard deviation of the same
`mean_elapsed_seconds`	float	Average wall-clock seconds per seed
`total_elapsed_seconds`	float	Total wall-clock seconds for the cohort
`seed_results`	list	One entry per seed (see below)

seed_results entry

Field	Type	Description
`seed`	int	Seed used for this run
`best_fitness`	float	Best fitness observed across all generations
`num_generations_completed`	int	Generations that ran (may be less than budget when `early_stop=True`)
`converged`	bool	Whether a convergence criterion was satisfied
`convergence_reason`	str\|null	`"fitness_plateau"`, `"diversity_collapse"`, `"budget_exhausted"`, or `null`
`generation_of_convergence`	int\|null	0-based generation of first convergence event
`elapsed_seconds`	float	Wall-clock seconds for this seed
`lower_bound_occupancy`	float\|null	Fraction of generations the best chromosome hit the `learning_rate` lower boundary

`cohort_aggregate.csv`

One row per seed with the same columns as the seed_results entries above. Load directly into pandas:

import pandas as pd
df = pd.read_csv("experiments/cohort_smoke/cohort_aggregate.csv")
print(df[["seed", "best_fitness", "converged", "lower_bound_occupancy"]])

Notebook ingestion

A minimal loading snippet for notebooks/hyperparameter_evolution_results.ipynb or any new notebook:

import json, pandas as pd

with open("experiments/cohort_smoke/cohort_aggregate.json") as f:
    cohort = json.load(f)

# Top-level aggregates
print(f"best_fitness  mean={cohort['best_fitness_mean']:.3f} "
      f"± {cohort['best_fitness_std']:.3f}  "
      f"[{cohort['best_fitness_min']:.3f}, {cohort['best_fitness_max']:.3f}]")
print(f"convergence_rate={cohort['convergence_rate']:.0%}")
print(f"lower_bound_occupancy mean={cohort['lower_bound_occupancy_mean']}")

# Per-seed DataFrame
df = pd.DataFrame(cohort["seed_results"])
df.plot(x="seed", y="best_fitness", marker="o", title="Best fitness per seed")

Interpreting results with statistical confidence

Using mean ± std for fitness comparisons

When comparing two configurations A and B:

If best_fitness_mean_A − best_fitness_std_A > best_fitness_mean_B + best_fitness_std_B, configuration A is reliably superior.
Overlapping 1-σ bands indicate that the apparent winner may invert on different seeds. Run more seeds (--num-seeds 10+) or use a paired t-test on the seed_results values.

Convergence rate

A high convergence_rate (> 0.7) combined with a low std_generation_of_convergence means the search reliably converges in a predictable number of generations. A low rate suggests the configuration rarely escapes the search space within the generation budget — consider increasing --generations or adjusting mutation parameters.

Lower-bound occupancy

lower_bound_occupancy_mean close to 1.0 indicates that winning candidates consistently collapse to the minimum learning_rate boundary. This is a sign of lower-bound collapse — the optimizer is effectively not searching the learning-rate space. Mitigations:

Switch to --boundary-mode reflect (part of the stable_hyper_evo preset) to let gene values bounce off boundaries instead of sticking.
Enable --boundary-penalty-enabled to add a soft fitness penalty near boundaries.
Widen the learning_rate gene’s minimum bound in the chromosome config.

Comparing multiple configurations

Run each configuration as a separate cohort and store results in separate --output-dir directories:

# Config A
python scripts/run_cohort_experiment.py \
  --preset stable_hyper_evo --num-seeds 10 --base-seed 0 \
  --output-dir experiments/cohort_A

# Config B
python scripts/run_cohort_experiment.py \
  --selection-method roulette --num-seeds 10 --base-seed 0 \
  --output-dir experiments/cohort_B

Then load both cohort_aggregate.json files in a notebook and compare the best_fitness_mean / best_fitness_std side-by-side:

import json, pandas as pd

configs = {"A": "experiments/cohort_A", "B": "experiments/cohort_B"}
rows = []
for name, path in configs.items():
    with open(f"{path}/cohort_aggregate.json") as f:
        d = json.load(f)
    rows.append({
        "config": name,
        "mean": d["best_fitness_mean"],
        "std": d["best_fitness_std"],
        "convergence_rate": d["convergence_rate"],
        "lb_occupancy": d["lower_bound_occupancy_mean"],
    })
comparison = pd.DataFrame(rows)
print(comparison)

Programmatic API

The CohortRunner class is exported from farm.runners and can be used directly without the CLI:

from farm.config import SimulationConfig
from farm.runners import (
    AdaptiveMutationConfig,
    CohortRunner,
    ConvergenceCriteria,
    EvolutionExperimentConfig,
    EvolutionFitnessMetric,
    EvolutionSelectionMethod,
)
from farm.core.hyperparameter_chromosome import BoundaryMode

base_config = SimulationConfig.from_centralized_config(environment="development")
template = EvolutionExperimentConfig(
    num_generations=8,
    population_size=10,
    num_steps_per_candidate=80,
    selection_method=EvolutionSelectionMethod.TOURNAMENT,
    boundary_mode=BoundaryMode.REFLECT,
    adaptive_mutation=AdaptiveMutationConfig(enabled=True),
    convergence_criteria=ConvergenceCriteria(enabled=True, early_stop=True),
    seed=None,  # overridden per seed
)

runner = CohortRunner(
    base_config=base_config,
    experiment_config_template=template,
    seeds=list(range(5)),          # seeds 0..4
    output_dir="experiments/cohort_api",
)
aggregate = runner.run()

print(f"fitness  {aggregate.best_fitness_mean:.3f} ± {aggregate.best_fitness_std:.3f}")
print(f"converged {aggregate.convergence_rate:.0%} of seeds")

The CohortAggregateResult and CohortSeedResult dataclasses are also exported from farm.runners if you need to type-annotate your own analysis code.