Is the DQN actually learning? A diagnostic, four bugs, and a sobering answer
I had a suspicion that the DQN decision module wasn’t actually
learning — agents weren’t getting better at making decisions, and I
wondered whether agents were just dying before their internal model had a
chance to adapt. That hypothesis is partly right, but the more
interesting story turned out to be that the DQN was being quietly
half-broken in four places at once: a global training throttle, a never-
applied epsilon schedule, a YAML-to-config mapping that dropped knobs on
the floor, and a hidden-size field that did nothing. The diagnostic also
ended with an honest finding: even after fixing all four, agents only
barely start making better decisions in late life vs early life at the
default simulation horizon. The cause is no longer the code — it is the
environment’s signal-to-noise ratio.
This is the walkthrough.
Setup
The active decision stack is:
| Piece | File |
|---|---|
| Agent-facing API | farm/core/decision/decision.py |
| Tianshou integration | farm/core/decision/algorithms/tianshou.py |
| Replay buffer (PER) | farm/core/decision/algorithms/rl_base.py |
| Per-agent wiring | farm/core/agent/behaviors/learning.py |
| Per-step agent loop | farm/core/agent/core.py |
| Deferred-training scheduler | farm/core/simulation.py |
I wrote a diagnostic harness, scripts/diagnose_dqn_learning.py, that
wraps DecisionModule and TianshouWrapper and records per agent:
replay-buffer stores, should_train() outcomes, real policy.learn()
calls, the L2 movement of the Q-network’s first parameter tensor (so we
can prove weights actually moved, not just that learn returned), buffer
size and policy.eps at end of run, and lifespan. Later in the session I
extended it to record (env_time, action, reward) per store so I could
do a proper residualised late-vs-early reward analysis. Everything ran
with PYTHONHASHSEED=0 for determinism on the default config (30
starting learning agents, 100–500 steps).
The point throughout was the same one I keep making to myself: do not guess at RL bugs from the code; instrument, run, look at numbers.
What the first run found
Default config, 100 steps:
| Per-agent metric | Median | p75 | Notes |
|---|---|---|---|
| Stores per agent | 93 | 100 | replay fills fine |
policy.learn() calls |
2 | 4.5 | barely any training |
train-ready=False |
61 | 69 | almost every chance, skipped |
|Δw|₂ of first Q-param |
~0 | 0.1 | weights barely move |
Final policy.eps |
0.000 | 0.000 | exploration broken |
| Lifespan | 95 | 101 | mean 81 steps |
That is enough to falsify my initial hypothesis right away. Mean
lifespan is 81 steps, the batch threshold is 32, and 58/63 agents
filled their buffer past it. Lifespan is not the bottleneck. The
bottleneck is somewhere between the buffer being ready and policy.learn()
actually getting called. And the policy.eps = 0.000 for every
trained agent at end of run is suspicious — none of them are exploring.
So I went looking. Four real bugs surfaced.
Bug 1: the global training throttle
farm/core/simulation.py runs deferred RL updates via
_run_deferred_learning_updates(env, max_updates, rr_cursor), with
max_updates = performance.max_learning_updates_per_step and a default
of 4. With ~30 alive agents and the round-robin scheduler, that caps
total gradient updates per env step at 4, no matter how many agents are
ready. Over 100 steps that’s at most 400 updates spread across 30+
agents — median 2 per agent. When I disabled deferred training entirely
(--no-defer, training happens inline on every store) the same agents
ran median 18 gradient steps. So the throttle was costing ~9× the
training volume.
Bug 2: epsilon-greedy was always off
Tianshou’s DQNPolicy.__init__ sets self.eps = 0.0 and only changes
it when someone calls policy.set_eps(...). The wrapper was passing
eps_train, eps_test, eps_train_final into the policy’s
algorithm_config, but set_eps() was never called anywhere in the
codebase. Direct repro:
DQNWrapper(..., algorithm_config={"eps_train": 0.5, ...}).policy.eps
# -> 0.0
So the configured epsilon_start: 1.0, epsilon_min: 0.01,
epsilon_decay: 0.995 in default.yaml were all dead. The agent’s
“epsilon-greedy” was pure greedy on a near-random Q-network. There was
still some exploration in the system, but it came from a hard-coded
predict_proba heuristic (0.8 mass on the policy’s argmax + 0.2
uniform) multiplied by per-action priors from action_weights, not
from the configured schedule. The DQN policy never got a “test/greedy”
mode and never had its exploration anneal over time.
Bug 3: YAML-to-DecisionConfig wiring silently dropped knobs
AgentComponentConfig.from_simulation_config mapped:
learning.memory_size -> decision.memory_size
learning.batch_size -> decision.batch_size
learning.dqn_hidden_size -> decision.dqn_hidden_size
But the Tianshou wrapper reads decision.rl_buffer_size and
decision.rl_batch_size, not memory_size / batch_size (those are
consumed by the legacy BaseDQNModule, which the production stack
doesn’t use). So learning.memory_size: 2000 in YAML actually produced
a 10000-entry replay buffer (the rl_buffer_size default). The
dqn_hidden_size: 24 knob did nothing for the same reason (see Bug 4).
Anyone tuning these from YAML was tuning ghosts.
Bug 4: dqn_hidden_size was decorative
AdaptiveQNet in tianshou.py had hard-coded FC widths of 512 / 256 /
128 in the constructor. DecisionConfig.dqn_hidden_size was declared
as a real Pydantic field but never plumbed through. Width was always
512/256/128 regardless of the config.
Bug 5 (cleanup): swallowed exceptions in the agent step loop
AgentCore.step had:
try:
action = self.behavior.decide_action(self, state_tensor, enabled_actions)
self._execute_action(action, state_tensor)
except Exception:
pass
So any exception during decision or execution looked exactly like “the agent simply isn’t learning”, with no log entry to distinguish a quiet step from a crashing one. This wasn’t the cause of my symptom, but it would absolutely hide future versions of it.
The fixes
Six commits on cursor/diagnose-dqn-learning-a451, PR
#878:
- Throttle auto-scale. Change
max_learning_updates_per_stepto default to0, and re-interpret0as the auto-scale sentinel — every alive agent gets one gradient step per env step. Positive ints remain a hard cap; negatives short-circuit to no training. - Epsilon-greedy actually wired.
TianshouWrappersnapshotseps_train/eps_train_final/eps_test/eps_decayat init, callspolicy.set_eps(epsilon_start)immediately, decays multiplicatively on everyselect_action_with_maskcall (training mode), floors atepsilon_min, and exposesset_train_mode(False)to switch the policy toeps_testfor evaluation. - YAML wiring.
from_simulation_confignow mapslearning.memory_sizeandlearning.batch_sizeto both the legacy fields (so the deprecated DQN module still works) and therl_buffer_size/rl_batch_sizefields that the Tianshou wrapper actually reads. **dqn_hidden_sizeplumbed.**AdaptiveQNettakes ahidden_sizeparameter and usesh*4 / h*2 / hwidths, with a floor of 8.dqn_hidden_sizeis added to_EXCLUDED_PARAMSso it doesn’t leak intoDQNPolicy’s constructor.- Exceptions logged, not swallowed.
AgentCore.stepfailures fromdecide_action/_execute_actionnow log at warning with the agent id, exception type, message, and traceback (structlogexc_info=True). - Tests. New unit tests in
tests/test_decision_config_wiring.pyandtests/test_rl_training_batching.pylock down: initialpolicy.eps == epsilon_start, the multiplicative decay, theepsilon_minfloor,set_train_mode(False)switching toeps_test,dqn_hidden_sizeactually changing the network width, the YAML-to-rl_buffer_sizemapping, and the auto-scale sentinel.
Full pytest run: 6492 passed, 0 failed, 18 skipped (pre-existing CUDA / openpyxl skips).
Did “is the DQN learning?” change after the fixes?
Yes. Same simulation, same seed, before-and-after at 100 steps:
| Per-agent metric | Before | After | Δ |
|---|---|---|---|
Median policy.learn() calls |
2 | 18 | 9× |
Median |Δw|₂ of first Q-param |
~0 | 0.1 | weights actually move |
Final policy.eps |
0.000 | 0.606 | schedule alive (1.0 × 0.995^100) |
And at 300 steps, simulating the legacy throttle + broken eps on the
current code (--legacy) vs the post-fix defaults:
| Per-agent metric | Legacy | Current | Δ |
|---|---|---|---|
Median policy.learn() calls |
5 | 34 | 6.8× |
|Δw|₂ p75 |
0.1 | 0.2 | 2× |
| Mean lifespan (steps) | 213 | 262 | +23% |
Final policy.eps |
0.000 | 0.222 | decayed correctly |
That lifespan delta is the cleanest behavioral signal. Same seed, same population, same world — agents under the fixed system survive ~23% longer.
But are agents actually making better decisions?
Higher training volume, larger weight movement, longer survival. Those are inputs and outcomes. What I really wanted to see — and what I actually asked — was policy quality over the lifetime of one agent. Is a 300-step-old agent picking better actions than a 50-step-old agent?
To answer that I extended the diagnostic with:
(env_time, action, reward)per stored experience.- A cohort baseline: for each env time, the mean reward across all alive agents at that moment. An agent’s residualised reward is its own reward minus that cohort mean, so anything common to the cohort (food depleting, density, weather) cancels out.
- Per-agent late-vs-early residualised reward, restricted to long-lived agents (lifespan ≥ 100), with a one-sample t-stat on the per-agent delta.
- Action-distribution entropy and top-action share in the first vs the last quartile of each agent’s experience. A policy that is actually learning to commit should show lower entropy / higher top-action share in late life.
Three configurations at 500 steps:
| Long-lived agents (lifespan ≥ 100) | Legacy | Current | --train-freq 1 |
|---|---|---|---|
| Long-lived agents evaluated | 67 | 61 | 63 |
Median policy.learn() calls |
7 | 33 | 117 |
Max policy.learn() calls |
110 | 118 | 469 |
| Mean Δ residualised reward (late − early) | +0.0082 | −0.0055 | +0.0069 |
| t-stat for Δ > 0 | +1.15 | −0.79 | +0.97 |
| Agents with Δ > 0 | 35/67 (52%) | 28/61 (46%) | 32/63 (51%) |
| Δ action entropy (nats, late − early) | +0.013 | +0.002 | −0.024 |
| Δ top-action share | −0.014 | +0.002 | +0.003 |
| Mean lifespan (steps) | 306 | 355 | 391 |
|Δw|₂ p75 |
0.1 | 0.2 | 0.3 |
The numbers do not let me claim “agents make demonstrably better
per-action decisions late in life.” The largest t-stat is +1.15, well
short of the t ≈ 2 threshold for p < 0.05. Even at the most aggressive
training setting (rl_train_freq=1, 12× more gradient steps), the
residualised-reward t-stat is only +0.97.
What I can claim:
- The action distribution starts to commit when training is
aggressive enough. At
rl_train_freq=1, late-life action entropy drops by−0.024nats (1.659 → 1.635), indicating a modest move toward a less uniform policy. Under the defaultrl_train_freq=4, entropy barely changes (+0.002). More gradient steps → policy actually committing. That is exactly the qualitative signature of learning, just at small amplitude. - Lifespan keeps improving with training budget: 306 → 355 → 391 steps across legacy → current → max-training. Better policies survive longer, even when the per-action reward signal is noisy.
- Weight movement scales with gradient steps: p75
|Δw|₂grows 0.1 → 0.2 → 0.3.
Why per-action reward is so noisy
The dominant variance in per-store reward isn’t the agent’s choice — it’s environment-level shocks (a nearby resource patch depleting, a neighbor moving in, a gather race) that every co-located agent experiences. Residualising against the cohort mean removes a lot of that but not all of it. The per-agent residual standard deviation is ~0.06 across agents, while the policy-driven signal (the mean of the late-minus-early deltas) is ~0.007 — an SNR of roughly 0.1.
The structural reason in the code is that LearningAgentBehavior never
calls the policy in pure-greedy mode. The active path is
predict_proba(state) × action_weights → np.random.choice, where
predict_proba is a 0.8/0.2 heuristic around the policy’s argmax.
So even a perfectly trained Q-network’s choice goes through one round
of weighted sampling against per-action config priors before being
executed. That dilutes whatever the policy has actually learned. It is
not a bug — action_weights is the chromosome-A
move_weight / gather_weight / ... prior — but it does make the
signal harder to read on per-action reward.
What survived
- The diagnostic harness, with the late-life-vs-early-life analysis
baked in.
scripts/diagnose_dqn_learning.py.--legacyreproduces the pre-fix behavior on current code;--train-freq Noverrides the gradient-step cadence;--no-deferruns inline training. - The four bug fixes plus the exception-logging cleanup, behind one PR
and one extended test file (
tests/test_decision_config_wiring.py). - The auto-scale default for
max_learning_updates_per_step. The previous default cost ~9× the training volume and was not the kind of thing that should ship as the recommended setting. - A clearer mental model of where in the stack the signal-to-noise
problem lives: not in the optimizer, not in the buffer, but in the
predict_proba × action_weightsaction-selection path that sits between the Q-network and the executed action.
What didn’t survive
The hypothesis I started with: “if I unblock training, late-life decisions will measurably improve at the default horizon.” They don’t, not in a statistically defensible way. The improvement is directional across multiple proxies (entropy down a hair, lifespan up, residualised reward up by a fraction of a standard error) but not big enough to put a confidence interval around.
My original hypothesis: “maybe agents aren’t living long enough.” Falsified. Mean lifespan was already 81 steps with batch threshold 32. The bottleneck was global throttling, not individual lifespans.
What’s next
- Longer simulations. 500 steps is enough to see training happen and enough to see lifespan respond; it is not enough to see per- action reward signal beat per-step environment noise. Multi-thousand- step runs are the obvious next experiment, on a smaller population so per-agent gradient budgets stay high.
- Greedy-evaluation probe states. Cache a fixed set of observation tensors early in the run; at intervals, freeze each agent’s Q-net and report argmax + max-Q on the probe set. That gives a deterministic per-policy quality metric independent of stochastic sampling and environment dynamics.
- Audit the
predict_proba × action_weightsdilution. If we want the trained policy to actually drive behavior, the 0.8/0.2 heuristic inTianshouWrapper.predict_probaprobably needs to be replaced with a real softmax over Q-values, and the multiplication byaction_weightsshould probably go through a log-additive bias on the logits rather than a multiplicative reweighting in probability space. This is a design call, not a clear bug. - Smaller-population runs as a learning sanity check. With 1–3 agents and a fixed resource layout, the cohort baseline goes away and the per-action reward signal should be clean enough to show a classic learning curve.
In short: the DQN is now actually training, but the easiest way to see it learning is no longer to fix code — it’s to design a simulation where the policy has enough room to matter.
Related docs
- Deep Q-learning module reference
- Hyperparameter chromosome design
- PR #878
- Diagnostic script:
scripts/diagnose_dqn_learning.py