Polyphonic Active Inference Gridworld

What is shown in the demo?

The agent moves through a maze-like environment while inferring which behavioural mode is most relevant, selecting subgoals, and mixing policy recommendations from several internal voices. This creates behaviour that is adaptive, interpretable, and often more realistic than a controller driven by a single scalar objective.

Partial observability Hazard beliefs Subgoal selection Expected free energy Voice arbitration

Downloads

The full technical write-up describes the model assumptions, architecture, and algorithm in more detail.

View PDF write-up Full overview, equations, and algorithm Open video directly MP4 demo of the agent in action

Overview

The model places an active inference agent in a discrete two-dimensional world containing a current goal, impassable walls, charging stations, and hidden hazards. The environment is only partially observed, so the agent must maintain and update probabilistic beliefs about local threat structure as it moves. This is not simply path planning. It is a belief-guided control problem in which action depends on inferred context, uncertainty, and internal constraints as much as on the goal itself.

A central idea of the polyphonic framework is that action should not be treated as the output of a single monolithic utility function. Instead, distinct control units or voices express partially competing priorities. A higher-level latent mode then shapes how much influence each voice has at a given moment, allowing the overall policy to change as the situation changes.

World and internal architecture

Environment

The grid contains walls, one current goal, charging stations, and true hazards that are not directly visible unless sampled through noisy local observation.

State

The agent maintains a position in the grid together with a battery state that influences both preferences and transition success.

Observations

At each step the agent sees only a local patch around itself and receives noisy hazard cues, forcing it to infer danger rather than read it off directly.

Beliefs

A hazard belief map stores the posterior probability that each location is dangerous, and a separate reset memory map marks recently catastrophic outcomes as temporarily more aversive.

Modes

The agent infers a posterior over behavioural modes such as Explore, Pursue Goal, Recharge, Avoid Threat, and Verify, based on context including battery urgency, uncertainty, and local threat.

Voices

Safety, Goal, Epistemic, Energy, and Habit voices each evaluate the same candidate futures using a different weighting over cost and information terms.

Expected free energy and policy evaluation

For each voice, the agent evaluates a finite set of short-horizon policies. These candidate futures are rolled forward under approximate transition dynamics and scored using an expected free energy style decomposition. In this implementation, the planner combines three broad terms: risk, ambiguity, and epistemic value.

G_k(π) = w_risk R_k(π) + w_amb A_k(π) - w_epi E_k(π)

Risk captures mismatch with preferences, including hazard exposure, distance from the current subgoal, low battery, charger viability failure, reset-memory penalties, and control costs. Ambiguity reflects uncertainty in outcomes, approximated here from the entropy of hazard beliefs at predicted locations. Epistemic value acts as an information-seeking term, favouring locations that remain uncertain and relatively unvisited.

q_k(π) ∝ exp(-β_π G_k(π))

Each voice therefore forms its own posterior over policies. These voice-wise posteriors are then mixed using a set of adaptive voice weights derived from the current mode posterior, giving a polyphonic posterior over action rather than a single-system decision rule.

q(π) = Σ_k π_t(k) q_k(π)

Why this matters

The model sits between a simple reward-maximising controller and a full exact active inference scheme. It is more structured than conventional utility maximisation because it separates competing imperatives, explicitly represents uncertainty, uses hierarchical context inference, and supports subgoal-based route restructuring.

At the same time, it remains computationally tractable and visually interpretable, which makes it useful both as a working agent and as a conceptual platform for studying how inference, arbitration, and planning interact.

Algorithm at a glance

1

Observe locally. The agent samples a small neighbourhood and receives noisy binary hazard cues.

2

Update hazard beliefs. Cellwise beliefs are revised approximately using Bayesian updates and mild regularisation toward a prior.

3

Infer behavioural mode. Contextual variables such as battery urgency, threat, uncertainty, and distance to goal determine a posterior over modes.

4

Select or retain a subgoal. Depending on mode, the internal target may be the true goal, a safe refuge, a charger route, or an uncertainty-relevant location.

5

Update voice weights. The mode posterior is converted into a smoothed mixture over Safety, Goal, Epistemic, Energy, and Habit voices.

6

Evaluate policies. Each voice rolls out candidate short-horizon action sequences and computes an approximate expected free energy.

7

Integrate polyphonically. Voice-specific policy posteriors are combined into a global posterior over policies and then over first actions.

8

Act under battery-gated dynamics. The selected action may still fail when battery is low, creating a distinction between intended and realised transitions.

9

Learn from consequences. If the agent hits a true hazard it is reset, the local threat representation is strengthened, and reset-memory is boosted.

Interpretation

One useful aspect of this framework is that failures are often informative rather than merely undesirable. Oscillations between goal pursuit and threat avoidance, deadlocks near bottlenecks, repeated visits to risky regions, or unstable subgoal switching can reveal which representational or planning ingredients are still missing from the model.

In that sense, the gridworld functions as more than a toy benchmark. It becomes a transparent setting in which theoretical assumptions about inference, arbitration, and internal control structure can be inspected directly.

Possible next steps

Natural extensions include smoother probabilistic threat fields, more expressive route-level subgoal inference, explicit belief updates over path structure, policy pruning or tree search for longer horizons, and a fuller discrete-state active inference formalism with explicit likelihood and transition matrices.

The broader motivation is to move toward richer active inference agents in which planning, uncertainty, internal drives, and hierarchical arbitration interact in a reusable and interpretable way.