Hierarchical polyphonic active inference in a 3D PyBullet world

A drone that acts on beliefs, not truth

This project implements a physically embodied drone agent that pursues a beacon in a cluttered 3D environment using a polyphonic active-inference-style architecture. The agent does not act directly on simulator truth. Instead, simulator state is converted into noisy observations, posterior beliefs are updated online, and action is selected by a short-horizon controller whose objectives are modulated by a slower hierarchical latent layer over visibility, progress, affordance, and behavioural context.

The result is an interpretable control system that can maintain a remembered target under occlusion, switch between search, approach, repositioning, recovery, and stabilisation, and use uncertainty and scene interpretation to arbitrate behaviour in a way that is much closer to a real inferential agent than a hand-scripted planner.

PyBullet
Belief-state control
Egocentric sensing
Approximate VMP / Laplace updates
Polyphonic control
Hierarchical latent scene inference
Demonstration run from the final Stage C1.1 controller. The overlay shows target belief, observation status, and higher-level latent context driving behavioural arbitration.
1152
control steps in the attached successful run
0.27 m
minimum distance to target achieved
538
steps with direct target visibility
6
distinct behavioural regimes represented in the mode layer

Why this system is interesting

Many active inference demos either operate in very small discrete worlds or rely on hidden shortcuts that make control easier than it first appears. This system was built to push beyond that. The drone lives in a continuous 3D environment, receives imperfect local observations, and must maintain a target belief under partial occlusion while avoiding obstacles and stabilising flight. It does not simply chase ground-truth coordinates.

The more distinctive ingredient is the polyphonic control architecture. Instead of reducing behaviour to a single scalar objective, the controller combines multiple pressures: pragmatic target pursuit, safety, altitude and heading stabilisation, open-space preference, uncertainty reduction, and mode-specific contextual drives. These pressures do not remain fixed. In the final version, they are modulated by a slower latent layer that tries to infer what kind of situation the agent is currently in.

In other words, the agent does not only ask “where is the target?”. It also asks “am I stalled?”, “is the target merely occluded or effectively lost?”, and “should I exploit my current target memory or gather better evidence first?”.

System architecture

1. World and sensors

The drone moves in a PyBullet environment with obstacles and a target beacon. The simulator provides truth internally, but the controller only receives noisy derived observations.

  • noisy self-pose and yaw observation
  • egocentric target cues: range, bearing, elevation
  • ray-based obstacle distances
  • soft target cues under chronic occlusion

2. Fast belief layer

Posterior beliefs are maintained over self state, target state, and local obstacle structure. This is the first layer that converts raw observations into hidden-state estimates.

  • self belief: pose, velocity, yaw
  • target belief: mean, covariance, confidence
  • local map belief: ray-space free-space structure
  • mode beliefs: search, rollout, reposition, recovery, hold, hover

3. Slow scene/context layer

A slower latent layer summarises what kind of situation the agent is in and feeds top-down biases back into control.

  • visibility regime: visible / soft-occluded / lost
  • progress regime: advancing / stalled / trapped
  • affordance regime: good-vantage / poor-vantage / blocked
  • context precisions: exploit vs explore, memory, safety, epistemic, goal

Behavioural modes

The controller does not issue commands from a single global optimiser. It arbitrates among a small family of interpretable behavioural regimes, each of which can dominate under different belief and context conditions.

  • search_reacquire: broad scanning when target memory is weak
  • efe_rollout: short-horizon forward evaluation of candidate commands
  • vantage_reposition: sidestepping or reorientation to recover line of sight
  • recovery: conservative control under poor local geometry
  • goal_hold: local stabilisation near target
  • success_hover: terminal hovering once the success condition is satisfied

The key point is that these are no longer brittle if-else switches. In the Stage B5 and Stage C variants, the controller maintains posterior-like probabilities over modes and uses those probabilities as an arbitration layer.

Polyphonic control

“Polyphonic” here means that action is not determined by one monolithic score. Several behavioural voices contribute pressure simultaneously, and their relative influence changes with the inferred situation.

  • pragmatic voice: reduce distance and alignment error to the beacon
  • safety voice: avoid local obstacle risk and high-collision trajectories
  • stability voice: maintain altitude, heading, and smooth control
  • epistemic voice: prefer actions expected to reduce uncertainty or improve visibility
  • open-space voice: drift toward more informative / less constrained local geometry

The scene/context layer acts mainly by modulating the effective precision of these voices, rather than replacing the controller with a second heavy planner.

Mathematical formulation

Fast hidden states

Let the fast hidden state factor as

$$x_t = \{x_t^{\mathrm{self}},\; x_t^{\mathrm{target}},\; x_t^{\mathrm{map}}\}.$$

In the current implementation these correspond approximately to:

$$x_t^{\mathrm{self}} = [p_x,p_y,p_z,v_x,v_y,v_z,\psi,\dot\psi],$$ $$x_t^{\mathrm{target}} = [g_x,g_y,g_z],$$ $$x_t^{\mathrm{map}} = [d_1,\dots,d_K],$$

where $d_k$ denotes the inferred obstacle distance along ray direction $k$.

Observations

The controller sees a noisy observation vector

$$o_t = \{o_t^{\mathrm{self}},\; o_t^{\mathrm{target}},\; o_t^{\mathrm{rays}}\}.$$

These are generated from simulator truth but presented to the controller as noisy measurements:

$$o_t^{\mathrm{self}} = H_s x_t^{\mathrm{self}} + \omega_t^{s},$$ $$o_t^{\mathrm{rays}} = x_t^{\mathrm{map}} + \omega_t^{r},$$ $$o_t^{\mathrm{target}} = h(x_t^{\mathrm{self}}, x_t^{\mathrm{target}}) + \omega_t^{g},$$

where $h(\cdot)$ returns egocentric range, bearing, and elevation when the target is visible or softly observable.

Approximate posterior updates

The fast belief layer maintains approximate Gaussian posteriors over continuous latent states,

$$q(x_t^{\mathrm{self}}) = \mathcal{N}(\mu_t^{s}, \Sigma_t^{s}), \qquad q(x_t^{\mathrm{target}}) = \mathcal{N}(\mu_t^{g}, \Sigma_t^{g}).$$

In the implementation, these are updated by lightweight prediction-correction steps rather than a fully general symbolic message-passing engine. The spirit is variational: beliefs are predicted forward under a transition model, corrected by precision-weighted observation errors, and retained as the sufficient statistics driving control.

$$\mu_{t|t-1} = f(\mu_{t-1}, u_{t-1}),$$ $$\Sigma_{t|t-1} = A_t \Sigma_{t-1} A_t^\top + Q,$$ $$\mu_t = \mu_{t|t-1} + K_t \big(o_t - h(\mu_{t|t-1})\big),$$ $$\Sigma_t = (I - K_t H_t)\Sigma_{t|t-1}.$$

For the target, egocentric cues are reconstructed into world-frame target estimates using the current self belief. Under chronic occlusion, soft cues are treated as weak evidence: confidence decays, covariance inflates, and the system is prevented from becoming overconfident under poor visibility.

Polyphonic policy scoring

For each candidate short-horizon control sequence $\pi_k$, the controller rolls the dynamics forward and computes a composite score with pragmatic, safety, stability, and epistemic components. A stylised form is:

$$G(\pi_k) = w_g J_{\mathrm{goal}}(\pi_k) + w_s J_{\mathrm{safety}}(\pi_k) + w_h J_{\mathrm{stability}}(\pi_k) + w_e J_{\mathrm{epistemic}}(\pi_k) + w_o J_{\mathrm{open}}(\pi_k).$$

Representative terms include:

$$J_{\mathrm{goal}}(\pi_k) \approx \mathbb{E}_{q}[\,\|p_t - g_t\|\,],$$ $$J_{\mathrm{safety}}(\pi_k) \approx \mathbb{E}_{q}[\,\phi(d_{\min})\,],$$ $$J_{\mathrm{epistemic}}(\pi_k) \approx \mathrm{tr}(\Sigma_{t+H}^{g}) + \lambda_{\mathrm{occ}}\,C_{\mathrm{occ}}(\pi_k),$$

where $C_{\mathrm{occ}}$ is an occlusion proxy and the weights are modulated by the slow latent context state. Lower $G(\pi_k)$ is preferred.

Discrete mode beliefs

Behaviour is also mediated by a discrete posterior-like belief over controller modes,

$$q(m_t) = \mathrm{softmax}(\ell_t),$$

where the mode logits $\ell_t$ depend on target confidence, visibility status, occlusion duration, local free-space structure, progress, and proximity to the target. This gives a compact inferential layer over behavioural regime selection.

Slow scene and context inference

The Stage C extension introduces a slower latent state

$$z_t = \{z_t^{\mathrm{vis}},\; z_t^{\mathrm{prog}},\; z_t^{\mathrm{aff}},\; z_t^{\mathrm{ctx}}\},$$

capturing visibility regime, progress regime, affordance regime, and contextual precision control. In practice this layer is updated from smoothed evidence over multiple timesteps and feeds back into control via precision-like variables such as exploit-vs-explore balance, target-memory precision, epistemic precision, safety precision, and goal precision.

$$w_i^{\mathrm{eff}} = w_i \cdot \rho_i(z_t),$$

so the higher layer changes how strongly different control voices are trusted without replacing the lower layer entirely.

Variational free energy view

Although the implementation is deliberately lightweight, the architecture can be interpreted in standard active inference terms. The fast layer minimises a variational free energy over hidden states,

$$F[q] = \mathbb{E}_{q(x_t)}\big[\ln q(x_t) - \ln p(o_t, x_t \mid u_{1:t-1})\big],$$

while action selection approximately minimises an expected free energy surrogate over candidate policies,

$$\pi^* = \arg\min_{\pi} G(\pi).$$

In the present system, $G(\pi)$ is not derived from a fully exact deep generative model, but it is close enough in form and function to support the key active inference intuition: policies are selected not only for immediate pragmatic value, but also for their relationship to uncertainty, visibility, safety, and future evidence.

Implementation sketch

The codebase is organised around a clear separation between world, sensing, inference, and control:

env/ drone.py # drone body and world dynamics sensing/ raycast_sensor.py # local obstacle sensing observation_model.py # noisy self and target observations belief/ belief_state.py # fast, discrete, and slow latent beliefs vmp_filter.py # approximate inference / prediction-correction control/ autopilot.py # polyphonic controller and mode arbitration utils/ logging_utils.py # structured logs and overlays main.py # full loop: truth → observations → beliefs → action

The central engineering decision is that the controller never consumes simulator truth directly. Truth exists only to generate noisy observations, which are then filtered into posterior beliefs. This is what makes the project an inferential control system rather than simply a sophisticated heuristic tracker.

Control loop

for each timestep t: truth_state = simulator.read_state() observations = observation_model.sample(truth_state) belief_state = vmp_filter.step(belief_state, observations, previous_action) command, diagnostics, voice_info = controller.compute_command(belief_state) simulator.apply(command) logger.log(truth_state, observations, belief_state, diagnostics)

This separation proved crucial during development. Earlier versions became brittle when the target was treated as either fully known or too quickly “lost”. The final Stage C1.1 system works because the target can remain remembered under uncertainty without becoming unrealistically certain under chronic occlusion.

Observed behaviour in the attached run

In the successful demonstration run attached to this page, the drone executes a mixture of search, rollout-based pursuit, recovery, and local stabilisation. Logs from that run show:

  • 1152 total control steps
  • 782 steps in rollout mode
  • 198 steps in search/reacquisition
  • 98 steps in goal-hold stabilisation
  • 73 steps in recovery
  • minimum target distance of approximately 0.27 m
  • direct target visibility on 538 steps and soft target cues on 614 steps

These numbers matter because they show that the final controller is not doing one thing all the time. It is switching intelligently between exploitative pursuit, evidence-seeking, and local stabilisation.