Hierarchical priors enable neural prediction of perceived biological motion

  1. Centre for Mind/Brain Sciences, University of Trento, Rovereto, Italy
  2. Donders Centre for Cognitive Neuroimaging, Donders Institute for Brain, Cognition, and Behaviour, Radboud University, Nijmegen, Netherlands

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Iris Groen
    University of Amsterdam, Amsterdam, Netherlands
  • Senior Editor
    Joshua Gold
    University of Pennsylvania, Philadelphia, United States of America

Reviewer #1 (Public review):

Summary

The authors apply dynamic representational similarity analysis (dRSA), a method introduced in de Vries and Wurm 2023, to source-reconstructed MEG data from 40 participants who viewed ballet dancing sequences under three conditions: normal viewing, up-down inversion, and temporal piecewise scrambling. In normal viewing, they replicate their previous finding of a hierarchical pattern of leading-edge neural representations, with view-invariant body motion represented earliest in time (around 500 ms before the corresponding stimulus state), followed by view-dependent body motion (around 200 ms) and pixelwise motion (around 150 ms). Inversion selectively attenuates the leading-edge representation of view-invariant body motion while enhancing view-dependent body motion. Scrambling abolishes all leading-edge motion representations and instead increases post-stimulus representations of body posture. The authors interpret these findings as evidence that biological motion perception relies on a hierarchy of priors operating within a predictive-processing framework, with inversion specifically disrupting holistic priors and scrambling disrupting kinematics priors.

Strengths

The empirical work is careful and technically ambitious. The dRSA framework introduced in the 2023 paper is a useful methodological contribution to the study of dynamic neural representations, and the present manuscript extends it in well-motivated directions. The dataset is substantial: 40 participants, source-reconstructed MEG, three within-subject conditions. The replication of the 2023 normal-condition findings in an independent 40-subject sample is solid, which is increasingly rare and welcome in the field. The inversion and scrambling manipulations are well-motivated, and the conditions are matched on stimulus identity. Principal component regression is used appropriately to handle the genuine challenge of correlated and autocorrelated stimulus features, and the authors validate this choice through simulations. Eye position is included as a covariate and successfully regressed out, addressing a common confound in MEG decoding work. Behavioral catch trials demonstrate that participants attended to the stimuli across conditions. Both frequentist and Bayesian statistics are reported with appropriate corrections for multiple comparisons. The inversion result, in particular, is striking, and the asymmetry between view-invariant and view-dependent representations is informative.

Weaknesses

The central interpretive step in the manuscript treats a negative-lag dRSA peak as direct evidence for active hierarchical predictive inference. The data are equally consistent with at least three other accounts that the manuscript does not engage with, and the conclusion is therefore stronger than the data support.

First, the leading-edge dRSA signature is a natural consequence of nonlinear temporal integration of autocorrelated stimulus features. A long line of work from the Winawer and Grill-Spector labs (Zhou et al. 2018, Zhou et al. 2019, Stigliani et al. 2017, Kim et al. 2024) has established that the human visual cortex implements compressive temporal summation with delayed divisive normalization and that temporal integration windows progressively increase from early to higher visual areas. A nonlinear-summation response to an autocorrelated feature encodes deviations from the recent baseline. For smooth trajectories, this is essentially a local derivative, and the derivative inherits the trajectory's leading edge as a free consequence - no predictive machinery required. The integration-window hierarchy that Kim et al. (2024) recovered from voxelwise spatiotemporal pRFs maps onto the 150 / 200 / 500 ms hierarchy reported here almost one-for-one. That alignment is unlikely to be coincidental and deserves explicit treatment.

Second, the experimental design places participants firmly in the regime where Dayan's successor representation (SR) predicts that the brain holds a precompiled associative cache of trajectory structure. Each unique sequence is presented approximately 47 times across the experiment. An SR in Dayan's original formulation is a precompiled lookup table, not an online inference engine - querying it during familiar trajectories produces leading-edge representations through passive associative retrieval, mechanistically distinct from active prediction despite producing similar signatures. The senior author's own lab has demonstrated SR-like representations in V1 (Ekman, Kusch, de Lange 2023 eLife), but this paper is not cited or engaged with in the present manuscript despite its direct relevance.

Third, the canonical computational model of biological motion perception (Giese and Poggio 2003 Nat Rev Neurosci) is a fully feedforward template-matching architecture that predates the predictive-coding framing of biological motion. It accommodates the inversion effect (templates tuned to upright statistics), the hierarchy of timescales (graded leaky integrator time constants), and the scrambling effect (broken sequence-neuron activation) without invoking generative models or prediction errors. The manuscript cites Giese-tradition work for the inversion-effect literature but does not engage with the model itself, even though it is the field standard.

The inversion result, while empirically striking, has a simpler interpretation than the one offered. Inversion makes viewpoint-invariant body computation fail because the underlying machinery is tuned to upright body statistics. A weaker representation produces a weaker dRSA signature at every lag, including the leading edge - no appeal to priors in the active-inference sense is required. The view-dependent enhancement under inversion fits this reading naturally: when viewpoint abstraction fails, processing falls back to viewpoint-specific representations that remain extractable. The manuscript implicitly acknowledges this when it states that "predictions were channeled to the level at which prediction was still possible," but does not notice that this concession softens the strong predictive-coding inference.

The scrambling result is internally awkward on the predictive-coding framing. The paper acknowledges that pixelwise motion prediction should, in principle, survive 200-500 ms scrambled segments (typical latency around 150 ms) but reports that it does not. The proposed save - that segments are "too short to start up prediction" - undercuts the framework, since by the same logic, most of normal viewing would also be pre-prediction. A cleaner reading is that scrambling destroys the temporal autocorrelation of stimulus features, which is the prerequisite both for nonlinear-summation neural responses to produce leading-edge representations and for SR-style associative retrieval to operate.

A further concern is that the experimental design and analysis pipeline are structurally biased toward producing the cleanest possible predictive signature. The 14 stimuli are repeated extensively, and trials are averaged across repetitions before dRSA is computed, filtering out exactly the variability that would distinguish online prediction from amortized retrieval. The 2023 paper reports a control comparing the first and last thirds of the experiment, but this test is in the post-saturation regime for any plausible associative-learning rate and does not actually adjudicate the question. A first-exposure or first-run analysis would be diagnostic. Finally, the behavioral task changed between the 2023 paper and the present manuscript. The earlier paradigm asked participants to recognize the current motion ("arms moving up?"), while the present paradigm asks participants to judge whether an occluded video continues correctly. The latter explicitly demands prediction. This change transforms the experimental context from naturalistic viewing into one that actively incentivizes predictive engagement, potentially inflating the very signatures the paper interprets as spontaneous prediction.

The 2023 Nature Communications paper actually navigated these interpretive questions more carefully than the present manuscript does, explicitly stating that the approach "does not provide conclusive evidence for predictive processing/coding theory but leaves the door open for related theories such as adaptive resonance or Bayesian inference without predictive coding." The current manuscript would benefit from restoring that epistemic discipline. The data and methods are valuable; the interpretive frame is overstated relative to what the evidence supports.

Impact and utility

The dataset and dRSA framework are useful contributions to the study of neural representation of dynamic stimuli, and the inversion and scrambling conditions open productive lines of inquiry. The interpretive over-commitment to predictive processing risks limiting the paper's reach into adjacent literatures - temporal integration, successor representations, template-matching biological motion models, encoding-model approaches - where the findings could land productively. With a more pluralistic interpretive frame, this work would speak to a substantially broader audience and connect more naturally with existing mechanistic accounts of dynamic visual processing.

Reviewer #2 (Public review):

Summary:

In this manuscript, de Vries and colleagues apply successful probabilistic inference and predictive coding frameworks to the question of biological motion perception. In contrast to most studies of predictive processing in humans, which rely on the presentation of discrete events, they instead aimed to track continuous predictions in the context of more naturalistic inputs such as biological motion. In these settings, the authors have previously demonstrated an inverted temporal hierarchy of prediction whereby high-level movement features (e.g., view-invariant body motion) are predicted earlier than lower-level ones (e.g., pixelwise motion). The specific question they set out to address in this manuscript is whether these predictions derive from prior beliefs about the biological and physical organization of biological movements versus the local extrapolation of motion from past observations.

The authors used anatomical MRI-driven source reconstruction of MEG activity recorded from human participants watching either normal, vertically-mirrored, or temporally scrambled movies. They then aimed to correlate activity in preselected ROIs with summary representations of these movies based on different visual features at 3 different hierarchical levels using RSA. Doing so, they could confirm that predictive processes could be identified prior to the change in the stimulus and organized anatomically along the visual cortical hierarchy. Critically, they report that mirrored movies selectively disrupted the highest processing level while the lowest level remained largely unaffected. Interestingly, the predictions at the intermediate level were boosted in mirrored movies, suggesting a possible channeling of predictions at this level when highest-level predictions are unavailable. Finally, disrupting all predictive aspects with the scrambled movies entirely abolished predictions at all levels, with signals mainly reflecting reactive bottom-up processing of inputs.

In sum, biological motion perception relies on a tight coordination of multi-level predictions based on both motion-related holistic and kinematics priors.

Strengths:

Overall, this is a very strong manuscript, with the text being clearly written. I liked the fact that the authors not only compared responses to normal videos against the same videos flipped upside-down, but also to temporal piecewise scrambling of that same video, allowing to identify the respective roles of holistic motion priors vs. temporal predictions. Of course, more work is needed to tease apart what key quantities are represented in these holistic priors. For now, the authors argue that they likely combine prior beliefs about the biological organization of bodies, such as the likely angle of joint movements, and about the physics of reality, such as gravity. Further work teasing apart these aspects would be interesting to read!

All analyses seem well executed and, while some aspects of the presentation of results could be slightly improved (see below), the manuscript is very clear and the conclusions are supported by the data. Finally, I liked the words of caution the authors added to the discussion. For instance, while they largely used negative vs. positive latency as a proxy for top-down vs. bottom-up processing respectively throughout the manuscript, they also accurately acknowledge that predictive computations could also modulate processes at positive lags, through, for instance, latency modulation.

Weaknesses:

The main aspect of the work I was left to struggle with is this idea that priors can be read out directly from large patterns of activity rates as measured with MEG. While some past experimental work does support this view, theoretical proposals also suggest that one benefit of predictive coding lies in its computational and energy-efficient properties, whereby only novel, unpredicted aspects are encoded in the rate of neural activity. Some other research lines, for instance, focusing on silent working memory, also report the brain's ability to store important computations in ways that are not reflected in costly increases in overall activity. The authors do not really unpack why they expect to see predictions to be encoded in such a way in the first place. They also do not discuss what that implies in terms of neural organization and whether other aspects of neural activity (e.g., oscillations, synaptic weights) could subtend predictive processing in this context. At the end of the day, this activity change is clearly there in the data, so that's totally fine to interpret that; it just would be helpful to unpack what such an implementation of prior beliefs would imply in terms of neural organization.

The other weakness point I see is the little consideration for behavior throughout the paper. Behavior is indeed mostly treated as a negative control, ensuring that differences between conditions at the neural level do not follow from different behavioral strategies or other peripheral factors. Critically, task design nicely incorporates two types of tasks: one that is related to motion (occlusion of movement) and one that's independent of it (color change of fixation cross). Yet, these conditions are not directly compared at the neural level. It would be useful to see whether the neural signatures of prediction are largely independent from the ongoing task or whether behavior gates the types of priors and prediction processes that are applied to incoming sensory inputs. Moreover, the text says that "neither in accuracy nor in reaction time was there a significant difference between conditions", yet significance stars in Figure 1d seem to suggest there is a difference in the fixation cross task. What am I missing? If there is indeed a difference in overall performance, can the results (esp. the reduced dRSA correlation strength in normal < inverted < scrambled movie) be interpreted in terms of a multi-tasking cognitive cost?

I also have some other minor questions and comments:

(1) In this task situation, prediction does not only come in the continuous domain but also relies on a mental simulation model, in particular in the occlusion task. However, corresponding literature, notably the work by Shepard & Metzler (1971) on mental rotation (as well as follow-ups), is not mentioned here, I believe. Could the authors perhaps mention this if they think that's relevant (if not, feel free to ignore).

(2) I'm concerned that the novelty of dynamic RSA as explained at lines 56-64 might appear slightly exaggerated. After all, isn't it just a generalization of matrix correlation in model and brain time domains? (Again, feel free to ignore if I misunderstood.)

(3) How do authors explain that high-level motion prediction is still significantly larger than zeros (correct?) in the inverted movie condition? Shouldn't it be entirely abolished?

Reviewer #3 (Public review):

Summary:

The authors investigate whether the brain's predictive representation of observed biological motion depends on holistic priors about body structure or on kinematic priors about motion continuity. The manuscript applies dynamic representational similarity analysis to MEG data from a large number of participants viewing ballet sequences under three conditions: normal, upside-down inverted, and temporally scrambled into short epochs.

Strengths:

The study reports that inversion selectively attenuates predictions of view-invariant body motion and enhances predictions of view-dependent body motion, while leaving low-level pixel-wise motion prediction unaffected. Further, scrambling eliminates predictive motion representations at every level and instead produces stronger post-stimulus representations of body posture, with view-invariant posture also delayed. The pattern across the two manipulations is internally consistent, holds across both peak magnitude and peak latency measures, and is also supported by a neural-to-neural dynamic representational similarity analysis (dRSA) analysis between normal and inverted conditions. The principal component regression pipeline is validated through simulations showing that it recovers the model of interest while suppressing covarying models. In particular, the inversion result provides strong evidence that high-level predictions of biological motion depend on holistic priors while predictions at lower levels do not, and the finding that disruption at the top of the hierarchy does not propagate down is informative for predictive processing accounts that assume a more cascading architecture.

Weaknesses:

The interpretation of the scrambling result is the main caveat of the manuscript. The claim that low-level motion prediction depends on kinematic continuity rests on the absence of pixelwise motion prediction in the scrambled condition, but the 200 to 500-ms segments may not be sufficient for prediction to develop, as the authors also point out. Without a parametric manipulation of segment length, it is difficult to distinguish a genuine dependence on kinematic priors from a floor. The interpretation of increased post-stimulus posture representations as prediction errors is also somewhat indirect, since a positive latency does not rule out potential top-down modulation/factor.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation