Bayesian causal inference unifies perceptual and neuronal processing of center-surround motion in area MT

  1. Department of Brain and Cognitive Science, University of Rochester, Rochester, United States
  2. Center for Visual Sciences, University of Rocheste, Rochester, United States
  3. Institute of Science and Technology Austria, Klosterneuburg, Austria
  4. Zuckerman Institute, Columbia University, New York, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Peter Latham
    University College London, London, United Kingdom
  • Senior Editor
    Joshua Gold
    University of Pennsylvania, Philadelphia, United States of America

Joint Public Review:

Summary:

Lengyel et al. present a normative model of single-neuron activity in area MT, which is known for its role in processing visual motion. The authors focus on responses to a center and a surround that move at different velocities. Both the center and surround are rigid: picture a set of dots all moving at the same velocity. The center dots are arranged in a disc; the surround dots in an annulus, and in both cases, the velocity of each is time-varying.

The core proposal is that the brain does not process motion in a fixed coordinate system, but instead infers a latent reference frame, and that MT neurons encode motion either in retinal coordinates or relative to this inferred reference frame. The model is meant to overcome a challenge in the existing literature on area MT: on the one hand, experimental findings are heterogeneous, including both surround suppression and surround facilitation of neural responses; on the other, existing models are either designed ad hoc to capture specific phenomena or they are somewhat general (e.g., divisive normalization), but in either case they can't explain the full range of responses. This manuscript proposes that the full range of responses in MT is explained as Bayesian inference over the reference frame in which center motion speed and direction should be estimated. The model extends one introduced in a previous publication from the same lab (Shivkumar et al. 2025). That publication focused on human perception of motion; this one makes predictions about MT mean responses and across-trial variability.

Strengths:

Processing visual motion is important for normal visual function, including for the integration and segmentation of visual objects. This manuscript presents a normative theory, supported by recent human perceptual data, and extends it to make predictions about neural firing rate and variability in area MT. The theory is well motivated and supported by the simulation analysis and comparison to data. It provides new insight into how causal inference of relative motion reference frames can modulate neural activity in MT. The richness of the theory's prediction can guide future experiments. In particular, the theory explains both center-surround suppression and facilitation, unifying disparate empirical observations in MT for which no unified explanation had been proposed. The manuscript also demonstrates a new method to map ideal observer predictions (posterior distributions over speed and direction, which are dependent on the posterior inference over reference frames) onto predicted neural activity for center-surround stimuli, by only considering basic tuning curves measured in the center-alone condition. This is a useful methodological contribution. The manuscript offers a thorough review of CS modulation studies in MT.

Weaknesses:

We found this paper difficult to read for two reasons. First, math is generally explained in words. This made it extremely difficult (impossible for some reviewers) to understand the details of the model, which are important. We're not against words, but it's critical that they be accompanied by equations.

Second, the manuscript is not self-contained in the sense that many of the motivations, assumptions, and limitations of the approach are only evident if one carefully reads the groups' prior work, Shivkumar et al. (2025). Following up on previous work isn't necessarily a flaw, but the introduction of the paper is written from a very broad perspective that does not effectively summarize the prior work and lay out the specific questions that motivate the current study. For example, it is not clear from the introduction whether the authors believe this framework can explain all sorts of center-surround interactions (including in non-motion stimuli and in other areas like the retina), or if the focus is only on area MT.

Finally, the connection to neural data is confusing and mostly qualitative. The authors create a library of "hypothetical but plausible tuning curves" and show that their modeling framework is flexible enough to capture a variety of center-surround interactions. Although they do state that their model can't explain all possible tuning curves, it's still hard to tell whether they have particularly strong evidence for the Bayesian causal inference hypothesis.

We also have several technical, but potentially important, comments.

Line 427: 'Our framework not only reinterprets past findings but also generates new, testable predictions. The model makes directly testable predictions for surround modulation. Facilitation, for instance, is predicted for neurons encoding retinal-centric motion (v_center) under high sensory uncertainty. In contrast, suppression is the hallmark of neurons encoding relative motion (v^relative_center) with respect to a surround-influenced reference frame.' It seems that to test the predictions of the model, one would need to first determine if a neuron encodes retinal or relative motion, without relying on the patterns predicted by this model, and then test if the two types of neurons behave as predicted. It is unclear how one can obtain this labeling of neurons independently of the model predictions.

Line 492: 'This offers a principled account of how the same population of neurons can support both perceptual states (integration and segmentation)'. However, because the theory assumes each neuron encodes either center velocity or center velocity relative to a moving reference frame, but not both, it does not explain that the same neuron could shift from suppression to facilitation. It may be worth considering another possibility, using V1 surround modulation as an analogy. Different neuron types are required to implement the surround computation: in mouse V1, SST interneurons are surround-facilitated, and they are necessary to implement surround suppression of pyramidal neurons https://pmc.ncbi.nlm.nih.gov/articles/PMC3621107, but their (SST) outputs are not communicated to downstream targets. In that view, facilitation is therefore not a signature of some neurons encoding a type of latent variable; it is only there as an intermediate step in the computation of the other latents (those that require suppression).

Misspecification of either the prior or likelihood can be a problem for Bayesian inference. Discussion of this point -- and in particular evidence (say from analysis of natural scene statistics in the case of the prior) that both are well-specified -- would strengthen the manuscript.

Author response:

We thank the reviewers for their careful and constructive assessment. We are glad they found the theory well motivated, that they recognised it unifies the previously unexplained center–surround suppression and facilitation in MT, and that they appreciated the methodological innovation in mapping ideal-observer predictions onto neural responses. We will make the manuscript more self-contained and mathematically explicit.

To clarify our central claim and the connection to neural data: Our model can account for both what people perceive and what neurons do. Specifically, we take a Bayesian causal-inference model that was built and fitted to human behaviour (Shivkumar et al., 2025), and use it to derive neural predictions for center–surround interactions in area MT during motion perception. We then compare these predictions to previously reported MT single-neuron responses – qualitatively, but without any further parameter fitting.

The reviewers are correct that there are two types of latent variables in our model, which imply two different sets of neural predictions. While one might conjecture relationships to other neural properties like classic center-surround suppression, a separate determination of which latent variable a neuron corresponds to is a simple matter of model comparison after fitting its responses to both. If the responses of a recorded neuron correspond to one of these predictions (as many existing neurons appear to do as we show in our paper), then this constitutes evidence in favor of them representing the corresponding posterior in our model. On the other hand, if they do not, then this can be due to a number of factors: our generative model being wrong, the neural encoding assumption (sampling or LDC) being wrong, or the Bayesian brain hypothesis being wrong (also see Lengyel et al. 2023; Haefner et al. 2024).

Furthermore, we’d like to also clarify that both types of latents support each of the perceptual states (including integration and segmentation) and that there is no 1-1 correspondence between them. Importantly, the related velocity latent represents the velocity in the inferred reference frame – which may be the surround, or the retinal, or an intermediate reference frame (see Fig. 5 in Shivkumar et al. 2025). As a result, the same neuron can show both suppressive and facilitatory effects (yellow and blue regions in the difference panels in Fig. 6 of our paper).

Finally, while specifying both the likelihood and the prior constitutes our model definition, we have made reasonable assumptions about the shape of each. The physics of the world — for instance, that objects tend to be stationary or to move slowly — motivates a spike-and-slab prior (Knill & Richards, 1996), and the likelihood is well described by a unimodal form (Stocker & Simoncelli, 2006). Our qualitative predictions do not depend on the exact specification of the prior and likelihood; other unimodal likelihoods yield similar results.

We will make corresponding edits throughout the text to clarify each of these points.

References:

Haefner, R. M., Beck, J., Savin, C., Salmasi, M., & Pitkow, X. (2024). How does the brain compute with probabilities? arXiv.

Knill, D. C., & Richards, W. (Eds.). (1996). Perception as Bayesian inference. Cambridge University Press.

Lengyel, G., Shivkumar, S., & Haefner, R. M. (2024). A general method for testing Bayesian models using neural data. In Proceedings of UniReps: The First Workshop on Unifying Representations in Neural Models (Proceedings of Machine Learning Research, Vol. 243).

Shivkumar, S., DeAngelis, G. C., & Haefner, R. M. (2025). Hierarchical motion perception as causal inference. Nature Communications, 16, Article 3868.

Stocker, A. A., & Simoncelli, E. P. (2006). Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, 9(4), 578–585.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation