Understanding neural circuit principles for representation learning through joint-embedding predictive architectures

Ashena Gorgan Mohammadi; Manu Srinath Halvagal; Friedemann Zenke

doi:10.7554/eLife.110895.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Tatyana Sharpee
Salk Institute for Biological Studies, La Jolla, United States of America
Senior Editor
Panayiota Poirazi
FORTH Institute of Molecular Biology and Biotechnology, Heraklion, Greece

Reviewer #1 (Public review):

Summary:

The paper describes a biologically plausible version of JEPA using recurrent neural networks called RPL for recurrent predictive learning. Given an embedding z_t, a recurrent neural network processes these inputs with the form: c_t+1 = RNN(c_t, z_t). Then the predictive network f is predicting the future inputs with the format: min || f(c_t) - stop_grad(z_t+delta t) ||^2. I understand that a prediction error is defined as: e = z_t+delta t - f(c_t) to model cortical measurements in the oddball task.

The RPL model is also shown to build an internal world model, with "real-world" data like the movement of moving animals or speech signals. The representation is then compared to V1 data and expected prediction error signals in an oddball setting. In a stacked hierarchy of RNN learning with RPL, the higher layers appear to learn high-level latent variables, although gradients are not propagated downward to the lower layers.

Strengths:

(1) The paper tackles an open question: Self-supervised learning is thought to be a fundamental principle to explain how computation is structured in the brain. Cortical data suggest qualitatively that prediction error is a core principle of representation learning in the brain, but the field is still looking for a simple yet expressive model that would explain how the cortex learns its representations. RPL contributes in that direction by making a useful link between cortical representation learning in RNN models and the JEPA learning algorithm that was demonstrated to scale to large world model learning from video data by Lecun's group. It is very useful to connect this popular deep learning algorithm to cortical data.

(2) The model formalism is relatively elegant and simple: Simple next input prediction objectives are conceptually simple but not necessarily trivial to build at scale. There is a clear benefit in comparison with contrastive or IL methods because they are free from dataset-specific data augmentation and negative samples. Thereby moving the comp neuro field towards conceptually simpler models of representation in the cortex. Yet predictive only models (and in particular predictive models in latent space instead of pixel space) are not easy to build in a stable fashion. JEPA family is basically intended to solve this question; it is very nice and timely to bring this to comp neuro.

(3) The methodology combining comp neuro and deep learning makes sense: The conceptual and qualitative analogy with cortical prediction errors is relevant and consistent with what is expected as a model of self-supervised learning in cortical models. The methodology to compare RPL with IL and CL is methodologically meaningful and grounded: showing, for instance, how some of the models fail to represent some latent structure in some toy datasets is interesting.

(4) h-RPL: The h-RPL is perhaps the most creative departure from the JEPA model family. It would be interesting to say more about what was particularly difficult to see in the latent variables emerging in the hierarchical model. I often find it magical that layer-wise learning rules of this type are not learning redundant representations. Any insights why this is not the case here would be potentially insightful.

Weaknesses:

In general, I fully support the type of question and ideas that the paper is putting forward. It is, however, very hard in this research field to gain insight into specific conceptual contributions or specific bits of experimental data that the model puts forward. In pointing to the following weaknesses, I am encouraging the authors to lay out more clearly what the unique hypothesis is or the contribution of the RPL model that we should remember it for.

(1) The devil is in the details:

1a) Comparison with JEPA variants: JEPA variants are integrating different details into the learning algorithm. Integrating, for instance, "masking" of the latent encoder targets, or EMA in the style of BYOL or Siamese networks, for the predicted representations. It is great that RPL does not seem to need any of those (next input prediction is a natural implementation of masking, and EMA does not seem to be used). It is notoriously hard for the JEPA model to work without these features. Since some of these details are sometimes surprisingly crucial for a simulation to work, it would be good to report which of the other important details were key to live without EMA and masking. Is it the difference in learning rate, for instance? Or maybe the tasks considered are simply easy enough for any model to work; if so, it could be useful to acknowledge to what extent this is true.

1b) Comparison with IL and CL: On a high level, the comparison with IL and CL algorithms is written as conclusive. I suspect that the failure modes of IL and CL that are described are not due to the algorithms themselves, but rather to the construction of invariance statistics or the choice of negative sample sets (the sets of samples among which variance 1 is requested by VICreg). For instance, if variance (or negative sample set) is taken only across time, the variance object identity is expected to collapse. Similarly, if the variance is taken across the object identity, the variance across time can collapse. So I wonder if the failure of IL and CL is induced by the construction of the variance definition.

(2) Prediction error: When compared to the recording of cortical activity in Figure 7. It is not obvious from the figure which latent space we are talking about mathematically. Is the vector z, c or the prediction error e? This is rather important from a neuroscientific point of view, because the prediction error e is expected to explain the neuronal data. On the other hand, the prediction error e is only used in the learning algorithm to define the loss function, but it is not the communication medium between the RNN units c (or with the encoder z).

In the brain, since the measurements are recorded as neural activity, they are communication channels between specific units (z or c). It is probably c or z that would already explain the oddball prediction error. I believe that other models, like Forward-forward of Nejad et al., have tried quite hard to address this apparent tension. Whether or not this is resolved by RPL, it thinks it would be beneficial to state the problem and clarify how the algorithm addresses or ignores the issue.

(3) Successor representation without value? I believe the term successor representation is historically relevant in a reinforcement learning (RL) setting and has a precise mathematical definition. Without RL, I feel that learning successor representation is conceptually identical to learning a transition matrix (aka, a primitive world model). I therefore wonder if the pitch for high-level framing of the successor representation is appropriately described or trivial.

(4) Learning in RNN: Learning with recurrent networks appears to be a key in this model presented here (it is in the algorithm name). Yet, this aspect of the model and the literature on biologically plausible learning rules for RNN is not really discussed.

https://doi.org/10.7554/eLife.110895.1.sa3

Reviewer #2 (Public review):

This is a very interesting manuscript, which proposes a novel idea on how cortical networks may learn useful representations of sensory stimuli. The model implementing this idea is thoroughly tested in multiple experimental paradigms. The manuscript is very clearly written. I feel it may have a significant impact on our understanding of cortical circuitry.

https://doi.org/10.7554/eLife.110895.1.sa2

Reviewer #3 (Public review):

Summary:

This paper presents Recurrent Predictive Learning (RPL), a self-supervised model conceptually similar to Joint-Embedding Predictive Architecture (JEPA) models. RPL sequentially observes dynamic scenes to predict subsequent observations. A central claim of the work is that the model's trained representations are simultaneously invariant and equivariant to transformations, such as movement properties that emerge without explicit supervision. These representational qualities are demonstrated through three experiments utilizing two simulated datasets and one naturalistic dataset. Furthermore, the latent embeddings are qualitatively compared with neural data, showing that the model reproduces the successor representation observed in human V1 and the local/global oddball effect in the monkey Prefrontal Cortex.

Strengths:

(1) The paper addresses a fundamental question relevant to both computational neuroscience and machine vision: how the brain learns representations that are simultaneously invariant and equivariant to transformations. The manuscript is well-written, easy to follow, and supported by clear visualizations.

(2) While JEPA-style models have recently gained significant traction in the artificial intelligence community, this paper nicely bridges the gap to neuroscience. By framing these architectures as a theory for visual learning in the brain, the authors provide valuable insights into how predictive frameworks can explain cortical processing.

(3) The qualitative alignment with V1 and PFC data is a particularly strong contribution, as it offers a potential mechanistic explanation for observed neural phenomena through the lens of self-supervised learning.

Weaknesses:

(1) The central claim, that both invariance and equivariance emerge spontaneously, requires further scrutiny (see Ghaemi et al., NeurIPS, 2025; Garrido et al., arXive, 2024). In particular, the synthetic "moving animal" dataset used in this paper may be too simple to fully support this claim. In latent space prediction, a model must predict both the scene content and the dynamics of movement. Because movement (whether ego-motion or external) is often highly uncertain (or multi-modal), predictive models in naturalistic settings often "collapse" toward learning purely invariant representations, ignoring the hard-to-predict dynamics. In the provided simulations, the movements are extremely predictable. In more complex scenarios, the model would likely prioritize content (invariance) over dynamics (equivariance) unless aided by action-conditioning or explicit factor estimation (Zhang et al., ICLR, 2026). The authors' results in Figure 5 using naturalistic video seem to reflect this limitation, given the lower performance on the naturalistic videos compared to the synthetic datasets.

(2) The framing of the RPL model as an entirely new theory of representation learning is slightly overstated. The focus on prediction in representation space rather than input space is the defining characteristic of JEPA and various other Self-Supervised Learning (SSL) models, even sequential prediction. While this paper clarifies the connection between these AI frameworks and cortical circuits, the work would be strengthened by more explicitly positioning RPL within the context of existing JEPA-style models and prior SSL theories of the visual system.

(3) A significant challenge in latent-space SSL is avoiding "representational collapse" (where the model provides a trivial constant output). While the paper alludes to JEPA-like solutions, it lacks a detailed explanation (in both the text and the architectural schematics) of the specific technique used to prevent collapse. Consequently, it is difficult to evaluate the authors' claim of "biological plausibility," as the biological equivalents of common machine learning techniques (such as stop gradient) are not discussed.

(4) Recent work has shown that the capacity (size) of the predictor significantly influences the learned representations in a JEPA-type world model (Gorrido et al., 2024). In simpler scenarios, a large enough predictor can allow a model to "memorize" dynamics rather than learning generalized equivariant features. It would be beneficial to see how the ratio of predictor size to encoder size affects the emergence of these features.

Methodological Clarifications:

(1) The authors mention a contrastive learning comparison but provide few details. Since contrastive learning is primarily a technique to avoid collapse, it would be a more rigorous baseline if implemented within the same architecture as RPL to isolate the effect of the predictive objective.

(2) In the PFC data comparison (Figure 7f), there appears to be a discrepancy where the local and global conditions show nearly identical results in PFC, while different dynamics in the model. It is unclear if this is a visualization error or a genuine model deviation.

(3) The criteria for selecting specific model variables for comparison with V1 versus PFC are not explicitly defined. Clarification is needed on whether the same latent variables were used for both brain regions or if different layers were selected.

https://doi.org/10.7554/eLife.110895.1.sa1

Author response:

Public Reviews:

Reviewer #1 (Public review):

The paper describes a biologically plausible version of JEPA using recurrent neural networks called RPL for recurrent predictive learning. Given an embedding z_t, a recurrent neural network processes these inputs with the form: c_t+1 = RNN(c_t,z_t). Then the predictive network f is predicting the future inputs with the format: min||f(c_t) − stop grad(z_t+∆_t)||². I understand that a prediction error is defined as: e = z_t+∆_t − f(c_t) to model cortical measurements in the oddball task.

The RPL model is also shown to build an internal world model, with ”real-world” data like the movement of moving animals or speech signals. The representation is then compared to V1 data and expected prediction error signals in an oddball setting. In a stacked hierarchy of RNN learning with RPL, the higher layers appear to learn high-level latent variables, although gradients are not propagated downward to the lower layers.

The paper tackles an open question: Self-supervised learning is thought to be a fundamental principle to explain how computation is structured in the brain. Cortical data suggest qualitatively that prediction error is a core principle of representation learning in the brain, but the field is still looking for a simple yet expressive model that would explain how the cortex learns its representations. RPL contributes in that direction by making a useful link between cortical representation learning in RNN models and the JEPA learning algorithm that was demonstrated to scale to large world model learning from video data by Lecun’s group. It is very useful to connect this popular deep learning algorithm to cortical data.

The model formalism is relatively elegant and simple: Simple next input prediction objectives are conceptually simple but not necessarily trivial to build at scale. There is a clear benefit in comparison with contrastive or IL methods because they are free from dataset-specific data augmentation and negative samples. Thereby moving the comp neuro field towards conceptually simpler models of representation in the cortex. Yet predictive only models (and in particular predictive models in latent space instead of pixel space) are not easy to build in a stable fashion. JEPA family is basically intended to solve this question; it is very nice and timely to bring this to comp neuro.

The methodology combining comp neuro and deep learning makes sense: The conceptual and qualitative analogy with cortical prediction errors is relevant and consistent with what is expected as a model of self-supervised learning in cortical models. The methodology to compare RPL with IL and CL is methodologically meaningful and grounded: showing, for instance, how some of the models fail to represent some latent structure in some toy datasets is interesting.

(1.1) h-RPL: The h-RPL is perhaps the most creative departure from the JEPA model family. It would be interesting to say more about what was particularly difficult to see in the latent variables emerging in the hierarchical model. I often find it magical that layer-wise learning rules of this type are not learning redundant representations. Any insights why this is not the case here would be potentially insightful.

We thank the reviewer for this comment. Regarding representational collapse in h-RPL: each local circuit independently applies the same collapse-preventing strategy as the single-level RPL model: namely, the asymmetric prediction architecture combined with the stop-grad operator. Since this mechanism operates locally within each circuit, it is sufficient to prevent collapse at every level of the hierarchy independently (see also our response to Point P1.3).

The more subtle question is why the circuits learn non-redundant rather than identical representations across the hierarchy. We believe two mechanisms are at play here: First, the hierarchical encoder is a stacked convolutional network, meaning that receptive field sizes grow with depth. This architectural inductive bias naturally encourages successive circuits to operate on increasingly spatially integrated features, creating a structural pressure toward learning complementary rather than redundant representations. Second, the growing expressivity of the network with depth means that higher circuits have access to richer, more abstract inputs from which they can extract higher-level latent structure that is not already captured by lower circuits. Together these factors: the local collapse-preventing mechanism and the depth-dependent growth in receptive field size and network expressivity presumably explain why h-RPL builds an increasingly refined and non-redundant representational hierarchy.

What we will do: We will expand our discussion on this point in the revised manuscript. We plan to expand our quantification on how abstractions emerge in h-RPL in future work in which we will also study variations with top-down connections.

(1.2) In general, I fully support the type of question and ideas that the paper is putting forward. It is, however, very hard in this research field to gain insight into specific conceptual contributions or specific bits of experimental data that the model puts forward. In pointing to the following weaknesses, I am encouraging the authors to lay out more clearly what the unique hypothesis is or the contribution of the RPL model that we should remember it for.

Thanks for the positive feedback along with the constructive criticism, and we agree that articulating the core contributions more crisply would strengthen the paper.

At its heart, we believe the paper makes two contributions we hope it will be remembered for. First, while prior work has established that invariant representations can be learned via local Hebbianlike learning rules, we show that learning equivariant representations alongside a latent dynamics model requires something qualitatively different: a local circuit; one with recurrent dynamics and an asymmetric predictive architecture. RPL provides a minimal concrete instantiation of this principle.

Second, and perhaps more broadly, the model makes a structural prediction about (cortical) neuronal circuit organization: since the encoder, integrator, and predictor each perform functionally distinct computations, the framework implies the existence of corresponding cell types and connectivity patterns one should look for in experimental data.

What we will do: We will sharpen these above messages in the revised manuscript to ensure these contributions are prominently highlighted throughout the paper.

(1.3) Comparison with JEPA variants: JEPA variants are integrating different details into the learning algorithm. Integrating, for instance, “masking” of the latent encoder targets, or EMA in the style of BYOL or Siamese networks, for the predicted representations. It is great that RPL does not seem to need any of those (next input prediction is a natural implementation of masking, and EMA does not seem to be used). It is notoriously hard for the JEPA model to work without these features. Since some of these details are sometimes surprisingly crucial for a simulation to work, it would be good to report which of the other important details were key to live without EMA and masking. Is it the difference in learning rate, for instance? Or maybe the tasks considered are simply easy enough for any model to work; if so, it could be useful to acknowledge to what extent this is true.

We thank the reviewer for raising this important point. There are two key mechanisms that ensure stable, non-trivial training in RPL. First, using a higher learning rate for the predictor relative to the encoder is crucial for stable training. This prevents the predictor from collapsing the encoder representations and was already noted empirically by Chen et al. (2021).

Second, and more fundamentally, predicting at the level of the memoryless encoder output, rather than at the level of the recurrent integrator, is essential to prevent a degenerate solution in which the RNN simply learns to generate an internally predictable time series unrelated to the input. By anchoring the prediction target to the encoder, the model is forced to ground its representations in the sensory input. Intuitively, otherwise the RNN can simply “make up” a predictable time series, which satisfies the learning objective, but would not yield useful internal representations.

Beyond these architectural points, previous work from our group (Srinath Halvagal et al., 2023) has shown mathematically that JEPAs without EMA avoid collapse via an implicit variance regularization mechanism, and we believe RPL benefits from the same principle. Indeed, we now have a more complete theoretical understanding of this, including identifiability proofs for the latent dynamical model under relatively mild assumptions (Mikulasch et al., 2026). This work has recently been accepted at ICML. Other than that, one has to ensure that representations are not already nearly collapsed at the beginning of training. In this paper, we used normalization layers (batchnorm) in the encoder to ensure this.

Finally like all SSL paradigms the augmentation strength is an important hyperparameter that impacts the quality of learned representations. In the temporal predictive setting, the augmentation strength is fixed by the world itself. The only knob we have to play with is the prediction horizon ∆. While we typically focused on next-time-step (∆ = 1) prediction, we saw a clear effect in the case of the speech dataset where ∆ = 8, but not ∆ = 1, yielded useful representations for the tasks (Fig. 5b).

What we will do: We will discuss the above points more prominently in the discussion to avoid them being overlooked in the methods. Additionally, we will include a plot on the empirical prediction horizon for the speech dataset in the supplementary material for reference.

(1.4) Comparison with IL and CL: On a high level, the comparison with IL and CL algorithms is written as conclusive. I suspect that the failure modes of IL and CL that are described are not due to the algorithms themselves, but rather to the construction of invariance statistics or the choice of negative sample sets (the sets of samples among which variance 1 is requested by VICreg). For instance, if variance (or negative sample set) is taken only across time, the variance object identity is expected to collapse. Similarly, if the variance is taken across the object identity, the variance across time can collapse. So I wonder if the failure of IL and CL is induced by the construction of the variance definition.

We thank the reviewer for this thoughtful point. Both RPL and CL implement an implicit variance regularizer by virtue of being JEPAs (Srinath Halvagal et al., 2023), whereas IL uses an explicit regularizer computed along both the batch and time dimensions to avoid representational and dimensional collapse. The failure modes of IL and CL therefore cannot be entirely attributed to the statistics of the input samples chosen for variance regularization, but are instead primarily determined by the choice of prediction and target representations.

What we will do: We will clarify this in the Methods section of the revised manuscript.

(1.5) Prediction error: When compared to the recording of cortical activity in Figure 7. It is not obvious from the figure which latent space we are talking about mathematically. Is the vector z, c or the prediction error e? This is rather important from a neuroscientific point of view, because the prediction error e is expected to explain the neuronal data. On the other hand, the prediction error e is only used in the learning algorithm to define the loss function, but it is not the communication medium between the RNN units c (or with the encoder z).

In the brain, since the measurements are recorded as neural activity, they are communication channels between specific units (z or c). It is probably c or z that would already explain the oddball prediction error. I believe that other models, like Forward-forward of Nejad et al., have tried quite hard to address this apparent tension. Whether or not this is resolved by RPL, it thinks it would be beneficial to state the problem and clarify how the algorithm addresses or ignores the issue.

Thanks for pointing out the issue with regards to clarity and for raising the important but subtle point about prediction error representation. To answer the immediate question asking which vector we use in Figure 7, it is the vector c corresponding to the integrator representations. We agree this should be stated explicitly and will update the manuscript accordingly.

On the more general point, we agree that the tension between recordable neural activity and the computational role of prediction errors is an important issue. We do already briefly engage with it in the Discussion (subsection “Relation to previous modeling work”), where we note that under RPL “inter-areal communication is dominated by representations rather than error signals”. However, we agree that this point should be surfaced more directly.

To elaborate, under classical predictive coding, prediction errors are the inter-areal communication channel and are therefore expected to be directly observable in neural recordings, e.g., as oddball responses. Under RPL, this is not the case: e is computed locally within a circuit and serves only as a learning signal for synaptic plasticity, not as a signal propagated between circuits or areas. What cortex primarily encodes and communicates in our framework are predictive representations, not reconstruction errors. Accordingly, what should map onto recorded population activity are the representations c (and z), while locally computed prediction errors could in principle remain observable as more circumscribed or transient mismatch-like signals within a circuit.

We would like to push this point further. The reviewer frames this as a tension that RPL needs to resolve, but growing neurophysiological evidence suggests that classical residual-difference prediction errors may not be a dominant mode of cortical encoding in the first place. Furutachi, Franklin, et al. (2024) showed that V1 responses to unexpected visual stimuli do not encode how input deviates from predictions, but instead selectively amplify the representation of the unexpected stimulus itself. Very recently, Furutachi and Hofer (2026) generalize this into a revised framework in which feedforward pathways transmit sensory representations modulated by prediction-error magnitude, rather than residual differences. Vasilevskaya et al. (2026) constrain the space of plausible cortical algorithms via functionalinfluence experiments, also concluding that no variant of standard predictive processing is consistent with the full pattern of layer 2/3 ↔ layer 5 interactions; they propose a JEPA-based model, citing RPL as a promising candidate. The model by Nejad et al. (2025) similarly shares with RPL the property that representations, rather than residual errors, propagate between circuit elements.

Taken together, the apparent tension may be less a problem RPL needs to resolve than one it is well positioned to explain, remaining consistent with the emerging picture of cortex as encoding amplified sensory features rather than transmitting residual errors across areas.

What we will do: We will add missing information to the main text and sharpen the Discussion with these arguments.

(1.6) Successor representation without value? I believe the term successor representation is historically relevant in a reinforcement learning (RL) setting and has a precise mathematical definition. Without RL, I feel that learning successor representation is conceptually identical to learning a transition matrix (aka, a primitive world model). I therefore wonder if the pitch for high-level framing of the successor representation is appropriately described or trivial.

The reviewer makes a valid point on the concept of successor representations. To answer the immediate question, it is not entirely trivial, as we not only observe the emergence of the transition structure (Fig. 6c), but also the encoding of decaying future (but not past) state occupancy (Fig 6d,e). We largely adapted the terminology “successor-like representations” from the study by (Ekman et al., 2023), but we will elaborate a bit further for why we stuck to it. As nicely pointed out by the reviewer, the term “successor representations” was introduced in the RL literature (Dayan, 1993), but further adopted in neuroscience to describe the idea that a neuronal population encodes a predictive representation that reflects the expected future occupancy of future states under a given policy. Ekman et al. (2023) use the term “successor-like representations” to explain the phenomena where the neural activity in V1 (and hippocampus) represent both current and (discounted) future, but not past, state occupancies in a sequence learning task with no explicitly defined policy or value training. In other words, successor-like representations are simply predictive representations.

What we will do: To deal with this dichotomy, we will replace “successor-like representations” with the term “predictive representations” in the abstract and clarify this distinction in the Results section of the revised manuscript.

(1.7) Learning in RNN: Learning with recurrent networks appears to be a key in this model presented here (it is in the algorithm name). Yet, this aspect of the model and the literature on biologically plausible learning rules for RNN is not really discussed.

We thank the reviewer for raising this concern. While h-RPL is one step toward more biologically plausible and spatially local learning rules, exploring it further in terms of temporal credit assignment is beyond the scope of the present study and would require a more systematic and in-depth analysis. However, moving toward more biologically plausible learning rules is an interesting research direction that we plan to explore, as we also mentioned in the Discussion (“Limitations and future research directions”).

We think a viable strategy could be to combine a slim spatial credit assignment strategy such as feedback alignment (Nøkland, 2016; Lillicrap et al., 2016) with an online learning rule using eligibility traces for temporal credit assignment such as SuperSpike (Zenke et al., 2018) or e-prop (Bellec et al., 2020). Similar strategies have given promising results for CLAPP (Illing et al., 2021; Zihan et al., 2026).

What we will do: Following the suggestion, we will discuss biologically plausible learning rules for RNNs in the Discussion.

Reviewer #2 (Public review):

This is a very interesting manuscript, which proposes a novel idea on how cortical networks may learn useful representations of sensory stimuli. The model implementing this idea is thoroughly tested in multiple experimental paradigms. The manuscript is very clearly written. I feel it may have a significant impact on our understanding of cortical circuitry.

Reviewer #3 (Public review):

This paper presents Recurrent Predictive Learning (RPL), a self-supervised model conceptually similar to Joint-Embedding Predictive Architecture (JEPA) models. RPL sequentially observes dynamic scenes to predict subsequent observations. A central claim of the work is that the model’s trained representations are simultaneously invariant and equivariant to transformations, such as movement properties that emerge without explicit supervision. These representational qualities are demonstrated through three experiments utilizing two simulated datasets and one naturalistic dataset. Furthermore, the latent embeddings are qualitatively compared with neural data, showing that the model reproduces the successor representation observed in human V1 and the local/global oddball effect in the monkey Prefrontal Cortex.

The paper addresses a fundamental question relevant to both computational neuroscience and machine vision: how the brain learns representations that are simultaneously invariant and equivariant to transformations. The manuscript is well-written, easy to follow, and supported by clear visualizations.

While JEPA-style models have recently gained significant traction in the artificial intelligence community, this paper nicely bridges the gap to neuroscience. By framing these architectures as a theory for visual learning in the brain, the authors provide valuable insights into how predictive frameworks can explain cortical processing.

The qualitative alignment with V1 and PFC data is a particularly strong contribution, as it offers a potential mechanistic explanation for observed neural phenomena through the lens of selfsupervised learning.

(3.1) The central claim, that both invariance and equivariance emerge spontaneously, requires further scrutiny (see Ghaemi et al., NeurIPS, 2025; Garrido et al., arXive, 2024). In particular, the synthetic ”moving animal” dataset used in this paper may be too simple to fully support this claim. In latent space prediction, a model must predict both the scene content and the dynamics of movement. Because movement (whether ego-motion or external) is often highly uncertain (or multi-modal), predictive models in naturalistic settings often ”collapse” toward learning purely invariant representations, ignoring the hard-to-predict dynamics. In the provided simulations, the movements are extremely predictable. In more complex scenarios, the model would likely prioritize content (invariance) over dynamics (equivariance) unless aided by action-conditioning or explicit factor estimation (Zhang et al., ICLR, 2026). The authors’ results in Figure 5 using naturalistic video seem to reflect this limitation, given the lower performance on the naturalistic videos compared to the synthetic datasets.

We thank the reviewer for the feedback. We agree that further validation on more complex datasets would strengthen the claims, and we take this point seriously. If the reviewer has any suggestions for a specific alternative dataset, we would welcome any recommendations.

Regarding the mouse video data specifically, we realized that this is a suboptimal benchmark rather than a shortcoming of our method. The culprit presumably is that the mice remain largely stationary, leading to a heavily imbalanced velocity distribution peaked near zero (Supplementary Fig. S9). This imbalance makes equivariance evaluation unreliable regardless of the learning algorithm. For example, end-to-end supervised training results in an R² of 0.19 compared to 0.08 ± 0.02 for RPL.

Regarding the moving animal dataset, we note that the dynamics are not trivial from an SSL perspective: unlike moving MNIST (Srivastava et al., 2015), the dataset includes changes in scale and orientation, both features that invariance-focused SSL models can easily ignore, yet RPL recovers reliably. For example, this discrepancy can be seen in Supplementary Table S1 where we compare to InfoNCE and CPC. That said, we acknowledge the reviewer’s broader concern and will seek to validate RPL on more complex datasets.

While it would be nice to compare to related work by Ghaemi et al. (2024), this study used 3DIEBench (Garrido et al., 2023). Unfortunately, 3DIEBench’s reliance on pair-based representations with annotated but random augmentations (such as rotations or color changes) precludes the possibility of smooth latent traversals that would be required for RPL to learn from the same dataset. We will look into whether it is computationally feasible to adapt or regenerate a similar dataset that meets the requirements for temporal prediction.

Regarding stochasticity, we agree that predictive learning in latent space is most natural in approximately deterministic settings, whereas real world sensory information often comprises non-deterministic elements. While a deeper treatment of such stochastic environments is beyond the scope of the present manuscript, it will be the focus of ongoing and future work. Regarding ongoing work, it is worth mentioning that in recent work from our group (Hauri et al., 2026), we have demonstrated that RPL’s core objective can replace the reconstruction loss in Dreamer, achieving competitive performance in complex, stochastic environments. While we did not systematically evaluate equivariance in this study, the results suggests that representation-space predictive learning is viable beyond the deterministic regime.

What we will do: We will make the point about the real-world mouse video dataset being a poor benchmark and include the additional R² values to show that. Further, we will try to identify or generate alternative datasets to back the equivariance claims and discuss our findings in the light of previous work, e.g., Ghaemi et al. (2024). Moreover, we will sharpen our discussion of our model’s limitations in stochastic settings and highlight notable connections to related work.

(3.2) The framing of the RPL model as an entirely new theory of representation learning is slightly overstated. The focus on prediction in representation space rather than input space is the defining characteristic of JEPA and various other Self-Supervised Learning (SSL) models, even sequential prediction. While this paper clarifies the connection between these AI frameworks and cortical circuits, the work would be strengthened by more explicitly positioning RPL within the context of existing JEPA-style models and prior SSL theories of the visual system.

Thanks for raising this point. We are unsure what the reviewer refers to. We did not frame our work as ”an entirely new theory of representation learning,” as the reviewer suggests. In fact, we highlight quite the opposite already in the title of our article, which reads: “Understanding neural circuit principles for representation learning through joint-embedding predictive architectures.” We do not claim novelty over JEPA as an ML paradigm, we adopt it precisely because it provides a principled, non-generative framework for predictive representation learning, and our goal is to develop a circuit level instantiation that accounts for neural circuit computation. We already discuss a body of previous work of self-supervised learning and JEPAs at length. Since the reviewer did not specify what they are missing, we will briefly reiterate what is already there.

Our contribution is a theory of representation learning in the brain, built on JEPAs as the underlying ML framework. The Title and Introduction already position our work quite explicitly this way. Specifically, we mention prior work on JEPAs (CPC, BYOL, SimSiam, I-JEPA, seq-JEPA, V-JEPA, V-JEPA 2), while noting that “most JEPAs developed in machine learning are poor models of cortical computation” because of their reliance on negative sampling, transformers, masking, static images, and/or known parametrized transformations, and motivate RPL as the minimal candidate that “must instead rely on recurrent neural dynamics, learn from streaming sensory input without masking, support both invariant and equivariant representations, and reproduce key neurophysiological observations.”

The Discussion (“Relation to previous modeling work”) further details the specific novelties of RPL relative to existing sequential JEPA-style and SSL models like CPC (Oord et al., 2018), V-JEPA (Bardes et al., 2024), V-JEPA 2 (Assran et al., 2025), seq-JEPA (Ghaemi et al., 2024). In brief:

RPL is a recurrent JEPA based on RNN dynamics, not transformers, and learns from streaming sensory input without masking or random negative sampling;

It explicitly compares three prediction-error topologies (RPL vs. invariance learning vs. contextprediction; Fig. 2, Suppl. Fig. S2, S6) and shows that asymmetric recurrent prediction is essential for jointly learning invariant and equivariant representations;

Importantly, it does so via pure temporal prediction without access to underlying transformations, a property shared by very few JEPAs. The closest exception is VJ-VCR (Drozdov et al., 2024) which uses an explicit variance-covariance regularization (VCReg) in a JEPA, which we will cite in the revised manuscript;

It provides the first hierarchical JEPA optimizing local prediction errors at multiple levels (h-RPL, Fig. 8), as envisioned by LeCun (2022) but not previously implemented;

It connects directly to neurophysiological data: successor-like representations in human V1 and abstract sequence representations in macaque PFC, which provides qualitative correspondence between JEPA components and cortical activity that the existing JEPA literature, focused on ML benchmarks, does not address.

Finally, our article already includes a discussion paragraph on recent self-supervised learning models in the context of the brain where we discuss work by Nejad et al. (2025) and Asabuki et al. (2025). Most other SSL theories of the visual system rely on static images and recognition tasks (Yerxa et al., 2024; Margalit et al., 2024). However, there are two studies that include temporal prediction objectives and are worth mentioning with more details: First, Bakhtiari et al. (2021) show that representations similar to ventral and dorsal pathways in the visual system can emerge in a two-pathway encoder architecture within the CPC model. Second, Niu et al. (2024) use a “straightening” objective together with VCReg as a practical model of the perceptual straightening hypothesis (H´enaff et al., 2019). Though not a JEPA (i.e., has no predictor network), it can decode equivariant factors in a sequential MNIST dataset where only single factors change throughout a video.

What we will do: We will carefully review our discussion of previous work and further discuss Drozdov et al. (2024), Bakhtiari et al. (2021), and Niu et al. (2024) in the revised manuscript.

(3.3) A significant challenge in latent-space SSL is avoiding “representational collapse” (where the model provides a trivial constant output). While the paper alludes to JEPAlike solutions, it lacks a detailed explanation (in both the text and the architectural schematics) of the specific technique used to prevent collapse. Consequently, it is difficult to evaluate the authors’ claim of “biological plausibility,” as the biological equivalents of common machine learning techniques (such as stop gradient) are not discussed.

Thanks for pointing this out. Our model avoids collapse through the asymmetric stop-grad / predictor architecture. It does not require an EMA, when the predictor learns with a faster learning rate than the rest of the network (see also our response to Point P1.3).

The use of stop-grad suggests that a circuit learning with RPL needs to compute a vector-based instructive learning signal. While we do not explicitly model the circuit level mechanisms of how this could be implemented in the brain, excitation-inhibition balance is one possibility (Rossbroich et al., 2025). Finally, differences in learning rate can be implemented both structurally or functionally in the brain (see Liu et al. (2025) for instance), or activity normalization is suggested as a canonical computation in biological neural circuits (Carandini et al., 2012).

What we will do: We will make sure to discuss these putative biological mechanisms in the revised manuscript.

(3.4) Recent work has shown that the capacity (size) of the predictor significantly influences the learned representations in a JEPA-type world model (Gorrido et al., 2024). In simpler scenarios, a large enough predictor can allow a model to ”memorize” dynamics rather than learning generalized equivariant features. It would be beneficial to see how the ratio of predictor size to encoder size affects the emergence of these features.

Thanks for raising this concern. We don’t observe noticeable difference in position and velocity decoding when changing the width or depth of the MLP predictor in the moving animals data. However, performance on rotation speed and orientation decoding scales with the changes in width, but not depth of the predictor. This analysis excludes the effect of integrator’s capacity as it directly affects the dimensionality of the representations, even though it also effectively contributes to prediction computation in RPL.

What we will do: We will include a figure how how task performance varies with the predictor’s width and depth.

Methodological Clarifications

(3.5) The authors mention a contrastive learning comparison but provide few details. Since contrastive learning is primarily a technique to avoid collapse, it would be a more rigorous baseline if implemented within the same architecture as RPL to isolate the effect of the predictive objective.

Thanks for the question. We already use the same network model as in RPL for the contrastive predictive learning (InfoNCE) baseline in Supplementary Table S1 and mentioned in the main text (l.164).

What we will do: We will mention the architecture of the non-linear predictor used for InfoNCE baseline in Methods more explicitly.

(3.6) In the PFC data comparison (Figure 7f), there appears to be a discrepancy where the local and global conditions show nearly identical results in PFC, while different dynamics in the model. It is unclear if this is a visualization error or a genuine model deviation.

Thanks for picking up on this subtlety in the experimental results. To clarify, it is a model deviation but an interesting one. The local and global responses do look quite similar in the original PFC data. They differ in that the global oddball (xY|xx and xx|xY) response has a secondary peak that encodes the presence of the global oddball, whereas the initial response is actually dominated by local oddball encoding (xY vs xx). Concretely, this results in the response to the xx|xY condition only showing up weakly in the data and at a time lag with respect to the initial local oddball response. Our model, however, does not show the transient initial response to local oddballs in the decoding direction for global oddballs. In a sense, the network model encodes the global oddball concept more robustly than is seen in the PFC data. That said, whether this indicates a genuine difference in representational strategies that needs to be further accounted for, or whether it is an issue stemming from limited sub-sampling of PFC neurons, remains unclear.

(3.7) The criteria for selecting specific model variables for comparison with V1 versus PFC are not explicitly defined. Clarification is needed on whether the same latent variables were used for both brain regions or if different layers were selected.

To clarify, the successor-like representations in human V1 and abstract representations in macaque PFC are two different experiments, so each has different latent variables requiring different RPL models. The architecture used for each experiment is detailed in Methods and the criteria for selecting each architecture was the simplest that should work given the task complexity. Throughout the paper, all representation analysis is done on the output of integrator (c) unless said otherwise. We hope this resolves the confusion.

References

Chen, Xinlei et al. (2021). “Exploring simple siamese representation learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758.

Srinath Halvagal, Manu et al. (2023). “Implicit variance regularization in non-contrastive SSL”. In: Advances in Neural Information Processing Systems 36, pp. 63409–63436.

Mikulasch, Fabian A et al. (2026). Understanding Self-Supervised Learning via Latent Distribution Matching. arXiv: 2605.03517[cs.LG].

Furutachi, Shohei, Alexis D. Franklin, et al. (Sept. 2024). “Cooperative thalamocortical circuit mechanism for sensory prediction errors”. en. In: Nature 633.8029. Publisher: Nature Publishing Group, pp. 398–406. issn: 1476-4687. doi: 10.1038/s41586-024-07851-w.

Furutachi, Shohei and Sonja B Hofer (2026). “Rethinking Predictive Processing”. In: Annual Review of Neuroscience 49.

Vasilevskaya, Anna et al. (2026). “A functional influence based circuit motif that constrains the set of plausible algorithms of cortical function”. In: bioRxiv. doi: 10.64898/2026.01.29.702557. eprint: https://www.biorxiv.org/content/early/2026/01/29/2026.01.29.702557.full. pdf.

Nejad, Kevin Kermani et al. (July 2025). “Self-supervised predictive learning accounts for cortical layer-specificity”. en. In: Nat Commun 16.1, p. 6178. issn: 2041-1723. doi: 10.1038/s41467-025-61399-5.

Ekman, Matthias et al. (Feb. 2023). “Successor-like representation guides the prediction of future events in human visual cortex and hippocampus”. In: eLife 12. Ed. by Morgan Barense et al., e78904. issn: 2050-084X. doi: 10.7554/eLife.78904.

Dayan, Peter (1993). “Improving generalization for temporal difference learning: The successor representation”. In: Neural computation 5.4, pp. 613–624.

Nøkland, Arild (2016). “Direct feedback alignment provides learning in deep neural networks”. In: Advances in neural information processing systems 29.

Lillicrap, Timothy P et al. (2016). “Random synaptic feedback weights support error backpropagation for deep learning”. In: Nature communications 7.1, p. 13276.

Zenke, Friedemann et al. (2018). “Superspike: Supervised learning in multilayer spiking neural networks”. In: Neural computation 30.6, pp. 1514–1541.

Bellec, Guillaume et al. (2020). “A solution to the learning dilemma for recurrent networks of spiking neurons”. In: Nature communications 11.1, p. 3625.

Illing, Bernd et al. (2021). “Local plasticity rules can learn deep representations using self-supervised contrastive predictions”. In: Advances in Neural Information Processing Systems 34.

Zihan, Wu S et al. (2026). “Can Local Learning Match Self-Supervised Backpropagation?” In: arXiv preprint arXiv:2601.21683.

Srivastava, Nitish et al. (2015). “Unsupervised learning of video representations using lstms”. In: International conference on machine learning. PMLR, pp. 843–852.

Ghaemi, Hafez et al. (2024). “Seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models”. In: NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice.

Garrido, Quentin et al. (2023). “Self-supervised learning of split invariant equivariant representations”. In: arXiv preprint arXiv:2302.10283.

Hauri, Michael et al. (2026). “Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction”. In: arXiv preprint arXiv:2603.07083.

Oord, Aaron van den et al. (July 2018). “Representation Learning with Contrastive Predictive Coding”. In: arXiv:1807.03748 [cs, stat]. arXiv: 1807.03748.

Bardes, Adrien et al. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning.

Assran, Mido et al. (2025). “V-jepa 2: Self-supervised video models enable understanding, prediction and planning”. In: arXiv preprint arXiv:2506.09985.

Drozdov, Katrina et al. (2024). “Video representation learning with joint-embedding predictive architectures”. In: arXiv preprint arXiv:2412.10925.

LeCun, Yann (2022). “A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-0627”. en. In.

Asabuki, Toshitake et al. (2025). “Learning predictive signals within a local recurrent circuit”. In: Proceedings of the National Academy of Sciences 122.27, e2414674122. doi: 10.1073/pnas. 2414674122. eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.2414674122.

Yerxa, Thomas et al. (2024). “Contrastive-equivariant self-supervised learning improves alignment with primate visual area it”. In: Advances in neural information processing systems 37, pp. 96045–96070.

Margalit, Eshed et al. (2024). “A unifying framework for functional organization in early and higher ventral visual cortex”. In: Neuron 112.14, pp. 2435–2451.

Bakhtiari, Shahab et al. (2021). “The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran Associates, Inc., pp. 25164–25178.

Niu, Julie Xueyan et al. (2024). “Learning predictable and robust neural representations by straightening image sequences”. In: Advances in Neural Information Processing Systems 37, pp. 40316– 40335.

H´enaff, Olivier J et al. (2019). “Perceptual straightening of natural videos”. In: Nature neuroscience 22.6, pp. 984–991.

Rossbroich, Julian et al. (2025). “Breaking Balance: Encoding local error signals in perturbations of excitation-inhibition balance”. In: bioRxiv, pp. 2025–05.

Liu, Peng et al. (2025). “Layer-specific changes in sensory cortex across the lifespan in mice and humans”. In: Nature neuroscience 28.9, pp. 1978–1989.

Carandini, Matteo et al. (2012). “Normalization as a canonical neural computation”. In: Nature reviews neuroscience 13.1, pp. 51–62.

https://doi.org/10.7554/eLife.110895.1.sa0

Understanding neural circuit principles for representation learning through joint-embedding predictive architectures

Peer review process

Editors

Be the first to read new articles from eLife