Modeling flexible behavior with remapping-based hippocampal sequence learning

Yoshiki Ito; Taro Toyoizumi

doi:10.7554/eLife.106506.1

Introduction

Humans exhibit highly flexible behavior. However, a major challenge in solving various tasks with one neural network is that the same external stimulus can have different meanings depending on the context. For example, the word “mouse” can mean either an animal or a PC device, depending on the context (Figure 1A). Therefore, for correct word recognition, the biological neural computation should not be based only on the word “mouse” alone, but also on the context it appears in. In experiments, it is reported that the hippocampus is one of the most important regions for contextual behavior. Hippocampal neurons show sequential activity (Buzsáki and Tingley, 2018) related to episodic memory (Burgess et al., 2002), the amount of reward (Ambrose et al., 2016), planning (Ólafsdóttir et al., 2018), and recall (Carr et al., 2011), and their representation depends on the context (Hasselmo and Eichenbaum, 2005). Additionally, hippocampal neurons exhibit reorganized neural activity called remapping, which does not purely reflect the change in the external stimuli but task structure (Jeffery et al., 2003), and subjective context (Sanders et al., 2020). However, how context-dependent sequential activity in the hippocampus is established through remapping and how it contributes to flexible behavior remain to be understood.

Schematic representation of our model.
A, An example of context-dependent cognition. Humans can understand the meaning of “mouse” (an animal or a computer input device) depending on the context. B, Our model involves two modules: the context selection module (X) and the hippocampal sequential module (H). chooses a context depending on the external stimuli and the input from H, and activates a sequence in H. This sequence is used for reward prediction. In addition, H sends predictive feedback about external stimuli to X. C, X compares the predictive input from H to the external stimuli. In case of a prediction error (a green cross mark), remapping occurs: the context representation in X is either switched or newly created and a different sequence in H is activated. D, Episodic segments represented in H are combined depending on rewards (purple arrows) and concatenated into task-dependent sequences. The sequences support action planning and enable predictions of future external stimuli and rewards. E, Mechanism of remapping in our model. Blue squares represent contextual states in the context selection module, and orange squares represent those in the sequential module and gray circles represent visible environmental states. Reward-independent synaptic connections are indicated by black arrows and reward-dependent synaptic connections are indicated by purple arrows. Here we consider a situation where an agent has experienced the environmental states S1 and S2, and the corresponding contextual states are already established in X and H. If the agent assumes it is in a contextual state associated with S1 that predicts external stimuli other than S2 but experiences S2, a prediction error arises (a green cross mark) and triggers remapping: a new contextual state associated with S2 indexed by β (green squares) is created, and the synaptic connections are potentiated between X and H (green arrows).

Several theoretical models have been proposed to explain how hippocampal activity depends on context. The first approach uses the structure of the environment. The Tolman-Eichenbaum Machine (Whittington et al., 2020) and the Clone Structured Cognitive Graph (George et al., 2021) account for context-dependent neural activities, such as splitter cells (Dudchenko and Wood, 2014) and lap cells (Sun et al., 2020), by introducing graphical structure stored within the network. However, these models entail optimization procedures like backpropagation or the expectation-maximization (EM) algorism, which are not considered biologically plausible. The second approach uses eligibility trace to explain how past experiences, i.e., temporal context, are integrated into hippocampal activity (Cone and Clopath, 2024). In this framework, the length of the temporal context is constrained by the time constant of the eligibility trace. Nevertheless, animals can flexibly estimate the current context using history of various lengths (Barnett et al., 2014), suggesting that hippocampal activity may not be bound by a fixed eligibility window. The third approach trains recurrent neural networks (RNNs) to replicate the dynamics of hippocampal activity. While previous works have explored hippocampal sequential activity for planning (Jensen et al., 2024; Pettersen et al., 2024) and hippocampal remapping for contextual inference (Low et al., 2023) separately, they have yet to elucidate how these two aspects jointly enable flexible behavior. A comprehensive model that can explain the formation of context-dependent hippocampal sequences of various lengths through remapping, while relying on a biologically plausible learning process, would thus provide valuable insights into the mechanisms underpinning flexible behavior.

We aim to understand how context-dependent hippocampal sequences can emerge from hippocampal remapping driven by prediction errors. Our key idea is as follows. When the external environment deviates from the expectations of the current subjective context, prediction errors arise and trigger remapping. This process recruits distinct subsets of neurons to encode the novel experience, thereby establishing separate contextual memories and enabling flexible goal-oriented behavior in response to sudden environmental changes. To demonstrate the capability of this idea, we constructed a computational model comprising two modules: a context-selection module that selects the appropriate context based on prediction errors, and a hippocampal sequence module that learns to generate neural activity sequences predicting future events by concatenating context-dependent episodic segments according to reward. Our model implements simple model-based reinforcement learning in ambiguous contexts, yielding flexible behavior using a biologically plausible synaptic plasticity rule. We show that it reproduces a range of context-dependent hippocampal activities as well as the impairments associated with specific brain lesion studies.

Finally, our model predicts a relationship between deficits in model-based behavior and sensory processing. Clinical research has reported that patients with schizophrenia (SZ) or autism spectrum disorder (ASD) often exhibit problems with both behavioral flexibility and sensory processing, including hyper- and hyposensitivity (Javitt and Freedman, 2015; Watts et al., 2016). These symptoms frequently co-occur, but the underlying reason remains unclear. Our model shows that the relative sizes of the neural populations in the sensory-processing region and the context-processing region within the context-selection module are important for contextual inference, suggesting that treatments targeting sensory processing could improve cognitive flexibility in some psychoses.

Results

As illustrated in Figure 1B-D, our model consists of two components: a context-selection module (X), which selects appropriate contexts, and a hippocampal sequence module (H), which generates neural activity sequences that predict future events. We use the Amari-Hopfield network (Amari, 1972; Hopfield, 1982) with Hebbian plasticity for X. X has two domains: a stimulus domain that represents external stimuli, and a contextual domain that represents subjective contextual information. This contextual domain allows the representation of multiple contextual states for a given external stimulus, such as different interpretations or associations of the stimulus. X can stably store multiple contextual states by creating attractors. When agents are at a starting point (i.e., a landmark), X initializes the neural activity of the contextual domain based on the external stimulus (see Materials and methods). When agents move to other locations, X receives predictive input from H and compares the predicted outcome with the actual external stimulus (Figure 1C). When the prediction error is small, X performs error correction via attractor dynamics. When a significant prediction error occurs, remapping is triggered, and X’s activity either shifts to another contextual state or generates a new one, thereby adding a new context (Figure 1E, see Materials and methods). Once X’s activity is set, it transmits the resulting output to H, which then activates an initial segment of H’s episodic sequence. H produces an episodic sequence corresponding to hippocampal replay (Davidson et al., 2009) or planning (Ólafsdóttir et al., 2018) based on its connectivity. For simplicity, we use a binary recurrent neural network for H, whose connectivity is updated by a three-factor Hebbian plasticity rule that depends on reward (see Materials and methods). Each replayed sequence is associated with actions and two predictive outcomes: expected reward value and predicted future external stimuli. At the beginning of learning, there are only short sequences that generate immediate actions and yield short-term predictions. As learning continues, the three-factor Hebbian plasticity rule concatenates these memory fragments, thereby creating longer sequences that reflect the task structure (Figure 1D). Thus, H learns to generate extended sequences that outline a course of actions and predict both reward and subsequent changes in the environment, forming a simple transition model for model-based reinforcement learning (Coulom, 2007). If a significant reward prediction error arises from a sequence, the agent explores a random action not specified by that sequence (see Materials and methods).

Splitter cells

Our model reproduces a range of hippocampal activity patterns that align with empirical data. First, we confirmed that our model reproduces the splitter cells reported in the hippocampus (Dudchenko and Wood, 2014). Splitter cells are a subset of hippocampal neurons that fire differentially on an overlapping segment of trajectories depending on where the animal came from, and/or where it is going. It is known that they do so based on information that is not present in sensory or motor patterns at the time of the splitting effect, but rather appear to reflect the recent past, upcoming future, and/or inferences about the state of the environment (Duvelle et al., 2023).

Experimentally, splitter cells are most often observed in an alternation task in a modified T-maze. Here, we simplified this task by using an environment with five discrete states, i.e. five discrete external stimuli (Figure 2A). In this environment, agents successfully solve this task by remapping, which creates different contextual states C2α and C2p at a task state S2 based on where the agents came from, and thereby enabling context-specific exploration of which state to go (S3 or S4) (Figure 2B).

Our model replicates the emergence of splitter cells.
A, Simplified alteration task diagram. B, A successful context map of our model. The state S2 is split into 2 different contextual states, C2α and C2β. C, The correct rate of our model. The error bar indicates the standard error of the mean (N = 40). D, The maximum number of states ahead that the agents planned (planning length) gradually increases over learning. Black lines indicate the planning length of each agent, and the red line is their average. e, Emergence of splitter cells in the hippocampus in the modified T-maze modification task (Wood et al., 2000)(Wood et al., 2000). F, Our model replicates the emergence of splitter cells in S2.

In our model, most agents can solve this task (Figure 2C). As learning progresses, the length of episodic memories increases, and eventually planning of the transition from one reward state to the next is possible (Figure 2D). Our model can be compared to the neural activity of the rats’ splitter cells in the hippocampus during the modified T-maze task. Our model successfully replicates the result describing splitter cells in Wood et al. (2000)(Wood et al., 2000) (Figure 2E), and contextdependent neural activity is reproduced at S2 (Figure 2F).

Lap cells

The emergence of splitter cells explored above has also been studied in previous work (Duvelle et al., 2023; Hasselmo and Eichenbaum, 2005; Katz et al., 2007). However, preparing a temporal context in advance is generally challenging for tasks where the number of required histories is unknown or changes dynamically, because preparing too few histories results in failing to solve the tasks, while preparing too many slows down the search for a solution. Instead of preparing temporal context of fixed length in advance, our model uses remapping that adds new contextual states whenever a prediction error arises. This approach enables on-demand creation of contextual states and accelerates solution-finding in dynamically changing tasks.

To show the advantage of our model, we demonstrate that our model replicates the emergence of lap cells (Sun et al., 2020). We set up a simplified discrete environment with a loop structure where the number of laps required to receive a reward varies (Figure 3A). Agents are initially rewarded for the shortest transitions through environmental states S1-S2-S4. After 20 trials, the environment changes, and the agents are rewarded for one lap transition, i.e., S1-S2-S3-S2-S4. It causes remapping that splits contextual states. For example, task state S2 is discriminated into C2α and C2β based on the lap, and contextual state transitions C1-C2α-C3α-C2β-C4β emerges. After another 15 trials, the environment changes again and the agents are rewarded for two laps, i.e., S1-S2-S3-S2-S3-S2-S4, or more. This reward rule cannot be represented by the existing shortest transition, C1-C2α-C4α, or the one lap transition, C1-C2α-C3α-C2β-C4β, requiring the addition of new contextual states. Again, remapping occurs, the contextual states for the second lap are prepared, and the rewarded transition of contextual states, i.e., C1-C2α-C3α-C2β-C3β-C2γ-C4γ, is reinforced (Figure 3B).

Our model replicates the emergence of lap cells.
A, Simplified 2-lap task diagram. Agents are rewarded for the shortest path (S1→S2→S4) for the initial 15 trials, for the 1-lap path (S1→S2→S3→S2→S4) for the next 15 trials, and for the 2 or more laps (S1→S2→S3→S2→S3→S2→S4, etc.) for the next 30 trials. B, A successful context map of our model. The state S2 and S4 are split into three contextual states, while S3 is split into two contextual states. C, The correct rate of our model. The error bar indicates the standard error of the mean (N = 40). D, The planning length gradually increases during learning, depending on the task demand. The black lines indicate the planning length of each agent, and the red line is their average. E, The comparison of lap cells in the hippocampus in the 4-lap task (Sun et al., 2020)(Sun et al., 2020) and our replicated results. F, The inhibition experiment of medial entorhinal cortex axons at CA1. ESR cells show a weak lap-specific correlation (ESR correlation) between light-on trials and light-off trials, while they show a strong spatial correlation between light-on trials and light-off trials (Left). Our model replicates the result qualitatively with the inhibition on and off (Right).

In our model, most agents can solve this task (Figure 3C). The episodic memory used for planning changes successfully depending on the environment (Figure 3D). This task is comparable with the 4- lap task for rats (Sun et al., 2020). In an environment where rats are rewarded for every four laps of a circuit, different hippocampal neurons fire for each lap. Our model replicates this result with the different hippocampal cells firing for different laps (Figure 3E). It is also reported that the inhibition of medial entorhinal cortex axons at CA1 attenuates the lap-specific activity (i.e., event-specific rate remapping (ESR)) without much affecting spatial encoding. Our model replicates this result by blocking the synaptic transmission from most ofneurons in the context domain of X to H (Figure 3F).

This task can also be solved by simply preparing temporal contexts with three histories, which is the minimal number to solve this task. However, it takes much longer to find the correct transition for solving the 1-lap task than our model because it involves an excessive number of states (Figure S1).

This result indicates that our model, which creates contextual states on demand, can perform better than the model with a fixed-length history.

Planning in stimulus-cued dynamic environment

In the real world, external stimuli dynamically change, and animals make plans and derive appropriate behavior by using the external stimulus as a clue. Here, we demonstrate that our model replicates key features of stimulus-related contextual behavior and its neural activity reported in experimental studies using remapping.

We consider the simplified environment of Ekman et al. (2022) (Ekman et al., 2022) shown in Figure 4A. In initial environment I, agents start from S0 and go to a state where one of two different external stimuli S2 or S3 is presented with different probability (p=0.8, 0.2 respectively). When it is S2, agents can get a reward at S4, whereas when it is S3, they can get a reward at S5. After 30 trials, the environment changes to II and the initial stimulus is switched to S1, not S0. In this environment, agents are rewarded at S5 and S4 when the external stimulus is S2 and S3, respectively (i.e., Reversal).

In such a stochastic environment, the agents need to switch transition rules according to the external stimuli after the first transition. For instance, in environment I, two rewarded transitions exist: a more likely one (C0-C2α-C4α) and a less likely one (C0-C3α-C5β) (Figure 4B). When external stimuli indicate that the less likely transition has occurred, remapping in the hippocampus enables replanning of an appropriate action. As a result, the correct transition models are created for both environment I and environment II (Figure 4B).

Our model replicates key features of human neural activity in dynamic environments.
A, Simplified task diagram of Ekman et al. 2022. In environment I, agents start at S0 and move to S2 or S3 randomly (S2 for p = 0.8 and S3 for p = 0.2) and receive a reward in S4 when they come from S2 and in S5 otherwise. In environment II, agents start at S1 and move to S2 or S3 randomly (S2 for p = 0.2 and S3 for p = 0.8) and receive a reward in S5 when they come from S2 and in S4 otherwise. The environment switches between the two every 30 trials. B, A successful context map of this task. S2 and S3 are split into two contextual states, and S4 and S5 are split into four contextual states. The hippocampal connections are built for rewarded conditions only. C, The probability of choosing S4. The red/blue line shows its mean when S2/S3 is presented. The error bar indicates the standard error of the mean (N = 40). D, The planning length gradually increases over learning and converges to 3. The black lines indicate each agent’s planning length, and the red line is their average. E, The probability of generating a specific planning sequence at S0 or S1. The expected states (S2 or S3) are modulated according to the environment. F, Our model behavior is similar to the human fMRI result of Ekman et al. (2022)(Ekman et al., 2022). G, Simplified task diagram of Julian & Doeller (2021)(Julian and Doeller, 2021). The training phase is the same as A, but the contextual stimuli of Square (Sq) or Circle (Ci) are initially presented and the probability of S2 and S3 is equal. In the test phase, either one of Sq, Ci or the mixture stimuli of Sq and Ci (Squircle: SC) are presented, and the agent transfers following their faith. Reward feedback is not given in the test phase. H, The transition probability under Sq context (Left) and Ci context (Right). I, The transition probability under SC context of the human patients in Julian & Doeller (2021)(Julian and Doeller, 2021) (Left) and our model (Right). J, Comparison of behavioral decoding accuracy from hippocampal fMRI activity of Julian & Doeller (2021)(Julian and Doeller, 2021) (Left) and hippocampal neural activity of our model (Right). Our model replicates the worse decoding accuracy in SC context (Bottom) than Sq or Ci context (Top).

In our model, most agents can learn to make appropriate transitions depending on the external stimuli. Importantly, they show a one-shot switch when the agents experience the same environment for the second time (Figure 4C). The length of the planning sequence used in the actual transition converges to between 2 and 3 because agents reselect the correct planning sequence in case of unexpected external stimuli (Figure 4D). The probability of predicted external stimuli matches well with the actual probability (Figure 4E). This result replicates that of Ekman et al. (2022) (Ekman et al., 2022), who showed that the sequence generation in the human hippocampus reflects the probability of the external stimuli (Figure 4F).

Julian and Doeller (2021) (Julian and Doeller, 2021) studied a similar task structure that we model here. In the training phase, external stimuli associated with the Square (Sq) or Circle (Ci) arena are presented first. Then, one of two target objects is randomly specified by S2 or S3 with equal probability. Depending on the arena type, the agents decide to transit to S4 or S5 to find it. This state transition structure is the same as Figure 4A. In the test phase, Sq, Ci, or their morphed version, Squircle (SC) arena, is presented. The agents transit depending on the subjective context of either Sq or Ci. Reward feedback is not given in the test phase (Figure 4G).

Our model successfully learns the context-dependent behaviors of Sq and Ci (Figure 4H). Additionally, our model replicates the experimental results of the mixed Sq- or Ci-like behaviors under SC (Figure 4I). Under SC, three reconstruction cases are observed in X: Sq context reconstruction, Ci context reconstruction, and a new contextual state generation due to X’s failure to convergence (see Materials and methods). In the last case, the agents make a random transition by recruiting new hippocampal neurons. Therefore, behavioral decoding based on hippocampal neural activity is lower than that under the Sq and Ci conditions (Figure 4J). This result is consistent with the findings of Julian and Doeller (2021).

Prediction related to sensory processing and flexible behavior

Our model does not only replicate a variety of experimental results, but also make predictions. In clinical research, it has been reported that issues related to behavioral flexibility and sensory processing often co-occur in certain psychiatric conditions, including schizophrenia (SZ) (Javitt and Freedman, 2015) and autism spectrum disorder (ASD) (Watts et al., 2016). Many studies have reported that both symptoms are linked to the dysfunction of the prefrontal cortex (PFC) (Kaplan et al., 2016; Watanabe et al., 2012); however, the reasons for their cooccurrence are not yet fully understood.

We assume that this dysfunction corresponds to hypo-/hyper-representation of context information in X. To investigate this hypothesis, we altered the ratio of neurons in the context domain and sensory domain in X in our model. We used the same task described in Figure 4A with equal probability transitions to S2 and S3 (Figure 5A). The agents with the standard context-stimulus ratio successfully solve this task by identifying the correct episodic sequence, whereas those with altered contextstimulus ratios fail (Figure 5B). When the stimulus domain is relatively underrepresented, the correct context cannot be inferred from the given stimuli, resulting in inaccurate distinctions of different external stimuli (hallucination-like behavior) and an inability to switch back to the appropriate behavior rapidly. Associated prediction errors often generate excessive contextual states. Occasionally, multiple of them are assigned to the same hippocampal neurons, leading to ambiguity (see Materials and methods). In contrast, when the context domain is relatively underrepresented, the capacity of the Amari-Hopfield network to store distinct contexts is reduced, causing frequent failure to reconstruct the appropriate contextual state and persistent behavior under the default contextual state. Thus, our model predicts a relationship between sensory processing and behavioral flexibility.

Model prediction about the relationship between sensory processing and flexible behavior.
A, Task diagram. The structure is the same as Figure 4, but the probability of S2 and S3 is equal. B, The result of perturbation about the neuron ratio of stimulus domain in the cortex. (Left) We tested three stimulus neuron ratios; 2.5% for SZ, 16.7% for control and 50% for ASD. (Middle) The probability of choosing S4 is plotted for the task performance. SZ model fails to show one-shot switch for the second experience of the environment I and II, while ASD model shows an impaired task performance mainly to the environment II. (Right) The result of context calculation is plotted. The total number of context calculations is plotted in black, the number of wrong stimulus context reconstruction (hallucination-like) is plotted in green, the number of reconstruction fail (default network usage) is plotted in red, and the number of new context preparation is plotted in yellow.

Discussion

In this study, we proposed a simple, model-based reinforcement learning model equipped with two functional modules: a hippocampal sequence module and a context-selection module. We introduced prediction error-based remapping as a key for generating context-dependent sequential activity in hippocampus and enabling flexible behavior. This mechanism is biologically plausible, as it is observed in the hippocampus (Bostock et al., 1991) and in some cortical regions (Castegnetti et al., 2021). Our model could simulate a variety of context-dependent sequential representations in hippocampus such as splitter cells (Wood et al., 2000), lap cells (Sun et al., 2020), probabilistic model selection (Ekman et al., 2022), and contextual inference (Julian and Doeller, 2021), without taskdependent parameter tuning. Furthermore, our model predicted a mechanistic explanation for the cooccurrence of deficits in sensory processing and flexible behavior. This result supported clinical reports that psychosis can change the attractor dynamics in the hippocampus (34) and treatments for sensory processing helped restore flexible behavior in some psychoses (Andelin et al., 2021; Javitt and Freedman, 2015; Pfeiffer et al., 2011). To the best of our knowledge, this is the first model that describes the formation of context-dependent hippocampal activity through remapping and its contribution to flexible behavior.

Although remapping is a widely known phenomenon, its mechanism remains under debate. We used the Amari-Hopfield network as the context-selection module to distinguish multiple contextual states that share the same external stimuli, and to reconstruct them via attractor dynamics from partial observations. We propose two advantages of this associative memory model. First, it can represent different contexts under the same external stimuli depending on the feedback from H to implement rapid behavioral switching without requiring synaptic changes. The second advantage is its ability to infer a contextual state using the associative memory mechanism. This property might occasionally yield a non-trivial contextual state based on past experiences. Expanding upon our model with more sophisticated associative memory search mechanisms could enable creative behavior.

We speculate that the context-selection module is implemented across multiple brain regions with varying degrees of resolution, including a part of the hippocampus and certain cortical regions. First, CA1 receives inputs from both CA3, which exhibits attractor dynamics reflecting sensory input, and the entorhinal cortex, which represents context information. Consequently, CA1 might calculate prediction errors between them (Zhao et al., 2022). Second, PFC has been reported to retain contextdependent attractors, which reflect working memory (D’Ardenne et al., 2012), attention (Siegel et al., 2015), and confidence (Wynn and Nyhus, 2022), and to send inputs to the hippocampus via the nucleus reuniens. In addition, the PFC computes prediction errors that might trigger remapping. Specifically, reward-related prediction errors are computed in the orbitofrontal cortex (OFC) (Garvert et al., 2023; Stalnaker et al., 2014), anterior cingulate cortex (ACC) (Seo and Lee, 2007) and ventromedial PFC (Rehbein et al., 2023), whereas stimulus-related prediction errors are calculated in the ACC (Ide et al., 2013) and dorsolateral PFC (Masina et al., 2018; Zmigrod et al., 2014). These neural circuits likely coordinate to estimate the current context and select the appropriate representation in the hippocampus via remapping. Our modeling of the context-selection module captures this core functionality in a simplified manner. Incorporating more elaborate features, such as multiple hierarchies (Rao, 2024), in future studies might help explain a broader range of experimental results.

To validate our model, we propose three experiments. First, our model posits that an error about the context triggers remapping. The OFC is known to be active when reward-related prediction error occurs (Banerjee et al., 2020), and hippocampal remapping is suggested to be induced by the entorhinal cortex, especially its lateral part (Latuske et al., 2017). Because a direct projection exists from the OFC to the lateral entorhinal cortex (Kondo and Witter, 2014), this input might critically influence hippocampal remapping. Second, our model suggests that the prediction error about the environment would induce a shift from place-cell encoding to lap-cell encoding in the hippocampus (Figure 3). Third, our model proposes two types of prediction error; one is the conventional prediction error that updates the synaptic weights within the context, and the other is the prediction error about the context that triggers remapping in the cortex and the hippocampus. How these two different prediction errors are represented in neural circuits will deepen our understanding of the neural basis of flexible behavior.

Our model also has limitations. First, there are context-dependent tasks that our model cannot solve. Although our model learns to separate contextual states, it does not combine them; consequently, we did not consider simulating the environment in which the number of hidden states decreases over time. Greater flexibility might be achieved by integrating both sensory and contextual information within certain neurons (e.g., Figure S2). Second, the resolution at which our model should distinguish different contextual states is hand-tuned in this work. However, this resolution must be adjusted autonomously. Introducing hierarchical representations with multiple levels of resolution might help facilitate such adjustments. Third, our model assumed that only the hippocampus projects to the midbrain for reward prediction of sequential plans. However, there are projections from other brain regions, including the cortex, to the midbrain that are also involved in reward prediction (Jo and Mizumori, 2016). How these additional projections influence model-based behavior, especially in the case of hippocampal lesions, remains beyond the scope of this work. Finally, explicitly modeling the input from grid cells that encode geometric task structure (Krupic et al., 2015) might enable more sophisticated planning (e.g., discovering the shortest path).

Materials and methods

Simulation environment

We conducted all simulations and post-hoc analysis using a custom-made Python code. The source code is provided in Supplementary data.

Model description

Overview

Below, we introduce a model that describes the acquisition of model-based reasoning. Our model consists of two components: the context-selection module (X) and the sequence module (H). For simplicity, agents move through discrete environmental states characterized by distinct external stimuli. The agents execute actions specified by a hippocampal sequence. To generate a sequence, the agents perform state estimation by the context-selection module and activate a corresponding hippocampal neuron. Then, this hippocampal neuron initiates sequential activity based on hippocampal synaptic connectivity. Each hippocampal sequence represents a planned course of action and is used to predict a series of external stimuli. The agents follow the plan unless remapping (see Remapping section) or exploration (see Exploration section). The hippocampal sequence from which actions are generated is updated upon a reward. As the agents become familiar with the environment, hippocampal sequences that enable future predictions to become longer, and state estimation by the context-selection module becomes less frequent. The algorithmic flow chart of our model is described in Figure 6.

The algorithmic flow chart of the model.
Square boxes show the manipulation explained in Method, while the gray circles show if bifurcation with yes for ochre arrows and no for blue arrows. Synaptic weight update is indicated in the pink boxes.

Context-selection module

We model the context-selection module as Amari-Hopfield network (Amari, 1972; Hopfield, 1982) of N = 1200 binary neurons, whose activity is described by vector X. X consists of two domains: stimulus domain X^stim and context domain X^cont. The neuron ratio in the stimulus domain over the whole neurons dim(X^stim)/N is 16.7% for the control condition, 2.5% for the SZ condition, and 50% for the ASD condition. Note that dim describes the dimensions of a vector.

When the agents visit a task state for the first time, the X’s activity is set to

where converter function f(ξ^stim) = binary(Aξ^stim > a) returns a binary vector computed from dim (X^cont) by dim (X^stim) matrix A with independently and identically distributed unit Gaussian entries and scalar threshold a chosen so that f(ξ^stim) consists of half 1 and half 0 elements. This contextual state is set as a default pattern, and used in case associative memory dynamics fail to converge, as we explain below.

From the second visit of each task state after completing actions according to a hippocampal sequence, the contextual state is determined by associative memory dynamics of the Amari-Hopfield network to produce a hippocampal sequence. We adopt two ways of initialization: history-based and landmark based. We use the former calculation when the synaptic weights from the recently activated hippocampal neuron to have already been learned. We use the latter calculation when these weights have not been learned and the agents are at the landmark (at the initial state of the task environment in this manuscript). When these weights have not been learned and the agents are not at the landmark, which often happens after remapping (see Remapping section), a new contextual state is prepared and stored in the Amari-Hopfield network without running the associative memory dynamics (see below). The landmark-based calculation starts from the initial state of the Amari-Hopfield network

where random indicates a random binary vector consisting of half 0 and half 1 elements. This randomness in the initial condition helps in learning a tree structure with multiple branches from the same state upon remapping. The history-based calculation starts from the initial state of the Amari-Hopfield network

where binary represents the indicator function that takes 1 if the argument is true and 0 otherwise, H is the binary vector of hippocampal neural activity in the previous state, and dim(X) by dim (H) matrix W^XH represents the synaptic weights from H to X (see Synaptic weight update section for how W^XH changes).

Then, the contextual state is updated according to the associative memory dynamics:

where θ = 0.5 , X⁰ = 0.5 , and dim(X) by dim(X) matrix W^XX represents synaptic weights of context-selection module (see Synaptic weight update section for how W^XX changes). These dynamics end up either as a successful or failed recall. A recall is defined as successful if X converges within 50 iterations, and its stimulus domain X^stim becomes identical to ξ^stim. If X fails to converge within 50 iterations, the contextual state is set to the default state defined in (eq. 1). If X converges within 50 iterations but the stimulus domain X^stim of the converged X is different from ξ^stim, agents consider the external stimuli as new stimuli, and new contextual state is prepared and stored in the Amari-Hopfield network, i.e. X is set to be a new contextual pattern X = (ξ^stim, random)^T and the synaptic weights W^XX are updated (see Synaptic weight update section).

After X is set, the agents randomly generate a hippocampal sequence reflecting it (see Hippocampus section). The agents evaluate this sequence that encodes a course of actions and act according to it (see Sequence selection). This randomness in the sequence generation facilitates the exploration behavior of the agents, which is important for reinforcement learning, but also adds noise to the input from the sequence module to the context-selection module in the history-based computation. The associative memory dynamics help retrieve more appropriate contextual states both in the sensory and context domains. In addition, a successful or failed recall facilitates the production of a new context and the reuse of the default context, respectively. This difference becomes critical in explaining the SZ and ASD phenotypes.

Hippocampus

The hippocampus produces sequential activity probabilistically based on the contextual state computed above. Starting from the seed hippocampal neuron directly activated by the contextual state, the next hippocampal neuron is iteratively activated with a probability proportional to the synaptic weights from the previously activated hippocampal neuron. Therefore, the same contextual state could generate diverse sequences.

Hippocampal neurons initially receive input vector W^HXX_k, where W^HX is the synaptic weight matrix from X to the hippocampus, and X_k is the contextual state at time step k. Only the neuron that receives the strongest input is activated, whose index is described as

(see Synaptic weight update section for how W^HX changes), where the tilde mark indicates a neuron index.

Our model has two types of hippocampal neurons: state-coding and transition-coding types. The indices of neurons belonging to these types are denoted as and , respectively. The statecoding neurons receive input from X and represent the current state, while the transition-coding neurons send output to X and indicate the next state the agents can transition. When the agents experience this contextual state X_k for the first time, is randomly chosen and the synaptic weight from to is set to 1. From the second experience of the contextual state X_k, the corresponding hippocampal neuron initiates a sequence of hippocampal activity with a non-negative integer τ, where the next neuron is recursively chosen with a probability vector proportional to

where describes a vector of intra-hippocampal synaptic weights from neuron and w₀ = 0.3 is the effective threshold. The sequential activity can stop at a transition-coding neuron according to two conditions: when all the synaptic weights are equal to or below 0.01 or when the reward value function of the lastly activated transition-coding neuron becomes positive (see Reward prediction section).

Because the input source for the state-coding neuron and the transition coding neuron differ (the former is selected from X, while the latter is selected from H), the same hippocampal neuron could occasionally be used for both state-coding and transition-coding across different contextual states. This is evident when an excessive number of contextual states are prepared especially in the SZ condition. This phenomenon degrades state estimation at X (eq.3).

The synaptic connection from a state-coding neuron to a transition-coding neuron is formed in a reward-independent manner as described above, whereas the connection from a transition-coding neuron to a state-coding neuron is established in a reward-dependent manner (see Synaptic Weight Update section). Consequently, when animals receive few rewards during the initial exploration phase, minimal sequences with τ = 0 are constructed. As animals discover rewarding behaviors, these minimal sequences join, and eventually, agents anticipate the rewarding transition ahead.

Reward prediction

Each hippocampal sequence H is associated with rewards, perhaps via the operation of the midbrain. Reward value function which depends on the lastly activated transition-coding hippocampal neuron of the sequence, is updated every time the agents receive reward R > 0 according to

with learning rate α = 0.15. The sequence value associated with mirrors except when it is suppressed by this neuron’s no-good indicator (cross marks in Figure 1C), namely,

where suppression threshold θ_NG is set to 0.7. When the no-good indicator is active, i.e., ≥ θ_NG the sequence value becomes transiently negative.

These neurons’ no-good indicators change when a reward is presented. The no-good indicator of the lastly activated hippocampal neuron instantaneously drops to 0 when the reward is greater than the reward value function, i.e., R > but instantaneously increases by 1 otherwise. In addition, the no-good indicators of all hippocampal neurons gradually decay according to

with multiplication factor γ = 0.7 when the reward is less than the reward value function, i.e., R < .

Sequence selection

The agents generate a contextual state and, based on it, generate a random hippocampal sequence ℌ. The transition-coding hippocampal neurons [, …, ] in the sequence specify the series of states the agents plan to move to in the next τ steps in its task environment, unless remapping happens. Below, we describe how the agents generate multiple hippocampal sequences and choose the one to follow. The last hippocampal neuron of the sequence ℌ informs its sequence value (see Reward prediction section). When is positive or to test a new context, the agents execute actions according to this sequence. Otherwise, the agents reject this hippocampal sequence. In case of rejection, the agents repeat generating another hippocampal sequence (using a different random seed for the landmark-based initialization) for up to nine attempts. If none of the nine sequences have positive , one is selected randomly, excluding that with the lowest sequence value. The agents continue a course of actions following the accepted sequence till its end unless remapping happens, and the sequence is interrupted (see Remapping section). After completing the final action specified by the sequence, the agents repeat the whole procedure in a new environmental state.

Remapping

Remapping can occur while the agents execute a course of actions following a hippocampal sequence. We refer to remapping as the shift of X’s activity to another contextual state or generate a new one under the same external stimuli. Upon the course of actions following hippocampal sequence ℌ, the stimulus domain ofthe input binaiy(W^XHH > 0) predicts the next external stimulus (without running associative memory dynamics), which may differ from the actual one, ξ^stim. When this happens at time k + 1, remapping occurs, and the synaptic weights from a transition-coding hippocampal neuron to state-coding hippocampal neurons are modified following the steps below.

The hippocampal sequence is interrupted between the transition , and the corresponding synaptic weight is weakened (see Synaptic weight update section).
If the transition-coding neuron projects to state-coding neurons other than these state-coding neurons’ predictions about external stimuli are examined. If there exists one that predicts the actual external stimuli with an error less than the remapping threshold of θ_remap = 5 bit, this neuron is activated, and the contextual state X is set based on its input (see Context-selection module section). If there is no such state-coding neuron, a new contextual state is set as X = (ξ^stim, random)^T with the synaptic weights W^XX updated (see Synaptic weight update section).
A hippocampal neuron is activated based on the new X following (eq. 5), and the synaptic weight is strengthened between the interrupted transition-coding hippocampal neuron and the newly activated state-coding hippocampal neuron (see Synaptic weight update section).

Exploration

To gain information on the environment, the agents perform exploration. We refer to exploration as a random action not specified by the selected sequence. Exploration can occur with exploration probability p whenever the agents enter an environmental state with the number of transition candidates greater than the number of stored sequences initiating from the corresponding state-coding hippocampal neuron. The exploration probability is generally p = 0.3 but increases to certainty (p = 1) if the agents are taking actions following a sequence with a negative sequence value, which happens when its no-good indicator is active. In case of this exploration, one of the unconnected transitioncoding hippocampal neurons is randomly activated, and the agents take a random transition. Then, synaptic weights of H and X are updated (see Synaptic weight update section).

Synaptic weight update

We used a Hebbian learning rule to update the synaptic weight matrix W^XX only for the first time contextual state X is settled:

We also used a basic Hebbian learning rule for updating synaptic weights between X and H. Again, only for the first time a hippocampal neuron is activated according to (eq. 5) in response to contextual state X_k, synaptic weights are updated as

where H^(s) and H^(T) are the state-coding and transition-coding hippocampal activity vectors, respectively, whose elements take 1 for the activated neuron of the corresponding type and 0 for the others. Similarly, is the transition-coding hippocampal activity vector of the previous hippocampal sequence, where the element corresponding to the last transition-coding neuron takes 1, and others take 0. Learning rate η = (N + 1)/2 and offset X¹ = N/(N + 1) are chosen to achieve good association dynamics in the context-selection module. These synaptic weights change within the bound W^XH, W^HX ≤ 1/2.

We used different learning rules for the intra-hippocampal synaptic weights depending on the coding types. The initial synaptic weights are all w₀, and these weights change within the bound 0 ≤ W^HH ≤ 1. State-coding to transition-coding synapses are constantly updated when and are activated as

The second term describes Hebbian potentiation, and the third term describes hetero-synaptic depression between non-active presynaptic neurons and the active postsynaptic neuron. Note that we assume hetero-synaptic depression only upon the initial establishment of the synaptic connection between the two hippocampal neurons. Transition-coding to state-coding synapses involved in ℌ are constantly updated when the agents receive reward (R > 0) and and are involved in ℌ according to

The second term describes Hebbian potentiation that modifies the weight toward R − w₀, and the third term describes hetero-synaptic depression.

In addition, if remapping happens at the sequence location between and , the synaptic weight from to is weakened by −α, while that from to the activated state-coding hippocampal neurons is strengthened by α(0.65 − ) binary( < 0.65).

Considering the memory capacity ofthe Amari-Hopfield Network with correlated patterns, the number of memorizable contextual states sharing the same external stimulus is below 8. If this condition is violated, to prevent overloading the Amari-Hopfield network, the contextual state X that has never produced hippocampal sequences with a sequence value more than 0.7 induces a forgetting process as

This process represents forgetting of reward-unrelated episodic memory.

Model-free learning with temporal contexts

To highlight the advantage of our model, we compared it to the Q-learning with temporal contexts, namely, the state is defined by the recent transition history of task state. The number of histories is changed from 0 to 3. In the Q-learning, the action value for a temporal state s_t to the next s_t+1 is updated as

where the initial Q value is 0, learning rate α = 0.4 , the discount factor γ = 0.6 and the taskdependent reward function R = 100 for the rewarded transition and R = 1 for else. Next state selection policy π is set to be proportional to Q value as

Inhibition experiment

To replicate the inhibition experiment of medial entorhinal cortex axons at CA1, we inhibit 98.5% of the input from the context domain of X to H. After the 2-lap task in Figure 3, we observed the hippocampal activity responding to each contextual state with or without this inhibition. ESR correlation is calculated based on the hippocampal activity of each lap, while the spatial correlation is calculated based on that of space. To avoid nan value when calculating correlations, we assumed that the activity of hippocampal cells without firing would have a random spontaneous activity between 0 and 0.1. Note that this operation does not significantly affect the result.

Simplified 2-lap task with model-free learning with temporal contextual states.
The contextual states are defined by the composition of the current state and n back states. It requires at least 3 back states to complete this task, but the correct rate of 3 back states is worse than our model.

Reward-dependent plasticity when sensory and contextual encoding neurons coexist in hippocampus.
A, Schematic figure of how sensory and contextual encoding neurons can coexist in the hippocampus. Hippocampal neurons that receive synaptic input mainly from the stimulus-encoding region have sensory encoding, while those from the context-encoding region have contextual encoding. B, How the hippocampal network evolves when sensory and contextual encoding neurons coexist in the 1- round task. This task requires contextual encoding, otherwise agents cannot distinguish between the first and second visit of S2. After 100 trials of random exploration in this area, the network between sensory encoding hippocampal neurons (indicated by the orange square) does not increase synaptic weights, while that between relevant context-encoding hippocampal neurons increases synaptic weights. c, How the hippocampal network evolves when sensory and contextual encoding neurons coexist in the ignore task. In this task, contextual encoding is not necessary because agents receive a reward at S4 independent of past states or latent variables. In contrast to the 1-round task, the network between sensory encoding hippocampal neurons (indicated by the orange square) increases the synaptic weights as well as that between context encoding hippocampal neurons.

Data and materials availability:

All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. All source code is provided in https://github.com/toppo365/flexiblemodel.git.

Acknowledgements

Funding: The study was supported by RIKEN Center for Brain Science, the JST CREST program JPMJCR23N2, and RIKEN TRIP initiative (RIKEN Quantum).

Note

This reviewed preprint has been updated to add details for a corresponding author.

Significance of findings

Strength of evidence

Abstract

Introduction

Schematic representation of our model.

Results

Splitter cells

Our model replicates the emergence of splitter cells.

Lap cells

Our model replicates the emergence of lap cells.

Planning in stimulus-cued dynamic environment

Our model replicates key features of human neural activity in dynamic environments.

Prediction related to sensory processing and flexible behavior

Model prediction about the relationship between sensory processing and flexible behavior.

Discussion

Materials and methods

Simulation environment

Model description

Overview

The algorithmic flow chart of the model.

Context-selection module

Hippocampus

Reward prediction

Sequence selection

Remapping

Exploration

Synaptic weight update

Model-free learning with temporal contexts

Inhibition experiment

Simplified 2-lap task with model-free learning with temporal contextual states.

Reward-dependent plasticity when sensory and contextual encoding neurons coexist in hippocampus.

Data and materials availability:

Acknowledgements

Note

References

Article and author information

Author information

Yoshiki Ito

Taro Toyoizumi

Author Notes

Version history

Cite all versions

Copyright

Metrics