Learning predictive cognitive maps with spiking neurons during behavior and replays

  1. Jacopo Bono
  2. Sara Zannone
  3. Victor Pedrosa
  4. Claudia Clopath  Is a corresponding author
  1. Department of Bioengineering, Imperial College London, United Kingdom

Abstract

The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data.

Editor's evaluation

This is an important article that leverages a spiking network model of the hippocampal circuit to show how spike-time-dependent plasticity can implement predictive reinforcement learning and form a predictive map of the environment. The authors provide a convincing and solid framework for understanding the prediction based learning rules that may be employed by the hippocampus to optimize an animal's behavior. This paper will be of interest to theoretical and experimental neuroscientists working on learning and memory as it provides new ways to connect computational models to experimental data that has yet to be fully explored from a reinforcement learning perspective.

https://doi.org/10.7554/eLife.80671.sa0

Introduction

Mid twentieth century, Tolman proposed the concept of cognitive maps (Tolman, 1948). These maps are abstract mental models of an environment which are helpful when learning tasks and in decision making. Since the discovery of hippocampal place cells, cells that are activated only in specific locations of an environment, it is believed that the hippocampus can provide the substrate to encode such cognitive maps (O’Keefe and Dostrovsky, 1971; O’Keefe and Nadel, 1978). More evidence of the role of the hippocampus in behavior was found in numerous experimental studies, such as the seminal water maze experiments (Morris, 1981; Morris et al., 1982), radial arm maze experiments (Olton and Papas, 1979) as well as evidence of broader information processing beyond just cognitive maps (Wood et al., 1999; Eichenbaum et al., 1999; Aggleton and Brown, 1999; Wood et al., 2000).

While these place cells offer striking evidence in favour of cognitive maps, it is not clear what representation is actually learned by the hippocampus and how this information is exploited when solving and learning tasks. Recently, it was proposed that the hippocampus computes a cognitive map containing predictive information, called the successor representation (SR). Theoretically, this SR framework has some computational advantages, such as efficient learning, simple computation of the values of states, fast relearning when the rewards change and flexible decision making (Dayan, 1993; Stachenfeld et al., 2014; Stachenfeld et al., 2017; Russek et al., 2017; Momennejad et al., 2017). Furthermore, the SR is in agreement with experimental observations. Firstly, the firing fields of hippocampal place cells are affected by the strategy used by the animal to navigate the environment (known as the policy in machine learning), as well as by changes in the environment (Mehta et al., 2000; Stachenfeld et al., 2017). Secondly, reward revaluation — the ability to recompute the values of the states when rewards change — would be more effective than transition revaluation (Russek et al., 2017; Momennejad et al., 2017).

In this work, we study how this predictive representation can be learned in the hippocampus with spike-timing dependent synaptic plasticity (STDP). Using STDP at the mechanistic level, we show that the learning is equivalent to TD(λ) on an algorithmic level. The latter is a well-studied and powerful algorithm known from reinforcement learning (Sutton and Barto, 1998), which we will discuss in more detail below.

Our model can thus learn over a behavioral timescale while using STDP timescales in the millisecond range. We show mathematically that our proposed framework smoothly connects a temporally precise spiking code akin to replay activity with a rate based code akin to behavioral spiking. Subsequently, we show that the delay-discounting parameter γ allows us to consider time as a continuous variable, therefore we don’t need to discretize time as is usual in reinforcement learning (Doya, 1995; Doya, 2000). Moreover, the delay-discounting in our model depends hyperbolically on time but exponentially on state transitions. We show how the γ parameter can be modulated by neuronal firing rates and neuromodulation, allowing state-dependent discounting and in turn enabling richer information in the SR, such as the encoding of salient states, landmarks, reward locations, etc. Finally, replays have long been speculated to be involved in learning models of the environment, supported by experiments (Johnson and Redish, 2007; Pfeiffer and Foster, 2013; Kay et al., 2020) and models (Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012; Kubie and Fenton, 2012). Here, we investigate how replays could play an additional role in learning the SR cognitive map. Following properties of TD(λ), we show how we can achieve both low bias and low variance by using replays, translating to both quicker initial learning and convergence to lower error. We show how we can use replays to learn offline. In this way, policies can be refined without the need for actual exploration.

Our framework allows us to make predictions about the roles of behavioral learning and replay-like activity and how they can be exploited in representation learning. Furthermore, we uncover a relation between STDP and a higher level learning algorithm. Our work therefore spans the three levels of analysis proposed by Marr, 2010. On the implementational level, our model consists of a feedforward network of excitatory neurons with biologically plausible spike-timing dependent plasticity. On the algorithmic level, we show that our model learns the successor representation using the TD(λ) algorithm. On the computational theory level, our model tackles representation learning using cognitive maps.

Results

Cognitive maps are internal models of an environment which help animals to learn, plan and make decisions during task completion. The hippocampus has long been thought to provide the substrate for learning such cognitive maps (O’Keefe and Dostrovsky, 1971; O’Keefe and Nadel, 1978; Morris, 1981; Morris et al., 1982; Wood et al., 1999; Eichenbaum et al., 1999), and recent evidence points towards a specific type of representation learned by the hippocampus, the successor representation (SR) (Stachenfeld et al., 2017).

The successor representation

In this section, we will give an overview of the successor representation and its properties, especially geared toward neuroscientists. Readers already familiar with this representation may safely move to the next section.

To understand the concept of successor representation (SR), we can consider a spatial environment — such as a maze — while an animal explores this environment. In this setting, the SR can be understood as how likely it is for the animal to visit a future location starting from its current position. We further assume the maze to be formed out of a discrete number of states. Then, the SR can be more formally described by a matrix with dimension (Nstates×Nstates), where Nstates denotes the number of states in the environment and each entry Rij of this matrix describes the expected future occupancy of a state Sj when the current state is Si. In other words, starting from Si, the more likely it is for the animal to reach the location associated with state Sj and the nearer in the future, the higher the value of Rij.

As a first example, we consider an animal running through a linear track. We assume the animal runs at a constant speed and always travels in the same direction — left to right (Figure 1a). We also split the track into four sections or states, S1 to S4, and the SR will be represented by a matrix with dimension (4×4). Since the animal always runs from left to right, there is zero probability of finding the animal at position i if its current position is greater than i. Therefore, the lower triangle of the successor matrix is equal to zero (Figure 1b). Alternatively, if the animal is currently at position S1, it will be subsequently found at positions S2, S3, and S4 with probability 1. The further away from S1, the longer it will take the animal to reach that other position. In terms of the successor matrix, we apply a discounting factor γ (0<γ1) for each extra ‘step’ required by the animal to reach a respective location (Figure 1b).

Figure 1 with 2 supplements see all
Successor representation and neuronal network.

(A) Our simple example environment consists of a linear track with 4 states (S1 to S4) and the animal always moves from left to right — i.e. one epoch consists of starting in S1 and ending in S4. (B) The successor matrix corresponding to the task described in panel A. (C) Our neuronal network consists of a two layers with all-to-all feedforward connections. The presynaptic layer mimics hippocampal CA3 and the postsynaptic layer mimics CA1. (D) The synaptic plasticity rule consists of a depression term and a potentiation term. The depression term is dependent on the synaptic weight and presynaptic spikes (blue). The potentiation term depends on the timing between a pre- and post-synaptic spike pair (red), following an exponentially decaying plasticity window (bottom). (E–F) Schematics illustrating some of the results of our model. (E) Our spiking model learns the top row of the successor representation (panel B) in the weights between the first CA3 place cell and the CA1 cells. (F) Our spiking model learns the third row of successor representation (panel B) in the weights between the third CA3 place cell and the CA1 cells.

Even though we introduced the linear track as an illustrative example, the SR can be learned in any environment (see Figure 1—figure supplement 1 for an example in an open field). Note that the representation learned by the SR is dependent not only on the structure of the environment, but also on the policy — or strategy — used by the animal to explore the environment. This is because the successor representation is not purely concerned with the physical distance between two areas in the environment, but rather it measures how long it usually takes to reach one place when starting from the other. In this first example, the animal applied a deterministic policy (always running from left to right), but the SR can also be learned for stochastic policies. Furthermore, the SR is a multi-step representation, in the sense that it stores predictive information of multiple steps ahead.

Because of this predictive information, the SR allows sample-efficient re-learning when the reward location is changed (Gershman, 2018). In reinforcement learning, we tend to distinguish between model-free and model-based algorithms. The SR is believed to sit in-between these two modalities. In model-free reinforcement learning, the aim is to directly learn the value of each state in the environment. Since there is no model of the environment at all, if the location of a reward is changed, the agent will have to first unlearn the previous reward location by visiting it enough times, and only then it will be able to re-learn the new location. In model-based reinforcement learning, a precise model of the environment is learned, specifically, single-step transition probabilities between all states of the environment. Model-based learning is computationally expensive, but allows a certain flexibility. If the reward changes location it is immediate to derive the updated values of the states. As we have seen, however, the SR can re-learn a new reward location somewhat efficiently, although less so than model-based learning. The SR can also be efficiently learned using model-free methods and allows us to easily compute values for each state, which in turn can guide the policy (Dayan, 1993; Russek et al., 2017; Momennejad et al., 2017). This position between model-based and model-free methods makes the SR framework very powerful, and its similarities with hippocampal neuronal dynamics have led to increased attention from the neuroscience community. Finally, in our examples above we considered an environment made up of a discrete number of states. This framework can be generalised to a continuous environment represented by a discrete number of place cells.

Learning the successor representation in biologically plausible networks

We propose a model of the hippocampus that is able to learn the successor representation. We consider a feedforward network comprising of two layers. Similar to McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Mehta et al., 2000; Hasselmo et al., 2002, we assume that the presynaptic layer represents the hippocampal CA3 region and is all-to-all connected to a postsynaptic layer - representing the CA1 network (Figure 1c). The synaptic connections from CA3 to CA1 are plastic such that the weight changes follow a spike-timing-dependent plasticity (STDP) rule consisting of two terms: a weight-dependent depression term for presynaptic spikes and a potentiation term for pre-post spike pairs (Figure 1d).

For simplicity, we assume that the animal spends a fixed time T in each state. During this time, a constant activation current is delivered to the CA3 neuron encoding the current location and, after a delay, to the corresponding CA1 place cell (see Materials and methods). On top of these fixed and location-dependent activations, the CA3 neurons can activate neurons in CA1 through the synaptic connections. In other words, the CA3 neurons are activated according to the current location of the animal, while the CA1 neurons have a similar location-dependent activity combined with activity caused by presynaptic neurons. The constant currents delivered directly to CA3 and CA1 neurons can be thought of as location-dependent currents from entorhinal cortex. These activations subsequently trigger plasticity at the synapses, and we can show analytically that, using the spike-timing dependent plasticity rule discussed above, the SR is learned in the synaptic weights (Figure 1e and f, and see Appendix).

Moreover, we find that, on an algorithmic level, our weight updates are equivalent to a learning algorithm known as TD(λ), a powerful and well-known algorithm in reinforcement learning that can be used to learn the successor representation. TD(λ) is based on a mixed methodology, which is regulated by the parameter λ. At one extreme, when λ=1, the SR is estimated by taking the average of state occupancies over past trajectories. This type of algorithm is called TD(1) or Monte Carlo (MC). At the other extreme, when λ=0, the estimate of the SR is adjusted ‘online’, with every step of the trajectory, by comparing the observed position with its predicted value. This algorithm is equivalent to TD(0). For all values of λ in between, the algorithm employs a mixture of both methodologies. The extreme cases of TD(1) and TD(0) have different strengths and weaknesses, as we will discuss in more detail in the next sections.

In practice, we prove analytically the mathematical equivalence of the dynamics of our spiking neural network, and the TD(λ) algorithm (see Appendix). Our calculations essentially prove that, at each step, our neural network tracks the reinforcement learning algorithm, known to converge to the theoretical values of the SR. This equivalence guarantees that our neural network weights will eventually converge to the correct SR matrix. As a proof of principle, we show that it is possible to learn the SR for any initial weights (Figure 1—figure supplement 2), independently of any previous learning in the CA3 to CA1 connections.

Importantly, from our analytical derivations (see Appendix), we find that the λ parameter depends on the behavioral parameter T (the time an animal spends in a state). We find that, the larger the time T, the smaller the value of λ and vice-versa. In other words, when the animal moves through the trajectory on behavioral time-scales (large T compared to the synaptic plasticity time-scales τLTP), the network is learning the SR with TD(λ0). For quick sequential activities (T → 0), akin to hippocampal replays, the network is learning the SR with TD(λ1). As we will discuss below, this framework therefore combines learning based on rate coding as well as temporal coding. Furthermore, from our model follows the prediction that replays can also be used for learning purposes and that they are algorithmically equivalent to MC, whereas during behavior, the hippocampal learning algorithm is equivalent to TD(λ). This strategy of using replays to learn is in line with recent experimental and theoretical observations (see Momennejad, 2020 for a review).

To validate our analytical results, we use again a linear track with a deterministic policy. Using our spiking model with either rate-code activity on behavioral time-scales (Figure 2a top) or temporal-code activity similar to replays (Figure 2b top), we show that the synaptic weights across trials match the evolution of the TD(λ) algorithm closely (Figure 2a and b middle). While convergence to the SR is guaranteed (Figure 2a and b bottom) due to the mathematical equivalence between our setup and TD(λ) (Figure 2—figure supplement 1), the learning trajectory has more variance in the neural network case due to the noise introduced by the randomness of the spike times. This noise can be mitigated by averaging over a population of neurons. Moreover, due to the equivalence with TD(λ), our setup is general for any type of task where discrete states are visited, in any dimension, and which may not need to be a navigation task (see e.g. Figure 1—figure supplement 1 for a 2D environment).

Figure 2 with 1 supplement see all
Comparison between TD(λ) and our spiking model.

(A-top) Learning during behavior corresponds to TD(λ0). States are traversed on timescales larger than the plasticity timescales and place cells use a rate-code. (A-middle) Comparison of the learning over epochs for two synaptic connections (full line denotes the mean over ten random seeds, shaded area denotes one standard deviation) with the theoretical learning curve of TD(λ) (dotted line). (A-bottom) Final successor matrix learned by the spiking model (left) and the theoretical TD(λ) algorithm (right). Star and diamond symbols denote the corresponding weights shown in the middle row. (B-top) Learning during replays corresponds to TD(λ1). States are traversed on timescales similar to the plasticity timescales and place cells use a temporally precise code. (B-middle and bottom) Analogous to panel A middle and bottom.

In summary, we showed how the network can learn the SR using a spiking neural model. We analytically showed how the learning algorithm is equivalent to TD(λ), and confirmed this using numerical simulations. We derived a relationship between the abstract parameter λ and the timescale T representing the animal’s behavior — and in turn the neuronal spiking — allowing us to unify rate and temporal coding within one framework. Furthermore, we predict a role for hippocampal replays in learning the SR using an algorithm equivalent to Monte Carlo.

Learning over behavioral time-scales using STDP

An important observation in our framework is that the SR can be learned using the same underlying STDP rule over time-scales ranging from replays up to behavior. One can now wonder how it is possible to learn relationships between events that are seconds apart during awake behavior, without any explicit error encoding signal typically used by the TD algorithm, and while the STDP rule is characterised by millisecond time-scales (Figure 3a).

Learning on behavioral timescales and state-dependent discounting.

(A) In our model, the network can learn relationships between neurons that are active seconds apart, while the plasticity rule acts on a millisecond timescale. (B) Due to transitions between subsequent states, each synaptic weight update depends on the weight from the subsequent CA3 neuron to the same CA1 neuron. In other words, the change of a synaptic weight depends on the weight below it in the successor matrix. The top panel visualizes how weights depend on others in our linear track example, where each lighter color depends on the darker neighbor. The bottom panel shows the learning of over 50 epochs. Notice the lighter traces converge more slowly, due to their dependence on the darker traces. (C) Place fields of the place cells in the linear track — each place cell corresponding to a column of the successor matrix. Activities of each place cell when the animal is in each of the four states (dots) are interpolated (lines). Three variations are considered: (i) the time spent in each state and the CA3 firing rates are constant (blue and panel D); (ii) the time spent in state 3 is doubled (orange and panel E); (iii) the CA3 firing rate in state 3 is doubled (green and panel F). Panels E and F lead to a modified discount parameter in state 3, affecting the receptive fields of place cells 3 and 4.

From a neuroscience perspective, this can be understood when considering the trajectory of the animal. Each time the animal moves from a position Sj-1 to a position Sj, the CA3 cell encoding the location Sj-1 stops firing and the CA3 cell encoding the location Sj starts firing. Since in our example this transition is instantaneous, these cells are activating the same CA1 cells consecutively. Therefore, the change in the weight wi,j-1 depend on the synaptic weight of the subsequent state wi,j (Figure 3b, yellow depends on orange, orange depends on red, etc). Indeed, in our example of an animal in a linear track subdivided into four locations, the weights on the diagonal, such as w4,4, are the first ones to be learned, since they are learned directly. The off-diagonal weights, such as w3,4, w2,4, and w1,4, are learned consecutively more slowly as they are dependent on the subsequent synaptic weight. Eventually, weights between neurons encoding positions that are behaviorally far apart can be learnt using a learning rule on a synaptic timescale (Figure 3b).

From a reinforcement learning perspective, the TD(0) algorithm relies on a property called bootstrapping. This means that the successor representation is learned by first taking an initial estimate of the SR matrix (i.e. the previously learned weights), and then gradually adjusting this estimate (i.e. the synaptic weights) by comparing it to the states in the environment that the animal actually visits. This comparison is achieved by calculating a prediction error, similar to the widely studied one for dopamine neurons (Schultz et al., 1997). Since the synaptic connections carry information about the expected trajectories, in this case, the prediction error is computed between the predicted and observed trajectories (see Materials and methods).

The main point of bootstrapping, therefore, is that learning happens by adjusting our current predictions (e.g. synaptic weights) to match the observed current state. This information is available at each time step and thus allows learning over long timescales using synaptic plasticity alone. If the animal moves to a state in the environment that the current weights deem unlikely, potentiation will prevail and the weight from the previous to the current state will increase. Otherwise, the opposite will happen. It is important to notice that the prediction error in our model is not encoded by a separate mechanism in the way that dopamine is thought to do for reward prediction (Schultz et al., 1997). Instead, the prediction error is represented locally, at the level of the synapse, through the depression and potentiation terms of our STDP rule, and the current weight encodes the current estimate of the SR (see Materials and methods). Notably, the prediction error is equivalent to the TD(λ) update. This mathematical equivalence ensures that the weights of our neural network track the TD(λ) update at each state, and thus stability and convergence to the theoretical values of the SR. We therefore do not need an external vector to carry prediction error signals as proposed in Gardner et al., 2018; Gershman, 2018. In fact, the synaptic potentiation in our model updates a row of the SR, while the synaptic depression updates a column.

On the other extreme, for very fast timescales such as replays, TD(1) is equivalent to online Monte Carlo learning (MC), which does not bootstrap at all. Instead, MC samples the whole trajectory and then simply takes the average of the discounted state occupancies to update the SR (see Materials and methods). During replays, the whole trajectory falls under the plasticity window and the network can learn without bootstrapping. For all cases in between, the network partially relies on bootstrapping and we correspondingly find a λ between 0 and 1.

In summary, in our framework, synaptic plasticity leads to the development of a successor representation in which synaptic weights can be directly linked to the successor matrix. In this framework, we can learn over behavioral timescales even though our plasticity rule acts on the scale of milliseconds, due to the bootstrapping property of TD algorithms.

Different discounting for space and time

In reinforcement learning, it is usual to have delay-discounting: rewards that are further away in the future are discounted compared to rewards that are in the immediate future. Intuitively, it is indeed clear that a state leading to a quick reward can be regarded as more valuable compared to a state that only leads to an equal reward in the distant future. For tasks in a tabular setting, with a discrete state space and where actions are taken in discrete turns, such as for example chess or our simple linear track discussed in section ‘The Successor Representation’, one can simply use a multiplicative factor 0<γ1 for each state transition. In this case the discount follows an exponential dependence, where rewards that are n steps away are discounted by a factor of γn.

In order to still use the above exponential discount when time is continuous, the usual approach is to discretize time by choosing a unit of time. However, this would imply one can never remain in a state for a fraction of this unit of time, and it is not clear how this unit would be chosen. Our framework deals naturally with continuous time, through the monotonically decreasing dependence of the discount parameter γ on the time an agent remains in a state, T. The dependence on T can be interpreted as an increased discounting the longer a state lasts.

In this way, instead of discounting by γn when the agent stays n units of time in a certain state, we would discount by γ(nT). More generally, for any arbitrary time T, a discount corresponding to γ(T) will be applied. This allows the agent to act in continuous time (Figure 3c and e). Interestingly, the dependence of γ on T in our model is not exponential as in the tabular case. Instead, we have a hyperbolic dependence. This hyperbolic discount is well studied in psychology and neuroeconomics and appears to agree well with experimental results (Laibson, 1997; Ainslie, 2012).

The difference between a hyperbolic discount and an exponential discount lays in the fact that we will attribute a different value to the same temporal delay, depending on whether it happens sooner or later. A classic example is that, when given the choice, people tend to prefer 100 dollars today instead of 101 dollars tomorrow, while they tend to prefer 101 dollars in 31 days instead of 100 dollars in 30 days. They therefore judge the 1 day of delay differently when it happens later in time. Exponential discounting, on the other hand, always attributes the same value to the same delay no matter when it occurs.

Our model therefore combines two types of discounting: exponential when we move through space — when sequentially activating different place cells — and hyperbolic when we move through time — when we prolong the activity of the same place cell.

The discount factor γ also depends on other parameters such as firing rate and STDP amplitudes (see Equation 22 in the Appendix). This gives our model the flexibility to encode state-dependent discounting even when the trajectories and times spent in the states are the same. Such state-dependent discounting can be useful to for example encode salient locations in the environment such as landmarks or reward locations (Figure 3c and f).

Bias-variance trade-off

As discussed previously (section ‘Learning the successor representation in biologically plausible networks’), the TD(λ) algorithm unifies the TD algorithm and the MC algorithm. In our framework, replay-like neuronal activations are equivalent to MC, while behavioral-like activity is equivalent to TD. In this section, we will discuss how the replays and behavior can work together when learning the cognitive map of an environment, leveraging the strengths of MC and TD.

The MC algorithm effectively works by averaging over the sampled trajectories. As such, the estimated SR matrix will be a close approximation of the theoretical value. The difference between the estimated and theoretical value is commonly referred to as bias. We can therefore say that the MC algorithm presents low bias. However, if the agent moves in the environment at random, the sampled trajectories will be quite different from each other. When taking the average, the estimated value will therefore fluctuate a lot. In this case, we say that the MC estimate has high variance as well (Figure 4A and B).

Figure 4 with 2 supplements see all
Replays can be used to control the bias-variance trade-off.

(A) The agent follows a stochastic policy starting from the initial state (denoted by START). The probability to move to either neighboring state is 50%. An epoch stops when reaching a terminal state (denoted with STOP). (B) Root mean squared error (RMSE) between the learned SR estimate and the theoretical SR matrix. The full lines are mean RMSEs over 1000 random seeds. Three cases are considered: (i) learning happens exclusively due to behavioral activity (TD STDP, green); (ii) learning happens exclusively due to replay activity (MC STDP, purple); (iii) A mixture of behavioral and replay learning, where the probabilities for replays drops off exponentially with epochs (Mix STDP, pink). The mix model, with a decaying number of replays learns as quickly as MC in the first epochs and converges to a low error similar to TD, benefiting both from the low bias of MC at the start and the low variance of TD at the end. (C, D, E) Representative weight changes for each of the scenarios. Full lines show various random seeds, shaded areas denote one standard deviation over 1000 random seeds. (F) More replays are observed when an animal explores a novel environment (day 1). Panel F adapted from Figure 3A in Cheng and Frank, 2008.

Unlike MC, the TD algorithm updates its estimate of the SR by comparing the current estimate of the SR with the actual state the agent transitioned to. Because of the dependence on the current estimate, this estimate will be incrementally refined with small updates. In this way, the SR estimate will not fluctuate much, and be lower in variance. However, by this dependence on the current estimate, we introduce a bias in the algorithm, which will be especially significant when our initial estimate of the SR is bad (Figure 4A and B). The TD algorithm therefore presents high bias and low variance.

We now apply these concepts to learning in a novel environment. Since the MC algorithm is unbiased by the initial estimate of the SR, replays should initially speed up learning in an unfamiliar environment. Later on, when the environment becomes familiar, the SR estimate is already closer to the exact value. At this point, we prefer to have low variance and thus the TD algorithm will be preferred. We confirm this logic using our spiking neural networks, and show how we can have both quick learning and low error at convergence if we proportionally have more replays at the first trials in a novel environment (Figure 4a–e). In contrast, when having an equal proportion of replays throughout the whole simulation, we do not yield as quick learning as MC and as low asymptotic error as TD (Figure 4—figure supplement 1). Interestingly, the pattern of proportionally more replays in novel environments versus familiar environments has also been experimentally observed (Cheng and Frank, 2008; Figure 4f). Please note that, while we implemented an exponentially decaying probability for replays after entering a novel environment, different schemes for replay activity could be investigated. Note also that other mechanisms besides the successor representation could account for these results, including model-based reinforcement learning.

Leveraging replays to learn novel trajectories

In the previous section, the replays re-activated the same trajectories as seen during behavior. In this section, we extend this idea and show how in our model replays can be useful during learning even when the re-activated trajectories were not directly experienced during behavior.

For this purpose, we reproduce an place-avoidance experiment from Wu et al., 2017. In short, rats are allowed to freely explore a linear track on day 1. Half of the track is dark, while the other half is bright. On day 2, the animals did four trials separated by resting periods: in the first trial (pre), the animals were free to explore the track; in the second trial (shock), they started in the light zone and received two mild footshocks when entering in the shock zone; in the third and fourth trial (post and re-exposure, respectively), they were allowed to freely explore the track again, but starting from the light zone or the shock zone respectively (Figure 5a). In the study, it was reported that during the post trial, animals tended to stay in the light zone and forward replays from the current position to the shock zone were observed when the animals reached the boundary between the light and the dark zone (Figure 5b and c).

Figure 5 with 4 supplements see all
Reproducing place-avoidence experiments with the spiking model.

(A–C) Data from Wu et al., 2017. (A) Experimental protocol: the animal is first allowed to freely run in the track (Pre). In the next trial, a footshock is given in the shock zone (SZ). In subsequent trials the animal is again free to run in the track (Post, Re-exposure). Figure redrawn from Wu et al., 2017. (B) In the Post trial, the animal learned to avoid the shock zone completely and the also mostly avoids the dark area of the track. Figure redrawn from Wu et al., 2017. (C) Time spent per location confirms that the animal prefers the light part of the track in the Post trial. Figure redrawn from Wu et al., 2017. (D) Mimicking the results from Wu et al., 2017, the shock zone is indicated by the black region, the dark zone by the gray region and the light zone by the white region. Left: without replays, the agent keeps extensively exploring the dark zone even after having experienced the shock. Right: with replays, the agent largely avoids entering the dark zone after having experienced the shock (replays not shown). (E) The value of each state in the cases with and without replays. (F) Occupancy of each state in our simulations and for the various trials. Solid line and shaded areas denote the average and standard deviation over 100 simulations, respectively. Notice we do not reproduce the peak of occupancy at the middle of the track as seen in panel c, since our simplified model assumes the same amount of time is spent in each state.

We simulated a simplified version of this task. Our simulated agent moves through the linear track following a softmax policy, and all states have equal value during the first phase (pre) (Figure 5d, blue trajectories). Then, the agent is allowed to move through the linear track until it reaches the shock zone and experiences a negative reward. Finally, the third phase is similar as the first phase and the animal is free to explore the track. Two versions of this third phase were simulated. In one version, there are no replays (Figure 5d, orange trajectories in left panel), while in the second version a forward replay until the shock zone is simulated every time the agent enters the middle state (Figure 5d, orange trajectories in right panel, replays not shown). The replays affect the learning of the successor representation and the negative reward information is propagated towards the decision point in the middle of the track. The states in the dark zone therefore have lower value compared to the case without replays (Figure 5e). In turn, this different value affects the policy of the agent which now tends to avoid the dark zone all together, while the agent without replays still occupies many states of the dark zone as much as states in the light zone (Figure 5f). Moreover, even when doubling the amount of SR updates in the scenario without replays, the behavior of the agent remains unaltered (Figure 5—figure supplement 1). This shows that it is not the amount of updates, but the type of policy that is important when updating the SR, and how using a different policy in the replay activity can significantly alter behavior.

Our setup for this simulation is simplified, and does not aim to reproduce the complex decision making of the rats. Observe for example the peak of occupancy of the middle state by the animals (Figure 5c), which is not captured by our model because we assume the agent to spend the same amount of time in each state. Nonetheless, it is interesting to see how replaying trajectories that were not directly experienced before, in combination with a model allowing replays to affect the learning of a cognitive map, can substantially influence the final policy of an agent and the overall performance. This mental imagination of trajectories could be exploited to refine our cognitive maps, avoiding unfavourable locations or finding shortcuts to rewards. It is important to note here that, while we are suggesting a potential role for the SR in solving this task, the data itself would also be compatible with a model-based strategy. In fact, experimental evidence suggests that humans may use a mixed strategy involving both model-based reinforcement learning and the successor representation (Momennejad et al., 2017).

Discussion

In this article, we investigated how a spiking neural network model of the hippocampus can learn the successor representation. Interestingly, we show that the updates in synaptic weights resulting from our biologically plausible STDP rule are equivalent to TD(λ) updates, a well-known and powerful reinforcement learning algorithm.

Reinforcement learning

Our network learns the SR in the CA3-CA1 weights. Since we have modeled neurons to integrate the synaptic EPSPs and generate spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate (see Figure 5—figure supplement 2). While the neuron model used is simple, it will be interesting for future work to study analogous models with non-linear neurons.

It is worth noting that, during learning, both pre-synaptic and post-synaptic layers receive external inputs representing the current state (Equation 10 and Equation 11 in Materials and methods). This may induce a distortion in the read out of the diagonal elements of the SR matrix (see Equations 13 and 15, and Figure 5—figure supplement 2). At a first glance, this may indicate that learning and reading out are antagonistic. However, there are multiple ways we could resolve this apparent conflict: (i) Since the external current in CA1 is present for only a fraction of the time T in each state, the readout might happen during the period of CA3 activation exclusively; (ii) The readout may be over the whole time T but becomes more noisy towards the end. Even in the case where the readout is noisy, the distortion would be limited to the diagonal elements of the matrix; (iii) Learning and readout may be separate mechanisms, where the CA3 driving current is present during readout only. This could be for instance signaled by neuromodulation (e.g. noradrenaline and acetylcholine are active during learning but not exploration Micheau and Marighetto, 2011; Hasselmo and Sarter, 2011; Robbins, 1997; Teles-Grilo Ruivo and Mellor, 2013; Palacios-Filardo et al., 2021), or it could be that readout happens during replays; (iv) The weights to or activation functions of the readout neuron may learn to compensate for the distorted signal in CA1.

Furthermore, we can notice that the external inputs encoding the current state activate CA3 first, and CA1 later. The delay between these activations θ/T (Equation 10 and Equation 11 in Materials and methods) is an arbitrary parameter that can be adjusted. Varying this delay will change the reinforcement learning representation, especially parameters λ and γ, but also the strength of the input current (see Figure 5—figure supplement 3). However, this will not impact the distortion of the diagonal elements of the SR matrix, which remains similar across various delay values θ/T (see Figure 5—figure supplement 4).

Biological plausibility

Uncovering a connection between STDP and TD(λ) shows how, using minimal assumptions, a theoretically grounded learning algorithm can emerge from a biological implementation of plasticity. Similar learning rules have indeed been observed in the hippocampus (Shouval et al., 2002 and proposed on theoretical grounds Mehta et al., 2000; Waddington et al., 2012; van Rossum et al., 2012).

The TD algorithm is most commonly known in neuroscience for describing how reward prediction can be computed in the brain. More specifically, it is widely believed that dopamine neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) encode the prediction error between the observed and expected reward (Schultz et al., 1997), dopamine thus acts as a global signal that can be broadcasted to other areas of the brain like the striatum to compute the expected reward. In our model, the TD algorithm estimates the SR (i.e. expected future occupancy), rather than the value. However, since the prediction error for the SR is different for every synaptic connection (i.e. each pair of states), it is not clear how it could be carried by a global signal analogous to dopamine. The SR would need multiple signals, or a matrix transformation of the global signal. Furthermore, we would need to postulate that such error – or errors – are computed elsewhere in the brain. Instead, in our model, the prediction error simply emerges from the synaptic plasticity rule itself. Furthermore, thanks to the presynaptic depression, our STDP rule alone allows us to compute negative prediction errors, which still poses an open challenge for computation with dopamine because of the low baseline dopaminergic firing rate (Glimcher, 2011; Daw et al., 2002; Matsumoto and Hikosaka, 2007).

Our framework smoothly connects a temporally precise spiking code with a fully rate-based code, and anything in between. As we have proven mathematically, this translates in moving smoothly from Monte Carlo to Temporal Difference by means of TD(λ). Fast spiking sequences (temporal code) can be used for consolidation of previous experiences using Monte Carlo learning, while the behavioral timescale activity (rate code) results in TD updates, allowing learning on the timescale of seconds even with plasticity timeconstants on the order of milliseconds. This type of Hebbian learning over behavioral timescale exploits the bootstrapping property of TD, and is different than the one-shot behavioral plasticity described in Bittner et al., 2017. However, these two mechanisms could be complementary, where the latter could play a more significant role in the formation of new place fields, while the former would be more relevant to shape the existing place fields to contain predictive information. Learning on behavioral timescales using STDP was also investigated in Drew and Abbott, 2006. The main difference between Drew and Abbott, 2006 and our work, is that the former relies on overlapping neural activity between the pre- and post-synaptic neurons from the start, while in our case no such overlap is required. In other words, our setup allows us to learn connections between a presynaptic neuron and a postsynaptic neuron whose activities are separated by behavioral timescales initially. For this to be possible, there are two requirements: (1) the task needs to be repeated many times and (2) a chain of neurons are consecutively activated between the aforementioned presynaptic and postsynaptic neuron. Due to this chain of neurons, over time the activity of the postsynaptic neuron will start earlier, eventually overlapping with the presynaptic neuron.

In our work, we did not include theta modulation, but phase precession and theta sequences could be yet another type of activity within the TD lambda framework. A recent work (George et al., 2023) incorporated the theta sweeps into behavioral activity, showing it approximately learns the SR. Moreover, theta sequences allow for fast learning, playing a similar role as replays (or any other fast temporal-code sequences) in our work. By simulating the temporally compressed and precise theta sequences, their model also reconciles the learning over behavioral timescales with STDP. In contrast, our framework reconciles both timescales relying purely on rate-coding during behavior. Finally, their method allows to learn the SR within continuous space. It would be interesting to investigate whether these methods co-exist in the hippocampus and other brain areas. Furthermore, (Fang et al., 2023) et al. recently showed how the SR can be learned using recurrent neural networks with biologically plausible plasticity.

There are three different neural activities in our proposed framework: the presynaptic layer (CA3), the postsynaptic layer (CA1), and the external inputs. These external inputs could for example be location-dependent currents from the entorhinal cortex, with timings guided by the theta oscillations. The dependence of CA1 place fields on CA3 and entorhinal input is in line with lesion studies (see e.g. Brun et al., 2008; Hales et al., 2014; O’Reilly et al., 2014). It would be interesting for future studies to further dissect the role various areas play in learning cognitive maps.

Notably, even though we have focused on the hippocampus in our work, the SR does not require predictive information to come from higher-level feedback inputs. This framework could therefore be useful even in sensory areas: certain stimuli are usually followed by other stimuli, essentially creating a sequence of states whose temporal structure can be encoded in the network using our framework. Interestingly, replays have been observed in other brain areas besides the hippocampus (Kurth-Nelson et al., 2016; Staresina et al., 2013). Furthermore, temporal difference learning in itself has been proposed in the past as a way to implement prospective coding (Brea et al., 2016).

Replays

We have also proposed a role for replays in learning the SR, in line with experimental findings and RL theories (Russek et al., 2017; Momennejad et al., 2017). In general, replays are thought to serve different functions, spanning from consolidation to planning (Roscow et al., 2021). Here, we have shown that when the replayed trajectories are similar to the ones observed during behavior, they play the role of speeding up and consolidating learning by regulating the bias-variance trade-off, which is especially useful in novel environments. On the other hand, if the replayed trajectories differ from the ones experienced during wakefulness, replays can play a role in reshaping the representation of space, which would suggest their involvement in planning. Experimentally, it has been observed that replays often start and end from relevant locations in the environment, like reward sites, decision points, obstacles or the current position of the animal (Ólafsdóttir et al., 2015; Pfeiffer and Foster, 2013; Jackson et al., 2006; Mattar and Daw, 2017). Since these are salient locations, it is in line with our proposition that replays can be used to maintain a convenient representation of the environment. It is worth noticing that replays can serve a variety of functions, and our framework merely proposes additional beneficial properties without claiming to explain all observed replays. For example, in addition to forward replays, also reverse replays are ubiquitous in hippocampus (Pfeiffer, 2020). The reverse replays are not included in our framework, and it is not clear yet whether they play different roles, with some evidence suggesting that reverse replays are more closely tied to the reward encoding (Ambrose et al., 2016). Moreover, while indirect evidence supports the idea that replays can play a role during learning (Igata et al., 2021), it is not yet clear how synaptic plasticity is manifested during replays (Fuchsberger and Paulsen, 2022).

Learning flexibility

Multiple ideas from reinforcement learning, such as TD(λ), state-dependent discounting and the successor representation, emerge quite naturally from our simple biologically plausible setting. We propose in our work that time and space can be discounted differently. Moreover, the flexibility to change the discounting factor by modulating firing rates and plasticity parameters — which is ubiquitous in neural circuits — suggests that these mechanisms could be used to encode a variety of information in a cognitive map. Moreover, the specific dependence of the discount factor on the biological parameters leads to experimentally testable predictions. Indeed, our framework predicts well-defined changes in place fields after modulations of firing rates, speed of the agent or neuromodulation of the plasticity parameters (Figure 3). Importantly, the discount parameter also depends on the time spent in each state. This eliminates the need for time discretization, which does not reflect the continuous nature of the response of time cells (Kraus et al., 2013).

Limitations of the reinforcement learning framework

We have already outlined some of the benefits of using reinforcement learning for modeling behavior, including providing clear computational and algorithmic frameworks. However, there are several intrinsic limitations to this framework. For example, RL agents that only use spatial data do not provide complete descriptions of behavior, which likely arises from integrating information across multiple sensory inputs. Whereas an animal would be able to smell and see a reward from a certain distance, an agent exploring the environment would only be able to discover it when randomly visiting the exact reward location. Furthermore, the framework rests on fairly strict mathematical assumptions: typically the state space needs to be markovian, time and space need to be discretized (which we manage to evade in this particular framework) and the discounting needs to follow an exponential decay. These assumptions are simplistic and it is not clear how often they are actually met. Reinforcement Learning is also a sample-intensive technique, whereas we know that some animals, including humans, are capable of much faster or even one-shot learning.

Even though we have provided a neural implementation of the SR, and of the value function as its read-out (see Figure 5—figure supplement 2), the whole action selection process is still computed only at the algorithmic level. It may be interesting to extend the neural implementation to the policy selection mechanism in the future.

Taken together, our work joins — in a single framework — a variety of concepts from the neuronal level over cognitive theories to reinforcement learning.

Materials and methods

The successor representation

Request a detailed protocol

In a tabular environment, we define the value of a state s as being the expected cumulative reward that an agent will receive following a certain policy starting in s. The future rewards are multiplied by a factor 0<γn1, where n is the number of steps until reaching the reward location and 0<γ1 is the delay discount factor. It is usual to use 0<γ<1, which ensures that earlier rewards are given more importance compared to later rewards. Formally, the value of a state s under a certain policy π is defined as

(1) Vπ(s)=Eπ[k=0γkRt+k|St=s]
(2) =aπ(a|s)[R(s,a)+sP(s|s,a)γVπ(s)]

Here, a denotes the action, R(s,a) is the reward function and P(s|s,a) is the transition function, i.e. the probability that taking an action a in state s will result in a transition to state s. Following (Dayan, 1993), we can decompose the value function into the inner product of reward function and successor matrix

(3) V(s)=sMs,sR(s)

with

(4) Ms,s=E[t=0γtI(st=s)|s0=s]

This representation is known as the successor representation (SR), where each element Mij represents the expected future occupancy of state j when in state i. By decomposing the value into the SR and the reward function (Equation 3), relearning the state values V after changing the reward function is fast, similar to model-based learning. At the same time, the SR can be learned in a model-free manner, using for example temporal difference (TD) learning (Russek et al., 2017).

Derivation of the TD(λ) update for the SR

Request a detailed protocol

The TD(λ) update for the SR is then implemented according to (see e.g. Sutton and Barto, 1998)

(5) ΔM(j,i)=δ0TD+γλδ1TD+(γλ)2δ2TD+

Using δiTD for the TD error at step i and δxy for the Kronecker delta,

(6) δnTD=δj+n,i+n+γM(j+n+1,i+n)-M(j+n,i+n)

corresponds to the TD error for element M(j+n,i+n) of the successor representation after the transition from state j+n to state j+n+1. Combining Equations 5 and 6, we find

(7) ΔM(j,i)=[δj,i+γM(j+1,i)-M(j,i)]+γλ[δj+1,i+γM(j+2,i)-M(j+1,i)]+(γλ)2[δj+2,i+γM(j+3,i)-M(j+2,i)]+=-M(j,i)+δj,i+(1-λ)γM(j+1,i)+γλδj+1,i+(1-λ)λγ2M(j+2,i)+(γλ)2δj+2,i+=-M(j,i)+n=0N[(γλ)nδj+n,i+(1-λ)γ(γλ)nM(j+n+1,i)]

and

(8) M(j,i)M(j,i)+η ΔM(j,i)M(j,i)η{M(j,i)+n=0N[(γλ)nδj+n,i+(1λ)γ(γλ)nM(j+n+1,i)]}

Neural network model

Plasticity rule

Request a detailed protocol

The synaptic plasticity rule (Figure 1d) consists of a weight-dependent depression for presynaptic spikes and a spike-timing dependent potentiation, given by

(9) dwij(t)dt=ηSTDPALTPTrLTPj(t)iδ(tti)ηSTDPALTDwij(t)δ(ttj)τLTPdTrLTPj(t)dt=TrLTPj(t)+jδ(ttj)

Here, wij represents the synaptic connection from presynaptic neuron j to postsynaptic neuron i, TrLTPj is the plasticity trace, a low-pass filter of the presynaptic spike train with time constant τLTP, tj and ti are the spike times of the postsynaptic and presynaptic neuron respectively, ALTP and ALTD are the amplitudes of potentiation and depression respectively, ηSTDP is the learning rate for STDP and the δ() denotes the Dirac delta function.

Place cell activation

Request a detailed protocol

We assume that each state in the environment is represented by a population of place cells in the network. In our model, this is achieved by delivering place-tuned currents to the neurons. Whenever a state S=j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρpre for a time θ, following a Poisson process with parameter ρhpre(t). The other presynaptic neurons are assumed to be silent:

(10) ρhpre(t)={ρpreδhj,if t[0,θ)0otherwise

where the Kronecker delta function is defined as δhj=1 if h=j and zero otherwise. Here we use the index j to denote any neuron belonging to the population of neurons encoding state j. After a short delay, at time t*, a similar current ρbias is delivered to the postsynaptic neuron encoding state j, for a duration of time ω.

(11) ρibias(t)={ρbiasδij,if t[t,t+ω)0,otherwise

Besides the place-tuned input current, CA1 neurons receive inputs from the presynaptic layer (CA3). The postsynaptic potential ρipost when the agent is in state j is thus given by

(12) ρipost(t)=kNpoptkf<twijk(t)κ(ttf)+ρibias(t),

with the first sum running over all Npop presynaptic neurons encoding state j, and the second sum over all presynaptic firing times tkf of neuron k happened before t. The excitatory postsynaptic current κ is modeled as an exponential decay described as κ(x)=ϵ0e-x/τm for x0 and zero otherwise. Each CA1 neuron i fires following an inhomogeneous Poisson process with rate ρipost(t).

Note that, in most simulations we will use a single neuron in the population Npop=1. In addition, we normally set t*=θ and ω=T-θ. However, we will keep these as explicit parameters for theoretical purposes.

Equivalence with TD(λ)

Request a detailed protocol
Total plasticity update
Request a detailed protocol

Since we have the mathematical equation for the plasticity rule, and CA3 and CA1 neurons follow an inhomogeneous Poisson process with time-dependent firing rate, we can calculate analytically the average total weight change for the synapse wij, given a certain trajectory (details in the Appendix). Please notice that our calculation is based on Kempter et al., 1999, which takes into account the fact that our plasticity rule is sensitive to spike timing and involves a spike-spike correlation term. We find that:

(13) Δwij=A wij+n=0N[B(eT/τLTP)nδij+n+C(eT/τLTP)n+1wi,j+n+1]

where N is the number of states until the end of the trajectory and

(14) A=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(1eθ/τm)[θτLTP(1eθ/τLTP)]+ηSTDPALTPθρpreNpopτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ
(15) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP(1-e-ωτLTP)ρbias=Bρbias
(16) C=ηSTDPALTPNpopϵ0τmτLTP2(ρpre)2(1-e-θτm)(eθτLTP-1)(1-e-θτLTP)
Comparison with TD(λ)
Request a detailed protocol

Comparing the total weight change due to STDP (Equation 13) to the TD(λ) update (Equation 8), we can see that the two equations are very similar in form:

wijwijA{wij+n=0N[BA(eT/τLTP)nδij+nCAeT/τLTP(eT/τLTP)nwi,j+n+1]}M(j,i)M(j,i)+η{M(j,i)+n=0N[(γλ)nδj+n,i+(1λ)γ(γλ)nM(j+n+1,i)]}

We impose wij=M(j,i), and find:

(17) -A=η
(18) -BA=1ρbias=-AB
(19) e-T/τLTP=λγ
(20) -Ce-T/τLTPA=1-λγ,

where A,B,B and C are defined as in Equations 14, 15, and 16.

Hence, our plasticity rule is learning the Successor Representation through a TD(λ) model with parameters:

(21) η=-A
(22) γ=A-CAe-TτLTP
(23) λ=AA-C

To ensure the learning rate η is positive, one condition resulting from Equation 21 is

(24) Apre>ALTP NpopτLTP τmϵ0(ρpre (1eθ/τm)θτLTP(1eθ/τLTP)θ+1τm+τLTP)

Learning during normal behavior (θ>>τLTP)

Request a detailed protocol

During normal behavior, we assume the place-tuned currents are on larger timescales than the plasticity constants: θ,ω>>τLTP. We can see from Equations 14 and 16 that the factor A grows linearly with θ while C grows exponentially with θ. From Equation 23, we then have

(25) λ0

(See also Figure 2—figure supplement 1).

Learning during replays (θ<<τLTP)

Assumptions

Request a detailed protocol

For the replay model we assume the place-tuned currents are impulses, which make the neurons emit exactly one spike at a given time. Specifically, we can make the duration of the place-tuned currents go to 0,

(26) θ,ω0

while the intensity of the currents goes to infinity. For simplicity, we will take:

ρpre(θ)=1θlimθ0ρpre=ρbias(ω)=1ωlimω0ρbias=

Furthermore, we assume that the contribution of the postsynaptic currents due to the single presynaptic spikes is negligible in terms of driving plasticity, allowing us to set

ϵ00

Calculations of TD parameters

Request a detailed protocol

Given the assumptions above, we can see from Equations 14 and 16 that:

A=ηSTDPApreC=0

For Equation 15, we can use the Taylor expansion for exτ around x=0, such that: exτ1+xτ :

B=ηSTDPALTPτLTP2ρpreθτLTPetτLTPωτLTPρbias=ηSTDPALTPetτLTP

Using Equations 21, 22, 23 and 18, we can calculate the parameters and constraints for the TD model:

(27) λ=AAC=1η=A=ηSTDPApreγ=ACAeTτLTP=eTτLTP1=BA=ALTPetτLTPApre

As expected, the bootstrapping parameter λ=1 (see also Figure 2—figure supplement 1).

Alternative derivation of replay model

Place cell activation during replays

Request a detailed protocol

We model a replay event as a precise temporal sequence of spikes. Since every neuron represents a state in the environment, a replay sequence reproduces a trajectory of states. We assume that, when the agent is in state S=j, the neurons representing state j fire npre spikes at some point in the time interval t[0,σ], where the exact firing times are uniformly sampled. After a short delay, the CA1 neurons representing state j fire npost spikes at a time uniformly sampled from the interval [t*,t*+σ]. The time between two consecutive state visits is T. The exact number of spikes in each replay event is random but small. Specifically, it is sampled from the set {0,1,2} according to the probability vector

(28) p=(p12,1p1,p12)

It is worth noting here that other implementations are possible but that we assume the average number of spikes in each state is 1, and that the average time between a presynaptic and a postsynaptic spike is t*. The model could be further generalized for a higher number of average spikes per state.

Plasticity update

Request a detailed protocol

We can consider again our learning rule, composed of a positive pre-post potentiation window and presynaptic weight-dependent depression (Equation 9). Let’s consider the synapse wij, we can see that on average the total amount of depression will be determined by the number of times the state j is visited in the trajectory replayed:

LTD=-AprewijNj,

where Nj is the number of times the state j is visited. The amount of potentiation will be determined, instead, by the time difference between the postsynaptic and presynaptic firing times, which encode the distance between state j and state i:

LTP=ALTPke-kT+t*τLTPnkij,

where nkij represents the number of times the agent visited state i k steps after j. Combining the equations above we find that:

(29) Δwij=ηSTDPALTPkekT+tτLTPnkijηSTDPAprewijNj.

If we assume that the this value has converged to its stationary state, Δwij=0;

(30) wij=ALTPApreetτLTPk(eTτLTP)knkijNj

Comparison with online Monte Carlo learning

Request a detailed protocol

Given the stable weight w* from Equation 30, we can impose that:

(31) ALTPApree-t*τLTP=1   and
(32) e-TτLTP=γ

we find that the stable weight is:

(33) wij=kγknkijNjE[kγkI(Sk=i|S0=j)]=M(j,i)

which is the definition of the Successor Representation matrix (Equation 4). Indeed, wij is computing the sample mean of the discounted distance between states i and j, which is equivalent to performing an every-state Monte Carlo or TD(λ=1) update. Notably, from Equation 29, we have that the learning rate for the Monte Carlo update is given by:

(34) η=ηSTDPALTPe-t*τLTP=ηSTDPApre

Simulation details for Figure 2

Request a detailed protocol

A linear track with four states is simulated. The policy of the agent in this simulation is to traverse the track from left to right, with one epoch consisting of starting in state 1 and ending in state 4. One simulation consists of 50 epochs, and we re-run the whole simulation ten times with different random seeds. Over these ten seeds, mean and standard deviation of the synaptic weights are recorded after every epoch.

Our neural network consists of two layers, each with a single neuron per state (as in Figure 1). Synaptic connections are made from each presynaptic neuron to all postsynaptic neurons, resulting in a 4-by-4 matrix which is initialized as the identity matrix. The plasticity rule and neuronal activations follow Equations 9–12.

The STDP parameters are listed in Table 1.

Table 1
Parameters used for the spiking network.
ϵ01
ρpre0.1ms-1
τm2ms
Npost1
Npretot1
Npre1
stepsize0.01ms
ηstdp0.003
τLTP60ms
ALTP1 ms-1

To obey Equation 24, we set Apre equal to the right hand side augmented with 5.

For the behavioral case, we choose T=100ms, θ=80ms, ω=T-θ, which correspond to TD(λ) parameters λ=0.21, γ=0.89, η=0.12.

In the replay case, we have a sequence of single spike per neuron (see Figure 2b and section ‘Alternative derivation of replay model’). Following Equation 27, we choose T=-log(γ)τLTP 7ms, where γ and τLTP are the same as in Table 1. We set θ=2 ms and σ=0.5 ms. By setting the ηstdp=ηALTPexp(θ/τLTP), the corresponding TD(λ) parameters are λ=1, γ=0.89, η=0.12 just as in the behavioral case.

More details on the place cell activation during replays in our model can be found in section ‘Alternative derivation of replay model’. Using exactly one single spike per neuron with the above parameters would allow us to follow the TD(1) learning trajectories without any noise. For more biological realism, we choose p1=0.15 in Equation 28, in order to achieve an equal amount of noise due to the random spiking as in the case of behavioral activity (see Figure 4—figure supplement 2).

Simulation details for Figure 3

Request a detailed protocol

Using the same neural network and plasticity parameters as the behavioral learning in Figure 2 (see previous section), we simulate the linear track in the following two situations:

  • The third state has T=200ms instead of 100ms. All other parameters remain the same as in Figure 2. Results plotted in Figure 3E.

  • The third state has ρpre=0.2ms-1 instead of 0.1ms-1. All other parameters remain the same as in Figure 2. Results plotted in Figure 3F.

Simulation details for Figure 4

Request a detailed protocol

A linear track with three states is simulated, and the agent has 50% probability to move left or right in each state (see Figure 4A). One epoch lasts until the agent reaches one of the STOP locations.

We then use the same neural network and plasticity parameters as used for Figure 2. We simulate three scenarios:

  • Only replay-based learning during all epochs (no behavioral learning). This scenario corresponds to MC STDP in Figure 4B and to Figure 4C.

  • Mixed learning using both behavior and replays. The probability for an epoch to be a replay is decaying over time following exp(-i/6), with i the epoch number. This scenario corresponds to Mix STDP in Figure 4B and to Figure 4E.

  • Only behavioral learning during all epochs (no replays). This scenario corresponds to TD STDP in Figure 4B and to Figure 4D.

Simulation details for Figure 5

Request a detailed protocol

A linear track with 21 states is simulated. The SR is initialized as the identity matrix, and the reward vector (containing the reward at each state) is also initialized as the zero vector. We simulate the learning of the SR during behavior using the theoretical TD(0) updates and during replays using the theoretical TD(1) updates. The value of each state is then calculated as the matrix-vector product between the SR and the reward vector, resulting in an initial value of zero for each state.

The policy of the agent is a softmax policy (i.e. the probability to move to neighboring states is equal to the softmax of the values of those neighboring states). The first time the agent reaches the leftmost state of the track (state 1), the negative reward of –2 is revealed, mimicking the shock in the actual experiments, and the reward vector is updated accordingly for this state.

We now simulate two scenarios: in the first scenario, the agent always follows the softmax policy and no replays are triggered (see Figure 5D, left panel). In the second scenario, every time the agent enters the dark zone from the light zone (i.e. transitions from state 12 to state 11 in our simulation), a replay is triggered from that state until the leftmost state (state 1) (see Figure 5D, right panel). Both scenarios are simulated for 2000 state transitions. We then run these two scenarios 100 times and calculate mean and standard deviation of state occupancies (Figure 5F).

Finally, since the second scenario has more SR updates than the first scenario, we also simulate the first scenario for 4000 state transitions (Figure 5—figure supplement 1) and show how the observed behavior of Figure 5 is unaffected by this.

Appendix 1

Analytical derivations for the total weight change in the behavioural model

Presynaptic rate during state j

Whenever a state S=j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρpre for a time θ, following a Poisson process with parameter ρjpre(t):

(35) ρjpre(t)={ρpre,if t[0,θ)0,otherwise

The other presynaptic neurons are silent.

Postsynaptic rate during state j

The average postsynaptic rate can be calculated as follows. The probability of a presynaptic spike between t and t+dt is equal to ρj(t)dt. The size of the presynaptic population encoding state j is equal to Npop and each excitatory postsynaptic potential (EPSP) is modeled by an immediate jump with amplitude ϵ0wij, followed by exponential decay of EPSP with time constant τm.

Following Equation 12 in the main paper, reproduced below,

ρipost(t)=kNpoptkf<twijk(t)κ(ttf)+ρibias(t)

we find that the average postsynaptic potential at time t is given by (assuming t=0 when entering the state j):

(36) ρ¯ipost(t)=0t[Npopρjpre(t)ϵ0wij(t)etτm+ρibias(t)]dt

We assume that wij(t) changes slowly compared to the timescale θ allowing us to consider the weight constant during that time. We can then approximate the average postsynaptic rate as:

(37) ρ¯ipost(t)={Npopρpreϵ0wijτm(1etτm), if 0t<θρbiasδij, if tt<t+ω0, otherwise

If t<θ, both the first and the second term will contribute to the postsynaptic rate in the time between t and θ.

LTP trace during state j

Given Equation 9 in the main paper, reproduced below,

τLTPdTrLTPj(t)dt=TrLTPj(t)+jδ(ttj)

and combined with Equation 35, we can calculate the evolution of the LTP trace for neuron j during state j:

(38) TrLTPj(t)={ρpre0tetτLTPdt, if 0t<θρpreτLTP(1eθτLTP)etθτLTP, if tθ={ρpreτLTP(1etτLTP), if 0t<θρpreτLTP(eθτLTP1)etτLTP, if tθ

For 0t<θ, the presynaptic neuron j is active and therefore the trace builds up with the presynaptic spikes, for tθ, the trace decays exponentially with time constant τLTP.

Total amount of LTP during state j

Following (Kempter et al., 1999), first we calculate the amount of LTP without taking into account spike-to-spike correlation:

The probability for a postsynaptic spike between t and t+dt is ρ¯ipost(t)dt. The amount of LTP due to a single spike at time t is ALTPTrLTPj(t). Hence, combining Equations 37 and 38, the total amount of LTP during a state (i.e. between time 0 and T) becomes:

(39) LTPnon-causal=ALTP 0T ρ¯ipost(t) TrLTPj(t) dt
(40) =ALTPNpopρpreϵ0wijτm(1eθτm)ρpreτLTP0θ(1etτLTP)dt+ALTPρbias ρpreτLTP(eθτLTP1)etτLTPtt+ωetτLTP dt=ALTP wij Npopϵ0(ρpre)2 τLTP τm(1eθτm)[θτLTP(1eθτLTP)]+ALTPρbias ρpreτLTP2(eθτLTP1)etτLTP[1eωτLTP]

Following (Kempter et al., 1999), the amount of LTP due to the causal part (each presynaptic spike temporarily increase the probability of a postsynaptic spike) is given by:

(41) LTPcausal=ALTPθρpreϵ0wijτmτLTPτm+τLTP

Combining equations for the non-causal 40 and causal 41 parts, we get the total amount of LTP during a state (assuming τm<<τLTP):

(42) LTP=ALTP wij Npopϵ0(ρpre)2 τLTP τm(1eθτm)[θτLTP(1eθτLTP)]+ALTPρbias ρpreτLTP2(eθτLTP1)etτLTP[1eωτLTP]+ALTPθρpreτmτLTPτm+τLTPϵ0wij

Total amount of LTD during state j

There is a weight-dependent depression for each presynaptic spike, hence the amount of LTD during a state is given by:

(43) LTD=-Apreρpreθwij

Total plasticity during state j

Combining Equations 42 and 43, we can calculate the total amount of plasticity during the time the agent spends in the current state j:

(44) Δ0wij=Awij+Bδij

with

(45) A=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(1eθ/τm)[θτLTP(1eθ/τLTP)]+ηSTDPALTPθρpreτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ

and

(46) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP[1-e-ωτLTP]ρbias

Plasticity due to states transitioning

Once the agent leaves state j, the decaying LTP trace can still cause potentiation due to the activity in the following states, j+n, with n=1,2, . Given that the agent spends a time T in each state, we find that the agent visits state j+n during time t[nT,nT+T). We will now calculate the contribution to plasticity due to these state transitions.

Postsynaptic rate during the new state j+n

During state j+n, the activity of the postsynaptic neurons is driven by the presynaptic neurons coding for j+n, and the bias current. We can thus generalize Equation 37 and find that the average postsynaptic rate ρ¯ipost during state j+n is:

(47) ρ¯ipost(t)={Npopρpreϵ0wij+nτm(1eθτm), if nTt<nT+θρbiasδij+n, if nT+tt<nT+t+ω0, otherwise

LTP trace from state j, during the new state j+n

Following Equation 38, we find that the amplitude of the LTP trace from state j during state j+n is:

(48) TrLTPj(nT+t)=ρpreτLTP(eθτLTP1)e(t+nT)τLTP=ρpreτLTP(eθτLTP1)(eTτLTP)netτLTP

with 0<t<T.

LTP due to state transitioning

We can then calculate the amount of LTP between the presynaptic neuron j and the postsynaptic neuron i, when the agent is in state j+n. We refer to Equation 39 and find:

LTPswitch=ALTP nTnT+T ρ¯ipost(t) TrLTPj(t) dt=ALTPNpopρpreϵ0wij+nτm(1eθτm)ρpreτLTP(eθτLTP1)(eTτLTP)nnTnT+θetτLTPdt+ALTPρbiasδij+nρpreτLTP(eθτLTP1)etτLTP(eTτLTP)nnT+tnT+t+ωetτLTPdt=ALTPNpopρpreϵ0wij+nτm(1eθτm)ρpreτLTP(eθτLTP1)(eTτLTP)nτLTP(1eθτLTP)+ALTPρbiasδij+nρpreτLTP(eθτLTP1)etτLTP(eTτLTP)nτLTP(1eωτLTP)

The amount of plasticity in state j+n when starting from state j is thus:

(49) Δnwij=C(e-TτLTP)nwij+n+B(e-TτLTP)nδij+n

where

(50) C=ηSTDPALTPNpopρpreϵ0τm(1-e-θτm)ρpreτLTP(eθτLTP-1)τLTP(1-e-θτLTP)
(51) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP(1-e-ωτLTP)ρbias

It is worth noting that the parameter B derived here is the same as Equation 46.

Summary: total STDP update

If we combine together Equations 44 and 49, we have that the total weight change for the synapse wij is given by:

(52) wij=Δ0wij+n=1NΔnwij=A wij+n=0N[B(eT/τLTP)nδij+n+C(eT/τLTP)n+1wi,j+n+1]

where N is the number of states until the end of the trajectory and A, B, C are as defined in Equations 45, 46 and 50 respectively.

Analytical calculations for hyperbolic discounting

From Equation 22 in the main paper, we have that, in the behavioural model γ=(1-CA)e-TτLTP. Here, we will derive an approximation to this value.

If we assume that θ>>τm,τLTP, we can approximate A and C as:

(53) A~=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(θτLTP)+ηSTDPALTPθρpreτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ=ηSTDPρpre(aθ+b),
(54) witha=ALTP ϵ0τLTP τm(Npopρpre +1τm+τLTP)Apreb=ALTP Npopϵ0ρpreτLTP2 τm
(55) C~=ηSTDPALTPNpop(ρpre)2ϵ0τmτLTP2eθτLTP=ηSTDPρpreeθτLTPb

If we define ψ such that θ+ψ=T, we can rewrite and approximate the discount parameter as:

(56) γ=(1CA)eθ+ψτLTPC~A~eθτLTPeψτLTP=beψτLTPaθ+b=11+abθeψτLTP

From Equation 56, we can see that the discount γ follows a hyperbolic function if we increase the duration of the presynaptic current θ. If, instead, we vary ψ, the discount becomes exponential (Figure 2—figure supplement 1a and b).

Notice that this analysis extends to the replay model. Following what was done after Equation 26, we can connect the behavioural model with the replay model by making θ,ϵ00, which implies ψT. From Equation 56 we find that:

limθ,ϵ00γ=e-TτLTP,

which is exactly the definition of γ in the replay model (Equations 27 in Materials and methods). For replays, the discount is therefore strictly exponential.

Furthermore, using the same calculations and Equations 21 and 19 in the main paper, we can find approximated values for the other parameters too (Figure 2—figure supplement 1c and d).

η=AηSTDPρpre(aθ+b)λ=eT/τLTPγ(1+abθ)eψτLTPeψ+θτLTP=(1+abθ)eθτLTP

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Modelling code is available on GitHub (https://github.com/jacopobono/learning_cognitive_maps_code, copy archived at swh:1:rev:d86b262545547353c7050bbc2d476c2f4a297989; Jacopo, 2023).

References

    1. Aggleton JP
    2. Brown MW
    (1999)
    Episodic memory, amnesia, and the hippocampal-anterior thalamic axis
    The Behavioral and Brain Sciences 22:425–444.
  1. Book
    1. Doya K
    (1995)
    Temporal difference learning in continuous time and space
    In: Touretzky D, Mozer M, Hasselmo M, editors. Advances in Neural Information Processing Systems. Massachusetts, United States: MIT Press. pp. 1–10.
  2. Book
    1. Mattar MG
    2. Daw ND
    (2017)
    A Rational Model of Prioritized Experience Replay
    Ann Arbor, USA: The University of Michigan.
  3. Conference
    1. Stachenfeld KL
    2. Botvinick MM
    3. Gershman SJ
    (2014)
    Design Principles of the Hippocampal Cognitive Map
    NeurIPS Proceedings.

Decision letter

  1. Lisa M Giocomo
    Reviewing Editor; Stanford School of Medicine, United States
  2. Laura L Colgin
    Senior Editor; University of Texas at Austin, United States
  3. Michael E Hasselmo
    Reviewer; Boston University, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Learning predictive cognitive maps with spiking neurons during behaviour and replays" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Laura Colgin as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Michael E. Hasselmo (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Based on the comments from all reviewers, I'd recommend focusing your revisions on the primary topic of improving the link between their model and experimental data. Specifically, this would include: Consideration (or modeling) of how the model would extend to 2D, discussion on how the neural activity would be used to perform computations, the limitations of RL as it relates to interpreting experimental data, and to more appropriately frame the work in the context of experimental studies (all reviewers had detailed suggestions for how to do this with text changes). I've provided highlights from the three reviewers below that apply to these concerns but note that reviewer 2 also provided a number of specific suggestions related to text/reference changes. All reviewer comments are also included at the bottom of this message.

Reviewer 1:

– The successor representation is learned at the level of synaptic weights between the two layers. It is not clear how it is read out into neural activity and exploited to perform actual computations, as both layers are assumed to be strongly driven by external inputs. This is a major limitation of this work.

– One of the results is that STDP at the timescale of milliseconds can lead to learning over behavioral timescales of seconds. This result seems related to Drew and Abbott PNAS 2006. In that work, the mapping between learning on micro and macro timescales in fact relied on precise tuning of plasticity parameters. It is not clear to which extent similar limitations apply here, and what is the precise relation with Drew and Abbott.

– Most of the results are presented at a formal, descriptive level relating plasticity to reinforcement learning algorithms. The provided examples are quite limited and focus on a simplified setting, a linear track. It would be important to see that the results extend to two-dimensional environments, and to show how the successor representation is actually used (see first comment).

– The main text does not explain clearly how replays are implemented.

Reviewer 2:

I think the authors of this article need to be clear about the shortcomings of RL. They should devote some space in the discussion to noting neuroscience data that has not been addressed yet. They could note that most components of their RL framework are still implemented as algorithms rather than neural models. They could note that most RL models usually don't have neurons of any kind in them and that their own model only uses neurons to represent state and successor representations, without representing actions or action selection processes. They could note that the agents in most RL models commonly learn about barriers by needing to bang into the barrier in every location, rather than learning to look at it from a distance. The ultimate goal of research such as this should be to link cellular level neurophysiological data to experimental data on behavior. To the extent possible, they should focus on how they link neurophysiological data at the cellular level to spatial behavior and the unit responses of place cells in behaving animals, rather than basing the validity of their work on the assumption that the successor representation is correct.

Reviewer 3:

1. Could the authors elaborate more on the connection between the biological replays that are observed in a different context in the brain and the replays implemented in their model? Within the modeling context, when are replays induced upon learning in a novel environment, and what is the influence of replays when/if they are generated upon revisiting the previously seen/navigated environment?

2. The model is composed of CA1 and CA3, what are the roles of the other hippocampal subregions in learning predictive maps? From the reported results, it looks like it may be possible that prediction-based learning can be successfully achieved simply via the CA1-CA3 circuit. Are there studies (e.g., lesioned) that show this causal relationship to behavior? Along this line, what are the potential limitations of the proposed framework in understanding the circuit computation adopted by the hippocampus?

3. Do the authors believe that the plasticity rules/computational principles observed within the 2-layer model are specific to the CA1-CA3 circuit? Can these rules be potentially employed elsewhere within the medial temporal lobe or sensory areas? What are the model parameters used that could suggest that the observed results are specific to hippocampus-based predictive learning?

4. The analytical illustration linking the proposed model with reinforcement learning is well executed. However, in practice, the actual implementation of reinforcement learning within the model is unclear. Given the sample task provided where animals are navigating a simple environment, how can one make use of value-based learning to enhance behavior? Explicit discussion on the extent to which reinforcement learning is related to the actual computation potentially needed to navigate sensory environments (both learned and novel) would be really helpful in understanding the link between the model to reinforcement learning.

5. Subplots both within and across figures seem to be of very different text formatting and sizing (such as panel F in Figure 4 and Figure 5). Please reformat them accordingly.

Reviewer #2 (Recommendations for the authors):

Important: Note that the page numbers refer to the page in the PDF, which is their own page number-1 (due to eLife adding a header page).

Page 3 – "smoothly… and anything in between" – this is overstated and should be removed.

Page 3 – "don't need to discretize time…". Here and elsewhere there should be citations to the work of Doya, NeurIPS 1995, Neural Comp 2000 on the modeling of continuous time in RL.

Page 3 – "using replays" – It is very narrowminded to assume that all replay do is set up successor representations. They could also be involved in model-based planning of behavior as suggested in the work of Johnson and Redish, 2012; Pfeiffer and Foster, 2018; Kay et al. 2021 and modeled in Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012.

Page 3 – They assume that STDP can occur during replay, but evidence for STDP during replay is unclear. McNaughton's lab showed that LTP is less likely to be induced during the modulatory states during which sharp-wave ripple replay events occur. They should look for citations that have actually shown LTP induction during the replay state.

Page 4 – Marr's three levels – They should remove this discussion about Marr's three levels as I think the implementation level is relatively sparse and the behavioral level is also relatively sparse.

Page 4 – "The hippocampus has long been thought" – It's astounding that the introduction only cites two experimental papers (O'Keefe and Dostrovsky, 1971; Mehta et al. 2000) and then the start of the Results section makes a statement like this and only cites Stachenfeld et al. 2017 as if it were an experimental paper. There are numerous original research papers that should be cited for the role of hippocampus in behavior. They should cite at least five or six so that the reader doesn't get the impression all o this work started with the holy paper by Stachenfeld et al. 2017. For example, they could again cite O'Keefe and Nadel, 1978 for the very comprehensive review of the literature up to that time, plus the seminal work of Morris et al. 1982 in Morris water maze and Olton, 1979 in 8-arm radial maze and perhaps some of the work by Aggleton and by Eichenbaum on spatial alternation.

Page 5 – The description of successor representations is very one dimensional. They should mention how it can be expanded to two dimensions.

Page 5 – "Usually attributed to model based…". They cannot just talk about SR being model free. Since this section was supposed to be for neuroscientists, they need to clearly explain the distinction between model free and model-based RL, and describe why successor representations are not just model-based RL, but instead provide a look-up table of predictive state that does NOT involve model-based planning of behavior. The blog of Vitay gives a much better overview that compares model-free, model-based and successor representations:

https://julien-vitay.net/post/successor_representations/ – This needs more than just a citation – there should be a clear description of model-based and model-free RL in contrast to SR, and Vitay is an example of that.

Page 5 – Related to this issue – they need to repeatedly address the fact that Successor representations are just an hypothesis contrast with model-based behavior, and repeatedly throughout the paper discuss that model-based behavior could still be the correct accounting for all of the data that they address.

Page 5 – "similar to (Mehta et al. 2000)" – Learning in the CA3-CA1 network has been modeled like this in many previous models that should be cited here including McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Treves and Rolls, 1994; Mehta et al. 2000; Hasselmo, Bodelon and Wyble, 2002.

Page 6 – Figure 1d looks like the net outcome of the learning rule in this example is long-term depression. Is that intended? Given the time interval between pre and post, it looks like it ought to be potentiation in the example.

Page 7 – They should address the problem of previously existing weights in the CA3 to CA1 connections. For example, what if there are pre-existing weights that are strong enough to cause post-synaptic spiking in CA1 independent of any entorhinal input? How do they avoid the strengthening of such connections? (i.e. the problem of prior weights driving undesired learning on CA3-CA1 synapses is addressed in Hasselmo and Schnell, 1994; Hasselmo, Bodelon and Wyble, 2002, which should be cited).

Page 7 – "Elegantly combines rate and temporal" – This is overstated. The possible temporal codes in the hippocampus include many possible representations beyond just one step prediction. They need to specify that this combines one type of possible temporal code. I also recommend removing the term "elegant" from the paper. Let someone else call your work elegant.

Page 7 – "replays for learning" – as noted above in experiments LTP has not been shown to be induced during the time periods of replay – sharp-wave ripple replay events seem to be associated with lower cholinergic tone (Buzsaki et al. 1983; VandeCasteele et al. 2014) whereas LTP is stronger when Ach levels are higher (Patil et al. J. Neurophysiol. 1998). This is not an all-or-none difference, but it should be addressed.

Page 7 – "equivalent to TD…". Should this say "equivalent to TD(0)"?

Page 8 – "Bootstrapping means that a function is updated using current estimates of the same function…". This is a confusing and vague description of bootstrapping. They should try to give a clearer definition for neuroscientists (or reduce their reference to this).

Page 9 – Figure 2 – Do TD λ and TD zero really give equivalent weight outputs?

Page 8 – "that are behaviorally far apart" – I don't understand how this occurs.

page 10 – "dependency of synaptic weights on each other as discussed above." This was not made sufficiently clear either here or above.

Page 10 – "dependency of synaptic weights on each other" This also suggests a problem of stability if the weights can start to drive their own learning and cause instability – how is this prevented?

Page 10 – "average of the discounted state occupancies" – this would be uniform without discounting but what is the biological mechanism for the discounting that is used here?

Page 10 – "due to the bootstrapping" – again this is unclear – can be improved by giving a better definition of bootstrapping and possibly by referring to specific equation numbers.

Page 11 – "exponential dependence" what is the neural mechanism for this?

Page 11 – "Ainsley" is not a real citation in the bibliography. Should fix and also provide a clearer definition (or equation) for hyperbolic.

Page 11 – "elegantly combines two types of discounting" – how is useful? Also, let other people call your work elegant.

Page 11 – how does discounting depend on both firing rate and STDP -- should provide some explanation or at least refer to where this is shown in the equations.

page 13 – "Cheng and Frank" – this is a good citation, but they could add more here on timing of replay events.

Page 15 – This whole section on the shock experiment starts with the assumption of a successor representation. As noted above, they need to explicitly discuss the important alternate hypothesis that the neural activity reflects model-based planning that guides the behavior in the task (and could perhaps better account for the peak of occupancy at the border of light and dark).

Page 16 – "mental imagination" – rather than using it for modifying SR, why couldn't mental imagination just be used for model-based behavior?

Page 17 – "spiking" – again, if they are going to refer to their model as a "spiking" model, they need to add some plots showing spiking activity.

Reviewer #3 (Recommendations for the authors):

I found the proposed modeling framework to be very exciting and of potential interest to not only computational neuroscientists but also to readers who are interested in neural mechanisms underlying learning in general. The manuscript is well-written and includes a detailed description and rationale of the model setups as well as the findings and their relevance to biological findings. That said, I have a few comments that I hope the authors could help address:

1. Could the authors elaborate more on the connection between the biological replays that are observed in a different context in the brain and the replays implemented in their model? Within the modeling context, when are replays induced upon learning in a novel environment, and what is the influence of replays when/if they are generated upon revisiting the previously seen/navigated environment?

2. The model is composed of CA1 and CA3, what are the roles of the other hippocampal subregions in learning predictive maps? From the reported results, it looks like it may be possible that prediction-based learning can be successfully achieved simply via the CA1-CA3 circuit. Are there studies (e.g., lesioned) that show this causal relationship to behavior? Along this line, what are the potential limitations of the proposed framework in understanding the circuit computation adopted by the hippocampus?

3. Do the authors believe that the plasticity rules/computational principles observed within the 2-layer model are specific to the CA1-CA3 circuit? Can these rules be potentially employed elsewhere within the medial temporal lobe or sensory areas? What are the model parameters used that could suggest that the observed results are specific to hippocampus-based predictive learning?

4. The analytical illustration linking the proposed model with reinforcement learning is well executed. However, in practice, the actual implementation of reinforcement learning within the model is unclear. Given the sample task provided where animals are navigating a simple environment, how can one make use of value-based learning to enhance behavior? Explicit discussion on the extent to which reinforcement learning is related to the actual computation potentially needed to navigate sensory environments (both learned and novel) would be really helpful in understanding the link between the model to reinforcement learning.

5. Subplots both within and across figures seem to be of very different text formatting and sizing (such as panel F in Figure 4 and Figure 5). Please reformat them accordingly.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Learning predictive cognitive maps with spiking neurons during behaviour and replays" for further consideration by eLife. Your revised article has been evaluated by Laura Colgin (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer 1 makes two text suggestions that I believe would clarify the findings. There remains some lack of clarification around (1) how the second layer of the model mixes the successor representation with a representation of the current state itself and (2) justification for the difference in duration of the external inputs to the two layers. Reviewer 1 also suggests an additional figure, but I leave the decision to add (or not) this to the authors. Details regarding the requested clarifications are below.

Reviewer #1 (Recommendations for the authors):

The revised article has only partly resolved my confusion.

My main issue was the following: in the proposed feed-forward model, the synaptic weights between the two layers learn the entries of the successor matrix. If external inputs were fed only to the first layer, the second layer would directly read out the successor representation (this is suggested in Figure 1 E-F, but not explicitly mentioned in the text as far as I can tell). Instead, in the model, both layers are driven by external inputs representing the current state. This is crucial for learning, but it implies that the activity of the second layer mixes the successor representation with a representation of the current state itself. Learning and readout, therefore, seem antagonistic. It would be worth explaining this fact in the main text.

In their reply, the authors clarify that the external inputs drive the activity of the second layer only for a limited time (20%). As far as I can tell, in the text, this is mentioned explicitly only in the legend of Figure 5-S2. That seems to imply that there is a large difference in the duration of the external inputs to the two layers. How can that be justified?

More importantly, it seems that varying the value of the delay should lead to a tradeoff between the accuracy of learning and the accuracy of the subsequent readout in the second layer. Is that the case? It would be useful to have a figure where the delay is varied.

https://doi.org/10.7554/eLife.80671.sa1

Author response

Reviewer 1:

– The successor representation is learned at the level of synaptic weights between the two layers. It is not clear how it is read out into neural activity and exploited to perform actual computations, as both layers are assumed to be strongly driven by external inputs. This is a major limitation of this work.

We thank the reviewer for this important remark. Since we modelled our neurons to integrate the synaptic EPSPs and generated spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate. We performed additional simulations and have added a supplementary figure (Figure 5—figure supplement S2) showing this setup.

We also want to note that the external inputs driving the activity affect the firing rate of the value neuron only during a limited amount of time (20% of the time in each state in our simulation of Figure 5—figure supplement S2). Moreover, this effect changes the estimate of the value quantitatively, but not qualitatively, i.e. the ranking of the states by value is not affected.

Practically, using the parameters in our simulation, one could either read out the correct estimate of the value during the first 80% of the time in a state, learn a correction to this perturbation in the weights to the value neuron, or simply use a policy that is based on the ranking of the values instead of the actual firing rate.

Besides the supplementary figure (Figure 5—figure supplement S2), we added the following paragraph in the Discussion section:

“Since we modelled our neurons to integrate the synaptic EPSPs and generate spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate (see Figure 5—figure supplement S2). While the neuron model used is simple, it will be interesting future work to study analogous models with non-linear neurons.”

– One of the results is that STDP at the timescale of milliseconds can lead to learning over behavioral timescales of seconds. This result seems related to Drew and Abbott PNAS 2006. In that work, the mapping between learning on micro and macro timescales in fact relied on precise tuning of plasticity parameters. It is not clear to which extent similar limitations apply here, and what is the precise relation with Drew and Abbott.

We thank the reviewer for pointing us to the interesting work by Drew and Abbott. We added the following paragraph in the Discussion section:

“Learning on behavioural timescales using STDP was also investigated in Drew 2006. The main difference between Drew 2006 and our work, is that the former relies on overlapping neural activity between the pre- and post-synaptic neurons from the start, while in our case no such overlap is required. In other words, our setup allows us to learn connections between a presynaptic neuron and a postsynaptic neuron whose activities are separated by behavioural timescales initially. For this to be possible, there are two requirements: (1) the task needs to be repeated many times and (2) a chain of neurons are consecutively activated between the aforementioned presynaptic and postsynaptic neuron. Due to this chain of neurons, over time the activity of the postsynaptic neuron will start earlier, eventually overlapping with the presynaptic neuron.”

– Most of the results are presented at a formal, descriptive level relating plasticity to reinforcement learning algorithms. The provided examples are quite limited and focus on a simplified setting, a linear track. It would be important to see that the results extend to two-dimensional environments, and to show how the successor representation is actually used (see first comment).

We thank the reviewer for the feedback, and have now included two new supplementary figures. Figure 5—figure supplement S2 shows the usefulness of our setup when learning the state values, as discussed above. Figure 1—figure supplement S1shows a simulation in a 2D environment. In fact, due to the exact link with TD(λ), our setup is general for any type of task, in any dimension, where states are visited and which may not need to be a navigation task.

We have added the following sentences in the second to last paragraph of section 2.2:

Moreover, due to the equivalence with TD(λ), our setup is general for any type of task where discrete states are visited, in any dimension, and which may not need to be a navigation task (see e.g. Figure 2d for a 2D environment).

– The main text does not explain clearly how replays are implemented.

We thank the reviewer for pointing out this issue. We have now updated the methods section 4.3.6 to be more clear about the implementation of the replays. We have also updated the description of replay generation in the methods of Figure 2 as follows:

“More details on the place cell activation during replays in our model can be found in section 4.3.6. Using exactly one single spike per neuron with the above parameters would allow us to follow the TD(1) learning trajectories without any noise. For more biological realism, we choose p1=0.15 in equation 28, in order to achieve an equal amount of noise due to the random spiking as in the case of behavioural activity (see Supplementary Figure S2).”

Reviewer 2:

I think the authors of this article need to be clear about the shortcomings of RL. They should devote some space in the discussion to noting neuroscience data that has not been addressed yet. They could note that most components of their RL framework are still implemented as algorithms rather than neural models. They could note that most RL models usually don't have neurons of any kind in them and that their own model only uses neurons to represent state and successor representations, without representing actions or action selection processes. They could note that the agents in most RL models commonly learn about barriers by needing to bang into the barrier in every location, rather than learning to look at it from a distance. The ultimate goal of research such as this should be to link cellular level neurophysiological data to experimental data on behavior. To the extent possible, they should focus on how they link neurophysiological data at the cellular level to spatial behavior and the unit responses of place cells in behaving animals, rather than basing the validity of their work on the assumption that the successor representation is correct.

We thank the reviewer for the important feedback. We have addressed the reviewer 2's concerns in the "Public Evaluation" section, which includes their detailed comments. In short, here, we made substantial changes to refer more thoroughly to the experimental literature, discuss the limitations of RL in general and the limitations of our proposed framework, discuss the link between neuronal data and behaviour, and finally, we made sure not to overstate the validity of the successor representation.

Reviewer 3:

1. Could the authors elaborate more on the connection between the biological replays that are observed in a different context in the brain and the replays implemented in their model? Within the modeling context, when are replays induced upon learning in a novel environment, and what is the influence of replays when/if they are generated upon revisiting the previously seen/navigated environment?

We thank the reviewer for the interesting questions. Within our modelling context, we speculate that replays may contribute to the learning of the SR. To understand why this may be beneficial, we show that in our framework, learning is faster in a novel environment when the proportion of replays is larger. Replays, because they rely on the MC algorithm, are great for learning quickly as they are fast to override obsolete weights. Specifically for Figure 4, we implemented a probability for replays that decays exponentially with the number of epochs in a novel environment, but other schemes that introduce more replays in novel environments than in familiar ones should lead to similar conclusions. We also show that, when replays are generated in familiar environments, they still contribute to learning the same SR but introduce more variance. We also argue that replays can be used to imagine novel trajectories (similar to ideas in model-based planning) and thus update the SR without actually walking the trajectory (Figure 5). In summary, we believe that replays can functionally serve a variety of purposes, and our framework merely proposes additional beneficial properties without claiming to explain all observed replays. For example, our framework does not encompass reverse replays. We have updated the ‘Replays’ paragraph of the Discussion section to reflect this.

“We have also proposed a role for replays in learning the SR, in line with experimental findings and RL theories Russek 2017 Momennejad 2017. In general, replays are thought to serve different functions, spanning from consolidation to planning Roscow 2021. Here, we have shown that when the replayed trajectories are similar to the ones observed during behaviour, they play the role of speeding up and consolidating learning by regulating the biasvariance trade-off, which is especially useful in novel environments. On the other hand, if the replayed trajectories differ from the ones experienced during wakefulness, replays can play a role in reshaping the representation of space, which would suggest their involvement in planning. Experimentally it has been observed that replays often start and end from relevant locations in the environment, like reward sites, decision points, obstacles or the current position of the animal FreyjaOlafsdottir 2015, Pfeiffer 2013, Jackson 2006,Mattar 2017. Since these are salient locations, it is in line with our proposition that replays can be used to maintain a convenient representation of the environment. It is worth noticing that replays can serve a variety of functions, and our framework merely proposes additional beneficial properties without claiming to explain all observed replays. For example, next to forward replays, also reverse replays are ubiquitous in hippocampus Pfeiffer 2020. The reverse replays are not

included in our framework, and it is not clear yet whether they play different roles, with some evidence suggesting that reverse replays are more closely tied to the reward encoding Ambrose 2016.”

2. The model is composed of CA1 and CA3, what are the roles of the other hippocampal subregions in learning predictive maps? From the reported results, it looks like it may be possible that prediction-based learning can be successfully achieved simply via the CA1-CA3 circuit. Are there studies (e.g., lesioned) that show this causal relationship to behavior? Along this line, what are the potential limitations of the proposed framework in understanding the circuit computation adopted by the hippocampus?

We thank the reviewer for the feedback. We have added the following paragraph in the discussion to address this point:

“There are three different neural activities in our proposed framework: the presynaptic layer (CA3), the postsynaptic layer (CA1), and the external inputs. These external inputs could for example be location-dependent currents from the entorhinal cortex, with timings guided by the theta oscillations. The dependence of CA1 place fields on CA3 and entorhinal input is in line with lesion studies (see e.g. Brun2008, Hales2014, Oreilly2014). It would be interesting for future studies to further dissect the role various areas play in learning cognitive maps.”

3. Do the authors believe that the plasticity rules/computational principles observed within the 2-layer model are specific to the CA1-CA3 circuit? Can these rules be potentially employed elsewhere within the medial temporal lobe or sensory areas? What are the model parameters used that could suggest that the observed results are specific to hippocampus-based predictive learning?

We thank the reviewer for this important question. We refer to the following paragraph in the discussion regarding this topic:

“Notably, even though we have focused on the hippocampus in our work, the SR does not require predictive information to come from higher-level feedback inputs. This framework could therefore be useful even in sensory areas: certain stimuli are usually followed by other stimuli, essentially creating a sequence of states whose temporal structure can be encoded in the network using our framework. Interestingly, replays have been observed in other brain areas besides the hippocampus Kurth-Nelson2016, Staresina2013. Furthermore, temporal difference learning in itself has been proposed in the past as a way to implement prospective coding Brea2016”

4. The analytical illustration linking the proposed model with reinforcement learning is well executed. However, in practice, the actual implementation of reinforcement learning within the model is unclear. Given the sample task provided where animals are navigating a simple environment, how can one make use of value-based learning to enhance behavior? Explicit discussion on the extent to which reinforcement learning is related to the actual computation potentially needed to navigate sensory environments (both learned and novel) would be really helpful in understanding the link between the model to reinforcement learning.

We thank the reviewer for this important remark. We have added Figure 5—figure supplement S2 and updated the discussion to address this point. Since we modelled our neurons to integrate the synaptic EPSPs and generated spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate. We performed additional simulations and have added a supplementary figure (Figure 5—figure supplement S2) showing this setup.

Besides the supplementary figure (Figure 5—figure supplement S2), we added the following paragraph in the Discussion section:

“Since we modelled our neurons to integrate the synaptic EPSPs and generate spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate (see Figure 5—figure supplement S2). While the neuron model used is simple, it will be interesting for future work to study analogous models with nonlinear neurons.”

5. Subplots both within and across figures seem to be of very different text formatting and sizing (such as panel F in Figure 4 and Figure 5). Please reformat them accordingly.

Thank you, we have reformatted the figures.

Reviewer #2 (Recommendations for the authors):

Important: Note that the page numbers refer to the page in the PDF, which is their own page number-1 (due to eLife adding a header page).

Page 3 – "smoothly… and anything in between" – this is overstated and should be removed.

We thank the reviewer for the feedback. We have adapted the sentence as follows:

“We show mathematically that our proposed framework smoothly connects a temporally precise spiking code akin to replay activity with a rate based code akin to behavioural spiking.”

While we agree that our model implements only one type of temporal code, it is important to stress that there is a smooth transition between this temporal code and a pure rate encoding of the state. The larger the time T in each state, the less precise the spikes become organised in time. In all cases, the learning dynamics have the same fixed point, namely, they converge to the successor representation. We believe that this fact is non-trivial. Furthermore, we show how this smooth transition changes how the fixed point (SR) is reached. On one extreme (T=0), algorithmically it uses TD(1), on the other extreme (T=infinity) TD(0) and all intermediate cases implement a value of λ between 0 and 1.

Page 3 – "don't need to discretize time…". Here and elsewhere there should be citations to the work of Doya, NeurIPS 1995, Neural Comp 2000 on the modeling of continuous time in RL.

Thank you, we have added those citations

Page 3 – "using replays" – It is very narrowminded to assume that all replay do is set up successor representations. They could also be involved in model-based planning of behavior as suggested in the work of Johnson and Redish, 2012; Pfeiffer and Foster, 2018; Kay et al. 2021 and modeled in Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012.

We thank the reviewer for the feedback, but assure that we did not mean to assume that is all replays do. In fact, we are well aware of other potential benefits of replays in memory consolidation, model-based planning etc. We merely propose to add another potential benefit of replays to the existing hypotheses, namely that they can be used to learn and in doing so they can reduce bias and learn offline. We have adapted the ‘Replays’ paragraph in the discussion to reflect this and have also adapted the highlighted text to make this clearer:

“Finally, replays have long been speculated to be involved in learning models of the environment (Eichenbaum, 2005; Hasselmo and Erdem and Hasselmo, 2012 and Fenton and Kubie, 2012; Johnson and Redish, 2012; Pfeiffer and Foster, 2018; Kay et al. 2021;). Here, we investigate how replays could play an additional role in learning the SR cognitive map”

Page 3 – They assume that STDP can occur during replay, but evidence for STDP during replay is unclear. McNaughton's lab showed that LTP is less likely to be induced during the modulatory states during which sharp-wave ripple replay events occur. They should look for citations that have actually shown LTP induction during the replay state.

Thank you for this comment. We have added the following sentence in the "Replay" paragraph of the discussion.

“Moreover, while indirect evidence supports the idea that replays can play a role during learning Igata2020, it is not yet clear how synaptic plasticity is manifested during replays Fuchsberger2022.”

Page 4 – Marr's three levels – They should remove this discussion about Marr's three levels as I think the implementation level is relatively sparse and the behavioral level is also relatively sparse.

We believe that this link between computation, algorithm and implementation is actually a strength of the proposed work. Typically, only one or at most two levels are discussed, without a more holistic view. We believe that our work (and e.g. George et al.) goes one step beyond by making the link between the implementation (spiking neurons with STDP), algorithm (TD learning) and computational theory (SR / predictive cognitive maps) explicit. While of course, these modelling studies are abstractions of reality, we believe this is not trivial and would like to maintain this paragraph to stimulate future research to bridge these levels as well.

Page 4 – "The hippocampus has long been thought" – It's astounding that the introduction only cites two experimental papers (O'Keefe and Dostrovsky, 1971; Mehta et al. 2000) and then the start of the Results section makes a statement like this and only cites Stachenfeld et al. 2017 as if it were an experimental paper. There are numerous original research papers that should be cited for the role of hippocampus in behavior. They should cite at least five or six so that the reader doesn't get the impression all o this work started with the holy paper by Stachenfeld et al. 2017. For example, they could again cite O'Keefe and Nadel, 1978 for the very comprehensive review of the literature up to that time, plus the seminal work of Morris et al. 1982 in Morris water maze and Olton, 1979 in 8-arm radial maze and perhaps some of the work by Aggleton and by Eichenbaum on spatial alternation.

We agree and thank the reviewer for the suggestion. We have now expanded the citations of relevant experimental work.

Page 5 – The description of successor representations is very one dimensional. They should mention how it can be expanded to two dimensions.

To address this, we perform a new simulation in a 2D environment (Supplementary Figure 2) as well as a discussion on the generality of the approach to any dimension in section 2.1

“Even though we introduced the linear track as an illustrative example, the SR can be learned in any environment (see Figure 1—figure supplement S1 for an example in an open field)”.

Page 5 – "Usually attributed to model based…". They cannot just talk about SR being model free. Since this section was supposed to be for neuroscientists, they need to clearly explain the distinction between model free and model-based RL, and describe why successor representations are not just model-based RL, but instead provide a look-up table of predictive state that does NOT involve model-based planning of behavior. The blog of Vitay gives a much better overview that compares model-free, model-based and successor representations:

https://julien-vitay.net/post/successor_representations/ – This needs more than just a citation – there should be a clear description of model-based and model-free RL in contrast to SR, and Vitay is an example of that.

We have now extended this paragraph with a more thorough explanation of model-free and model-based RL, including an example of what happens when a reward location changes in all three cases.

Because of this predictive information, the SR allows sample-efficient re-learning when the reward location is changed Gershman2018. In reinforcement learning, we tend to distinguish between model-free and model-based algorithms. The SR is believed to sit inbetween these two modalities. In model-free reinforcement learning, the aim is to directly learn the value of each state in the environment. Since there is no model of the environment at all, if the location of a reward is changed, the agent will have to first unlearn the previous reward location by visiting it enough times, and then is able to re-learn the new location. In modelbased reinforcement learning, a precise model of the environment is learned, specifically, single-step transition probabilities between all states of the environment. Model-based learning is computationally expensive, but allows a certain flexibility. If the reward changes location it is immediate to derive the updated values of the states. As we have seen, however, the SR can re-learn a new reward location somewhat efficiently, although less so than model-based learning. The SR can also be efficiently learned using model-free methods and allows us to easily compute values for each state, which in turn can guide the policy Dayan1993, Russek2017, Momennejad2017. This position between model-based and model-free methods makes the SR framework very powerful, and its similarities with hippocampal neuronal dynamics have led to increased attention from the neuroscience community. Finally, in our examples above, we considered an environment made up of a discrete number of states. This framework can be generalised to a continuous environment represented by a discrete number of place cells.

Page 5 – Related to this issue – they need to repeatedly address the fact that Successor representations are just an hypothesis contrast with model-based behavior, and repeatedly throughout the paper discuss that model-based behavior could still be the correct accounting for all of the data that they address.

We thank the reviewer for pointing out that the successor representation is only one of the possible hypotheses.

We have proceeded to address this for the Frank and Cheng data (page 14, ‘Please note, however, that other mechanisms besides the Successor Representation could account for these results, including model-based reinforcement learning.), and in the final section of the shock experiment (page 16, ‘It is important to note here that, while we are suggesting a potential role for the SR in solving this task, the data itself would also be compatible with a model-based strategy. In fact, experimental evidence suggests that humans may use a mixed strategy involving both model-based reinforcement learning and the successor representation [Momennejad et al., 2017].’).

Page 5 – "similar to (Mehta et al. 2000)" – Learning in the CA3-CA1 network has been modeled like this in many previous models that should be cited here including McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Treves and Rolls, 1994; Mehta et al. 2000; Hasselmo, Bodelon and Wyble, 2002.

Thank you, we have added the citations

Page 6 – Figure 1d looks like the net outcome of the learning rule in this example is long-term depression. Is that intended? Given the time interval between pre and post, it looks like it ought to be potentiation in the example.

This depends on the pre-post spike timing, and in this example the postsynaptic spike is too far from the presynaptic spike, leading to depression. When moving the postsynaptic spike closer to the presynaptic spike, potentiation would occur. The plasticity rule qualitatively results in three regions depending on the spike timing: depression (post-pre), potentiation (pre-post with small interval), and depression (pre-post with large interval). Qualitatively it is in line with e.g. Shouval et al. 2002

Page 7 – They should address the problem of previously existing weights in the CA3 to CA1 connections. For example, what if there are pre-existing weights that are strong enough to cause post-synaptic spiking in CA1 independent of any entorhinal input? How do they avoid the strengthening of such connections? (i.e. the problem of prior weights driving undesired learning on CA3-CA1 synapses is addressed in Hasselmo and Schnell, 1994; Hasselmo, Bodelon and Wyble, 2002, which should be cited).

This is an important point, and since our model guarantees a stable fixed point of the learning dynamics, the initial conditions are not influencing the final convergence. To show this explicitly, we have performed a new simulation and added Figure 1—figure supplement S2 to illustrate this.

This is reflected in the manuscript:

“As a proof of principle, we show that it is possible to learn the SR for any initial weights (Figure 1—figure supplement S2), independently of any previous learning in the CA3 to CA1 connections.”

Page 7 – "Elegantly combines rate and temporal" – This is overstated. The possible temporal codes in the hippocampus include many possible representations beyond just one step prediction. They need to specify that this combines one type of possible temporal code. I also recommend removing the term "elegant" from the paper. Let someone else call your work elegant.

We thank the reviewer for the feedback. We have modified the manuscript to reflect the comments:

“As we will discuss below, this framework, therefore, combines learning based on rate coding as well as temporal coding.”

Page 7 – "replays for learning" – as noted above in experiments LTP has not been shown to be induced during the time periods of replay – sharp-wave ripple replay events seem to be associated with lower cholinergic tone (Buzsaki et al. 1983; VandeCasteele et al. 2014) whereas LTP is stronger when Ach levels are higher (Patil et al. J. Neurophysiol. 1998). This is not an all-or-none difference, but it should be addressed.

Thank you for this comment. We have added the following sentence in the "Replay" paragraph of the discussion.

“Moreover, while indirect evidence supports the idea that replays can play a role during learning Igata2020, it is not yet clear how synaptic plasticity is manifested during replays Fuchsberger2022.”

Page 7 – "equivalent to TD…". Should this say "equivalent to TD(0)"?

We changed the phrasing in the manuscript to: “equivalent to TD(λ)”.

Page 8 – "Bootstrapping means that a function is updated using current estimates of the same function…". This is a confusing and vague description of bootstrapping. They should try to give a clearer definition for neuroscientists (or reduce their reference to this).

We have now changed the explanation in the manuscript in the corresponding section.

From a reinforcement learning perspective, the TD(0) algorithm relies on a property called bootstrapping. This means that the successor representation is learned by first taking an initial estimate of the SR matrix (i.e. the previously learned weights), and then gradually adjusting this estimate (i.e. the synaptic weights) by comparing it to the states in the environment the animal actually visits. This comparison is achieved by calculating a prediction error, similar to the widely studied one for dopamine neurons Schultz1997. Since the synaptic connections carry information about the expected trajectories, in this case, the prediction error is computed between the predicted and observed trajectories (see Methods).

The main point of bootstrapping, therefore, is that learning happens by adjusting our current predictions (e.g. synaptic weights) to match the observed current state. This information is available at each time step and thus it allows learning over long timescales using synaptic plasticity alone. If the animal moves to a state in the environment that the current weights deem unlikely, potentiation will prevail and the weight from the previous to the current state will increase. Otherwise, the opposite will happen. It is important to notice that the prediction error in our model is not encoded by a separate mechanism in the way that dopamine is thought to do for reward predictionSchultz1997. Instead, the prediction error is represented locally, at the level of the synapse, through the depression and potentiation terms of our STDP rule, and the current weight encodes the current estimate of the SR (see Methods). Notably, the prediction error updates result in a total update equivalent to the TD(\λ) update. This mathematical equivalence ensures that the weights of our neural network track the TD(\λ) update at each state, and thus stability and convergence to the theoretical values of the SR. We therefore do not need an external vector to carry prediction error signals as proposed in Gardner2018, Gershman2018. In fact, the synaptic potentiation in our model updates a row of the SR, while the synaptic depression updates a column.

Page 9 – Figure 2 – Do TD λ and TD zero really give equivalent weight outputs?

The theoretical value of the SR is the same, independently of the algorithm used to learn it. TD λ, for any value of λ, converges to this theoretical value. No tuning is needed for this, it is mathematically guaranteed that the algorithm converges. Thus, also for λ=0 the results are exact.

Page 8 – "that are behaviorally far apart" – I don't understand how this occurs.

See explanation above.

page 10 – "dependency of synaptic weights on each other as discussed above." This was not made sufficiently clear either here or above.

See explanation above.

Page 10 – "dependency of synaptic weights on each other" This also suggests a problem of stability if the weights can start to drive their own learning and cause instability – how is this prevented?

See explanation above. The dynamics have a fixed point of convergence (the SR). No matter what the initial weights are, it is guaranteed to converge to the SR. We illustrated that in the new Supplementary Figure 7.

Page 10 – "average of the discounted state occupancies" – this would be uniform without discounting but what is the biological mechanism for the discounting that is used here?

This is an important point. No extra mechanism is needed to induce the discount in our framework. Instead, the decaying LTP window is sufficient to drive the discounting of states further away.

Page 10 – "due to the bootstrapping" – again this is unclear – can be improved by giving a better definition of bootstrapping and possibly by referring to specific equation numbers.

We have now changed the explanation in the manuscript in the corresponding section, see comment above.

Page 11 – "exponential dependence" what is the neural mechanism for this?

In our case, it is the LTP window which exponentially decays.

Page 11 – "Ainsley" is not a real citation in the bibliography. Should fix and also provide a clearer definition (or equation) for hyperbolic.

Page 11 – "elegantly combines two types of discounting" – how is useful? Also, let other people call your work elegant.

Very true! We have now rephrased our statement. Hyperbolic discounting has been observed in neuroeconomics experiments, whereas reinforcement learning algorithms normally rely on exponential discounting for its favourable mathematical properties. This has been however considered confusing when applying reinforcement learning to behaviour. Our model uses exponential discounting at a neural level and in space, but can account for the findings with hyperbolic discounting.

Page 11 – how does discounting depend on both firing rate and STDP -- should provide some explanation or at least refer to where this is shown in the equations.

We have now added a reference to the equation.

page 13 – "Cheng and Frank" – this is a good citation, but they could add more here on timing of replay events.

We added the following sentence regarding the replay timing:

“Please note that, while we implemented an exponentially decaying probability for replays after entering a novel environment, different schemes for replay activity could be investigated.”

Page 15 – This whole section on the shock experiment starts with the assumption of a successor representation. As noted above, they need to explicitly discuss the important alternate hypothesis that the neural activity reflects model-based planning that guides the behavior in the task (and could perhaps better account for the peak of occupancy at the border of light and dark).

We have changed the text to reflect this suggestion.

Page 16 – "mental imagination" – rather than using it for modifying SR, why couldn't mental imagination just be used for model-based behavior?

We believe it is indeed possible that mental imagination can be used for model-based behaviour. However, it is possible that humans use a mixed strategy involving both modelbased learning and the successor representation, as suggested by the findings in Momennejad et al. 2017.

Page 17 – "spiking" – again, if they are going to refer to their model as a "spiking" model, they need to add some plots showing spiking activity.

One example of the spiking activity of our model can be found in Figure 2 panel A.

Reviewer #3 (Recommendations for the authors):

We have responded to the reviewer 3's concerns at the beginning of the rebuttal, in the "Essential Revision" section.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #1 (Recommendations for the authors):

The revised article has only partly resolved my confusion.

My main issue was the following: in the proposed feed-forward model, the synaptic weights between the two layers learn the entries of the successor matrix. If external inputs were fed only to the first layer, the second layer would directly read out the successor representation (this is suggested in Figure 1 E-F, but not explicitly mentioned in the text as far as I can tell). Instead, in the model, both layers are driven by external inputs representing the current state. This is crucial for learning, but it implies that the activity of the second layer mixes the successor representation with a representation of the current state itself. Learning and readout, therefore, seem antagonistic. It would be worth explaining this fact in the main text.

We thank the reviewer for the question, and we now added an extra explanation in the main text. In brief, the CA1 layer indeed mixes the SR with a representation of the current state. However, we argue that learning and readout are not necessarily antagonistic. We suggest here a few possibilities to resolve this conflict:

1. Since the external current in CA1 is present for only a fraction of the time T in each state, the readout might happen during the period of CA3 activation exclusively.

2. The readout may be over the whole time T but becomes noisier towards the end. It is worth noting that, even in the case where the readout is noisy, the distortion would be limited to the diagonal elements of the matrix (see equation 13 and 15 in the Methods section, and Figure 5 – supplement 2, panels B and C).

3. Learning and readout may be separate mechanisms, where during readout only the CA3 driving current is present. This could be for instance signaled by neuromodulation (e.g. acetylcholine has been associated to attention and arousal during spatial learning [1,2,3,4,5]), or it could be that readout happens during replays with a population of neurons.

4. The weights to or activation functions of the readout neuron may learn to compensate for the distorted signal in CA1.

We have now included this explanation in the main text in the Discussion, in the Reinforcement Learning subchapter.

In their reply, the authors clarify that the external inputs drive the activity of the second layer only for a limited time (20%). As far as I can tell, in the text, this is mentioned explicitly only in the legend of Figure 5-S2. That seems to imply that there is a large difference in the duration of the external inputs to the two layers. How can that be justified?

More importantly, it seems that varying the value of the delay should lead to a tradeoff between the accuracy of learning and the accuracy of the subsequent readout in the second layer. Is that the case? It would be useful to have a figure where the delay is varied.

As suggested, we added two figures (Figure 5—figure supplement 3, Figure 5—figure supplement 4) where we varied this delay. In our model, the external inputs are first active in CA3 and afterwards in CA1 (equation 10 and 11). We model these inputs as step currents. We call the overall time in each state T, and say that the external current for CA3 has duration θ, while the external current for CA1 has duration ω omega. If we assume that there is no pause in between these two stimuli, we have ω = T – θ. We can therefore define the delay between the two stimuli as: θ/T.

As the reviewer points out, the delay value θ/T that we chose for Figure 5 – Supplement 2 was 0.8. However, this was an arbitrary choice and other values are possible. We explore here what happens when we vary the delay parameter θ/T:

– Impact on the reinforcement learning representation

The delay parameter θ/T influences the reinforcement learning representation, as it can be seen from equations 22 and 23. When the delay is longer, the value for γ increases and λ decreases (Figure 5 —figure supplement 3, panels b-c). However, it is important to notice that γ and λ are determined by other biological values as well (see equations 22 and 23), such as the depression amplitude (Figure 5 —figure supplement 3, panels e-f) or the decay of the potentiation learning window. The value we choose for the delay is therefore flexible: even if we want to learn the SR with certain fixed γ or λ, we can always vary other parameters in the spiking model.

– Impact on learning and readout

The strength of the place-tuned input to CA1 also depends on the choice of the delay (see equation 18). Typically, the CA1 place-tuned input strength increases with the delay, except for long values of T or low depression amplitudes, where the input remains fairly constant (Figure 5 —figure supplement 3, panels a and d respectively). Therefore, the longer the CA1 place-tuned input lasts, the weaker it has to be to ensure the proper learning. Intuitively, this means that we can compensate for a shorter duration of the external current by increasing its strength. The distortion of the readout thus remains more or less constant independently of our choice of delay (Figure 5 —figure supplement 4). We therefore do not find a trade-off between accuracy of learning and accuracy of readout, and the distortion in the value could be corrected for example by one of the mechanisms proposed above.

We addressed the above points in the Discussion, in the Reinforcement Learning subchapter.

References:

Micheau, J. and Marighetto, A. Acetylcholine and memory: a long, complex and chaotic but still living relationship. Behav. Brain Res. 221, 424–429 (2011)

Hasselmo, M. E. and Sarter, M. Modes and models of forebrain cholinergic neuromodulation of cognition. Neuropsychopharmacology 36, 52–73 (2011).

Robbins, T. W. Arousal systems and attentional processes. Biol. Psychol. 45, 57–71 (1997).

Teles-Grilo Ruivo, L. M. and Mellor, J. R. Cholinergic modulation of hippocampal network function. Front. Synaptic Neurosci. 5, 2 (2013).

Palacios-Filardo, Jon, et al. "Acetylcholine prioritises direct synaptic inputs from entorhinal cortex to CA1 by differential modulation of feedforward inhibitory circuits." Nature communications 12.1 (2021): 1-16.

https://doi.org/10.7554/eLife.80671.sa2

Article and author information

Author details

  1. Jacopo Bono

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft
    Contributed equally with
    Sara Zannone
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9552-3151
  2. Sara Zannone

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft
    Contributed equally with
    Jacopo Bono
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9526-7001
  3. Victor Pedrosa

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Visualization, Writing - original draft
    Competing interests
    No competing interests declared
  4. Claudia Clopath

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Project administration, Writing - review and editing
    For correspondence
    c.clopath@imperial.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4507-8648

Funding

Wellcome Trust (200790/Z/16/Z)

  • Claudia Clopath

Engineering and Physical Sciences Research Council (EP/R035806/1)

  • Claudia Clopath

Simons Foundation (564408)

  • Claudia Clopath

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Senior Editor

  1. Laura L Colgin, University of Texas at Austin, United States

Reviewing Editor

  1. Lisa M Giocomo, Stanford School of Medicine, United States

Reviewer

  1. Michael E Hasselmo, Boston University, United States

Version history

  1. Preprint posted: August 17, 2021 (view preprint)
  2. Received: May 30, 2022
  3. Accepted: January 12, 2023
  4. Version of Record published: March 16, 2023 (version 1)

Copyright

© 2023, Bono, Zannone et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 945
    Page views
  • 119
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jacopo Bono
  2. Sara Zannone
  3. Victor Pedrosa
  4. Claudia Clopath
(2023)
Learning predictive cognitive maps with spiking neurons during behavior and replays
eLife 12:e80671.
https://doi.org/10.7554/eLife.80671

Further reading

    1. Neuroscience
    Ching Fang, Dmitriy Aronov ... Emily L Mackevicius
    Research Article

    The predictive nature of the hippocampus is thought to be useful for memory-guided cognitive behaviors. Inspired by the reinforcement learning literature, this notion has been formalized as a predictive map called the successor representation (SR). The SR captures a number of observations about hippocampal activity. However, the algorithm does not provide a neural mechanism for how such representations arise. Here, we show the dynamics of a recurrent neural network naturally calculate the SR when the synaptic weights match the transition probability matrix. Interestingly, the predictive horizon can be flexibly modulated simply by changing the network gain. We derive simple, biologically plausible learning rules to learn the SR in a recurrent network. We test our model with realistic inputs and match hippocampal data recorded during random foraging. Taken together, our results suggest that the SR is more accessible in neural circuits than previously thought and can support a broad range of cognitive functions.

    1. Computational and Systems Biology
    2. Ecology
    Vanessa Rossetto Marcelino
    Insight

    High proportions of gut bacteria that produce their own food can be an indicator for poor gut health.