Learning predictive cognitive maps with spiking neurons during behavior and replays

  1. Jacopo Bono
  2. Sara Zannone
  3. Victor Pedrosa
  4. Claudia Clopath  Is a corresponding author
  1. Department of Bioengineering, Imperial College London, United Kingdom

Abstract

The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data.

Editor's evaluation

This is an important article that leverages a spiking network model of the hippocampal circuit to show how spike-time-dependent plasticity can implement predictive reinforcement learning and form a predictive map of the environment. The authors provide a convincing and solid framework for understanding the prediction based learning rules that may be employed by the hippocampus to optimize an animal's behavior. This paper will be of interest to theoretical and experimental neuroscientists working on learning and memory as it provides new ways to connect computational models to experimental data that has yet to be fully explored from a reinforcement learning perspective.

https://doi.org/10.7554/eLife.80671.sa0

Introduction

Mid twentieth century, Tolman proposed the concept of cognitive maps (Tolman, 1948). These maps are abstract mental models of an environment which are helpful when learning tasks and in decision making. Since the discovery of hippocampal place cells, cells that are activated only in specific locations of an environment, it is believed that the hippocampus can provide the substrate to encode such cognitive maps (O’Keefe and Dostrovsky, 1971; O’Keefe and Nadel, 1978). More evidence of the role of the hippocampus in behavior was found in numerous experimental studies, such as the seminal water maze experiments (Morris, 1981; Morris et al., 1982), radial arm maze experiments (Olton and Papas, 1979) as well as evidence of broader information processing beyond just cognitive maps (Wood et al., 1999; Eichenbaum et al., 1999; Aggleton and Brown, 1999; Wood et al., 2000).

While these place cells offer striking evidence in favour of cognitive maps, it is not clear what representation is actually learned by the hippocampus and how this information is exploited when solving and learning tasks. Recently, it was proposed that the hippocampus computes a cognitive map containing predictive information, called the successor representation (SR). Theoretically, this SR framework has some computational advantages, such as efficient learning, simple computation of the values of states, fast relearning when the rewards change and flexible decision making (Dayan, 1993; Stachenfeld et al., 2014; Stachenfeld et al., 2017; Russek et al., 2017; Momennejad et al., 2017). Furthermore, the SR is in agreement with experimental observations. Firstly, the firing fields of hippocampal place cells are affected by the strategy used by the animal to navigate the environment (known as the policy in machine learning), as well as by changes in the environment (Mehta et al., 2000; Stachenfeld et al., 2017). Secondly, reward revaluation — the ability to recompute the values of the states when rewards change — would be more effective than transition revaluation (Russek et al., 2017; Momennejad et al., 2017).

In this work, we study how this predictive representation can be learned in the hippocampus with spike-timing dependent synaptic plasticity (STDP). Using STDP at the mechanistic level, we show that the learning is equivalent to TD(λ) on an algorithmic level. The latter is a well-studied and powerful algorithm known from reinforcement learning (Sutton and Barto, 1998), which we will discuss in more detail below.

Our model can thus learn over a behavioral timescale while using STDP timescales in the millisecond range. We show mathematically that our proposed framework smoothly connects a temporally precise spiking code akin to replay activity with a rate based code akin to behavioral spiking. Subsequently, we show that the delay-discounting parameter γ allows us to consider time as a continuous variable, therefore we don’t need to discretize time as is usual in reinforcement learning (Doya, 1995; Doya, 2000). Moreover, the delay-discounting in our model depends hyperbolically on time but exponentially on state transitions. We show how the γ parameter can be modulated by neuronal firing rates and neuromodulation, allowing state-dependent discounting and in turn enabling richer information in the SR, such as the encoding of salient states, landmarks, reward locations, etc. Finally, replays have long been speculated to be involved in learning models of the environment, supported by experiments (Johnson and Redish, 2007; Pfeiffer and Foster, 2013; Kay et al., 2020) and models (Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012; Kubie and Fenton, 2012). Here, we investigate how replays could play an additional role in learning the SR cognitive map. Following properties of TD(λ), we show how we can achieve both low bias and low variance by using replays, translating to both quicker initial learning and convergence to lower error. We show how we can use replays to learn offline. In this way, policies can be refined without the need for actual exploration.

Our framework allows us to make predictions about the roles of behavioral learning and replay-like activity and how they can be exploited in representation learning. Furthermore, we uncover a relation between STDP and a higher level learning algorithm. Our work therefore spans the three levels of analysis proposed by Marr, 2010. On the implementational level, our model consists of a feedforward network of excitatory neurons with biologically plausible spike-timing dependent plasticity. On the algorithmic level, we show that our model learns the successor representation using the TD(λ) algorithm. On the computational theory level, our model tackles representation learning using cognitive maps.

Results

Cognitive maps are internal models of an environment which help animals to learn, plan and make decisions during task completion. The hippocampus has long been thought to provide the substrate for learning such cognitive maps (O’Keefe and Dostrovsky, 1971; O’Keefe and Nadel, 1978; Morris, 1981; Morris et al., 1982; Wood et al., 1999; Eichenbaum et al., 1999), and recent evidence points towards a specific type of representation learned by the hippocampus, the successor representation (SR) (Stachenfeld et al., 2017).

The successor representation

In this section, we will give an overview of the successor representation and its properties, especially geared toward neuroscientists. Readers already familiar with this representation may safely move to the next section.

To understand the concept of successor representation (SR), we can consider a spatial environment — such as a maze — while an animal explores this environment. In this setting, the SR can be understood as how likely it is for the animal to visit a future location starting from its current position. We further assume the maze to be formed out of a discrete number of states. Then, the SR can be more formally described by a matrix with dimension (Nstates×Nstates), where Nstates denotes the number of states in the environment and each entry Rij of this matrix describes the expected future occupancy of a state Sj when the current state is Si. In other words, starting from Si, the more likely it is for the animal to reach the location associated with state Sj and the nearer in the future, the higher the value of Rij.

As a first example, we consider an animal running through a linear track. We assume the animal runs at a constant speed and always travels in the same direction — left to right (Figure 1a). We also split the track into four sections or states, S1 to S4, and the SR will be represented by a matrix with dimension (4×4). Since the animal always runs from left to right, there is zero probability of finding the animal at position i if its current position is greater than i. Therefore, the lower triangle of the successor matrix is equal to zero (Figure 1b). Alternatively, if the animal is currently at position S1, it will be subsequently found at positions S2, S3, and S4 with probability 1. The further away from S1, the longer it will take the animal to reach that other position. In terms of the successor matrix, we apply a discounting factor γ (0<γ1) for each extra ‘step’ required by the animal to reach a respective location (Figure 1b).

Figure 1 with 2 supplements see all
Successor representation and neuronal network.

(A) Our simple example environment consists of a linear track with 4 states (S1 to S4) and the animal always moves from left to right — i.e. one epoch consists of starting in S1 and ending in S4. (B) The successor matrix corresponding to the task described in panel A. (C) Our neuronal network consists of a two layers with all-to-all feedforward connections. The presynaptic layer mimics hippocampal CA3 and the postsynaptic layer mimics CA1. (D) The synaptic plasticity rule consists of a depression term and a potentiation term. The depression term is dependent on the synaptic weight and presynaptic spikes (blue). The potentiation term depends on the timing between a pre- and post-synaptic spike pair (red), following an exponentially decaying plasticity window (bottom). (E–F) Schematics illustrating some of the results of our model. (E) Our spiking model learns the top row of the successor representation (panel B) in the weights between the first CA3 place cell and the CA1 cells. (F) Our spiking model learns the third row of successor representation (panel B) in the weights between the third CA3 place cell and the CA1 cells.

Even though we introduced the linear track as an illustrative example, the SR can be learned in any environment (see Figure 1—figure supplement 1 for an example in an open field). Note that the representation learned by the SR is dependent not only on the structure of the environment, but also on the policy — or strategy — used by the animal to explore the environment. This is because the successor representation is not purely concerned with the physical distance between two areas in the environment, but rather it measures how long it usually takes to reach one place when starting from the other. In this first example, the animal applied a deterministic policy (always running from left to right), but the SR can also be learned for stochastic policies. Furthermore, the SR is a multi-step representation, in the sense that it stores predictive information of multiple steps ahead.

Because of this predictive information, the SR allows sample-efficient re-learning when the reward location is changed (Gershman, 2018). In reinforcement learning, we tend to distinguish between model-free and model-based algorithms. The SR is believed to sit in-between these two modalities. In model-free reinforcement learning, the aim is to directly learn the value of each state in the environment. Since there is no model of the environment at all, if the location of a reward is changed, the agent will have to first unlearn the previous reward location by visiting it enough times, and only then it will be able to re-learn the new location. In model-based reinforcement learning, a precise model of the environment is learned, specifically, single-step transition probabilities between all states of the environment. Model-based learning is computationally expensive, but allows a certain flexibility. If the reward changes location it is immediate to derive the updated values of the states. As we have seen, however, the SR can re-learn a new reward location somewhat efficiently, although less so than model-based learning. The SR can also be efficiently learned using model-free methods and allows us to easily compute values for each state, which in turn can guide the policy (Dayan, 1993; Russek et al., 2017; Momennejad et al., 2017). This position between model-based and model-free methods makes the SR framework very powerful, and its similarities with hippocampal neuronal dynamics have led to increased attention from the neuroscience community. Finally, in our examples above we considered an environment made up of a discrete number of states. This framework can be generalised to a continuous environment represented by a discrete number of place cells.

Learning the successor representation in biologically plausible networks

We propose a model of the hippocampus that is able to learn the successor representation. We consider a feedforward network comprising of two layers. Similar to McNaughton and Morris, 1987; Hasselmo and Schnell, 1994; Mehta et al., 2000; Hasselmo et al., 2002, we assume that the presynaptic layer represents the hippocampal CA3 region and is all-to-all connected to a postsynaptic layer - representing the CA1 network (Figure 1c). The synaptic connections from CA3 to CA1 are plastic such that the weight changes follow a spike-timing-dependent plasticity (STDP) rule consisting of two terms: a weight-dependent depression term for presynaptic spikes and a potentiation term for pre-post spike pairs (Figure 1d).

For simplicity, we assume that the animal spends a fixed time T in each state. During this time, a constant activation current is delivered to the CA3 neuron encoding the current location and, after a delay, to the corresponding CA1 place cell (see Materials and methods). On top of these fixed and location-dependent activations, the CA3 neurons can activate neurons in CA1 through the synaptic connections. In other words, the CA3 neurons are activated according to the current location of the animal, while the CA1 neurons have a similar location-dependent activity combined with activity caused by presynaptic neurons. The constant currents delivered directly to CA3 and CA1 neurons can be thought of as location-dependent currents from entorhinal cortex. These activations subsequently trigger plasticity at the synapses, and we can show analytically that, using the spike-timing dependent plasticity rule discussed above, the SR is learned in the synaptic weights (Figure 1e and f, and see Appendix).

Moreover, we find that, on an algorithmic level, our weight updates are equivalent to a learning algorithm known as TD(λ), a powerful and well-known algorithm in reinforcement learning that can be used to learn the successor representation. TD(λ) is based on a mixed methodology, which is regulated by the parameter λ. At one extreme, when λ=1, the SR is estimated by taking the average of state occupancies over past trajectories. This type of algorithm is called TD(1) or Monte Carlo (MC). At the other extreme, when λ=0, the estimate of the SR is adjusted ‘online’, with every step of the trajectory, by comparing the observed position with its predicted value. This algorithm is equivalent to TD(0). For all values of λ in between, the algorithm employs a mixture of both methodologies. The extreme cases of TD(1) and TD(0) have different strengths and weaknesses, as we will discuss in more detail in the next sections.

In practice, we prove analytically the mathematical equivalence of the dynamics of our spiking neural network, and the TD(λ) algorithm (see Appendix). Our calculations essentially prove that, at each step, our neural network tracks the reinforcement learning algorithm, known to converge to the theoretical values of the SR. This equivalence guarantees that our neural network weights will eventually converge to the correct SR matrix. As a proof of principle, we show that it is possible to learn the SR for any initial weights (Figure 1—figure supplement 2), independently of any previous learning in the CA3 to CA1 connections.

Importantly, from our analytical derivations (see Appendix), we find that the λ parameter depends on the behavioral parameter T (the time an animal spends in a state). We find that, the larger the time T, the smaller the value of λ and vice-versa. In other words, when the animal moves through the trajectory on behavioral time-scales (large T compared to the synaptic plasticity time-scales τLTP), the network is learning the SR with TD(λ0). For quick sequential activities (T → 0), akin to hippocampal replays, the network is learning the SR with TD(λ1). As we will discuss below, this framework therefore combines learning based on rate coding as well as temporal coding. Furthermore, from our model follows the prediction that replays can also be used for learning purposes and that they are algorithmically equivalent to MC, whereas during behavior, the hippocampal learning algorithm is equivalent to TD(λ). This strategy of using replays to learn is in line with recent experimental and theoretical observations (see Momennejad, 2020 for a review).

To validate our analytical results, we use again a linear track with a deterministic policy. Using our spiking model with either rate-code activity on behavioral time-scales (Figure 2a top) or temporal-code activity similar to replays (Figure 2b top), we show that the synaptic weights across trials match the evolution of the TD(λ) algorithm closely (Figure 2a and b middle). While convergence to the SR is guaranteed (Figure 2a and b bottom) due to the mathematical equivalence between our setup and TD(λ) (Figure 2—figure supplement 1), the learning trajectory has more variance in the neural network case due to the noise introduced by the randomness of the spike times. This noise can be mitigated by averaging over a population of neurons. Moreover, due to the equivalence with TD(λ), our setup is general for any type of task where discrete states are visited, in any dimension, and which may not need to be a navigation task (see e.g. Figure 1—figure supplement 1 for a 2D environment).

Figure 2 with 1 supplement see all
Comparison between TD(λ) and our spiking model.

(A-top) Learning during behavior corresponds to TD(λ0). States are traversed on timescales larger than the plasticity timescales and place cells use a rate-code. (A-middle) Comparison of the learning over epochs for two synaptic connections (full line denotes the mean over ten random seeds, shaded area denotes one standard deviation) with the theoretical learning curve of TD(λ) (dotted line). (A-bottom) Final successor matrix learned by the spiking model (left) and the theoretical TD(λ) algorithm (right). Star and diamond symbols denote the corresponding weights shown in the middle row. (B-top) Learning during replays corresponds to TD(λ1). States are traversed on timescales similar to the plasticity timescales and place cells use a temporally precise code. (B-middle and bottom) Analogous to panel A middle and bottom.

In summary, we showed how the network can learn the SR using a spiking neural model. We analytically showed how the learning algorithm is equivalent to TD(λ), and confirmed this using numerical simulations. We derived a relationship between the abstract parameter λ and the timescale T representing the animal’s behavior — and in turn the neuronal spiking — allowing us to unify rate and temporal coding within one framework. Furthermore, we predict a role for hippocampal replays in learning the SR using an algorithm equivalent to Monte Carlo.

Learning over behavioral time-scales using STDP

An important observation in our framework is that the SR can be learned using the same underlying STDP rule over time-scales ranging from replays up to behavior. One can now wonder how it is possible to learn relationships between events that are seconds apart during awake behavior, without any explicit error encoding signal typically used by the TD algorithm, and while the STDP rule is characterised by millisecond time-scales (Figure 3a).

Learning on behavioral timescales and state-dependent discounting.

(A) In our model, the network can learn relationships between neurons that are active seconds apart, while the plasticity rule acts on a millisecond timescale. (B) Due to transitions between subsequent states, each synaptic weight update depends on the weight from the subsequent CA3 neuron to the same CA1 neuron. In other words, the change of a synaptic weight depends on the weight below it in the successor matrix. The top panel visualizes how weights depend on others in our linear track example, where each lighter color depends on the darker neighbor. The bottom panel shows the learning of over 50 epochs. Notice the lighter traces converge more slowly, due to their dependence on the darker traces. (C) Place fields of the place cells in the linear track — each place cell corresponding to a column of the successor matrix. Activities of each place cell when the animal is in each of the four states (dots) are interpolated (lines). Three variations are considered: (i) the time spent in each state and the CA3 firing rates are constant (blue and panel D); (ii) the time spent in state 3 is doubled (orange and panel E); (iii) the CA3 firing rate in state 3 is doubled (green and panel F). Panels E and F lead to a modified discount parameter in state 3, affecting the receptive fields of place cells 3 and 4.

From a neuroscience perspective, this can be understood when considering the trajectory of the animal. Each time the animal moves from a position Sj-1 to a position Sj, the CA3 cell encoding the location Sj-1 stops firing and the CA3 cell encoding the location Sj starts firing. Since in our example this transition is instantaneous, these cells are activating the same CA1 cells consecutively. Therefore, the change in the weight wi,j-1 depend on the synaptic weight of the subsequent state wi,j (Figure 3b, yellow depends on orange, orange depends on red, etc). Indeed, in our example of an animal in a linear track subdivided into four locations, the weights on the diagonal, such as w4,4, are the first ones to be learned, since they are learned directly. The off-diagonal weights, such as w3,4, w2,4, and w1,4, are learned consecutively more slowly as they are dependent on the subsequent synaptic weight. Eventually, weights between neurons encoding positions that are behaviorally far apart can be learnt using a learning rule on a synaptic timescale (Figure 3b).

From a reinforcement learning perspective, the TD(0) algorithm relies on a property called bootstrapping. This means that the successor representation is learned by first taking an initial estimate of the SR matrix (i.e. the previously learned weights), and then gradually adjusting this estimate (i.e. the synaptic weights) by comparing it to the states in the environment that the animal actually visits. This comparison is achieved by calculating a prediction error, similar to the widely studied one for dopamine neurons (Schultz et al., 1997). Since the synaptic connections carry information about the expected trajectories, in this case, the prediction error is computed between the predicted and observed trajectories (see Materials and methods).

The main point of bootstrapping, therefore, is that learning happens by adjusting our current predictions (e.g. synaptic weights) to match the observed current state. This information is available at each time step and thus allows learning over long timescales using synaptic plasticity alone. If the animal moves to a state in the environment that the current weights deem unlikely, potentiation will prevail and the weight from the previous to the current state will increase. Otherwise, the opposite will happen. It is important to notice that the prediction error in our model is not encoded by a separate mechanism in the way that dopamine is thought to do for reward prediction (Schultz et al., 1997). Instead, the prediction error is represented locally, at the level of the synapse, through the depression and potentiation terms of our STDP rule, and the current weight encodes the current estimate of the SR (see Materials and methods). Notably, the prediction error is equivalent to the TD(λ) update. This mathematical equivalence ensures that the weights of our neural network track the TD(λ) update at each state, and thus stability and convergence to the theoretical values of the SR. We therefore do not need an external vector to carry prediction error signals as proposed in Gardner et al., 2018; Gershman, 2018. In fact, the synaptic potentiation in our model updates a row of the SR, while the synaptic depression updates a column.

On the other extreme, for very fast timescales such as replays, TD(1) is equivalent to online Monte Carlo learning (MC), which does not bootstrap at all. Instead, MC samples the whole trajectory and then simply takes the average of the discounted state occupancies to update the SR (see Materials and methods). During replays, the whole trajectory falls under the plasticity window and the network can learn without bootstrapping. For all cases in between, the network partially relies on bootstrapping and we correspondingly find a λ between 0 and 1.

In summary, in our framework, synaptic plasticity leads to the development of a successor representation in which synaptic weights can be directly linked to the successor matrix. In this framework, we can learn over behavioral timescales even though our plasticity rule acts on the scale of milliseconds, due to the bootstrapping property of TD algorithms.

Different discounting for space and time

In reinforcement learning, it is usual to have delay-discounting: rewards that are further away in the future are discounted compared to rewards that are in the immediate future. Intuitively, it is indeed clear that a state leading to a quick reward can be regarded as more valuable compared to a state that only leads to an equal reward in the distant future. For tasks in a tabular setting, with a discrete state space and where actions are taken in discrete turns, such as for example chess or our simple linear track discussed in section ‘The Successor Representation’, one can simply use a multiplicative factor 0<γ1 for each state transition. In this case the discount follows an exponential dependence, where rewards that are n steps away are discounted by a factor of γn.

In order to still use the above exponential discount when time is continuous, the usual approach is to discretize time by choosing a unit of time. However, this would imply one can never remain in a state for a fraction of this unit of time, and it is not clear how this unit would be chosen. Our framework deals naturally with continuous time, through the monotonically decreasing dependence of the discount parameter γ on the time an agent remains in a state, T. The dependence on T can be interpreted as an increased discounting the longer a state lasts.

In this way, instead of discounting by γn when the agent stays n units of time in a certain state, we would discount by γ(nT). More generally, for any arbitrary time T, a discount corresponding to γ(T) will be applied. This allows the agent to act in continuous time (Figure 3c and e). Interestingly, the dependence of γ on T in our model is not exponential as in the tabular case. Instead, we have a hyperbolic dependence. This hyperbolic discount is well studied in psychology and neuroeconomics and appears to agree well with experimental results (Laibson, 1997; Ainslie, 2012).

The difference between a hyperbolic discount and an exponential discount lays in the fact that we will attribute a different value to the same temporal delay, depending on whether it happens sooner or later. A classic example is that, when given the choice, people tend to prefer 100 dollars today instead of 101 dollars tomorrow, while they tend to prefer 101 dollars in 31 days instead of 100 dollars in 30 days. They therefore judge the 1 day of delay differently when it happens later in time. Exponential discounting, on the other hand, always attributes the same value to the same delay no matter when it occurs.

Our model therefore combines two types of discounting: exponential when we move through space — when sequentially activating different place cells — and hyperbolic when we move through time — when we prolong the activity of the same place cell.

The discount factor γ also depends on other parameters such as firing rate and STDP amplitudes (see Equation 22 in the Appendix). This gives our model the flexibility to encode state-dependent discounting even when the trajectories and times spent in the states are the same. Such state-dependent discounting can be useful to for example encode salient locations in the environment such as landmarks or reward locations (Figure 3c and f).

Bias-variance trade-off

As discussed previously (section ‘Learning the successor representation in biologically plausible networks’), the TD(λ) algorithm unifies the TD algorithm and the MC algorithm. In our framework, replay-like neuronal activations are equivalent to MC, while behavioral-like activity is equivalent to TD. In this section, we will discuss how the replays and behavior can work together when learning the cognitive map of an environment, leveraging the strengths of MC and TD.

The MC algorithm effectively works by averaging over the sampled trajectories. As such, the estimated SR matrix will be a close approximation of the theoretical value. The difference between the estimated and theoretical value is commonly referred to as bias. We can therefore say that the MC algorithm presents low bias. However, if the agent moves in the environment at random, the sampled trajectories will be quite different from each other. When taking the average, the estimated value will therefore fluctuate a lot. In this case, we say that the MC estimate has high variance as well (Figure 4A and B).

Figure 4 with 2 supplements see all
Replays can be used to control the bias-variance trade-off.

(A) The agent follows a stochastic policy starting from the initial state (denoted by START). The probability to move to either neighboring state is 50%. An epoch stops when reaching a terminal state (denoted with STOP). (B) Root mean squared error (RMSE) between the learned SR estimate and the theoretical SR matrix. The full lines are mean RMSEs over 1000 random seeds. Three cases are considered: (i) learning happens exclusively due to behavioral activity (TD STDP, green); (ii) learning happens exclusively due to replay activity (MC STDP, purple); (iii) A mixture of behavioral and replay learning, where the probabilities for replays drops off exponentially with epochs (Mix STDP, pink). The mix model, with a decaying number of replays learns as quickly as MC in the first epochs and converges to a low error similar to TD, benefiting both from the low bias of MC at the start and the low variance of TD at the end. (C, D, E) Representative weight changes for each of the scenarios. Full lines show various random seeds, shaded areas denote one standard deviation over 1000 random seeds. (F) More replays are observed when an animal explores a novel environment (day 1). Panel F adapted from Figure 3A in Cheng and Frank, 2008.

Unlike MC, the TD algorithm updates its estimate of the SR by comparing the current estimate of the SR with the actual state the agent transitioned to. Because of the dependence on the current estimate, this estimate will be incrementally refined with small updates. In this way, the SR estimate will not fluctuate much, and be lower in variance. However, by this dependence on the current estimate, we introduce a bias in the algorithm, which will be especially significant when our initial estimate of the SR is bad (Figure 4A and B). The TD algorithm therefore presents high bias and low variance.

We now apply these concepts to learning in a novel environment. Since the MC algorithm is unbiased by the initial estimate of the SR, replays should initially speed up learning in an unfamiliar environment. Later on, when the environment becomes familiar, the SR estimate is already closer to the exact value. At this point, we prefer to have low variance and thus the TD algorithm will be preferred. We confirm this logic using our spiking neural networks, and show how we can have both quick learning and low error at convergence if we proportionally have more replays at the first trials in a novel environment (Figure 4a–e). In contrast, when having an equal proportion of replays throughout the whole simulation, we do not yield as quick learning as MC and as low asymptotic error as TD (Figure 4—figure supplement 1). Interestingly, the pattern of proportionally more replays in novel environments versus familiar environments has also been experimentally observed (Cheng and Frank, 2008; Figure 4f). Please note that, while we implemented an exponentially decaying probability for replays after entering a novel environment, different schemes for replay activity could be investigated. Note also that other mechanisms besides the successor representation could account for these results, including model-based reinforcement learning.

Leveraging replays to learn novel trajectories

In the previous section, the replays re-activated the same trajectories as seen during behavior. In this section, we extend this idea and show how in our model replays can be useful during learning even when the re-activated trajectories were not directly experienced during behavior.

For this purpose, we reproduce an place-avoidance experiment from Wu et al., 2017. In short, rats are allowed to freely explore a linear track on day 1. Half of the track is dark, while the other half is bright. On day 2, the animals did four trials separated by resting periods: in the first trial (pre), the animals were free to explore the track; in the second trial (shock), they started in the light zone and received two mild footshocks when entering in the shock zone; in the third and fourth trial (post and re-exposure, respectively), they were allowed to freely explore the track again, but starting from the light zone or the shock zone respectively (Figure 5a). In the study, it was reported that during the post trial, animals tended to stay in the light zone and forward replays from the current position to the shock zone were observed when the animals reached the boundary between the light and the dark zone (Figure 5b and c).

Figure 5 with 4 supplements see all
Reproducing place-avoidence experiments with the spiking model.

(A–C) Data from Wu et al., 2017. (A) Experimental protocol: the animal is first allowed to freely run in the track (Pre). In the next trial, a footshock is given in the shock zone (SZ). In subsequent trials the animal is again free to run in the track (Post, Re-exposure). Figure redrawn from Wu et al., 2017. (B) In the Post trial, the animal learned to avoid the shock zone completely and the also mostly avoids the dark area of the track. Figure redrawn from Wu et al., 2017. (C) Time spent per location confirms that the animal prefers the light part of the track in the Post trial. Figure redrawn from Wu et al., 2017. (D) Mimicking the results from Wu et al., 2017, the shock zone is indicated by the black region, the dark zone by the gray region and the light zone by the white region. Left: without replays, the agent keeps extensively exploring the dark zone even after having experienced the shock. Right: with replays, the agent largely avoids entering the dark zone after having experienced the shock (replays not shown). (E) The value of each state in the cases with and without replays. (F) Occupancy of each state in our simulations and for the various trials. Solid line and shaded areas denote the average and standard deviation over 100 simulations, respectively. Notice we do not reproduce the peak of occupancy at the middle of the track as seen in panel c, since our simplified model assumes the same amount of time is spent in each state.

We simulated a simplified version of this task. Our simulated agent moves through the linear track following a softmax policy, and all states have equal value during the first phase (pre) (Figure 5d, blue trajectories). Then, the agent is allowed to move through the linear track until it reaches the shock zone and experiences a negative reward. Finally, the third phase is similar as the first phase and the animal is free to explore the track. Two versions of this third phase were simulated. In one version, there are no replays (Figure 5d, orange trajectories in left panel), while in the second version a forward replay until the shock zone is simulated every time the agent enters the middle state (Figure 5d, orange trajectories in right panel, replays not shown). The replays affect the learning of the successor representation and the negative reward information is propagated towards the decision point in the middle of the track. The states in the dark zone therefore have lower value compared to the case without replays (Figure 5e). In turn, this different value affects the policy of the agent which now tends to avoid the dark zone all together, while the agent without replays still occupies many states of the dark zone as much as states in the light zone (Figure 5f). Moreover, even when doubling the amount of SR updates in the scenario without replays, the behavior of the agent remains unaltered (Figure 5—figure supplement 1). This shows that it is not the amount of updates, but the type of policy that is important when updating the SR, and how using a different policy in the replay activity can significantly alter behavior.

Our setup for this simulation is simplified, and does not aim to reproduce the complex decision making of the rats. Observe for example the peak of occupancy of the middle state by the animals (Figure 5c), which is not captured by our model because we assume the agent to spend the same amount of time in each state. Nonetheless, it is interesting to see how replaying trajectories that were not directly experienced before, in combination with a model allowing replays to affect the learning of a cognitive map, can substantially influence the final policy of an agent and the overall performance. This mental imagination of trajectories could be exploited to refine our cognitive maps, avoiding unfavourable locations or finding shortcuts to rewards. It is important to note here that, while we are suggesting a potential role for the SR in solving this task, the data itself would also be compatible with a model-based strategy. In fact, experimental evidence suggests that humans may use a mixed strategy involving both model-based reinforcement learning and the successor representation (Momennejad et al., 2017).

Discussion

In this article, we investigated how a spiking neural network model of the hippocampus can learn the successor representation. Interestingly, we show that the updates in synaptic weights resulting from our biologically plausible STDP rule are equivalent to TD(λ) updates, a well-known and powerful reinforcement learning algorithm.

Reinforcement learning

Our network learns the SR in the CA3-CA1 weights. Since we have modeled neurons to integrate the synaptic EPSPs and generate spikes using an inhomogeneous Poisson process based on the depolarization, the firing rate is proportional to the total synaptic weights. Therefore, the successor representation can be read out simply by a downstream neuron. Moreover, since the value of a state is defined by the inner product between the successor matrix and the reward vector, it is sufficient for the synaptic weights to the downstream neuron to learn the reward vector, and the downstream neuron will then encode the state value in its firing rate (see Figure 5—figure supplement 2). While the neuron model used is simple, it will be interesting for future work to study analogous models with non-linear neurons.

It is worth noting that, during learning, both pre-synaptic and post-synaptic layers receive external inputs representing the current state (Equation 10 and Equation 11 in Materials and methods). This may induce a distortion in the read out of the diagonal elements of the SR matrix (see Equations 13 and 15, and Figure 5—figure supplement 2). At a first glance, this may indicate that learning and reading out are antagonistic. However, there are multiple ways we could resolve this apparent conflict: (i) Since the external current in CA1 is present for only a fraction of the time T in each state, the readout might happen during the period of CA3 activation exclusively; (ii) The readout may be over the whole time T but becomes more noisy towards the end. Even in the case where the readout is noisy, the distortion would be limited to the diagonal elements of the matrix; (iii) Learning and readout may be separate mechanisms, where the CA3 driving current is present during readout only. This could be for instance signaled by neuromodulation (e.g. noradrenaline and acetylcholine are active during learning but not exploration Micheau and Marighetto, 2011; Hasselmo and Sarter, 2011; Robbins, 1997; Teles-Grilo Ruivo and Mellor, 2013; Palacios-Filardo et al., 2021), or it could be that readout happens during replays; (iv) The weights to or activation functions of the readout neuron may learn to compensate for the distorted signal in CA1.

Furthermore, we can notice that the external inputs encoding the current state activate CA3 first, and CA1 later. The delay between these activations θ/T (Equation 10 and Equation 11 in Materials and methods) is an arbitrary parameter that can be adjusted. Varying this delay will change the reinforcement learning representation, especially parameters λ and γ, but also the strength of the input current (see Figure 5—figure supplement 3). However, this will not impact the distortion of the diagonal elements of the SR matrix, which remains similar across various delay values θ/T (see Figure 5—figure supplement 4).

Biological plausibility

Uncovering a connection between STDP and TD(λ) shows how, using minimal assumptions, a theoretically grounded learning algorithm can emerge from a biological implementation of plasticity. Similar learning rules have indeed been observed in the hippocampus (Shouval et al., 2002 and proposed on theoretical grounds Mehta et al., 2000; Waddington et al., 2012; van Rossum et al., 2012).

The TD algorithm is most commonly known in neuroscience for describing how reward prediction can be computed in the brain. More specifically, it is widely believed that dopamine neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) encode the prediction error between the observed and expected reward (Schultz et al., 1997), dopamine thus acts as a global signal that can be broadcasted to other areas of the brain like the striatum to compute the expected reward. In our model, the TD algorithm estimates the SR (i.e. expected future occupancy), rather than the value. However, since the prediction error for the SR is different for every synaptic connection (i.e. each pair of states), it is not clear how it could be carried by a global signal analogous to dopamine. The SR would need multiple signals, or a matrix transformation of the global signal. Furthermore, we would need to postulate that such error – or errors – are computed elsewhere in the brain. Instead, in our model, the prediction error simply emerges from the synaptic plasticity rule itself. Furthermore, thanks to the presynaptic depression, our STDP rule alone allows us to compute negative prediction errors, which still poses an open challenge for computation with dopamine because of the low baseline dopaminergic firing rate (Glimcher, 2011; Daw et al., 2002; Matsumoto and Hikosaka, 2007).

Our framework smoothly connects a temporally precise spiking code with a fully rate-based code, and anything in between. As we have proven mathematically, this translates in moving smoothly from Monte Carlo to Temporal Difference by means of TD(λ). Fast spiking sequences (temporal code) can be used for consolidation of previous experiences using Monte Carlo learning, while the behavioral timescale activity (rate code) results in TD updates, allowing learning on the timescale of seconds even with plasticity timeconstants on the order of milliseconds. This type of Hebbian learning over behavioral timescale exploits the bootstrapping property of TD, and is different than the one-shot behavioral plasticity described in Bittner et al., 2017. However, these two mechanisms could be complementary, where the latter could play a more significant role in the formation of new place fields, while the former would be more relevant to shape the existing place fields to contain predictive information. Learning on behavioral timescales using STDP was also investigated in Drew and Abbott, 2006. The main difference between Drew and Abbott, 2006 and our work, is that the former relies on overlapping neural activity between the pre- and post-synaptic neurons from the start, while in our case no such overlap is required. In other words, our setup allows us to learn connections between a presynaptic neuron and a postsynaptic neuron whose activities are separated by behavioral timescales initially. For this to be possible, there are two requirements: (1) the task needs to be repeated many times and (2) a chain of neurons are consecutively activated between the aforementioned presynaptic and postsynaptic neuron. Due to this chain of neurons, over time the activity of the postsynaptic neuron will start earlier, eventually overlapping with the presynaptic neuron.

In our work, we did not include theta modulation, but phase precession and theta sequences could be yet another type of activity within the TD lambda framework. A recent work (George et al., 2023) incorporated the theta sweeps into behavioral activity, showing it approximately learns the SR. Moreover, theta sequences allow for fast learning, playing a similar role as replays (or any other fast temporal-code sequences) in our work. By simulating the temporally compressed and precise theta sequences, their model also reconciles the learning over behavioral timescales with STDP. In contrast, our framework reconciles both timescales relying purely on rate-coding during behavior. Finally, their method allows to learn the SR within continuous space. It would be interesting to investigate whether these methods co-exist in the hippocampus and other brain areas. Furthermore, (Fang et al., 2023) et al. recently showed how the SR can be learned using recurrent neural networks with biologically plausible plasticity.

There are three different neural activities in our proposed framework: the presynaptic layer (CA3), the postsynaptic layer (CA1), and the external inputs. These external inputs could for example be location-dependent currents from the entorhinal cortex, with timings guided by the theta oscillations. The dependence of CA1 place fields on CA3 and entorhinal input is in line with lesion studies (see e.g. Brun et al., 2008; Hales et al., 2014; O’Reilly et al., 2014). It would be interesting for future studies to further dissect the role various areas play in learning cognitive maps.

Notably, even though we have focused on the hippocampus in our work, the SR does not require predictive information to come from higher-level feedback inputs. This framework could therefore be useful even in sensory areas: certain stimuli are usually followed by other stimuli, essentially creating a sequence of states whose temporal structure can be encoded in the network using our framework. Interestingly, replays have been observed in other brain areas besides the hippocampus (Kurth-Nelson et al., 2016; Staresina et al., 2013). Furthermore, temporal difference learning in itself has been proposed in the past as a way to implement prospective coding (Brea et al., 2016).

Replays

We have also proposed a role for replays in learning the SR, in line with experimental findings and RL theories (Russek et al., 2017; Momennejad et al., 2017). In general, replays are thought to serve different functions, spanning from consolidation to planning (Roscow et al., 2021). Here, we have shown that when the replayed trajectories are similar to the ones observed during behavior, they play the role of speeding up and consolidating learning by regulating the bias-variance trade-off, which is especially useful in novel environments. On the other hand, if the replayed trajectories differ from the ones experienced during wakefulness, replays can play a role in reshaping the representation of space, which would suggest their involvement in planning. Experimentally, it has been observed that replays often start and end from relevant locations in the environment, like reward sites, decision points, obstacles or the current position of the animal (Ólafsdóttir et al., 2015; Pfeiffer and Foster, 2013; Jackson et al., 2006; Mattar and Daw, 2017). Since these are salient locations, it is in line with our proposition that replays can be used to maintain a convenient representation of the environment. It is worth noticing that replays can serve a variety of functions, and our framework merely proposes additional beneficial properties without claiming to explain all observed replays. For example, in addition to forward replays, also reverse replays are ubiquitous in hippocampus (Pfeiffer, 2020). The reverse replays are not included in our framework, and it is not clear yet whether they play different roles, with some evidence suggesting that reverse replays are more closely tied to the reward encoding (Ambrose et al., 2016). Moreover, while indirect evidence supports the idea that replays can play a role during learning (Igata et al., 2021), it is not yet clear how synaptic plasticity is manifested during replays (Fuchsberger and Paulsen, 2022).

Learning flexibility

Multiple ideas from reinforcement learning, such as TD(λ), state-dependent discounting and the successor representation, emerge quite naturally from our simple biologically plausible setting. We propose in our work that time and space can be discounted differently. Moreover, the flexibility to change the discounting factor by modulating firing rates and plasticity parameters — which is ubiquitous in neural circuits — suggests that these mechanisms could be used to encode a variety of information in a cognitive map. Moreover, the specific dependence of the discount factor on the biological parameters leads to experimentally testable predictions. Indeed, our framework predicts well-defined changes in place fields after modulations of firing rates, speed of the agent or neuromodulation of the plasticity parameters (Figure 3). Importantly, the discount parameter also depends on the time spent in each state. This eliminates the need for time discretization, which does not reflect the continuous nature of the response of time cells (Kraus et al., 2013).

Limitations of the reinforcement learning framework

We have already outlined some of the benefits of using reinforcement learning for modeling behavior, including providing clear computational and algorithmic frameworks. However, there are several intrinsic limitations to this framework. For example, RL agents that only use spatial data do not provide complete descriptions of behavior, which likely arises from integrating information across multiple sensory inputs. Whereas an animal would be able to smell and see a reward from a certain distance, an agent exploring the environment would only be able to discover it when randomly visiting the exact reward location. Furthermore, the framework rests on fairly strict mathematical assumptions: typically the state space needs to be markovian, time and space need to be discretized (which we manage to evade in this particular framework) and the discounting needs to follow an exponential decay. These assumptions are simplistic and it is not clear how often they are actually met. Reinforcement Learning is also a sample-intensive technique, whereas we know that some animals, including humans, are capable of much faster or even one-shot learning.

Even though we have provided a neural implementation of the SR, and of the value function as its read-out (see Figure 5—figure supplement 2), the whole action selection process is still computed only at the algorithmic level. It may be interesting to extend the neural implementation to the policy selection mechanism in the future.

Taken together, our work joins — in a single framework — a variety of concepts from the neuronal level over cognitive theories to reinforcement learning.

Materials and methods

The successor representation

Request a detailed protocol

In a tabular environment, we define the value of a state s as being the expected cumulative reward that an agent will receive following a certain policy starting in s. The future rewards are multiplied by a factor 0<γn1, where n is the number of steps until reaching the reward location and 0<γ1 is the delay discount factor. It is usual to use 0<γ<1, which ensures that earlier rewards are given more importance compared to later rewards. Formally, the value of a state s under a certain policy π is defined as

(1) Vπ(s)=Eπ[k=0γkRt+k|St=s]
(2) =aπ(a|s)[R(s,a)+sP(s|s,a)γVπ(s)]

Here, a denotes the action, R(s,a) is the reward function and P(s|s,a) is the transition function, i.e. the probability that taking an action a in state s will result in a transition to state s. Following (Dayan, 1993), we can decompose the value function into the inner product of reward function and successor matrix

(3) V(s)=sMs,sR(s)

with

(4) Ms,s=E[t=0γtI(st=s)|s0=s]

This representation is known as the successor representation (SR), where each element Mij represents the expected future occupancy of state j when in state i. By decomposing the value into the SR and the reward function (Equation 3), relearning the state values V after changing the reward function is fast, similar to model-based learning. At the same time, the SR can be learned in a model-free manner, using for example temporal difference (TD) learning (Russek et al., 2017).

Derivation of the TD(λ) update for the SR

Request a detailed protocol

The TD(λ) update for the SR is then implemented according to (see e.g. Sutton and Barto, 1998)

(5) ΔM(j,i)=δ0TD+γλδ1TD+(γλ)2δ2TD+

Using δiTD for the TD error at step i and δxy for the Kronecker delta,

(6) δnTD=δj+n,i+n+γM(j+n+1,i+n)-M(j+n,i+n)

corresponds to the TD error for element M(j+n,i+n) of the successor representation after the transition from state j+n to state j+n+1. Combining Equations 5 and 6, we find

(7) ΔM(j,i)=[δj,i+γM(j+1,i)-M(j,i)]+γλ[δj+1,i+γM(j+2,i)-M(j+1,i)]+(γλ)2[δj+2,i+γM(j+3,i)-M(j+2,i)]+=-M(j,i)+δj,i+(1-λ)γM(j+1,i)+γλδj+1,i+(1-λ)λγ2M(j+2,i)+(γλ)2δj+2,i+=-M(j,i)+n=0N[(γλ)nδj+n,i+(1-λ)γ(γλ)nM(j+n+1,i)]

and

(8) M(j,i)M(j,i)+η ΔM(j,i)M(j,i)η{M(j,i)+n=0N[(γλ)nδj+n,i+(1λ)γ(γλ)nM(j+n+1,i)]}

Neural network model

Plasticity rule

Request a detailed protocol

The synaptic plasticity rule (Figure 1d) consists of a weight-dependent depression for presynaptic spikes and a spike-timing dependent potentiation, given by

(9) dwij(t)dt=ηSTDPALTPTrLTPj(t)iδ(tti)ηSTDPALTDwij(t)δ(ttj)τLTPdTrLTPj(t)dt=TrLTPj(t)+jδ(ttj)

Here, wij represents the synaptic connection from presynaptic neuron j to postsynaptic neuron i, TrLTPj is the plasticity trace, a low-pass filter of the presynaptic spike train with time constant τLTP, tj and ti are the spike times of the postsynaptic and presynaptic neuron respectively, ALTP and ALTD are the amplitudes of potentiation and depression respectively, ηSTDP is the learning rate for STDP and the δ() denotes the Dirac delta function.

Place cell activation

Request a detailed protocol

We assume that each state in the environment is represented by a population of place cells in the network. In our model, this is achieved by delivering place-tuned currents to the neurons. Whenever a state S=j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρpre for a time θ, following a Poisson process with parameter ρhpre(t). The other presynaptic neurons are assumed to be silent:

(10) ρhpre(t)={ρpreδhj,if t[0,θ)0otherwise

where the Kronecker delta function is defined as δhj=1 if h=j and zero otherwise. Here we use the index j to denote any neuron belonging to the population of neurons encoding state j. After a short delay, at time t*, a similar current ρbias is delivered to the postsynaptic neuron encoding state j, for a duration of time ω.

(11) ρibias(t)={ρbiasδij,if t[t,t+ω)0,otherwise

Besides the place-tuned input current, CA1 neurons receive inputs from the presynaptic layer (CA3). The postsynaptic potential ρipost when the agent is in state j is thus given by

(12) ρipost(t)=kNpoptkf<twijk(t)κ(ttf)+ρibias(t),

with the first sum running over all Npop presynaptic neurons encoding state j, and the second sum over all presynaptic firing times tkf of neuron k happened before t. The excitatory postsynaptic current κ is modeled as an exponential decay described as κ(x)=ϵ0e-x/τm for x0 and zero otherwise. Each CA1 neuron i fires following an inhomogeneous Poisson process with rate ρipost(t).

Note that, in most simulations we will use a single neuron in the population Npop=1. In addition, we normally set t*=θ and ω=T-θ. However, we will keep these as explicit parameters for theoretical purposes.

Equivalence with TD(λ)

Request a detailed protocol
Total plasticity update
Request a detailed protocol

Since we have the mathematical equation for the plasticity rule, and CA3 and CA1 neurons follow an inhomogeneous Poisson process with time-dependent firing rate, we can calculate analytically the average total weight change for the synapse wij, given a certain trajectory (details in the Appendix). Please notice that our calculation is based on Kempter et al., 1999, which takes into account the fact that our plasticity rule is sensitive to spike timing and involves a spike-spike correlation term. We find that:

(13) Δwij=A wij+n=0N[B(eT/τLTP)nδij+n+C(eT/τLTP)n+1wi,j+n+1]

where N is the number of states until the end of the trajectory and

(14) A=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(1eθ/τm)[θτLTP(1eθ/τLTP)]+ηSTDPALTPθρpreNpopτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ
(15) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP(1-e-ωτLTP)ρbias=Bρbias
(16) C=ηSTDPALTPNpopϵ0τmτLTP2(ρpre)2(1-e-θτm)(eθτLTP-1)(1-e-θτLTP)
Comparison with TD(λ)
Request a detailed protocol

Comparing the total weight change due to STDP (Equation 13) to the TD(λ) update (Equation 8), we can see that the two equations are very similar in form:

wijwijA{wij+n=0N[BA(eT/τLTP)nδij+nCAeT/τLTP(eT/τLTP)nwi,j+n+1]}M(j,i)M(j,i)+η{M(j,i)+n=0N[(γλ)nδj+n,i+(1λ)γ(γλ)nM(j+n+1,i)]}

We impose wij=M(j,i), and find:

(17) -A=η
(18) -BA=1ρbias=-AB
(19) e-T/τLTP=λγ
(20) -Ce-T/τLTPA=1-λγ,

where A,B,B and C are defined as in Equations 14, 15, and 16.

Hence, our plasticity rule is learning the Successor Representation through a TD(λ) model with parameters:

(21) η=-A
(22) γ=A-CAe-TτLTP
(23) λ=AA-C

To ensure the learning rate η is positive, one condition resulting from Equation 21 is

(24) Apre>ALTP NpopτLTP τmϵ0(ρpre (1eθ/τm)θτLTP(1eθ/τLTP)θ+1τm+τLTP)

Learning during normal behavior (θ>>τLTP)

Request a detailed protocol

During normal behavior, we assume the place-tuned currents are on larger timescales than the plasticity constants: θ,ω>>τLTP. We can see from Equations 14 and 16 that the factor A grows linearly with θ while C grows exponentially with θ. From Equation 23, we then have

(25) λ0

(See also Figure 2—figure supplement 1).

Learning during replays (θ<<τLTP)

Assumptions

Request a detailed protocol

For the replay model we assume the place-tuned currents are impulses, which make the neurons emit exactly one spike at a given time. Specifically, we can make the duration of the place-tuned currents go to 0,

(26) θ,ω0

while the intensity of the currents goes to infinity. For simplicity, we will take:

ρpre(θ)=1θlimθ0ρpre=ρbias(ω)=1ωlimω0ρbias=

Furthermore, we assume that the contribution of the postsynaptic currents due to the single presynaptic spikes is negligible in terms of driving plasticity, allowing us to set

ϵ00

Calculations of TD parameters

Request a detailed protocol

Given the assumptions above, we can see from Equations 14 and 16 that:

A=ηSTDPApreC=0

For Equation 15, we can use the Taylor expansion for exτ around x=0, such that: exτ1+xτ :

B=ηSTDPALTPτLTP2ρpreθτLTPetτLTPωτLTPρbias=ηSTDPALTPetτLTP

Using Equations 21, 22, 23 and 18, we can calculate the parameters and constraints for the TD model:

(27) λ=AAC=1η=A=ηSTDPApreγ=ACAeTτLTP=eTτLTP1=BA=ALTPetτLTPApre

As expected, the bootstrapping parameter λ=1 (see also Figure 2—figure supplement 1).

Alternative derivation of replay model

Place cell activation during replays

Request a detailed protocol

We model a replay event as a precise temporal sequence of spikes. Since every neuron represents a state in the environment, a replay sequence reproduces a trajectory of states. We assume that, when the agent is in state S=j, the neurons representing state j fire npre spikes at some point in the time interval t[0,σ], where the exact firing times are uniformly sampled. After a short delay, the CA1 neurons representing state j fire npost spikes at a time uniformly sampled from the interval [t*,t*+σ]. The time between two consecutive state visits is T. The exact number of spikes in each replay event is random but small. Specifically, it is sampled from the set {0,1,2} according to the probability vector

(28) p=(p12,1p1,p12)

It is worth noting here that other implementations are possible but that we assume the average number of spikes in each state is 1, and that the average time between a presynaptic and a postsynaptic spike is t*. The model could be further generalized for a higher number of average spikes per state.

Plasticity update

Request a detailed protocol

We can consider again our learning rule, composed of a positive pre-post potentiation window and presynaptic weight-dependent depression (Equation 9). Let’s consider the synapse wij, we can see that on average the total amount of depression will be determined by the number of times the state j is visited in the trajectory replayed:

LTD=-AprewijNj,

where Nj is the number of times the state j is visited. The amount of potentiation will be determined, instead, by the time difference between the postsynaptic and presynaptic firing times, which encode the distance between state j and state i:

LTP=ALTPke-kT+t*τLTPnkij,

where nkij represents the number of times the agent visited state i k steps after j. Combining the equations above we find that:

(29) Δwij=ηSTDPALTPkekT+tτLTPnkijηSTDPAprewijNj.

If we assume that the this value has converged to its stationary state, Δwij=0;

(30) wij=ALTPApreetτLTPk(eTτLTP)knkijNj

Comparison with online Monte Carlo learning

Request a detailed protocol

Given the stable weight w* from Equation 30, we can impose that:

(31) ALTPApree-t*τLTP=1   and
(32) e-TτLTP=γ

we find that the stable weight is:

(33) wij=kγknkijNjE[kγkI(Sk=i|S0=j)]=M(j,i)

which is the definition of the Successor Representation matrix (Equation 4). Indeed, wij is computing the sample mean of the discounted distance between states i and j, which is equivalent to performing an every-state Monte Carlo or TD(λ=1) update. Notably, from Equation 29, we have that the learning rate for the Monte Carlo update is given by:

(34) η=ηSTDPALTPe-t*τLTP=ηSTDPApre

Simulation details for Figure 2

Request a detailed protocol

A linear track with four states is simulated. The policy of the agent in this simulation is to traverse the track from left to right, with one epoch consisting of starting in state 1 and ending in state 4. One simulation consists of 50 epochs, and we re-run the whole simulation ten times with different random seeds. Over these ten seeds, mean and standard deviation of the synaptic weights are recorded after every epoch.

Our neural network consists of two layers, each with a single neuron per state (as in Figure 1). Synaptic connections are made from each presynaptic neuron to all postsynaptic neurons, resulting in a 4-by-4 matrix which is initialized as the identity matrix. The plasticity rule and neuronal activations follow Equations 9–12.

The STDP parameters are listed in Table 1.

Table 1
Parameters used for the spiking network.
ϵ01
ρpre0.1ms-1
τm2ms
Npost1
Npretot1
Npre1
stepsize0.01ms
ηstdp0.003
τLTP60ms
ALTP1 ms-1

To obey Equation 24, we set Apre equal to the right hand side augmented with 5.

For the behavioral case, we choose T=100ms, θ=80ms, ω=T-θ, which correspond to TD(λ) parameters λ=0.21, γ=0.89, η=0.12.

In the replay case, we have a sequence of single spike per neuron (see Figure 2b and section ‘Alternative derivation of replay model’). Following Equation 27, we choose T=-log(γ)τLTP 7ms, where γ and τLTP are the same as in Table 1. We set θ=2 ms and σ=0.5 ms. By setting the ηstdp=ηALTPexp(θ/τLTP), the corresponding TD(λ) parameters are λ=1, γ=0.89, η=0.12 just as in the behavioral case.

More details on the place cell activation during replays in our model can be found in section ‘Alternative derivation of replay model’. Using exactly one single spike per neuron with the above parameters would allow us to follow the TD(1) learning trajectories without any noise. For more biological realism, we choose p1=0.15 in Equation 28, in order to achieve an equal amount of noise due to the random spiking as in the case of behavioral activity (see Figure 4—figure supplement 2).

Simulation details for Figure 3

Request a detailed protocol

Using the same neural network and plasticity parameters as the behavioral learning in Figure 2 (see previous section), we simulate the linear track in the following two situations:

  • The third state has T=200ms instead of 100ms. All other parameters remain the same as in Figure 2. Results plotted in Figure 3E.

  • The third state has ρpre=0.2ms-1 instead of 0.1ms-1. All other parameters remain the same as in Figure 2. Results plotted in Figure 3F.

Simulation details for Figure 4

Request a detailed protocol

A linear track with three states is simulated, and the agent has 50% probability to move left or right in each state (see Figure 4A). One epoch lasts until the agent reaches one of the STOP locations.

We then use the same neural network and plasticity parameters as used for Figure 2. We simulate three scenarios:

  • Only replay-based learning during all epochs (no behavioral learning). This scenario corresponds to MC STDP in Figure 4B and to Figure 4C.

  • Mixed learning using both behavior and replays. The probability for an epoch to be a replay is decaying over time following exp(-i/6), with i the epoch number. This scenario corresponds to Mix STDP in Figure 4B and to Figure 4E.

  • Only behavioral learning during all epochs (no replays). This scenario corresponds to TD STDP in Figure 4B and to Figure 4D.

Simulation details for Figure 5

Request a detailed protocol

A linear track with 21 states is simulated. The SR is initialized as the identity matrix, and the reward vector (containing the reward at each state) is also initialized as the zero vector. We simulate the learning of the SR during behavior using the theoretical TD(0) updates and during replays using the theoretical TD(1) updates. The value of each state is then calculated as the matrix-vector product between the SR and the reward vector, resulting in an initial value of zero for each state.

The policy of the agent is a softmax policy (i.e. the probability to move to neighboring states is equal to the softmax of the values of those neighboring states). The first time the agent reaches the leftmost state of the track (state 1), the negative reward of –2 is revealed, mimicking the shock in the actual experiments, and the reward vector is updated accordingly for this state.

We now simulate two scenarios: in the first scenario, the agent always follows the softmax policy and no replays are triggered (see Figure 5D, left panel). In the second scenario, every time the agent enters the dark zone from the light zone (i.e. transitions from state 12 to state 11 in our simulation), a replay is triggered from that state until the leftmost state (state 1) (see Figure 5D, right panel). Both scenarios are simulated for 2000 state transitions. We then run these two scenarios 100 times and calculate mean and standard deviation of state occupancies (Figure 5F).

Finally, since the second scenario has more SR updates than the first scenario, we also simulate the first scenario for 4000 state transitions (Figure 5—figure supplement 1) and show how the observed behavior of Figure 5 is unaffected by this.

Appendix 1

Analytical derivations for the total weight change in the behavioural model

Presynaptic rate during state j

Whenever a state S=j is entered, the presynaptic neurons encoding state j start firing at a constant rate ρpre for a time θ, following a Poisson process with parameter ρjpre(t):

(35) ρjpre(t)={ρpre,if t[0,θ)0,otherwise

The other presynaptic neurons are silent.

Postsynaptic rate during state j

The average postsynaptic rate can be calculated as follows. The probability of a presynaptic spike between t and t+dt is equal to ρj(t)dt. The size of the presynaptic population encoding state j is equal to Npop and each excitatory postsynaptic potential (EPSP) is modeled by an immediate jump with amplitude ϵ0wij, followed by exponential decay of EPSP with time constant τm.

Following Equation 12 in the main paper, reproduced below,

ρipost(t)=kNpoptkf<twijk(t)κ(ttf)+ρibias(t)

we find that the average postsynaptic potential at time t is given by (assuming t=0 when entering the state j):

(36) ρ¯ipost(t)=0t[Npopρjpre(t)ϵ0wij(t)etτm+ρibias(t)]dt

We assume that wij(t) changes slowly compared to the timescale θ allowing us to consider the weight constant during that time. We can then approximate the average postsynaptic rate as:

(37) ρ¯ipost(t)={Npopρpreϵ0wijτm(1etτm), if 0t<θρbiasδij, if tt<t+ω0, otherwise

If t<θ, both the first and the second term will contribute to the postsynaptic rate in the time between t and θ.

LTP trace during state j

Given Equation 9 in the main paper, reproduced below,

τLTPdTrLTPj(t)dt=TrLTPj(t)+jδ(ttj)

and combined with Equation 35, we can calculate the evolution of the LTP trace for neuron j during state j:

(38) TrLTPj(t)={ρpre0tetτLTPdt, if 0t<θρpreτLTP(1eθτLTP)etθτLTP, if tθ={ρpreτLTP(1etτLTP), if 0t<θρpreτLTP(eθτLTP1)etτLTP, if tθ

For 0t<θ, the presynaptic neuron j is active and therefore the trace builds up with the presynaptic spikes, for tθ, the trace decays exponentially with time constant τLTP.

Total amount of LTP during state j

Following (Kempter et al., 1999), first we calculate the amount of LTP without taking into account spike-to-spike correlation:

The probability for a postsynaptic spike between t and t+dt is ρ¯ipost(t)dt. The amount of LTP due to a single spike at time t is ALTPTrLTPj(t). Hence, combining Equations 37 and 38, the total amount of LTP during a state (i.e. between time 0 and T) becomes:

(39) LTPnon-causal=ALTP 0T ρ¯ipost(t) TrLTPj(t) dt
(40) =ALTPNpopρpreϵ0wijτm(1eθτm)ρpreτLTP0θ(1etτLTP)dt+ALTPρbias ρpreτLTP(eθτLTP1)etτLTPtt+ωetτLTP dt=ALTP wij Npopϵ0(ρpre)2 τLTP τm(1eθτm)[θτLTP(1eθτLTP)]+ALTPρbias ρpreτLTP2(eθτLTP1)etτLTP[1eωτLTP]

Following (Kempter et al., 1999), the amount of LTP due to the causal part (each presynaptic spike temporarily increase the probability of a postsynaptic spike) is given by:

(41) LTPcausal=ALTPθρpreϵ0wijτmτLTPτm+τLTP

Combining equations for the non-causal 40 and causal 41 parts, we get the total amount of LTP during a state (assuming τm<<τLTP):

(42) LTP=ALTP wij Npopϵ0(ρpre)2 τLTP τm(1eθτm)[θτLTP(1eθτLTP)]+ALTPρbias ρpreτLTP2(eθτLTP1)etτLTP[1eωτLTP]+ALTPθρpreτmτLTPτm+τLTPϵ0wij

Total amount of LTD during state j

There is a weight-dependent depression for each presynaptic spike, hence the amount of LTD during a state is given by:

(43) LTD=-Apreρpreθwij

Total plasticity during state j

Combining Equations 42 and 43, we can calculate the total amount of plasticity during the time the agent spends in the current state j:

(44) Δ0wij=Awij+Bδij

with

(45) A=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(1eθ/τm)[θτLTP(1eθ/τLTP)]+ηSTDPALTPθρpreτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ

and

(46) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP[1-e-ωτLTP]ρbias

Plasticity due to states transitioning

Once the agent leaves state j, the decaying LTP trace can still cause potentiation due to the activity in the following states, j+n, with n=1,2, . Given that the agent spends a time T in each state, we find that the agent visits state j+n during time t[nT,nT+T). We will now calculate the contribution to plasticity due to these state transitions.

Postsynaptic rate during the new state j+n

During state j+n, the activity of the postsynaptic neurons is driven by the presynaptic neurons coding for j+n, and the bias current. We can thus generalize Equation 37 and find that the average postsynaptic rate ρ¯ipost during state j+n is:

(47) ρ¯ipost(t)={Npopρpreϵ0wij+nτm(1eθτm), if nTt<nT+θρbiasδij+n, if nT+tt<nT+t+ω0, otherwise

LTP trace from state j, during the new state j+n

Following Equation 38, we find that the amplitude of the LTP trace from state j during state j+n is:

(48) TrLTPj(nT+t)=ρpreτLTP(eθτLTP1)e(t+nT)τLTP=ρpreτLTP(eθτLTP1)(eTτLTP)netτLTP

with 0<t<T.

LTP due to state transitioning

We can then calculate the amount of LTP between the presynaptic neuron j and the postsynaptic neuron i, when the agent is in state j+n. We refer to Equation 39 and find:

LTPswitch=ALTP nTnT+T ρ¯ipost(t) TrLTPj(t) dt=ALTPNpopρpreϵ0wij+nτm(1eθτm)ρpreτLTP(eθτLTP1)(eTτLTP)nnTnT+θetτLTPdt+ALTPρbiasδij+nρpreτLTP(eθτLTP1)etτLTP(eTτLTP)nnT+tnT+t+ωetτLTPdt=ALTPNpopρpreϵ0wij+nτm(1eθτm)ρpreτLTP(eθτLTP1)(eTτLTP)nτLTP(1eθτLTP)+ALTPρbiasδij+nρpreτLTP(eθτLTP1)etτLTP(eTτLTP)nτLTP(1eωτLTP)

The amount of plasticity in state j+n when starting from state j is thus:

(49) Δnwij=C(e-TτLTP)nwij+n+B(e-TτLTP)nδij+n

where

(50) C=ηSTDPALTPNpopρpreϵ0τm(1-e-θτm)ρpreτLTP(eθτLTP-1)τLTP(1-e-θτLTP)
(51) B=ηSTDPALTPρpreτLTP2(eθτLTP-1)e-t*τLTP(1-e-ωτLTP)ρbias

It is worth noting that the parameter B derived here is the same as Equation 46.

Summary: total STDP update

If we combine together Equations 44 and 49, we have that the total weight change for the synapse wij is given by:

(52) wij=Δ0wij+n=1NΔnwij=A wij+n=0N[B(eT/τLTP)nδij+n+C(eT/τLTP)n+1wi,j+n+1]

where N is the number of states until the end of the trajectory and A, B, C are as defined in Equations 45, 46 and 50 respectively.

Analytical calculations for hyperbolic discounting

From Equation 22 in the main paper, we have that, in the behavioural model γ=(1-CA)e-TτLTP. Here, we will derive an approximation to this value.

If we assume that θ>>τm,τLTP, we can approximate A and C as:

(53) A~=ηSTDPALTP Npopϵ0(ρpre)2 τLTP τm(θτLTP)+ηSTDPALTPθρpreτmτLTPτm+τLTPϵ0ηSTDPApre ρpreθ=ηSTDPρpre(aθ+b),
(54) witha=ALTP ϵ0τLTP τm(Npopρpre +1τm+τLTP)Apreb=ALTP Npopϵ0ρpreτLTP2 τm
(55) C~=ηSTDPALTPNpop(ρpre)2ϵ0τmτLTP2eθτLTP=ηSTDPρpreeθτLTPb

If we define ψ such that θ+ψ=T, we can rewrite and approximate the discount parameter as:

(56) γ=(1CA)eθ+ψτLTPC~A~eθτLTPeψτLTP=beψτLTPaθ+b=11+abθeψτLTP

From Equation 56, we can see that the discount γ follows a hyperbolic function if we increase the duration of the presynaptic current θ. If, instead, we vary ψ, the discount becomes exponential (Figure 2—figure supplement 1a and b).

Notice that this analysis extends to the replay model. Following what was done after Equation 26, we can connect the behavioural model with the replay model by making θ,ϵ00, which implies ψT. From Equation 56 we find that:

limθ,ϵ00γ=e-TτLTP,

which is exactly the definition of γ in the replay model (Equations 27 in Materials and methods). For replays, the discount is therefore strictly exponential.

Furthermore, using the same calculations and Equations 21 and 19 in the main paper, we can find approximated values for the other parameters too (Figure 2—figure supplement 1c and d).

η=AηSTDPρpre(aθ+b)λ=eT/τLTPγ(1+abθ)eψτLTPeψ+θτLTP=(1+abθ)eθτLTP

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Modelling code is available on GitHub (https://github.com/jacopobono/learning_cognitive_maps_code, copy archived at swh:1:rev:d86b262545547353c7050bbc2d476c2f4a297989; Jacopo, 2023).

References

    1. Aggleton JP
    2. Brown MW
    (1999)
    Episodic memory, amnesia, and the hippocampal-anterior thalamic axis
    The Behavioral and Brain Sciences 22:425–444.
  1. Book
    1. Doya K
    (1995)
    Temporal difference learning in continuous time and space
    In: Touretzky D, Mozer M, Hasselmo M, editors. Advances in Neural Information Processing Systems. Massachusetts, United States: MIT Press. pp. 1–10.
  2. Book
    1. Mattar MG
    2. Daw ND
    (2017)
    A Rational Model of Prioritized Experience Replay
    Ann Arbor, USA: The University of Michigan.
  3. Conference
    1. Stachenfeld KL
    2. Botvinick MM
    3. Gershman SJ
    (2014)
    Design Principles of the Hippocampal Cognitive Map
    NeurIPS Proceedings.

Article and author information

Author details

  1. Jacopo Bono

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft
    Contributed equally with
    Sara Zannone
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9552-3151
  2. Sara Zannone

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft
    Contributed equally with
    Jacopo Bono
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9526-7001
  3. Victor Pedrosa

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Visualization, Writing - original draft
    Competing interests
    No competing interests declared
  4. Claudia Clopath

    Department of Bioengineering, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Project administration, Writing - review and editing
    For correspondence
    c.clopath@imperial.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4507-8648

Funding

Wellcome Trust (200790/Z/16/Z)

  • Claudia Clopath

Engineering and Physical Sciences Research Council (EP/R035806/1)

  • Claudia Clopath

Simons Foundation (564408)

  • Claudia Clopath

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Version history

  1. Preprint posted: August 17, 2021 (view preprint)
  2. Received: May 30, 2022
  3. Accepted: January 12, 2023
  4. Version of Record published: March 16, 2023 (version 1)

Copyright

© 2023, Bono, Zannone et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,313
    views
  • 168
    downloads
  • 7
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jacopo Bono
  2. Sara Zannone
  3. Victor Pedrosa
  4. Claudia Clopath
(2023)
Learning predictive cognitive maps with spiking neurons during behavior and replays
eLife 12:e80671.
https://doi.org/10.7554/eLife.80671

Share this article

https://doi.org/10.7554/eLife.80671

Further reading

    1. Neuroscience
    Ching Fang, Dmitriy Aronov ... Emily L Mackevicius
    Research Article

    The predictive nature of the hippocampus is thought to be useful for memory-guided cognitive behaviors. Inspired by the reinforcement learning literature, this notion has been formalized as a predictive map called the successor representation (SR). The SR captures a number of observations about hippocampal activity. However, the algorithm does not provide a neural mechanism for how such representations arise. Here, we show the dynamics of a recurrent neural network naturally calculate the SR when the synaptic weights match the transition probability matrix. Interestingly, the predictive horizon can be flexibly modulated simply by changing the network gain. We derive simple, biologically plausible learning rules to learn the SR in a recurrent network. We test our model with realistic inputs and match hippocampal data recorded during random foraging. Taken together, our results suggest that the SR is more accessible in neural circuits than previously thought and can support a broad range of cognitive functions.

    1. Computational and Systems Biology
    2. Genetics and Genomics
    Weichen Song, Yongyong Shi, Guan ning Lin
    Tools and Resources

    We propose a new framework for human genetic association studies: at each locus, a deep learning model (in this study, Sei) is used to calculate the functional genomic activity score for two haplotypes per individual. This score, defined as the Haplotype Function Score (HFS), replaces the original genotype in association studies. Applying the HFS framework to 14 complex traits in the UK Biobank, we identified 3619 independent HFS–trait associations with a significance of p < 5 × 10−8. Fine-mapping revealed 2699 causal associations, corresponding to a median increase of 63 causal findings per trait compared with single-nucleotide polymorphism (SNP)-based analysis. HFS-based enrichment analysis uncovered 727 pathway–trait associations and 153 tissue–trait associations with strong biological interpretability, including ‘circadian pathway-chronotype’ and ‘arachidonic acid-intelligence’. Lastly, we applied least absolute shrinkage and selection operator (LASSO) regression to integrate HFS prediction score with SNP-based polygenic risk scores, which showed an improvement of 16.1–39.8% in cross-ancestry polygenic prediction. We concluded that HFS is a promising strategy for understanding the genetic basis of human complex traits.