Rapid learning of predictive maps with STDP and theta phase precession
Abstract
The predictive map hypothesis is a promising candidate principle for hippocampal function. A favoured formalisation of this hypothesis, called the successor representation, proposes that each place cell encodes the expected state occupancy of its target location in the near future. This predictive framework is supported by behavioural as well as electrophysiological evidence and has desirable consequences for both the generalisability and efficiency of reinforcement learning algorithms. However, it is unclear how the successor representation might be learnt in the brain. Errordriven temporal difference learning, commonly used to learn successor representations in artificial agents, is not known to be implemented in hippocampal networks. Instead, we demonstrate that spiketiming dependent plasticity (STDP), a form of Hebbian learning, acting on temporally compressed trajectories known as ‘theta sweeps’, is sufficient to rapidly learn a close approximation to the successor representation. The model is biologically plausible – it uses spiking neurons modulated by thetaband oscillations, diffuse and overlapping place celllike state representations, and experimentally matched parameters. We show how this model maps onto known aspects of hippocampal circuitry and explains substantial variance in the temporal difference successor matrix, consequently giving rise to place cells that demonstrate experimentally observed successor representationrelated phenomena including backwards expansion on a 1D track and elongation near walls in 2D. Finally, our model provides insight into the observed topographical ordering of place field sizes along the dorsalventral axis by showing this is necessary to prevent the detrimental mixing of larger place fields, which encode longer timescale successor representations, with more finegrained predictions of spatial location.
Editor's evaluation
This theoretical work is important in that it bridges neural mechanisms within the hippocampus with the abstract computations it is thought to support for reinforcement learning. The study offers a potential mechanism by which spike timing dependent plasticity and theta phase precession within spiking neurons in CA3 and CA1 can yield successor representations. The simulations are compelling in that they continue to hold even when some of the simple but less realistic assumptions are relaxed in support of more realistic scenarios consistent with biological data.
https://doi.org/10.7554/eLife.80663.sa0Introduction
Knowing where you are and how to navigate in your environment is an everyday existential challenge for motile animals. In mammals, a key brain region supporting these functions is the hippocampus (Scoville and Milner, 1957; Morris et al., 1982), which represents selflocation through the population activity of place cells – pyramidal neurons with spatially selective firing fields (O’Keefe and Dostrovsky, 1971). Place cells, in conjunction with other spatially tuned neurons (Taube et al., 1990; Hafting et al., 2005), are widely held to constitute a ‘cognitive map’ encoding information about the relative location of remembered locations and providing a basis upon which to flexibly navigate (Tolman, 1948; O’Keefe and Nadel, 1978).
The hippocampal representation of space incorporates spike time and spike rate based encodings, with both components conveying broadly similar levels of information about selflocation (Skaggs et al., 1996b; Huxter et al., 2003). Thus, the position of an animal in space can be accurately decoded from place cell firing rates (Wilson and McNaughton, 1993) as well as from the precise time of these spikes relative to the background 8–10 Hz theta oscillation in the hippocampal local field potential (Huxter et al., 2003). The latter is made possible since place cells have a tendency to spike progressively earlier in the theta cycle as the animal traverses the place field – a phenomenon known as phase precession (O’Keefe and Recce, 1993). Therefore, during a single cycle of theta the activity of the place cell population smoothly sweeps from representing the past to representing the future position of the animal (Maurer et al., 2006), and can simulate alternative possible futures across multiple cycles (Johnson and Redish, 2007).
In order for a cognitive map to support planning and flexible goaldirected navigation, it should incorporate information about the overall structure of space and the available routes between locations (Tolman, 1948; O’Keefe and Nadel, 1978). Theoretical work has identified the regular firing patterns of entorhinal grid cells with the former role, providing a spatial metric sufficient to support the calculation of navigational vectors (Bush et al., 2015; Banino et al., 2018). In contrast, associative place cell – place cell interactions have been repeatedly highlighted as a plausible mechanism for learning the available transitions in an environment (Muller et al., 1991; Blum and Abbott, 1996; Mehta et al., 2000). In the hippocampus, such associative learning has been shown to follow a spiketiming dependent plasticity (STDP) rule (Bi and Poo, 1998) – a form of Hebbian learning where the temporal ordering of spikes between presynaptic and postsynaptic neurons determines whether longterm potentiation or depression occurs. One of the consequences of phase precession is that correlates of behaviour, such as position in space, are compressed onto the timescale of a single theta cycle and thus coincide with the timewindow of STDP $\mathcal{O}(2050\text{ms})$ (Skaggs et al., 1996b; Mehta et al., 2000; Mehta, 2001; Mehta et al., 2002). This combination of theta sweeps and STDP has been applied to model a wide range of sequence learning tasks (Jensen and Lisman, 1996; Koene et al., 2003; Reifenstein et al., 2021), and as such, potentially provides an efficient mechanism to learn from an animal’s experience – forming associations between cells which are separated by behavioural timescales much larger than that of STDP.
Spatial navigation can readily be understood as a reinforcement learning problem – a framework which seeks to define how an agent should act to maximise future expected reward (Sutton and Barto, 1998). Conventionally, the value of a state is defined as the expected cumulative reward that can be obtained from that location with some temporal discount applied. Thus, the relationship between states and the rewards expected from those states are captured in a single value which can be used to direct rewardseeking behaviour. However, the computation of expected reward can be decomposed into two components – the successor representation, a predictive map capturing the expected location of the agent discounted into the future, and the expected reward associated with each state (Dayan, 1993). Such segregation yields several advantages since information about available transitions can be learnt independently of rewards and thus changes in the locations of rewards do not require the value of all states to be relearnt. This recapitulates a number of longstanding theory of hippocampus which state that hippocampus provides spatial representations that are independent of the animal’s particular goal and support goaldirected spatial navigation (Redish and Touretzky, 1998; Burgess et al., 1997; Koene et al., 2003; Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012).
A growing body of empirical and theoretical evidence suggests that the hippocampal spatial code functions as a successor representations (Stachenfeld et al., 2017). Specifically, that the activity of hippocampal place cells encodes a predictive map over the locations the animal expects to occupy in the future. Notably, this framework accounts for phenomena such as the skewing of place fields due to stereotyped trajectories (Mehta et al., 2000), the reorganisation of place fields following a forced detour (Alvernhe et al., 2011), and the behaviour of humans and rodents whilst navigating physical, virtual, and conceptual spaces (Momennejad et al., 2017; de Cothi et al., 2022). However, the successor representation is typically conceptualised as being learnt using the temporal difference learning rule (Russek et al., 2017; de Cothi and Barry, 2020), which uses the prediction error between expected and observed experience to improve the predictions. Whilst correlates of temporal difference learning have been observed in the striatum during rewardbased learning (Schultz et al., 1997), it is less clear how it could be implemented in the hippocampus to learn a predictive map. In this context, we hypothesised that the predictive and compression properties of theta sweeps, combined with STDP in the hippocampus, might be sufficient to approximately learn a successor representation.
We simulated the synaptic weights learnt due to STDP between a set of synthetic spiking place cells and show they closely resemble the weights of a successor representation learnt with temporal difference learning. We found that the inclusion of theta sweeps with the STDP rule increased the efficiency and robustness of the learning, with the STDP weights being a close approximation to the temporal difference successor matrix. Further, we find no fine tuning of parameters is needed – biologically determined parameters are optimal to efficiently approximate a successor representation and replicate experimental results synonymous with the predictive map hypothesis, including the behaviourally biased skewing of place fields (Mehta et al., 2000; Stachenfeld et al., 2017) in realistic one and twodimensional environments. Finally, we use the simulation of STDP with theta sweeps to generate insight into the observed topographical ordering of place field sizes along the dorsalventral hippocampal axis (Kjelstrup et al., 2008), by observing that such organisation is necessary to prevent the detrimental mixing of larger place fields, which approximate longer timescale successor representations (Momennejad and Howard, 2018), with more finegrained predictions of future spatial location. Our model, focussing on the role of theta sweeps and STDP in learning a hippocampal predictive map, is part of a growing body of recent work emphasising hippocampally plausible mechanisms of learning successor representations, such as using hippocampal recurrence (Fang et al., 2023) or synaptic learning rules which bootstrap longrange predictive associations (Bono et al., 2023).
Results
We set out to investigate whether a combination of STDP and phase precession is sufficient to generate a successor representationlike matrix of synaptic weights between place cells in CA3 and downstream CA1. The model comprises of an agent exploring a maze where its position $\mathbf{x}(t)$ is encoded by the instantaneous firing of a population of $N$ CA3 basis features, each with a spatial receptive field ${f}_{j}^{x}(\mathbf{x})$ given by a thresholded Gaussian of radius 1 m and 5 Hz peak firing rate. As the agent traverses the receptive field, its rate of spiking is subject to phase precession ${f}_{j}^{\theta}(\mathbf{x},t)$ with respect to a 10 Hz theta oscillation. This is implemented by modulating the firing rate by an independent phase precession factor which varies according to the current theta phase and how far through the receptive field the agent has travelled (Chadwick et al., 2015) (see Methods and Figure 1a) such that, in total, the instantaneous firing rate of the ${j}^{\text{th}}$ basis features is given by:
CA3 basis features $f}_{j$ then linearly drive downstream CA1 ‘STDP successor features’ $\stackrel{~}{\psi}}_{i$ (Figure 1b)
Using an inhomogeneous Poisson process, the firing rates of the basis and STDP successor features are converted into spike trains which cause learning in the weight matrix $\mathsf{W}}_{ij$ according to an STDP rule (see Methods and Figure 1c). The STDP synaptic weight matrix $\mathsf{W}}_{ij$ (Figure 1d) can then be directly compared to the temporal difference (TD) successor matrix $\mathsf{M}}_{ij$ (Figure 1e), learnt via TD learning on the CA3 basis features (the full learning rule is derived in Methods and shown in Equation 27). Further, the TD successor matrix $\mathsf{M}}_{ij$ can also be used to generate the ‘TD successor features’:
allowing for direct comparison and analyses with the STDP successor features $\stackrel{~}{\psi}}_{i$ (Equation 2), using the same underlying firing rates driving the TD learning to sample spikes for the STDP learning. This abstraction of biological detail avoids the challenges and complexities of implementing a fully spiking network, although an avenue for correcting this would be the approach of Brea et al., 2016 and Bono et al., 2023. In our model phase, precession generates theta sweeps (Figure 1a, grey box) as cells successively visited along the current trajectory fire at progressively later times in each theta cycle. Theta sweeps take the current trajectory of the agent and effectively compress it in time. As we show below these compressed trajectories are important for learning successor features.
The STDP learned synaptic weight matrix closely approximates the TD successor matrix
We first simulated an agent with $N=50$ evenly spaced CA3 place cell basis features on a 5 m circular track (linear track with circular boundary conditions to form a closed loop, Figure 2a). The agent moved lefttoright at a constant velocity for 30 min, performing ∼58 complete traversals of the loop. The STDP weights learnt between the phase precessing basis features and their downstream STDP successor features (Figure 2b) were markedly similar to the successor representation matrix generated using temporal difference learning applied to the same basis features under the same conditions (Figure 2c, elementwise Pearson correlation between matrices ${R}^{2}=0.87$). In particular, the agent’s strong lefttoright behavioural bias led to the characteristic asymmetry in the STDP weights predicted by successor representation models (Stachenfeld et al., 2017), with both matrices dominated by a wide band of positive weight shifted left of the diagonal and negative weights shifted right.
To compare the structure of the STDP weight matrix ${W}_{ij}$ and TD successor matrix ${M}_{ij}$, we aligned each row on the diagonal and averaged across rows (see Methods), effectively calculating the mean distribution of learnt weights originating from each basis feature (Figure 2d). Both models exhibited a similar distribution, with values smoothly ramping up to a peak just left of centre, before a sharp dropoff to the right caused by the lefttoright bias in the agent’s behaviour. In the network trained by TD learning this is because CA3 place cells to the left of (i.e. preceding) a given basis feature are reliable predictors of that basis feature’s future activity, with those immediately preceding it being the strongest predictors and thus conferring the strongest weights to its successor feature. Conversely, the CA3 place cells immediately to the right of (i.e. after) this basis feature are the furthest they could possibly be from predicting its future activity, resulting in minimal weight contributions. Indeed, we observed some of these weights even becoming negative (Figure 2d) – necessary to approximate the sharp dropoff in predictability using the smooth Gaussian basis features. With the STDP model, the similar distribution of weights is caused by the asymmetry in the STDP learning rule combined with the consistent temporal ordering of spikes in a theta sweep. Hence, the sequence of spikes emitted by different cells within a theta cycle directly reflects the order in which their spatial fields are encountered, resulting in commensurate changes to the weight matrix. So, for example, if a postsynaptic neuron reliably precedes its presynaptic cell on the track, the corresponding weight will be reduced, potentially becoming negative. We note that weights changing their sign is not biologically plausible, as it is a violation of Dale’s Law (Dale, 1935). This could perhaps be corrected with the addition of global excitation or by recruiting inhibitory interneurons.
Notably, the temporal compression afforded by theta phase precession, which brings behavioural effects into the millisecond domain of STDP, is an essential element of this process (Lisman and Grace, 2005; Koene et al., 2003). When phase precession was removed from the STDP model, the resulting weights failed to capture the expected behavioural bias and thus did not resemble the successor matrix – evidenced by the lack of asymmetry (Figure 2d, dashed line; ratio of mass either side of yaxis 4.54 with phase precession vs. 0.99 without) and a decrease in the explained variance of the TD successor matrix (Figure 2e, ${R}^{2}=0.87\pm 0.01$ vs ${R}^{2}=0.63\pm 0.02$ without phase precession). Similarly, without the precise ordering of spikes, the learnt weight matrix was less regular, having increased levels of noise, and converged over $4.5\times $ more slowly (Figure 2e; time to reach ${R}^{2}=0.5$: 2.5 vs 11.5 min without phase precession), still yet to fully converge over the course of 1 hr (Figure 2—figure supplement 1a). Thus, the ability to approximate TD learning appears specific to the combination of STDP and phase precession. Indeed, there are deep theoretical connections linking the two – see Methods section 5.9 for a theoretical investigation into the connections between TD learning and STDP learning augmented with phase precession. This effect is robust to variations in running speed (Figure 2—figure supplement 1b) and field sizes (Figure 2—figure supplement 1c), as well as scenarios where target CA1 cells have multiple firing fields (Figure 2—figure supplement 2a) that are updated online during learning (Figure 2—figure supplement 2b–d), or fully driven by spikes in CA3 (Figure 2—figure supplement 2e); see Methods for more details.
We also conducted a hyperparameter sweep to test if these results were robust to changes in the phase precession and STDP learning rule parameters (Figure 2—figure supplement 3). The sweep range for each parameter contained and extended beyond the ‘biologically plausible’ values used in this paper (Figure 2—figure supplement 3a). We found that optimised parameters (those which result in the highest final similarity between STDP and TD weight matrices, ${W}_{ij}$ and ${M}_{ij}$) were very close to the biological parameters already selected for our model from a literature search (Figure 2—figure supplement 3 c,d parameter references also listed in figure) and, when they were used, no drastic improvement was seen in the similarity between ${W}_{ij}$ and ${M}_{ij}$. The only exception was firing rate for which performance monotonically improved as it increased  something the brain likely cannot achieve due to energy constraints. In particular, the parameters controlling phase precession in the CA3 basis features (Figure 2—figure supplement 4a) can affect the CA1 STDP successor features learnt, with ‘weak’ phase precession resembling learning in the absence of theta modulation (Figure 2—figure supplement 4b,c), biologically plausible values providing the best match to the TD successor features (Figure 2—figure supplement 4d) and ‘exaggerated’ phase precession actually hindering learning (Figure 2—figure supplement 4e; see methods for more details). Additionally, we find these CA1 cells go on to inherit phase precession from the CA3 population even after learning when they are driven by multiple CA3 fields (Figure 2—figure supplement 4f), and that this learning is robust to realistic phase offsets between the populations of CA3 and CA1 place cells (Figure 2—figure supplement 4g).
Next, we examined the correspondence between our model and the TDtrained successor representation in a situation without a strong behavioural bias. Thus, we reran the simulation on the linear track without the circular boundary conditions so the agent turned and continued in the opposite direction whenever it reached each end of the track (Figure 2f). Again, the STDP and TD successor representation weight matrices where remarkably similar (${R}^{2}=0.88$; Figure 2gh) both being characterised by a wide band of positive weight centred on the diagonal (Figure 2i) – reflecting the directionally unbiased behaviour of the agent. In this unbiased regime, theta sweeps were less important though still confered a modest shape, learning speed, and signalstrength advantage over the nonphase precessing model (Figure 2j) – evidenced as an increased amount of explained variance (${R}^{2}=0.88\pm 0.01$ vs. ${R}^{2}=0.76\pm 0.02$) and faster convergence (time to reach ${R}^{2}=0.5$; 3 vs 7.5 minutes).
To test if the STDP model’s ability to capture the successor matrix would scale up to open field spaces, we implemented a 2D model of phase precession (see Methods) where the phase of spiking is sampled according to the distance travelled through the place field along the chord currently being traversed (Jeewajee et al., 2014). We then simulated both the agent in an environment consisting of two interconnected 2.5 × 2.5 m square rooms (Figure 2k) using an adapted policy modelling rodent foraging behaviour that is biased towards traversing doorways and following walls (Raudies and Hasselmo, 2012; see Methods and 10 minute sample trajectory shown in Figure 2k). After training for 2 hr of exploration, we found that the combination of STDP and phase precession was able to successfully capture the structure in the TD successor matrix (Figure 2l–m, ${R}^{2}=0.74$, TD successor matrix calculated over the same 2 hr trajectory).
Theta sequenced STDP place cells show behaviourally biased skewing, a hallmark of successor representations
We next wanted to investigate how the similarities in weights between the STDP and TD successor representation models are conveyed in the downstream CA1 successor features. One hallmark of the successor representation is that strong biases in behaviour (for example, travelling one way round a circular track) induce a reliable predictability of upcoming future locations, which in turn causes a backward skewing in the resulting successor features (Stachenfeld et al., 2017). Such skewing, opposite to the direction of travel, has also been observed in hippocampal place cells (Mehta et al., 2000). Under strongly biased behaviour on the circular linear track, the biologically plausible STDP CA1 successor features (Equation 2) had a very high correlation with the TD successor features (Equation 3) predicted by successor theory (Figure 3a; ${R}^{2}=0.98\pm 0.01$). Both exhibited a pronounced backward skew, opposite to the direction of travel (mean TD vs. STDP successor feature skewness: $=0.39\pm 0.01$ vs. $=0.24\pm 0.07$). Furthermore, both the STDP and TD successor representation models predict that such biased behaviour should induce a backwards shift in the location of place field peaks (Figure 3a left panel; TD vs. STDP successor feature shift in metres: $0.28\pm 0.00$ vs $0.38\pm 0.03$) – this phenomenon is also observed in the hippocampal place cells (Mehta et al., 2000), and our model accounts for the observation that more shifting and skewing is observed in CA1 place cells than CA3 place cells (Dong et al., 2021). As expected, when theta phase precession was removed from the model no significant skew or shift was observed in the STDP successor features. Similarly, the skew in field shape and shift in field peak were not present when the behavioural bias was removed (Figure 3b) – in this unbiased scenario, the advantage of the STDP model with theta phase precession was modest relative to the same model without phase precession (${R}^{2}=0.99\pm 0.01$ vs. ${R}^{2}=0.96\pm 0.01$).
Examining the activity of CA1 cells in the tworoom open field environment, we found an increase in the eccentricity of fields close to the walls (Figure 3c & d; average eccentricity of STDP successor features near vs. far from wall: $0.57\pm 0.06$ vs. $0.33\pm 0.07$). In particular, this increased eccentricity is facilitated by a shorter field width along the axis perpendicular to the wall (Figure 3e), an effect observed experimentally in rodent place cells (Tanni et al., 2021). This increased eccentricity of cells near the wall remained when the behavioural bias to follow walls was removed (Figure 3d; average eccentricity with vs. without wall bias: $0.57\pm 0.06$ vs. $0.54\pm 0.06$), thus indicating it is primarily caused by the inherent bias imposed on behaviour by extended walls rather than an explicit policy bias. Note that our ellipse fitting algorithm accounts for portions of the field that have been cut off by environmental boundaries (see methods & Figure 3c), and so this effect is not simply a product of basis features being occluded by walls.
In a similar fashion, the bias in the motion model we used  which is predisposed to move between the two rooms – resulted in a shift in STDP successor feature peaks towards the doorway (Figure 3f & g; inwards shift in metres for STDP successor features near vs. far from doorway: $0.15\pm 0.06$ vs. $0.04\pm 0.05$; with doorway bias turned off: $0.05\pm 0.08$ vs. $0.04\pm 0.05$). At the level of individual cells, this was visible as an increased propensity for fields to extend into the neighbouring room after learning (Figure 3h). Hence, although basis features were initialised as two approximately nonoverlapping populations – with only a small proportion of cells near the doorway extending into the neighbouring room – after learning many cells bind to those on the other side of the doorway, causing their place fields to diffuse through the doorway and into to the other room (Figure 3f). This shift could partially explain why place cell activity is found to cluster around doorways (Spiers et al., 2015) and rewarded locations (Dupret et al., 2010) in electrophysiological experiments. Equally it is plausible that a similar effect might underlie experimental observations that neural representations in multicompartment environments typically begin heavily fragmented by boundaries and walls but, over time, adapt to form a smooth global representations (e.g. as observed in grid cells by Carpenter et al., 2015).
Multiscale successor representations are stored along the hippocampal dorsalventral axis by populations of differently sized place cells
Finally, we wanted to investigate whether the STDP learning rule was able form successor representationlike connections between basis features of different scales. Recent experimental work has highlighted that place fields form a multiscale representation of space, which is particularly noticeable in larger environments (Tanni et al., 2021; Eliav et al., 2021), such as the one modelled here. Such multiscale spatial representations have been hypothesised to act as a substrate for learning successor features with different time horizons – largescale place fields are able to make predictions of future location across longer time horizons, whereas place cells with smaller fields are better placed to make temporally finegrained predictions. Agents could use such a set of multiscale successor features to plan actions at different levels of temporal abstraction, or predict precisely which states they are likely to encounter soon (Momennejad and Howard, 2018). Despite this, what is not known is whether different sized place fields will form associations when subject to STDP coordinated by phase precession and what effect this would have on the resulting successor features. Hypothetically, consider a small basis feature cell with a receptive field entirely encompassed by that of a larger basis cell with no theta phase offset between the entry points of both fields. A potential consequence of theta phase precession is that the cell with the smaller field would phase precess faster through the theta cycle than the other cell – initially, it would fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it would fire earlier. These periods of potentiation and depression instigated by STDP could act against each other, and the extent to which they cancel each other out would depend on the relative placement of the two fields, their size difference, and the parameters of the learning rule. To test this, we simulated an agent, learning according to our STDP model in the circular track environment, with, simultaneously, three sets of differently sized basis features ($\sigma =0.5$, 1.0 and 1.5 m, Figure 4a). Such ordered variation in field size has been observed along the dorsoventral axis of the hippocampus (Kjelstrup et al., 2008; Strange et al., 2014; Figure 4b), and has been theorised to facilitate successor representation predictions across multiple timescales (Stachenfeld et al., 2017; Momennejad and Howard, 2018).
When we trained the STDP model on a population of homogeneously distributed multiscale basis features, the resulting weight matrix displayed binding across the different sizes regardless of the scale difference (Figure 4c top). This in turn leads to a population of downstream successor features with the same redundantly large scale (Figure 4c bottom). The negative interaction between different sized fields was not sufficient to prevent binding and, as such, the place fields of small features are dominated by contributions from bindings to larger basis features. Conversely, when these multiscale basis features were ordered along the dorsoventral axis to prevent binding between the different scales – cells of the three scales were processed separately (Figure 4d top) – the multiscale structure is preserved in the resulting successor features (Figure 4d bottom). We thus propose that place cell size can act as a proxy for the predictive time horizon, $\tau $ – also called the discount parameter, $\gamma ={e}^{\frac{dt}{\tau}}$, in discrete Markov Decision Processes. However, for this effect to be meaningful, plasticity between cells of different scales must be minimised to prevent short timescales from being overwritten by longer ones, this segregation may plausibly be achieved by the observed size ordering along the hippocampal dorsalventral axis.
Discussion
Successor representations store longrun transition statistics and allow for rapid prediction of future states (Dayan, 1993) – they are hypothesised to play a central role in mammalian navigation strategies (Stachenfeld et al., 2017; de Cothi and Barry, 2020). We show that Hebbian learning between spiking neurons, resembling the place fields found in CA3 and CA1, learns an accurate approximation to the successor representation when these neurons undergo phase precession with respect to the hippocampal theta rhythm. The approximation achieved by STDP explains a large proportion of the variance in the TD successor matrix and replicates hallmarks of successor representations (Stachenfeld et al., 2014; Stachenfeld et al., 2017; de Cothi and Barry, 2020) such as behaviourally biased place field skewing, elongation of place fields near walls, and clustering near doorways in both one and twodimensional environments.
That the predictive skew of place fields can be accomplished with a STDPtype learning rule is a longstanding hypothesis; in fact, the authors that originally reported this effect also proposed a STDPtype mechanism for learning these fields (Mehta et al., 2000; Mehta, 2001). Similarly, the possible accelerating effect of theta phase precession on sequence learning has also been described in a number of previous works (Jensen and Lisman, 1996; Koene et al., 2003; Reifenstein et al., 2021). Until recently (Fang et al., 2023; Bono et al., 2023), SR models have largely not connected with this literature: they either remain agnostic to the learning rule or assume temporal difference learning (which has been wellmapped onto striatal mechanisms (Schultz et al., 1997; Seymour et al., 2004), but it is unclear how this is implemented in hippocampus) (Stachenfeld et al., 2014; Stachenfeld et al., 2017; de Cothi and Barry, 2020; Geerts et al., 2020; Vértes and Sahani, 2019). Thus, one contribution of this paper is to quantitatively and qualitatively compare thetaaugmented STDP to temporal difference learning, and demonstrate where these functionally overlap. This explicit link permits some insights about the physiology, such as the observation that the biologically observed parameters for phase precession and STDP resemble those that are optimal for learning the SR (Figure 2—figure supplement 3), and that the topographic organisation of place cell sizes is useful for learning representations over multiple discount timescales (Figure 4). It also permits some insights for RL, such as that the approximate SR learned with thetaaugmented STDP, while provably theoretically different from TD (Section: A theoretical connection between STDP and TD learning), is sufficient to capture key qualitative phenomena.
Theta phase precession has a dual effect not only allowing learning by compressing trajectories to within STDP timescales but also accelerating convergence to a stable representation by arranging the spikes from cells along the current trajectory to arrive in the order those cells are actually encountered (Jensen and Lisman, 1996; Koene et al., 2003). Without theta phase precession, STDP fails to learn a successor representation reflecting the current policy unless that policy is approximately unbiased. Further, by instantiating a population of place cells with multiple scales we show that topographical ordering of these place cells by size along the dorsoventral hippocampal axis is a necessary feature to prevent small discount timescale successor representations from being overwritten by longer ones. Last, performing a grid search over STDP learning parameters, we show that those values selected by evolution are approximately optimal for learning successor representations. This finding is compatible with the idea that the necessity to rapidly learn predictive maps by STDP has been a primary factor driving the evolution of synaptic learning rules in hippocampus.
While the model is biologically plausible in several respects, there remain a number of aspects of the biology that we do not interface with, such as different cell types, interneurons and membrane dynamics. Further, we do not consider anything beyond the most simple model of phase precession, which directly results in theta sweeps in lieu of them developing and synchronising across place cells over time (Feng et al., 2015). Rather, our philosophy is to reconsider the most pressing issues with the standard model of predictive map learning in the context of hippocampus (e.g. the absence of dopaminergic error signals in CA1 and the inadequacy of synaptic plasticity timescales). We believe this minimalism is helpful, both for interpreting the results presented here and providing a foundation on which further work may examine these biological intricacies, such as whether the model’s theta sweeps can alternately represent future routes (Kay et al., 2020) for example by the inclusion of attractor dynamics (Chu et al., 2022). Still, we show this simple model is robust to the observed variation in phase offsets between phase precessing CA3 and CA1 place cells across different stages of the theta cycle (Mizuseki et al., 2012). In particular, this phase offset is most pronounced as animals enter a field ($\sim {90}^{\circ}$) and is almost completely reduced by the time they leave it ($\sim {90}^{\circ}$; Figure 2—figure supplement 4g). Essentially, our model hypothesises that the majority of plasticity induced by STDP and theta phase precession will take place in the latter part of place fields, equating to earlier theta phases. Notably, this is inkeeping with experimental data showing enhanced coupling between CA3 and CA1 in these early theta phases (Colgin et al., 2009; Hasselmo et al., 2002). However, as our simulations show (Figure 2—figure supplement 4g), even if these assumptions do not hold true, the model is sufficiently robust to generate SR equivalent weight matrices for a range of possible phase offsets between CA3 and CA1.
Our model extends previous work – which required successor features to recursively expand in order to make long range predictions (e.g. as demonstrated in Brea et al., 2016; Bono et al., 2023) – by exploiting the existence of temporally compressed theta sweeps (O’Keefe and Recce, 1993; Skaggs et al., 1996b), allowing place cells with distant fields to bind directly without intermediaries or ‘bootstrapping’. This configuration yields several advantages. First, learning with theta sweeps converges considerably faster than without them. Biologically, it is likely that successor feature learning via Hebbian learning alone (without theta precession) would be too slow to account for the rapid stabilisation of place cells in new environments at behavioural time scales (Bittner et al., 2017) – Dong et al. observed place fields in CA1 to increase in width for approximately the first 10 laps around a 3 m track (Dong et al., 2021). This timescale is well matched by our model with theta sweeps in which CA1 place cells reach 75% of their final extent after 5 min (or 9.6 laps) of exploration on a 5 m track but is markedly slower without theta sweeps.
Second, as well as extending previous work to large twodimensional environments and complex movement policies our model also uses realistic population codes of overlapping Gaussian features. These naturally present a hard problem for models of spiking Hebbian learning since, in the absence of theta sweeps, the order in which features are encountered is not encoded reliably in the relative timing or order of their spikes at synaptic timescales. Theta sweeps address this by tending to sequence spikes according to the order in which their originating fields are encountered. Indeed our preliminary experiments show that when theta sweeps are absent the STDP successor features show little similarity to the TD successor features. Our work is thus particularly relevant in light of a recent trend to focus on biologically plausible features for reinforcement learning (Gustafson and Daw, 2011; de Cothi and Barry, 2020).
Other contemporary theoretical works have made progress on biological mechanisms for implementing the successor representation algorithm using somewhat different but complementary approaches. Of particular note are the works by Fang et al., 2023, who show a recurrent network with weights trained via a Hebbianlike learning rule converges to the successor representation in steady state, and Bono et al., 2023 who derive a learning rule for a spiking feedforward network which learns the SR of onehot features by bootstrapping associations across time (see also Brea et al., 2016). Combined, the above models, as well as our own, suggest there may be multiple means of calculating successor features in biological circuits without requiring a direct implementation of temporal difference learning.
Our theory makes the prediction that theta contributes to learning predictive representations, but is not necessary to maintain them. Thus, inhibiting theta oscillations during exposure to a novel environment should impact the formation of successor features (e.g. asymmetric backwards skew of place fields) and subsequent memoryguided navigation. However, inhibiting theta in a familiar environment in which experiencedependent changes have already occurred should have little effect on the place fields: that is, some asymmetric backwards skew of place fields should be intact even with theta oscillations disrupted. To our knowledge, this has not been directly measured, but there are some experiments that provide hints. Experimental work has shown that power in the theta band increases upon exposure to novel environments (Cavanagh et al., 2012) – our work suggests this is because theta phase precession is critical for learning and updating stored predictive maps for spatial navigation. Furthermore, it has been shown that place cell firing can remain broadly intact in familiar environments even with theta oscillations disrupted by temporary inactivation or cooling (Bolding et al., 2020; Petersen and Buzsáki, 2020). It is worth noting, however, that even with intact place fields, these theta disruptions impair the ability of rodents to reach a hidden goal location that had already been learned, suggesting theta oscillations play a role in navigation behaviours even after initial learning (Bolding et al., 2020; Petersen and Buzsáki, 2020). Other work has also shown that muscimol inactivations to medial septum can disrupt acquisition and retrieval of the memory of a hidden goal location (Chrobak et al., 1989; RashidyPour et al., 1996), although it is worth noting that these papers use muscimol lesions which Bolding and colleagues show also disrupt placerelated firing, not just theta precession.
The SR model has a number of connections to other models from the computational hippocampal literature that bear on the interpretation of these results. A longstanding property of computational models in the hippocampus literature is a factorisation of spatial and reward representations (Redish and Touretzky, 1998; Burgess et al., 1997; Koene et al., 2003; Hasselmo and Eichenbaum, 2005; Erdem and Hasselmo, 2012), which permits spatial navigation to rapidly adapt to changing goal locations. Even in RL, the SR is also not unique in factorising spatial and reward representations, as purely modelbased approaches do this too (Dayan, 1993; Sutton and Barto, 1998; Daw, 2012). The SR occupies a much more narrow niche, which is factorising reward from spatial representations while caching longterm occupancy predictions (Dayan, 1993; Gershman, 2018). Thus, it may be possible to retain some of the flexibility of modelbased approaches while retaining the rapid computation of modelfree learning.
A number of other models describe how physiological and anatomical properties of hippocampus may produce circuits capable of goaldirected spatial navigation (Erdem and Hasselmo, 2012; Redish and Touretzky, 1998; Koene et al., 2003). These models adopt an approach more characteristic of modelbased RL, searching iteratively over possible directions or paths to a goal (Erdem and Hasselmo, 2012) or replaying sequences to build an optimal transition model from which sampled trajectories converge toward a goal (Redish and Touretzky, 1998) (this model bears some similarities to the SR that are explored by Fang et al., 2023, which shows dynamics converge to SR under a similar form of learning). These models rely on dynamics to compute the optimal trajectory, while the SR realises the statistics of these dynamics in the rate code and can therefore adapt very efficiently. Thus, the SR retains some efficiency benefits. These models are very wellgrounded in known properties of hippocampal physiology, including theta precession and STDP, whereas until recently, SR models have enjoyed a much looser affiliation with exact biological mechanisms. Thus, a primary goal of this work is to explore how hippocampal physiological properties relate to SR learning as well.
More generally, in principle, any form of sufficiently ordered and compressed trajectory would allow STDP plasticity to approximate a successor representation. Hippocampal replay is a well documented phenomena where previously experienced trajectories are rapidly recapitulated during sharpwave ripple events (Wilson and McNaughton, 1994), within which spikes show a form of phase precession relative to the ripple band oscillation (150–250 Hz; Bush et al., 2022). Thus, our model might explain the abundance of sharpwave ripples during early exposure to novel environments (Cheng and Frank, 2008) – when new ‘informative’ trajectories, for example those which lead to reward, are experienced it is desirable to rapidly incorporate this information into the existing predictive map (Mattar and Daw, 2018).
The distribution of place cell receptive field size in hippocampus is not homogeneous. Instead, place field size grows smoothly along the longitudinal axis (from very small in dorsal regions to very large in ventral regions). Why this is the case is not clear – our model contributes by showing that, without this ordering, large and small place cells would all bind via STDP, essentially overwriting the short timescale successor representations learnt by small place cells with long timescale successor representations. Topographically organising place cells by size anatomically segregates place cells with fields of different sizes, preserving the multiscale successor representations. Further, our results exploring the effect of different phase offsets on STDPsuccessor learning (Figure 2—figure supplement 4g) suggest that the gradient of phase offsets observed along the dorsoventral axis (Lubenov and Siapas, 2009; Patel et al., 2012) is insufficient to impair the plasticity induced by STDP and phase precession. The premise that such separation is needed to learn multiscale successor representations is compatible with other theoretical accounts for this ordering. Specifically, Momennejad and Howard, 2018 showed that exploiting multiscale successor representations downstream, in order to recover information which is ‘lost’ in the process of compiling state transitions into a single successor representation, typically requires calculating the derivative of the successor representation with respect to the discount parameter. This derivative calculation is significantly easier if the cells – and therefore the successor representations – are ordered smoothly along the hippocampal axis.
Work in control theory has shown that the difficult reinforcement learning problem of finding an optimal policy and value function for a given environment becomes tractable if the policy is constrained to be near a ‘default policy’ (Todorov, 2009). When applied to spatial navigation, the optimal value function resembles the value function calculated using a successor representation for the default policy. This solution allows for rapid adaptation to changes in the reward structure since the successor matrix is fixed to the default policy and need not be relearnt even if the optimal policy changes. Building on this, recent work suggested the goal of hippocampus is not to learn the successor representation for the current policy but rather for a default diffusive policy (Piray and Daw, 2021).
Indeed, we found that in the absence of theta sweeps, the STDP rule learns a successor representation close to that of an unbiased policy, rather than the current policy. This is because without thetasweeps to order spikes along the current trajectory, cells bind according to how overlapping their receptive fields are, that is, according to how close they are under a ‘diffusive’ policy. In this context it is interesting to note that a substantial proportion of CA3 place cells do not exhibit significant phase precession (O’Keefe and Recce, 1993; Jeewajee et al., 2014). One possibility is that these place cells with weak or absent phase precession might plausibly contribute to learning a policyindependent ‘default representation’, useful for rapid policy prediction when the reward structure of an environment is changed. Simultaneously, theta precessing place cells may learn a successor representation for the current (potentially biased) policy, in total giving the animal access to both an offpolicybutnearoptimal value function and an onpolicybutsuboptimal value function.
Finally, we comment on the approximate nature of the successor representations learnt by our biologically plausible model. The STDP successor features described here are unlikely to converge analytically to the TD successor features. Potentially, this implies that a value function calculated according to Equation 31 would not be accurate and may prevent an agent from acting optimally. There are several possible resolutions to this point. First, the successor representation is unlikely to be a self contained reinforcement learning system. In reality, it likely interacts with other modelbased or modelfree systems acting in other brain regions such as nucleus accumbens in striatum (Lisman and Grace, 2005). Plausibly errors in the successor features are corrected for by counteracting adjustments in the reward weights implemented by some downstream model free error based learning system. Alternatively, it is likely that value function learnt by the brain is either fundamentally approximate or uses an different, less tractable, temporal discounting scheme. Ultimately, although in principle specialised and expensive learning rules might be developed to exactly replicate TD successor features in the brain, this maybe undesirable if a simple learning rule (STDP) is adequate in most circumstances. Indeed, animals – including humans – are known to act suboptimally (Zentall, 2015; de Cothi et al., 2022), perhaps in part because of a reliance on STDP learning rules in order to learn longrange associations.
Methods
General summary of the model
The model comprises of an agent exploring a maze where its position $\mathbf{x}$ at time $t$ is encoded by the instantaneous firing of a population of $N$ CA3 basis features, ${f}_{j}(\mathbf{x},t)$ for $j\in \{1,..,N\}$. Each has a spatial receptive field given by a thresholded Gaussian of peak firing rate 5 Hz:
where ${\mathbf{x}}_{j}$ is the location of the field peak, $\sigma =1\mathrm{m}$ is the standard deviation and $c$ is a positive constant that keeps ${f}_{j}^{x}$ continuous at the threshold.
The theta phase of the hippocampal local field potential oscillates at 10 Hz and is denoted by ${\varphi}_{\theta}(t)\in [0,2\pi ]$. Phase precession suppresses the firing rate of a basis features for all but a short period within each theta cycle. This period (and subsequently the time when spikes are produced, described in more details below) precesses earlier in each theta cycle as the agent crosses the spatial receptive field. Specifically, this is implemented by simply multiplying the spatial firing rate ${f}_{j}^{x}$ by a theta modulation factor which rises and falls according to a von Mises distribution in each theta cycle, peaking at a ‘preferred phase’, ${\varphi}_{j}^{*}$, which depends on how far through the receptive field the agent has travelled (hence the spike timings implicitly encode location);
where $\kappa =1$ is the concentration parameter of the Von Mises distribution. These basis features in turn drive a population of $N$ downstream ‘STDP successor features’ (Equation 2).
Firing rates of both populations (${f}_{j}(\mathbf{x},{\varphi}_{\theta})$ and ${\stackrel{~}{\psi}}_{i}(\mathbf{x},{\varphi}_{\theta})$) are converted to spike trains according to an inhomogeneous Poisson process. These spikes drive learning in the synaptic weight matrix, ${W}_{ij}$, according to an STDP learning rule (details below). In summary, if a presynaptic CA3 basis features fires immediately before a postsynaptic CA1 successor feature the binding strength between these cells is strengthened. Conversely if they fire in the opposite order, their binding strength is weakened.
For comparison, we also implement successor feature learning using a temporal difference (TD) learning rule, referred to as ‘TD successor features’, ${\psi}_{i}(\mathbf{x})$, to provide a ground truth against which we compare the STDP successor features. Like STDP successor features, these are constructed as a linear combination of basis features (Equation 3).
Temporal difference learning updates $\mathsf{M}}_{ij$ as follows
where $\delta}_{ij}^{\text{TD}$ is the temporal difference error, which we derive below. In reinforcement learning the temporal difference error is used to learn discounted value functions (successor features can be considered a special type of value function). It works by comparing an unbiased sample of the true value function to the currently held estimate. The difference between these is known as the temporal difference error and is used to update the value estimate until, eventually, it converges on (or close to) the true value function.
Definition of TD successor features and TD successor matrix
Phase precession model details
In our hippocampal model CA3 place cells, referred to as basis features and indexed by $j$ and have thresholded Gaussian receptive fields. The threshold radius is $\sigma =1$ m and peak firing rate is $F=5$ Hz. Mathematically, this is written as
where ${[f(x)]}_{+}=max(0,f(x))$, ${\mathbf{x}}_{j}$ is the centre of the receptive field and $\mathbf{x}(t)$ is the current location of the agent.
Phase precession is implemented by multiplying the spatial firing rate, ${f}_{j}^{x}(\mathbf{x})$, by a phase precession factor
where ${f}_{\text{VM}}(x\mu ,\kappa )$ denotes the circular Von Mises distribution on $x\in (0,2\pi ]$ with mean $\mu ={\varphi}_{j}^{\ast}(\mathbf{x})$ and spread parameter $\kappa =1$. This factor is large only when the current theta phase,
which oscillates at ${\nu}_{\theta}=10$ Hz, is close to the cell’s ‘preferred’ theta phase,
${d}_{j}(\mathbf{x}(t))\in [1,1]$ tracks how far through the cell’s spatial receptive field, as measured in units of $\sigma $, the agent has travelled:
In instances where the agent travels directly across the centre of a cell (as is the case in 1D environments) then $(\mathbf{x}(t){\mathbf{x}}_{j})$ and its normalised velocity (a vector of length 1, pointing in the direction of travel) $\frac{\dot{\mathbf{x}}(t)}{\Vert \dot{\mathbf{x}}(t)\Vert}$ are parallel such that ${d}_{j}(\mathbf{x})$ progresses smoothly in time from it’s minimum, –1, to it’s maximum, 1. In general, however, this extends to any arbitrary curved path an agent might take across the cell and matches the model used in Jeewajee et al., 2014. We fit $\beta $ and $\kappa $ to biological data in Figure 5a of Jeewajee et al., 2014 ($\beta =0.5$, $\kappa =1$). The factor of $2\pi$ normalises this term, although the instantaneous firing may briefly rise above the spatial firing rate ${f}_{j}^{x}(\mathbf{x})$, the average firing rate over the entire theta cycle is still given by the spatial factor ${f}_{j}^{x}(\mathbf{x})$. In total, the instantaneous firing rate of the basis feature is given by the product of the spatial and phase precession factors (Equation 1).
Note that the firing rate of a cell depends explicitly on its location through the spatial receptive field (its ‘rate code’) and implicitly on location through the phase precession factor (its ‘spiketime code’) where location dependence is hidden inside the calculation of the preferred theta phase. Notably, the effect of phase precession is only visible on rapid ‘subtheta’ timescales. Its effect disappears when averaging over any timescale, ${T}_{av}$ substantially longer than theta timescale of ${T}_{\theta}=0.1$ s:
This is important since it implies that the effect of phase precession is only important for synaptic processes with very short integration timescales, for example, STDP.
Our phase precession model is ‘independent’ (essentially identical to Chadwick et al., 2015) in the sense that each place cell phase precesses independently from what the other place cells are doing. In this model, phase precession directly leads to theta sweeps as shown in Figure 1. Another class of models referred to as ‘coordinated assembly’ models (Harris, 2005) hypothesise that internal dynamics drive theta sweeps within each cycle because assemblies (aka place cells) dynamically excite oneanother in a temporal chain. In these models, theta sweeps directly lead to phase precession. Feng and colleagues draw a distinction between theta precession and theta sequence, observing that while independent theta precession is evident right away in novel environments, longer and more stereotyped theta sequences develop over time (Feng et al., 2015). Since we are considering the effect of theta precession on the formation of place field shape, the independent model is appropriate for this setting. We believe that considering how our model might relate to the formation of theta sequences or what implications theta sequences have for this model is an exciting direction for future work.
Synaptic learning via STDP
STDP is a discrete learning rule: if a presynaptic neuron $j$ fires before a postsynaptic neuron $i$ their binding strength ${W}_{ij}$ is potentiated, conversely if the postsynaptic neuron fires before the presynaptic then weight is depressed. This is implemented as follows.
First, we convert the firing rates to spike trains. We sample, for each neuron, from an inhomogeneous spike train with rate parameter ${f}_{j}(\mathbf{x},t)$ (for presynaptic basis features) or ${\stackrel{~}{\psi}}_{i}(\mathbf{x},t)$ for postsynaptic successor features. This is done over the period $[0,T]$ across which the animal is exploring.
Asymmetric Hebbian STDP is implemented online using a trace learning rule. Each presynaptic spike from CA3 cell, indexed $j$, increments an otherwise decaying memory trace, ${T}_{j}^{\text{pre}}(t)$, and likewise an analagous trace for postsynaptic spikes from CA1, ${T}_{i}^{\text{post}}(t)$. We matched the STDP plasticity window decay times to experimental data: ${\tau}^{\text{pre}}=20$ ms and ${\tau}^{\text{post}}=40$ ms (Bush et al., 2010).
We simplify our model by fixing weights during learning:
where we will refer to ${W}_{ij}^{A}$ as the “anchoring” weights which, up until now, have been set to the identity ${W}_{ij}^{A}={\delta}_{ij}$. Since ${f}_{j}(\mathbf{x},t)$ is the phase precessing features, ${\stackrel{~}{\psi}}_{i}(\mathbf{x},t)$ also inherits phase precession from these features mapped through ${W}_{ij}^{A}$. Fixing the weights means that during learning the effect of changes in ${W}_{ij}$ are not propagated to the successor features (CA1), their influence is only considered during postlearning recall broadly analogous to the distinct encoding and retrieval phases that have been hypothesised to underpin hippocampal function (Hasselmo et al., 2002). We relax this assumption in Figure 2—figure supplement 2 and allow ${W}_{ij}$ to be updated online, showing this isn’t essential.
After a period, $[0,T]$ of exploration the synaptic weights are updated on aggregate to account for STDP.
where the second terms accounts for the cumulative potentiation and depression due to STDP from spikes in the CA3 and CA1 populations. $\eta $ is the learning rate (here set to 0.01) and ${a}^{\text{pre}}$ and ${a}^{\text{post}}$ give the relative amounts of prebeforepost potentiation and postbeforepre depression, set to match experimental data from Bi and Poo, 1998 as 1 and —0.4 respectively. The weights are initialised to the identity: ${W}_{ij}(0)={\delta}_{ij}$.
Finally, when analysing the successor features after learning we use the updated weight matrix, not the anchoring weights, (and turn off phase precession since we are only interested in rate maps)
Temporal difference learning
To test our hypothesis that STDP is a good approximation to TD learning we simultaneously computed the TD successor features defined as the total expected future firing of a basis feature:
$\tau $ is the temporal discounting timehorizon (related to $\gamma $, the discount factor used in reinforcement learning on temporally discretised MDPs, $\gamma ={e}^{\frac{dt}{\tau}}$) and the expectation is over trajectories initiated at position $\mathbf{x}$. This formula explains the onetoone correspondence between CA3 cells and CA1 cells in our hippocampal model (Figure 1b): each CA1 cell, indexed $i$, learns to approximate the TD successor feature for its target basis feature, also indexed $i$. We set the discount timescale to $\tau =4$ s to match relevant behavioural timescales for an animal exploring a small maze environment where behavioural decisions, such as whether to turn left or right, need to be made with respect to optimising future rewards occurring on the order of seconds.
We learn these successor features by tuning the weights of a linear decomposition over the basis feature set:
this way we can directly compare ${M}_{ij}$ to the STDP weight matrix ${W}_{ij}$.
Our TD successor matrix, ${M}_{ij}$, should not be confused with the successor representation as defined in Stachenfeld et al., 2017 and denoted $M({s}_{i},{s}_{j})$, although they are analogous. ${M}_{ij}$ can be thought of as an analogue to $M({s}_{i},{s}_{j})$ for spatially continuous (i.e. not onehot) basis features, we show in the methods that they are equal (strictly, $M(s,{s}^{\prime})={M}_{ij}^{T}$) in the limit of a discrete onehot place cells.
Temporal difference learning
The temporal difference (TD) update rule is used to learning the TD successor matrix (Equation 20). The standard TD(0) learning rule for a linear value function, ${\psi}_{i}(\mathbf{x})$, which basis feature weights ${M}_{ij}$ is (Sutton and Barto, 1998):
where ${\delta}_{i}$ is the observed TDerror for the ${i}^{\text{th}}$ successor feature and $\eta $ is the learning rate. Note that we are only considering the spatial component of the firing rate, ${f}_{j}^{x}(\mathbf{x})$, not the phase modulation component, ${f}_{j}^{\theta}(\mathbf{x})$, which (as shown) would average away over any timescale significantly longer than the theta timescale (100ms). For now we will drop the superscript and write ${f}_{j}^{x}(\mathbf{x})={f}_{j}(\mathbf{x})$.
To find the TDerror, we must derive a temporally continuous analogue of the Bellman equation. Following Doya, 2000, we take the derivative of Equation 19 which gives a consistency equation on the successor feature as follows:
This gives a continuous TDerror of the form
which can be rediscretised and rewritten by Taylor expanding the derivative (${\psi}_{i}(t)=\frac{{\psi}_{i}(t){\psi}_{i}(tdt)}{dt}$) to give
This looks like a conventional TDerror term (typically something like ${\delta}_{t}={R}_{t}+\gamma {V}_{t}{V}_{t1}$) except that we can choose $dt$ (the timestep between learning updates) freely. Finally expanding ${\psi}_{i}(\mathbf{x}(t))$ using (Equation 3) and substituting this back into Equation 21 gives the update rule:
This rule does not stipulate a fixed time step between updates. Unlike traditional TD updates rules on discrete MDPs, $dt$ can take any positive value. The ability to adaptively vary $dt$ has potentially underexplored applications for efficient learning: when information density is high (e.g. when exploring new or complex environments, or during a compressed replay event Skaggs and McNaughton, 1996a) it may be desirable to learn regularly by setting $dt$ small. Conversely when the information density is low (for example in well known or simple environments) or learning is undesirable (for example the agent is aware that a change to the environment is transient and should not be committed to memory), $dt$ can be increased to slow learning and save energy. In practise, we set our agent to perform a learning update approximately every 1 cm along it’s trajectory ($dt\approx 0.1$ s).
We add a small amount of $\mathrm{L2}$ regularisation by adding the term $2\eta \lambda M$ to the right hand side of Equation 27. This breaks the degeneracy in ${M}_{ij}$ caused by having a set of basis features which is overly rich to construct the successor features and can be interpreted, roughly, as a mild energy constraint favouring smaller synaptic connectomes. In total the full update rule from our TD successor matrix in matrix form is given by
Successor features in continuous time and space
Typically, as in Stachenfeld et al., 2017, the successor representation is calculated in discretised time and space. $M({s}_{i},{s}_{j})$ encodes the expected discounted future occupancy of state ${s}_{j}$ along a trajectory initiated in state ${s}_{i}$:
There are two forms of discretisation here. Firstly, time is discretised: it increases by a fixed increment,+1, to transition the state from ${s}_{t}\to {s}_{t+1}$. Secondly, assuming this is a spatial exploration task, space is discretised: the agent can be in exactly one state on any given time.
We loosen both these constraints reinstating time and space as continuous quantities. Since, for space, we cannot hope to enumerate an infinite number of locations, we represent the state by a population vector of diffuse, overlapping spatially localised place cells. Thus it is no longer meaningful to ask what the expected future occupancy of a single location will be. The closest analogue, since the place cells are spatially localised, is to ask how much we expect place cell, $i$, centred at ${\mathbf{x}}_{i}$, to fire in the near (discounted) future. This continuous time constraint alters the sum over time into an integral over time. Further, the role of $\gamma $ which discounts state occupancy many time steps into the future, is replaced by $\tau $ which discounts firing a long time into the future. Thus the extension of the successor representation, $M({s}_{i},{s}_{j})$, to continuous time and space is given by the successor feature,
Why have we chosen to do this? Temporally it makes little sense to discretise time in a continuous exploration task: $\gamma $, the reinforcement learning discount factor, describes how many timesteps into the future the predictive encoding accounts for and so undesirably ties the predictive encoding to the otherwise arbitrary size of the simulation timestep, $dt$. In the continuous definition, $\tau $ intuitively describes how long into the future the predictive encoding discounts over and is independent of $dt$. This definition allows for online flexibility in the size of $dt$, as shown in Equation 27. This relieves the agent of a burden imposed by discretisation; namely that it must learn with a fixed time step,+1, all the time. Now the agent potentially has the ability to choose the fidelity over which to learn and this may come with significant benefits in terms of energy efficiency, as described above. Further, using the discretised form implicitly ties the definition of the successor representation (or any similarly defined value function) to the time step used in their simulation.
When space is discretised, the successor representation is a matrix encoding predictive relationships between these discrete locations. TD successor features, defined above, are the natural extension of the successor representation in a continuous space where location is encoded by a population of overlapping basis features, rather than exclusive onehot states. The TD successor matrix, $\mathsf{M}}_{ij$, can most easily be viewed as set of driving weights: ${M}_{ij}$ is large if basis feature ${f}_{j}(\mathbf{x})$ contributes strongly to successor feature ${\psi}_{i}(\mathbf{x})$. They are closely related (for example, in the effectively discrete case of nonoverlapping basis features, it can be shown that the TD successor matrix then corresponds directly to the transpose of the successor representation, ${\mathsf{M}}_{ij}^{\mathsf{T}}=\mathsf{M}({\mathsf{s}}_{i},{\mathsf{s}}_{i})$, see below for proof) but we believe the continuous case has more applications in terms of biological plausibility; electrophysiological studies show hippocampus encodes position using a population vector of overlapping place cells, rather than onehot states. Furthermore the continuous case maps neatly onto known neural circuity, as in our case with CA3 place cells as basis features, CA1 place cells as successor features, and the successor matrix as the synaptic weights between them. In our case, the choice not to discretise space and use a more biologically compatible basis set of large overlapping place cells is necessary were our basis features to not overlap they would not be able to reliably form associations using STDP since often only one cell would ever fire in a given theta cycle.
For completeness (although this is not something studied in this report), this continuous successor feature form also allows for rapid estimation of the value function in a neurally plausible way. Whereas for the discrete case value can be calculated as:
where $R({s}_{j})$ is the pertimestep reward to be found at state ${s}_{j}$, for continuous successor feature setting:
where ${R}_{j}$ is a vector of weights satisfying ${\sum}_{j}{R}_{j}{f}_{j}(\mathbf{x})=R(x)$ where $R(x)$ is the rewardrate found at location $\mathbf{x}$. (Equation 31) can be confirmed by substituting into it Equation 29. ${R}_{j}$ (like $R({s}_{j})$) must be learned independent to, and as well as, the successor features, a process which is not the focus of this study although correlates have been observed in the hippocampus (Gauthier and Tank, 2018). $V(\mathbf{x})$ is the temporally continuous value associated with trajectories initialised at $\mathbf{x}$:
Equivalence of the TD successor matrix to the successor representation
Here, we show the equivalence between $M({s}_{i},{s}_{j})$ and ${M}_{ij}$. First we can rediscretise time by setting $d{t}^{\prime}$ to be constant and defining $\gamma =1\frac{d{t}^{\prime}}{\tau}$ and ${\mathbf{x}}_{n}=\mathbf{x}(n\cdot d{t}^{\prime})$. The integral in Equation 29 becomes a sum,
Next, we rediscretise space by supposing that CA3 place cells in our model have strictly nonoverlapping receptive fields which tile the environment. For each place cell, $i$, there is continuous area, $\mathcal{A}}_{i$, such that for any location within this area place cell $i$ fires at a constant rate whilst all others are silent. When $\mathbf{x}\in {\mathcal{A}}_{i}$ we denote this state $s(\mathbf{x})={s}_{i}$ (since all locations in this area have identical population vectors).
Let the initial state be $s(\mathbf{x})={s}_{j}$ (i.e. $\mathbf{x}\in {\mathcal{A}}_{j}$). Putting this into Equation 33 and equating to Equation 3, the definition of our TD successor matrix, gives
confirming that
Simulation and analysis details
Maze details
In the 1D open loop maze (Figure 2a–e), the policy was to always move around the maze in one direction (left to right, as shown) at a constant velocity of 16 cm s–1 along the centre of the track. Although figures display this maze as a long corridor, it is topologically identical to a loop; place cells close to the left or right sides have receptive fields extending into the right or left of the corridor respectively. Fifty Gaussian basis features of radius 1 m, as described above, are placed with their centres uniformly spread along the track. Agents explored for a total time of 30 min.
In the 1D corridor maze, Figure 2f–j, the situation is only changed in one way: the left and right hand edges of the maze are closed by walls. When the agent reaches the wall it turns around and starts walking the other way until it collides with the other wall. Agents explored for a total time of 30 min.
In the 2D two room maze, 200 basis feature are positioned in a grid across the two rooms (100 per room) then their location jittered slightly (Figure 2k). The cells are geodesic Gaussians. This means that the ${\parallel \mathbf{x}(t){\mathbf{x}}_{i}\parallel}^{2}$ term in Equation 7 measures the distance from the agent location the centre of cell $i$ along the shortest walk which complies with the wall geometry. This explains the bleeding of the basis feature through the door in Figure 3d. Agents explored for a total time of 120 min.
The movement policy of the agent is a random walk with momentum. The agent moves forward with the speed at each discrete time step drawn from a Rayleigh distribution centred at 16 cm s–1. At each time step the agent rotates a small amount; the rotational speed is drawn from a normal distribution centred at zero with standard deviation 3 πrad s–1 ($\pi $ rad s–1 for the 1D mazes). Although the agent gets close to a wall (within 10 cm), the direction of motion is changed parallel to the wall, thus biasing towards trajectories which ‘follow’ the boundaries, as observed in real rats. This model was designed to match closely the behaviour of freely exploring rats and was adapted from the model initially presented in Raudies and Hasselmo, 2012. We add one additional behavioural bias: in the 2D two room maze, whenever the agent passes within 1 m of the centre point of the doorway connecting the two rooms, its rotational velocity is biased to turn it towards the door centre. This has the effect of encouraging roomtoroom transitions, as is observed in freely moving rats (Carpenter et al., 2015).
Analyses of the STDP and TD successor matrices
For the 1D mazes, there exists a translational symmetry relating the $N=50$ uniformly distributed basis features and their corresponding rows in the STDP/TD weight matrices. This symmetry is exact for the 1D loop maze (all cells around a circle are rotated versions of one another) and approximate for the corridor maze (broken only for cells near to the left or right bounding wall). The result is that much the information in the linear track weight matrices Figure 2b, c, g and h can be viewed more easily by collapsing this matrix over the rows centred on the diagonal entry (plotted in Figure 2d and i). This is done using a circular permutation of each matrix row by a count, ${n}_{i}$, equal to how many times we must shift cell $i$ to the right in order for it’s centre to lie at the middle of the track, $x}_{i}=2.5\mathrm{m$,
This is the ‘row aligned matrix’. Averaging over its rows removes little information thanks to the symmetry of the circular track. We therefore define the 1D quantity
which is a convenient way to plot, in 1D, only the nonredundant information in the weight matrices.
A theoretical connection between STDP and TD learning
Why does STDP between phase precessing place cells approximate TD learning? In this section, we attempt to shed some light on this question by analytically studying the equations of TD learning. Ultimately, comparisons between these learning rules are difficult since the former is inherently a discrete learning rule acting on pairs of spikes whereas the latter is a continuous learning rule acting on firing rates. Nonetheless, in the end we will draw the following conclusions:
In the first part, we will show that, under a small set of biologically feasible assumptions, temporal difference learning ‘looks like’ a spiketime dependent temporally asymmetric Hebbian learning rule (that is, roughly, STDP) where the temporal discount time horizon, $\tau $ is equal to the synaptic plasticity timescale $O(20\text{ms})$.
In the second part, we will see that this limitation that the temporal discount time horizon is restricted to the timescale of synaptic plasticity (i.e. very short) can be overcome by compressing the inputs. Phase precession, or more formally, theta sweeps, perform exactly the required compression.
In sum, there is a deep connection between TD learning and STDP and the role of phase precession is to compress the inputs such that a very short predictive time horizon amounts to a long predictive time horizon in decompressed time coordinates. We will finish by discussing where these learning rules diverge and the consequences of their differences on the learned representations. The goal here is not to derive a mathematically rigorous link between STDP and TD learning but to show that a connection exists between them and to point the reader to further resources if they wish to learn more.
Reformulating TD learning to look like STDP
First, recall that the temporal difference (TD) rule for learning the successor features ${\psi}_{i}(\mathbf{x})$ defined in Equation 19 takes the form:
where ${M}_{ij}$ are the weights of the linear function approximator, Equation 3 (Note, firstly, it is a coincidence specific to this study that the basis features of the linear function approximator, Equation 3, happen to be the same features of which we are computing the successor features, Equation 19. In general, this needn’t be the case. Secondly, this analysis applies to any value function, not just successor features which are a specific example. If ${f}_{i}(\mathbf{x})$ in Equation 19 was a reward density then ${\psi}_{i}(\mathbf{x})$ would become a true value function (discounted sum of future rewards) in the more conventional sense). and ${\delta}_{i}(t)$ is the continuous temporal difference error defined in Equation 24. ${O}_{j}(t)$ is the eligibility trace for feature $j$ defined according to
or, equivalently, by its dynamics (which we will make use of)
where ${\tau}_{O}\in [0,\tau ]$ is a ‘free’ parameter, the eligibility trace timescale, analogous to $\lambda $ in discrete TD($\lambda $). When ${\tau}_{O}=0$ we recover the learning rule we use to learn successor features, ‘TD(0)’, in Equation 21.
Subbing Equation 24 and Equation 41 into this update rule, Equation 39, rearranges to give
where we redefined $\eta \leftarrow {\eta}^{\prime}=\eta /\tau $. Now let the predictive time horizon be equal to the eligibility trace timescale. This setting is also called TD(1) or Monte Carlo learning,
Now
The final term in this update rule, the total derivative, can be ignored with respect to the stationary point of the learning process. To see why, consider the simple case of a periodic environment which repeats over a time period $T$ – this is true for the 1D experiments studied here. Learning is at a stationary point when the integrated changes in the weights vanish over one whole period:
where the last term vanishes due to the periodicity. This shows that the learning rule converges to the same fixed point (i.e. the successor feature) irrespective of whether this term is present and it can therefore be removed. The dynamics of this updated learning rule won’t strictly follow the same trajectory as TD learning but they will converge to the same point. Although strictly we only showed this to be true in the artificially simple setting of a periodic environment it is more generally true in a stochastic environment where the feature inputs depend on a stationary latent Markov chain (Brea et al., 2016).
Thus, a valid learning rule which converges onto the successor feature can be written as
Claim: this looks like a continuous analog of STDP acting on the weights between a set of input features, indexed $j$, and a set of downstream “successor features” indexed $i$. Each term in the above learning rule can be nonrigorously identified as follows, a key change is that the successor features neurons have twocompartments; a somatic compartment and a dendritic compartment:
${f}_{i}(t):={V}_{i}^{\text{soma}}(t)$ is the somatic membrane voltage which is primarily set by a ‘target signal’. In general, this target signal could be any reward density function, here it is the firing rate of the ith input feature.
${\psi}_{i}(t):={V}_{i}^{\text{dend}}(t)$ is the voltage inside a dendritic compartment which is a weighted linear sum of the input currents, Equation 3. This compartment is responsible for learning the successor feature by adjusting its input weights, ${M}_{ij}$, according to equation (48).
${f}_{j}(t):={I}_{j}(t)$ are the synaptic currents into the dendritic compartment from the upstream features.
${O}_{j}(t):={\stackrel{~}{I}}_{j}(t)$ are the lowpass filtered eligibility traces of the synaptic input currents.
This learning rule, mapped onto the synaptic inputs and voltages of a twocompartment neuron, is Hebbian. The first term potentiates the synapse ${M}_{ij}$ if there is a correlation between the lowpass filtered presynaptic current and the somatic voltage (which drives postsynaptic activity). More specifically this potentiation is is temporally asymmetric due to the second term which sets a threshold. A postsynaptic spike (e.g. when ${V}_{i}^{\text{soma}}(t)$ reaches threshold) will cause potentiation if
but since the eligibility trace decays uniformly after a presynaptic input this will only be true if the postsynaptic spike arrives very soon after. This is prebeforepost potentiation. Conversely an unpaired presynaptic input (e.g. when ${I}_{j}(t)$ spikes) will likely cause depression since this bolsters the second depressive term of the learning rule but not the first (note this is true if its synaptic weight is positive such that ${V}^{\text{dend}}(t)$ will be high too). This is analogous to postbeforepre depression. Whilst not identical, it is clear this rule bears the key hallmarks of the STDP learning rule used in this study, specifically: prebeforepost synaptic activity potentiates a synapse if post synaptic activity arrive within a short time of the presynaptic activity and, secondly, postbeforepre synaptic activity will typically result in depression of the synapse.
Intuitively, it now makes sense why asymmetric STDP learns successor features. If a postsynaptic spike from the ith neuron arrives just after a presynaptic spike from the jth feature it means, in all probability, that the presynaptic input features is ‘predictive’ of whatever caused the postsynaptic spike which in this case is the ith feature. Thus, if we want to learn a function which is predictive of the ith features future activity (its successor feature), we should increase the synaptic weight ${M}_{ij}$. Finally, identifying that this learning rule looks similar to STDP fixes the timescale of the eligibility trace to be the timescale of STDP plasticity i.e. $O(2050\text{ms})$. And to derive this learning rule, we required that the temporal discount time horizon must equal the eligibility trace timescale, altogether:
This limits the predictive time horizon of the learnt successor feature to a rather useless – but importantly nonzero – 20–50ms. In the next section, we will show how phase precession presents a novel solution to this problem.
Theta phase precession compresses the temporal structure of input features
We showed in Figure 1 how phase precession leads to theta sweeps. These phenomena are two sides of the same coin. Here we will start by positing the existence of theta sweeps and show that this leads to a potentially large amount of compression of the feature basis set in time.
First, consider two different definitions of position. ${\mathbf{x}}_{T}(t)$ is the ‘True’ position of the agent representing where it is in the environment at time $t$ is the ‘Encoded’ position of the agent which determines the firing rate of place cells which have spatial receptive fields ${f}_{i}({\mathbf{x}}_{E}(t))$. During a theta sweep, the encoded position ${\mathbf{x}}_{E}(t)$ moves with respect to the true position ${\mathbf{x}}_{T}(t)$ at a relative speed of ${\mathbf{v}}_{S}(t)$ where the subscript $S$ distinguishes the ‘Sweep’ speed from the absolute speed of the agent ${\dot{\mathbf{x}}}_{T}(t)={\mathbf{v}}_{A}(t)$. In total, accounting for the motion of the agent:
Now consider how the population activity vector changes in time
and compare the time how it would varying in time if there was no theta sweep (i.e ${\mathbf{x}}_{E}(t)={\mathbf{x}}_{T}(t)$)
They are proportional. Specifically in 1D, where the sweep is observed to move in the same direction as the agent (from behind it to in front of it) this amount to compression of the temporal dynamics by a factor of
This ‘compression’ is also true in 2D where sweeps are also observed to move largely in the same direction as the agent.
If this compression is large, it would solve the timescale problem described above. This is because learning a successor feature with a very small time horizon, $\tau $, where the input trajectory is heavily compressed in time by a factor of ${\kappa}_{\theta}$ amounts to the same thing as learning a successor feature with a long time horizon ${\tau}^{\prime}=\tau {\kappa}_{\theta}$ where the inputs are not compressed in time.
What is ${v}_{S}$, and is it fast enough to provide enough compression to learn temporally extended SRs? We can make a very rough ballpark estimate. Data is hard to come by but studies suggest the intrinsic speed of theta sweeps can be quite fast. Figures in Feng et al., 2015, Wang et al., 2020 and Bush et al., 2022 show sweeps moving at up to, respectively, 9.4ms–1, 8.5ms–1 and 2.3ms–1. A conservative range estimate of ${v}_{S}\approx 5\pm 5$ ms–1 accounts for very fast and very slow sweeps. The timescale of STDP is debated but a reasonable conservative estimate would be around ${\tau}_{\text{STDP}}\approx 35\pm 15\times {10}^{3}$ s which would cover the range of STDP timescales we use here. The typical speed of a rat, though highly variable, is somewhere in the range ${v}_{A}\approx 0.15\pm 0.15$ ms–1. Combining these (with correct error analysis, assuming Gaussian uncertainties) gives an effective timescale increase of
Therefore, we conclude theta sweeps can provide enough compression to lift the timescale of the SR being learn by STDP from short synaptic timescales to relevant behavioural timescales on the order of seconds. Note this ballpark estimate is not intended to be precise, and does not account for many unknowns for example the covariability of sweep speed with running speed[cite], variability of sweep speed with track length[cite] or cell size[cite] which could potentially extend this range further.
Differences between STDP and TD learning: where our model does not work
We only drew a handwaving connection between the TDderived Hebbian learning rule in Equation 48 and STDP. There are numerous difference between STDP and TD learning, these include the fact that:
Depression in Equation 48 is dependent on the dendritic voltage which is not true for our STDP rule.
Depression in Equation 48 is not explicitly dependent on the time between post and presynaptic activity, unlike STDP.
Equation 48 is a continuous learning rule for continuous firing rates, STDP is a discrete learning rule applicable only to spike trains.
Analytic comparison is difficult due to this final difference which is why in this paper we instead opted for empirical comparison. Our goal was never to derive a spiketime dependent synaptic learning rule which replicates TD learning, other papers have done work in this direction (see Brea et al., 2016; Bono et al., 2023), rather we wanted to (i) see whether unmodified learning rules measured to be used by hippocampal neurons perform and (ii) study whether phase precession aids learning. Under regimes tested here, STDP seems to hold up well.
These differences aside, the learning rule does share other similarities to our model setup. A special feature of this learning rule is that it postulates that somatic voltage driving postsynaptic activity during learning isn’t affected by the neurons own dendritic voltage. Rather, dendritic voltages affect the plasticity by setting the potentiation threshold. These learning rules have been studies under the collective name of ‘voltage dependent’ Hebbian learning rules[CITE]. This matches the learning setting we use here where, during learning, CA1 neurons are driven by one and only one CA3 feature (the ‘target feature’) whilst the weights being trained ${W}_{ij}$ do not immediately effect somatic activity during learning. The lack of online updating matches the electrophysiological observation that plasticity between CA3 and CA1 is highest during the phase of theta when CA1 is driven by Entorhinal cortex and lowest at the phase when CA3 actually drives CA1 (Hasselmo et al., 2002).
Finally, there is one clear failure for our STDP model – learning very long timescale successor features. Unlike TD learning which can ‘bootstrap’ long timescale associations through intermediate connections, this is not possible with our STDP rule in its current form. Brea et al., 2016 and Bono et al., 2023 show how Equation 48 can be modified to allow long timescale SRs whilst still enforcing the timescale constraint we imposed in Equation 43 thus still maintaining the biological plausibility of the learning rule, this requires allowing the dendritic voltage to modify the somatic voltage during learning in a manner highly similar to bootstrapping in RL. Specifically, in the former study, this is done by a direct extension to the twocompartment model, in the latter it is recast in a onecompartment model although the underlying mathematics shares many similarities. Ultimately both mechanisms could be at play; even in neurons endowed with the ability to bootstrap long timescale association with short timescale plasticity kernels phase precession would still increase learning speed significantly by reducing the amount of bootstrapping required by a factor of ${\kappa}_{\theta}$, something we intend to study more in future work. Finally it isn’t clear what timescales predictive encoding in the hippocampus reach, there is likely to be an upper limit on the utility of such predictive representations beyond which the animal use modelbased methods to find optimal solution which guide behaviour.
Supplementary analysis
Figure 2—figure supplement 1: Place cell size and movement statistics
For convenience, panel a of Figure 2—figure supplement 1 duplicates the experiment shown in paper Figure 2a–e. The only change is learning time was extended from 30 minutes to 1 hour.
Movement speed variability
Panel b shows an experiment where we reran the simulation shown in paper Figure 2a–e except, instead of a constant motion speed, the agent moves with a variable speed drawn from a continuous stochastic process (an Ornstein Uhlenbeck process). The parameters of the process were selected so the mean velocity remained the same (16 cm s–1 lefttoright) but now with significant variability (standard deviation of 16 cm s–1 thresholded so the speed cannot go negative). Essentially, the velocity takes a constrained random walk. This detail is important: the velocity is not drawn randomly on each time step since these changes would rapidly average out with small $dt$, rather the change in the velocity (the acceleration) is random – this drives slow stochasticity in the velocity where there are extended periods of fast motion and extended periods of slow motion. After learning there is no substantial difference in the learned weight matrices. This is because both TD and STDP learning rules are able to averageover the stochasticity in the velocity and converge on representations representative of the mean statistics of the motion.
Smaller place cells and faster movement
Nothing fundamental prevents learning from working in the case of smaller place fields or faster movement speeds. We explore this in Figure 2—figure supplement 1, panel c, as follows: the agent speed is doubled from 16 cm s–1 to 32 cm s–1 and the place field size is shrunk by a factor of 5 from 2 m diameter to 40 cm diameter. To facilitate learning we also increase the cell density along the track from 10 cells m–1 to 50 cells m–1. We also shrink the track size from 5 m to 2 m (any additional track is redundant due to the circular symmetry of the setup and small size of the place cells). We then train for 12 min. This time was chosen since 12 min moving at 32 cm s–1 on a 2 m track means the same number of laps as 60 min moving at 16 cm s–1 on a 5 m track (96 laps in total). Despite these changes the weight matrix converged with high similarity to the successor matrix with a shorter time horizon (0.5 s). Convergence time measured in minutes was faster than in the original case but this is mostly due to the shortened track length and increased speed. Measured in laps it now takes longer to converge due to the decreased number of spikes (smaller place fields and faster movement through the place fields). This can be seen in the shallower convergence curve, panel c (right) relative to panel a.
Figure 2—figure supplement 2: Weight initialisation and updating schedule
Random initialisation
In Figure 2—figure supplement 2, panel a, we explore what happens if weights are initialised randomly. Rather than the identity, the weight matrix during learning is fixed (‘anchored’) to a sparse random matrix ${W}_{ij}^{A}$; this is defined such that each CA1 neuron receives positive connections from 3, 4, or 5 randomly chosen CA3 neurons with weights summing to one. In all other respects learning remains unchanged. CA1 neurons now have multimodal receptive fields since they receive connections from multiple, potentially far apart, CA3 cells. This should not cause a problem since each subfield now acts as its own place field phase precessing according to whichever place cells in CA3 is driving it. Indeed it does not: after learning with this fixed but random CA3CA1 drive, the synaptic weights are updated on aggregate and compares favourably to the successor matrix (panel a, middle and right). Specifically, this is the successor matrix which maps the unmixed unimodal place cells in CA3 to the successor features of the new multimodal ‘mixed’ features found in CA1 before learning. We note in passing that this is easy to calculate due to the linearity of the successor feature (SF): an SF of a linear sum of features is equal to a linear sum of SF, therefore we can calculate the new successor matrix using the same algorithm as before (described in the Methods) then rotating it by the sparse random matrix, ${M}_{ij}^{\prime}={\sum}_{k}{W}_{ik}^{A}{M}_{kj}$.
In order that some structure is visible matrix rows (which index the CA1 postsynaptic cells) have been ordered according to the location of the CA1 peak activity. This explains why the random sparse matrix (panel a, middle) looks ordered even though it is not. After learning the STDP successor feature looks close in form to the TD successor feature and both show a shift and skew backwards along the track (panel a, rights, one example CA1 field shown).
Online weight updating
In Figure 2—figure supplement 2, panels b, c and d, we explore what happens if the weights are updated online during learning. It is not possible to build a stable fully online model (as we suspect the review realised) and it is easy to understand why: if the weight matrix doing the learning is also the matrix doing the driving of the downstream features then there is nothing to prevent instabilities where, for example, the downstream feature keeps shifting backwards (no convergence) or the weight matrix for some/all features disappears or blows up (incorrect convergence). However, it is possible to get most of the way there by splitting the driving weights into two components. The first and most significant component is the STDP weight matrix being learned online, this creates a ‘closed loop’ where changes to the weights affects the downstream features which in turn affect learning on the weights. The second smaller component is what we call the ‘anchoring’ weights, which we set to a fraction of the identity matrix (here $\frac{1}{2}$) and are not learned. In summary, Equation 16 becomes
for ${W}_{ij}^{A}=\frac{1}{2}{\delta}_{ij}$.
These anchoring weights provide structure, analogous to a target signal or ‘scaffold’ onto which the successor features will learn without risk of infinite backwards expansion or weight decay. After learning when analysing the weight/successor features the anchoring component is not considered.
Every other model of TD learning implicitly or explicitly has a form of anchoring. For example in classical TD learning each successor feature receives a fixed ‘reward’ signal from the feature it is learning to predict (this is the second term in Equation 23 of our methods). Even other ‘synaptically plausible’ models include a nonlearnable constant drive [see (Bono et al., 2023) CA3CA1 model, more specifically the bias term in their Equation 12]. This is the approach we take here. We add the additional constraint that the sum of each row of the weight matrix must be smaller than or equal to 1, enforced by renormalisation on each time step. This constraint encodes the notion that there may be an energetic cost to large synaptic weight matrices and prevents infinite growth of the weight matrix.
The resulting evolution of the learnable weight component, ${W}_{ij}(t)$, is shown in panel b (middle shows row aligned averages of ${W}_{ij}(t)$ from t=0 minutes to to = 64 min, on the full matrices are shown) and panel f (full matrix) from being initialised to the identity. The weight matrix evolves to look like a successor matrix (long skew left of diagonal, negative right of diagonal). One risk, when weights are updated online, is that the asymmetric expansion continues indefinitely. This does not happen and the matrix stabilises after 15 min (panel e, colour progression). It is important to note that the anchoring component is smaller than the online weight component and we believe it could be made very small in the limit of less noisy learning (e.g. more cells or higher firing rates).
In panel c, we explore the combination: random weight initialisation and online weight updating. As can be seen, even with rather strong random initial weights learning eventually ‘forgets’ these and settles to the same successor matrix form as when identity initialisation was used.
In panel d, we show that anchoring is essential. Without it ($W^{A}{}_{ij}=0$) the weight matrix initially shows some structure shifting and skewing to the left but this quickly disintegrates and no observable structure remains at the end of learning.
Manytofew spiking model
In Figure 2—figure supplement 2, panel e, we simulate the more biologically realistic scenario where each CA1 neuron integrates spikes (rather than rates) from a large (rather than equal) number of upstream CA3 neurons. This is done with two changes:
Firstly we increased the number of CA3 neurons from 50 to 500 while keeping the number of CA1 neurons fixed. Each CA1 neuron is now receives fixed anchoring drive from a Gaussianweighted sum of the 10 (as opposed to 1) closest CA3 neurons.
Secondly, since in our standard model spikes are used for learning but neurons communicate via their rates, we change this so that CA3 spikes directly drive CA1 spikes in the form of a reduced spiking model. Let ${X}_{i,t}^{\mathrm{CA1}}$ be the spike count of the ${i}^{\text{th}}$ CA1 neuron at timestep $t$ and ${X}_{j,t}^{\mathrm{CA3}}$ the equivalent for the ${j}^{\text{th}}$ CA3 neuron then, under the reduced spiking model,
As can be expected, this model is very similar to the original model since CA3 spikes are noisey sample of their rates. This noise should average out over time and the simulations indeed confirm this.
Figure 2—figure supplement 3: Hyperparameter sweep
We perform a hyperparameter sweep over STDP and phase precession parameters to see which are optimal for learning successor matrices. Remarkably the optimal parameters (those giving highest R2 between the weight matrix and the successor matrix) are found to be those – or vary close to those – used by biological neurons (Figure 2—figure supplements 2 and 3). Specifically, to avoid excess computational costs two independent sweeps were run: the first was run over the four relevant STDP parameters (the two synaptic plasticity timescales, the ratio of potentiation to depression and the firing rate) and the second was run over the phase precession parameters (phase precession spread parameter and the phase precession fraction).
On all cases, the optimal parameter sits close to the biological parameter we used in this paper (panel c, d). One exception is the firing rate where higher firing rates always giver better scores, likely due to the decreased effect of noise, however it is reasonable biology can’t achieve arbitrarily high firing rates for energetic reasons.
Figure 2—figure supplement 4: Phase precession
The optimality of biological phase precession parameters
In Figure 2—figure supplement 3, we ran a hyperparameter sweep over the two parameters associated with phase precession: $\kappa $, the von Mises parameter describing how noisy phase precession is and $\beta $, the fraction of the full 2π theta cycle phase precession crosses. The results show that for both of these parameters there is a clear “goldilocks” zone around the biologically fitted parameters we chose originally. When there is too much (large $\kappa $, large $\beta $) or too little (small $\kappa $, small $\beta $) phase precession performance is worse than at intermediate biological amounts of phase precession. Whilst – according to the central hypothesis of the paper – it makes sense that weak or nonexistence phase precession hinders learning, it is initially counter intuitive that strong phase precession also hinders learning.
We speculate the reason is as follows, when $\beta $ is too big phase precession spans the full range from 0 to 2π, this means it is possible for a cell firing very late in its receptive field to fire just before a cell a long distance behind it on the track firing very early in the cycle because 2π comes just before 0 on the unit circle. When $\kappa $ is too big, phase precession is too clean and cells firing at opposite ends of the theta cycle will never be able to bind since their spikes will never fall within a 20ms window of each other. We illustrate these ideas in Figure 2—figure supplement 4 by first describing the phase precession model (panel a) then simulating spikes from 4 overlapping place cells (panel b) when phase precession is weak (panel c), intermediate/biological (panel d) and strong (panel e). We confirm these intuitions about why there exists a phase precession ‘goldilocks’ zone by showing the weight matrix compared to the successor matrix (right hand side of panels c, d and e). Only in the intermediate case is there good similarity.
Phase precession of CA1
In most results shown in this paper, the weights are anchored to the identity during learning. This means each CA1 cells inherits phase precession from the one and only one CA3 cell it is driven by. It is important to establish whether CA1 still shows phase precession after learning when driven by multiple CA3 cells or, equivalently, during learning when the weights aren’t anchored and it is therefore driven by multiple CA3 neurons. Analysing the spiking data from CA1 cells after learning (phase precession turned on) shows it does phase precession. This phase precession is noisier than the phase precession of a cell in CA3 but only slightly and compares favourably to real phase precession data for CA1 neurons (panel f, right, adapted from Jeewajee et al., 2014).
The reason for this is that CA1 cells are still localised and therefore driven mostly by cells in CA3 which are close and which peak in activity together at a similar phase each theta cycle. As the agent moves through the CA1 cell it also moves through all the CA3 cells and their peak firing phase precesses driving an earlier peak in the CA1 firing. Phase precession is CA1 after learning is noisier/broader than CA3 but far from nonexistent and looks similar to real phase precession data from cells in CA1.
Phase shift between CA3 and CA1
In Figure 2—figure supplement 4g, we simulate the effect of a decreasing phase shift between CA3 and CA1. As observed by Mizuseki et al., 2012, there is a phase shift between CA3 and CA1 neurons starting around 90 degrees at the end of each theta cycle (where cells fire as their receptive field is first entered) and decreasing to 0 at the start. We simulate this by adding a temporal delay to all downstream CA1 spikes equivalent to the phase shifts of 0º, 45ºand 90º. The average of the weight matrices learned over all three examples still displays clear SRlike structure.
Data availability
All code associated with this project can be found at https://github.com/TomGeorge1234/STDPSR, (George, 2023, copy archived at swh:1:rev:f126330b993d50cee021b1c356077bdab80299f4). There are no raw or external datasets associated with this project.
References

Local remapping of place cell firing in the tolman detour taskThe European Journal of Neuroscience 33:1696–1705.https://doi.org/10.1111/j.14609568.2011.07653.x

A model of spatial MAP formation in the hippocampus of the ratNeural Computation 8:85–93.https://doi.org/10.1162/neco.1996.8.1.85

Prospective coding by spiking neuronsPLOS Computational Biology 12:e1005003.https://doi.org/10.1371/journal.pcbi.1005003

Robotic and neuronal simulation of the hippocampus and rat navigationPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 352:1535–1543.https://doi.org/10.1098/rstb.1997.0140

Dual coding with STDP in a spiking recurrent neural network model of the hippocampusPLOS Computational Biology 6:e1000839.https://doi.org/10.1371/journal.pcbi.1000839

Grid cells form a global representation of connected environmentsCurrent Biology 25:1176–1182.https://doi.org/10.1016/j.cub.2015.02.037

Intraseptal administration of muscimol produces dosedependent memory impairments in the ratBehavioral and Neural Biology 52:357–369.https://doi.org/10.1016/s01631047(89)90472x

Pharmacology and nerveendings (walter ernest dixon memorial lecture)Proceedings of the Royal Society of Medicine 28:319–332.https://doi.org/10.1016/S01631047(89)90472X

BookModelbased reinforcement learning as cognitive search: neurocomputational theoriesIn: Todd PM, editors. Cognitive Search: Evolution, Algorithms and the Brain. MIT Press. pp. 195–208.

Neurobiological successor features for spatial navigationHippocampus 30:1347–1355.https://doi.org/10.1002/hipo.23246

Predictive maps in rats and humans for spatial navigationCurrent Biology 32:3676–3689.https://doi.org/10.1016/j.cub.2022.06.090

Reinforcement learning in continuous time and spaceNeural Computation 12:219–245.https://doi.org/10.1162/089976600300015961

The reorganization and reactivation of hippocampal maps predict spatial memory performanceNature Neuroscience 13:995–1002.https://doi.org/10.1038/nn.2599

A goaldirected spatial navigation model using forward trajectory planning based on grid cellsThe European Journal of Neuroscience 35:916–931.https://doi.org/10.1111/j.14609568.2012.08015.x

The successor representation: its computational logic and neural substratesThe Journal of Neuroscience 38:7193–7200.https://doi.org/10.1523/JNEUROSCI.015118.2018

Grid cells, place cells, and geodesic generalization for spatial reinforcement learningPLOS Computational Biology 7:e1002235.https://doi.org/10.1371/journal.pcbi.1002235

Neural signatures of cell assembly organizationNature Reviews. Neuroscience 6:399–407.https://doi.org/10.1038/nrn1669

Theta phase precession of grid and place cell firing in open environmentsPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 369:20120532.https://doi.org/10.1098/rstb.2012.0532

Neural ensembles in CA3 transiently encode paths forward of the animal at a decision pointThe Journal of Neuroscience 27:12176–12189.https://doi.org/10.1523/JNEUROSCI.376107.2007

Prioritized memory access explains planning and hippocampal replayNature Neuroscience 21:1609–1617.https://doi.org/10.1038/s415930180232z

Neuronal dynamics of predictive codingThe Neuroscientist 7:490–495.https://doi.org/10.1177/107385840100700605

The successor representation in human reinforcement learningNature Human Behaviour 1:680–692.https://doi.org/10.1038/s4156201701808

Modeling boundary vector cell firing given optic flow as a cuePLOS Computational Biology 8:e1002553.https://doi.org/10.1371/journal.pcbi.1002553

The role of the hippocampus in solving the Morris water mazeNeural Computation 10:73–111.https://doi.org/10.1162/089976698300017908

Predictive representations can link modelbased reinforcement learning to modelfree mechanismsPLOS Computational Biology 13:e1005768.https://doi.org/10.1371/journal.pcbi.1005768

LOSS of recent memory after bilateral hippocampal lesionsJournal of Neurology, Neurosurgery & Psychiatry 20:11–21.https://doi.org/10.1136/jnnp.20.1.11

ConferenceDesign principles of the hippocampal cognitive mapAdvances in Neural Information Processing Systems 27.

The hippocampus as a predictive MAPNature Neuroscience 20:1643–1653.https://doi.org/10.1038/nn.4650

Functional organization of the hippocampal longitudinal axisNature Reviews. Neuroscience 15:655–669.https://doi.org/10.1038/nrn3785

Reinforcement learning: an introductionIEEE Transactions on Neural Networks 9:1054.https://doi.org/10.1109/TNN.1998.712192

Efficient computation of optimal actionsPNAS 106:11478–11483.https://doi.org/10.1073/pnas.0710743106

BookA neurally plausible model learns successor representations in partially observable environmentsIn: Jordan MI, editors. Advances in Neural Information Processing Systems. MIT Press. pp. 5–6.

When animals misbehave: analogs of human biases and suboptimal choiceBehavioural Processes 112:3–13.https://doi.org/10.1016/j.beproc.2014.08.001
Decision letter

Michael J FrankSenior and Reviewing Editor; Brown University, United States

Michael E HasselmoReviewer; Boston University, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Rapid learning of predictive maps with STDP and theta phase precession" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Michael E. Hasselmo (Reviewer #1).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) Significantly more discussion of the work's relationship to relevant prior models of the hippocampus (as described by Reviewer #1)
2) New simulations that address Reviewer 2's concerns about biological plausibility.
3) Analysis that sheds light on why theta sequences + STDP approximates the TD algorithm (as described by Reviewer #2).
The second essential revision above may involve significant restructuring of the modeling approach. If the authors wish to undertake this, we will be happy to consider the substantially revised version for publication in eLife.
Reviewer #1 (Recommendations for the authors):
Page 4 – top line – "in the successor representation this is because CA3 place cells to the left…". I think this is confusing as the STDP model essentially generates the same effect. I think this should say: "In the network trained by Temporal Difference learning this is because CA3 place cells to the left…". This better description is used further down where the text says "between STDP and TD weight matrices". Throughout the manuscript
Page 4 – end of the first paragraph – "potentially becoming negative" – it is disconcerting to have this discussion of the idea of synaptic weights going from positive to negative in the context of the STDP model. One of the main advantages of this model is its biological realism, so it should not so casually mention violating Dale's law and having the synapse magically switch from being glutamatergic to GABAergic. This is disturbing to a neuroscientist.
Page 4 "is an essential element of this process." – The importance of theta phase precession to sequence learning with STDP has been discussed in numerous previous papers. For example, in a series of four papers in 1996, Jensen and Lisman describe in great detail a buffer mechanism for generating theta phase precession, and show how this allows encoding of a sequence. This is also explicitly discussed in Koene, Gorchetnikov, Cannon, and Hasselmo, Neural Networks, 2003, in terms of a spiking window of LTP less than 40 msec that requires a shortterm memory buffer to allow spiking within this window.
Page 4 – "our model and the successor representation" – again this is confusing and should instead contrast "our model and the TD trained successor representation"
Page 6 – "in observed" – is observed.
Page 6 – "binding across the different sizes" – This needs to be stated more clearly in the text as it is very vague. I would suggest adding the phrase: "regardless of the scale difference".
Figure 4D – "create a physical barrier" – this is very ambiguous as it recalls a physical barrier in the environment as between two rooms – should instead say "created an anatomical segregation".
Page 8 – "hallmarks of successor representations" – there should be citations for what paper shows these hallmarks of the successor representation.
Page 8 – "arrive in the order" – Here is a location where citations to previous papers on the use of a phase precession buffer to correctly time spiking for STDP should be added (i.e. Jensen and Lisman, 1996; Koene et al. 2003).
Page 8 – "via Hebbian learning alone" – add "without theta phase precession" to be clear about what is not being included (since it could be anything such as other aspects of a learning rule).
Page 9 – "for spiking a feedforward network" – what does this mean – do they mean "for spiking in a feedforward network"? Aren't these other network mechanisms less biological realistic than the one presented here? I'd like to see some critical comparison between the models.
Page 9 – "makes a clear prediction…should impact subsequent navigation and the formation of successor features" – This is not a clear prediction but is instead circular – it essentially says – "if successor representations are not formed successor representations will not be observed" This is not much use to an experimentalist. This prediction should be stated in terms of a clear experimental prediction that refers only to physical testable quantities in an experiment and not circularly referring to the same vague and abstract concept of successor representations.
Page 9 – "to reach a hidden goal" – A completely different hippocampal modeling framework was used to model the finding of hidden goals in the Morris water maze in Erdem and Hasselmo, 2012, Eur. J. Neurosci and earlier work by Redish and Touretzky 1998, Neural Comp. To clarify the status of the successor representation framework relative to these older models that do not use successor representations, it would be very useful to have a few sentences of discussion about how the successor representation differs and is somehow either advantageous or biologically more realistic than these earlier models.
Page 9 "Lesions of the medial septum" – inactivation of the medial septum has also been shown to impair performance in Morris water maze (Chrobak et al. 2006).
Page 9 – "physical barrier to binding" – this is again very confusing as there is no physical barrier in the hippocampus. They should instead say "anatomical segregation"
Citation 32 – Mommenejad and Howard, 2018 – This is a very important citation and highly relevant to the discussion. However, I think it should just be cited as BioRXiv. It is confusing to call it a preprint.
Reviewer #2 (Recommendations for the authors):
This is an interesting study, and I enjoyed reading it. However, I have a number of concerns, particularly regarding the biological plausibility of the model, that I believe can be addressed with additional simulations and analysis.
– I had a number of concerns regarding the biological plausibility of the model and the choice of parameter settings, especially:
1) Mapping from rates to rates. The CA3 neurons act on CA1 neurons via their firing rate rather than their spikes, but the STDP rule acts on the spikes. What happens if the CA1 neurons are driven by the synapticallyfiltered CA3 spikes rather than the underlying rates? How does the model perform, and how does the performance vary with the number of CA3 neurons (since more neurons may be required in order to average over the stochastic spikes)?
2) Weights are initialised as Wij=deltaij, meaning a 11 correspondence from CA3 to CA1 cells. This would have been ok, except that the weights are not updated during learning – they are held fixed during the entire learning phase and only updated on aggregate after learning. Thus, during the entire learning process each CA1 cell is driven by exactly 1 CA3 cell, and therefore simply inherits (or copies) the activity of that CA3 cell (according to equation 2). If either 1) a more realistic weight initialisation were used (e.g., random) or 2) weights were updated online during learning, it seems likely that the proposed mechanism would no longer work.
3) Lack of discussion of phase precession in CA1 cells. What are the theta firing patterns of CA1 (successor) cells in the model? Do they exhibit theta sequences and/or phase precession? We are never told this. The spike phase of the downstream CA1 cell is extremely important for STDP, as it determines whether synapses associated with past or future events are potentiated or suppressed (see Figure 8 of Chadwick et al. 2016, eLife). Based on my understanding, in the current setup CA1 place cells should produce phase precession during learning (before weights are updated), but only because each CA1 cell copies the activity of exactly one CA3 cell, which is unrealistic. Moreover, after the weights are updated, whether they produce phase precession is no longer clear. It is important to determine whether the proposed mechanism works in the more realistic scenario in which both CA3 and CA1 cells exhibit phase precession, but CA1 cells are driven by multiple CA3 cells.
4) Related to the preceding comment, there is a phase shift/delay between CA3 and CA1 (Mizuseki, Buzsaki et al., 2010). This doesn't seem to have been taken into account. Can the model be set up so that i) CA1 cells receive inputs from multiple CA3 cells ii) both CA3 and CA1 cells exhibit phase precession iii) there is the appropriate phase delay between CA3 and CA1?
5) Dependence of learning on the noisiness of phase precession. The hyperparameter sweep seems to omit some of the most important variables, such as the spread paramaeter (kappa) and the place field width and running speed (see next comment). Since the successor representation is shown to be learned well when kappa=1 but not when kappa=0 (i.e. when phase precession is removed), this leaves open the question of what happens when kappa is bigger than or small than 1. It would be nice to see kappa systematically varied and the consequences explored.
6) Wide place fields and slow speeds. Place fields in the model have a diameter of 2 metres. This is quite big – bigger than typical place field sizes in the dorsal hippocampus (which often have around 30 cm diameter, or 15 cm radius). Moreover, the chosen velocity of 16 cm/s is quite slow, and rats often run much faster in experiments (30 cm/s and higher). With the chosen parameters, it takes the rodent 12.5 s to traverse a place field, which is unrealistically long. My concern is that this setup leads to a large number of spikes per pass through a place field and that this unrealistic setting is needed for the proposed mechanism to learn effectively in a reasonable number of laps. What happens when place fields are smaller and running speeds faster, as is typically found in experiments? How many laps are required for convergence?
7) Running speeddependence of phase precession and firing rate. The rat is assumed to run at a fixed speed – what happens when speed is allowed to vary? Running speed has profound effects on the firing of place cells, including i) a change in their rate of phase precession ii) a change in their firing rate (Huxter et al., 2003). More simulations are needed in which running speed varies lapbylap, and/or within laps.
8) Twodimensional phase precession. There is debate over how 2D environments are encoded in the theta phase (Chadwick et al. 2015, 2016; Huxter et al., 2008; Climer et al., 2013; Jeewajee et al., 2013). This should be mentioned and discussed – how much do the results depend on the specific assumptions regarding phase precession in 2D? For example, Huxter et al. found that, when animals pass through the edge of a place field, the cell initially precesses but then processes back to its initial phase, but this isn't captured by the model used in the present study. Chadwick et al. (2016) proposed a model of twodimensional phase precession based on the phase locking of an oscillator, which reproduces the findings of Huxter et al. and makes different predictions for phase precession in two dimensions than the Jeewajee model used by the authors. It would be nice to test alternative models for 2D phase precession and determine how well they perform in terms of generating successorlike representations.
9) Modelling the distribution of place field sizes along the dorsoventral axis. Two important phenomena were omitted that are likely important and could alter the conclusions. First, there is a phase gradient along the dorsoventral axis, which generates travelling theta waves (Patel, Buszaki et al., 2012; Lebunov and Siapas, 2009). How do the results change when including a 180 (or 360) phase gradient along the DV axis? The authors state that "A consequence of theta phase precession is that the cell with the smaller field will phase precess faster through the theta cycle than the other cell – initially it will fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it will fire earlier" – this neglects to consider the phase gradient along the DV axis (see also Leibold and MonsalveMecado, 2017). Second, the authors chose three discrete place field sizes for their dorsoventral simulations. How would these simulations look if a continuum of sizes were used reflecting the gradient along the dorsoventral axis? Going further, CA1 cells likely receive input from CA3 cells with a distribution of place field sizes rather than a single place field size – how would the model behave in that case?
– There is no theoretical analysis of why theta sequences+STDP approximates the TD algorithm, or when the proposed mechanism might/might not work. The model is simple enough that some analysis should be possible. It would be nice to see this elaborated on – can a reduced model be obtained that captures the learning algorithm embodied by theta sequences+STDP, and does this reduced model reveal an explicit link to the TD algorithm? If not, then why does it work, and when might it generalise/not work?
– The comparison of successor features to neural data was qualitative rather than quantitative, and often quite vague. This makes it hard to know whether the predictions of the model are actually consistent with real neural data. It would be much preferred if a direct quantitative comparison of the learned successor features to real data could be performed, for example, the properties of place fields near to doorways.
– Statistical structure of theta sequences. The model used by the authors is identical to that of Chadwick et al. (2015) (except for the thresholding of the Gaussian field), and so implicitly assumes that theta sequences are generated by the independent phase precession of each place cell. However, the authors mention in the introduction that other studies argue for the coordination of place cells, such that theta sequences can represent alternative futures on consecutive theta cycles (Kay et al.). This begs the question: how important is the choice of an independent phase precession model for the results of this study? For example, if the authors were to simulate a Tmaze, would a model which includes cycling of alternative futures learn the successor representation better or worse than the model based on independent coding? Given that there now is a large literature exploring the coordination of theta sequences and their encoded trajectories, it would be nice to see some discussion of how the proposed mechanism depends on/relates to this.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Rapid learning of predictive maps with STDP and theta phase precession" for further consideration by eLife. Your revised article has been evaluated by Michael Frank (Senior Editor) and the Reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:
1. Spiking model. We all agree with you that a full spiking model would be much too complex. However, since you already generate spikes using a Poisson process, it would be useful to see a simulation where the Poisson rate of CA1 cell is determined by the integration of the incoming CA3 spikes (perhaps with many incoming CA3 neurons). If this doesn't work, you should discuss why this is the case and what the implications are for the model.
2. CA3 => CA1 projections. CA1 cells still receive input from just one CA3 cell for each place field in the updated model (at least in the majority of simulations). This allows precise theta timing of the pre and post synaptic neurons which appears to be critical for the plasticity rule to function. For example, the mathematics of Geisler et al. 2007 shows that, if the CA1 cell would receive input from a set of phase precessing CA3 cells with spatially offset place field and a Gaussian weight profile (the most common way to model CA3CA1 connections), then the CA1 cell would actually fire at the LFP theta frequency and wouldn't phase precess, and as a consequence the STDP mechanism would no longer learn the successor representation. This suggests strong constraints on the conditions under which the model can function which are currently not being adequately discussed. This should be investigated and discussed, and the constraints required for the model to function should be plainly laid out.
3. A similar concern holds with the phase offset between CA3 and CA1 found by Mizuseki et al. The theta+STDP mechanism learns the successor representation because the CA1 cells inherit their responses from a phaseprecessing upstream CA3 cell, so the existence of a phase lag is troubling, because it suggests that CA1 cells are not driven causally by CA1 cells in the way the model requires. You may be right that, if some external force were to artificially impose a fixed lag between the CA3 and CA1 cell, the proposed learning mechanism would still function but now with a spatial offset. However, the Reviewer was concerned that the very existence of the phase lag challenges the basic spirit of the model, since CA1 cells are not driven by CA3 cells in the way that is required to learn causal relationships. At the very least, this needs to be addressed and discussed directly and openly in the Discussion section, but it would be better if the authors could implement a solution to the problem to show that the model can work when an additional mechanism is introduced to produce the phase lag (for example, a combination of EC and CA3 inputs at different theta phases?)
4. DV phase precession. The Reviewer would still like to see you introduce DV phase lags, which could be done with a simple modification of the existing simulations. At minimum, it is critical to remove/modify the sentence "A consequence of theta phase precession is that the cell with the smaller field will phase precess faster through the theta cycle than the other cell – initially it will fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it will fire earlier." As R2 noted in their original review, this is not the case when DV phase lags are taken into account, as was shown by Leibold and MonsalveMercado (2017). Ideally, it would be best to update simulations updated to account for the DV phase lags and the discussion updated to account for their functional implications
Reviewer #1 (Recommendations for the authors):
I am satisfied with the response of the authors to the reviewer's comments.
Reviewer #2 (Recommendations for the authors):
While the reviewers have undertaken a number important additional analyses which address some of the concerns raised in the review, several of the most pressing concerns regarding biological plausibility have not been addressed. In particular, each CA1 place field is still inherited by exactly 1 CA3 place field in the updated protocol, and cells still interact via their firing rates with spikes only being used for the weight updates. Moreover, the authors chose not to address concerns regarding quantitative comparisons between the model and data. Overall, while the authors correctly point out that their primary contribution should be viewed as illustrating a mechanism to learn successor representations via phase precession and STDP, this message is undermined if the proposed mechanism can't function when reasonable assumptions are made regarding the number of cells and their mode of interaction.
Detailed points below:
1) In the updated protocol where CA1 cells receive inputs from multiple CA1 cells, the model still copies CA3 place fields to CA1 place fields in a 11 manner. This is not biologically plausible, since receptive fields in the brain are formed by integration of thousands of synaptic inputs from cells with spatially offset but overlapping receptive fields. Moreover, neurons in the model still interact from rates to rates, with plasticity instead acting only on spikes. The authors could have addressed these two concerns jointly by having CA1 cells integrate input from a large number of spiking CA3 neurons with spatially overlapping place fields and plastic synapses, but since the authors chose not to do so, I can only assume that the model doesn't work when realistic assumptions are incorporated. Such an approach needn't involve simulating a full spiking network as the authors suggest – rather, a GLM/LNP style model can be used to model CA1 spikes in response to CA3 spiking input. Moreover, I do not see any reason why this should complicate the comparison to the TD successor representation as suggested by the authors, as the model would still have a continuous rate underlying the Poisson process that could be used to this end. If the proposed model can't be made to work with realistic numbers of CA3 neurons (with realistic firing rates and plastic synapses), then the proposed mechanism is not a plausible learning rule for the hippocampus, which undercuts the central message of the study.
2) The authors chose not perform a quantitative comparison of the model to experimental data (e.g., clustering of place fields around doorways etc.), leaving a central concern unaddressed. While I understand that theories of the hippocampal successor representation more generally have been compared to data, the lack of quantitative comparison of the particular model proposed in this study is still troubling to me.
3) Many other concerns were not addressed, such as:
– The phase shift between CA3 and CA1. While the authors may be correct that, if a phase shift were artificially imposed on the model, this would entail a spatial shift along the track, the model as it stands is premised on the notion that CA1 cells inherit their activity entirely from upstream CA3 cells, and the model predicts that the two regions are in phase with one another. If a phase shift were imposed by another mechanism (e.g. EC input), then CA1 cells would no longer inherit their responses from CA3, and the proposed mechanism for learning the successor representation would no longer function. Thus, it seems essential to the proposed model that CA3 and CA1 are in phase, in contrast to experimental data.
– The phase shift along the DV axis and its impact on phase relationships. In the revised manuscript, the authors still say "A consequence of theta phase precession is that the cell with the smaller field will phase precess faster through the theta cycle than the other cell – initially it will fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it will fire earlier.", but as pointed out in the original review (and shown by Leibold et al.), this is not true when the DV phase shift is included. I see no reason why unrealistic assumptions should be made in the model regarding DV phase precession.
https://doi.org/10.7554/eLife.80663.sa1Author response
Essential revisions:
1) Significantly more discussion of the work's relationship to relevant prior models of the hippocampus (as described by Reviewer #1)
We have added a large quantity of text addressing the work’s relationship to relevant prior models of the hippocampus. We have added substantially to the introduction and discussion, and also have made other additions throughout the results to provide better context.
2) New simulations that address Reviewer 2's concerns about biological plausibility.
We have performed several new simulations, producing new results that speak to the model’s robustness and biological plausibility, constituting 3 entirely new multipanel supplementary figures examining the effects on the model of place field size, running speed, phase precession parameters, weight initialisation, weight update regimes and downstream phase precession in CA1.
3) Analysis that sheds light on why theta sequences + STDP approximates the TD algorithm (as described by Reviewer #2).
A significant new theoretical section provides mathematical insight as to why a combination of STDP and theta phase precession can approximate the temporal difference learning algorithm.
Reviewer #1 (Recommendations for the authors):
Page 4 – top line – "in the successor representation this is because CA3 place cells to the left…". I think this is confusing as the STDP model essentially generates the same effect. I think this should say: "In the network trained by Temporal Difference learning this is because CA3 place cells to the left…". This better description is used further down where the text says "between STDP and TD weight matrices". Throughout the manuscript
Thank you for this suggestion. We’ve gone through the text and implemented this change where the issue arises, as well as adding the sentence clarifying our terms (described in the in response to the public review in response to point 4).
Page 4 – end of the first paragraph – "potentially becoming negative" – it is disconcerting to have this discussion of the idea of synaptic weights going from positive to negative in the context of the STDP model. One of the main advantages of this model is its biological realism, so it should not so casually mention violating Dale's law and having the synapse magically switch from being glutamatergic to GABAergic. This is disturbing to a neuroscientist.
Thank you for this valid point – we’ve added the following line to follow that sentence:
“So, for example, if a postsynaptic neuron reliably precedes its presynaptic cell on the track, the corresponding weight will be reduced, potentially becoming negative. We note that weights changing their sign is not biologically plausible, as it is a violation of Dale’s Law [43]. This could perhaps be corrected with the addition of global excitation or by recruiting inhibitory interneurons.”
Page 4 "is an essential element of this process." – The importance of theta phase precession to sequence learning with STDP has been discussed in numerous previous papers. For example, in a series of four papers in 1996, Jensen and Lisman describe in great detail a buffer mechanism for generating theta phase precession, and show how this allows encoding of a sequence. This is also explicitly discussed in Koene, Gorchetnikov, Cannon, and Hasselmo, Neural Networks, 2003, in terms of a spiking window of LTP less than 40 msec that requires a shortterm memory buffer to allow spiking within this window.
We agree that the paper would benefit from better connection with the prior work on sequence learning with STDP and have added text to the introduction and discussion. In the introduction, we have added:
“One of the consequences of phase precession is that correlates of behaviour, such as position in space, are compressed onto the timescale of a single theta cycle and thus coincide with the timewindow of STDP O(20 − 50 ms) [8, 18, 20, 21]. This combination of theta sweeps and STDP has been applied to model a wide range of sequence learning [22, 23, 24], and as such, potentially provides an efficient mechanism to learn from an animal’s experience – forming associations between cells which are separated by behavioural timescales much larger than that of STDP.”
And we’ve included a paragraph to the discussion to make this clear. This is contained in the paragraph above, in our response to point 1 in the public review (see paragraph starting “That the predictive skew of place fields can be accomplished…”).
Page 4 – "our model and the successor representation" – again this is confusing and should instead contrast "our model and the TD trained successor representation"
Thank you, we have made this change to the text.
Page 6 – "in observed" – is observed.
Thank you – fixed.
Page 6 – "binding across the different sizes" – This needs to be stated more clearly in the text as it is very vague. I would suggest adding the phrase: "regardless of the scale difference".
Thank you for the suggestion – we have implemented this change.
Figure 4D – "create a physical barrier" – this is very ambiguous as it recalls a physical barrier in the environment as between two rooms – should instead say "created an anatomical segregation".
Thank you for the suggestion – we have implemented this change.
Page 8 – "hallmarks of successor representations" – there should be citations for what paper shows these hallmarks of the successor representation.
Thank you – we have added citations to Stachenfeld et al. 2014, Stachenfeld et al. 2017, and de Cothi and Barry 2020 to this sentence.
Page 8 – "arrive in the order" – Here is a location where citations to previous papers on the use of a phase precession buffer to correctly time spiking for STDP should be added (i.e. Jensen and Lisman, 1996; Koene et al. 2003).
Thank you for the suggestion – we have implemented this change.
Page 8 – "via Hebbian learning alone" – add "without theta phase precession" to be clear about what is not being included (since it could be anything such as other aspects of a learning rule).
Thank you for the suggestion – we have implemented this change.
Page 9 – "for spiking a feedforward network" – what does this mean – do they mean "for spiking in a feedforward network"? Aren't these other network mechanisms less biological realistic than the one presented here? I'd like to see some critical comparison between the models.
Thank you for spotting this, this was actually a typo: the sentence should read “for a spiking feedforward network”, which in this case semantically alters the meaning.
Page 9 – "makes a clear prediction…should impact subsequent navigation and the formation of successor features" – This is not a clear prediction but is instead circular – it essentially says – "if successor representations are not formed successor representations will not be observed" This is not much use to an experimentalist. This prediction should be stated in terms of a clear experimental prediction that refers only to physical testable quantities in an experiment and not circularly referring to the same vague and abstract concept of successor representations.
We have addressed both of these points with changes to the same paragraph, so we have condensed them for readability. Firstly, we agree our stated “clear prediction” of the model was, in fact, unclear. We have rewritten the paragraph (see below) to clarify what we meant by this. Further, we were unable to locate the Chrobak et al., 2006 reference, but found a Chrobak et al., 1989 that matches this description. This is indeed relevant and we have added a citation (let us know if this was not the intended reference or if there is an additional relevant one):
Chrobak, J. J., Stackman, R. W., and Walsh, T. J. (1989). Intraseptal administration of muscimol produces dosedependent memory impairments in the rat. Behavioral and Neural Biology, 52(3), 357–369. https://doi.org/10.1016/S01631047(89)90472X
However, we noted that this paper uses a Muscimol inactivation to medial septum, which was shown by Bolding et al. 2019 to disrupt placerelated firing as well as thetaband activity, so it is possible that the disruption to place code is what is driving the navigational deficit. Also, we accidentally referred to the inactivations performed by Bolding and colleagues as lesions, but in fact they performed temporary inactivations with a variety of drugs (tetracaine, muscimol, gabazine; the latter of which disrupted theta but left placerelated firing intact).
We have modified our paragraph describing these points and the predictions of our model as follows:
“Our theory makes the prediction that theta contributes to learning predictive representations, but is not necessary to maintain them. Thus, inhibiting theta oscillations during exposure to a novel environment should impact the formation of successor features (e.g., asymmetric backwards skew of place fields) and subsequent memoryguided navigation. However, inhibiting theta in a familiar environment in which experiencedependent changes have already occurred should have little effect on the place fields: that is, some asymmetric backwards skew of place fields should be intact even with theta oscillations disrupted. To our knowledge this has not been directly measured, but there are some experiments that provide hints. Experimental work has shown that power in the theta band increases upon exposure to novel environments [62] – our work suggests this is because theta phase precession is critical for learning and updating predictive maps for spatial navigation. Furthermore, it has been shown that place cell firing can remain broadly intact in familiar environments even with theta oscillations disrupted by temporary inactivation or cooling [63, 64]. It is worth noting, however, that even with intact place fields, these theta disruptions impair the ability of rodents to reach a hidden goal location that had already been learned, suggesting theta oscillations play a role in navigation behaviours even after initial learning [63, 64]. Other work has also shown that muscimol inactivations to medial septum can disrupt acquisition and retrieval of the memory of a hidden goal location [65, 66], although it is worth noting that these papers use muscimol lesions which Bolding and colleagues show also disrupt placerelated firing, not just theta precession.”
Page 9 – "to reach a hidden goal" – A completely different hippocampal modeling framework was used to model the finding of hidden goals in the Morris water maze in Erdem and Hasselmo, 2012, Eur. J. Neurosci and earlier work by Redish and Touretzky 1998, Neural Comp. To clarify the status of the successor representation framework relative to these older models that do not use successor representations, it would be very useful to have a few sentences of discussion about how the successor representation differs and is somehow either advantageous or biologically more realistic than these earlier models.
We agree this would be helpful, and have added the following text to the discussion:
“A number of other models describe how physiological and anatomical properties of hippocampus may produce circuits capable of goaldirected spatial navigation [30, 27, 23]. These models adopt an approach more characteristic of model based RL, searching iteratively over possible directions or paths to a goal [30] or replaying sequences to build an optimal transition model from which sampled trajectories converge toward a goal [27] (this model bears some similarities to the SR that are explored by [40], which shows that under certain assumptions, dynamics converge to SR under a similar form of learning). These models rely on dynamics to compute the optimal trajectory, while the SR realises the statistics of these dynamics in the rate code and can therefore adapt very efficiently. Thus, the SR retains some efficiency benefits. The models cited above are very wellgrounded in known properties of hippocampal physiology, including theta precession and STDP, whereas until recently, SR models have enjoyed a much looser affiliation with exact biological mechanisms. Thus, a primary goal of this work is to explore how hippocampal physiological properties relate to SR learning as well.”
Page 9 – "physical barrier to binding" – this is again very confusing as there is no physical barrier in the hippocampus. They should instead say "anatomical segregation".
Thank you for the suggestion – we have implemented this change as well.
Citation 32 – Mommenejad and Howard, 2018 – This is a very important citation and highly relevant to the discussion. However, I think it should just be cited as BioRXiv. It is confusing to call it a preprint.
Thank you for highlighting this, we have now changed the citation of this and all other cited preprints to their appropriate server e.g. bioRxiv.
Reviewer #2 (Recommendations for the authors):
This is an interesting study, and I enjoyed reading it. However, I have a number of concerns, particularly regarding the biological plausibility of the model, that I believe can be addressed with additional simulations and analysis.
Thank you again for your thorough appraisal of our work. Your suggestions have led to new simulations and analyses that have contributed to a significantly improved manuscript. To briefly summarise, these include: 3 new multipanel supplementary figures examining the effects of place field size, running speed, phase precession parameters, weight initialisation,weight update regimes and CA1 phase precession; a new appendix providing theoretical analyses and insight into how and why the model approximates temporal difference learning; and an extension of the hyperparameter sweep analysis to include the parameters controlling phase precession.
– I had a number of concerns regarding the biological plausibility of the model and the choice of parameter settings, especially:
1) Mapping from rates to rates. The CA3 neurons act on CA1 neurons via their firing rate rather than their spikes, but the STDP rule acts on the spikes. What happens if the CA1 neurons are driven by the synapticallyfiltered CA3 spikes rather than the underlying rates? How does the model perform, and how does the performance vary with the number of CA3 neurons (since more neurons may be required in order to average over the stochastic spikes)?
We agree that swapping rates for spikes would move the model in the direction of being more biologically plausible; however, this ends up complicating the central comparison of the work. The purpose of this study was to test the hypothesis that a combination of STDP and theta phase precession can approximate the learning of successor representations via temporal difference (TD) learning. As such, since this TD learning rule applies to continuous firing rate values (e.g. de Cothi and Barry 2020), we find this mapping of rates to rates is an essential component to facilitate fair comparison between the two learning rules. This also simplifies our model and its interpretation, as it allows us to avoid the complexity of spiking models. However, we recognise that this is a biologically implausible assumption that we are making. An avenue for correcting this in future work would be to adopt the approach of Brea et al. 2016 or Bono et al. 2021 (on bioRxiv, also currently in review at eLife). We have now added the following text to the beginning of the Results section to clarify why this particular set up was used and its caveats:
“Further, the TD successor matrix Mij can also be used to generate the ‘TD successor features’ … allowing for direct comparison and analyses with the STDP successor features (Eqn. 2), using the same underlying firing rates driving the TD learning to sample spikes for the STDP learning. This abstraction of biological detail avoids the challenges and complexities of implementing a fully spiking network, although an avenue for correcting this would be the approach of Brea et al., 2016 and Bono et al., 2021 [41, 43].”
2) Weights are initialised as Wij=deltaij, meaning a 11 correspondence from CA3 to CA1 cells. This would have been ok, except that the weights are not updated during learning – they are held fixed during the entire learning phase and only updated on aggregate after learning. Thus, during the entire learning process each CA1 cell is driven by exactly 1 CA3 cell, and therefore simply inherits (or copies) the activity of that CA3 cell (according to equation 2). If either 1) a more realistic weight initialisation were used (e.g., random) or 2) weights were updated online during learning, it seems likely that the proposed mechanism would no longer work.
Thank you for this suggestion. Originally the 11 correspondence from CA3 to CA1 cells was to directly correspond to the definition of a successor feature (in which each successor feature corresponds to the predicted activity of a specific basis feature, e.g. Stachenfeld et al., 2017; de Cothi and Barry 2020). However we acknowledge the biological implausibility of this approach. As such, we have updated the manuscript to include analyses of simulations where both the target CA1 activity is initialised by random weights (i.e. not the identity matrix), as well as where this target activity is updated online during learning (Figure 2—figure supplement 2). As we show, neither manipulation inhibits successful learning of the STDP successor features, with the caveat that when updating the target weights online, the target features need to be partially anchored to the external world to prevent perpetual drift in the target population. We now summarise these new simulations in the Results section:
“This effect is robust to variations in running speed (Figure 2—figure supplement 1b ) and field sizes (Figure 2—figure supplement 1c), as well as scenarios where target CA1 cells have multiple firing fields (Figure 2–Supplement 2a) that are updated online during learning (Figure 2—figure supplement 2b,c ; see Supplementary Materials for more details)”
and elaborate on this method in the appendices/methods:
“Random initialisation: In Figure 2—figure supplement 2, panel a, we explore what happens if weights are initialised randomly. Rather than the identity, the weight matrix during learning is fixed (“anchored”) to a sparse random matrix WA ; this is defined such that each CA1 neuron receives positive connections from 3, 4 or 5 randomly chosen CA3 neurons with weights summing to one. […] After learning the STDP successor feature looks close in form to the TD successor feature and both show a shift and skew backwards along the track (panel a, rights, one example CA1 field shown).”
"Online weight updating: In Fig. 2 supplement 2, panels b, c and d, we explore what happens if the weights are updated online during learning. […] In panel d we show that anchoring is essential. Without it (WAij = 0) the weight matrix initially shows some structure shifting and skewing to the left but this quickly disintegrates and no observable structure remains at the end of learning.”
One interpretation of our setup (the original one, described in the main text of the paper where weights are not updated online) is that it matches the “Separate Phases of Encoding and Retrieval Model” model [Hasselmo (2002)]. This paper describes how LTP between CA1 and CA3 synapses is strongest at the phase of theta when input to CA1 is primarily coming from entorhinal cortex. To quote the abstract of this paper: “effective encoding of new associations occurs in the phase when synaptic input from entorhinal cortex is strong and longterm potentiation (LTP) of excitatory connections arising from hippocampal region CA3 is strong, but synaptic currents arising from region CA3 input are weak”. Broadly speaking, this matches what we have here. That is to say: what drives CA1 during learning are not the synapses onto which learning is accumulating. Of course we don’t replicate this model in all its details – for example we don’t actually separate CA1 drive into two phases, and don’t model phase dependent LTD and so don’t reproduce their memory extinction results – but, philosophically, it is similar.
3) Lack of discussion of phase precession in CA1 cells. What are the theta firing patterns of CA1 (successor) cells in the model? Do they exhibit theta sequences and/or phase precession? We are never told this. The spike phase of the downstream CA1 cell is extremely important for STDP, as it determines whether synapses associated with past or future events are potentiated or suppressed (see Figure 8 of Chadwick et al. 2016, eLife). Based on my understanding, in the current setup CA1 place cells should produce phase precession during learning (before weights are updated), but only because each CA1 cell copies the activity of exactly one CA3 cell, which is unrealistic. Moreover, after the weights are updated, whether they produce phase precession is no longer clear. It is important to determine whether the proposed mechanism works in the more realistic scenario in which both CA3 and CA1 cells exhibit phase precession, but CA1 cells are driven by multiple CA3 cells.
Thank you for these suggestions. We now show in Figure 2—figure supplement 4f that the CA1 STDP successor features in the model do indeed inherit this phase precession:
The reason for this is that CA1 cells are still localised and therefore driven mostly by cells in CA3 which are close and which peak in activity together at a similar phase each theta cycle. As the agent moves through the CA1 cell it also moves through all the CA3 cells and their peak firing phase ‘precesses’ driving an earlier peak in the CA1 firing. Phase precession is CA1 after learning is noisier/broader than CA3 but far from nonexistent and looks similar to real phase precession data from cells in CA1. This result is described in the main text:
“In particular, the parameters controlling phase precession in the CA3 basis features (Figure 2–supplement 4a) can affect the CA1 STDP successor features learnt, with ‘weak’ phase precession resembling learning in the absence of theta modulation (Figure 2–supplement 4bc), biologically plausible values providing the best match to the TD successor features (Figure 2–supplement 4d) and ‘exaggerated’ phase precession actually hindering learning (Figure 2–supplement 4e; see Supplementary Materials for more details). Additionally, we find these CA1 cells go on to inherit phase precession from the CA3 population even after learning when they are driven by multiple CA3 fields (Figure 2–supplement 4f).”
And we elaborate on this in the appendices/methods:
“Phase precession of CA1: In most results shown in this paper the weights are anchored to the identity during learning. This means each CA1 cells inherits phase precession from the one and only one CA3 cell it is driven by. It is important to establish whether CA1 still shows phase precession after learning when driven by multiple CA3 cells or, equivalently, during learning when the weights aren’t anchored and it is therefore driven by multiple CA3 neurons. Analysing the spiking data from CA1 cells after learning (phase precession turned on) shows it does phase precession. This phase precession is noisier than the phase precession of a cell in CA3 but only slightly and compares favourably to real phase precession data for CA1 neurons (panel f, right, with permission from Jeewajee et al. (2014) [46]).
The reason for this is that CA1 cells are still localised and therefore driven mostly by cells in CA3 which are close and which peak in activity together at a similar phase each theta cycle. As the agent moves through the CA1 cell it also moves through all the CA3 cells and their peak firing phase precesses driving an earlier peak in the CA1 firing. Phase precession is CA1 after learning is noisier/broader than CA3 but far from nonexistent and looks similar to real phase precession data from cells in CA1.”
Additionally, by extending our parameter sweep to include phase precession parameters (Figure 2–supplement 3 panel c, last 2 subplots), we now show that the biologically derived values for the parameters determining the phase precession in the model are in fact optimally placed to approximate the TD learning of successor features (Figure 2–supplement 4, please see response to point 5 for more details).
Finally, we show that the CA1 successor features can still be successfully learnt via the STDP + phase precession mechanism when the target features are driven by multiple CA3 cells (Figure 2 supplement 2A), and when the target features are updated by the learnt weights online (Figure 2 supplement 2bc, please see response to point 2 for technical details).
4) Related to the preceding comment, there is a phase shift/delay between CA3 and CA1 (Mizuseki, Buzsaki et al., 2010). This doesn't seem to have been taken into account. Can the model be set up so that i) CA1 cells receive inputs from multiple CA3 cells ii) both CA3 and CA1 cells exhibit phase precession iii) there is the appropriate phase delay between CA3 and CA1?
Thank you for this comment, as it provoked much thought. At the level of individual cells in our model, the phase shift presented by Mizuseki, Buzsaki et al., 2010 (i.e. CA1 being shifted temporally just ahead of CA3 ) is functionally nearidentical to if each CA3 basis feature were connected to a different CA1 cell slightly further ahead of it down the track. Therefore, in total, this would simply manifest as a rotation on the weight matrix (e.g. realignment of CA1 cells along the track). Thus perhaps these phase delays are important for other aspects of learning we are not capturing here. However, if this shift were more substantial, it is not entirely clear what would happen. We identify this as a limitation and direction for future work in the new paragraph we have added that discussing the limits of the model’s biological plausibility (reprinted below for convenience):
“While our model is biologically plausible in several respects, there remain a number of aspects of the biology that we do not interface with, such as different cell types, interneurons and membrane dynamics. Further, we do not consider anything beyond the most simple model of phase precession, which directly results in theta sweeps in lieu of them developing and synchronising across place cells over time [60]. Rather, our philosophy is to reconsider the most pressing issues with the standard model of predictive map learning in the context of hippocampus (e.g., the absence of dopaminergic error signals in CA1 and the inadequacy of synaptic plasticity timescales). We believe this minimalism is helpful, both for interpreting the results presented here and providing a foundation for further work to examine these biological intricacies, such as the possible effect of phase offsets in CA3, CA1 [61] and across the dorsoventral axis [62, 63], as well as whether the model’s theta sweeps can alternately represent future routes [64] by the inclusion of attractor dynamics [65].”
5) Dependence of learning on the noisiness of phase precession. The hyperparameter sweep seems to omit some of the most important variables, such as the spread paramaeter (kappa) and the place field width and running speed (see next comment). Since the successor representation is shown to be learned well when kappa=1 but not when kappa=0 (i.e. when phase precession is removed), this leaves open the question of what happens when kappa is bigger than or small than 1. It would be nice to see kappa systematically varied and the consequences explored.
Thank you for this suggestion. We have now extended our parameter sweep (Figure 2 supplement 3) to systematically determine the effect of variations in the noisiness of the phase precession (kappa) and the proportion of the theta cycle in which the precession takes place (β). Interestingly, we find that the biologically derived parameters are in fact optimally placed to approximate the TD learning of successor features (Figure 2 supplement 3c and 4ae). We summarise these results in the main text:
“In particular, the parameters controlling phase precession in the CA3 basis features (Figure 2–supplement 4a) can affect the CA1 STDP successor features learnt, with ‘weak’ phase precession resembling learning in the absence of theta modulation (Figure 2–supplement 4bc), biologically plausible values providing the best match to the TD successor features (Figure 2–supplement 4d) and ‘exaggerated’ phase precession actually hindering learning (Figure 2–supplement 4e; see Supplementary Materials for more details). Additionally, we find these CA1 cells go on to inherit phase precession from the CA3 population (Figure 2–supplement 4f).”
In an additional supplementary figure (Figure 2–supplement 4) we delve into these hyperparameter sweep results showing examples of toomuch or toolittle phase precession on the learnt successor features and attempt to shed light on why this intermediate optima exist.
We also go into further detail in the appendices/methods:
“The optimality of biological phase precession parameters In figure 2 supplement 3 we ran a hyperparameter sweep over the two parameters associated with phase precession: κ, the von Mises parameter describing how noisy phase precession is and β, the fraction of the full 2π theta cycle phase precession crosses. The results show that for both of these parameters there is a clear “goldilocks” zone around the biologically fitted parameters we chose originally. When there is too much (large κ, large β) or too little (small κ, small β) phase precession performance is worse than at intermediate biological amounts of phase precession. Whilst – according to the central hypothesis of the paper – it makes sense that weak or nonexistence phase precession hinders learning, it is initially counter intuitive that strong phase precession also hinders learning.
We speculate the reason is as follows, when β is too big phase precession spans the full range from 0 to 2π, this means it is possible for a cell firing very late in its receptive field to fire just before a cell a long distance behind it on the track firing very early in the cycle because 2π comes just before 0 on the unit circle. When κ is too big, phase precession is too clean and cells firing at opposite ends of the theta cycle will never be able to bind since their spikes will never fall within a 20 ms window of each other. We illustrate these ideas in figure 2 supplement 4 by first describing the phase precession model (panel a) then simulating spikes from 4 overlapping place cells (panel b) when phase precession is weak (panel c), intermediate/biological (panel d) and strong (panel e). We confirm these intuitions about why there exists a phase precession “goldilocks” zone by showing the weight matrix compared to the successor matrix (right hand side of panels c, d and e). Only in the intermediate case is there good similarity.”
6) Wide place fields and slow speeds. Place fields in the model have a diameter of 2 metres. This is quite big – bigger than typical place field sizes in the dorsal hippocampus (which often have around 30 cm diameter, or 15 cm radius). Moreover, the chosen velocity of 16 cm/s is quite slow, and rats often run much faster in experiments (30 cm/s and higher). With the chosen parameters, it takes the rodent 12.5 s to traverse a place field, which is unrealistically long. My concern is that this setup leads to a large number of spikes per pass through a place field and that this unrealistic setting is needed for the proposed mechanism to learn effectively in a reasonable number of laps. What happens when place fields are smaller and running speeds faster, as is typically found in experiments? How many laps are required for convergence?
Thank you for this suggestion, we now explore this in a new fsupplementary figure, (Figure 2–supplement 1bc). In summary, we find there is no critical effect on learning with smaller place fields and faster speeds. As hypothesised by the reviewer, we find that the learning is slower (when measured in number of laps) due to the decreased number of spikes, but not with catastrophic effects. This is summarised in the results:
“Thus, the ability to approximate TD learning appears specific to the combination of STDP and phase precession. Indeed, there are deep theoretical connections linking the two – see Methods section 5.8 for a theoretical investigation into the connections between TD learning and STDP learning augmented with phase precession. This effect is robust to variations in running speed (Figure 2–supplement 1b) and field sizes (Figure 2–supplement 1c), as well as scenarios where target CA1 cells have multiple firing fields (Figure 2–supplement 2a) that are updated online during learning (Figure 2–supplement 2bc; see Supplementary Materials for more details)”
And elaborated on in the appendices/methods:
“Smaller place cells and faster movement: Nothing fundamental prevents learning from working in the case of smaller place fields or faster movement speeds. We explore this in figure 2 supplement 1, panel c, as follows: the agent speed is doubled from 16 cm s^{−1} to 32 cm s^{−1} and the place field size is shrunk by a factor of 5 from 2 m diameter to 40 cm diameter. To facilitate learning we also increase the cell density along the track from 10 cells m^{−1} to 50 cells m^{−1}. We also shrink the track size from 5 m to 2 m (any additional track is redundant due to the circular symmetry of the setup and small size of the place cells). We then train for 12 minutes. This time was chosen since 12 minutes moving at 32 cm s^{−1} on a 2 m track means the same number of laps as 60 mins moving at 16 cm s^{−1} on a 5 m track (96 laps in total). Despite these changes the weight matrix converged with high similarity to the successor matrix with a shorter time horizon (0.5 s). Convergence time measured in minutes was faster than in the original case but this is mostly due to the shortened track length and increased speed. Measured in laps it now takes longer to converge due to the decreased number of spikes (smaller place fields and faster movement through the place fields). This can be seen in the shallower convergence curve, panel c (right) relative to panel a.”
7) Running speeddependence of phase precession and firing rate. The rat is assumed to run at a fixed speed – what happens when speed is allowed to vary? Running speed has profound effects on the firing of place cells, including i) a change in their rate of phase precession ii) a change in their firing rate (Huxter et al., 2003). More simulations are needed in which running speed varies lapbylap, and/or within laps.
Thank you for this suggestion, we now explore this in a new supplementary figure, (Figure 2–supplement 1b, see comment above) where the speed of the rat / agent is allowed to vary smoothly and stochastically. In summary, we find no observable effect on the STDP weight matrix or the TD successor matrix after learning, with the R^2 value between the two. This is summarised in the results:
“Thus, the ability to approximate TD learning appears specific to the combination of STDP and phase precession. Indeed, there are deep theoretical connections linking the two – see Methods section 5.8 for a theoretical investigation into the connections between TD learning and STDP learning augmented with phase precession. This effect is robust to variations in running speed (Figure 2–supplement 1b) and field sizes (Figure 2–supplement 1c), as well as scenarios where target CA1 cells have multiple firing fields (Figure 2–supplement 2a) that are updated online during learning (Figure 2–supplement 2bc; see Supplementary Materials for more details)”
With further details in the appendices/methods:
“Movement speed variability: Panel b shows an experiment where we reran the simulation shown in paper figures 2ae except, instead of a constant motion speed, the agent moves with a variable speed drawn from a continuous stochastic process (an OrnsteinUhlenbeck process). The parameters of the process were selected so the mean velocity remained the same (16 cm s^{−1} lefttoright) but now with significant variability (standard deviation of 16 cm s^{−1} thresholded so the speed can’t go negative). Essentially, the velocity takes a constrained random walk. This detail is important: the velocity is not drawn randomly on each time step since these changes would rapidly average out with small dt, rather the change in the velocity (the acceleration) is random – this drives slow stochasticity in the velocity where there are extended periods of fast motion and extended periods of slow motion. After learning there is no substantial difference in the learned weight matrices. This is because both TD and STDP learning rules are able to averageover the stochasticity in the velocity and converge on representations representative of the mean statistics of the motion.”
8) Twodimensional phase precession. There is debate over how 2D environments are encoded in the theta phase (Chadwick et al. 2015, 2016; Huxter et al., 2008; Climer et al., 2013; Jeewajee et al., 2013). This should be mentioned and discussed – how much do the results depend on the specific assumptions regarding phase precession in 2D? For example, Huxter et al. found that, when animals pass through the edge of a place field, the cell initially precesses but then processes back to its initial phase, but this isn't captured by the model used in the present study. Chadwick et al. (2016) proposed a model of twodimensional phase precession based on the phase locking of an oscillator, which reproduces the findings of Huxter et al. and makes different predictions for phase precession in two dimensions than the Jeewajee model used by the authors. It would be nice to test alternative models for 2D phase precession and determine how well they perform in terms of generating successorlike representations.
Thank you for this suggestion. We agree this is an important topic in terms of understanding the correlates and consequences of phase precession. There is a wealth of literature surrounding this topic, some of which we relied upon for defining the model of 2D phase precession implemented here (e.g. Jeewajee et al., 2013 and Chadwick et al. 2015). However, we believe that this would be better suited as a followup to the current study, which addresses the first question of what how closely the representations learned with classical theta precession resemble TDtrained SRs. Rather, we agree that considering alternative 2D models of phase precession would be a wonderful direction for future work and our code is publicly available should anyone wish to explore this.
9) Modelling the distribution of place field sizes along the dorsoventral axis. Two important phenomena were omitted that are likely important and could alter the conclusions. First, there is a phase gradient along the dorsoventral axis, which generates travelling theta waves (Patel, Buszaki et al., 2012; Lebunov and Siapas, 2009). How do the results change when including a 180 (or 360) phase gradient along the DV axis? The authors state that "A consequence of theta phase precession is that the cell with the smaller field will phase precess faster through the theta cycle than the other cell – initially it will fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it will fire earlier" – this neglects to consider the phase gradient along the DV axis (see also Leibold and MonsalveMecado, 2017). Second, the authors chose three discrete place field sizes for their dorsoventral simulations. How would these simulations look if a continuum of sizes were used reflecting the gradient along the dorsoventral axis? Going further, CA1 cells likely receive input from CA3 cells with a distribution of place field sizes rather than a single place field size – how would the model behave in that case?
Thank you for this interesting point. The model and results presented here pertain more to the role of theta compression (and STDP) in approximating TD learning. However we have now added the following to our discussion to consider these additional aspects of theta oscillations:
“The distribution of place cell receptive field size in hippocampus is not homogeneous. Instead, place field size grows smoothly along the longitudinal axis (from very small in dorsal regions to very large in ventral regions). Why this is the case is not clear – our model contributes by showing that, without this ordering, large and small place cells would all bind via STDP, essentially overwriting the short timescale successor representations learnt by small place cells with long timescale successor representations. Topographically organising place cells by size anatomically segregates place cells with fields of different sizes, preserving the multiscale successor representations. The functional separation of these spatial scales could be further enhanced by a gradient of phase offsets along the dorsoventral axis, resulting from the theta oscillation being a travelling wave [62, 63]. This may act as a temporal segregation preventing learning between cells of different field sizes, on top of the anatomical segregation we explore here. The premise that such separation is needed to learn multiscale successor representations is compatible with other theoretical accounts for this ordering. Specifically Momennejad and Howard [39] showed that exploiting multiscale successor representations downstream, in order to recover information which is ‘lost’ in the process of compiling state transitions into a single successor representation, typically requires calculating the derivative of the successor representation with respect to the discount parameter. This derivative calculation is significantly easier if the cells – and therefore the successor representations – are ordered smoothly along the hippocampal axis.”
As well as this, we include a new paragraph in the discussion pertaining to these limits in the model’s biological plausibility and our intended contribution:
“While the model is biologically plausible in several respects, there remain a number of aspects of the biology that we do not interface with, such as different cell types, interneurons and membrane dynamics. Further, only the most simple model of phase precession is considered, which directly results in theta sweeps in lieu of them developing and synchronising across place cells over time [60]. Rather, our philosophy is to reconsider the most pressing issues with the standard model of predictive map learning in the context of hippocampus. These include the absence of dopaminergic error signals in CA1 and the inadequacy of synaptic plasticity timescales. We believe this minimalism is helpful, both for interpreting the results presented here and providing a foundation on which further work may examine these biological intricacies, such as the possible effect of phase offsets in CA3, CA1 [61] and across the dorsoventral axis [62, 63], as well as whether the model’s theta sweeps can alternately represent future routes [64] e.g. by the inclusion of attractor dynamics [65].”
– There is no theoretical analysis of why theta sequences+STDP approximates the TD algorithm, or when the proposed mechanism might/might not work. The model is simple enough that some analysis should be possible. It would be nice to see this elaborated on – can a reduced model be obtained that captures the learning algorithm embodied by theta sequences+STDP, and does this reduced model reveal an explicit link to the TD algorithm? If not, then why does it work, and when might it generalise/not work?
Thank you for this suggestion. We have now updated the manuscript to include a section (Methods 5.8) explaining the theoretical connection between STDP and TD learning. In short, it starts by showing how temporal difference learning can be mathematically recast into a temporally asymmetric Hebbian learning rule reminiscent of simplified STDP. However, in order to recast TD learning in its STDPlike form it is necessary to fix the temporal discount time horizon to the synaptic plasticity timescale. This alone would produce TDstyle learning on a timescale too short to capture meaningful predictions of behaviour. Thus, we show mathematically that the importance of theta phase precession is to provide a precise temporal compression on the input sequences that effectively increases this predictive time horizon from the timescale of synaptic plasticity to the timescale of behaviour. This temporal compression overcomes the timescales problem since, by symmetry, learning a successor feature with a very small time horizon where the input trajectory is temporally compressed is equivalent to learning a successor feature with a long time horizon where the inputs are not compressed. We derive a formula for the amount of compression as a function of the typical speed of a `theta sweep’ and estimate a ballpark figure showing that in many cases this compression is enough to extend the synaptic plasticity timescale into behaviourally relevant timescales. In essence, this section provides the mathematics behind the very intuition on which we based the study (e.g. Figure 1). That is:
1. Fundamentally, STDP behaves similarly to TD learning since the temporally asymmetric learning rule binds pairs of cells if one cell spikes before (i.e. is predictive of) the other.
2. STDP can’t easily learn temporally extended predictive maps but can if phase precession “compresses” input features.
Finally, we end this theoretical analysis section by examining where and why the two learning rules diverge (i.e. where STDP does not approximate TD learning). We direct the reader to studies that focus more closely on modified Hebbian learning rules to circumvent these issues, whilst pointing out that it does not have to be one or the other – the intuition for why theta phase precession helps learning applies equally well to modified learning rules which focus more closely on exactly replicating TD learning at the expense of similarity to biological STDP. We include the newly added theory section at the end of this review response document.
– The comparison of successor features to neural data was qualitative rather than quantitative, and often quite vague. This makes it hard to know whether the predictions of the model are actually consistent with real neural data. It would be much preferred if a direct quantitative comparison of the learned successor features to real data could be performed, for example, the properties of place fields near to doorways.
We agree that we could be much more specific in our comparisons to neural data, and that making quantitative comparisons to experimental recorded place cells would be a valuable contribution. To address the first point, we have clarified the presentation of our results in several places in order to make the connections to existing neural data more specific. As for making comparisons to data, we believe it is outside the scope of this work. Our primary contribution is to make quantitative comparisons between successor representations learned by TD and learned by STDP+theta. This led us to testable predictions that we have described in the discussion (page 11, paragraph beginning “Our theory makes the prediction”) that specifically relate to the effect of impairing theta oscillations at different stages of learning (we note that these descriptions have been rewritten to be clearer in the revised manuscript). We believe that these kind of experiments would be optimal for providing datasets that would be better suited for the specific theoretical questions we are investigating here than would a posthoc analysis of an existing datasets. Finally, we have now included theoretical analysis of the connection between STDP and TD learning (see comment above), in which readers may find a more insightful way to gain intuition about how closely this model matches SR theory and solidifies the theory contribution.
We also want to note that some prior (and inreview) work has conducted quantitative comparisons between hippocampal data and successor representations. Neuroimaging studies have shown evidence for predictive coding of spatial and nonspatial states on varying timehorizons (Garvert et al. 2017, Schapiro et al. 2016, Brunec and Momennejad 2022). Other studies have found that the SR did not explain under certain conditions, such as Duvelle et al. 2021. de Cothi et al. 2022 provide a model comparison to explain navigation behaviours in humans and rats, and found that both were best explained by a successor representationlike strategy. We also note that in recent work also under review at eLife, Ching Fang and colleagues conduct a quantitative comparison between place fields recorded from chickadees and the successor representation (Fang et al. 2022).
– Statistical structure of theta sequences. The model used by the authors is identical to that of Chadwick et al. (2015) (except for the thresholding of the Gaussian field), and so implicitly assumes that theta sequences are generated by the independent phase precession of each place cell. However, the authors mention in the introduction that other studies argue for the coordination of place cells, such that theta sequences can represent alternative futures on consecutive theta cycles (Kay et al.). This begs the question: how important is the choice of an independent phase precession model for the results of this study? For example, if the authors were to simulate a Tmaze, would a model which includes cycling of alternative futures learn the successor representation better or worse than the model based on independent coding? Given that there now is a large literature exploring the coordination of theta sequences and their encoded trajectories, it would be nice to see some discussion of how the proposed mechanism depends on/relates to this.
Thank you for this suggestion. We have added a citation to Chadwick et al., 2015 (ref [42]) as well as the following at the beginning of the results:
“As the agent traverses the receptive field, its rate of spiking is subject to phase precession fjθ(x,t) with respect to a 10 Hz theta oscillation. This is implemented by modulating the firing rate by an independent phase precession factor which varies according to the current theta phase and how far through the receptive field the agent has travelled [42] (see Methods and Figure 1a)”
We also discuss limits of the model with regard to the Kay et al. study, as well as possible manipulations to capture this result, in a new discussion paragraph:
““While our model is biologically plausible in several respects, there remain a number of aspects of the biology that we do not interface with, such as different cell types, interneurons and membrane dynamics. Further, we do not consider anything beyond the most simple model of phase precession, which directly results in theta sweeps in lieu of them developing and synchronising across place cells over time [60]. Rather, our philosophy is to reconsider the most pressing issues with the standard model of predictive map learning in the context of hippocampus (e.g., the absence of dopaminergic error signals in CA1 and the inadequacy of synaptic plasticity timescales). We believe this minimalism is helpful, both for interpreting the results presented here and providing a foundation for further work to examine these biological intricacies, such as the possible effect of phase offsets in CA3, CA1 [61] and across the dorsoventral axis [62, 63], as well as whether the model’s theta sweeps can alternately represent future routes [64] by the inclusion of attractor dynamics [65].”
And elaborate on both of these points in the methods section:
“Our phase precession model is “independent” (essentially identical to Chadwick et al. (2015)[42]) in the sense that each place cell phase precesses independently from what the other place cells are doing. In this model, phase precession directly leads to theta sweeps as shown in Figure 1. Another class of models referred to as “coordinated assembly” models [76] hypothesise that internal dynamics drive theta sweeps within each cycle because assemblies (aka place cells) dynamically excite oneanother in a temporal chain. In these models theta sweeps directly lead to phase precession. Feng and colleagues draw a distinction between theta precession and theta sequence, observing that while independent theta precession is evident right away in novel environments, longer and more stereotyped theta sequences develop over time [77]. Since we are considering the effect of theta precession on the formation of place field shape, the independent model is appropriate for this setting. We believe that considering how our model might relate to the formation of theta sequences or what implications theta sequences have for this model is an exciting direction for future work.”
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:
1. Spiking model. We all agree with you that a full spiking model would be much too complex. However, since you already generate spikes using a Poisson process, it would be useful to see a simulation where the Poisson rate of CA1 cell is determined by the integration of the incoming CA3 spikes (perhaps with many incoming CA3 neurons). If this doesn't work, you should discuss why this is the case and what the implications are for the model.
Thank you for this suggestion. In order to address points 1 and 2 (see below), we have updated the manuscript to include a new simulation where the spiking activity in CA1 cells (N=50) is driven by a large number of spiking CA3 neurons (N=500) with overlapping fields that phase precess. To avoid the complexity of HodgkinHuxley / Leaky integrateandfire models, spiking activity in CA1 is determined by a linearnonlinear cascade model, which we would like to thank Reviewer 2 for suggesting. In the simulation, which has been added as a panel in Figure 2 Supplement 2, we find that the resulting weights learnt via STDP in the spiking model are almost identical to those learnt by the standard STDPsuccessor learning rule used in most of our previous simulations (diagonalaligned average across rows: R^{2}=0.99). Note that the resulting weight matrix is no longer square due to the x10 greater number of cells in CA3 vs. CA1.
The results of the simulation are also now referred to in the Results:
“This effect is robust to variations in running speed (Figure 2–supplement 1b) and field sizes (Figure 2–supplement 1c), as well as scenarios where target CA1 cells have multiple firing fields (Figure 2–supplement 2a) that are updated online during learning (Figure 2–supplement 2bd), or fullydriven by spikes in CA3 (Figure 2–supplement 2e); see methods for more details.”
and details of the simulations have been added to the Methods section 5.10.2 (equations for the spiking model are also summarised in the figure above).
2. CA3 => CA1 projections. CA1 cells still receive input from just one CA3 cell for each place field in the updated model (at least in the majority of simulations). This allows precise theta timing of the pre and post synaptic neurons which appears to be critical for the plasticity rule to function. For example, the mathematics of Geisler et al. 2007 shows that, if the CA1 cell would receive input from a set of phase precessing CA3 cells with spatially offset place field and a Gaussian weight profile (the most common way to model CA3CA1 connections), then the CA1 cell would actually fire at the LFP theta frequency and wouldn't phase precess, and as a consequence the STDP mechanism would no longer learn the successor representation. This suggests strong constraints on the conditions under which the model can function which are currently not being adequately discussed. This should be investigated and discussed, and the constraints required for the model to function should be plainly laid out.
We agree that the onetoone nature between CA3 => CA1 in the majority of simulations might suggest that this is a strict condition in order for the model to function. Rather, it is a condition we impose in order to simplify the model as well as to establish a clear connection between the model and successor feature theory (see Methods section 5.9). However it is important to note that this condition can easily be relaxed and that doing so still produces results that are extremely similar to true successor feature learning. In order to show this, we have updated the manuscript to include a new simulation, also outlined above, where the spiking activity in CA1 cells (N=50) is driven by a large number of spiking CA3 neurons (N=500) with overlapping fields that phase precess. We show that in this regime, even when the spiking output of each CA1 cell is determined by the spiking input of a large number of phase precessing CA3 cells, the resulting STDP synaptic weights closely resembles that of the STDPsuccessor matrix used in the majority of simulations (diagonalaligned average across rows: R^{2}=0.99).
Additionally we also show a similar result in Figure 2 supplement 2bandc (added after the first round of review) where the synaptic weight matrix is updated online, during learning. In these models all 50 (not just one) CA3 cells are able to drive CA1 cells during learning and SRlike weight matrices still develop. In total, we believe these simulations and the new one performed here demonstrate that manytoone projections do not pose a fundamental issue.
Regarding the effect on phase precession (or the lack thereof) when cells are driven by multiple neurons, the simulation described above, and the ones provided in response to the first round of reviews, show that this is not a substantial concern. Geisler et al. (2010) raised the possibility that in such a situation phase precession would not emerge in CA1. However, our simulations show that it is possible for CA1 cells receiving input from multiple CA3 cells to phase precess as long as there is some spatial structure to the connections. If a CA1 cell is most strongly driven by a population of CA3 cells in a similar location on the track (and which therefore phase precess similarly) it too will phase precess. This spatial structure can be quite broad, for evidence of this please see Figure 2 supplement 2f included in our previous rebuttal, and Author response image 1 for an equivalent plot drawn from the spiking model simulation described above. In the figures we show the phase precession of CA1 when driven by the learnt synaptic weight matrix, W, which is significant over a large portion of the input CA3 neurons. This demonstrates that manytoone connections are not incompatible with phase precession and therefore our proposed learning mechanism can still work.
3. A similar concern holds with the phase offset between CA3 and CA1 found by Mizuseki et al. The theta+STDP mechanism learns the successor representation because the CA1 cells inherit their responses from a phaseprecessing upstream CA3 cell, so the existence of a phase lag is troubling, because it suggests that CA1 cells are not driven causally by CA1 cells in the way the model requires. You may be right that, if some external force were to artificially impose a fixed lag between the CA3 and CA1 cell, the proposed learning mechanism would still function but now with a spatial offset. However, the Reviewer was concerned that the very existence of the phase lag challenges the basic spirit of the model, since CA1 cells are not driven by CA3 cells in the way that is required to learn causal relationships. At the very least, this needs to be addressed and discussed directly and openly in the Discussion section, but it would be better if the authors could implement a solution to the problem to show that the model can work when an additional mechanism is introduced to produce the phase lag (for example, a combination of EC and CA3 inputs at different theta phases?)
The reviewer is correct in that since there is a theta phase offset between CA3 and CA1, it is important to consider the possible impact on our model. Indeed, while Mizuseki et al., 2009 alludes to a fixed phase difference between CA3 and CA1 neurons, the consequences for phase precessing place cells are more nuanced. Importantly, in a later paper from 2012, Misuzeki et al. demonstrate this offset in phase between CA3 and CA1 place cells varies at different stages of the theta cycle. Thus as an animal first enters a place field and spikes are fired late in the theta cycle, CA1 spikes are emitted around 80°to 90° after spikes from CA3. However, as the animal progresses through the field, spikes from both regions precess to earlier phases but the effect is more pronounced in CA1, meaning that by the time the animal exits a place fields the phase offset between the two regions is essentially 0° (the key figure from Mizuseki et al. 2012 is shown in Figure 2 Supplement 4g). Importantly this result fits with the work of Hasselmo et al. (2002) and Colgin et al. (2009) both of which point to there being enhanced CA3 > CA1 coupling at early theta phases – in other words CA3’s influence on CA1 appears to be most pronounced in the latter half of place fields.
In response to this we have done two things. First, to simulate the effect of a variable phase offset, we ran the model as before but for offsets of 90°, 45°, and 0°, which correspond to late, mid and early theta phase. We then averaged the resulting STDP weight matrices to generate a single prediction for a system in which the CA3 to CA1 phase offset varies in a plausible fashion – the resulting matrix is still very similar to the TD successor matrix (diagonalaligned average across rows: R^{2}=0.76), and clearly shows the SRlike asymmetry (positive band left of diagonal, negative band right) confirming that our model is robust to the observed phase offset. These simulations, including the weight matrices for offsets of 90°, 45°, and 0° have now been included in a new figure panel appended to Figure 2 Supplement 4:
and are referred to in the Results section:
“Additionally, we find these CA1 cells go on to inherit phase precession from the CA3 population even after learning when they are driven by multiple CA3 fields (Figure 2–supplement 4f), and that this learning is robust to realistic phase offsets between the populations of CA3 and CA1 place cells (Figure 2—figure supplement 4g).”
Secondly, we have also updated the discussion to cover these points in more detail and in particular have addressed the nuances suggested by the experimental results from Hasselmo et al. (2002) and Colgin et al. (2009). Specifically, we indicate that because CA3>CA1 coupling is most pronounced at early theta phases – when the phase offset between the regions is at its lowest – the effect of the offset is likely to be less important than might immediately be thought. Thus the simulation presented above, which still learns a good approximation of the TD SR matrix (diagonalaligned average across rows: R^{2}=0.76), should be considered as a worstcase scenario.
We now expand upon these points in the Discussion:
“While the model is biologically plausible in several respects, there remain a number of aspects of the biology that we do not interface with, such as different cell types, interneurons and membrane dynamics. Further, we do not consider anything beyond the most simple model of phase precession, which directly results in theta sweeps in lieu of them developing and synchronising across place cells over time [60]. Rather, our philosophy is to reconsider the most pressing issues with the standard model of predictive map learning in the context of hippocampus (e.g., the absence of dopaminergic error signals in CA1 and the inadequacy of synaptic plasticity timescales). We believe this minimalism is helpful, both for interpreting the results presented here and providing a foundation on which further work may examine these biological intricacies, such as whether the model’s theta sweeps can alternately represent future routes [61] e.g. by the inclusion of attractor dynamics [62]. Still, we show this simple model is robust to the observed variation in phase offsets between phase precessing CA3 and CA1 place cells across different stages of the theta cycle [63]. In particular, this phase offset is most pronounced as animals enter a field (∼90°) and is almost completely reduced by the time they leave it (~0°) (Figure 2 —figure supplement 4g). Essentially our model hypothesises that the majority of plasticity induced by STDP and theta phase precession will take place in the latter part of place fields, equating to earlier theta phases. Notably, this is inkeeping with experimental data showing enhanced coupling between CA3 and CA1 in these early theta phases [64, 65]. However, as our simulations show (figure 2 supplement 4 panel g ), even if these assumptions do not hold true, the model is sufficiently robust to generate SR equivalent weight matrices for a range of possible phase offsets between CA3 and CA1 ”
with details of the simulations added to the Methods:
“Phase shift between CA3 and CA1. In figure 2 supplement 4g we simulate the effect of a decreasing phase shift between CA3 and CA1. As observed by Mizuseki et al. (2012) [87] there is a phase shift between CA3 and CA1 neurons being maximally around 90 degrees at the end of each theta cycle, decreasing to 0 at the start. We simulate this by adding a temporal delay to all downstream CA1 spikes equivalent to the phase shifts of 0°, 45° and 90°. The average of the weight matrices learned over all three examples still displays clear SRlike structure.”
4. DV phase precession. The Reviewer would still like to see you introduce DV phase lags, which could be done with a simple modification of the existing simulations. At minimum, it is critical to remove/modify the sentence "A consequence of theta phase precession is that the cell with the smaller field will phase precess faster through the theta cycle than the other cell – initially it will fire later in the theta cycle than the cell with a larger field, but
as the animal moves towards the end of the small basis field it will fire earlier." As R2 noted in their original review, this is not the case when DV phase lags are taken into account, as was shown by Leibold and MonsalveMercado (2017). Ideally, it would be best to update simulations updated to account for the DV phase lags and the discussion updated to account for their functional implications
Reviewer #2 (Recommendations for the authors):
While the reviewers have undertaken a number important additional analyses which address some of the concerns raised in the review, several of the most pressing concerns regarding biological plausibility have not been addressed. In particular, each CA1 place field is still inherited by exactly 1 CA3 place field in the updated protocol, and cells still interact via their firing rates with spikes only being used for the weight updates. Moreover, the authors chose not to address concerns regarding quantitative comparisons between the model and data. Overall, while the authors correctly point out that their primary contribution should be viewed as illustrating a mechanism to learn successor representations via phase precession and STDP, this message is undermined if the proposed mechanism can't function when reasonable assumptions are made regarding the number of cells and their mode of interaction.
Thank you for highlighting this. The sentence mentioned was actually intended to be a ‘strawman’ to motivate the subsequent analyses that show the different rates of phase precession induced by varied field sizes do not impair plasticity in a manner that is sufficient to segregate spatial scales (Figure 4). Note, to be clear we were referring to the fact that small place fields – found at the dorsal pole of the hippocampus – do phase precess more rapidly in time. The point being that phase precession is proportional to field size, so for a given distance travelled – say 20cm – a small field will exhibit a greater change in spiking phase than a large one. We apologise for presenting this information in a way that was not clear. We have now made the following changes to ensure the intention of this paragraph is clear:
“Hypothetically, consider a small basis feature cell with a receptive field entirely encompassed by that of a larger basis cell with no theta phase offset between the entry points of both fields. A potential consequence of theta phase precession is that the cell with the smaller field would phase precess faster through the theta cycle than the other cell – initially it would fire later in the theta cycle than the cell with a larger field, but as the animal moves towards the end of the small basis field it would fire earlier. These periods of potentiation and depression instigated by STDP could act against each other, and the extent to which they cancel each other out would depend on the relative placement of the two fields, their size difference, and the parameters of the learning rule.”
Similarly, as outlined in the simulations above, graduated theta phase offsets of up to and including 90° are also insufficient to impair the plasticity induced by STDP and phase precession. Applying both of these findings in the context of theta as a travelling wave across the dorsalventral axis, our original conclusion that topographic organisation of place cells by size along the DV axis is necessary to prevent cross binding and preserve multiscale structure in the resulting successor features remains unchanged.
We now clarify these points in the discussion:
“The distribution of place cell receptive field size in hippocampus is not homogeneous. Instead, place field size grows smoothly along the longitudinal axis (from very small in dorsal regions to very large in ventral regions). Why this is the case is not clear – our model contributes by showing that, without this ordering, large and small place cells would all bind via STDP, essentially overwriting the short timescale successor representations learnt by small place cells with long timescale successor representations. Topographically organising place cells by size anatomically segregates place cells with fields of different sizes, preserving the multiscale successor representations. Further, our results exploring the effect of different phase offsets on STDPsuccessor learning (Figure 2 —figure supplement 4g) suggest that the gradient of phase offsets observed along the dorsoventral axis [79, 80] is insufficient to impair the plasticity induced by STDP and phase precession. The premise that such separation is needed to learn multiscale successor representations is compatible with other theoretical accounts for this ordering. Specifically Momennejad and Howard [39] showed that exploiting multiscale successor representations downstream, in order to recover information which is ‘lost’ in the process of compiling state transitions into a single successor representation, typically requires calculating the derivative of the successor representation with respect to the discount parameter. This derivative calculation is significantly easier if the cells – and therefore the successor representations – are ordered smoothly along the hippocampal axis.”
https://doi.org/10.7554/eLife.80663.sa2Article and author information
Author details
Funding
Wellcome Trust (212281/Z/18/Z)
 William de Cothi
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.
Acknowledgements
We thank Wellcome for supporting this work through the Senior Research Fellowship awarded to C.B. [212281/Z/18/Z]. We also thank Samuel J Gershman and Talfan Evans for useful feedback on the manuscript.
Senior and Reviewing Editor
 Michael J Frank, Brown University, United States
Reviewer
 Michael E Hasselmo, Boston University, United States
Publication history
 Preprint posted: April 21, 2022 (view preprint)
 Received: May 30, 2022
 Accepted: February 26, 2023
 Version of Record published: March 16, 2023 (version 1)
Copyright
© 2023, George, de Cothi et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 484
 Page views

 32
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
The predictive nature of the hippocampus is thought to be useful for memoryguided cognitive behaviors. Inspired by the reinforcement learning literature, this notion has been formalized as a predictive map called the successor representation (SR). The SR captures a number of observations about hippocampal activity. However, the algorithm does not provide a neural mechanism for how such representations arise. Here, we show the dynamics of a recurrent neural network naturally calculate the SR when the synaptic weights match the transition probability matrix. Interestingly, the predictive horizon can be flexibly modulated simply by changing the network gain. We derive simple, biologically plausible learning rules to learn the SR in a recurrent network. We test our model with realistic inputs and match hippocampal data recorded during random foraging. Taken together, our results suggest that the SR is more accessible in neural circuits than previously thought and can support a broad range of cognitive functions.

 Neuroscience
Animals can continuously learn different tasks to adapt to changing environments and, therefore, have strategies to effectively cope with intertask interference, including both proactive interference (ProI) and retroactive interference (RetroI). Many biological mechanisms are known to contribute to learning, memory, and forgetting for a single task, however, mechanisms involved only when learning sequential different tasks are relatively poorly understood. Here, we dissect the respective molecular mechanisms of ProI and RetroI between two consecutive associative learning tasks in Drosophila. ProI is more sensitive to an intertask interval (ITI) than RetroI. They occur together at short ITI (<20 min), while only RetroI remains significant at ITI beyond 20 min. Acutely overexpressing Corkscrew (CSW), an evolutionarily conserved protein tyrosine phosphatase SHP2, in mushroom body (MB) neurons reduces ProI, whereas acute knockdown of CSW exacerbates ProI. Such function of CSW is further found to rely on the γ subset of MB neurons and the downstream Raf/MAPK pathway. In contrast, manipulating CSW does not affect RetroI as well as a single learning task. Interestingly, manipulation of Rac1, a molecule that regulates RetroI, does not affect ProI. Thus, our findings suggest that learning different tasks consecutively triggers distinct molecular mechanisms to tune proactive and retroactive interference.