Successor-like representation guides the prediction of future events in human visual cortex and hippocampus

  1. Matthias Ekman  Is a corresponding author
  2. Sarah Kusch
  3. Floris P de Lange
  1. Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Netherlands

Abstract

Human agents build models of their environment, which enable them to anticipate and plan upcoming events. However, little is known about the properties of such predictive models. Recently, it has been proposed that hippocampal representations take the form of a predictive map-like structure, the so-called successor representation (SR). Here, we used human functional magnetic resonance imaging to probe whether activity in the early visual cortex (V1) and hippocampus adhere to the postulated properties of the SR after visual sequence learning. Participants were exposed to an arbitrary spatiotemporal sequence consisting of four items (A-B-C-D). We found that after repeated exposure to the sequence, merely presenting single sequence items (e.g., - B - -) resulted in V1 activation at the successor locations of the full sequence (e.g., C-D), but not at the predecessor locations (e.g., A). This highlights that visual representations are skewed toward future states, in line with the SR. Similar results were also found in the hippocampus. Moreover, the hippocampus developed a coactivation profile that showed sensitivity to the temporal distance in sequence space, with fading representations for sequence events in the more distant past and future. V1, in contrast, showed a coactivation profile that was only sensitive to spatial distance in stimulus space. Taken together, these results provide empirical evidence for the proposition that both visual and hippocampal cortex represent a predictive map of the visual world akin to the SR.

Editor's evaluation

In this paper, Ekman and colleagues present compelling fMRI evidence from a visual sequence task that both the early visual cortex (V1) and the hippocampus represent perceptual sequences in the form of a predictive "successor" representation, where the current state is represented in terms of its future (successor) states in a temporally discounted fashion. In both brain structures, there was evidence for upcoming, but not preceding steps in the sequence, and these results were found only in the temporal but not spatial domain. This study offers the fundamental suggestion that both the hippocampus and V1 represent temporally structured information in a predictive, future-oriented manner.

https://doi.org/10.7554/eLife.78904.sa0

Introduction

Anticipation and planning of future visual input require knowledge of the relational structure between events. The relational structure, for instance that stimulus B usually follows stimulus A, is learned through exposure during past experiences (Behrens et al., 2018; Finnie et al., 2021; Gavornik and Bear, 2014) and can be used to build a model or cognitive map (Tolman, 1948) that enables us to generate inferences in situation with noisy or partial input (Ekman et al., 2017; Momennejad, 2020; Schwartenbeck et al., 2021).

In the visual domain, with rapidly changing input, it remains unknown what the inherent properties of the model underlying our predictions are. On the one hand, such a model needs to be efficient enough to generate predictions from a constant stream of visual input, while on the other hand, it also allows for flexible updating in an ever-changing environment. In the context of hippocampal representations, the successor representation (SR) has been recently proposed (Dayan, 1993; Stachenfeld et al., 2017) to combine the trade-off between both flexible and efficient model properties. The SR postulates a predictive representation in which the current state is represented in terms of its future (successor) states, in a temporally discounted fashion. The SR is dependent on the actual experience, with states experienced more frequently being represented more strongly. This enables learning in an environment without explicit reward (Gläscher et al., 2010). This hypothesis captures many aspects of empirical hippocampal place cell firing pattern, like the exponential decay toward distant future locations (Alvernhe et al., 2011; Mehta et al., 2000).

Previous research has repeatedly shown that prior expectations influence neural activity in the visual cortex (Ekman et al., 2017; Gavornik and Bear, 2014; Hindy et al., 2016; Kok et al., 2012; Xu et al., 2012). It remains, however, unknown if SR-like representations are present outside the hippocampus in areas like the early visual cortex (V1) that have a strong retinotopic organization. Theoretically, it is possible that V1 receptive fields, analogous to hippocampal place fields, become tuned to respond not only to the current input, but also to expected future inputs. Here, we propose that the computationally efficient and flexible properties of the SR could in theory also underlie the anticipation of future events in V1.

To directly test this hypothesis, we conducted a functional magnetic resonance imaging (fMRI) study in which participants were presented with an arbitrary visual dot sequence (A-B-C-D). After initial sequence exposure, we introduced occasional omission trials, where only one element of the sequence was presented (e.g., B), while the rest of the sequence (e.g., A, C, and D) was omitted. These partial sequence trials allowed us to study expectations of future stimulus sequences in the absence of physical stimulation. This design allowed us to test the specific assumptions of the SR and also assess whether V1 predictions were better described by an alternative mechanism called pattern completion. Pattern completion describes a framework in which autoassociative connections within the hippocampal CA3 regions reactivate related sequence items from partial input (Deuker et al., 2014; Leutgeb and Leutgeb, 2007; Rolls, 2013) that is then propagated to sensory regions such as V1 (Hindy et al., 2016). In contrast to the SR, pattern completion predicts reactivations of all associated items, without any skewing toward future locations or temporal discounting of events that are farther in the future.

Using fMRI, we found reactivations of future sequence locations (e.g., C-D), but not of past locations (e.g., A) in both V1 and hippocampus. In line with the SR, a model comparison confirmed that predictive representations constitute a map-like structure, with exponential decay toward distant future states. Further, more detailed analysis of predictive codes revealed that hippocampus represented visual locations based on their temporal proximity within the sequence, rather than spatial distance.

Taken together, these data suggest that humans predict upcoming visual input by using a generative model whose properties resemble the SR. Importantly, the presence of SR-like representations in V1 indicates that SR might be a more ubiquitous coding schema that is present beyond hippocampal place cells. Finally, while SR-like representations were found to be present in both V1 and hippocampus, the predictive codes between these areas revealed complementary tuning properties, with hippocampus being sensitive to temporal distance and V1 being more sensitive to retinotopic spatial distance.

Results

Human observers (N = 35) were exposed to four dots presented in rapid succession that formed an arbitrary visual sequence A-B-C-D (Figure 1A). Dot locations were sampled from eight locations (Figure 1B) and the resulting possible sequences were randomly assigned across subjects. After an initial exposure period with the full sequence (352 trials outside the scanner, 160 trials inside the scanner), occasionally only one item of the sequence was presented, omitting the remaining sequence items (e.g., partial sequence trial ‘- B - -’ where B is shown and A, C, and D are omitted; Figure 1A). Participants were instructed to maintain fixation throughout the experiment, and tasked to detect a slight temporal onset delay (170 ms vs. 17 ms) of the last sequence dot that occurred in ~40% of the full sequence trials. The task was designed to be demanding (group averaged hit rate = 78%, SD = 8%) and to keep participants’ attention on the sequence.

Sequence paradigm to probe successor-like representations.

(a) Stimulus timing for full sequence trials (top) and partial sequence trials (bottom). During full sequence trials, four dots were presented in rapid succession in a fixed sequence order (A-B-C-D). During partial sequence trials, only one of the four dots was presented, omitting the remaining sequence dots. Here shown for -B - -, while A- - -, - -C-, and - - -D partial trials were also presented. (b) Sequences were randomized across subjects such that sequence locations were sampled from a total of eight possible locations with the constraint that every quadrant was stimulated once. Dot locations were evenly spaced around central fixation at a radius of 7 degrees visual angle (dva). (c) Independent stimulus localizer trials to map out stimulus representations.

We hypothesized that presenting only one item of the sequence would elicit anticipatory activity at the omitted sequence locations that followed the presented stimulus (i.e., successor states), but not at the sequence location that preceded the sequence item (i.e., predecessor states). For instance, during partial ‘- B - -’ trials, we expected activity at omitted sequence locations C (+1) and D (+2), but not at omitted location A (–1).

Stimulus sequences elicit spatially specific responses in V1

To test our prediction, we first selected V1 sub regions of interest (ROIs) that responded selectively to the eight stimulus locations based on an independent localizer session (Figure 1C). Stimulus-response profiles of these eight (retinotopic) ROIs show little coactivation of neighboring locations in the visual field which allows for a precise investigation of location-specific activity (Figure 2A). Unsurprisingly, during full sequence trials, BOLD activity at the sequence locations receiving bottom-up visual input was markedly enhanced compared to non-stimulated control locations (Figure 2B). Population-based receptive field (pRF) data that was acquired for a subset of participants confirmed that the selected voxels correspond to the retinotopic stimulus locations as expected. For all analyses, we subtracted the average BOLD activity of all control locations from the sequence location activity (Figure 3A), which provides an accurate measure of stimulus-specific responses independent of global signal fluctuations for instance due to attention.

V1 stimulus mapping.

(a) An independent stimulus localizer was used to identify V1 subpopulations that respond to individual dot locations (left). Stimulus-response profiles show tuning properties for selected V1 populations (middle). Visualizing stimulus activity by projecting group averaged BOLD activity (n=35) into stimulus space (right) shows focal activity at the stimulated location with minimal spreading to neighboring locations. (b) Identified V1 subpopulations during full sequence trials (left) show heightened BOLD activity compared to non-stimulated control locations (middle). Group averaged (n=7) sequence activity projected into stimulus space shows spatially specific activity at the stimulated locations (right).

Successor-like representation of future sequence events in V1.

(a) BOLD activity during full sequence trials. (b) Schematic of all partial sequence trials (left) illustrating the omission of different predecessor (purple), or successor (orange) sequence locations. Group averaged (n=35) V1 activity during partial sequence trials (right) shows enhanced activation of successor locations compared to predecessor locations. (c) Group averaged V1 activity for individual partial sequence trials. Error bars denote ± s.e.m.; two-tailed t test, ***p < 0.001; **p < 0.01; *p < 0.05, uncorrected for multiple comparisons.

Anticipated stimulus sequences in V1

Briefly flashing individual dots during partial sequence trials, while omitting the other dots of the sequence, allowed us to probe anticipatory activity at the successor and predecessor locations (Figure 3B). In line with our predictions, V1 BOLD activity was indeed enhanced at the non-stimulated successor locations compared to the non-stimulated predecessor locations (averaged across all partial trials and sequence locations; t(34) = 6.45, p = 2.23 × 10–7). The same pattern of future-directed prediction was also evident from the visual inspection of BOLD activity for all partial sequence trials separately (Figure 3C). Further, these results of greater activity for successor compared to predecessor activity also holds when comparing individual sequence locations without averaging (i.e., comparing non-stimulated location B when successor vs. predecessor, t(34) = 5.72, p = 2.02 × 10–6; and location C when successor vs. predecessor, t(34) = 3.13, p = 0.0035).

The activity decay toward distant future locations was formally tested by fitting an exponentially decaying factor gamma γ ∈ [0,1] to each participant’s data. Here, values closer to 0 indicate a steeper decay and values closer to 1 indicate no decay. In line with our predictions, we found a group averaged decaying factor of γ=0.14 (±0.03 s.e.m.) that was statistically significantly different from 1 (non-parametric t test t(34) = –17.17, p = 2.54 × 10–18).

One might argue that participants with stronger predictions toward future locations would perform better at the behavioral detection task. However, no such correlation between individual V1 BOLD activity and task accuracy was found in an across-subject correlation analysis (see Materials and methods, spearman r = 0.05; p = 0.769).

Successor-like representation in V1

Next, we sought to formally test how well the observed data fits the prediction of the SR, namely an exponential decay of states farther into the future.

For each subject, we fitted partial sequence trials with an SR model (Figure 4A), keeping the exponential decay parameter γ as a free parameter (see Materials and methods). In order to evaluate how well the SR model resembled the data, we then computed the error between the SR prediction and the actual data (lower values indicate a better model fit). For comparison, we additionally fitted a traditional pattern-completion co-occurrence (CO) model that predicts that events that occur together, will be reactivated together (Figure 4B). In contrast to the SR model, predictions of the CO model are non-directional, meaning that it predicts equal reactivation of both successor and predecessor locations. Furthermore, while the SR model predicts a temporal discounting toward future states, the CO model assumes no differential activity of reactivated states. In our implementation of the CO model, anticipatory activity was modulated by one multiplicative parameter ω. Additionally, as a baseline model, we also evaluated a null model (H0) that assumes no predictive activity (i.e., no difference between successor and predecessor locations). Note that in order to be interpretable as a predictive representation, the best-fitting model should not only have the smallest error, but also differ significantly from the H0 model.

Model comparison favors successor-like representation in V1.

(a) Probing predictions of the successor representation (SR) against the competing co-occurrence (CO) model. The relational structure of the full sequence A-B-C-D is translated into a transition matrix (top), where a non-zero value indicates a transition between two states in the sequence. The SR matrix (bottom) is computed from the transition matrix, here shown with a temporal discount factor of γ = 0.3 (see Materials and methods). (b) The relational structure in the CO model is non-directional, resulting in a constant prediction of past and future states weighted by a factor ω. (c) Competing model predictions were fitted to partial sequence trial V1 data of each individual participant with γ and ω as free parameters. Comparison of model errors showed that the data is most in line with the SR. A null model (bottom), resembling no prediction of past and future locations, was included in the model comparison as baseline. Error bars denote ± s.e.m.; BIC, Bayesian Information Criterion (taking into account that the H0 model has fewer parameters). Two-tailed t test, ***p < 0.001, uncorrected for multiple comparisons.

Our results show that anticipatory activity in V1 is best described by the predictions of the SR (Figure 4C; SR vs. CO t(34) = –2.29, p = 0.028). Additionally, both SR and CO describe the data better than the null model (SR vs. H0 t(34) = –8.25, p = 1.24 × 10–9; CO vs. H0: t test t(34) = –7.59, p = 8.22 × 10–9).

SR in hippocampus

The predictive neural representation in the form of a SR was originally postulated for the hippocampus (Stachenfeld et al., 2017). We therefore wanted to investigate whether the predictive representations that we observed in V1 were also present in the hippocampus. Note that while the hippocampal formation and nearby entorhinal cortex might feature a coarse representation of visual space (Killian et al., 2012; Knapen, 2021; Nau et al., 2018b; Silson et al., 2020), it does not feature the same fine-scale retinotopic organization present in V1 (Dumoulin and Wandell, 2008). Therefore, instead of focusing on univariate BOLD activity within certain hippocampal subregions, we focused on population activity patterns across the entire hippocampus using a decoding approach similar to previous studies (Ekman et al., 2022; Kok and Turk-Browne, 2018; Kurth-Nelson et al., 2016; Russek et al., 2021; Schapiro et al., 2012).

In keeping with the V1 analysis, we used the independent stimulus localizer to extract location-specific activity patterns in the hippocampus and then tested during partial sequence trials to what extent location-specific representations were reactivated. Specifically, we trained a pattern classifier to distinguish between the eight dot locations within the localizer. Before applying the trained classifier to omission trials of the main task (see Materials and methods), we confirmed that cross-validated decoding accuracies within the localizer were above chance-level to ensure that the hippocampal pattern shows a reliable representation of space.

Within localizer decoding accuracy results confirmed that hippocampus has a coarse representation of the eight stimulus locations (Figure 5B) within the localizer (two-sided one-sample t test; t(34) = 3.28, p = 0.002; cross-validated accuracy = 15 ± 3.6%, mean ± s.d.; see Materials and methods). Notably, compared to V1 (Figure 2A), within localizer accuracy was relatively low and as a consequence tuning curves in hippocampus appeared less sharp (Figure 5C). In order to maximize sensitivity for the hippocampus, we averaged classification evidence across successor and predecessor locations. Non-averaged results can be found in Figure 5—figure supplement 1.

Figure 5 with 1 supplement see all
Hippocampus represents spatial locations and engages in future-directed predictions.

(a) Hippocampus region of interest (green). (b) A pattern classifier was trained to distinguish between the eight stimulus locations during a perceptual localizer. Resulting stimulus-response profiles reveal that hippocampus distinguishes between individual stimulus locations. (c) Averaged (n=35) tuning profiles shifted to one location. (d) A classifier that was trained on the perceptual localizer was applied to partial sequence trials during the main task to probe whether hippocampal representations skew toward predecessor locations (purple), or successor locations (orange). (e) Classifier evidence, averaged across possible successor and predecessor locations, shows that hippocampus predominantly represents future (successor) stimulus locations over predecessor locations. (f) Since the hemodynamic properties of hippocampal functions are not well understood, the decoding analysis was additionally performed in a time-resolved manner and fitted with a canonical hemodynamic function to estimate the time to peak. The difference time-course (successor minus predecessor) showed a temporally distinct peak around 4.7 s indicating that the future-directed prediction occurs as transient response to the partial stimulus input and not as a sustained signal throughout the trial. Error bars denote ± s.e.m.; **p < 0.01.

Applying the trained classifier to partial sequence trials of the main task, we asked whether hippocampus would preferentially reactivate successor or predecessor locations (Figure 5D). To answer this question, we first subtracted the probabilistic classifier evidence for the control locations from the classifier evidence of the sequence locations. Consequently, values greater than 0 reflect evidence for the reactivation of sequence representations, while values smaller than 0 reflect a relative suppression of sequence locations. After that we averaged the evidence across all successor and predecessor locations, respectively, and tested for differences across participants. Our results reveal that hippocampus representations were preferentially biased toward successor locations (Figure 5E; paired-sample t test, t(34) = 2.74, p = 0.009), mirroring the results found in V1.

Finally, in order to better understand the temporal dynamics of the anticipatory representations in hippocampus, we repeated the decoding analysis in a time-resolved manner. We reasoned that if reactivations of future sequence locations were triggered by the brief presentations of partial sequence dots, the evidence time-course should follow a transient response profile. Alternatively, if hippocampus were to signal a constant bias toward future sequence locations, the evidence time-course should be unrelated to the stimulus onset and show a sustained temporal profile.

Results of the evidence difference time-course clearly show a transient response peaking approximately 4.7 s post-stimulus onset (Figure 5F) indicating that hippocampal predictions were triggered by the partial sequence dot. Note that the decoding time-course reflects the evidence for successor locations versus predecessor locations independent of the bottom-up stimulus. The transient decoding profile can therefore not simply reflect the onset of a given trial.

In order to probe the relationship between hippocampus and V1 successor reactivations, we performed an across-subject analysis, correlating V1 BOLD activity, averaged across all successor locations, with hippocampus classifier evidence, averaged across all successor locations. No significant relationship was observed (spearman correlation, r = –0.08, p = 0.668).

One could ask whether our findings are specific to V1 and hippocampus, or widespread throughout the brain. In order to answer this question, we repeated the analysis for low-level visual area V2. In contrast to V1, no predictive effects were found in area V2. V2 BOLD activity was not enhanced at the non-stimulated successor locations compared to the non-stimulated predecessor locations (averaged across all partial trials and sequence locations; t(34) = 1.41, p = 0.168).

Hippocampal codes preserve spatiotemporal tuning

In contrast to V1, hippocampal representations are not inherently retinotopic and feature only a coarse representation of visual space (Knapen, 2021; Nau et al., 2018b; Silson et al., 2021). Instead, hippocampal place cells provide a detailed representation of the allocentric position in an environment. However, more recently, the intriguing picture emerged that hippocampus also contributes to a more general organization of information by representing non-spatial aspects of experience in a map-like way (Constantinescu et al., 2016; Garvert et al., 2017; Stachenfeld et al., 2017), similar to the representation of space (Aronov et al., 2017).

Inspired by these recent observations, we asked what the underlying properties of the reported hippocampus representations were. Given that we successfully trained a classifier based on eight spatial locations, it might seem obvious to conclude that the underlying code for these representations is purely spatial (retinotopic) as well. This is however not necessarily the case, given that the localizer was shown after the main task and might therefore reflect persistent predictive representations. Instead, robust discrimination of sequence locations could theoretically also be based on coding of temporal properties of the sequence. Indeed, Deuker et al., 2016 have recently shown that hippocampus representations can reflect in principle both spatial and temporal aspects. In our case, a temporal coding mechanism could represent stimulus locations not based on proximity in space, but rather by proximity in time.

In order to address this question, we conducted a detailed analysis of the coactivation pattern in the stimulus localizer (Figure 6A). Note that the localizer was shown at the end of the study, allowing us to test whether learned associations persisted even after the full sequence was not relevant anymore. Here, coactivations were defined as activation of non-stimulated locations. For instance, when presenting stimulus A, locations B-C-D might become activated as well. In general, such coactivations are often attributed to noise or ambivalent responses driven by overlapping receptive fields. However, in this case, we made use of the coactivation pattern to draw inferences about the learned persistent representations.

Stimulus localizer reveals complementary coactivation (tuning) properties in hippocampus and V1.

(a) Schematic of the localizer trial with the stimulated location ‘A’ and the non-stimulated locations (B, C, D, dashed circle) that were part of the sequence in the main task preceding the localizer. (b) Illustration of coactivation (tuning) of sequence locations based on spatial (Euclidean) distance from the stimulated location (left) and temporal distance in sequence space (right). Note how sequence locations A and B are far apart in the spatial (Euclidean) domain, but close in terms of temporal distance in sequence space. (c) Hypothetical activation pattern for representational tuning of spatial distance and temporal distance for illustration shown in (b). (d) Illustration of tuning pattern averaged across all localizer conditions for temporal tuning (top), spatial tuning (middle), Successor Representation (SR, middle), and no coactivation (H0, bottom). For visualization purposes the x-axis is sorted by time for all three tuning patterns. (e) Classifier evidence for current, future, and past locations for hippocampus (left) and V1 (right). (f) Comparing model errors (i.e., lower is better) show that hippocampal representations were best described by temporal coactivation (left), while V1 (right) was best described by spatial coactivation and the absence of coactivation (H0) of sequence locations. Error bars denote ± s.e.m.; two-tailed t test, ***p < 0.001; **p < 0.01; *p < 0.05, uncorrected for multiple comparisons; BIC, Bayesian Information Criterion (taking into account that the H0 model has fewer parameters).

Specifically, for a representation based on spatial coactivation (tuning) one would expect a coactivation of nearby spatial locations. In other words, the coactivation of non-stimulated locations should be modulated by the spatial (Euclidean) distance to the stimulated location (Figure 6B). This spatial tuning pattern is typically seen in early visual areas with overlapping receptive fields and is also visually present in our V1 results (Figure 2A). Alternatively, for a representation based on temporal coactivation (tuning) one would expect a coactivation of nearby locations in sequence space (Figure 6C).

Thus, spatial and temporal tuning codes lead to different coactivation patterns (Figure 6D) that can be disentangled with our stimulus paradigm. For instance, sequence locations A and B were far apart in the spatial (Euclidean) domain, but close in the temporal domain (distance in sequence space). Conversely, locations A and D are close in terms of spatial distance, but far apart in terms of temporal distance (Figure 6B). Note, while spatial tuning is in principle independent of any task-specific experience, temporal tuning on the other hand requires exposure to a sequential structure and can therefore only occur for the four dots that were part of the sequence. For this reason, we restricted the coactivation (tuning) analysis to the four dot locations that were part of the sequence.

For each participant, individual localizer data were fitted by a spatial coactivation model, a temporal coactivation model, an SR model, and a no-coactivation (H0) control model (Figure 6D). The latter was included as a low-level baseline control. Visual inspection of the group averaged localizer coactivation pattern revealed a clear temporal tuning pattern in hippocampus but not in V1 (Figure 6E). These results were confirmed by a formal model comparison (Figure 6F, Hippocampus: two-sided t test, Temporal vs. Spatial t(34) = −2.36, p = 0.024; Temporal vs. SR t(34) = −3.24, p = 0.003; Temporal vs. H0 t(34) = −19.27, p = 7.12 × 10–20; V1: H0 vs. Temporal t(34) = −22.09, p = 9.49 × 10–22; H0 vs. Spatial t(34) = −14.26, p = 6.52 × 10–16; H0 vs. SR t(34) = −17.15, p = 2.61 × 10–18).

Discussion

Uncovering the computations that drive human prediction and planning is a central aspect when it comes to understanding human cognition. What are the general coding mechanisms that allow to utilize knowledge of the environment to make inferences and generalizations about future events? In this study, we sought to answer the question whether the map-like SR that has been posited for the hippocampus (Mehta et al., 2000; Stachenfeld et al., 2017) may also explain the shape of anticipatory activity in visual cortex (V1).

There is an extensive body of literature that shows how expectations elicit anticipatory activity in early visual cortices (de Lange et al., 2018; Hindy et al., 2016; Kok et al., 2012). For instance, we have previously shown that flashing an individual dot of a simple, linear sequence triggers an activity wave in V1 that resembles the full stimulus sequence (Ekman et al., 2017; Ekman et al., 2022), akin to replay of place field activity during spatial navigation (Foster and Wilson, 2006; Gupta et al., 2010). However, what remains unknown is whether these sensory replay traces are guided by a generative model that represents the relational structure of the stimulus sequence, akin to a predictive map. Alternatively, anticipatory activity traces could simply reflect the association between different stimuli, based on their CO, without the added complexity of any temporal relational structure. The latter explanation appears plausible, given that predictive representations in early visual cortex are generally time critical and operate in parallel to a constant stream of new sensory input, which arguably requires efficient processing and in turn limits the complexity of such representations.

In fact, we previously speculated that cue-triggered reactivation of simple sequences might be driven by an automatic pattern completion-like mechanism that reactivates all associated items based on partial input (Ekman et al., 2017). This idea is in line with the finding that predictive representations in V1 correlated with pattern completion-like activity in the hippocampus (Hindy et al., 2016; Kok and Turk-Browne, 2018) that might be driving V1 activity (Finnie et al., 2021; Ji and Wilson, 2007).

Our current findings directly challenge this interpretation and instead point to a predictive representation of expected, temporally discounted, future states. We accomplished this by using a paradigm in which one visual event (e.g., the presentation of one dot) was framed as one state in a directed transition matrix with a fixed relational structure. The SR hypothesis makes two testable predictions, namely that population activity represents future states over predecessor states, and that future state representations are temporally discounted, such that events in the close future are more prominently represented compared to events in the distant future. Using a paradigm in which we occasionally presented only single items of the full sequence, allowed us to investigate V1 activity at omitted sequence locations.

Confirming the SR predictions, V1 activity at the successor locations was enhanced compared to activity at the predecessor locations, indicating a representation skewed toward future locations and away from the past. Notably, this relative difference was not only due to an enhancement of successor states, but our results also showed a decrease of activity at the predecessor states (compared to baseline). This suppression of predecessor states might seem surprising at first given that SR postulates the mere absence of predecessor activity (Momennejad, 2020; Stachenfeld et al., 2017). We speculate that the observed decrease at the predecessor states might constitute a functional separation mechanism between predecessor and successor states, strengthening the future-directed representation of the sequence by selectively decreasing representations of the unexpected predecessor states.

One aspect that sets our study apart is that the viewing of the visual sequence does not require any predictive planning of the participant to evaluate different future outcomes. In contrast, related studies reporting neuronal evidence for SR-like representations in hippocampus and PFC (Barron et al., 2020; Brunec and Momennejad, 2022) and occipital cortex (Schwartenbeck et al., 2021) have used paradigms in which participants were actively engaged in prospective planning and choice evaluation. Given the relatively passive nature of our task, one might therefore wonder whether it is expected to find any map-like activity at all. However, in this context, it is important to stress that the SR, unlike other model-based algorithms, does not depend on choice-dependent reward to build its transitional task structure (Momennejad et al., 2017; Stachenfeld et al., 2017) and therefore might not depend on participants’ active engagement. Furthermore, Russek et al., 2021 have recently used a paradigm in which subjects were passively exposed to transitions between visual states and reported evidence for SR-like representations in the absence of active choices in line with the results of the present study. Further supporting this notion, we have previously shown that anticipatory sequence activity occurred even after subjects’ attention was diverted from the sequence to a demanding task at fixation (Ekman et al., 2017), rendering the sequence task irrelevant. Taken together, these observations indicate that SR-like representations are not limited to situations that require active planning, or multiple-choice evaluations but may rather be formed automatically and incidentally, as has been shown repeatedly in the domain of statistical learning (Fiser and Aslin, 2002; Turk-Browne et al., 2005).

While we have interpreted the neural activity patterns in the light of the SR, it is strictly speaking not possible to distinguish between model-based (MB) and SR algorithms within the context of our design. The key distinction between them is that SR caches a predictive map of states that the agent expects to visit in the future, whereas MB algorithms store a full model of the world and compute trajectories at the decision time (Gershman, 2018; Momennejad et al., 2017). Therefore, both predict a temporally discounted activation of successor states. It should be noted however that MB comes at a higher computational cost, and is more intensive both in terms of time and working memory resources. The activation of successor states that we observed, on the other hand, occurred in the absence of a decision-making process (i.e., participants did not perform any task on the trials where a single dot was presented). Also, importantly, we previously observed that this activation pattern was not dependent on the task, and was equally present when attentional resources were strongly drawn away from the stimuli (Ekman et al., 2017). These observations may be more readily in line with the automatic (cached) activation of successor states that is embodied by SR, rather than the effortful iterative calculation of successor states that is the hallmark of MB. One future possibility to disentangle SR and MB algorithms could be to probe how well each model adapts to changes in the dot sequence structure. It has previously been shown, that compared to MB, the flexibility of the SR is somewhat limited to reflect changes in the transitional structure, because it requires the entire SR to be relearned (Momennejad et al., 2018).

The hippocampal formation can acquire arbitrary relationships between objects (Aronov et al., 2017; Backus et al., 2016; Behrens et al., 2018; Constantinescu et al., 2016; Garvert et al., 2017) beyond geometric location in space (O’Keefe and Nadel, 1978). While our main focus in the current study was on V1 representations, we also wanted to test to what extent hippocampus showed a similar SR-like representation of visual sequences. Previous fMRI studies investigating hippocampal representations have mainly focused on either navigation in a spatial (Brunec and Momennejad, 2022; Deuker et al., 2016) or non-spatial task (Garvert et al., 2017; Schapiro et al., 2013; Schuck and Niv, 2019) in which participants explore a relatively complex task space. It was recently shown that hippocampus has a rudimentary representation of visual space (Knapen, 2021; Silson et al., 2021), but it was not clear whether hippocampus would also engage in the representation of a comparably simple, low-level visual sequence presented in our paradigm.

Our results confirmed that hippocampus representations resemble an SR-like predictive map, favoring future over past sequence locations. This result highlights the compelling conceptual parallels between mnemonic expectations in hippocampus (Hindy et al., 2016) and its perceptual manifestation in sensory cortex. On a conceptual level, navigation (in memory and space) and processing visual events both involve abstraction of the relational structure between events to enable forward planning and predictions. Similar to navigational space, visual space can be represented in terms of its relational structure-like direction and distance, and it has been suggested that similar mechanisms might underlie spatial and non-spatial representations (Nau et al., 2018a), especially if there is sequential structure present (Finnie et al., 2021). Supporting this notion, recently, a conceptual link between representations for visual understanding and spatial navigation has been proposed that suggests a common underlying map-like representation of visual and navigational task structure (Schwartenbeck et al., 2021).

These conceptual links, as well as anatomical (Felleman and Van Essen, 1991; Huang et al., 2021) and functional connections between the hippocampus and visual cortex (Bosch et al., 2014; Hindy et al., 2016; Ji and Wilson, 2007; Kok and Turk-Browne, 2018; Lee et al., 2012; Nau et al., 2018a) raise the question whether hippocampal representations are independent from V1, or whether V1 is instead receiving the predictions as a feedback signal from hippocampus. Supporting the idea of functional feedback, Finnie et al., 2021 recently showed that V1 predictions were heavily impaired after hippocampus damage. However, contrary to this notion, spatiotemporal sequence predictions have also been shown to occur locally within V1 without the need for top-down predictions (Gavornik and Bear, 2014; Xu et al., 2012). Our study showed no functional relationship between sequence prediction in V1 and hippocampus. However, our experimental paradigm was not primarily designed to address this question, as it does not exclude the possibility that an apparent coordination might be driven by other factors like attentional fluctuations across participants. Further, V1-hippocampus coordination might exist on a trial-by-trial level, which does not necessarily transfer to statistical comparisons across participants. Future experiments, using more than one stimulus sequence could potentially address this question by comparing evidence of sequence-specific representations in both areas within participants.

It is notable that while hippocampal and visual representations appear similar with respect to their SR-like representation, they also show qualitative differences with respect to their underlying coding properties. V1 representations of individual sequence items resembled a coding based on spatial tuning. Hippocampus on the other hand represented relevant items predominantly in terms of their temporal distance within the sequence, suggesting that representations capitulate on the transitionally structure of the visual sequence. These results align with previous reports that hippocampus can learn to represent temporal sequence structure (Thavabalasingam et al., 2018; Thavabalasingam et al., 2019) and temporal proximity in a spatial navigation task (Deuker et al., 2016; Howard et al., 2014), but to the best of our knowledge, constitute the first reports of coding temporal distance of a visual sequence.

Furthermore, hippocampus predictive codes were found to persist after the sequence task and coactivation of related sequence locations was still present during the stimulus localizer, potentially indicating that hippocampus representations reflect a more stable code operating on a longer timescale. V1 representations on the other hand did not persist throughout the stimulus localizer and reverted back to representing individual spatial locations without coactivation of related sequence locations, further highlighting another qualitative difference between V1 and hippocampus coding. According to the SR, it is expected that sequence predictions will change once the regularities of the environment change. The absence of SR-like pattern in V1 during the functional localizer is therefore not at odds with our results from the main task, but rather indicative of a dynamic updating of the generative model. Taken these qualitative differences together, it is reasonable to speculate that predictive activity in V1 does not merely reflect top-down feedback from hippocampus, but instead that SR-like representations in V1 are somewhat independent, and potentially complementary, to SR-like representations found in the hippocampus.

In conclusion, our data show that anticipatory activity in early visual cortex and hippocampus is guided by a generative model that represents the relational structure of the visual world, akin to a predictive map. Our results suggest that the observed SR-like representation underlying visual predictions can provide a sophisticated state space representation that enables flexible generalization from partial input to future sequence locations, while also being efficient enough to provide rapid visual computations.

Materials and methods

Preregistration

Request a detailed protocol

The experimental design, data analyses, and hypotheses were all preregistered at Open Science Framework (https://osf.io/f8dv9/) prior to data collection.

Participants

Thirty-seven right-handed subjects participated in the fMRI study. Two participants were excluded based on predetermined performance and motion criteria during scanning (error rate/relative motion three standard deviations above the group mean). The final sample included 35 subjects (20 females, mean age = 27 years). Target sample size was decided prior to data collection based on a power analysis (two-sided paired t test, power = 80%, Cohen’s d ≥ 0.5 and α = 0.05). Participants gave written informed consent in accordance with the institutional guidelines of the local ethical committee (CMO region Arnhem-Nijmegen, The Netherlands) and received monetary compensation for their participation. All participants had normal or corrected-to-normal visual acuity.

Stimuli

Request a detailed protocol

Participants viewed a sequence of four white dots on a black background. Dot locations were sampled from eight possible locations (Figure 1B). The center of each dot location was 7 degrees visual angle (dva) away from the central white fixation cross (0.5 dva) and the locations were equally spaced around the center (distance in polar angle from the vertical line: 22.5°, 67.5°, 112.5°, 157.5°, 202.5°, 247.5°, 292.5°, and 337.5°, see Figure 1B). The dots had a diameter of 1.2 dva. Stimulus sequences were shown on an MRI safe LCD screen (BOLDscreen 32, 1920 × 1080 pixel resolution, 60 Hz refresh rate). Participants were positioned 134 cm away from the screen and viewed the stimuli via a mirror on top of the head coil.

During full sequence trials, each dot was shown for 100 ms with an interstimulus interval (ISI) of 17 ms, resulting in a total sequence duration of 451 ms. For 52 out of 128 full sequence trials, the onset of the last dot was delayed with an ISI of 170 ms (instead of 17 ms). Participants were instructed to detect and report these delayed sequence presentations via a button press with their right index finger.

Sequences were constructed such that each of the eight locations served as a starting location for one possible sequence. Further, each quadrant was stimulated once, which also excluded the possibility that neighboring dots were part of the same sequence. This constraint was chosen to minimize the potential spreading of activity from one location to neighboring sequence locations. Specifically, the second dot was always presented opposite of the starting location (180° clockwise from the start). The third dot was shown 90° clockwise from the second location and the last dot was on the opposite side of the third location. These constraints also served to decouple spatial and temporal distance within the sequence. With these constraints, there were eight possible visual sequences that were randomly assigned and counterbalanced in frequency across subjects. Dots that were part of the sequence are labeled as sequence dots A-B-C-D. While the remaining four dots at locations that were not part of the sequence are referred to as ‘control dots’.

Note that because within each dot sequence, temporal order and spatial distance were not perfectly decorrelated (e.g., the second sequence dot was always farthest apart from the starting dot), it is not possible to estimate the combined influence of the SR model and the spatial coactivation model on the observed BOLD activity.

Experimental design

Request a detailed protocol

The experiment lasted a total of 2 hr and consisted of three blocks (i) learning, (ii) main task, and (iii) a stimulus localizer. During the learning part, participants were familiarized with one of the eight sequences. The full sequence, consisting of four successively presented dots A-B-C-D, was shown 352 times outside and 160 times inside the scanner. In order to maintain participants’ attention during the learning part, there was a delay detection task on 50% of the trials. Participants were instructed to detect a timing delay of the last dot for which they had 1 s to respond. After every 30 trials, participants were shown their aggregated detection accuracy. During the initial learning phase outside the scanner, participants received additional feedback after each trial on whether their response was correct or incorrect through changes in the color of the fixation cross (green for correct and red for incorrect answers). No trial-wise feedback was given inside the fMRI. Participants were instructed to maintain fixation throughout the experiment and eye movements were measured with an Eyelink 1000 eye-tracker system (SR Research, Ontario, Canada; 1000 Hz sampling rate).

The main task consisted of three runs of equal duration (about 13 min). There were 192 trials per run and 576 trials in total. Trials were separated by a variable inter-trial interval (ITI) with a duration drawn from a truncated exponential distribution with a minimum of 2 s, maximum of 10.9 s, and mean of 3.72 s. The variable ITI ensured that the experimental paradigm had no temporal structure that participants could learn to expect the onset of a trial. This allowed us to focus in the present study on the learning and representation of structural knowledge, independent of any temporal expectation effects.

To probe activity replay, we introduced partial sequence trials where only one of the four dots was shown for 100 ms, instead of the full sequence. Visually, there was no difference between the ITI and the part of the partial sequence trials where the dots were omitted, both showed a fixation cross at the center of the screen. During each run, two thirds of the 192 trials were full sequence trials (128 trials) and one third of the trials were partial sequence trials (64 trials). Trial order was pseudo-randomized with the constraint that partial sequence trials were always followed and preceded by a full sequence trial, excluding the possibility of partial sequence trial repetitions. The pseudo-randomization (perfect counterbalancing was numerically not possible with the set number of trial types and repetitions), rules out the possibility of systematic order effects. There was a task on ~40% of the full sequence trials (156/384 trials). At the end of each run, participants received feedback on their performance.

After the main task, we ran a functional localizer (~16 min) where each dot was flashed at 2 Hz for 13.5 s in a pseudo-randomized order, followed by 15 s rest period. In total, each dot location was presented eight times and each of the eight dots followed once immediately after the rest period. Participants performed a letter stream task at fixation where they had to detect target letters (‘X’ and ‘Z’) in a stream of non-target letters (‘A’, ‘T’, ‘N’, ‘U’, ‘V’, ‘Y’, ‘H’, and ‘R’). The target probability was 10%. Each letter was presented for 500 ms.

For a small subset of N = 7 participants, after the localizer, we additionally presented moving bar stimuli, in order to map the pRFs of voxels in early visual cortex. During these runs, bars containing full-contrast flickering checkerboards (2 Hz) moved across the screen in a circular aperture with a diameter of 20°. The bars moved in eight different directions (four cardinal and four diagonal directions) in 20 steps of 1°. Four blank fixation screens (10.8 s) were inserted after each of the cardinally moving bars. Throughout each run (5.76 min), a colored fixation dot was presented in the center of the screen, changing color (red to green and green to red) at random time points. Participants’ task was to press a button whenever this color change occurred. Participants performed four identical runs of this task.

MRI acquisition

Request a detailed protocol

Functional and anatomical MRI data were acquired on a 3 T PrismaFit scanner (Siemens AG, Healthcare Sector, Erlangen, Germany) using a 32-channel head coil. The protocol included a T1-weighted anatomical scan and five functional runs. The anatomical scan was acquired with a Magnetization Prepared Rapid Acquisition Gradient Echo sequence (MP-RAGE; TR = 2300 ms, TI = 1100 ms, TE = 3 ms, flip angle = 8°, 1 × 1 × 1 mm3 isotropic). To acquire the functional images, we used a T2*-weighted multiband 4 (Moeller et al., 2010) sequence (TR = 1500 ms, TE = 39 ms, flip angle = 75°, 2 × 2 × 2 mm3, 68 slices). The five functional runs comprised of one learning run, three main task runs, and one localizer run. For two subjects only two main task runs were acquired because of time constraints. Seven participants participated in a previous study in which they completed four runs of pRF mapping.

fMRI preprocessing

Request a detailed protocol

MRI data were preprocessed using FSL (version 6.00; FMRIB Software Library) (Smith et al., 2004). We applied brain extraction using BET, motion correction using MCFLIRT, temporal high-pass filtering (100 s) and spatial smoothing (Gaussian kernel, FWHM = 5 mm). All analyses were carried out in native subject space. The first three volumes of each run were discarded to allow for signal stabilization. Registration of the functional images to the anatomical image was performed with FLIRT boundary-based registration. The anatomical image was registered to the MNI152 T1 2 mm standard space template (linear registration, 12 degrees of freedom).

ROI selection

Request a detailed protocol

V1 and hippocampus ROIs were determined using the automatic cortical parcellation provided by Freesurfer (Fischl, 2012) based on individual T1 images. Anatomical V1 and hippocampus masks were then transformed into native space using linear transformation. For V1, we used a preregistered voxel selection method to determine V1 subpopulations that are most responsive to individual stimulus locations.

First, the localizer data were fitted with a voxel-wise general linear model (GLM) using FSL FEAT (Smith et al., 2004) with the following regressors: 8 regressors of interest for stimulation of each of the locations (duration = 13.5 s), 1 regressor for the instructions and the end-of-block screen (duration = 4.5 and 15 s, respectively) as well as the 24 FSL motion regressors.

Second, for each location, we calculated the GLM contrast by comparing one location to all other locations and selected the 25 most selective voxels (highest z-values). Third, we removed voxels from the selection that were selective for multiple dot locations. Finally, we determined the lowest number of selective voxels per region and removed the least active voxels from all other locations until all V1 subpopulations had the exact same number of selected voxels per location. This procedure was chosen to rule out the possibility that potential activity differences across locations could be attributed to different number of voxels per region. Across subjects, we selected on average 22.05 voxels (SD = 2.88) per location.

V1 BOLD amplitude modulation

Request a detailed protocol

A GLM for the main task was created with the following regressors: 8 regressors for each single dot trial (4 sequence dots and 4 control dots), 1 regressor for the full sequence trial, 1 regressor of no interest to model the instructions and the feedback at the end of a run and 24 motion regressors (6 standard and 18 extended FSL motion parameters, i.e., the derivatives of the standard motion parameters, the squares of standard motion parameters, and the squares of the derivatives). Note that the control dot trials in the main task were modeled in the GLM, but treated as regressors of no interest. The model was convolved with a single gamma hemodynamic response function. Nine contrasts were set up that tested which voxels were more responsive to presentation of a single dot (eight contrasts, one for each dot) or the full sequence (one contrast) compared to baseline. The GLM was fit to each run separately and resulting beta estimates were averaged across runs for each participant. In order to obtain an estimate of stimulus-specific activity (Figure 2B), we averaged the activity at the four control ROIs and subtracted it from the activity at the sequence ROIs.

Correlation with behavior

Request a detailed protocol

In order to relate SR representations to behavior, we first calculated individual V1 BOLD differences for all successor versus all predecessor locations to get an estimate for how much participant’s predictions were skewed toward future locations. We then correlated these values with behavioral accuracy across subjects using Spearman correlation.

V1 model comparison

Request a detailed protocol

For each participant, V1 BOLD activity from the partial trials was fitted with three models, SR, coactivation (CO), and a null-model (H0). The resulting root mean square error (RSME, lower values = better fit) between model fit and observed data was then tested across participants for significance using paired-sample t tests to address the question whether one model prediction describes the underlying data better than competing models.

The model prediction of the SR is based on the task structure, formalized in a transition matrix T of the sequence A-B-C-D (Figure 4). The SR matrix M is then calculated as:

M=I- γT-1

where I is the identity matrix and γ [0,1] is the discount factor or predictive horizon. During model fitting, γ was a free parameter, meaning that instead of using a fixed value, individual γ values were determined for each participant. Here, larger values of γ result in a smaller exponential decay of future states. The model prediction of the CO model is based on the CO of events. In contrast to SR, the task structure is non-directed and off-diagonal values in the CO model are constant and modulated in amplitude by a free multiplicative parameter ω. The H0 (null) model serves as a baseline that assumes no off-diagonal (predictive) activity. In order to be interpretable any winning model should outperform the H0 model. The diagonal values in all three models reflect the bottom-up stimulation induced by the single dot of the partial trials.

Hippocampal decoding

Request a detailed protocol

The decoding analysis was performed with scikit-learn (Pedregosa et al., 2011). Individual voxel time courses were low-pass filtered using a Savitzky-Golay filter with a window length of 5 TRs and polynomial order of 3 (Savitzky and Golay, 1964) and normalized to z-scores. Volumes for individual localizer trials were averaged between 3 and 13.5 s to capture only stimulus-related BOLD activity. A logistic regression classifier (default values, L2 regularization; C = 1) was trained to distinguish between eight stimulus locations during the independent localizer run. Before applying the trained classifier to the main task, we confirmed that the classifier was indeed able to distinguish between stimulus locations within the localizer. To this end, we performed a leave-one-out cross-validation and tested the decoding accuracy against chance level (1/8 = 12.5 %) across subjects using a one-sample t test. In addition to a binary classifier output for each class, we also looked at the probabilistic output. For each sample in the localizer test set, we obtained eight probability values, one for each class. We refer to the classifier probability as classifier evidence, as the probability reflects the evidence that a particular class is represented. For each participant, probability values were averaged across trials to obtain location-specific response profiles.

Next, we trained the classifier on all localizer trials and applied it to individual trials of the main task. Volumes for individual main task trials were averaged between 3 and 6 s to capture only stimulus-related BOLD activity. Note that the main task was an event-related design with shorter trial durations compared to the block-design localizer with 13.5 s stimulation periods; hence, the different averaging windows of 3–13.5 and 3–6 s. Similar to the BOLD analysis in V1, for each partial sequence trial in the main task, we averaged the classifier evidence for the four control locations and subtracted it from the evidence of the sequence locations. We then averaged the classifier evidence for all predecessor and successor locations, respectively, and compared the evidence across subjects with a paired-sample t test.

Finally, in order to rule out that the chosen time window had any influence on the results, we repeated the decoding analysis in a time-resolved manner, repeating the steps above for each volume from 0 to 13.5 s separately. Fitting a standard hemodynamic response function (hrf) revealed a transient decoding evidence peak at around 4.7 s.

Hippocampus and V1 tuning

Request a detailed protocol

The tuning analysis investigates coactivation pattern during the localizer and focuses on the four locations that were part of the stimulus sequence in the preceding main task. Classifier evidence values within the localizer were averaged and sorted to reveal potential coactivation (tuning) pattern of sequence locations. Three tuning patterns were considered and tested: (i) temporal tuning, assuming a linear decay from the currently presented stimulus toward location that where farther in the past and future (two free parameters, slope, and intercept), (ii) spatial tuning, assuming a linear decay from the current stimulus toward other stimulus locations modulated by spatial distance (two free parameters, slope, and intercept), and (iii) a baseline no-coactivation pattern. Note that the latter model was considered because V1 tuning curves were rather sharp with little activity spread to immediate neighboring locations (5.4° apart; Figure 2A) and locations in the current analysis were 9.9° apart. For each participant, aggregated classifier evidence was fitted using three tuning patterns and resulting errors were compared across subjects to determine the best-fitting pattern. Fitting was performed using the curve_fit function in SciPy 1.6.2 (Virtanen et al., 2020).

pRF estimation and reconstruction

Request a detailed protocol

pRF data were available for seven participants from a previous study (Ekman et al., 2022) and were used to validate visually that the voxel selection based on the functional localizer selected voxel that corresponds to the stimulated location is visual space (Figure 2). Data from the moving bar runs were used to estimate the pRF of each voxel in the functional volumes using MrVista (http://white.stanford.edu/software). In this analysis, a predicted BOLD signal is calculated from the known stimulus parameters and a model of the underlying neuronal population. The model of the neuronal population consisted of a two-dimensional Gaussian pRF, with parameters x0, y0, and σ0, where x0 and y0 are the coordinates of the center of the receptive field, and σ0 indicates its spread (standard deviation), or size. All parameters were stimulus-referred, and their units were degrees of visual angle. These parameters were adjusted to obtain the best possible fit of the predicted to the actual BOLD signal. This method has been shown to produce pRF size estimates that agree well with electrophysiological receptive field measurements in monkey and human visual cortex (Klink et al., 2021). For details of this procedure, see Dumoulin and Wandell, 2008; Kay et al., 2015. Once estimated, x0 and y0 were converted to eccentricity and polar-angle measures and co-registered with the functional images using linear transformation.

For the pRF-based stimulus reconstruction, for each participant, we first limited the pRF data to voxel that were selected based on the functional localizer (25 voxels per stimulus location, 200 voxels in total). This selection step would allow us to visually inspect whether the voxel selection accurately selected voxel corresponding to the respective stimulus location. Second, every voxel is described as a 2D Gaussian with parameters x0, y0, and s0 from the pRF estimation. The 2D Gaussians for each voxel, represented by a pixel × pixel image, were scaled based on the percent signal change obtained from the functional localizer GLM, and consecutively summed over voxels to create one 2D representation of the reconstructed stimulus. This procedure was repeated separately for all eight stimulus locations. Finally, for visualization purpose, the eight individual localizer conditions were rotated to one stimulus location (22.5°) and averaged across stimulus locations and participants.

Data availability

All data and code used for stimulus presentation and analysis are available on the Donders Repository (https://doi.org/10.34973/bsy6-9h29).

The following data sets were generated
    1. Ekman M
    2. Kusch S
    3. de Lange FP
    (2023) Donders Repository
    Successor-like representation guides the prediction of future events in human visual cortex and hippocampus.
    https://doi.org/10.34973/bsy6-9h29

References

  1. Book
    1. O’Keefe J
    2. Nadel L
    (1978)
    The Hippocampus as a Cognitive Map
    Clarendon Press.
    1. Pedregosa F
    2. Varoquaux G
    3. Gramfort A
    4. Michel V
    5. Thirion B
    6. Grisel O
    7. Blondel M
    8. Prettenhofer P
    9. Weiss R
    10. Dubourg V
    11. Vanderplas J
    12. Passos A
    13. Cournapeau D
    14. Brucher M
    15. Perrot M
    16. Duchesnay É
    (2011)
    Scikit-learn: machine learning in python
    Journal of Machine Learning Research 12:2825–2830.

Decision letter

  1. Morgan Barense
    Reviewing Editor; University of Toronto, Canada
  2. Chris I Baker
    Senior Editor; National Institute of Mental Health, National Institutes of Health, United States
  3. Helen Barron
    Reviewer

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Successor-like representation guides the prediction of future events in human visual cortex and hippocampus" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Chris Baker as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Helen Barron (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The reviewers detail their essential revisions below. Our discussion converged on three key points:

1) We ask that the authors keep all model comparisons consistent across regions and tasks.

2) Additional analyses appear necessary to clarify the relationship between the hippocampus and V1.

3) In the revision it will be important to consider the successor representation model proposed here to other predictive sequence models.

Reviewer #1 (Recommendations for the authors):

1) If SR is the best name for the discussed model, it should be clarified why this is the case and, importantly, any difference with the SR as defined in the RL literature should be discussed. Otherwise, another term might be more appropriate.

2) It would be interesting to discuss in the discussion the distinction between the SR model and more complex models that might fit human behaviors and representations just as good or better. For example, with the current design, the SR model can't be disentangled from a more complex model in which all one-step transitions are stored and perhaps in which predictions are iteratively updated based on additional evidence (appearing items). A design in which each state is associated with multiple possible states (with different probabilities) might allow disentangling such additional possibilities.

3) There should be an additional analysis to investigate the relationship between the hippocampus and V1. I understand the limitations of fMRI and of the current experimental design, but there are still possible analyses, even if they are indirect and the results non-definitive (for example, a correlation of the hippocampal and V1 effects across individuals, as in Hindy et al., 2016, Nat Neurosci).

4) The goal of the tuning analysis and the interpretation of its result should be clarified.

5) It should be clarified whether the screen during the ITI is the same as during the omitted items of the partial sequence trials. If this is the case, the potential implications should be discussed.

6) It is unclear from the methods how the tuning analysis was performed exactly. It is a bit circular to define voxels sensitive to a given dot location based on the localizer data and then evaluate on that same data which dot representations were activated on a given trial. Was there some form of cross-validation performed? I could not find it in the code. Even if this was done correctly without double dipping, it seems strange conceptually to use the localizer data for both the fitting and testing purposes here because implicitly, the authors would both assume that the localizer data is independent of the learned associations (to determine the voxels sensitive to a given dot) and dependent on it (to assess temporal tuning). Relatedly, this somewhat applies to the other analyses too: since the localizer was performed after the main task, could it be that the authors did not select the right set, or the complete set, of voxels that are normally sensitive to a given dot location?

7) There seems to be a trend toward the last dot leading to a greater BOLD activity (Figure 3a). I'm wondering if this is because of the task, which is specific to the last dot. I don't think this explains the successor vs predecessor effect though, as you show in Figure 3c. However, this could explain the result of the current only statistical test performed in the "Anticipated stimulus sequences in V1" section. To formally exclude this possibility, the authors should test the difference in the activation of a given dot (B or C) when it is a successor vs when it is a predecessor.

8) The second important prediction of the SR model, in addition to the greater activation for successors than for predecessors, is the decreasing trend in activation for further successors. Although it is visible in the figures, it would be nice if this trend was also statistically tested and reported in the "Anticipated stimulus sequences in V1" section.

9) I don't find the time-resolved hippocampus analysis very convincing: couldn't this transient temporal profile be in response to the start of the trial rather than the missing dot (but see recommendation 5)? It would be best to perform the same analyses suggested above (recommendations 7 and 8) to really test whether the hippocampus exhibits the properties of the SR.

10) Continuing from above, concerning the time-resolved decoding: since trials are very short and ITI are jittered, it seems to me that the activity from previous trials could affect the results. Performing the decoding analysis on regression coefficients from a single-trial GLM analysis would help avoid this confound.

11) Could you show a similar figure as Figure 3c but in Figure 5 for the hippocampus? It would be helpful to see the activation related to each dot location (including the shown dot).

12) Background about predictions and predictive effects in V1 should be added to the introduction, this is currently lacking.

13) There is no mention of corrections for multiple comparisons in the paper. For example, are the tests for the significance of each item in Figure 3b corrected? This should be indicated at all relevant places in the manuscript and figure legends, along with whether the tests are one-tailed or two-tailed.

14) Concerning the model fitting analysis, I'm unsure whether the H0 model can be compared to the other two models using RMSE, since it seems to have fewer parameters. A criterion like BIC or AIC should be used in this case.

Reviewer #2 (Recommendations for the authors):

I had two thoughts, but I leave it to the authors to decide how to address these.

1. While I agree with the authors that this is the first evidence for SR in visual sequences (to the best of my knowledge), there is another set of studies that comes to mind looking at hippocampal contributions to sequence and duration coding of perceptual sequences, which the authors may wish to discuss:

Thavabalasingam, S., O'Neil, E. B., Tay, J., Nestor, A., and Lee, A. C. (2019). Evidence for the incorporation of temporal duration information in human hippocampal long-term memory sequence representations. Proceedings of the National Academy of Sciences, 116(13), 6407-6414.

Thavabalasingam, S., O'Neil, E. B., and Lee, A. C. (2018). Multivoxel pattern similarity suggests the integration of temporal duration in hippocampal event sequence representations. NeuroImage, 178, 136-146.

2. In the model fitting procedure, what exactly does it mean that the discount parameter γ was a free parameter (p. 18)? It would be helpful to provide a bit more clarity on this, but it's also potentially theoretically interesting in light of evidence that different neural structures represent information in line with different values of γ.

Reviewer #3 (Recommendations for the authors):

1. SR versus other predictive sequence models: It remains unclear to me whether the predictive activity observed in V1 is best explained by an SR model or by other models that capture predictive sequences (of which there are many). To assess whether the data is best explained by an SR model, it seems necessary to check whether two adjacent states that predict divergent future states have dissimilar representations, while two states that predict similar future states have similar representations. The data presented here is unfortunately not designed to test this comparison. Can the authors nevertheless distinguish between an SR model (e.g. Figure 4A) and a 'flat prediction' model where each stimulus predicts all possible successor states equally without any temporal discounting (i.e. A predicts B, C, and D with equal probability; B predicts C and D with equal probability but does not predict A; etc..)? It seems important to report this comparison and discuss how it may be difficult to distinguish between an SR model and a 'flat prediction' using the BOLD signal.

2. Related to point 1, it remains unclear to me why the authors consider this data to reflect an SR model, while in their previous data they characterise predictive sequences as reflecting preplay. Can the authors provide a clearer explanation for why this data is best described as an SR model rather than preplay, while Ekman et al., 2017 reflect preplay? Or do the authors consider these codes to be equivalent?

3. It is not clear to me how the ROIs are being used in Figure 3 and 4? If V1 activity reflects an SR, within a given ROI it should be possible to see evidence for backward skew in the representation of each location (consistent with Mehta et al., 2000), while at the population level there is a forward skew?

4. The authors seem to apply different models to data from different brain regions and to data from the task and localiser data. Why? For consistency and clarity would it be possible for the authors to apply the same set of models throughout, to both V1 and hippocampus, and to both task and localiser data? i.e. SR model, 'flat prediction' model, CO model, H0 model, spatial model, temporal model.

5. Related to point 4, in Figure 6 it seems that V1 data from the localiser scan does not support an SR model? This suggests that the task itself is driving the predictive sequence activity in Figures 3-4? This important difference in evidence for an SR-like code during the task and localiser scan should be emphasised and discussed.

6. How specific are these findings to V1 and hippocampus? If the authors use a searchlight analysis to look for multivariate patterns consistent with an SR model, do they not find that many brain regions show evidence for an SR representation?

7. In general, several of the reported analyses are not clearly explained. For example, how do the authors generate the reconstruction maps in Figure 2? Why was pRF mapping only performed in 7 subjects? Why were the data from the pRF maps not used to generate ROIs?

8. Statistics:

a) Can the authors clarify how they corrected for multiple comparisons when performing model comparisons?

b) The authors say they performed a one-sided t-test using data from Figure 5b. Can they clarify what they did here?

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Successor-like representation guides the prediction of future events in human visual cortex and hippocampus" for further consideration by eLife. Your revised article has been evaluated by Chris Baker (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there is one remaining issue that needs to be addressed, as outlined by Reviewer #1. Specifically, we thought it would be helpful to provide a bit more detail on the differences between the predictions of an SR versus model-based algorithm:

Reviewer #1 (Recommendations for the authors):

The authors have considerably revised their paper and they have addressed most of my comments satisfactorily. However, I remain uncertain about point 1.1.

I understand that there are no rewards in your task and that the SR algorithm can apply in the absence of rewards. I am not sure however that a model-based (MB) algorithm would make different predictions than SR in the context of your experiment. Indeed, it can be difficult to distinguish SR and MB in many contexts, especially if there is no reevaluation of the transition matrix during the experiment (Momennejad et al., 2017, Nat Hum Behav). Could the authors perhaps test what the predictions of a MB algorithm would be in their experiment (see, e.g., the equation reported in the Methods of the Momennejad paper), or otherwise explain why this would be irrelevant?

Reviewer #2 (Recommendations for the authors):

The authors have done a thorough job of addressing my comments. I don't have any further suggestions.

https://doi.org/10.7554/eLife.78904.sa1

Author response

Essential revisions:

Reviewer #1 (Recommendations for the authors):

(1.1) If SR is the best name for the discussed model, it should be clarified why this is the case and, importantly, any difference with the SR as defined in the RL literature should be discussed. Otherwise, another term might be more appropriate.

We thank the reviewer for giving us the opportunity to clarify this aspect. The reviewer states that “the SR has previously been used only in a RL context, where there are rewards associated with specific states and where predictions are task-relevant.”. We believe that this might reflect a misunderstanding of the SR model and have revised our manuscript to point out more clearly that SR, in contrast to model-free RL algorithms, is in fact not reward dependent.

The SR learning algorithm is based on temporal-difference-learning, but instead of learning future rewards it learns discounted expected future state occupancies. This enables learning in an environment without reward and results in a representation that encodes each state in relation to its successor states, i.e. those states that are expected to be visited in the future. This aspect is now clarified in the Introduction and Discussion of our manuscript.

Introduction:

In the context of hippocampal representations, the successor representation (SR) has been recently proposed (Dayan, 1993; Stachenfeld et al., 2017) to combine the trade-off between both flexible and efficient model properties. The SR postulates a predictive representation in which the current state is represented in terms of its future (successor) states, in a temporally discounted fashion. The SR is dependent on the actual experience, with states experienced more frequently being represented more strongly. This enables learning in an environment without explicit reward (Gläscher et al., 2010).2

Our discussion now also includes the following paragraph highlighting the passive nature of our task and the absence of any reward:

Discussion:

“One aspect that sets our study apart is that the viewing of the visual sequence does not require any predictive planning of the participant to evaluate different future outcomes. In contrast, related studies reporting neuronal evidence for SR-like representations in hippocampus and PFC (Barron et al., 2020; Brunec and Momennejad, 2022) and occipital cortex (Schwartenbeck et al., 2021) have used paradigms in which participants were actively engaged in prospective planning and choice evaluation. Given the relatively passive nature of our task, one might therefore wonder whether it is expected to find any map-like activity at all. However, in this context it is important to stress that the SR, unlike other model-based algorithms, does not depend on choice-dependent reward to build its transitional task structure (Momennejad et al., 2017; Stachenfeld et al., 2017) and therefore might not depend on participants’ active engagement. Furthermore, Russek et al. (2021) have recently used a paradigm in which subjects were passively exposed to transitions between visual states and reported evidence for SR-like representations in the absence of active choices in line with the results of the present study. Further supporting this notion, we have previously shown that anticipatory sequence activity occurred even after subjects’ attention was diverted from the sequence to a demanding task at fixation (Ekman et al., 2017), rendering the sequence task irrelevant. Taken together, these observations indicate that SR-like representations are not limited to situations that require active planning, or multiple-choice evaluations but may rather be formed automatically and incidentally, as has been shown repeatedly in the domain of statistical learning (Fiser and Aslin, 2002; Turk-Browne et al., 2005).”

(1.2) It would be interesting to discuss in the discussion the distinction between the SR model and more complex models that might fit human behaviors and representations just as good or better. For example, with the current design, the SR model can't be disentangled from a more complex model in which all one-step transitions are stored and perhaps in which predictions are iteratively updated based on additional evidence (appearing items). A design in which each state is associated with multiple possible states (with different probabilities) might allow disentangling such additional possibilities.

We agree with the reviewer that a different experimental design would be required to dissociate SR from other, more complex models. While there are a series of model-based algorithms that indeed store the entire transitional structure (and reward information) that can be iteratively updated, we are not aware of any such model that would align with the predictions of the SR (temporal discounting and directionality).

We have revised the manuscript Discussion to include this discussion point:

For future designs it would be interesting to include visual sequences where individual states have multiple possible successor states with different probabilities associated with them. Such a design would further allow to dissociate the SR representation from alternative models that simply store all one-step transitions and their respective probabilities.”

(1.3) The second important prediction of the SR model, in addition to the greater activation for successors than for predecessors, is the decreasing trend in activation for further successors. Although it is visible in the figures, it would be nice if this trend was also statistically tested and reported in the "Anticipated stimulus sequences in V1" section.

We added a statistical test to quantify the activity decay visible in Figure 3 and revised the manuscript as follows:

The activity decay toward distant future locations was formally tested by fitting an exponentially decaying factor γ γ ∈ [0,1] to each participant’s data. Here, values closer to 0 indicate a steeper decay and values closer to 1 indicate no decay. In line with our predictions, we found a group averaged decaying factor of γ = 0.14 (+/- 0.03 s.e.m.) that was statistically significantly different from 1 (non-parametric t-test t(34) = -17.17, p=2.54 × 10-18).”

(1.4) There seems to be a trend toward the last dot leading to a greater BOLD activity (Figure 3a). I'm wondering if this is because of the task, which is specific to the last dot. I don't think this explains the successor vs predecessor effect though, as you show in Figure 3c. However, this could explain the result of the current only statistical test performed in the "Anticipated stimulus sequences in V1" section. To formally exclude this possibility, the authors should test the difference in the activation of a given dot (B or C) when it is a successor vs when it is a predecessor.

Following the reviewer’s suggestion, we compared V1 BOLD activity across predecessor vs successor states for individual dot locations B and C. The statistical results of this control analysis replicate our main results, showing larger activity at location B during trials when B is a successor state (i.e., when dot A is presented) compared to when it is a predecessor state (i.e., when dot C is presented): t(34) = 5.72, p = 2.02 × 10-6. The same pattern of results was observed for location C (t(34) = 3.13, p = 0.0035), providing an internal conceptual replication. These new results show that the reported effects are also present went comparing individual dots and therefore exclude the possibility that the statistical comparison was driven by an increase in BOLD across locations.

We revised the manuscript to include the additional analysis:

In line with our predictions, V1 BOLD activity was indeed enhanced at the non-stimulated successor locations compared to the non-stimulated predecessor locations (averaged across all partial trials and sequence locations; t(34) = 6.45, p = 2.23 × 10-7). The same pattern of future directed prediction was also evident from the visual inspection of BOLD activity for all partial sequence trials separately (Figure 3C).

Further, these results of greater activity for successor compared to predecessor activity also holds when comparing individual sequence locations without averaging (i.e., comparing non-stimulated location B when successor vs predecessor, t(34) = 5.72, p = 2.02 × 10-6; and location C when successor vs predecessor, t(34) = 3.13, p = 0.0035).”

(1.5) I don't find the time-resolved hippocampus analysis very convincing: couldn't this transient temporal profile be in response to the start of the trial rather than the missing dot (but see recommendation 5)? It would be best to perform the same analyses suggested above (recommendations 7 and 8) to really test whether the hippocampus exhibits the properties of the SR.

We thank the reviewer for bringing up this point. It appears that we had not properly explained what the transient profile reflects. Concretely, if the decoded information were only in response to the start of the trial, as the reviewer suggests, the time-resolved decoding profile would be completely flat.

We rephrased the result section to emphasize that the time-resolved hippocampus analysis reflects the decoding time-course (successor vs predecessor states) and therefore differs in its interpretation from a BOLD response. In the latter case, the transient profile could indeed reflect a bottom-up response to the starting dot, as the reviewer pointed out. However, in the case of decoding the transient profile shows that individual trials represent evidence specifically for dot locations associated with the successor states.

Results of the evidence difference time-course clearly show a transient response peaking approximately 4.7 s post stimulus onset (Figure 5F) indicating that hippocampal predictions were triggered by the partial sequence dot. Note, that the decoding time-course reflects the evidence for successor locations vs predecessor locations independent of the bottom-up stimulus. The transient decoding profile can therefore not simply reflect the onset of a given trial.”

(1.6) Continuing from above, concerning the time-resolved decoding: since trials are very short and ITI are jittered, it seems to me that the activity from previous trials could affect the results. Performing the decoding analysis on regression coefficients from a single-trial GLM analysis would help avoid this confound.

We thank the reviewer for this suggestion. We would like to point out that the order of our trial sequences was counterbalanced to prevent systematic influences from previous trials to the current trial. We therefore believe that the single-trial GLM analysis is not strictly required in this case.

We added the motivation for counterbalancing the trial order in the revised manuscript:

The pseudo-randomization (perfect counterbalancing was numerically not possible with the set number of trial types and repetitions), rules out the possibility of systematic order effects.”

2) There should be an additional analysis to investigate the relationship between the hippocampus and V1. I understand the limitations of fMRI and of the current experimental design, but there are still possible analyses, even if they are indirect and the results non-definitive (for example, a correlation of the hippocampal and V1 effects across individuals, as in Hindy et al., 2016, Nat Neurosci).

We appreciate this suggestion. To clarify, we had previously refrained from a V1-Hippocampus correlation analysis, and discussed why we believe that the results would not be very meaningful with the current design:

Our study was not designed to address the question to what extent V1 and hippocampus representations are independent of each other. Here, we purposefully refrained from reporting correlations between the two regions as we could not exclude that an apparent coordination might be driven by other factors like attentional fluctuations. Future experiments, using more than one stimulus sequence could potentially address this question by comparing evidence of sequence specific representations in both areas. (p. 13)”

However, to empirically address the reviewer’s question, we correlated the averaged BOLD activity across all successor locations in V1 with the average classifier evidence across successor locations in hippocampus, across participants. No significant relationship was observed (spearman correlation, r = -0.08, p = 0.668). While there could be several reasons for this lack of relationship, we believe that a lack of power precludes us from drawing strong conclusions from this null finding. Nevertheless, for completeness, we now revised the manuscript to include this new analysis:

In order to probe the relationship between hippocampus and V1 successor reactivations, we performed an across subject analysis, correlating V1 BOLD activity, averaged across all successor locations, with hippocampus classifier evidence, averaged across all successor locations. No significant relationship was observed (spearman correlation, r = -0.08, p = 0.668).”

We rephrased the discussion as follows:

Our study showed no functional relationship between sequence prediction in V1 and hippocampus. However, our experimental paradigm was not primarily designed to address this question, as it does not exclude the possibility that an apparent coordination might be driven by other factors like attentional fluctuations across participants. Further, V1-hippocampus coordination might exist on a trial-by-trial level, which does not necessarily transfer to statistical comparisons across participants. Future experiments, using more than one stimulus sequence could potentially address this question by comparing evidence of sequence specific representations in both areas within participants.”

(3.1) The goal of the tuning analysis and the interpretation of its result should be clarified.

We thank the reviewer for giving us the opportunity to clarify the localizer analysis and interpretation.

We realized that the term ‘tuning’ might have been misunderstood to imply that we were quantifying the neural coding properties of a cortical region. However, this is not the case, instead we had intended to use the term ‘tuning’ to refer to a learned association, as in “after exposure hippocampus representations become tuned to a certain stimulus”.

We have now changed most occurrences of the terms ‘spatial and temporal tuning’ and replaced it with ‘spatial and temporal coactivation pattern’ to avoid this confusion. Whenever we use the term ‘tuning’, we made clear that we are talking about it in the context of learned coactivation pattern. Further, we have put bigger emphasis on the fact that we are investigating and interpreting the localizer coactivation pattern as learned associations that might persists from the main task.

Here we highlight some of these changes from the Results section:

Given that we successfully trained a classifier based on eight spatial locations it might seem obvious to conclude that the underlying code for these representations is purely spatial (retinotopic) as well. This is however not necessarily the case, given that the localizer was shown after the main task and might therefore reflect persistent predictive representations. Instead, robust discrimination of sequence locations could theoretically also be based on coding of temporal properties of the sequence. Indeed, Deuker et al. (2016) have recently shown that hippocampus representations can reflect in principle both spatial and temporal aspects. In our case, a temporal coding mechanism could represent stimulus locations not based on proximity in space, but rather by proximity in time.

In order to address this question, we conducted a detailed analysis of the coactivation pattern in the stimulus localizer (Figure 6A). Note that the localizer was shown at the end of the study, allowing us to test whether learned associations persisted even after the full sequence was not relevant anymore. Here, coactivations were defined as activation of non-stimulated locations. For instance, when presenting stimulus A, locations B-C-D might become activated as well. In general, such coactivations are often attributed to noise or ambivalent responses driven by overlapping receptive fields. However, in this case we made use of the coactivation pattern to draw inferences about the learned persistent representations.”

Regarding the absence of blank screens in the localizer. The localizer did in fact have so called null-events (blank screens) where only the fixation cross was shown. Further, the presentation order of dot locations in the localizer was counterbalanced, avoiding any systematic influence of previous trials on the current trial. To answer the reviewer’s question, we don’t see how the blank screens could contribute to the pattern observed in hippocampus. Arguably, if there were any issues with the blank screen causing a certain pattern of activity, that should be visible in both V1 and hippocampus. However, the temporal coactivation pattern we describe was only observed in hippocampus, but not in V1, rendering this possibility unlikely.

(3.2) It is unclear from the methods how the tuning analysis was performed exactly. It is a bit circular to define voxels sensitive to a given dot location based on the localizer data and then evaluate on that same data which dot representations were activated on a given trial. Was there some form of cross-validation performed? I could not find it in the code. Even if this was done correctly without double dipping, it seems strange conceptually to use the localizer data for both the fitting and testing purposes here because implicitly, the authors would both assume that the localizer data is independent of the learned associations (to determine the voxels sensitive to a given dot) and dependent on it (to assess temporal tuning).

The reviewer is correct that we used leave-one-out cross-validation for the localizer analysis to prevent double dipping.

This is described in the method section:

Before applying the trained classifier to the main task, we confirmed that the classifier was indeed able to distinguish between stimulus locations within the localizer. To this end, we performed a leave-one-out cross validation and tested the decoding accuracy against chance level (1/8 = 12.5 %) across subjects using a one-sample t-test. In addition to a binary classifier output for each class, we also looked at the probabilistic output. For each sample in the localizer test set, we obtained 8 probability values, one for each class. We refer to the classifier probability as classifier evidence, as the probability reflects the evidence that a particular class is represented. For each participant probability values were averaged across trials to obtain location specific response profiles.”

We apologise that the analysis code was not sufficiently documented. We previously provided analysis code that recreates the article Figures from pre-processed, intermediate data specific to each figure, and code that creates the intermediate data from the raw data. The cross-validation analysis was not included in the scripts associated with the figures. We have now improved the documentation of our analysis scripts and separated the code for the article figures from the code that processes the raw data, which makes the distinction more obvious.

Considering the aspect of ‘conceptual strangeness’, in the localizer, we are simply assessing the structure of persistent activity pattern after the main task. Our results show that no such structure is present in V1, which basically rules out any concerns. The V1 results are also confirmed using independent pRF data (see detailed response below).

The hippocampus does show a co-activation pattern, whereby not only the presented stimulus, but also other stimuli were represented, albeit to a lesser degree. Importantly, in contrast to V1, the hippocampus analysis is based on a classification analysis, that does not rely on a two-step process where relevant voxels are first identified, and then characterized based on their BOLD activity. Instead, the classifier takes all hippocampus voxels and outputs a probability for each possible stimulus location based on the multivariate structure.

Since the presented localizer stimulus is also the one correctly identified as most likely by the classifier (despite evidence for other stimuli), we can use the classifier to identify the presented and reactivated stimuli in the main task.

Relatedly, this somewhat applies to the other analyses too: since the localizer was performed after the main task, could it be that the authors did not select the right set, or the complete set, of voxels that are normally sensitive to a given dot location?

We thank the reviewer for bringing up the issue of correct voxel selection.

In V1, the voxel for each location were selected by (1) contrasting the BOLD activity at one location with all other locations and then (2) selecting the 25 most active voxels (highest z-value) for that location. For the main results shown in Figure 3, this contrast approach ensures that we select only voxel specific to one dot location. Even in the case of co-activation, or activity spread to neighboring locations, the contrast approach will ensure that only voxel specific to the stimulated location were selected (assuming that the region receiving the bottom-up stimulus input will always elicit the strongest BOLD response).

The location selectivity can also be empirically seen in Figure 2a (right), where we plot the BOLD activity of the selected voxel, projected into stimulus space, using independently acquired receptive-field data. Figure 2a (right) shows that the selected voxels were indeed at the expected stimulus location, and not at other receptive field locations in the visual field.

We have now stressed more clearly that the pRF data in Figure 2 validate that we selected the right set of voxels for a given dot location:

Stimulus response profiles of these eight (retinotopic) ROIs show little coactivation of neighboring locations in the visual field which allows for a precise investigation of location specific activity (Figure 2A). Unsurprisingly, during full sequence trials BOLD activity at the sequence locations receiving bottom-up visual input was markedly enhanced compared to non-stimulated control locations (Figure 2B). Population-based receptive field (pRF) data, that was acquired for a subset of participants confirmed that the selected voxels correspond to the retinotopic stimulus locations as expected.”

For the additional results shown in Figure 6, we show data from a baseline contrast (opposed to the direct contrast employed for the main results).

4) It should be clarified whether the screen during the ITI is the same as during the omitted items of the partial sequence trials. If this is the case, the potential implications should be discussed.

The reviewer is correct that there is no visual difference between the inter-trial interval (ITI) and the part of the sequence where no dot is shown. By design, the variable ITI prevents the subject from learning any temporal structures related to the start of a trial. In doing so, we can focus on the predictive process that is triggered by the presentation of a sequence dot, independent of any temporal expectation effects.

In the revised manuscript, we have now clarified that the variable ITI looks visually identical to the omission of sequence dots and therefore ensures that we are not confounding the predictive effects of interest with any temporal expectation effects.

“The variable ITI ensured that the experimental paradigm had no temporal structure that participants could learn to expect the onset of a trial. This allowed us to focus in the present study on the learning and representation of structural knowledge, independent of any temporal expectation effects.

To probe activity replay we introduced partial sequence trials where only one of the four dots was shown for 100 ms, instead of the full sequence. Visually, there was no difference between the ITI and the part of the partial sequence trials where the dots were omitted, both showed a fixation cross at the center of the screen.”

5) Could you show a similar figure as Figure 3c but in Figure 5 for the hippocampus? It would be helpful to see the activation related to each dot location (including the shown dot).

Given the significant, but very low classification accuracy in within the localizer (accuracy = 15% 3.6%, mean ± s.d.; p = 0.002), we had previously decided to only report averaged location results for the hippocampus as the non-averaged predictions would be very noisy. To put the hippocampus classification accuracy into context, in V1 cross-validated accuracy within the localizer was (92% ± 12%, mean ± s.d.).

We now stressed this difference between V1 and hippocampus decoding in the Results section and motivate our reason for presenting averaged results:

Within localizer decoding accuracy results confirmed that hippocampus has a coarse representation of the eight stimulus locations (Figure 5B) within the localizer (one-sample t-test; t(34) = 3.28, p = 0.002; cross-validated accuracy = 15% ± 3.6%, mean ± s.d.; see Materials and methods). Notably, compared to V1 (cf. Figure 2A), within localizer accuracy was relatively low and as a consequence tuning curves in hippocampus appeared less sharp (Figure 5C). In order to maximize sensitivity for the hippocampus, we averaged classification evidence across successor and predecessor locations. Non-averaged results can be found in Supplementary Figure 1A.”

Further, we followed the reviewer’s suggestion and added a new supplementary Figure including the non-averaged results for hippocampus. The new Figure also includes the model comparison the reviewers had asked for.

6) Background about predictions and predictive effects in V1 should be added to the introduction, this is currently lacking.

We rephrased the introduction to focus more on predictive effects in V1:

Previous research has repeatedly shown that prior expectations influence neural activity in the visual cortex (Ekman et al., 2017; Gavornik and Bear, 2014; Hindy et al., 2016; Kok et al., 2012; Xu et al., 2012). It remains, however, unknown if SR-like representations are present outside the hippocampus in areas like the early visual cortex (V1) that have a strong retinotopic organization. Theoretically it is possible that V1 receptive fields, analogous to hippocampal place fields, become tuned to respond not only to the current input, but also to expected future inputs. Here we propose that the computationally efficient and flexible properties of the SR could in theory also underlie the anticipation of future events in V1.”

7) There is no mention of corrections for multiple comparisons in the paper. For example, are the tests for the significance of each item in Figure 3b corrected? This should be indicated at all relevant places in the manuscript and figure legends, along with whether the tests are one-tailed or two-tailed.

We thank the reviewer for pointing this out. We added the information to the legend of Figure 3, Figure 4 and Figure 6:

Error bars denote ± s.e.m.; two-tailed t-test, ***P<0.001; **P<0.01; *P<0.05 uncorrected for multiple comparisons.”

8) Concerning the model fitting analysis, I'm unsure whether the H0 model can be compared to the other two models using RMSE, since it seems to have fewer parameters. A criterion like BIC or AIC should be used in this case.

We implemented this suggestion and calculated BIC, instead of RMSE for every subject, thereby controlling for the difference in model parameters. The results remain unchanged compared to the previous version of the manuscript. In short, the SR model has the smallest BIC value (smaller = more likely), followed by the CO model and the H0 model.

We updated Figure 4c, Figure 6f and the related Results and Method section.

Reviewer #2 (Recommendations for the authors):

I had two thoughts, but I leave it to the authors to decide how to address these.

1. While I agree with the authors that this is the first evidence for SR in visual sequences (to the best of my knowledge), there is another set of studies that comes to mind looking at hippocampal contributions to sequence and duration coding of perceptual sequences, which the authors may wish to discuss:

Thavabalasingam, S., O'Neil, E. B., Tay, J., Nestor, A., and Lee, A. C. (2019). Evidence for the incorporation of temporal duration information in human hippocampal long-term memory sequence representations. Proceedings of the National Academy of Sciences, 116(13), 6407-6414.

Thavabalasingam, S., O'Neil, E. B., and Lee, A. C. (2018). Multivoxel pattern similarity suggests the integration of temporal duration in hippocampal event sequence representations. NeuroImage, 178, 136-146.

We thank the reviewer for pointing out these articles. We have now included both references in the revised manuscript.

Hippocampus on the other hand represented relevant items predominantly in terms of their temporal distance within the sequence, suggesting that representations capitulate on the transitionally structure of the visual sequence. These results align with previous reports that hippocampus can learn to represent temporal sequence structure (Thavabalasingam et al., 2018, 2019) and temporal proximity in a spatial navigation task (Deuker et al., 2016; Howard et al., 2014), but to the best of our knowledge, constitute the first reports of coding temporal distance of a visual sequence.”

2. In the model fitting procedure, what exactly does it mean that the discount parameter γ was a free parameter (p. 18)? It would be helpful to provide a bit more clarity on this, but it's also potentially theoretically interesting in light of evidence that different neural structures represent information in line with different values of γ.

Keeping γ as a „free parameter” was meant to convey, that instead of using a fixed value for γ (e.g., based on previous literature) and fitting the curve to all participants, the value of γ was determined during data fitting for each participant individually. We rephrased this formulation to make that clearer.

During model fitting γ was a free parameter, meaning that instead of using a fixed value, individual γ values were determined for each participant. Here, larger values of γ result in a smaller exponential decay of future states.”

We also report group statistics of obtained γ values:

The activity decay toward distant future locations was formally tested by fitting an exponentially decaying factor γ γ ∈ [0,1] to each participant’s data. Here, values closer to 0 indicate a steeper decay and values closer to 1 indicate no decay. In line with our predictions, we found a group averaged decaying factor of γ = 0.14 (+/- 0.03 s.e.m.) that was statistically significantly different from 1 (non-parametric t-test t(34) = -17.17, p=2.54 × 10-18).”

Reviewer #3 (Recommendations for the authors):

1. SR versus other predictive sequence models: It remains unclear to me whether the predictive activity observed in V1 is best explained by an SR model or by other models that capture predictive sequences (of which there are many). To assess whether the data is best explained by an SR model, it seems necessary to check whether two adjacent states that predict divergent future states have dissimilar representations, while two states that predict similar future states have similar representations. The data presented here is unfortunately not designed to test this comparison. Can the authors nevertheless distinguish between an SR model (e.g. Figure 4A) and a 'flat prediction' model where each stimulus predicts all possible successor states equally without any temporal discounting (i.e. A predicts B, C, and D with equal probability; B predicts C and D with equal probability but does not predict A; etc.)? It seems important to report this comparison and discuss how it may be difficult to distinguish between an SR model and a 'flat prediction' using the BOLD signal.

The reviewer points out that there are many other possible predictive activity patterns that could be expected. We are actually not aware of any existing (biological) model that would generate predictions, selectively for successor states and not for predecessor states.

While we agree with the reviewer’s point that there are many possible predictive activity patterns, like ‘flat prediction’, ‘linear decrease prediction’, ‘linear increase predictions’, to the best of our knowledge, none of these predictions can be derived from existing models. That’s why we had previously only included one alternative model, the co-occurrence model which is based on a biological framework in which autoassociative connections within the hippocampal CA3 regions reactivate related sequence items from partial input, without skewing toward future locations. To the best of our knowledge, the SR model is the only model that predicts an asymmetry toward future locations.

While we like to keep the focus of our manuscript on the two existing, biologically motivated models, we have calculated the suggested ‘flat prediction’ model for the revision letter.

Comparing the model fit (BIC) of the suggested ‘flat prediction’ pattern with the SR model showed that the SR model describes the data significantly better (two-sided t-test, t(34) = 6.12, p = 5.98 x 10-7).”

2. Related to point 1, it remains unclear to me why the authors consider this data to reflect an SR model, while in their previous data they characterise predictive sequences as reflecting preplay. Can the authors provide a clearer explanation for why this data is best described as an SR model rather than preplay, while Ekman et al., 2017 reflect preplay? Or do the authors consider these codes to be equivalent?

Previous to the present study we didn’t know whether the observed preplay/replay traces were guided by a generative model that represents the relational structure of the environment. Our previous paradigm in Ekman et al. 2017 was not designed to address this question, as the dot sequence (i) had no intermediate omissions and (ii) the dot locations have different eccentricities from fixation which hinders the interpretation of the absolute BOLD values.

The difference with our previous study is discussed as follows:

There is an extensive body of literature that shows how expectations elicit anticipatory activity in early visual cortices (de Lange et al., 2018; Hindy et al., 2016; Kok et al., 2012). For instance, we have previously shown that flashing an individual dot of a simple, linear sequence triggers an activity wave in V1 that resembles the full stimulus sequence (Ekman et al., 2017, 2022), akin to replay of place field activity during spatial navigation (Foster and Wilson, 2006; Gupta et al., 2010). However, what remains unknown is whether these sensory replay traces are guided by a generative model that represents the relational structure of the stimulus sequence, akin to a predictive map. Alternatively, anticipatory activity traces could simply reflect the association between different stimuli, based on their co-occurrence, without the added complexity of any temporal relational structure. The latter explanation appears plausible, given that predictive representations in early visual cortex are generally time critical and operate in parallel to a constant stream of new sensory input, which arguably requires efficient processing and in turn limits the complexity of such representations.

In fact, we previously speculated that cue-triggered reactivation of simple sequences might be driven by an automatic pattern completion-like mechanism that reactivates all associated items based on partial input (Ekman et al., 2017). This idea is in line with the finding that predictive representations in V1 correlated with pattern completion-like activity in the hippocampus (Hindy et al., 2016; Kok and Turk-Browne, 2018) that might be driving V1 activity (Finnie et al., 2021; Ji and Wilson, 2007).

Our current findings directly challenge this interpretation and instead point to a predictive representation of expected, temporally discounted, future states. We accomplished this by using a paradigm in which one visual event (e.g., the presentation of one dot) was framed as one state in a directed transition matrix with a fixed relational structure. The SR hypothesis makes two testable predictions, namely that population activity represents future states over predecessor states, and that future state representations are temporally discounted, such that events in the close future are more prominently represented compared to events in the distant future. Using a paradigm in which we occasionally presented only single items of the full sequence, allowed us to investigate V1 activity at omitted sequence locations.”

3. It is not clear to me how the ROIs are being used in Figure 3 and 4? If V1 activity reflects an SR, within a given ROI it should be possible to see evidence for backward skew in the representation of each location (consistent with Mehta et al., 2000), while at the population level there is a forward skew?

We believe that this is indeed what our data shows. For example, within the V1 ROI that is responsive to dot location B, there is elevated activity when dot A is shown. This could be interpreted as ‘backward skew’ of this ROI: the ROI that is tuned to location B also starts responding to location A.

At the level of the entire V1 population, however, this results in “forward skew”: when presenting dot location A, the population response is skewed forward, by virtue of the anticipatory activity in V1 neurons that are tuned to successor location B.

4. The authors seem to apply different models to data from different brain regions and to data from the task and localiser data. Why? For consistency and clarity would it be possible for the authors to apply the same set of models throughout, to both V1 and hippocampus, and to both task and localiser data? i.e. SR model, 'flat prediction' model, CO model, H0 model, spatial model, temporal model.

We appreciate the suggestion made by the reviewer to apply the same models to both V1 and hippocampus.

For the hippocampus, we had previously analysed averaged classifier outputs across locations. This was done to effectively improve the signal-to-noise ratio. However, averaging the output (i.e., all successor locations vs all predecessor locations), did not allow us to do any model fitting. In the revised version, we have now implemented six changes: (1) we added our motivation for collapsing the hippocampus data (2) we now show the non-averaged hippocampus results as a Supplementary Figure (3) we report the same model comparison for hippocampus that was done for V1, thereby keeping the model comparison consistent across regions (4) we now include the SR model in the model comparison for the localizer (5) we added our motivation for applying the spatial and temporal models to the localizer and not to the main task. (6) we renamed the no coactivation (NoCo) model from the localizer to H0 model, indicating more clearly that this is the same ‘baseline’ model used in the main task. The different names (H0, NoCo) might have previously contributed to the impression that these are different models, despite being conceptually the same.

Below we copy our response to Reviewer #1 from above, who brought up a similar point.

Given the significant, but very low classification accuracy in within the localizer (accuracy = 15% 3.6%, mean ± s.d.; p = 0.002), we had previously decided to only report averaged location results for the hippocampus as the non-averaged predictions would be very noisy. To put the hippocampus classification accuracy into context, in V1 cross-validated accuracy within the localizer was (92% ± 12%, mean ± s.d.).

We no stressed this difference between V1 and hippocampus decoding in the Results section and motivate our reason for presenting averaged results:

Within localizer decoding accuracy results confirmed that hippocampus has a coarse representation of the eight stimulus locations (Figure 5B) within the localizer (one-sample t-test; t(34) = 3.28, p = 0.002; cross-validated accuracy = 15% ± 3.6%, mean ± s.d.; see Materials and methods). Notably, compared to V1 (cf. Figure 2A), within localizer accuracy was relatively low and as a consequence tuning curves in hippocampus appeared less sharp (Figure 5C). In order to maximize sensitivity for the hippocampus, we averaged classification evidence across successor and predecessor locations. Non-averaged results can be found in Supplementary Figure 1A.”

Further, we followed the reviewer’s suggestion and added a new supplementary Figure including the non-averaged results for hippocampus. The new Figure also includes the model comparison the reviewers had asked for.

Additionally, we also followed the reviewer’s suggestion and included the SR model to the localizer analysis, confirming that the localizer is not best described by an SR coactivation pattern.

Finally, we explained why we don’t apply the temporal and spatial coactivation models to the main task. Here we copy our reply to Reviewer #2 from above, who had a similar point:

The reviewer is correct that the fact that the sequence order and spatial distance were not fully decorrelated (second presentation was always farthest away from starting dot, third and fourth dot always the same distance from start) prevents us from quantifying the interaction of the SR and CO model with a spatial model during the main task.

We added the following to the Method section to clarify this:

Note that because within each dot sequence, temporal order and spatial distance were not perfectly decorrelated (e.g. the second sequence dot was always farthest apart from the starting dot), it is not possible to estimate the combined influence of the SR model and the spatial coactivation model on the observed BOLD activity.”

Having said that, we believe that there is little concern that the reported reactivations of the main task are driven by the Euclidean distance in a meaningful way for two reasons:

(1) Detailed analysis of the localizer data showed that there is no spatial spreading from one dot location to the other sequence locations (Figure 6). This is likely because the relevant dot locations were sufficiently spaced apart. Given the lack of spreading during the localizer, where the dot was flashed for 13.5s, makes the presence of spreading during the main task, where the dot was flashed for only 100ms, equally unlikely.

(2) The presence of spatial spreading would actually obfuscate the reported SR-like pattern and could not have caused it. Specifically, because the second sequence dot was always farthest apart from the start, this is where one would assume the least amount of activity spread (greatest Euclidean distance). Sequence dots three and four should be more active given that they are both closer to the starting point in terms of Euclidean distance. Our reported results are the opposite of that pattern, ruling out the possibility that these were caused by spatial spreading.

5. Related to point 4, in Figure 6 it seems that V1 data from the localiser scan does not support an SR model? This suggests that the task itself is driving the predictive sequence activity in Figures 3-4? This important difference in evidence for an SR-like code during the task and localiser scan should be emphasised and discussed.

The reviewer is correct that V1 data from the localizer does not show the persistent SR-like predictions from the main task. In the revised version, we now included a formal test for this (see response above).

We believe that this is to be expected as the sequence predictions are learned and updated based on exposure. Within the localizer, no more dot sequences are shown. Instead, individual dot locations are repeatedly flashed for 13.5 s at the same location. It is therefore expected that sequence predictions, related to the previous task, would eventually fade away. This can be understood in the context of continuous updating of the predictions once the regularities of the environment change and does not constitute any evidence against the SR model.

We have added the following to the Discussion:

Furthermore, hippocampus predictive codes were found to persist after the sequence task and coactivation of related sequence locations were still present during the stimulus localizer, potentially indicating that hippocampus representations reflect a more stable code operating on a longer timescale. V1 representations on the other hand did not persist throughout the stimulus localizer and reverted back to representing individual spatial locations without coactivation of related sequence locations, further highlighting another qualitative difference between V1 and hippocampus coding. According to the SR, it is expected that sequence predictions will change once the regularities of the environment change. The absence of SR-like pattern in V1 during the functional localizer is therefore not at odds with our results from the main task, but rather indicative of a dynamic updating of the generative model.”

6. How specific are these findings to V1 and hippocampus? If the authors use a searchlight analysis to look for multivariate patterns consistent with an SR model, do they not find that many brain regions show evidence for an SR representation?

We addressed the reviewer’s question about the specificity of the effects, by testing another low-level visual area V2. These new results show that in contrast to V1 and hippocampus, V2 does not feature any predictive effects and suggest that the reported findings are not ubiquitous throughout the brain.

This new result is mentioned in the revised manuscript:

One could ask whether our findings are specific to V1 and hippocampus, or widespread throughout the brain. In order to answer this question, we repeated the analysis for low-level visual area V2. In contrast to V1, no predictive effects were found in area V2. V2 BOLD activity was not enhanced at the non-stimulated successor locations compared to the non-stimulated predecessor locations (averaged across all partial trials and sequence locations; t(34) = 1.41, p = 0.168).”

We appreciate the searchlight suggestion to complement our ROI analysis approach. However, we believe that an additional ROI analysis in this case is more meaningful compared to a searchlight analysis. The reason for this is that the dot locations in our experiment are up to ~14 degrees apart in visual space. Within retinotopically organized visual areas, these stimulus locations are represented in different hemifields, multiple centimetres away.

The sphere of a searchlight with a commonly used radius of ~4mm would not be able to capture that effect, simply because the sphere would be too small to include all relevant voxel. One could argue that running a searchlight analysis with a radius of ~40mm (a magnitude larger) would alleviate that problem, but that would result in a complete loss of spatial specificity. For this reason, we refrained from using a searchlight analysis and present the additional V2 results instead.

7. In general, several of the reported analyses are not clearly explained. For example, how do the authors generate the reconstruction maps in Figure 2? Why was pRF mapping only performed in 7 subjects? Why were the data from the pRF maps not used to generate ROIs?

We thank the reviewer for pointing out these unclarities. In short, the pRF data were available for 7 subjects, that had also participated in a previous experiment. Since the goal of the pRF mapping was only to confirm that the localizer voxel-selection resulted in accurate results (Figure 2), we decided not to invite all subjects for a second pRF session that would have taken 45 minutes of fMRI scanning time.

The ROIs were generated from the functional localizer, because the pRF data was not available for all subjects. Please note that the voxel selection via the functional localizer was also the method we had preregistered for the data analysis (https://osf.io/f8dv9/), so this was always the intended analysis.

In the revised manuscript we now stress, that the pRF data was from a previous experiment:

pRF estimation. pRF data was available for seven participants from a previous study and was used to validate visually that the voxel selection based on the functional localizer selected voxel that correspond to the stimulated location is visual space (Figure 2).”

We further added the missing information how the pRF reconstruction in Figure 2 was performed:

“For the pRF-based stimulus reconstruction, for each participant, we first limited the pRF data to voxel that were selected based on the functional localizer (25 voxel per stimulus location, 200 voxels in total). This selection step would allow us to visually inspect whether the voxel selection accurately selected voxel corresponding to the respective stimulus location. Second, every voxel is described as a 2D Gaussian with parameters x0, y0, and s0 from the pRF estimation. The 2D Gaussians for each voxel, represented by a pixel x pixel image, were scaled based on the percent signal change obtained from the functional localizer GLM, and consecutively summed over voxels to create one 2D representation of the reconstructed stimulus. This procedure was repeated separately for all 8 stimulus locations. Finally, for visualization purpose, the 8 individual localizer conditions were rotated to one stimulus location (22.5°) and averaged across stimulus locations and participants.”

8. Statistics:

a) Can the authors clarify how they corrected for multiple comparisons when performing model comparisons?

Thank you for pointing this out. No correction for multiple comparisons was carried out during the model comparison. This was now added to the legend of Figure 6.

Error bars denote ± s.e.m.; two-tailed t-test, ***P<0.001; **P<0.01; *P<0.05 uncorrected for multiple comparisons.”

b) The authors say they performed a one-sided t-test using data from Figure 5b. Can they clarify what they did here?

We apologize for the unclarity. We performed a two-sided one-sample t-test. We revised the paragraph in the main text to refer to the Methods section where the analysis is explained:

“Within localizer decoding accuracy results confirmed that hippocampus has a coarse representation of the eight stimulus locations (Figure 5B) within the localizer (two-sided one-sample t-test; t(34) = 3.28, p = 0.002; cross-validated accuracy = 15% ± 3.6%, mean ± s.d.; see Materials and methods).”

The Methods section under “Hippocampal decoding” reads:

“A logistic regression classifier (default values, L2 regularization; C=1) was trained to distinguish between 8 stimulus locations during the independent localizer run. Before applying the trained classifier to the main task, we confirmed that the classifier was indeed able to distinguish between stimulus locations within the localizer. To this end, we performed a leave-one-out cross validation and tested the decoding accuracy against chance level (1/8 = 12.5 %) across subjects using a one-sample t-test.”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there is one remaining issue that needs to be addressed, as outlined by Reviewer #1. Specifically, we thought it would be helpful to provide a bit more detail on the differences between the predictions of an SR versus model-based algorithm:

Reviewer #1 (Recommendations for the authors):

The authors have considerably revised their paper and they have addressed most of my comments satisfactorily. However, I remain uncertain about point 1.1.

I understand that there are no rewards in your task and that the SR algorithm can apply in the absence of rewards. I am not sure however that a model-based (MB) algorithm would make different predictions than SR in the context of your experiment. Indeed, it can be difficult to distinguish SR and MB in many contexts, especially if there is no reevaluation of the transition matrix during the experiment (Momennejad et al., 2017, Nat Hum Behav). Could the authors perhaps test what the predictions of a MB algorithm would be in their experiment (see, e.g., the equation reported in the Methods of the Momennejad paper), or otherwise explain why this would be irrelevant?

We thank the reviewer for raising this issue, which has prompted us to reflect more thoroughly on this issue.

The reviewer is correct that, within the context of our design, it is strictly speaking not possible to distinguish between model-based (MB) and SR algorithms. The key distinction between them is that SR caches a predictive map of states that the agent expects to visit in the future, whereas MB algorithms store a full model of the world and compute trajectories at the decision time. Both predict a temporally discounted activation of successor states.

It should be noted however that MB comes at a higher computational cost, and is more intensive both in terms of time and working memory resources. The activation of successor states that we observed occurred in the absence of a decision-making process (i.e., participants did not perform any task on the trials where a single dot was presented). Also, and interestingly, we previously observed that this activation pattern was not dependent on the task, and was equally present when attentional resources were strongly drawn away from the stimuli (Ekman et al. Nat Comm 2017). These observations may be more readily in line with the automatic (cached) activation of successor states that is embodied by SR, rather than the effortful iterative calculation of successor states that is the hallmark of MB. Nevertheless, we agree with the reviewer that our study does not provide strong evidence in favor of SR over MB computations in the visual cortex. We have now made this clearer in the Discussion section of our manuscript (see below).

Also, the question raised by the reviewer inspired a potential follow-up experiment, which is outside of the scope of the current manuscript, but which would be a potentially promising avenue of future research. The transition revaluation manipulation described earlier (Momennejad et al. 2017) could also be applied to our experimental setting. After exposing participants to our dot sequences (ABCD), one could introduce a relearning phase in which participants are exposed to BDC. When participants, after this relearning phase, are exposed to A, there are competing predictions about the activation pattern of successor states: H1 (SR): BCD

H2 (MB): BDC

We believe that this could be an interesting follow-up experiment, and have added it to the Discussion section of the manuscript.

We have added the following section to the Discussion section (page 13):

While we have interpreted the neural activity patterns in the light of the SR, it is strictly speaking not possible to distinguish between model-based (MB) and SR algorithms within the context of our design. The key distinction between them is that SR caches a predictive map of states that the agent expects to visit in the future, whereas MB algorithms store a full model of the world and compute trajectories at the decision time (Mommenejad et al. 2017, Gershman 2018). Therefore, both predict a temporally discounted activation of successor states. It should be noted however that MB comes at a higher computational cost, and is more intensive both in terms of time and working memory resources. The activation of successor states that we observed, on the other hand, occurred in the absence of a decision-making process (i.e., participants did not perform any task on the trials where a single dot was presented). Also, importantly, we previously observed that this activation pattern was not dependent on the task, and was equally present when attentional resources were strongly drawn away from the stimuli (Ekman et al., 2017). These observations may be more readily in line with the automatic (cached) activation of successor states that is embodied by SR, rather than the effortful iterative calculation of successor states that is the hallmark of MB. One future possibility to disentangle SR and MB algorithms could be to probe how well each model adapts to changes in the dot sequence structure. It has previously been shown, that compared to MB, the flexibility of the SR is somewhat limited to reflect changes in the transitional structure, because it requires the entire SR to be relearned (Momennejad et al., 2017).”

https://doi.org/10.7554/eLife.78904.sa2

Article and author information

Author details

  1. Matthias Ekman

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    matthias.ekman@donders.ru.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1254-1392
  2. Sarah Kusch

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Formal analysis, Methodology, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Floris P de Lange

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing - review and editing
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6730-1452

Funding

Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Veni Grant No. 016.Veni.195.435)

  • Matthias Ekman

HORIZON EUROPE European Research Council (ERC Consolidator Grant 101000942 "Surprise")

  • Floris P de Lange

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

Human subjects: The study followed institutional guidelines of the local ethics committee (CMO region Arnhem-Nijmegen, The Netherlands; Research Protocol "Imaging Human Cognition", NL45659.091.14), including informed consent of all participants.

Senior Editor

  1. Chris I Baker, National Institute of Mental Health, National Institutes of Health, United States

Reviewing Editor

  1. Morgan Barense, University of Toronto, Canada

Reviewer

  1. Helen Barron

Version history

  1. Received: March 23, 2022
  2. Preprint posted: March 26, 2022 (view preprint)
  3. Accepted: January 13, 2023
  4. Version of Record published: February 2, 2023 (version 1)

Copyright

© 2023, Ekman et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,601
    Page views
  • 158
    Downloads
  • 6
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Matthias Ekman
  2. Sarah Kusch
  3. Floris P de Lange
(2023)
Successor-like representation guides the prediction of future events in human visual cortex and hippocampus
eLife 12:e78904.
https://doi.org/10.7554/eLife.78904

Further reading

    1. Evolutionary Biology
    2. Neuroscience
    Katja Heuer, Nicolas Traut ... Roberto Toro
    Research Article

    The process of brain folding is thought to play an important role in the development and organisation of the cerebrum and the cerebellum. The study of cerebellar folding is challenging due to the small size and abundance of its folia. In consequence, little is known about its anatomical diversity and evolution. We constituted an open collection of histological data from 56 mammalian species and manually segmented the cerebrum and the cerebellum. We developed methods to measure the geometry of cerebellar folia and to estimate the thickness of the molecular layer. We used phylogenetic comparative methods to study the diversity and evolution of cerebellar folding and its relationship with the anatomy of the cerebrum. Our results show that the evolution of cerebellar and cerebral anatomy follows a stabilising selection process. We observed 2 groups of phenotypes changing concertedly through evolution: a group of 'diverse' phenotypes - varying over several orders of magnitude together with body size, and a group of 'stable' phenotypes varying over less than 1 order of magnitude across species. Our analyses confirmed the strong correlation between cerebral and cerebellar volumes across species, and showed in addition that large cerebella are disproportionately more folded than smaller ones. Compared with the extreme variations in cerebellar surface area, folial anatomy and molecular layer thickness varied only slightly, showing a much smaller increase in the larger cerebella. We discuss how these findings could provide new insights into the diversity and evolution of cerebellar folding, the mechanisms of cerebellar and cerebral folding, and their potential influence on the organisation of the brain across species.

    1. Neuroscience
    Amanda J González Segarra, Gina Pontes ... Kristin Scott
    Research Article

    Consumption of food and water is tightly regulated by the nervous system to maintain internal nutrient homeostasis. Although generally considered independently, interactions between hunger and thirst drives are important to coordinate competing needs. In Drosophila, four neurons called the interoceptive subesophageal zone neurons (ISNs) respond to intrinsic hunger and thirst signals to oppositely regulate sucrose and water ingestion. Here, we investigate the neural circuit downstream of the ISNs to examine how ingestion is regulated based on internal needs. Utilizing the recently available fly brain connectome, we find that the ISNs synapse with a novel cell-type bilateral T-shaped neuron (BiT) that projects to neuroendocrine centers. In vivo neural manipulations revealed that BiT oppositely regulates sugar and water ingestion. Neuroendocrine cells downstream of ISNs include several peptide-releasing and peptide-sensing neurons, including insulin producing cells (IPCs), crustacean cardioactive peptide (CCAP) neurons, and CCHamide-2 receptor isoform RA (CCHa2R-RA) neurons. These neurons contribute differentially to ingestion of sugar and water, with IPCs and CCAP neurons oppositely regulating sugar and water ingestion, and CCHa2R-RA neurons modulating only water ingestion. Thus, the decision to consume sugar or water occurs via regulation of a broad peptidergic network that integrates internal signals of nutritional state to generate nutrient-specific ingestion.