Challenges in Replay Detection by TDLM in Post-Encoding Resting State

Simon Kern; Juliane Nagel; Lennart Wittkuhn; Steffen Gais; Ray Dolan; Gordon Feld

doi:10.7554/eLife.108023.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Anna Schapiro
University of Pennsylvania, Philadelphia, United States of America
Senior Editor
Michael Frank
Brown University, Providence, United States of America

Reviewer #1 (Public review):

Summary:

Participants learned a graph-based representation, but, contrary to the hypotheses, failed to show neural replay shortly after. This prompted a critical inquiry into temporally delayed linear modeling (TDLM)--the algorithm used to find replay. First, it was found that TDLM detects replay only at implausible numbers of replay events per second. Second, it detects replay-to-cognition correlations only at implausible densities. Third, there are concerning baseline shifts in sequenceness across participants. Fourth, spurious sequences arise in control conditions without a ground truth signal. Fifth, when reframing simulations previously published, similar evidence is apparent.

Strengths:

(1) This work is meticulous and meets a high standard of transparency and open science, with preregistration, code and data sharing, external resources such as a GUI with the task and material for the public.

(2) The writing is clear, balanced, and matter-of-fact.

(3) By injecting visually evoked empirical data into the simulation, many surface-level problems are avoided, such as biological plausibility and questions of signal-to-noise ratio.

(4) The investigation of sequenceness-to-cognition correlations is an especially useful add-on because much of the previous work uses this to make key claims about replay as a mechanism.

Weaknesses:

Many of the weaknesses are not so much flaws in the analyses, but shortcomings when it comes to interpretation and a lack of making these findings as useful as they could be.

(1) I found the bigger picture analysis to be lacking. Let us take stock: in other work, during active cognition, including at least one study from the Authors, TDLM shows significance sequenceness. But the evidence provided here suggests that even very strong localizer patterns injected into the data cannot be detected as replay except at implausible speeds. How can both of these things be true? Assuming these analyses are cogent, do these findings not imply something more destructive about all studies that found positive results with TDLM?

(2) All things considered, TDLM seems like a fairly 'vanilla' and low-assumption algorithm for finding event sequences. It is hard to see intuitively what the breaking factor might be; why do the authors think ground truth patterns cannot be detected by this GLM-based framework at reasonable densities?

(3) Can the authors sketch any directions for alternative methods? It seems we need an algorithm that outperforms TDLM, but not many clues or speculations are given as to what that might look like. Relatedly, no technical or "internal" critique is provided. What is it about TDLM that causes it to be so weak?

Addressing these points would make this manuscript more useful, workable, and constructive, even if they would not necessarily increase its scientific breadth or strength of evidence.

https://doi.org/10.7554/eLife.108023.1.sa2

Reviewer #2 (Public review):

Summary:

Kern et al. investigated whether temporally delayed linear modeling (TDLM) can uncover sequential memory replay from a graph-learning task in human MEG during an 8-minute post-learning rest period. After failing to detect replay events, they conduct a simulation study in which they insert synthetic replay events, derived from each participant's localizer data, into a control rest period prior to learning. The simulations suggest that TDLM only reveals sequences when replay occurs at very high densities (> 80 per minute) and that individual differences in baseline sequenceness may lead to spurious and/or lackluster correlations between replay strength and behavior.

Strengths:

The approach is extremely well documented and rigorous. The authors have done an excellent job re-creating the TDLM methodology that is most commonly used, reporting the different approaches and parameters that they used, and reporting their preregistrations. The hybrid simulation study is creative and provides a new way to assess the efficacy of replay decoding methods. The authors remain measured in the scope/applicability of their conclusions, constructive in their discussion, and end with a useful set of recommendations for how to best apply TDLM in future studies. I also want to commend this work for not only presenting a null result but thoroughly exploring the conditions under which such a null result is expected. I think this paper is interesting and will be generally quite useful for the field, but I believe it also has a number of weaknesses that, if addressed, could improve it further.

Weaknesses:

The sample size is small (n=21, after exclusions), even for TDLM studies (which typically have somewhere between 25-40 participants). The authors address this somewhat through a power analysis of the relationship between replay and behavioral performance in their simulations, but this is very dependent on the assumptions of the simulation. Further, according to their own power analysis, the replay-behavior correlations are seriously underpowered (~10% power according to Figure 7C), and so if this is to be taken at face value, their own null findings on this point (Figure 3C) could therefore just reflect undersampling as opposed to methodological failure. I think this point needs to be made more clearly earlier in the manuscript. Relatedly, it would be very useful if one of the recommendations that come out of the simulations in this paper was a power analysis for detecting sequenceness in general, as I suspect that the small sample size impacts this as well, given that sequenceness effects reported in other work are often small with larger sample sizes. Further, I believe that the authors' simulations of basic sequenceness effects would themselves still suffer from having a small number of subjects, thereby impacting statistical power. Perhaps the authors can perform a similar sort of bootstrapping analysis as they perform for the correlation between replay and performance, but over sequenceness itself?

The task paradigm may introduce issues in detecting replay that are separate from TDLM. First, the localizer task involves a match/mismatch judgment and a button press during the stimulus presentation, which could add noise to classifier training separate from the semantic/visual processing of the stimulus. This localizer is similar to others that have been used in TDLM studies, but notably in other studies (e.g., Liu, Mattar et al., 2021), the stimulus is presented prior to the match/mismatch judgment. A discussion of variations in different localizers and what seems to work best for decoding would be useful to include in the recommendations section of the discussion. Second, and more seriously, I believe that the task design for training participants about the expected sequences may complicate sequence decoding. Specifically, this is because two images (a "tuple") are shown together and used for prediction, which may encourage participants to develop a single bound representation of the tuple that then predicts a third image (AB -> C rather than A -> B, B -> C). This would obviously make it difficult to i) use a classifier trained on individual images to detect sequences and ii) find evidence for the intended transition matrix using TDLM. Can the authors rule out this possibility?

Participants only modestly improved (from 76-82% accuracy) following the rest period (which the authors refer to as a consolidation period). If the authors assume that replay leads to improved performance, then this suggests there is little reason to see much task-related replay during rest in the first place. This limitation is touched on (lines 228-229), but I think it makes the lack of replay finding here less surprising. However, note that in the supplement, it is shown that the amount of forward sequenceness is marginally related to the performance difference between the last block of training and retrieval, and this is the effect I would probably predict would be most likely to appear. Obviously, my sample size concerns still hold, and this is not a significant effect based on the null hypothesis testing framework the authors employ, but I think this set of results should at least be reported in the main text. I was also wondering whether the authors could clarify how the criterion over six blocks was 80% but then the performance baseline they use from the last block is 76%? Is it just that participants must reach 80% within the six blocks *at some point* during training, but that they could dip below that again later?

Because most of the conclusions come from the simulation study, there are a few decisions about the simulations that I would like the authors to expand upon before I can fully support their interpretations. First, the authors use a state-to-state lag of 80ms and do not appear to vary this throughout the simulations - can the authors provide context for this choice? Does varying this lag matter at all for the results (i.e., does the noise structure of the data interact with this lag in any way?) Second, it seems that the approach to scaling simulated replays with performance is rather coarse. I think a more sensitive measure would be to scale sequence replays based on the participants' responses to *that* specific sequence rather than altering the frequency of all replays by overall memory performance. I think this would help to deliver on the authors' goal of simulating an "increase of replay for less stable memories" (line 246). On the other hand, I was also wondering whether it is actually necessary to use the real memory performance for each participant in these simulations - couldn't similar goals (with a better/more full sampling of the space of performance) be achieved with simulated memory performance as well, taking only the MEG data from the participant? Finally, Figure 7D shows that 70ms was used on the y-axis. Why was this the case, or is this a typo?

Because this is a re-analysis of a previous dataset combined with a new simulation study on that data aimed at making recommendations about how to best employ TDLM, I think the usefulness of the paper to the field could be improved in a few places. Specifically, in the discussion/recommendation section, the authors state that "yet unknown confounders" (line 295) lead to non-random fluctuations in the simulated correlations between replay detection and performance at different time lags. Because it is a particularly strong claim that there is the potential to detect sequenceness in the baseline condition where there are no ground-truth sequences, the manuscript could benefit from a more thorough exploration of the cause(s) of this bias in addition to the speculation provided in the current version. In addition, to really provide that a realistic simulation is necessary (one of the primary conclusions of the paper), it would be useful to provide a comparison to a fully synthetic simulation performed on this exact task and transition structure (in addition to the recreation of the original simulation code from the TDLM methods paper). Finally, I think the authors could do further work to determine whether some of their recommendations for improving the sensitivity of TDLM pan out in the current data - for example, they could report focusing not just on the peak decoding timepoint but incorporating other moments into classifier training.

Lastly, I would like the authors to address a point that was raised in a separate public forum by an author of the TDLM method, which is that when replays "happen during rest, they are not uniform or close". Because the simulations in this work assume regularly occurring replay events, I agree that this is an important limitation that should be incorporated into alternative simulations to ensure the lack of findings is not because of this assumption.

https://doi.org/10.7554/eLife.108023.1.sa1

Reviewer #3 (Public review):

Summary:

Kern et al. critically assess the sensitivity of temporally delayed linear modelling (TDLM), a relatively new method used to detect memory replay in humans via MEG. While TDLM has recently gained traction and been used to report many exciting links between replay and behavior in humans, Kern et al. were unable to detect replay during a post-learning rest period. To determine whether this null result reflected an actual absence of replay or sensitivity of the method, the authors ran a simulation: synthetic replay events were inserted into a control dataset, and TDLM was used to decode them, varying both replay density and its correlation with behavior. The results revealed that TDLM could only reliably detect replay at unrealistically (not-physiological) high replay densities, and the authors were unable to induce strong behavior correlations. These findings highlight important limitations of TDLM, particularly for detecting replay over extended, minutes-long time periods.

Strengths:

Overall, I think this is an extremely important paper, given the growing use of TDLM to report exciting relationships between replay and behavior in humans. I found the text clear, the results compelling, and the critique of TDLM quite fair: it is not that this method can never be applied, but just that it has limits in its sensitivity to detect replay during minutes-long periods. Further, I greatly appreciated the authors' efforts to describe ways to improve TDLM: developing better decoders and applying them to smaller time windows.

The power of this paper comes from the simulation, whereby the authors inserted replay events and attempted to detect them using TDLM. Regarding their first study, there are many alternative explanations or possible analysis strategies that the authors do not discuss; however, none of these are relevant if, under conditions where it is synthetically inserted, replay cannot be detected.

Additionally, the authors are relatively clear about which parameters they chose, why they chose them, and how well they match previous literature (they seem well matched).

Finally, I found the application of TDLM to a baseline period particularly important, as it demonstrated that there are fluctuations in sequenceness in control conditions (where no replay would be expected); it is important to contrast/calculate the difference between control (pre-resting state) and target (post-resting state) sequenceness values.

Weaknesses:

While I found this paper compelling, I was left with a series of questions.

(1) I am still left wondering why other studies were able to detect replay using this method. My takeaway from this paper is that large time windows lead to high significance thresholds/required replay density, making it extremely challenging to detect replay at physiological levels during resting periods. While it is true that some previous studies applying TDLM used smaller time windows (e.g., Kern's previous paper detected replay in 1500ms windows), others, including Liu et al. (2019), successfully detected replay during a 5-minute resting period. Why do the authors believe others have nevertheless been able to detect replay during multi-minute time windows?

For example, some studies using TDLM report evidence of sequenceness as a contrast between evidence of forwards (f) versus backwards (b) sequenceness; sequenceness was defined as ZfΔt - ZbΔt (where Z refers to the sequence alignment coefficient for a transition matrix at a specific time lag). This use case is not discussed in the present paper, despite its prevalence in the literature. If the same logic were applied to the data in this study, would significant sequenceness have been uncovered? Whether it would or not, I believe this point is important for understanding methodological differences between this paper and others.

(2) Relatedly, while the authors note that smaller time windows are necessary for TDLM to succeed, a more precise description of the appropriate window size would greatly improve the utility of this paper. As it stands, the discussion feels incomplete without this information, as providing explicit guidance on optimal window sizes would help future researchers apply TDLM effectively. Under what window size range can physiological levels of replay actually be detected using TDLM? Or, is there some scaling factor that should be considered, in terms of window size and significance threshold/replay density? If the authors are unable to provide a concrete recommendation, they could add information about time windows used in previous studies (perhaps, is 1500ms as used in their previous paper a good recommendation?).

(3) In their simulation, the authors define a replay event as a single transition from one item to another (example: A to B). However, in rodents, replay often traverses more than a single transition (example: A to B to C, even to D and E). Observing multistep sequences increases confidence that true replay is present. How does sequence length impact the authors' conclusions? Similarly, can the authors comment on how the length of the inserted events impacts TDLM sensitivity, if at all?

For example, regarding sequence length, is it possible that TDLM would detect multiple parts of a longer sequence independently, meaning that the high density needed to detect replay is actually not quite so dense? (example: if 20 four-step sequences (A to B to C to D to E) were sampled by TDLM such that it recorded each transition separately, that would lead to a density of 80 events/min).

https://doi.org/10.7554/eLife.108023.1.sa0

Challenges in Replay Detection by TDLM in Post-Encoding Resting State

Peer review process

Editors

Be the first to read new articles from eLife