Experimental procedure in the MEG.

Control: Eight minutes of resting state activity were recorded before participants encountered any stimuli. Localizer task: Next, the ten individual items were repeatedly presented to participants auditorily and visually to extract multisensory activity patterns. Learning: Participants learned triplets of the ten items pseudo-randomly generated based on an underlying graph, by performing a three-alternatives-forced-choice. Participants were unaware of the graph layout. Consolidation: Eight minutes of resting state activity were recorded. Retrieval: Participants’ recall was tested by cueing triplets from the graph. The letters in the pictograms are placeholders for individual images. For details, see Methods section Figure 10. For an animated overview of the task and the experiment, see also our GitHub pages of the study: https://cimh-clinical-psychology.github.io/DeSMRRest-TDLM-Simulation/. Figure is reproduced from Kern et al., (2024), Figure 4.

Behavioural results and localizer decoding accuracy.

A) Memory performance of participants after completing the first block of learning, the last block of learning (block 2 to 6, depending on speed of learning), and retrieval performance after the resting state. Participants completed up to six blocks of learning trials. After reaching 80% in any block, no more learning blocks were performed (criterion learning). B) Decoding accuracy of the currently displayed item during the localizer task for participants with a decoding accuracy above 30% (n=21, 9 participants’ data did not pass this criterion). The mean peak time point across all participants corresponded to 210 ms, with an average decoding peak accuracy of 42% (n=21). Note that the displayed graph combines accuracies across participants, where peak values were computed on an individual level and then averaged. Therefore, at a group level, the individual mean peak does not necessarily correspond to the average peak. C) Classifier transfer within the localizer when trained and tested at different time points as determined by cross validation, after ∼200 ms, most training and testing time points generalize to each other (significant clusters indicated by cluster permutation testing, alpha<0.05). Figure is reproduced from Kern et al., (2024), Figure 1.

Sequenceness result of the resting state sessions.

A) Forward and backward sequenceness values on the y-axis at different time lags (x-axis) for control resting state (light coloured) and post-learning resting state (dark coloured). Two significance thresholds are indicated, the more lenient 95% quantile of the maximum sequenceness of all permutations (lightly dotted line) and the maximum mean peak sequenceness across all permutations of all subjects (bold dotted line). None of the thresholds were surpassed. The peak mean sequenceness has been marked by a vertical line respectively. B) Difference between pre-learning and post-learning resting states across time lags. Using FDR-corrected paired t-tests for each time lag revealed no significant difference between conditions (without correction two sequenceness differences were significant: forward 140 ms, p=0.039 and 220 ms, p=0.035). An additionally performed cluster-permutation test found a non-significant cluster at 130-140 ms forward direction (p=0.43). Bands indicate two standard errors of the mean of the difference scores. C) T-value distribution of a 10000 iterations-sign-flip permutation test. Per permutation, a random subset of subjects’ sequenceness values were flipped and the maximum t-value across lags used to construct the empirical distribution. The true maximum t-value indicated in red (nonsignificant in all cases) D) Individual participant’s retrieval memory performance relative to their sequenceness values at the peak mean sequenceness (indicated in A). Regression lines and 95% confidence intervals are fitted. For correlation with performance deltas between post and pre resting state retrievals see Supplement Figure 9. NB, however, as our simulation shows, correlations of sequenceness with behavioural markers are likely to be underpowered and occur only with very high replay rates or much higher sample size. See our simulation discussion for a more detailed explanation on how correlations may be inherently biased, where fluctuations in baseline sequenceness overshadow individual scaling with behavioural markers. For individual sequenceness curves, see Supplement Figure 3. For within-session forward minus backward sequenceness, see Supplement Figure 10.

Sequenceness map of different segments.

Each of the eight minutes of resting state are on the y-axis, and different time lags on the x-axis. For each minute, we took either the first or second half of 30 seconds for the final calculation and the other 30-second segment for the exploration data set. Sequenceness crossed the 95% permutation threshold in the control resting state in minute 4 (forward 100 millisecond time lag) alone. However, the sequenceness score was negative, rendering its interpretation non-trivial. Note that we did not apply any control for multiple comparisons across the segments. For sequenceness curve per segment, see Supplement Figure 4.

Schematic of procedure for inserting simulated replay into the control resting state.

First, neural patterns are extracted, for each stimulus, from the peak of decodability during the localizer task. Participant-specific patterns are normalized by subtracting the average sensor values of seeing any other stimulus from the sensor values of seeing only the stimulus of interest. Next, these subtle patterns are inserted into the control resting state at a fixed interval, following a one-step transition pattern according to the task sequence. A refractory period is retained between each replay event to prevent overlaps. For details see Methods section.

Sequenceness results of the hybrid simulation.

A) Simulation results for inserting replay at different densities. On the y-axis, sequenceness at 80-millisecond state-to-state time lag, at different replay densities on the x-axis with standard error as shaded orange. The 95% confidence interval significance threshold is reached at a density of ∼80 min-1, while the maximum permutation threshold is reached only around 130 min-1. B) Observed t-values and significance thresholds of a sign-flip permutation test, in which the maximum t-value across time lags for sign-flip permutated sequenceness curves are used to construct a non-parametric null distribution (see Methods for details). Significance (p<0.05) is reached similarly at around 80 min-1. C) Sequenceness across time lags for replay at baseline (0 min-1) and simulated for a density of 40, 80 and 120 min-1 with a simulated time lag of 80 milliseconds. In this simulation, we assumed a temporally uniform distribution of replay throughout the resting state, as TDLM is relatively invariant to temporal distribution (see Appendix A.1 and Appendix A Figure 1)

Correlation with behaviour is underpowered and strongly affected by systematic baseline fluctuations.

A) Relationship (Pearson’s r) between memory performance (behaviour) and sequenceness at 80 ms time lag on the y axis, with replay density on the x-axis for two scaling rules: Realistic scaling: Replay density was inversely scaled with participant’s performance to between 50 and 100% (i.e., the worst performer had 100% while the best performer only had 50% of replay density). Maximal scaling: Replay density was scaled in from 0-100% (i.e., the best performer had no replay inserted). Significance of the correlation was not reached even at the maximum density. However, note, that the baseline correlation started at r=0.1, and simulated effect without this offset would have reached significance at a density of around 200 min-1. B) Correlation of sequenceness computed across different time lags in the control condition (i.e., without inserted replay). Even when no replay is present, the correlation fluctuates significantly depending on which time lag it has been computed, possibly due to baseline variations that fluctuate differentially between participants. C) Bootstrap power analysis showing how many participants would be necessary to detect a significant correlation between replay and performance in the maximally scaled scenario with a replay density of 130 min-1. 80% power to detect a correlation is reached around a sample size of around 160. D) Sequenceness at 80 milliseconds time lag in the control condition (left), in the realistic scenario with replay scaling from 50-100% (middle) and the maximal scenario with scaling from 0-100% (right). Participant’s performance is indicated by increasing saturation of blue. While all participants show an increase in sequenceness with increased scaling, some participants have a higher sequenceness value and an ordering according to performance is barely visible in the maximum scaling case.

Results of the fully synthetic simulation, akin to the one presented in Liu et al. (2021).

A) Mean sequenceness at the simulated time lag of 80 milliseconds. Thresholds are reached already at a much lower density of a rate of ∼19 events/minute, in line with the original MATLAB code (Supplement Figure 6). Note that due to scaling with performance, real average events per minute are slightly below the indicated density, explaining why the thresholds are reached at a slightly higher density (i.e. ∼19 vs ∼14). B) Four exemplar plots for sequenceness at a density of 0 (baseline), 40, 80 and 120 events per minute. Oscillatory behaviour of the sequenceness curves is very minimal in the baseline condition. Maximum permutation thresholds are reached easily in the 40 events per minute case and surpassed several times with higher densities. However, exactly the lack of these spurious confounders and a too strong pattern discriminability make it possible for TDLM to work well in a synthetic dataset. See Supplement Figure 6 to compare result to the original MATLAB simulation code.

Pattern discriminability between the target class and other classes impact sequenceness sensitivity.

A) Histograms of log-scaled classifier probability estimates during the peak decodability of localizer (left, cross-validated), for reactivation events in the hybrid simulation (middle) and synthetic simulation (right). B) KDE-plot of the Wasserstein distance of participants’ classifier probabilities. The hybrid simulation likely over-estimates how well the decoder finds a reactivation event, while the synthetic simulation largely over-estimates. C) Pattern discriminability as a function of the inserted pattern strength (akin signal-to-noise-ratio). We scaled the magnitude of the inserted pattern in the synthetic simulation. Dotted lines indicate estimated pattern strength of the empirical data and hybrid simulation as given by their mean Wasserstein distance D) TDLM Sequenceness increases with better pattern discriminability. However, it is questionable if the pattern discriminability during a resting state reactivation event is higher than during the localizer.

Detailed overview of the experiment procedure.

A) During the localizer task, a word describing the stimulus was played via headphones and the corresponding visual item was then shown to the participant. In 11% of trials, the audio and visual cue did not match and in this case, participants were instructed to press a button on detection (to sustain attention to the stimuli and check for inattention). B) Graph layout of the task. Two elements could appear in two different triplets. The graph was directed such that each tuple had exactly one successor (e.g., apple→zebra could only be followed by cake and not mug), but individual items could have different successors (zebra alone could be followed by mug or cake). Participants never saw the illustrated birds-eye-view. C) During learning, in each trial one node was randomly chosen as the current node. First, its predecessor node was shown, followed by the current node with the participant then given a choice of three items. They were then required to choose the node that followed the displayed cue tuple. Feedback was then provided to the participant. This process was repeated until the participant reached 80% accuracy for any block or reached a maximum of six blocks of learning. D) The retrieval followed the same structure as the learning task, except that no feedback was given. Figure is reproduced from Kern et al., (2024), Figure 5.

Normalized probability estimates of time points in the control resting state for the target class of which the pattern has been added and non-target class.

Left: Time points at which no pattern has been inserted, both classes have a similar probability estimate at baseline. Right: At time points where the stimulus-specific pattern was added, the decoded probability estimate is significantly increased. Note that also the probability estimate of non-target classes is slightly increased at the timepoint of pattern insertion, which is due to slight shifts in sensor values that heightens all probabilities by a small amount.

Mean classifier probabilities before and after normalization, per participant for time points selected for replay pattern insertion.

Left: Variance in baseline probability estimates can already be seen before pattern insertion, i.e. for some participants a higher baseline is present that for others. This higher baseline probability values (or rather their stronger absolute fluctuation) will lead to falsely increased sequenceness values in the GLM betas. Right: After normalization, baseline probabilities are similar across all participants. Blue shading indicates the participant’s performance with lighter shading at 50% and darkest shading at 100% memory performance.

Distribution of raw sensor values before and after insertion of the simulation pattern.

As MEG measures differences in magnetic field strength, the signal is centred on zero. Adding a pattern that is extracted from a zero-centred distribution should not affect the distributions mean and hence still produce a valid MEG signal, which is shown in the figure. However, a very small difference in spread between the two distribution can be seen, indicating that sensor values are not exactly following the same distribution characteristics after adding the stimulus-specific pattern, which might explain the small increase in baseline probability for non-target classes in Figure 11.

Individual sequenceness curves for all participants in the control resting state (left) and the post-learning resting state (right), for forward (upper), and backward sequenceness (bottom).

The participant’s performance is indicated in shades of blue, from the lightest shade at 50% to the darkest shade at 100% retrieval performance. Note that there is an oscillatory component present in all sequenceness curves, however, their phase and exact amplitude seem to be shifted.

Individual sequenceness curves for different segments.

A) Control resting state and B) post-learning resting state. Each segment is taken from minute N of the respective resting state. For each participant, from each segment either the first or second 30-second interval was taken. Two significance thresholds are indicated, the more lenient 95% quantile of the maximum sequenceness of all permutations (lightly dotted line) and the maximum mean peak sequenceness across all permutations of all subjects (boldly dotted line). Sequenceness crossed the 95% permutation threshold in the control resting state in minute 4 (forward 100 millisecond time lag). Note that sequenceness scores are negative for the latter, which would indicate a replay-suppression of these specific sequences.

Sequenceness curves when not correcting for alpha oscillations.

An increase in oscillatory behaviour of the sequenceness curves can be observed. Alpha correction is usually applied to mitigate nuisance effects that are introduced by strong eyes-closed occipital alpha by adding probability time lines in 100ms steps to the GLM (Y. Liu, Dolan, et al., 2021).

Rerun of original MATLAB simulation code.

A) Original TDLM (Y. Liu, Dolan, et al., 2021) simulation with a replay density of 2000 min-1. B) Re-run of the code with realistic replay density of 15 min-1. Using this density, significance thresholds are barely reached. C) Baseline sequenceness with no inserted sequences, some participants reach significance even in absence of replay, as can be seen by the confidence intervals D) Left: Time points at which no pattern has been inserted, both classes have a similar probability estimate at baseline. Right: At time points where the stimulus-specific pattern was added, the decoded probability estimate is significantly increased, cf. Figure 11. E) Probability estimates before and after insertion of the replay pattern, visualised as in Supplement Figure 1. The probability estimates at baseline are close to zero while increased by several orders of magnitude. Using synthetic data probably overestimates probability estimates of true events. Note that different from our simulation, the replay time lag has here been sampled from a gamma prior with mean 40 milliseconds, such that the majority of events is simulated at 40 milliseconds, with some events getting assigned a lag of 30 or 50 milliseconds.

Cross-validation accuracy during the localizer across different regularization values.

For each regularization value, a cross-validation was performed per participant as documented in the methods section, evaluated at 150 to 250 milliseconds after stimulus onset. The best regularization value was found around C=9.1

Visualization of the sensor patterns that were inserted into the resting state per class, averaged across all participants.

Patterns were created on a per-participant-basis by taking the mean sensor value at the peak time point of decodability during the localizer of all trials of the specific class. From these patterns, the mean across all trials (the ERP) at the peak timepoint was subtracted, to remove activity related to general visual processing. Saturation indicates value magnitude (a.U.), red is positive sensor values, blue negative sensor value. Figure is reproduced from Kern et al., (2024), Figure 1—figure supplement 6.

Individual participant’s retrieval performance delta of the post-rest retrieval to the pre-rest retrieval relative to their sequenceness values at the peak mean sequenceness (indicated in A).

Regression lines and 95% confidence intervals are fitted. While we found a non-significant relation between a memory performance enhancement and post-learning forward sequenceness we are cautious not to overinterpret these results. As in the section “Correlation with behaviour only present at high replay speeds” the noted correlational measure oscillates heavily with baseline sequenceness fluctuations, and any true replay effect is likely to be overshadowed by such fluctuations.

Forward minus backward sequenceness within each resting state session.

Previous papers often report subtraction of backward from forward sequenceness (fwd-bkw) as a means to remove oscillatory confounds that impact both sequenceness directions in synchrony. While required in early cross-correlation approaches (Kurth-Nelson et al., 2016), its validity in GLM-based frameworks depends on an assumption that confounds are global and in-phase across sensors. We observed this assumption is violated in our baseline data, where spurious sequenceness occasionally diverges in opposite directions at specific time lags (e.g., ∼90 ms). In such instances, subtraction would increase the false-positive rate rather than suppress noise. In Figure 3B, we prioritized the comparison of pre-task versus post-task sequenceness within the same direction, as oscillatory confounds appeared more stable across time within a single direction, as opposed to across directions within a single session. However, we consider both approaches are valid. We now provide the fwd-bkw plots for completeness and comparison with previous literature. A) forward minus backwards sequenceness for Control (left) and Post-Learning resting state (right). B) T-value distribution of the sign-flip permutation test for Control (left) and Post-Learning resting state (right).

Power analysis of sequenceness significance for bootstrapped samples sizes.

A) Powermap for state-permutation thresholds. However, here the bootstrap approach suffers from a conceptional problem: significance thresholds are defined by the permutation maximum and/or 95-percentile of the maximums across all sequence-permutations across participants. If we resample bootstrap-participants from our existing pool, the maximum thresholds computed will remain relatively stable across resampled participants, as it only compares against the mean and disregards the standard deviation. B) The newly presented statistical approach is significantly more sensitive at higher sample sizes. Note that even then, 80% power is only reached with replay density of higher than 50 min-1 at a sample size of 60 participants. Additionally, the sign-flip permutation test assumes that the mean is at zero. As we observed a non-zero mean due to spurious oscillations, we subtracted the mean sequenceness of the baseline condition from each participant before permuting to achieve a null distribution with mean zero, as otherwise, we would have found significant replay effects in the baseline condition at increasing sample size. Nevertheless, due to the higher sensitivity, the new sign-flip test is recommended over the previous sequence-permutation-based test. Colours indicate the power from 0 to 1 for different bootstrapped sample sizes and densities. 80% power thresholds are outlined in black

TDLM is invariant to the macro-distribution of sequences within the recording.

A) forward and backward sequenceness of replay events that were uniformly distributed (light orange/green) or that were “bursty”, i.e., temporally constrained to specific areas (dark orange/green). Shaded areas denote the standard error across participants. B) Probability distribution from which the starting points for each replay events were drawn. C) Histogram of empirical replay starting positions across all participants.

TDLM is invariant to the length of sequence replay trains under an assumption that the number of target transitions (e.g. single steps) is fixed.

We simulated replay either as two temporally separate A->B, B->C events (light orange/green) or as a single A->B->C event (dark orange/green), both yielding equivalent sequenceness. Shaded areas denote the standard error across participants.

TDLM is insensitive towards which transitions are replayed and only sensitive to how many transitions are detected in total.

Here we simulate transitions either sampled from the full graph (light orange/green) or participant-specific transitions of trials that participants correctly remembered (dark orange/green). Shaded areas denote the standard error across participants.

Sequenceness for longer multi-step replay.

Here, we inserted sequence trains of longer replay (e.g. A->B->C->D->E for the 4-step case) without keeping the number of transitions fixed. To prevent overlap, an increase in length of replay is accompanied by a longer refractory window, which leads to a reduced number of maximum replay events per minute. TDLM becomes more sensitive under the assumption that (i) multiple sequence steps are reactivated and (ii) all sequence steps are decoded accurately. Notably, as mentioned before, only the number of single step transitions is relevant for TDLM’s sensitivity. Indicated line is sequenceness at the simulated 80- millisecond time lag across different time lags, with participants’ standard error in the shaded area.