Experiment procedure in the MEG. Localizer task: The ten individual items were repeatedly presented to the participant auditorily and visually to extract multisensory activity patterns. Learning: Participants learned pseudo-randomly generated triplets of the ten items by trial and error. These triplets were determined by an underlying graph structure. Participants were unaware of the exact structure and graph layout. Consolidation: Eight minutes of resting state activity were recorded. Retrieval: Participants’ recall was tested by cueing triplets from a sequence. The letters in the pictograms are placeholders for individual images.

Task structure: A) During the localizer task, a word describing the stimulus was played back via headphones and the item then shown to the participant. In 4% of trials, the audio and visual cue did not match and in this case, participants were instructed to press a button (attention check). B) Graph layout of the task. Two elements could appear in two different triplets. The graph was directed such that each tuple had exactly one successor (e.g., apple→zebra could only be followed by cake and not mug), but individual items could have different successors (zebra alone could be followed by mug or cake). Participants never saw the illustrated birds-eye-view. C) During learning, one node was randomly chosen as the current node in each trial. First, its predecessor node was shown, then the current node was shown, then the participant was given the choice of three items and must choose the correct node that followed the displayed cue tuple. Feedback was then provided to the participant. This process was repeated until the participant reached 80% accuracy for any block or reached a maximum of six blocks of learning. D) The retrieval testing followed the same structure as the learning task, except that no feedback was given.

A) Decoding accuracy of the currently displayed item during the localizer task for participants with a decoding accuracy higher than 30% (n=21). The mean peak time point across all participants corresponded to 210 ms, with an average decoding peak decoding accuracy of 42% (n=21). Note that the displayed graph combines accuracies across participants, where peak values were computed on an individual level and then averaged. Therefore, the indicated individual mean peak does not match the average at a group level. B) Memory performance of participants after completing the first block of learning, the last block (block 2 to 6, depending on speed of learning), and the test performance. C) Classifier transfer within the localizer when trained and tested at different time points determined by cross validation. D) Classifier transfer from the localizer session to the retrieval session when trained at different time points during training and tested at different time points during cue presentation of the first (predecessor) image cue during retrieval. For B and C: Within the white outline, classification was significantly above chance level (cluster permutation testing, alpha<0.05).

A) Strength of forward and backward sequenceness across different time lags up to 250 ms during the 1500 ms window after cue onset. Two significance thresholds are shown: Conservative threshold of the maximum of 1000 permutations of classification labels across all time lags and the 95% percentiles (see Methods section for details). B) Permutation distribution of mean sequenceness values across 1000 state permutations. Observed mean sequenceness is indicated with a red line. C) Association between memory performance and mean sequenceness value computed across all trials, and time lags, for each participant.

Clustered Reactivation: A) Differential reactivation probability between off-screen items that were up to two steps ahead of the current stimulus cue vs. distant items that were more than two steps away on the graph for trials with correct answers. Between 220 and 260 ms the next items are simultaneously reactivated significantly more than items that are further away (p<0.05; permutation test with 10000 shuffles). B) Reactivation strength of items after retrieval cue onset by distance of items to the currently on-screen stimulus. A significant negative correlation between distance on a directional graph and reactivation strength can be seen (p=0.008). C) Same as B, but subdivided into trials in which participants answered correctly (left) and in which participants did not know the correct answer (right). A correlation between reactivation strength and distance can only be seen in case of successful retrieval (but see also limitations for a discussion of the low trial and participant number in this sub-analysis). Mean probability values are marked by black dots. D) Example activations of a successful retrieval (left) and a failed retrieval (right), sorted by distance to current cue. Colors indicate probability estimates of the decoders. Indicated time points in D) are after onset of the current image cue.

Percentage of rejected trials for each participant. Artifacts were detected automatically by AutoReject. If possible, channels were interpolated for the affected time span, else the trial was rejected.

Excluded participants based on decoding accuracy and memory performance during testing.

Decoding accuracy across time determined by a leave-one-out cross-validation per participant.

Number of learning blocks that each participant completed. Learning was stopped if participants reached at least 80% memory performance in a block or if they reached 6 blocks.

Percentage of sensors relevant for each image across all participants (beta weight of sensor location unequal to zero). Larger/darker dots indicate more participants’ decoders’ used information from this sensor. The largest dot indicates that this sensor was used for all participants for this image for this image. The smallest/lightest dot indicates that almost no participant’s decoder used information from this sensor.