Behavioral evidence for memory replay of video episodes in the macaque
Abstract
Humans recall the past by replaying fragments of events temporally. Here, we demonstrate a similar effect in macaques. We trained six rhesus monkeys with a temporal-order judgement (TOJ) task and collected 5000 TOJ trials. In each trial, the monkeys watched a naturalistic video of about 10 s comprising two across-context clips, and after a 2 s delay, performed TOJ between two frames from the video. The data are suggestive of a non-linear, time-compressed forward memory replay mechanism in the macaque. In contrast with humans, such compression of replay is, however, not sophisticated enough to allow these monkeys to skip over irrelevant information by compressing the encoded video globally. We also reveal that the monkeys detect event contextual boundaries, and that such detection facilitates recall by increasing the rate of information accumulation. Demonstration of a time-compressed, forward replay-like pattern in the macaque provides insights into the evolution of episodic memory in our lineage.
Introduction
Accumulating evidence indicates that non-human primates possess the ability to remember temporal relationships among events (Templer and Hampton, 2013; Gower, 1992; Charles et al., 2004). The apes can remember movies based on the temporal order of scenes (Morimura and Matsuzawa, 2001) and keep track of the past time of episodes (Martin-Ordas et al., 2010), whereas macaque monkeys possess serial (Terrace et al., 2003) and ordinal positions expertise for multi-item lists (Chen et al., 1997) and are able to categorize sequences of fractal images by their ordinal number (Orlov et al., 2000). However, keeping track of and remembering the positional coding, and forming associative chaining (Templer et al., 2019; Long and Kahana, 2019) of lists of arbitrary items might diverge from how a semantically linked, temporally relational representation of real-life events is maintained and utilized.
In the human literature, it has been shown that episodes can be replayed sequentially on the basis of learned structures (Liu et al., 2019), sensory information (Michelmann et al., 2019), and pictorial content (Wimmer et al., 2019). These findings suggest the possibility that monkeys can rely on a similar mechanism in recalling events that are linked temporally during temporal order judgement (TOJ). However, the extent to which mechanisms of temporal order judgement overlap across humans and monkeys remains undefined. One hypothesis is that the macaques can similarly rely on a scanning model for information retrieval – akin to serial replay of episodes – to perform temporal order judgements (Gower, 1992; Charles et al., 2004). In this model, after encoding streams of events, the animal performs retrieval by replaying the stream of information in a forward direction. In this way, retrieval time (RT) would be positively correlated to the temporal distance between the beginning of the stream and the target location. Recent findings in rodents (Panoz-Brown et al., 2018) and in humans of memory replay during cued-recall tasks across fragments of video episodes are characterized, however, by replay that proceeds in a forward manner and is temporally compressed (Michelmann et al., 2019). A feature of this more sophisticated mechanism is that memory replay is a fluidic process that allows subjects to skip flexibly across sub-events (Michelmann et al., 2019). Subjects can omit non-informative parts of episodes and replay a shorter episode (shorter than physical perception) in memory, which contains less information. This interpretation is supported by other works on the mental simulation of paths (Bonasia et al., 2016) and video episodes (Michelmann et al., 2019). This latter model constitutes a global compression of parts of episodes (that allows skipping across sub-events) and is regarded as substantially superior to a strict forward-replay mechanism.
In order to simulate the dynamic flow of information that occurs in real-life scenarios, we used naturalistic videos as experimental material to study the mechanism of memory retrieval of event order in the monkeys. These videos are more realistic than the arbitrary items or images that were used in previous studies (Templer and Hampton, 2013; Naya et al., 2017). We used a temporal order judgement paradigm to examine whether and to what extent the pattern underlying memory retrieval conforms to a time-compressed, forward-replay mechanism. In each trial, monkeys watched a naturalistic video composed of two clips, and following a 2 s retention delay, made a temporal order judgement to choose the frame that was shown earlier in the video between two frames extracted from that video (Figure 1A). The two frames were either extracted from the same clip or from two different clips of the video. Given that analyses on response latency can provide insights into the extent to which the monkeys’ behavior might conform to the two putative replay models outlined above, we looked into the RT data. By applying representational similarity analyses (RSA), the Linear Approach to Threshold with Ergodic Rate (LATER) model and generalized linear models to the RT data, we examined the presence of replay-like behavioral patterns in the monkeys. Specifically, if monkeys recall the frames by their ordinal positions, this would imply a linear increase in their retrieval times. By contrast, if the memory search entails a complex processing of the content determined by their semantically linked, temporally relational linkage within the cinematic footage, we should observe evidence of some non-linear pattern.

TOJ task schema and RT results.
(A) In each trial, the monkey watched a video (8–12 s, comprising two 4–6 s video clips), and following a 2 s retention delay, made temporal order judgement between two probe frames extracted from the video. The monkeys were required to choose the frame that was presented earlier in the video for water reward. (B) Task performance of six monkeys. Proportion correct for the six monkeys (left); mean reaction times for three trial types (right). Error bars are standard errors of the means over monkeys. *** denotes p<0.001. (C) Linear plots of reaction time (RT) for each monkey as a function of chosen frame location, see also Table 1. (D) Linear plots of RT as a function of chosen frame location for each human participant, see also Supplementary file 4. In panels (C) and (D), black lines and orange lines refer to lists of non-primate video clips and primate video clips, respectively (with five repetitions collapsed for monkeys and two repetitions collapsed for human participants). All responses in the within-context condition are shown, with cyan and magenta dots denoting whether the chosen probe frames were extracted from Clip 1 or Clip 2, respectively.
Our results suggested that macaque monkeys might adopt a time-compressed, replay-like pattern to search within the representation of continuous information. This time-compression characteristic refers to durations of memory replay that are significantly shorter than the length of the videos. We found that while both species recall the video content non-linearly, there is an aspect of discrepancy between the two species in which the monkeys do not compress the cinematic events globally as effectively as in humans, whereas human participants possess an ability to skip irrelevant information within the video. Finally, we revealed that the monkeys can make use of context changes to facilitate memory retrieval, thus increasing their rate of information accumulation in a drift diffusion model framework.
Results
Human-like forward replay in macaques
All six monkeys learned to perform the temporal order judgement task with dynamic cinematic videos as encoded content (Figure 1B left and Video 1). The six monkeys performed the task with a significantly above chance level with an overall accuracy of 67.9% ± 1.5% (mean ± SD). The human participants performed the task on average at 92.7% ± 1.2% (Figure 1—figure supplement 1A, left). Note that there are two main kinds of TOJ trials: ‘within-context’ and ‘across-context’ trials (Figure 1A). Here, we are first concerned with the response times (RT) data from ‘within-context’ trials, which allow us to examine TOJ mechanisms, whereas RT data from ‘across-context’ trials were used to test for effects arising from context changes (event boundaries), as is discussed in subsequent subsections.
A monkey performing an example trial.
The monkey performs an across-context trial with a correct response (rewarded with liquid).
We first addressed our main hypothesis by examining changes in RT using only within-context trials. One possibility is that the monkeys perform TOJ by relying on some form of memory replay, recalling the events coded at specific ordinal positions in a serial manner. To test this hypothesis, in each trial we explicitly looked at the relationship between RT and the location of the frame within the video that the monkeys chose (‘chosen frame location’, as indexed by the ordinal frame numbers in the video). Considering a range of nuisance variables that might affect these relationships (see also full GLM results in Figure 6), we ran the linear regression analysis of reciprocal latency as a function of chosen frame location/temporal similarity, while including a range of variables as nuisance regressors for each monkey separately. We found a negative relationship between reciprocal latency and chosen frame location in all monkeys (all p<0.001 ; Figure 1C and Table 1 upper panel), and between reciprocal latency and temporal similarity (all p<0.001; Figure 2—figure supplement 1A and Supplementary file 2 upper panel). This suggests that the monkeys are systemically faster in identifying frames that are located earlier in the video. Moreover, we replicated these results when considering only correct trials or only incorrect trials separately (Table 1 middle/bottom panel for chosen frame location; Supplementary file 2 middle/bottom panels for temporal similarity), suggesting that the putative response latency patterns are not affected by the memory outcome. These patterns of result are also replicated using logarithmically transformed RT data (all p<0.001).
To address the direction and speed of memory replay, reaction times at retrieval were compared between TOJ frames that are within Clip 1 versus Clip 2. During TOJ retrieval, frames that were presented in Clip 1 (mean reaction time = 1.59 s) were retrieved significantly faster than those that were experienced in Clip 2 (mean reaction time = 1.88 s) (one-tailed t5 = −4.533; p=0.003; Cohen’s d = −0.54; 95% CI: –infinity to −0.158 s; log(RT): one-tailed t5 = −6.473; p=6.558 × e−4; Cohen’s d = −0.69; 95% CI: –infinity to −0.114 s). Again, these findings confirm that the replay of the video takes place in a forward direction.
Moreover, since the latency required to respond to the chosen frames is much smaller than the duration of the videos themselves, the replay of the video must have been conducted at a compressed speed (i.e., memory replay was faster than perception during video-watching). The difference in reaction time between the very first frame and the last frame was averaged at 942 ms (range: 468–1859 ms). This is equivalent to 94.2 ms to scan through each second of the video, and corresponds to a compression factor of 10.61 during replay in these monkeys (compression factor for each monkey: Jupiter = 13.59, Mars = 7.39, Saturn = 14.80, Mercury = 21.37, Uranus = 17.78, Neptune = 5.38, see Figure 1C). This is comparable to a compression factor of 13.7 observed in humans, corroborating the notion of forward replay and those findings in humans (Michelmann et al., 2019).
The results suggest that the monkeys’ judgments are faster when responding to probe frames that are located in the earlier parts of the videos, and they also illustrate that the search might not follow a linear function. We formally tested for any ‘non-linearity’ of the fit. We began by testing for a linear relationship and progressively moved to higher-order relationships (quadratic, cubic). At each step, we assessed the significance of the new predictor, having accounted for all the variance fitted by the lower-level predictors. We reported the significance of different trends (linear, quadratic, cubic) for the two species (Supplementary file 3). This non-linear (quadratic and cubic) relationship in the monkeys is important because it rules out alternative explanations in which the main effects simply result from positional effects. Rather, the non-linear change in slope indicates that other factors are in play, and that the monkeys might group (or parse) the content according to the relational storyline structure (e.g., the two-clip video design), and do not merely recall the frames/items as of their ordinal positions in a fixed linear manner. In comparison, this analysis also highlights differences between human and macaque monkeys. We found that the human data fits best, and only, with the cubic model (Supplementary file 3), suggesting that humans might have treated the video and TOJ differently from the macaques. This difference is reflected in the global compression capability of the human participants (see next section).
In summary, we found that our monkeys might have performed TOJ of video episodes using a forward search of ordered elements in the mnemonic representation at the time of memory test with a non-linear, time-compressed function.
We also ran the same sets of analyses on human participant data for comparison between the two species. Showing a completely opposite pattern, regression analyses on human subjects showed a negative relationship between temporal similarity and reciprocal latency for all participants (all p<0.001, one subject with p=0.055, Figure 2—figure supplement 1B and Supplementary file 4 upper panel). These results imply that the more similar the two frames to be judged, the longer time needed for retrieve temporal order information. There was also no observable reciprocal latency/chosen frame location slope in the human data (if anything, it shows an opposite trend; see Figure 1D and Supplementary file 4 bottom panel). It is notable that when reciprocal latency as a function of chosen frame location is analyzed in the humans (as shown in Figure 2B compared with Figure 2A for monkeys) a very different pattern emerges, suggesting some form of mechanistic discrepancy between the species. We will examine these aspects in detail in the next section.

Moving average analysis based on reciprocal latency and accuracy for monkeys (left panel) and human participants (right panel).
(A) Reciprocal latency for monkeys as a function of chosen frame location for the average of all animals (upper panel) in the within-context condition, with the results for six individual monkeys are shown in the lower panel. The relationship between chosen frame location and RT follows a non-linear pattern. (B) Reciprocal latency for human participants as a function of chosen frame location for the average of all human subjects (upper panel) in the within-context condition, with results for individual subjects are shown in the lower panel. In panels (A) and (B), the shaded region denotes confidence intervals. (C) Proportion of correct answers for individual monkeys as a function of target frame location in the within-context condition. (D) Proportion of correct answers for individual human subjects as a function of target frame location in the within-context condition. In panels (C) and (D), the horizontal blue lines denote chance-level accuracy. Blue vertical lines in these plots denote the mean boundary location between Clip 1 and Clip 2 (116th frame).
We also ran a sliding-window average analysis to illustrate how accuracy varies as a function of the target frames' location within within-context trials (Figure 2C for monkeys; Figure 2D for humans). It is clear that humans can make use of the across-clip boundary to facilitate their TOJ. Interestingly, in paradigms that use discreet images for encoding, the TOJ performance reported in the human literature is distinct from those making use of continuous streaming movie material. Specifically, it has been reported that accuracy and response times are worse in TOJ for items are that separated by a boundary during encoding in the literature (Ezzyat and Davachi, 2014; Heusser et al., 2018), whereas we show that TOJ performance is better for frames taken from an across-context condition than for frames taken from a within-context condition in both the proportion of correct answers and RTs (Figure 1—figure supplement 1A). This might be because remembering ‘discrete segment of events’ is facilitated by making use of the contextual or perceptual cues in the video frames in across-context condition. We will pursue this aspect further in a latter part of this manuscript, in which we focus on contrasting within-context trials with across-context trials (results and discussion related to Figure 5).
We also noted that the accuracy might vary along the course of the videos. In order to rule out the possibility that the putative RT effects are not driven by a trade-off in accuracy, we segmented each video into four segments (i.e., each clip was segmented into two segments on the basis of target frame location) and calculated the inverse efficiency score [RT (ms)/percentage correct (%)] for each segment for each individual. The monkeys showed a mild numeral increase in inverse efficiency score across the four segments but this trend did not reach statistical significance (Figure 1—figure supplement 1B, left panel) [F (3, 20) = 0.10, p=0.96], suggesting that the increase in RT towards the end of the video did not contribute to better accuracy. By contrast, we found that the humans showed a lower inverse efficiency score for the video parts immediately after a boundary (bars 2 and 3 in Figure 1—figure supplement 1B, right panel) [F (3, 24) = 4.17, p=0.016, a post hoc test shows significant difference between bar 2 and bar 3). This pattern of boundary effect aligns with the little ‘blip’ in proportion correct that occurs shortly after the beginning of Clip 2 in the humans (Figure 2B). These characteristics might be related to their ability to detect the boundaries.
One sample t-tests of the slopes of reciprocal latency as a function of chosen frame location for each monkey after having entered a range of nuisance variables as regressor-of-no-interest (see also Figure 6).
The three panels correspond to analyses performed using all trials (top), only correct trials (middle), and only incorrect trials (bottom). The same slope patterns were observed irrespective of response correctness.
Monkeys | Beta | SEM | t-statistics | p-value | 95% confidence interval Lower Upper | |
---|---|---|---|---|---|---|
Slope of reciprocal latency/chosen frame location tested against zero (all trials) | ||||||
Jupiter | –0.203 | 0.021 | –9.751 | <0.001 | –0.244 | –0.163 |
Mars | –0.369 | 0.025 | –14.950 | <0.001 | –0.417 | –0.320 |
Saturn | –0.157 | 0.027 | –5.810 | <0.001 | –0.210 | –0.104 |
Mercury | –0.207 | 0.052 | –3.958 | <0.001 | –0.309 | –0.104 |
Uranus | –0.164 | 0.022 | –7.595 | <0.001 | –0.207 | –0.122 |
Neptune | –0.197 | 0.031 | –6.361 | <0.001 | –0.257 | –0.136 |
Slope of reciprocal latency/chosen frame location tested against zero (correct trials) | ||||||
Jupiter | –0.185 | 0.025 | –7.393 | <0.001 | –0.234 | –0.136 |
Mars | –0.272 | 0.028 | –9.879 | <0.001 | –0.326 | –0.218 |
Saturn | –0.092 | 0.032 | –2.857 | 0.004 | –0.155 | –0.029 |
Mercury | –0.246 | 0.065 | –3.777 | <0.001 | –0.374 | –0.118 |
Uranus | –0.153 | 0.024 | –6.259 | <0.001 | –0.201 | –0.105 |
Neptune | –0.150 | 0.039 | –3.858 | <0.001 | –0.226 | –0.074 |
Slope of reciprocal latency/chosen frame location tested against zero (Incorrect trials) | ||||||
Jupiter | –0.175 | 0.027 | –6.619 | <0.001 | –0.227 | –0.123 |
Mars | –0.366 | 0.031 | –11.705 | <0.001 | –0.428 | –0.305 |
Saturn | –0.191 | 0.035 | –5.386 | 0.002 | –0.261 | –0.122 |
Mercury | –0.075 | 0.077 | –0.975 | 0.330 | –0.227 | 0.076 |
Uranus | –0.140 | 0.029 | –4.816 | <0.001 | –0.197 | –0.083 |
Neptune | –0.209 | 0.041 | –5.148 | <0.001 | –0.288 | –0.129 |
Discrepancy with humans: compression of replay is local but not global
It has been shown in the humans that memory replay is not a straightforward recapitulation of the original experience. Subjects can skip through their memories, on a faster time scale across segments of a video episode than within-segment, by skipping flexibly over salient elements such as video boundaries within episodes (Michelmann et al., 2019). We propose two possible models with respect to whether the compression is global or not over the whole video. If there is a global compression of the video during replay, the time to initiate replay of Clip 2 would be sooner than the endpoint of replay for Clip 1, as the animal would be able to skip over the whole of Clip 1 to the beginning of Clip 2 (Global-compression model, Figure 3A, right panel). However, if the monkeys are not equipped with the ability to skip video segments during the replay process, we would expect a linear increase of retrieval time with chosen frame location, irrespective of the boundary (Strict forward model, Figure 3A left). We tested statistically whether the time to initiate replay of Clip 2 was shorter than the duration of Clip 1.
We divided each video into eight equal segments and computed cross-correlations derived from pairs of averaged condition-wise RTs based on chosen frame locations using a representational similarity analysis (see section 'Representational similarity analysis (RSA)' in 'Materials and methods' for details). The RT for TOJ between each segment of the video increases linearly according to their position in encoding (Figure 3B, left). We tested these against a hypothetical ‘Strict forward’ model and found significant correlation with the Strict forward model (r = 0.66, p=0.009), but not with the Global compression model (r = −0.16, p=0.802) (Figure 3B, left). These statistics also remain significant for the Strict forward model when we divided the video into either 10 (p=0.030) or 14 equal segments (p=0.020). The same patterns are also obtained when considering correct trials (Strict forward model: r = 0.37, p=0.040; Global compression model: r = −0.02, p=0.545) or incorrect trials (Strict forward model: r = 0.45, p=0.010; Global compression model: r = −0.06, p=0.545) separately. Contrarily, these correlational patterns with the Strict forward model are not observed in the human subjects (r = −0.11, p=0.703), but rather we observe a trend favoring the global compression model instead (r = 0.39, p=0.069) (Figure 3B right). When contrasting the the two models using pairwise comparisons, the two models are both statistically significant for monkeys (p<0.01) and humans (p<0.05), confirming the significance of the winning model.

Model comparison using representative similarity analysis.
(A) Visualization of two candidate models as representational dissimilarity matrices (RDMs). Patterns of reaction time (rank-transformed values) as a function of chosen frame location for the the two hypothetical models. The colors of Clip 1 and Clip 2 evolve increasingly with the temporal progression of the video (left), and their respective hypothetical RDMs (right). The reduction in RT (indicated by an arrow) between Clip 1 and Clip 2 is defined as ‘offset’; the magnitude of such an ‘offset’ is arbitrary (but see further analysis in Figure 4). (B) We segmented the videos into eight equal segments, and the RDMs show pairwise Euclidean distances between these different segments for the species group average (monkeys: left; humans: right) and for each individual separately (monkeys: left bottom; humans: right bottom). RDM correlation tests between behavioral RDMs and two candidate RDMs show that the monkeys replay the footage using a Strict forward strategy (r = 0.66, p=0.009), and provide little evidence for the Global compression strategy (r = −0.16, p=0.802). Humans show an opposite pattern from the macaques. In the humans, the Global compression model shows a higher correlation with behavioral RDM (marginally insignificant r = 0.39, p=0.069) than with the Strict forward model (r = −0.11, p=0.703). Pairwise comparisons show that the two models are both statistically significant for monkeys (p<0.01) and humans (p<0.05). Error bars indicate the standard errors of the means based on 100 iterations of randomization. P values are FDR-corrected (***, p<0.001, **, p<0.01, *, p<0.05).

The Strict forward model provides a better fit to the RT data in monkeys but not in humans.
(A) ‘Offsets’ are defined as the magnitude of reduced RT when the frames were in Clip 2. 11 hypothetical models with their reaction time patterns (top) and RDMs (bottom). We systemically varied the ‘offset’ parameter while keeping a constant slope. These models progressively range from an absolute Global compression model (model 1, most left) to a Strict-forward model (model 6, middle), and beyond (7th to 11th models, right). The numerals below the RDMs denote the magnitude of the respective offsets. (B) Each monkey’s data were tested against each of these 11 hypothetical models. The Spearman correlations increase as a function of offset magnitude between Clip 1 and Clip 2 until reaching an asymptote when the offset value is around zero, which corresponds to the Strict forward model (model 6 in panel (A); see also Figure 3). Individuals’ RT RDMs are shown in insets. (C) Each human participant’s data were also tested against each of these 11 hypothetical models. The Spearman correlations decrease as a function of offset magnitude between Clip 1 and Clip 2 until reaching an asymptote when the offset value is around zero. This analysis confirms the hypothetical discrepancy between the two species (see also Figure 3B).
To control for the confound that the two species might be differentially susceptible to the effect of having different numbers of trials in the experiment, we performed a control analysis that made use of only the first two repetitions of data to equate the number of trials between the two species. With these equated subsets of data, we re-calculated the correlation between monkey RT RDM and the two hypothetical models (i.e., the Strict forward model vs the Global compression model). The results showed a significant correlation with the Strict forward model (r = 0.56, p=0.049), but not with the Global compression model (r = −0.10, p=0.643) in the monkeys. With this smaller set of data, we also directly compared the correlations between the two models for the monkeys and found significant differences between them (p<0.05, FDR-corrected). These results replicated the findings observed when the full set of data were used, suggesting that more numerous exposures to the same stimuli did not affect the main results.
Factors modulating the model: ‘offsets’ for search and memory-search RT slope
We defined the reduced RT to initiate replay for Clip 2 as ‘offsets’ in initiating search in Clip 2 by skipping the non-informative Clip 1 (Figure 3A). With respect to the detailed differences between the two models, one may wonder whether and how the ‘offsets’ between Clip 1 and Clip 2 might influence the results. Especially for the Global compression model, changes of this parameter will cause changes in the RDMs. To address this concern, we simulated an array of RDMs by systemically varying the offset parameter and produced 11 hypothetical models, ranging from an absolute Global compression model (model 1, most left in Figure 4A) to a Strict-forward model (model 6, middle in Figure 4A), and beyond (7th to 11th models, right in Figure 4A). We then tested each individual monkeys’ data with each of these 11 models. The results show that the Spearman correlation values between the monkey’s data and hypothetical RDMs reach an asymptote of around r = 0.8 as the offset parameter tends to zero, and notably, that the correlation values only improve minimally with increasing offsets (Figure 4B, with each individual’s RT RDM displayed as an inset). These results suggest that the monkeys have processed the video as a holistic chunk of information rather than taking advantage of skipping the non-informative first clip when the two probe frames were in Clip 2. For comparison, we also tested human participant’s data against each of these 11 hypothetical models and found a completely opposite pattern in the humans (Figure 4C, with each individual’s RT RDM displayed as an inset). Taken together, we reveal a discrepancy between human and macaque performance in terms of their ability to compress past (irrelevant) information during TOJ.
Context changes (event boundaries) increase the rate of rise in decision information
Thus far, we have focused on how the monkeys retrieve the order of frames when information was equated within contexts, but how contextual changes might aid TOJ processes remains to be examined. It was evident that the monkeys retrieved the temporal order of frames with numerally different speeds for the three trial-types: across-Clip 1 and Clip 2 vs. within-Clip1 vs. within-Clip2[F (2, 15) = 2.32, p=0.132 (Figure 1B, right)]. Thus, we then compared the latency distribution of within-context and across-context conditions. and we hypothesized that a context shift would change the rate of rise of information accumulation (shift model) without altering the decision threshold (swivel model) within the Linear Approach to Threshold with Ergodic Rate (LATER) model (Figure 5A). We compared across-context and within-context trials specifically and fitted the two types of LATER models to each monkey’s data separately [Figure 5B, see section 'LATER (linear approach to threshold with ergodic rate) modelling' in 'Materials and methods'], together with a ‘two fits’ model, which supposes that the reaction times for the two conditions are independent of each other, and a null model, which assumes that there is no effect of manipulation. Using the Bayesian information criterion (BIC) as an index of model comparison, the results consistently indicate that the shift model is better than the swivel model for all six monkeys [range of ΔBIC = (14.57, 300.07); Supplementary file 3, see section 'Model comparison' in 'Materials and methods'). These results further indicate that contextual changes do not alter the judgement threshold for decisions (providing no evidence for a swivel pattern). By contrast, this pattern is not seen in the human participants (Figure 5C and Supplementary file 3), suggesting that the two species might not treat the information given by the event boundary during TOJ in the same manner. Within a drift diffusion model framework, the results suggest that monkeys accumulate information for memory decisions at a faster rate when the frames were extracted from two different clips than when the frames were extracted from the same-context clip.

LATER model fitting of RT in across-context and within-context conditions for both species.
(A) Cartoon of the LATER model cartoon illustrating that a decision signal triggered by a stimulus rises from its start level, at a rate of information accumulation r, to the threshold. Once it reaches the threshold, a decision is initiated. The rate of rise r varies from trial to trial, obeying the Gaussian distribution (variation denoted as green shaded area). (B) Contextual change effect on the distribution of response latency for the monkeys; data from Monkey ‘Mars’ was chosen for larger display. (C) Contextual change effect on the distribution of response latency for humans; data from Subject 1 was chosen for larger display. The red and blue dashed lines show the best fits (maximum likelihood) of across-context trials and within-context trials, respectively (see also Supplementary file 3).
Confirmatory GLMs and control analyses for the putative patterns
To verify whether the effects are attributed to basic stimulus features such as the perceptual differences inherent in the across-context condition. We then generated several generalized linear models to quantify the effect sizes of several principal variables (see section 'Generalized linear models (GLM)' in 'Materials and methods'). In the within-context condition, given that the monkeys would replay their experience to judge the relative temporal order of probe frames (‘replay hypothesis’), we used the temporal characteristics of probe frames, as represented by chosen frame location (or temporal similarity, which is essentially an inverse of frame location) as the independent variables. In the across-context condition, we included a perceptual similarity measure, which was based on feature points extracted by the SURF algorithm (SURF similarity, see Figure 6—figure supplement 1C), in the GLM to reflect the extent to which the monkeys were able to capitalize on using contextual boundaries for TOJ judgment. In addition, we also entered a number of independent variables as regressors: a binary regressor indicating whether the video includes primate content or not, a binary regressor indicating that a video is played forward or backward, the five repetitions (or two repetitions for humans) of the video-trials, physical location of the selected probe (left or right), time elapsed within a session, chosen frame location, temporal similarity, perceptual similarity (SURF), temporal distance, and response of the subjects (correct/incorrect).
The within-context GLM shows that monkeys' RT was indeed significantly faster when the probe frame was located earlier in the video, p<0.001 (or in equivalent terms, when two frames were temporally closer, p=0.004), confirming our main finding that the monkeys might have adopted a forward scanning strategy for information retrieval (Figure 6A, left). By contrast, the across-context GLM shows that there was no significant effect of the chosen frame location on RT. Rather, the monkeys retrieve their memories significantly faster for probe frames that are contextually (or perceptually) distinct, p=0.004 (Figure 6A , right). Human results are shown in Figure 6B for comparison.

Full GLM analysis including a number of variables that might affect reciprocal latency separately for within-context and across-context conditions.
(A) Monkey data. (B) Human data. We included ten regressors, namely, a binary regressor indicating whether the video category is primate or non-primate (video category), a binary regressor indicating that a video is played forward or backward (play order), the repeated exposure of the trial (Monkey: 1–5; Human: 1–2) (exposure), the physical location of the selected probe on screen (left or right) (touch side), time elapsed within a session (elapsed time; to rule out fatigue or attentional confounds), chosen frame location, temporal similarity, SURF similarity as a perceptual similarity measure (perceptual similarity), temporal distance between two probe frames, and response of subjects (correct/incorrect). In the monkeys, the results confirm that chosen frame location is the most significant regressor in within-context trials, whereas perceptual similarity is the most significant regressor in across-context trials. ***, p<0.001, **, p<0.01, *, p<0.05.
Finally, considering that the monkeys performed a larger number of trials than the human participants (5000 vs. 2000 trials), we plotted the accuracy data for each testing day for each monkey (and for humans too, Figure 7) and observed no apparent increase in performance over the course of the 50 testing days. This absence of performance change could also be due to the extensive training that the monkeys had received prior to this experiment (~21,150 trials in total for each monkey across 141 sessions on average, see Table 2). We therefore deem the main effects reported here as unlikely to be attributable to behavioral or strategic changes over the course of main experiment.
Discussion
In light of recent reports on the neural correlates underlying how humans and rodents replay their past experiences (Liu et al., 2019; Michelmann et al., 2019; Panoz-Brown et al., 2018; Davidson et al., 2009), here, we demonstrate parallel behavioral findings in macaque monkeys looking at dynamic cinematic material. Previous reports of macaques succeeding in TOJ indicated their ability to remember the order of events (Templer and Hampton, 2013; Martin-Ordas et al., 2010; Ninokura et al., 2003) and even to monitor the quality of representations of temporal relations among item images meta-cognitively (Templer et al., 2018). For example, Orlov et al., 2000 suggested that monkeys can categorize stimuli by their ordinal number to aid recall of order, and Templer and Hampton, 2013 showed that monkeys retrieve the temporal order information on the basis of the order of events rather than elapsed time.
One possible common mechanism underlying these performances is that monkeys use a forward search to identify targets in memory representation (Gower, 1992). Taking advantage of the latency data obtained during TOJ on naturalistic materials, we provide new behavioral evidence in support of the hypothesis proposed by Gower, 1992 that the monkeys can replay their memory in a serial forward manner. Our analysis further clarifies that this replay process is conducted in a time-compressed manner. Notwithstanding task differences, both humans and macaques execute retrieval with forward replay with a comparable compression factor (factors of ~11 in macaques vs. ~13 in humans, Michelmann et al., 2019) (but see also Wimmer et al., 2019). Another cross-species similarity rests on the observation that the RT patterns are independent of retrieval success. We have previously showed in analogous TOJ paradigms in humans that TOJ task-specific BOLD activation and behavioral RT patterns are independent of retrieval accuracy (Kwok et al., 2012; Kwok and Macaluso, 2015a). We interpreted these effects as process-based rather than content-based. At present, the latency results probably also point towards some ‘search’ or replay processes, and any ultimate incorrect responses are thus likely to be caused by memory noises injected during encoding and/or during delay maintenance.
Despite the cross-species similarity, our revelation of a critical discrepancy between humans and macaques carries an important theoretical implication: humans can do both local and global compression, whereas monkeys are not able to attain global compression. The implication is that mental time travel is not all-or-none. There could be multiple layers underpinning the concept of mental time travel, which entail the ability to relive the past (Suddendorf et al., 2009; Tulving, 1985) and skipping over unimportant information (Michelmann et al., 2019). The latter aspect allows humans to recall our memories flexibly far into our past and suggest powerful computational efficiencies that may facilitate memory storage and recall. By contrast, we did not find any evidence in the macaques of an ability to make use of salient boundary cues to skip unimportant details. It thus remains unknown how far back in time monkeys can replay their memories. It has been shown recently that humans can spontaneously replay experience on the basis of learned structures, with fast structural generalization to new experiences facilitated by representing structural information in a format that is independent of its sensory consequences (Liu et al., 2019). The lack of global compression in the monkeys of their video experience implies that the monkeys might not be able to use factorized representations to allow components to be recombined in more ways than were experienced (Behrens et al., 2018). However, by establishing that neither the number of intervening frames nor the passage of time per se determines the RT pattern (probably a mixed effect resultant from a combination of both), we ruled out order or positional memory as the underlying mechanism supporting TOJ in this task.
Although we have argued for the presence of replay-like patterns in the monkey during TOJ, we are aware that replay is a neural phenomenon supported by the activity of individual neurons and that implicates the offline reactivation of sequences of hippocampal place cells that reflect past and future trajectories (Pfeiffer and Foster, 2013; Jadhav et al., 2012; Ólafsdóttir et al., 2018). On the basis of the behavioral data presented here, the question of where the putative replay may be occurring anatomically is important for the broader field. On the one hand, in rodent research, replay has been linked to sharp wave-ripples in the hippocampal formation (Lee and Wilson, 2002; Foster and Wilson, 2006). On the other hand, recent MEG studies on humans' replay have implicated several cortical areas such as the occipito-parietal cortex (Michelmann et al., 2016), the vmPFC (Liu et al., 2019), and the visual cortex (Wimmer et al., 2019), in addition to the MTL including the hippocampus (Liu et al., 2019; Wimmer et al., 2019). These observations, made under simultaneous whole-brain recordings in humans, are in line with the idea that replay may be coordinated between the hippocampus and neocortical areas (Ji and Wilson, 2007).
A further caveat is that although most of the rodents studies on replay focus on spontaneous replay patterns at rest (Liu et al., 2019), during sleep (Lee and Wilson, 2002; Louie and Wilson, 2001) or during a task-free state (Foster and Wilson, 2006; Karlsson and Frank, 2009), our monkeys perform their replay-like recall of the videos as an effortful operation to solve a TOJ task. This is more akin to studies using stimuli embedded in episodes (Wimmer et al., 2019) or short video-episodes (Michelmann et al., 2016) for their ecological validity. Our combined results thus constitute a novel connection between various kinds of replay-like behaviors that are shared between rodents and humans, and provide a primate model for anatomical investigation. This cognitive discrepancy should be further elucidated using electrophysiological methods probing into the MTL (Davidson et al., 2009; Foster and Wilson, 2006; Karlsson and Frank, 2009; Diba and Buzsáki, 2007) and the neocortices (Naya et al., 2017).
We observed one further interesting phenomenon here, which is that the monkeys are able to detect contextual changes to facilitate TOJ. We show that the expedited TOJ in across-context condition was facilitated by contextual details, which in turn results in an increased rate of rise of signal towards memory decision in a drift-diffusion process. Humans studies show that contextual changes lead to segmentation of ongoing information (Magliano et al., 2001; Hard et al., 2006). Our results provide evidence consistent with event segmentation in the macaque monkeys and imply that these monkeys might be capable of parsing the footage using contextual information, akin to what has been shown in humans (DuBrow and Davachi, 2013; Sols et al., 2017; Ezzyat and Davachi, 2011; Kwok and Macaluso, 2015b) and rodents (Panoz-Brown et al., 2016).
Memory replay is an elaborate mental process and our demonstration of a time-compressed, forward replay-like pattern in the macaque monkeys, together with their primordial rigidity in compressing the experienced past, provides promise for mapping the evolution of episodic memory in our lineage.
Materials and methods
Subjects
Macaque monkeys
Request a detailed protocolSix male rhesus macaques (Macaca mulatta) (5.49 ± 0.5 kg) with a mean age of 3.5 years at the start of testing participated in this study. They were initially housed in a group of 6 in a specially built spacious enclosure (max capacity = 12–16 adults) with enrichment elements such as a swing and climbing structures present until the study began. The monkeys were then housed in pairs during the experimentation period according to their social hierarchy and temperament. They were fed twice a day with portions of 180 g monkey chow and pieces of fruits (8:30am/4:00pm). Water was available ad libitum except on experimental days. They were routinely offered treats such as peanuts, raisins and various kinds of seeds in their home cage for forage purpose. The monkeys were procured from a nationally accredited colony located in the Beijing outskirts, where the monkeys were bred and reared. The animals were thus ecologically naive to the natural wilderness and should not have had any previous encounter with other creatures except humans and their companion. The room in which they are housed is operated with an automated 12:12 (7am/7pm) light-dark cycle and kept within temperate around 18–23°C and humidity of 60–80%.
Human subjects
Request a detailed protocolSeven participants (mean age = 19.57 ± 1.13, 6 female) took part in the experiment. The participants were recruited from the undergraduate population in East China Normal University. The participants provided informed consent and were compensated 400 RMB for their time.
The experimental protocol was approved by the Institutional Animal Care and Use Committee (permission code: M020150902 and M020150902-2018) and the University Committee on Human Research Protection (permission code: HR 023–2017) at East China Normal University. All experimental protocols and animal welfare adhered with the ‘NIH Guidelines for the Care and Use of Laboratory Animals’.
Training history and task performance
Request a detailed protocolThere were five stages of training on the TOJ task and the numbers of days per monkey are reported in Table 2. With these extended periods of training, the monkeys’ performances were unlikely to change over the course of the main experiment (see also Figure 7).
Number of days accrued in each training stage.
When we trained the two additional monkeys (Uranus and Neptune), we made them skip the 4th and 5th stages entirely. The performance patterns of these two monkeys were not different from those of the initial four.
Monkeys | Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 |
---|---|---|---|---|---|
Jupiter | 12 | 39 | 16 | 65 | 51 |
Mars | 29 | 7 | 15 | 48 | 42 |
Saturn | 25 | 42 | 31 | 64 | 95 |
Mercury | 13 | 38 | 17 | 61 | 49 |
Uranus | 14 | 17 | 11 | - | - |
Neptune | 13 | 21 | 11 | - | - |
Apparatus and testing cubicle
Macaque monkeys
Request a detailed protocolThe testing was conducted in an automated test apparatus controlled by two Windows computers (OptiPlex 3020, Dell). The subject sat, head-unrestrained, in a wheeled, specially made Plexiglas monkey chair (29.4 cm × 30.8 cm × 55 cm) fixed in position in front of a 17-inch infrared touch-sensitive screen (An-210W02CM, Shenzhen Anmite Technology Co., Ltd, China) with a refresh rate of 60 Hz. The distance between the subject's head and the screen was kept at ~20 cm. The touch-sensitive screen was mounted firmly on a custom-made metal frame (18.5 cm × 53.2 cm) on a large platform (100 cm × 150 cm × 76 cm). Water reward delivery was controlled by an automated water-delivery rewarding system (5-RLD-D1, Crist Instrument Co., Inc, U.S.) and each delivery was accompanied by an audible click. An infrared camera and video recording system (EZVIZ-C2C, Hangzhou Ezviz Network Co., Ltd, China) allowed the subject to be monitored while it was engaged in the task. The entire apparatus was housed in a sound-proof experimental cubicle that was dark apart from the background illumination from the touch screen.
Human subjects
Request a detailed protocolWe used the same model of computers and touch-sensitive screens for the human participants. The human subjects sat ~30 cm in front of another identical 17-inch infrared touch-sensitive screen, with stimuli presented with the same computer used in the monkeys’ experiment.
Source of video materials and preparation
Request a detailed protocolA collection of documentary films on wild animals was gathered from YouTube. The films were Monkey Kingdom (Disney), Monkey Planet (Episode 1–3; BBC), Planet Earth (Episode 1–11; BBC), Life (Episode 1–10; BBC), and Snow Monkey (PBS Nature). In total, 28 hr of footage was gathered. We applied Video Studio X8 (Core Corporation) to parse the footage into smaller segments. Experimenters then applied the following criteria to edit out ~2500 unique clips manually: 1) the clip must contain a continuous flow of depiction of events (i.e., no scene transition); 2) at least one living creature must be included; 3a) at least one of the animals contained must be in obvious motion; 3b) the trajectories of these motions must be unidirectional (i.e., no back and forth motion of the same subject); 4) clips with snakes were discarded. From this library, we then selected 2000 clips (all of 4 s – 6 s) for the final test, and a small number of additional clips were also prepared for the training stages. Each of the monkeys was assigned with a pseudo-randomized remix of a unique 1000 video-trials set from the 2000-video library, thus eliminating video idiosyncrasy associated with any experimental conditions. For each individual monkey, the videos assigned to be in an experimental condition were not used/shown in another condition, so that a particular monkey would only view that video repeatedly under the same condition.
Task and experimental procedure
Request a detailed protocolWe combined naturalistic material with a TOJ paradigm that is widely used in episodic memory research (Templer and Hampton, 2013; Manns et al., 2007; Kwok and Macaluso, 2015c). In each trial, the monkey initiated a trial by pressing a colored rectangle in the center of the screen (0.15 ml water). An 8–12 s video (consisting of two 4–6 s clips) was then presented (0.15 ml water), and following a 2 s retention delay, two frames extracted from the video were displayed bilaterally on the screen for TOJ. The monkeys were trained to choose the frame that was shown earlier in the video (see Video 1). A touch to the target frame resulted in 1.5 ml water as reward, the incorrect frame was removed, and the target frame remained alone for 5 s as positive feedback. A touch to the incorrect frame removed both frames from the screen and blanked the screen for 20 s without water delivery. As the monkeys could self-start the trials, we did not set an explicit inter-trial interval. Correction trial procedures were not used in the main test.
We collected 50 daily sessions of data. Each session contained 100 trials, giving us a 5000 trials per monkey. The 5000 trials contained a break-down of four factors: Boundary (Within vs. Across), Play Order (Normal vs. Reverse), Temporal distance (TD, 25 levels), and Exposure (R1–R5), giving out a 2 * 2 * 25 * 5 within-subject design. The two frames for TOJ could be extracted from the same clip (Within) or from distinct clips (Across). TD was an ordinal variable, with 25 levels ranging between a minimum of 1000 ms (equivalent to 25 frames) and a maximum of 3880 ms (equivalent of to frames). Each TD level increased progressively with three frames in each step. The ten lists contained five lists of primate animals and five lists of non-primate animals (Category: Primate/Non-Primate). The six monkeys were counter-balanced in their order of receiving the two kinds of material: three monkeys (Jupiter, Mars, and Saturn) were tested first on Non-Primate lists, whereas the other three (Mercury, Neptune, and Uranus) were tested first on Primate lists. The whole experiment lasted for 68 days. There were three blocks of reset (3 days, 9 days and 6 days) in-between testing days. The task was programmed in PsychoPy2 implemented in Python.
The same experimental design was adopted by the human participants. They re-used the 6 unique video-trials/list sets and TOJ frames (one unique set per monkey) correspondingly for the human subjects (the extra subject 7 re-used set 1). The only critical difference was that human participants performed only two exposures (R1–R2) rather than five. Identically, each session contained 100 trials, and with 20 daily sessions of testing, we acquired 2000 trials per participant for analysis.
Data analysis: modelling, temporal and perceptual similarity, and model comparison
Trials with RT longer than 10 s (1.45%) or shorter than 0.7 s (2.47%) were excluded from the analyses. Both correct and incorrect trials were entered for all analyses. For Neptune, data from one day of primate list 5 (exposure 4) and from two days of primate lists 4 and 5 (exposure 5) were lost because of machine breakdown and thus only 4700 trials were included for Neptune.
Representational similarity analysis (RSA)
Request a detailed protocolTo create RDMs for the representational pattern of response time, we evenly divided each video into eight segments on the basis of chosen frame location and averaged the response times within each segment for each monkey individually. For each matrix, we then computed the Euclidean distance of average response time for each pair of the segments. In order to show the relative distance among all pairs of segments intuitionally, we further transformed the matrix by replacing each element with the rank number in the distribution of all the elements, and linearly scaled into [0,1]. To compare similarity between RT patterns and the candidate models, we computed Spearman correlation between the model and group averaged RT RDMs with 100 randomized iterations by bootstrapping. We then used two-sided Wilcoxon signed-rank test to test the null hypothesis that the correlation between data RDM and the two hypothetical models are equal. The statistical threshold was set at p<0.05 (FDR corrected).
LATER (linear approach to threshold with ergodic rate) modelling
Request a detailed protocolThe observations that the brain needs more time than that requires for nerves to transport information and that trial-by-trial RTs vary considerably have stimulated researchers to make use of distribution of RT to examine mental processes (Magliano et al., 2001). Latency, as an indicator of decision processes, provides a source of insight into the underlying decision mechanisms (Louie and Wilson, 2001; Karlsson and Frank, 2009; Diba and Buzsáki, 2007). The Linear Approach to Threshold with Ergodic Rate (LATER) model is a widely used model that taps into these processes (Noorani and Carpenter, 2016). The LATER model stipulates that the winner signals that reach threshold faster would trigger the decision, resulting in shorter latency. Accordingly, changing the rate of rise would cause the line to ‘shift’ along the abscissa without changing its slope. By contrast, lines plotted by latency distribution would ‘swivel’ around an intercept when the threshold changes (Reddi and Carpenter, 2000; Figure 5A). To explore the underlying mechanisms of different experiment conditions, we plotted the reciprocal of RT (i.e., a multiplicative inverse of RT, 1/RT) as a function of their z-scores, thus making the distributions follow a Gaussian distribution (Noorani and Carpenter, 2011). Hence, we fitted the main component for each condition. The main component is a ramp to the threshold with rate of rise r. The distance between start level S0 and threshold ST is defined as θ, and the rate of rise r follows a Gaussian distribution of mean, μ, and standard deviation, σ1.
For model comparison, we fitted four different models to the data. The ‘null’ model fits the RT from within-context vs. across-context conditions with the same parameters, implying no effect of manipulation. The ‘two fits’ model set all the parameters to be free, which supposes that the RTs of the two conditions are independent of each other. The ‘Shift’ model only allows the slope of main component μ to change according to different conditions on the assumption that the manipulation will change the rate of rise. The ‘Swivel’ model only allows θ to change according to different conditions on the assumption that the subject sets different thresholds for different conditions.
Generalized linear models (GLM)
Request a detailed protocolWe ran GLM to compare the effect sizes of independent variables on the dependent variable. The mean (μ) of the outcome distribution depends on the independent variables , according to the following formula:
where is a distribution of outcomes, β is an unknown parameter to be estimated, and is a link function (Gaussian function). The dependent variable (Y) is reciprocal latency, and the independent variables are as follows: a binary regressor indicating whether the video includes primate content or not, a binary regressor indicating that a video is played forward or backward, the five repetitions (or two repetitions for humans) of the video-trials, physical location of the selected probe (left or right), time elapsed within session, chosen frame location, temporal similarity, perceptual similarity (SURF), temporal distance, and the response of the subjects (correct/incorrect).
Model comparison
Request a detailed protocolTo obtain the best fit among these models, we used Bayesian Information Criterion (BIC) as a criterion for model selection among these four models. The formula for BIC is: BIC = −2(logL) + numParam* log(numObs), where L is the maximum likelihood for the model, and numParam and numObs represent the number of free parameters and the number of samples, respectively. We computed ΔBIC as the strength of the evidence, which indicates the extent to which the selected model is superior to other models. Different ranges of ΔBIC show different level of evidence: a value of ΔBIC larger than two shows positive evidence, and a value of ΔBIC larger than six indicates strong evidence (Kass and Raftery, 1995).
Temporal similarity
Request a detailed protocolFor each trial, we calculated temporal similarity (TS) as an index of the discriminability of probe frames. Temporal similarity between two probe frames extracted from the video is calculated by the ratio of the two frames’ temporal separation between their occurrence in the video and the time of testing. The temporal similarity of any two memory traces can be calculated as: TS = delay2/delay1, where delay2 < delay1 (Brown et al., 2007).
Perceptual similarity
Request a detailed protocolWe made use of three main parameters to measure the perceptual dissimilarity between TOJ frames for the GLMs. First, RGB-histogram is computed as the Sum-of Square-Difference (SSD) error between image pairs for the three color channels (RGB). For each color channel, the intensity values range from 0 to 255 (i.e., 256 bins), and we computed the total number of pixels at each intensity value and then the SSD for all 256 bins for each image pair. The smaller the value of the SSD, the more similar the two images (image pair) were. Second, for HOG similarity, we constructed a histogram of directions of gradient over fixed-sized grids across the entire image. A vector is generated from each grid cell and correlated with HOG features from another image. Third, Speeded Up Robust Features (SURF) (Bay et al., 2008) uses Box Filter using integral images (Viola and Jones, 2004) to approximate Laplacian-of-Gaussian (LoG). Wavelet responses in both horizontal and vertical directions are used to assign orientation in SURF. SURF consists of fixing a reproducible orientation that is based on information from a circular region around the interest point. A descriptor vector is generated around the interest point using the integral image, which matches with descriptor vectors extracted from a compared image. SURF uses various scales and different orientation to identify unique features or key-points in an image. If the same feature exits in another image that is smaller or larger in size or even at a different orientation, SURF identifies that feature (or key-point) as corresponding or similar in both images (Bay et al., 2008; see Figure 6—figure supplement 1C for illustration). The Euclidean distance is used to measure the similarity between two descriptor vectors from images.
Data availability
All data is available at Dryad (https://doi.org/10.5061/dryad.3r2280gcc).
-
Dryad Digital RepositoryBehavioral evidence for memory replay of video episodes in macaque monkeys.https://doi.org/10.5061/dryad.3r2280gcc
References
-
Speeded-Up robust features (SURF)Computer Vision and Image Understanding 110:346–359.https://doi.org/10.1016/j.cviu.2007.09.014
-
A temporal ratio model of memoryPsychological Review 114:539–576.https://doi.org/10.1037/0033-295X.114.3.539
-
Impaired recency judgments and intact novelty judgments after fornix transection in monkeysJournal of Neuroscience 24:2037–2044.https://doi.org/10.1523/JNEUROSCI.3796-03.2004
-
Knowledge of the ordinal position of list items in rhesus monkeysPsychological Science 8:80–86.https://doi.org/10.1111/j.1467-9280.1997.tb00687.x
-
Forward and reverse hippocampal place-cell sequences during ripplesNature Neuroscience 10:1241–1242.https://doi.org/10.1038/nn1961
-
The influence of context boundaries on memory for the sequential order of eventsJournal of Experimental Psychology: General 142:1277–1286.https://doi.org/10.1037/a0034024
-
What constitutes an episode in episodic memory?Psychological Science 22:243–252.https://doi.org/10.1177/0956797610393742
-
Short-term memory for the temporal order of events in monkeysBehavioural Brain Research 52:99–103.https://doi.org/10.1016/S0166-4328(05)80329-8
-
Making sense of abstract events: building event schemasMemory & Cognition 34:1221–1235.https://doi.org/10.3758/BF03193267
-
Perceptual boundaries cause mnemonic trade-offs between local boundary processing and across-trial associative bindingJournal of Experimental Psychology: Learning, Memory, and Cognition 44:1075–1090.https://doi.org/10.1037/xlm0000503
-
Coordinated memory replay in the visual cortex and Hippocampus during sleepNature Neuroscience 10:100–107.https://doi.org/10.1038/nn1825
-
Awake replay of remote experiences in the HippocampusNature Neuroscience 12:913–918.https://doi.org/10.1038/nn.2344
-
Bayes factorsJournal of the American Statistical Association 90:773–795.https://doi.org/10.1080/01621459.1995.10476572
-
Hippocampal contributions to serial-order memoryHippocampus 29:252–259.https://doi.org/10.1002/hipo.23025
-
Indexing space and time in film understandingApplied Cognitive Psychology 15:533–545.https://doi.org/10.1002/acp.724
-
Speed of time-compressed forward replay flexibly changes in human episodic memoryNature Human Behaviour 3:143–154.https://doi.org/10.1038/s41562-018-0491-4
-
Memory of movies by chimpanzees (Pan Troglodytes)Journal of Comparative Psychology 115:152–158.https://doi.org/10.1037/0735-7036.115.2.152
-
Representation of the temporal order of visual objects in the primate lateral prefrontal cortexJournal of Neurophysiology 89:2868–2873.https://doi.org/10.1152/jn.00647.2002
-
Full reaction time distributions reveal the complexity of neural decision-makingEuropean Journal of Neuroscience 33:1948–1951.https://doi.org/10.1111/j.1460-9568.2011.07727.x
-
The LATER model of reaction time and decisionNeuroscience & Biobehavioral Reviews 64:229–251.https://doi.org/10.1016/j.neubiorev.2016.02.018
-
The role of hippocampal replay in memory and PlanningCurrent Biology 28:R37–R50.https://doi.org/10.1016/j.cub.2017.10.073
-
Rats remember items in context using episodic memoryCurrent Biology 26:2821–2826.https://doi.org/10.1016/j.cub.2016.08.023
-
Replay of episodic memories in the ratCurrent Biology 28:1628–1634.https://doi.org/10.1016/j.cub.2018.04.006
-
The influence of urgency on decision timeNature Neuroscience 3:827–830.https://doi.org/10.1038/77739
-
Mental time travel and the shaping of the human mindPhilosophical Transactions of the Royal Society B: Biological Sciences 364:1317–1324.https://doi.org/10.1098/rstb.2008.0301
-
Co-operation of long-term and working memory representations in simultaneous chaining by rhesus monkeys ( Macaca mulatta )Quarterly Journal of Experimental Psychology 72:2208–2224.https://doi.org/10.1177/1747021819838432
-
Serial expertise of rhesus macaquesPsychological Science 14:66–73.https://doi.org/10.1111/1467-9280.01420
-
Memory and consciousnessCanadian Psychology/Psychologie Canadienne 26:1–12.https://doi.org/10.1037/h0080017
-
Robust Real-Time face detectionInternational Journal of Computer Vision 57:137–154.https://doi.org/10.1023/B:VISI.0000013087.49260.fb
Decision letter
-
Joshua I GoldSenior Editor; University of Pennsylvania, United States
-
Morgan BarenseReviewing Editor; University of Toronto, Canada
-
Bryan StrangeReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Memory replay enables humans recall the past in a flexible and temporally-compressed manner. The extent to which this process is shared with nonhuman primates is unknown. Kwok and colleagues demonstrate that, like humans, macaque monkeys temporally compress past experiences with a non-linear forward-replay mechanism. However, replay in macaques was strictly forward; unlike humans, macaque monkeys lacked a global compression mechanism that enabled the flexibility to skip irrelevant information. This work helps to map the evolution the episodic memory.
Decision letter after peer review:
Thank you for submitting your article "Behavioral evidence for memory replay of video episodes in macaque monkeys" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Joshua Gold as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Bryan Strange (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
All reviewers agreed that this is an interesting and timely paper, and they universally congratulated you on the Herculean effort involved in conducting this research. The reviewers did raise several issues that must be addressed prior to publication. Their reviews are included below.
The manuscript presents a number of very complex analyses, which were at times difficult to digest. In the revision, we encourage the authors to improve the overall clarity of the manuscript by providing more (and clearer) details on the methods. Each reviewer has highlighted specific requests for more information and additional analyses to clarify results. Clearer linkage between the Materials and methods and corresponding Results would also help to improve the manuscript's clarity.
The reviewers raised questions about the behavioural performance measures (e.g., the need to more carefully consider the accuracy data, the inclusion of both correct and incorrect trials, and how stimulus repetition may affect the results). In several instances, the reviewers suggested additional interpretations of the data.
We hope that these comments are helpful as you prepare your revision, and look forward to receiving the revised manuscript.
Reviewer #1:In this manuscript, Zuo and colleagues present novel behavioural data from macaque monkeys with the aim of investigating the presence of replay during temporal order judgments for previously seen video clips. The question and paradigm are interesting and timely. The comparisons between monkey and human strategies and performance are especially intriguing. However, I have some questions about the behavioural performance and measures. The statistical models should also be reported more clearly and the link to predictions should be more explicit.
1) Additional details about the testing paradigm and overall behavioural results are needed. Were the monkeys trained ahead of time (and if so, how long did the training take)? With such a protracted period, it would be interesting to know whether the monkeys' performance improved or changed over the course of the experiment. As their performance is not at ceiling level, it seems possible that they improve overall, or that they start relying on different aspects to make their judgments. Day of testing could therefore also be included as a predictor in the models.
Regarding the monkeys' accuracy: in the subsection “Task performance”, the authors report that the monkeys' overall accuracy was ~68%. Was this the overall accuracy calculated across all 50 testing days? In contrast, the monkeys' accuracy plotted in Figure 2C is barely above chance level – while it's clear that the pattern displayed in the figure will be noisy as the data are split according to target frame location, the lines hover close to chance level and don't seem to average out to 68%. I may be missing something, but unless the data in Figure 2C represent only a subset, for example, this needs to be explained in more detail. Related to the point above, it would generally be helpful if the authors provided a measure of the monkeys' accuracy over the course of testing.
This is also important as the human participants' accuracy was at ceiling after a single testing session (based on the aforementioned subsection) – did the trend of the monkeys' performance remain the same across the 50 testing days (i.e., no global compression), or did this only develop after extensive experience on the task?
2) On a related note, it seems confusing to include both correct and incorrect trials in most of the analyses. In the GLM analyses (subsection “Human-like forward replay in macaques”, Supplementary file 2), the patterns were indeed quite similar for correct and incorrect trials, but it still seems unusual to include trials where the monkeys may not have replayed the content correctly as they reached an incorrect decision. Further, the different amounts of data entered into different models (correct trials only vs. all trials) make it more difficult to compare them. I would recommend that only correct trials are included in all analyses, or additional control analyses are needed to ensure that all findings remain the same when only correct trials are used.
3) The comparison between monkey and human data in Figure 3 is especially interesting. It is intriguing to see that human participants display global compression, but monkeys do not, and this interpretation seems to be well-supported by the data. However, the authors state that monkeys 'reach their memory decision threshold more quickly when probe frames are extracted from the two different contexts' and that these findings 'parallel closely established findings in the humans'. This pattern is in fact the opposite to that observed in humans; when two stimuli are separated by a boundary, humans are slower to reach a decision (e.g. Ezzyat and Davachi, 2014; Heusser et al., 2018, to name just two recent papers). This interpretation should be revised as the data from the present manuscript in fact suggest that boundaries affect human and monkey decision making processes in opposite directions.
4) The manuscript contains a relatively large number of (at times complex) analyses. As the more complex models are such an important aspect of the paper, they should be explained in more detail and the link between Materials and methods and Results should be clear. I found it somewhat difficult to align the description of the analytic approach in the Materials and methods section with the reporting of the results. In general, more detail in the description of the methods is needed. This especially applies to the model and variable descriptions. A few key points below:
– In the subsection “Task and experimental procedure”, the authors state that the experiment comprised four factors: boundary, play order, temporal distance, and exposure. First, I found it unusual that the authors reported temporal distance as a factor with 25 levels, since modeling it as a categorical variable assumes that all 25 levels are independent of one another. If this was not the case, it should be clarified in text, but if this variable was indeed modeled as a categorical factor, this should be changed to continuous or ordinal.
Second, in the subsection “Generalized linear models (GLM)”, the authors then report a set of variables included in the GLM. If I'm not mistaken, these are only regressed out of the very last GLM (subsection “Confirmatory GLMs for the putative patterns”). While the evidence from this analysis is clear and the significance of the key factors of interest does not change, I think including these regressors in each of the analyses and treating them as nuisance regressors would be more convincing/parsimonious.
– The description of the LATER model fitting was somewhat confusing. For example, it wasn't immediately clear that 'both conditions' (subsection “LATER (linear approach to threshold with ergodic rate) modelling”) refers to within vs. across clips. I also found it somewhat confusing that one of the models was called 'unconstrained' in the Materials and methods, but 'two fits' in Supplementary file 5 (if that is indeed the same model). Finally, in the Materials and methods, the authors refer to Figure 4A, which I believe should be Figure 6A. I am not a modeler, but it would nonetheless be important for the model descriptions to be linked back to the experimental predictions to make the connection between model parameters and behaviour clearer.
– How exactly was reciprocal latency calculated? I was not familiar with the term beforehand so I looked it up in the literature, but I would suggest that all key variables are defined in an accessible manner.
5) Related to the above: in the Introduction, the authors say that a 'non-linear' pattern would be predicted, which led me to expect a comparison of linear and non-linear model fits. Similarly, in the Materials and methods, the authors state that they used BIC to 'obtain the best fit among these models' – as this section immediately follows the section on GLMs, I assumed the models would be compared. However, the only BIC value reported is that comparing the 'shift' and 'swivel' models. I believe BIC can be used to compare any model types (i.e. linear and non-linear), but if this is not the case, the approach should be clarified. Unless I'm missing something, it appears that the reciprocal latency was modelled in a linear model (subsection “Human-like forward replay in macaques”), but the 'non-linearity' of the fit was only assessed visually. As the linear analyses all had significant outcomes, it seems important to provide a benchmark of 'goodness of fit' for different models. Reporting the significance of different trends (e.g. linear, quadratic, cubic) could be helpful.
Reviewer #2:In this study, Zuo et al. examined existence of memory replay during the retrieval of video clip material encoded continuously during periods of ~10 seconds. The study combined several computational modeling approaches and complex analytical procedures on reaction time data to test whether monkeys showed forward replay of the encoded material and whether the replayed memory content showed a structured pattern modulated by context changes during encoding, as shown previously in humans. Their results, though similar with humans in many aspects, also revealed striking differences, being the Inability to skip irrelevant Information In their replay a major one.
I found the study very well written and the topic of research of interest among many in the neuroscientific community. The analytical approach is sophisticated and the implementation sounding. I do have however some concerns that would require further attention though, which are listed below:
1) Many of the findings described in human studies related to sequence of images or video clips that are presented only once. Here, animals are shown repeatedly with the video clips. To what extent the repetition is affecting the underlying neural mechanisms and consequently the actual findings? I am aware the study included somehow this in their GLM model (Supplementary Figure 3) but in my view, this should require some more work. The question that one would like to see answered here would be: To what extent differences between humans and macaques are susceptible to be affected by the large amount of repeated material used in monkeys? I am aware many of the analysis included in the study require large number of trials for each individual but maybe authors can explore this issue across participants?
2) Several details concerning the experimental design tested in humans are lacking. Which are the differences between the two species when it comes to the experimental design? This information should be clear in the manuscript so that differences and similarities between species can be fully evaluated.
3) I was expecting that most of the analysis were implemented in monkeys' and in humans' data. Is there any particular reason to skip some of them in humans RTs (for example: effects of context change (within vs. across) and GLM confirmatory analysis)?
4) Correlation results between models should be directly compared and show they differed significantly to be able to attribute a winner one (i.e., Figure 3).
5) I found the results showing that many of the effects were equally robust for correct and incorrect trials a bit confusing. In my understanding, the behavioural manifestation of how memory content is organized and replayed should be specifically evident for when retrieval access has been successful, as otherwise it may be difficult to discard the possibility that the observations are driven by a more general task oriented operation. Can the authors please justify why it would be relevant that many of their central findings were valid for correct and incorrect trials? And if so, wouldn't it be also relevant to show the same results in humans' data?
Reviewer #3:By training 6 macaques on a cinematic video-clip task, Kwok and colleagues have leveraged reaction time (RT) data to make inferences on the ability of non-human primates to make temporal order judgment (TOJ). RT analyses, using LATER and drift-diffusion modelling, enabled to authors to suggest potential mechanistic underpinnings to the behavioural effects they observe. Irrespective of performance, RTs were faster if the still pertained to earlier segments of the video clip. The effect was non-linear, as the correlation was significant for log-transformed RTs. Furthermore, the relationship of still presentation latency to RT was around 10:1, which the authors interpret as time compression in replay. Humans, on the other hand do not show this relationship.
This is an interesting paper, and the cross-species differences are very clearly depicted and highly interesting. I commend the authors on what is a tremendous effort in terms of stimulus preparation and animal training. I have some comments that would need to be addressed before recommending publication.
1) The authors have concentrated mainly on RTs, but when accuracy is considered (plotted in Figure 2C) it is clear that for many target frame locations, the macaques are performing at chance (horizontal blue line). In at least 3 of the macaques, accuracy seems worse for earliest stills from clip 1. Could there be a speed-accuracy trade-off underlying faster RTs for these early stills? I appreciate that, overall, performance was above chance (it is somewhat atypical to report this at the beginning of the Materials and methods section), but the possible confound of monkeys making fast responses with little memory content to these early still probes needs addressing.
2) Furthermore, there is clearly an improvement in accuracy for stills that are at the beginning of clip 2. The authors mention this as "a blip" but provide no statistics. This is an interesting boundary effect that could be reported better and integrated with point 3.
3) Representational similarity analyses were used to demonstrate that "global compression" of individual video clips is not evident in macaques, who appear to show increasing RTs to stills drawn from over the course of the 2 clips (i.e. the strictly forward model). There is a non-significant trend towards global compression effects in humans. It is clear, though, that macaques and humans respond differently. What makes things a bit confusing is that LATER modeling indicated that macaques show an important boundary effect. Memory decision threshold is reached more quickly if probe stills come from different clips. I wonder whether this has something to do with perceptual similarity effects reported later for the GLM analyses, but I did not get a good feel for what this perceptual similarity parameter is measuring.
4) I am undecided as to whether Figure 5 and associated Results section really add to the findings. It is clear from the preceding figures that slope is markedly different in the two species.
5) In view of the indirect nature of the inference, certain statements such as "The monkeys apply a non-linear forward, time-compressed replay mechanism during the temporal-order judgement” (Abstract) need to be toned down.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your article "Behavioral evidence for memory replay of video episodes in the macaque" for consideration by eLife. Your revised article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Joshua Gold as the Senior Editor The following individual involved in review of your submission has agreed to reveal their identity: Bryan Strange (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.
The reviewers were very satisfied with the revisions and we would like to congratulate you on a fine paper. The number of testing days for each animal is testimony to Herculean effort put into this work, and we believe that it will be a very important contribution to understanding cross-species differences in memory replay.
It was noted that with the additional analyses, the paper now shows more robust evidence for a forward scanning strategy in non-human primates. The additional analyses regarding training/testing session over time were also appreciated, in particular including response accuracy as a covariate and the trend analysis.
Only a few relatively minor comments remain:
1) The observation that the findings largely hold up when the data are split by correct and incorrect trials (Table 1) certainly supports the authors' decision to include all trials in their analysis, regardless of correctness. However, since the findings are so similar for correct and incorrect trials, the authors may wish to discuss why this pattern might be observed even on incorrect trials. Is the assumption that the monkeys are replaying the content correctly but then reaching an incorrect decision or that the information was incorrectly encoded? In other words, if the latency data reflects memory processes, an incorrect decision here would suggest that the initially encoded temporal order was incorrect. Either way, this seems like an interesting finding and Discussion point.
2) Regarding the analysis of linear, quadratic, and cubic trends: it is indeed encouraging to see that the non-linear (quadratic and cubic) trends are significant in the monkeys. However, interestingly, only the cubic trend seems to be significant in the human sample (linear and quadratic are not). Since one of the important contributions of this paper is a direct comparison between monkeys and humans, we think it would be helpful if the authors also addressed this difference in the manuscript. We also suggest that the manuscript text more explicitly stated what type of non-linear relationship was observed (i.e., in the subsection “Human-like forward replay in macaques” where these results are reported).
There were two requests for greater clarity:
1) The Introduction sets up the notion of linear vs. non-linear models for RTs and the authors state that they adjudicate between the two aspects of the replay models comparing between human and macaque data. While I appreciate that the non-linear human component refers to the global compression, the fact that monkeys appear to have performed TOJ using a forward search with non-linear compression, might confuse some readers. It was recommended that this be made explicit in the Introduction.
2) The legend of Figure 6—figure supplement 1 needs to include more explanation; please correct. Is the point of this to indicate that there is not a direct mapping of all features between inset and T-shirt images (i.e. the coloured lines don't always go to the same point of the corresponding image)?
https://doi.org/10.7554/eLife.54519.sa1Author response
Reviewer #1:
[…]
1) Additional details about the testing paradigm and overall behavioural results are needed. Were the monkeys trained ahead of time (and if so, how long did the training take)? With such a protracted period, it would be interesting to know whether the monkeys' performance improved or changed over the course of the experiment. As their performance is not at ceiling level, it seems possible that they improve overall, or that they start relying on different aspects to make their judgments. Day of testing could therefore also be included as a predictor in the models.
We now provided more details on these aspects. The monkeys had received training before the main test reported in the manuscript. There were five stages of learning on the TOJ task and the numbers of days are reported in (Table 2). With these extended periods of training, we think that the monkeys were unlikely to change over the course of the experiment in terms of their base TOJ ability (see also our reply below). In order to check this, we plotted out the proportion correct per day per individual: (A) Monkey data. (B) Human data. We observe no obvious increase in performance over the course of testing days for both monkeys and humans (Figure 7). This absence of performance changes is due to the extensive training the monkeys had received prior to this experiment (~ 21150 trials in total per monkey across 141 sessions on average, see Table 2). We therefore deem the main effects reported here as unlikely attributable to behavioral or strategic changes over the course of main experiment.
Regarding the monkeys' accuracy: in the subsection “Task performance”, the authors report that the monkeys' overall accuracy was ~68%. Was this the overall accuracy calculated across all 50 testing days? In contrast, the monkeys' accuracy plotted in Figure 2C is barely above chance level – while it's clear that the pattern displayed in the figure will be noisy as the data are split according to target frame location, the lines hover close to chance level and don't seem to average out to 68%. I may be missing something, but unless the data in Figure 2C represent only a subset, for example, this needs to be explained in more detail.
The results displaying ~68% are the aggregated data from all trials (both within- and across-) from all 50 testing days, whereas the data in Figure 2C contained only within-context trials. These are now clarified more explicitly in the revised manuscript (subsection “Human-like forward replay in macaques”). The analyses in this paper are broadly made of two parts: using only within-context trials and comparing within-context vs. across-context trials.
Related to the point above, it would generally be helpful if the authors provided a measure of the monkeys' accuracy over the course of testing. This is also important as the human participants' accuracy was at ceiling after a single testing session (based on the aforementioned subsection) – did the trend of the monkeys' performance remain the same across the 50 testing days (i.e., no global compression), or did this only develop after extensive experience on the task?
We plotted the accuracy data for each testing day for each monkey and human as Figure 7. We do not observe any obvious increase in performance over the course of the 50 testing days. Together with the extensive training sessions/trials prior to the experiment (over 20,000 trials per monkey). We deem the main effects reported in the manuscript as not attributable to changes over the course of main experiment.
Moreover, to further address the possibility that the two species might be differentially susceptible to the effect of having different numbers of trials in the experiment, we performed a control analysis making use of only the first 2 repetitions of data to equate the number of trials between the two species. With these equated subsets of data, we re-calculated the correlation between monkey RT RDM and two hypothetical model (i.e., Strict forward model vs. Global compression model). The results replicated the same significant correlation with the Strict forward model (r = 0.56, P = 0.049), but not with the Global compression model (r = -0.10, P = 0.643). These results replicated the findings as of when the full set of data was used, suggesting that the more numerous exposure to the same stimuli did not affect the main results. These results are now added in the revised manuscript (subsection “Discrepancy with humans: Compression of replay is local but not global”).
2) On a related note, it seems confusing to include both correct and incorrect trials in most of the analyses. In the GLM analyses (subsection “Human-like forward replay in macaques”, Supplementary file 2), the patterns were indeed quite similar for correct and incorrect trials, but it still seems unusual to include trials where the monkeys may not have replayed the content correctly as they reached an incorrect decision. Further, the different amounts of data entered into different models (correct trials only vs. all trials) make it more difficult to compare them. I would recommend that only correct trials are included in all analyses, or additional control analyses are needed to ensure that all findings remain the same when only correct trials are used.
In order to clarify this issue, we first made use of the GLMs, but now adding a new binary regressor (response: correct or incorrect) into the full GLMs. In this way, we checked how correctness might influencethe overall response latency per condition per species.The results indicate that response correctness influences the overall response latency considerably especially in the across-context condition in both species, but to a lesser extent for within-context condition in the monkeys (Figure 6). However, in order to directly address the main issue of “changes in response latency as a function of chosen frame location”, we improved the linear regression analyses by including other variables as nuisance regressors (including response correctness) (cf. reviewer 1 comment 4 below). We found a significantly positive relationship between RT and chosen frame location in all monkeys, all P < 0.001 (more significant than without the nuisance regressors). We have accordingly modified Table 1 (top panel) with these new results. Moreover, we have also conducted these analyses considering only correct trials (or only incorrect trials) separately. The new results are now also reported in Table 1 (middle and bottom panels respectively) (subsection “Human-like forward replay in macaques”). These results support that possibility that the retrieval RT patterns as function of chosen frame location are not affected by the memory outcome. The same has been done with response latency/temporal similarity slope analysis too (Supplementary file 2).
3) The comparison between monkey and human data in Figure 3 is especially interesting. It is intriguing to see that human participants display global compression, but monkeys do not, and this interpretation seems to be well-supported by the data. However, the authors state that monkeys 'reach their memory decision threshold more quickly when probe frames are extracted from the two different contexts' and that these findings 'parallel closely established findings in the humans'. This pattern is in fact the opposite to that observed in humans; when two stimuli are separated by a boundary, humans are slower to reach a decision (e.g. Ezzyat and Davachi, 2014; Heusser et al., 2018 to name just two recent papers). This interpretation should be revised as the data from the present manuscript in fact suggest that boundaries affect human and monkey decision making processes in opposite directions.
We have now added an analysis on the human data with the LATER and did not observe the same pattern as the monkeys (Figure 5C and Supplementary file 3). It is worth mentioning that in paradigms making use of discreet images for encoding, the TOJ performance thus far reported in the human literature is distinct from those making use of continuous streaming video material. Specifically, it has been reported that accuracy and response times are worse in TOJ for items are that separated by a boundary during encoding in the literature (Ezzyat and Davachi, 2014; Heusser et al., 2018), whereas we show that TOJ performance is better between frames taken from an across-context condition than from a within-context condition (Figure 1—figure supplement 1A). This is likely due to the ease of remembering ‘discrete segment of events’ when making use of the contextual or perceptual clues in the video frames in an across-context condition (see also Figure 6—figure supplement 1 that contextual information affects TOJ). We have added this discussion briefly in the revised manuscript (subsection “Human-like forward replay in macaques”).
4) The manuscript contains a relatively large number of (at times complex) analyses. As the more complex models are such an important aspect of the paper, they should be explained in more detail and the link between Materials and methods and Results should be clear. I found it somewhat difficult to align the description of the analytic approach in the Materials and methods section with the reporting of the results. In general, more detail in the description of the methods is needed. This especially applies to the model and variable descriptions. A few key points below:
Thank you for this suggestion. Apart from addressing the specific sub-points mentioned below, we have re-organized some content from the Materials and methods section and provided some more detailed clarification of the methods. For example, to clarify our methods in parameterizing perceptual similarity information between TOJ frames, we have provided an illustration of what the main perceptual similarity parameter indexes in Figure 6—figure supplement 1C.
– In the subsection “Task and experimental procedure”, the authors state that the experiment comprised four factors: boundary, play order, temporal distance, and exposure. First, I found it unusual that the authors reported temporal distance as a factor with 25 levels, since modeling it as a categorical variable assumes that all 25 levels are independent of one another. If this was not the case, it should be clarified in text, but if this variable was indeed modeled as a categorical factor, this should be changed to continuous or ordinal.
Temporal distance is an ordinal variable. We have clarified this in the revised manuscript (subsection “Task and experimental procedure”).
Second, in the subsection “Generalized linear models (GLM)”, the authors then report a set of variables included in the GLM. If I'm not mistaken, these are only regressed out of the very last GLM (subsection “Confirmatory GLMs for the putative patterns”). While the evidence from this analysis is clear and the significance of the key factors of interest does not change, I think including these regressors in each of the analyses and treating them as nuisance regressors would be more convincing/parsimonious.
We now ran the linear regression analysis of response latency as a function of chosen frame location/temporal similarity having included other variables as nuisance regressors for each monkey separately. We now found a significantly positive relationship between RT and chosen frame location in all monkeys, all P < 0.001 (more significant than without these nuisance variables). We have modified Table 1 (top panel) with these new results. Moreover, related to comment 2 above, we have also conducted these analyses considering only correct trials (and only incorrect trials) separately. The results are reported in Table 1 (middle and bottom panels respectively) (subsection “Human-like forward replay in macaques”). We also ran the same sets of analyses on human participant data for comparison between the two species. The results are reported in Supplementary file 4.
– The description of the LATER model fitting was somewhat confusing. For example, it wasn't immediately clear that 'both conditions' (subsection “LATER (linear approach to threshold with ergodic rate) modelling”) refers to within vs. across clips. I also found it somewhat confusing that one of the models was called 'unconstrained' in the Materials and methods, but 'two fits' in Supplementary file 5 (if that is indeed the same model). Finally, in the Materials and methods, the authors refer to Figure 4A, which I believe should be Figure 6A. I am not a modeler, but it would nonetheless be important for the model descriptions to be linked back to the experimental predictions to make the connection between model parameters and behaviour clearer.
We apologize for the confusion and the mistaken reference. To clarify, the ‘unconstrained’ model is the same as ‘two fits’ model, and we have now only used the term ‘two fits’ model throughout the main text and the table. We have also corrected the reference to correct figure in the revised manuscript.
– How exactly was reciprocal latency calculated? I was not familiar with the term beforehand so I looked it up in the literature, but I would suggest that all key variables are defined in an accessible manner.
Reciprocal latency is the multiplicative inverse of raw reaction time (i.e., 1/RT). We added its definition in the Materials and methods (subsection “LATER (linear approach to threshold with ergodic rate) modelling”).
5) Related to the above: in the Introduction, the authors say that a 'non-linear' pattern would be predicted, which led me to expect a comparison of linear and non-linear model fits. Similarly, in the Materials and methods, the authors state that they used BIC to 'obtain the best fit among these models' – as this section immediately follows the section on GLMs, I assumed the models would be compared. However, the only BIC value reported is that comparing the 'shift' and 'swivel' models. I believe BIC can be used to compare any model types (i.e. linear and non-linear), but if this is not the case, the approach should be clarified. Unless I'm missing something, it appears that the reciprocal latency was modelled in a linear model (subsection “Human-like forward replay in macaques”), but the 'non-linearity' of the fit was only assessed visually. As the linear analyses all had significant outcomes, it seems important to provide a benchmark of 'goodness of fit' for different models. Reporting the significance of different trends (e.g. linear, quadratic, cubic) could be helpful.
We formally tested for any ‘non-linearity’ of the fit. We ran curvilinear regression analyses using reaction time as the dependent variable as chosen frame location to estimate the significance of different trends. We started testing for the linear relationship and progressively moved to higher-order relationships (quadratic, cubic). At each step, we assessed the significance of the new predictor, having accounted for all the variance fitted by the lower-level predictors. We reported the significance of different trends (linear, quadratic, cubic) for the two species (Supplementary file 5). This non-linear relationship is important because it rules out alternative explanations that the main effects are simply resultant from positional effects. Rather, the non-linear change in slope indicates there are other factors in play that the monkeys might group (or parse) the content according to the relational storyline structure (e.g., the 2-clip video design), and do not merely recall the frames/items as of their ordinal positions in a fixed linear manner.
Reviewer #2:
[…]
1) Many of the findings described in human studies related to sequence of images or video clips that are presented only once. Here, animals are shown repeatedly with the video clips. To what extent the repetition is affecting the underlying neural mechanisms and consequently the actual findings? I am aware the study included somehow this in their GLM model (Supplementary Figure 3) but in my view, this should require some more work. The question that one would like to see answered here would be: To what extent differences between humans and macaques are susceptible to be affected by the large amount of repeated material used in monkeys? I am aware many of the analysis included in the study require large number of trials for each individual but maybe authors can explore this issue across participants?
In a control analysis, we now made use of only the first 2 repetitions of data to equate the number of trials between the two species (note that the same experimental design was adopted by the human participants, see our reply to comment 2 below). Now with these equated subsets of data, we re-calculated the correlation between monkey RT RDM and two hypothetical model (i.e., Strict forward model vs. Global compression model). The results replicated the significant correlation with Strict forward model (r=0.56, P=0.049), but not with the Global compression model (r=-0.10, P=0.643). With this smaller set of data, we also directly compared the correlations between the two models and found significant differences between them (P<0.05, FDR-corrected, cf. reply to comment 4 below). These results replicated the findings as of when the full set of data was used, suggesting that having the more numerous exposure to the same stimuli did not affect the main results. These results are now added in the revised manuscript (subsection “Discrepancy with humans: Compression of replay is local but not global”).
It may be worth mentioning that despite having administered a large amount of training to the animals prior to this study, the monkeys’ performance have stabilized over the course of the experiment (see our reply to reviewer 1, point 1 and revised Materials and methods, subsection “Training history and task performance”). We have also taken care to ensure that all the videos used in the training stages were sampled differently from those employed in the main test, and that each of the monkeys was assigned with a pseudo-randomized remix of a unique 1000 video-trials set from the 2000-video library, thus eliminating video idiosyncrasy to be associated with any experimental conditions. With the large pool of unseen videos we have generated, only five repeated presentations were needed to accrue the relatively large numbers of trials (5000 per animal), which we believe is not too deviant from accepted standards in psychological studies in human.
2) Several details concerning the experimental design tested in humans are lacking. Which are the differences between the two species when it comes to the experimental design? This information should be clear in the manuscript so that differences and similarities between species can be fully evaluated.
The experimental design is essentially the same as the monkeys. The human participants re-used the 6 unique video-trials/list sets and TOJ frames (one unique set per monkey) correspondingly (the extra subject 7 re-used set 1). The only critical difference was that human participants performed only two exposures (R1 – R2) rather than five. Identically, each session contained 100 trials and with 20 daily sessions of testing, we have acquired 2000 trials per participant for analysis. Human participants used the same model of computers and touch-sensitive screens as monkeys. The human subjects sat ~30cm in front of another identical 17-inch infrared touch-sensitive screen, with stimuli presented with the same computer used in the monkeys’ experiment. These details are now further clarified in the revised manuscript (subsections “Apparatus and testing cubicle”, “Source of video materials and preparation” and “Task and experimental procedure”).
3) I was expecting that most of the analysis were implemented in monkeys' and in humans' data. Is there any particular reason to skip some of them in humans RTs (for example: effects of context change (within vs. across) and GLM confirmatory analysis)?
We have now completed the full comparison between the two species’ data with a series of new analyses with human data. These include the following main analyses: (i) LATER modelling for within- vs. across-conditions (Figure 5 and Supplementary file 3) and (ii) GLM confirmatory analyses (Figure 6B and Figure 6—figure supplement 1B). In addition, we also completed the comparison on (iii) the relationship between temporal similarity and reciprocal latency for within-context trials in humans (Figure 2—figure supplement 1B), (iv) proportion corrects and RT analyses for the humans (Figure 1—figure supplement 1A and Figure 7). These altogether help elucidate the differences and similarities between the species more explicitly.
4) Correlation results between models should be directly compared and show they differed significantly to be able to attribute a winner one (i.e., Figure 3).
We used two-sided Wilcoxon signed-rank tests to test the null hypothesis that the correlation between data RDM and the two hypothetical models are equal. Pairwise comparisons between the two models are statistically significant for monkeys (P<0.01) and humans (P<0.05), confirming the significant differences between the models in both cases. These are now modified in the revised manuscript and in Figure 3 (subsections “Discrepancy with humans: Compression of replay is local but not global”, and “Representational similarity analysis (RSA)”).
5) I found the results showing that many of the effects were equally robust for correct and incorrect trials a bit confusing. In my understanding, the behavioural manifestation of how memory content is organized and replayed should be specifically evident for when retrieval access has been successful, as otherwise it may be difficult to discard the possibility that the observations are driven by a more general task oriented operation. Can the authors please justify why it would be relevant that many of their central findings were valid for correct and incorrect trials? And if so, wouldn't it be also relevant to show the same results in humans' data?
Making use of the GLMs, we now added a new binary regressor (response: correct or incorrect) into the full GLMs. In this way, we checked how correctness might influencethe overall response latency per condition per species.The results indicate that response correctness influences response latency considerably especially in the across-context condition in both species, but to a lesser extent for within-context condition in the monkeys (Figure 6). However, to directly address the main issue of “changes in response latency as a function of chosen frame location”, we also ran the linear regression analyses but now including other variables as nuisance regressors (including response correctness). We found a significantly negative relationship between reciprocal latency and chosen frame location in all monkeys, all P < 0.001 (more significant than without those nuisance variables). We have modified Table 1 (top panel) with these new results. Moreover, we have also re-conducted these analyses considering only correct trials (or only incorrect trials) separately. The results are now reported in Table 1 (middle and bottom panels respectively) (subsection “Human-like forward replay in macaques”). These results support the possibility that the response latency patterns are not affected by the memory outcome.
Reviewer #3:
[…]
1) The authors have concentrated mainly on RTs, but when accuracy is considered (plotted in Figure 2C) it is clear that for many target frame locations, the macaques are performing at chance (horizontal blue line). In at least 3 of the macaques, accuracy seems worse for earliest stills from clip 1. Could there be a speed-accuracy trade-off underlying faster RTs for these early stills? I appreciate that, overall, performance was above chance (it is somewhat atypical to report this at the beginning of the Materials and methods section), but the possible confound of monkeys making fast responses with little memory content to these early still probes needs addressing.
In order to rule out that the putative RT effects are not driven by a trade-off in accuracy, we segmented each video into four segments (i.e., each clip was parsed into two halves based on target frame location separately) and calculated the inverse efficiency score [RT (ms) / percentage correct (%)] for each segment per individual. The monkeys showed a mild numeral increase trend in inverse efficiency score across the four segments but did not reach statistical significance (Figure 1—figure supplement 1B, left panel) (F (3, 20) = 0.10, P = 0.96), suggesting that the increase in RT towards the end of the video did not contribute to better accuracy. In contrast, we found that the humans show a lower inverse efficiency score for the video parts right after a boundary (F (3, 24) = 4.17, P = 0.016), with post hoc tests showing significant difference between bars 2 and 3 (Figure 1—figure supplement 1B, right panel). This difference between bars 2 and 3 suggests that the participants are generally faster in their decision (while having accuracy controlled for) right after a clip boundary. This pattern of boundary effect aligns with the little “blip” in proportion correct right after the beginning of Clip 2 in the humans (Figure 2B, right panel). This discussion is now added in the revised manuscript (subsection “Human-like forward replay in macaques”).
2) Furthermore, there is clearly an improvement in accuracy for stills that are at the beginning of clip 2. The authors mention this as "a blip" but provide no statistics. This is an interesting boundary effect that could be reported better and integrated with point 3.
Together with an additional analysis making use of inverse efficiency score (cf. reply to comment 1), it is clear that humans can make use of the across-clip boundary to facilitate their TOJ. Interestingly, in paradigms making use of discreet images for encoding, the TOJ performance thus far reported in the human literature is distinct from those making use of continuous streaming video material. Specifically, accuracy and response times are worse in TOJ for items are that separated by a boundary during encoding (Ezzyat and Davachi, 2014; Heusser et al., 2018), whereas TOJ performance is better between frames taken from an across-context condition than from a within-context condition (Figure 1—figure supplement 1A). This might be due to the ease of remembering ‘discrete segment of events’ when making use of the contextual or perceptual cues in the video frames in an across-context condition (see also Figure 6—figure supplement 1 that contextual information affects TOJ). We have added this discussion briefly in the revised manuscript (subsection “Human-like forward replay in macaques”).
3) Representational similarity analyses were used to demonstrate that "global compression" of individual video clips is not evident in macaques, who appear to show increasing RTs to stills drawn from over the course of the 2 clips (i.e. the strictly forward model). There is a non-significant trend towards global compression effects in humans. It is clear, though, that macaques and humans respond differently. What makes things a bit confusing is that LATER modeling indicated that macaques show an important boundary effect. Memory decision threshold is reached more quickly if probe stills come from different clips. I wonder whether this has something to do with perceptual similarity effects reported later for the GLM analyses, but I did not get a good feel for what this perceptual similarity parameter is measuring.
Thank you for raising this consideration. While both species make use of the perceptual dissimilarity to some extent (see Figure 6—figure supplement 1), their benefits from utilizing the perceptual information seemingly differ. Specifically, we have now added an analysis on the human data with the LATER and indeed did not observe the same pattern as the monkeys (Figure 5C and Supplementary file 3; cf. reviewer 1’s comment 3), suggesting that the two species might not treat the information given by the event boundary at TOJ in the same manner. Moreover, to clarify our methods in parameterizing perceptual similarity information between TOJ frames, we have provided a more intuitive illustration of what the main perceptual similarity parameter (SURF) indexes in Figure 6—figure supplement 1.
4) I am undecided as to whether Figure 5 and associated Results section really add to the findings. It is clear from the preceding figures that slope is markedly different in the two species.
Thank you for this suggestion. Indeed, reviewer 1 also concurs with this concern. Removal of Figure 5 is justified given that most of the information conveyed by Figure 5 is shown in Figures 3 and 4.
5) In view of the indirect nature of the inference, certain statements such as "The monkeys apply a non-linear forward, time-compressed replay mechanism during the temporal-order judgement” (Abstract) need to be toned down.
Yes, indeed. Thank you for this suggestion. We have made changes in the revised manuscript to emphasize that the data (or analyses) patterns are suggestive of a possibility of memory replay in the animals, or similar statements to that effect (e.g., Abstract, Introduction, Results and Discussion).
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
1) The observation that the findings largely hold up when the data are split by correct and incorrect trials (Table 1) certainly supports the authors' decision to include all trials in their analysis, regardless of correctness. However, since the findings are so similar for correct and incorrect trials, the authors may wish to discuss why this pattern might be observed even on incorrect trials. Is the assumption that the monkeys are replaying the content correctly but then reaching an incorrect decision or that the information was incorrectly encoded? In other words, if the latency data reflects memory processes, an incorrect decision here would suggest that the initially encoded temporal order was incorrect. Either way, this seems like an interesting finding and Discussion point.
Thank you for pointing this out. We are also intrigued by this finding. We have previously showed in analogous TOJ paradigms in humans that TOJ task specific BOLD activation and behavioural RT patterns are independent of retrieval accuracy (Ezzyat and Davachi, 2014; Heusser et al., 2018). We interpreted these effects as process-based rather than content-based. At present, the latency results likely also point towards some “search” or replay memory processes, and any ultimate incorrect responses would thus likely be caused by memory noises injected during encoding and/or during delay maintenance. We have now discussed these issues more explicitly, as evidence for cross-species correspondence, in the revised text (Discussion).
2) Regarding the analysis of linear, quadratic, and cubic trends: it is indeed encouraging to see that the non-linear (quadratic and cubic) trends are significant in the monkeys. However, interestingly, only the cubic trend seems to be significant in the human sample (linear and quadratic are not). Since one of the important contributions of this paper is a direct comparison between monkeys and humans, we think it would be helpful if the authors also addressed this difference in the manuscript. We also suggest that the manuscript text more explicitly stated what type of non-linear relationship was observed (i.e., in the subsection “Human-like forward replay in macaques” where these results are reported).
This analysis also helps highlight differences between humans and macaque monkeys. As we found that the human data fits best, and only, with the cubic model, it suggests humans might have treated the video and TOJ differently from the macaques. This difference is reflected in the global compression capability in the human participants. We link this part to the reporting of its subsequent section in the revised manuscript (subsection “Human-like forward replay in macaques”).
To stipulate the non-linear relationship more precisely, we have added “quadratic and cubic” to qualify the non-linear pattern in monkeys.
There were two requests for greater clarity:
1) The Introduction sets up the notion of linear vs. non-linear models for RTs and the authors state that they adjudicate between the two aspects of the replay models comparing between human and macaque data. While I appreciate that the non-linear human component refers to the global compression, the fact that monkeys appear to have performed TOJ using a forward search with non-linear compression, might confuse some readers. It was recommended that this be made explicit in the Introduction.
We clarify this by stating that while both species do recall the video content non-linearly, there is an aspect of discrepancy between the two species, in which the monkeys do not compress the cinematic events globally as effectively as in humans, whereas human participants possess an ability to skip irrelevant information of the video. This is now clarified in the revised manuscript (Introduction).
2) The legend of Figure 6—figure supplement 1 needs to include more explanation; please correct. Is the point of this to indicate that there is not a direct mapping of all features between inset and T-shirt images (i.e. the coloured lines don't always go to the same point of the corresponding image)?
We modified the legend of Figure 6—figure supplement 1C. Specifically, we clarified that the cartoon is to illustrate that how features from two images can be matched irrespective of their scale.We can see that features of the inset are matched to the T-shirt on how strongly they are related, but some feature points (mostly minority) may still incorrectly match other parts of the image.
https://doi.org/10.7554/eLife.54519.sa2Article and author information
Author details
Funding
973 Program (2013CB329501)
- Yong-di Zhou
Ministry of Education of the People's Republic of China (Humanities and Social Sciences 16YJC190006)
- Sze Chai Kwok
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Ethics
Human subjects: The experimental protocol was approved by the the University Committee on Human Research Protection (permission code: HR 023-2017) at East China Normal University. The participants provided informed consent.
Animal experimentation: The experimental protocol was approved by the Institutional Animal Care and Use Committee (permission code: M020150902 & M020150902-2018) at East China Normal University. All experimental protocols and animal welfare adhered with the "NIH Guidelines for the Care and Use of Laboratory Animals".
Senior Editor
- Joshua I Gold, University of Pennsylvania, United States
Reviewing Editor
- Morgan Barense, University of Toronto, Canada
Reviewer
- Bryan Strange
Version history
- Received: December 17, 2019
- Accepted: April 20, 2020
- Accepted Manuscript published: April 20, 2020 (version 1)
- Version of Record published: May 18, 2020 (version 2)
Copyright
© 2020, Zuo et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 2,965
- Page views
-
- 340
- Downloads
-
- 6
- Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Neuroscience
Perceptual decisions about sensory input are influenced by fluctuations in ongoing neural activity, most prominently driven by attention and neuromodulator systems. It is currently unknown if neuromodulator activity and attention differentially modulate perceptual decision-making and/or whether neuromodulatory systems in fact control attentional processes. To investigate the effects of two distinct neuromodulatory systems and spatial attention on perceptual decisions, we pharmacologically elevated cholinergic (through donepezil) and catecholaminergic (through atomoxetine) levels in humans performing a visuo-spatial attention task, while we measured electroencephalography (EEG). Both attention and catecholaminergic enhancement improved decision-making at the behavioral and algorithmic level, as reflected in increased perceptual sensitivity and the modulation of the drift rate parameter derived from drift diffusion modeling. Univariate analyses of EEG data time-locked to the attentional cue, the target stimulus, and the motor response further revealed that attention and catecholaminergic enhancement both modulated pre-stimulus cortical excitability, cue- and stimulus-evoked sensory activity, as well as parietal evidence accumulation signals. Interestingly, we observed both similar, unique, and interactive effects of attention and catecholaminergic neuromodulation on these behavioral, algorithmic, and neural markers of the decision-making process. Thereby, this study reveals an intricate relationship between attentional and catecholaminergic systems and advances our understanding about how these systems jointly shape various stages of perceptual decision-making.
-
- Neuroscience
The relationship between obesity and human brain structure is incompletely understood. Using diffusion-weighted MRI from ∼30,000 UK Biobank participants, we test the hypothesis that obesity (waist-to-hip ratio, WHR) is associated with regional differences in two micro-structural MRI metrics: isotropic volume fraction (ISOVF), an index of free water, and intra-cellular volume fraction (ICVF), an index of neurite density. We observed significant associations with obesity in two coupled but distinct brain systems: a prefrontal/temporal/striatal system associated with ISOVF and a medial temporal/occipital/striatal system associated with ICVF. The ISOVF~WHR system colocated with expression of genes enriched for innate immune functions, decreased glial density, and high mu opioid (MOR) and other neurotransmitter receptor density. Conversely, the ICVF~WHR system co-located with expression of genes enriched for G-protein coupled receptors and decreased density of MOR and other receptors. To test whether these distinct brain phenotypes might differ in terms of their underlying shared genetics or relationship to maps of the inflammatory marker C-reactive Protein (CRP), we estimated the genetic correlations between WHR and ISOVF (rg = 0.026, P = 0.36) and ICVF (rg = 0.112, P < 9×10−4) as well as comparing correlations between WHR maps and equivalent CRP maps for ISOVF and ICVF (P<0.05). These correlational results are consistent with a two-way mechanistic model whereby genetically determined differences in neurite density in the medial temporal system may contribute to obesity, whereas water content in the prefrontal system could reflect a consequence of obesity mediated by innate immune system activation.