Introduction

Keeping relevant information in an easily accessible state is vital for adaptive behavior in dynamic environments. In the primate visual system, this requirement is met by visual working memory (VWM), the capacity to actively maintain visual information from milliseconds to seconds after a stimulus disappears from view [14]. While the contents of VWM are frequently updated to reflect changes in the environment and in behavioral priorities, the visual processing hierarchy itself introduces additional layers of dynamism [5, 6]. The fidelity of representations therefore evolves from the moment VWM starts accumulating evidence [7, 8] throughout the maintenance period until the information is used for action [911].

Nonetheless, within most theoretical frameworks, VWM is treated as a stationary process whereby representations are measured and modelled as fixed states of the system. One such model of working memory is based on principles of neural population coding [12, 13]. In the Neural Resource model, visual information is encoded in the activity of a population of noisy feature-selective neurons [14, 15]. The spiking activity of the neural population is constrained by normalization [16, 17], such that the total activity is fixed but flexibly distributed between memoranda, implementing a form of limited memory resource. At retrieval, encoded stimulus values are reconstructed from the noisy spiking activity. This model has provided a quantitative account of patterns of recall error across a range of tasks and stimulus dimensions [1822]. However, despite its grounding in principles of neural coding, the basic architecture of the model lacks a temporal dimension to describe the dynamics of memory representations during encoding and maintenance.

Research on prolonged memory maintenance has demonstrated that the precision of stored representations gradually deteriorates over time (e.g., 23, 24). Computational models attempting to account for these dynamics have often relied on principles of diffusion within an attractor network. In such a network, information is maintained in a sustained pattern of activity, which can be visualized as a “bump” of activity centered on the stored value. Over time, the bump diffuses along the feature dimension due to random fluctuations in neural activity, leading to stochastic changes in the encoded feature value and a gradual loss of information [25, 26]. Critically, the neural code diffuses without decay in signal strength. A growing body of empirical support, both at the behavioral [9] and neural level [27, 28], identifies diffusion as a key mechanism of memory deterioration.

In contrast to such gradual deterioration over longer retention intervals, studies that probed memory within a few hundred milliseconds of stimulus offset revealed a precipitous decrease in memory fidelity immediately after a stimulus disappears [2932]. This early superior recall was attributed to a high-capacity but short-lived form of storage termed iconic memory (IM) [33]. An implicit assumption has often been that the behavioral advantage of early cues derives from reading out information directly from IM and circumventing capacity limitations imposed by VWM, however, this idea has not been formally modelled or tested. At the neural level, IM is thought to be supported by a brief period of decaying neural activity in early visual areas following the response elicited by the visible stimulus [3437]. In contrast to later memory dynamics arising due to noise accumulation, early changes in memory fidelity were supported by modulation of the neural signal strength. However, little is known about the read-out of this sensory memory buffer.

Finally, memory fidelity changes during encoding while the evidence is extracted from the visible stimulus. Previous studies revealed that longer stimulus exposures have a favorable effect on the subsequent recall, but that this effect is modulated by the number of simultaneously encoded objects [3840], providing evidence for a processing or encoding limitation of VWM. As stimulus presentation duration increases, more information may be extracted from the sensory signal into VWM, increasing the fidelity of the representation. Critically, with prolonged exposure, VWM fidelity approaches a stable level that depends on the number of encoded items, suggesting that a ceiling is imposed on evidence accumulation by a shared limit on VWM resources. However, a computational framework describing information accumulation from sensory areas into VWM is lacking, and the observed encoding limit may reflect dynamics in sensory areas registering visible objects as well as VWM accumulating this sensory evidence.

Here, we investigated the temporal dynamics in the fidelity of VWM from information encoding until its recall. To map human recall fidelity to the time domain, we conducted psychophysical experiments in which we probed memory representations at different time points relative to stimulus onset and offset while simultaneously manipulating set size. To isolate memory dynamics due to changes in the representational signal, we advanced an analogue reproduction task with a novel response method specifically adapted to minimize the time cost of motor (i.e., response) processes and capture the momentary state of memory representations. This allowed us to precisely measure the time course of fidelity dynamics during representation formation (i.e., encoding) and retention (i.e., maintenance). A major conclusion is that the enhanced precision seen at very brief retention intervals depends on integration of information from the sensory store into VWM following the cue, with the result that retrieval from IM of even the simplest stimulus is subject to the temporal and capacity limitations of working memory.

To explain the neural computations underlying the observed time courses, we devised a comprehensive neural model of memory dynamics whose core architecture is rooted in the Neural Resource model of VWM [12, 13]. The Dynamic Neural Resource (DyNR) model assumes that changes in memory fidelity reflect temporal dynamics in the sensory population registering the stimuli and from signal and noise accumulation processes of resource-limited VWM (Fig. 1). In particular, the model prescribes how time-dependent gain control mechanisms in sensory areas produce a smooth neural response following abrupt changes in stimulus presence. As this sensory signal provides feed-forward input to VWM, the dynamics in VWM activity in the temporal vicinity of stimulus presentation (i.e., onset and offset) strongly reflect not only limits in VWM, but also the dynamics of the sensory signal. Finally, once accumulated into VWM, the neural signal is subject to perturbations due to noise accumulation, resulting in degradation of internal representations with time. The DyNR model accurately reproduced the detailed empirical patterns of human recall errors in the psychophysical experiments. Based on these results, we argue that changes in memory fidelity on short time scales reflect dynamics in the gain or signal strength in neural populations representing the stimulus, while changes on longer time scales are dominated by corruption of the representation by accumulated noise.

Proposed neural population dynamics for encoding a single orientation into VWM and maintaining it over a delay. Top: Stimulus onset is followed by a ramping increase in activity (indicated by color) of sensory neurons whose tuning (indicated on y axis) matches the stimulus orientation. Following stimulus offset, this sensory signal rapidly decays. The sensory signal, including its decaying post-stimulus component, provides input into VWM. Bottom: At stimulus onset, the VWM population begins to accumulate activity from the sensory population. This accumulation saturates at a maximum amplitude determined by global normalization. As the sensory activity decays, the activity in the VWM population is maintained at a constant amplitude, but accumulation of random errors causes the activity bump to diffuse along the feature dimension (y axis) over time, changing the orientation represented by the population. At recall, when the VWM population activity is decoded, accuracy of the recall estimate depends on both the orientation represented (centre of the activity bump) and the fidelity with which it can be retrieved (determined by activity amplitude).

Dynamic Neural Resource (DyNR) Model

The Dynamic Neural Resource model generalizes an established neural population account of VWM, originally proposed by Bays [12] and inspired by similar models of attention and perceptual decision-making [4143]. In the original model, memorization and recall of visual stimuli is achieved by encoding and decoding of spiking activity in idealized feature-tuned neurons. The limited capacity of VWM to hold multiple object features simultaneously is reproduced by a global divisive normalization that constrains total spiking activity, implementing a continuous memory resource [16, 12]. The DyNR model (illustrated in Fig. 1) extends this stationary encoding-decoding model with a temporal dimension. First, to capture encoding dynamics, stimulus information enters the VWM population (Fig. 1, bottom) indirectly, by accumulation of neural signal from a separate sensory population (top), which receives the visual input. The signal strength in the VWM population at any point in time jointly depends on the history of the signal in the sensory population and the number of features competing for representation in VWM. Once the sensory signal is gone, the VWM signal is maintained at its maximum attained amplitude, but the stimulus value encoded by the signal gradually diffuses due to accumulation of random noise. Recall error depends on both the stimulus value represented at the time of retrieval (what is encoded) and the signal amplitude at that time, read out in the form of spikes (how precisely it can be decoded).

Dynamics of sensory signal strength

To model the temporal dynamics of human memory fidelity, we begin by defining computations of the sensory system registering the incoming signal. A particularly important computation is temporal filtering – a property of neurons to respond more sensitively to specific temporal patterns in stimuli. To model the signal represented in the cortical sensory level, we assume that the sensory response to a stimulus presentation of fixed duration (described as a step function in visual input amplitude, Fig. 2A & B, left) is controlled by a monophasic temporal filter having a low-pass frequency response [44]. This choice is a natural one since it is consistent with electrophysiological studies demonstrating that a large range of temporal frequencies registered by the retina and LGN [45, 46] is attenuated at higher frequencies before the signal enters the primary visual cortex [47]. Passing the stimulus through such a temporal filter attenuates the neural response to fast transients in the signal, and thereby produces a smooth rise and decay of neural activity in response to a uniform input signal (Fig. 2C). In particular, we assume that the activity of the sensory population after stimuli onset and offset changes exponentially towards the maximum sensory activity and baseline activity, respectively.

Schematic of signal amplitudes in the DyNR model during a cued recall trial. (A) Observers are presented with a memory array (left), followed after a blank delay (not shown) by an arrow cue (center) indicating the location of one item (the target) whose remembered orientation should immediately be reported (right). (B) The amplitude of the visual input associated with each item is modelled as a step function (left). The sensory response (D) is modelled as a low-pass filtering of the stimulus input, with different time constants for rise and decay (C). (F) Amplitude of the working memory signal reflects a saturating accumulation of activity from the sensory population (illustrated in E). Beginning with stimulus onset, activity associated with each item is accumulated from the sensory population into the VWM population, approaching an upper bound (green dashed line) that reflects a total activity limit shared between the N items in memory. Once the cue has been presented (solid orange line) and processed (dashed orange line), uncued items can be dropped from VWM, raising the ceiling on activity available to represent the cued item (green arrow). This allows more information about the cued item to be accumulated from the decaying sensory trace (equivalent to the red shaded area in D). Response variability depends on the asymptotic VWM signal amplitude available for decoding (red circle) combined with the accumulated effects of diffusion (see text).

The choice of the filter’s temporal response characteristics (i.e., its time constant) fully defines dynamics in the sensory population activity and controls the signal projected towards higher areas. The available physiological evidence suggests the temporal properties of the rising and decaying neural response are not symmetric [4850]. In particular, the neural response typically reaches the maximum activity after the onset faster than it reaches the baseline activity after the offset. Consistent with this, we allowed the sensory signal to decay at a different rate than the rising rate. The temporal dynamics in sensory population firing activity in response to a fixed input signal of duration toffset is then given by:

where is the maximum sensory signal, τrise and τdecay are rising and decaying time constants of the temporal filter, respectively.

The temporal properties of the sensory response have been shown to depend on the physical characteristics of stimuli, such as contrast and location [48, 51]. Similarly, previous work has demonstrated that the decaying component of the sensory response is strongly influenced by the engagement of the sensory population after stimuli offset (e.g., 35). In particular, a new input signal, e.g. a backward noise mask, curtails ongoing activity related to the previous stimulus, resulting in a faster decay of activity compared to the unmasked post-stimulus period [52]. Consistent with this, here we assume that the backward mask operates by interrupting ongoing sensory processing of stimuli, limiting the access to the sensory signal (Fig. 5) (cf. integration mask) [53].

Dynamics of VWM signal strength

The information registered by the sensory system is subsequently accumulated into a VWM population capable of maintaining activity in the absence of further input (e.g. by self-excitation, see 54, 55, 26; although only the resulting dynamics are modelled here). The total activity of the VWM neural population is normalized, implementing a limited resource shared out between memory items [12, 13]. Consequently, if the stimuli are presented for long enough, the evidence accumulated from the sensory signal into VWM will saturate at a level that reflects the total number of stimuli represented (Fig. 2D). The dynamics in VWM population activity are given by:

Where is the maximum VWM signal amplitude, M (t) is the number of items represented in VWM at time t, τwm is the time constant of accumulation into VWM.

A common assumption of VWM models is that the strength of the representational signal remains stable after encoding from a visible stimulus. This stationary view has been reinforced by typically measuring VWM sufficiently long after the stimulus disappears (1 second) and at a single time-point. In contrast, work on IM demonstrated that recall fidelity in a brief period after stimulus offset typically surpasses and then precipitously decays towards VWM fidelity level [56]. Consistent with that, we consider how the normalized representational signal in VWM formed during encoding can be boosted in the absence of the physical stimulus. In particular, we assume a representation stored in VWM can be strengthened as long as the sensory population provides feed-forward input and VWM activity is not saturated at the normalized level. Such a scenario can be achieved by cueing an item for recall in the temporal vicinity of stimulus offset, i.e. before sensory activity decays to zero. By cueing an item for recall, the remaining contents of VWM becomes obsolete and can be removed from memory [57]. In the model,

where tcue is the time when the item is identified for a recall and the readout of stimulus value begins. This “demounting” of resource from uncued items makes it available for storing additional information about the cued item, which is extracted from the residual sensory representation, increasing the representation fidelity beyond that granted by equal distribution of neural signal between items. Critically, as sensory information quickly decays, there will be less signal remaining to supplement the VWM representation of a cued item if the cue is delivered later, and at the longest cue intervals the cue will confer no advantage over the fidelity attained when all items compete equally for VWM representation (Fig. 2D).

We note that removal of uncued items cannot occur until the cue has been processed to the point of identifying one of the N items in the memory array. We follow Hick [58] in modelling this cue processing time as logarithmic in the number of alternatives:

where b is a scaling parameter. Previous work demonstrated that estimation of temporal dynamics in attention and memory could be confounded with the time needed to interpret the cue and start acting on it [59]. This is especially significant when trying to accurately capture quickly changing processes, such as decay of the sensory residual. Although the cue processing time likely fluctuates on a trial-by-trial basis due to changes in, e.g. attention, arousal, or motivation, here we focus on the influence of set size arising from a limited information processing capacity.

Diffusion of VWM encoded values

So far we have described only changes in the strength of the neural signal encoding features in memory. However, feature representations maintained over time in neural activity will accumulate noise in the absence of external input. We model this process of noise-driven diffusion as Brownian motion in feature space throughout the retention interval (Fig. 1), contributing to variability in the decoded feature value [25, 9]. The resulting variability is described by a wrapped normal distribution with variance σ2 that increases linearly with time from stimulus offset, so that at time t the encoded feature corresponding to a true stimulus feature θ is

where specifies the base diffusion rate. While the fast decay of sensory activity after stimuli offset accounts for early dynamics in VWM fidelity, diffusion becomes prominent over longer delays, accounting for more gradual deterioration of precision with time.

Such a diffusion account has support in the available neural evidence as well as in theoretical work. At the neural level, an electrophysiological study in monkeys performing a spatial working memory task demonstrated that shifts of neural tuning curves during a memory delay predicted behavioral response errors [26]. A similar finding was observed in humans where drift in the fMRI activity patterns relative to the target predicted errors in an orientation discrimination task [27]. At a theoretical level, continuous attractor models explain diffusion as a consequence of neural variability in networks where excitatory and inhibitory connections constrain population activity to a sub-space or manifold corresponding to the encoded feature space [25, 60, 55].

Retrieval

To model the process that leads to a response we first consider that in some trials observers may erroneously identify a non-target item as being cued. Previous work indicates these “swap” errors occur due to uncertainty in memory for the cue features of the stimuli, in this case their locations [20, 61]. We assume that changes in variability in the cue features mirror those of the memory features, leading swap frequency to decrease exponentially as a function of presentation duration and increase linearly with retention interval (Fig. S2):

where τspatial is the time constant related to presentation duration, and rspatial is the rate constant related to the retention interval.

If θ is the true feature value of the item identified as the target (i.e. the cued item with probability 1 − pswap, a randomly selected non-cued item with probability pswap), then due to diffusion (Eq. 5) the value encoded in the VWM population at the time of retrieval is given by

We model retrieval as estimation of θ based on spiking activity in the VWM population that encodes the selected item. For this purpose we assume an idealized set of tuning functions, where the mean response of neuron i encoding orientation θ with population gain γ is described by

where n is the number of neurons, and κ determines the tuning width. The preferred orientations of the neurons, φi, are evenly distributed throughout the circular space to provide uniform coverage. The spike count produced by each neuron is drawn from a Poisson distribution,

and the decoded orientation estimate is obtained by maximum likelihood estimation based on the spike counts:

Additional assumptions

To fit the model to behavioral data, we make several further simplifying assumptions. We assume that the exponential decay of the sensory signal is rapid enough that there is effectively no information remaining by the time the VWM population is decoded to generate a response. This allows us to approximate the VWM activity at the time of decoding by the asymptotic VWM activity were the sensory decay to continue indefinitely:

Next, we identify diffusion in the encoded value at the time of retrieval with diffusion at the time of target item identification, justifying the use of tcue in Eq. 8. We reason that the rate of diffusion is slow enough relative to the rate of sensory decay, that any additional diffusion in the brief period of post-cue sensory accumulation is negligible.

In Experiment 1 (see below), a task with a fixed 200 ms exposure period, we assume that the initial encoding of all items into VWM is complete by the time of stimulus offset, i.e. that VWM activity at this time can be approximated by its asymptotic level reflecting normalization:

Finally, in the condition of Experiment 1 where memory array and cue are presented simultaneously, we assume that only the cued feature is encoded in VWM, reaching the maximum amplitude, , irrespective of set size. Maximum likelihood fits were obtained via the Nelder-Mead simplex method (function fminsearch in Matlab). All parameters and variables used to describe the DyNR model are listed in Table 1.

DyNR model parameters (1–9) and other variables (10–24) used in model description.

Overview of Experiments

We tested predictions of the Dynamic Neural Resource model against empirical data collected in continuous report tasks. In Experiment 1 (Fig. 3A & B), observers were presented with an array of oriented stimuli for a fixed duration followed after a variable delay by a visual cue identifying one of the preceding stimuli whose orientation should be reported. This experiment was designed to investigate the contribution of decaying sensory representations following stimulus offset to the dynamics of recall fidelity. Experiment 2 (Fig. 3C) was aimed at expanding the results of the first experiment to now also assess the accumulation of information during the time the stimuli were visible. In this case, the exposure duration was varied while the delay before the visual cue was held constant. In both experiments we varied the number of stimuli in the array (set size) to assess capacity limitations affecting encoding and maintenance.

Experimental procedure. (A) Experiment 1. On each trial, a memory array was presented consisting of 1, 4, or 10 randomly oriented Gabor stimuli. In 50% of all trials, the stimuli underwent a change of phase and contrast towards the end of the exposure period intended to minimize retinal aftereffects. After a variable delay, an arrow cue was shown pointing towards the location of one stimulus from the preceding array. Observers reported the remembered orientation of the cued stimulus by swiping their index finger on the touchpad. The response was followed by feedback showing the true orientation. (B) In a proportion of trials, the cue was presented simultaneously with the stimuli. (C) Experiment 2. On each trial a memory array consisting of 1, 4, or 10 randomly oriented Gabors was presented for a variable duration, and followed by a white noise flickering mask. The mask was replaced by an arrow cue pointing towards the location of one stimulus from the preceding array. Observers reported its remembered its orientation and received feedback as in Experiment 1. Stimuli are not drawn to scale.

To provide additional validation of the DyNR model, we also tested its predictions against data from a previously published continuous report experiment (Experiment 1 in 12) and one additional dataset collected as part of a separate study [62]. A detailed description of all experiments is provided in Supplementary Information.

Results

Experiment 1: Delay duration

In Experiment 1, we evaluated the time course of VWM fidelity over brief memory intervals. Previous work has demonstrated that immediately after a stimulus physically disappears, its representation briefly persists in the sensory system in the form of residual neural activity [36]. Accumulation of this lingering sensory activity into VWM could enable superior recall of information [56] within the constraints of a finite VWM resource that strongly limits representational fidelity [3]. To describe these dynamics, we examined human recall of orientation stimuli presented in arrays of varying sizes and probed after a variable delay ranging from 0 ms to 1000 ms. Here we focus on an experimental condition in which retinal afterimages were suppressed by a phase shift towards the end of stimuli presentation. Validation of this method and results from the condition without a phase shift are provided in the Supplementary Information.

Experimental data

Recall error distributions and mean performance in Experiment 1 are plotted in Figs. 4A and B. Response error (measured as RMSE) increased with both set size and delay duration. A repeated measures ANOVA revealed a significant effect of set size (F(2,18) = 117.8, p < .001, η2 = .44), delay time (F(5,45) = 52, p < .001, η2 = .23), and their interaction (F(10,90) = 26.7, p < .001, η2 = .13) on response error. We further explored this interaction, first finding response error in the 1 item condition (red in Fig. 4) did not change with delay (F(5,45) = 1.32, p = .27, η2 = .07). This was supported by Bayesian analysis (BF10 = 0.34) which found weak to moderate evidence against modulation of 1 item recall by memory delay. In contrast, response error increased with delay for the remaining two set sizes (4 items, green; 10 items, blue; main effect: F(5,45) = 55, p < .001, η2 = .48). This increase in response error consisted of an initial rapid rise (over the first 200 ms), followed by a more gradual increase as the delay between stimulus and cue increased. Next, we found a modulating effect of delay on recall for the remaining two set sizes (interaction: F(5,45) = 10.1, p < .001, η2 = .05). The direct comparison revealed that the increase in response error with delay (ΔRMSE = RMSE1000ms RMSESimult) was greater when observers memorized more items (t(9) = 9.1, p < .001, d = 2.88).

Experiment 1 data and model fits show the consequences of varying set size and delay duration on WM reproduction error. (A) Empirical recall error distributions (black circles) and the DyNR model fits (colored curves). Different panels correspond to different set sizes (rows) and delays (columns). (B) Corresponding RMS errors from experimental data (circles and errorbars) and the DyNR model fits (curves and error patches). Error bars and patches indicate ± 1 SEM.

One surprising result was the observed set size effect in the 0 ms delay condition (F(2,18) = 23.7, p < .001, η2 = .53) consistent with a stepwise increase in recall error with set size (pairwise comparison, t(9) 2.88, p ≤ .036, d ≥ 0.91, Bonferroni correction applied). Importantly, this effect was a consequence of responding based on a memory of the stimulus, since orientation reproduction was comparable across set sizes in the perceptual condition (simultaneous presentation; F(2,18) = 1.26, p = .3, η2 = .04, BF10 = 0.47). Previous studies have characterized iconic memory as an effectively unlimited store, capable of holding any number of items without a consequent loss of fidelity [63, 30]. While our modelling ultimately affirmed this conception of IM, we nonetheless show that recall of information is contingent on the number of objects concurrently in memory from the moment stimuli physically disappear (see below).

Taken together, these results provide evidence that the fidelity of stored representations changes dramatically over the first few moments after stimuli offset. We next aimed to explain the neural computations supporting these dynamics. In summary, behavioral data displayed three key characteristics we aimed to explain, all visible in Fig. 4B. First, recall fidelity for a single item remained relatively stable across changes in delay, and was the same as perceptual fidelity. Second, recall fidelity for higher set sizes showed substantial, non-linear temporal dynamics. Lastly, recall fidelity was contingent on the number of stored items from the moment stimuli disappeared.

Dynamic Neural Resource model

Curves in Figs. 4A and B show fits of the model with maximum likelihood (ML) parameters (mean ± SE: population gain γ = 59.8 ± 3.3, tuning width κ = 3.21 ± 0.2, sensory decay time constant τdecay = 0.21 ± 0.052, VWM accumulation time constant τWM = 0.096 ± 0.045, cue processing constant b = 0.171 s ± .055 s, base diffusion σ2 = 0.03 ± 0.017, swap probability p = 0.027 ± 0.009). The model provided a close fit to response error distributions (Fig. 4A) and summary statistics (Fig. 4B; see also Fig. S2 for reproduction of swap error frequencies), successfully reproducing the pattern of changes with set size and delay. In particular, the model accounted for the three key observations identified above.

First, the model predicted the near-constant recall fidelity observed for a single item across these short retention intervals. The neural signal associated with the target object at recall depends on the normalized signal in VWM at offset supplemented by the available sensory signal post-cue. The sensory signal is integrated into VWM after the cue to fill any unallocated neural resource that arose by discarding uncued items. In the case of a single item, the entirety of VWM resources are allocated to one object during encoding, so no resource is freed by the cue that would allow the signal to be further strengthened based on the decaying sensory representation.

Importantly, this prediction contradicts the classical view of direct read-out from IM, according to which representational fidelity should be enhanced with very short delays irrespective of VWM limitations (see Alternative accounts below for a formal test of such a model). Note that the DyNR model nonetheless predicts some deterioration in fidelity over time even for a single item, due to noise-driven diffusion of the stored value. However, based on previous reports, we expected this process to be substantially slower and the impact on single item precision relatively small on this (1 s) timescale. The fitted diffusion parameters and resulting shallow slope of fitted RMS error (red curve in Fig. 4B) confirmed this.

Second, the neural model predicts the specific pattern of dynamics observed in trials with multiple items (set sizes 4, green, and 10, blue curves). Once the cue is presented, resources encoding uncued items are freed and the decaying sensory signal representing the target item is further integrated into VWM, still subject to limited total VWM resources but now without competition from other items. Due to exponential decay of the sensory signal, the increase in fidelity thus accrued changes rapidly with retention interval over the first few hundred milliseconds. At longer delays, the cue identifies the target only after the sensory signal has effectively disappeared, so the VWM signal representing the target item remains at the normalized level reflecting equal distribution between all items in the memory array, and memory dynamics consist only of the more gradual deterioration of fidelity due to accumulated noise in the encoded value.

Finally, the DyNR model predicts the presence of a set size effect on fidelity throughout the entire memory period, including the no delay (0 ms) condition in which the cue onset was coincident with stimulus offset, but not in the simultaneous cue condition. In the model, this behavior emerges as a consequence of two independent processes. First, at the end of stimulus presentation, items within smaller (lower set size) arrays are encoded in VWM with higher signal amplitude, reflecting normalization. This signal strength represents a baseline that can be supplemented by further integration of the sensory signal after an early cue. However, if the sensory decay is sufficiently rapid, then even if the cue is presented immediately the target representation will not attain the maximum amplitude (equivalent to set size of one) starting from a lower baseline. Second, as described by Hick’s Law [58] it takes longer to identify the target item based on the cue as the number of alternatives increases (see Alternative models below for a formal test of this assumption). As a result, for higher set sizes, less sensory signal encoding the target item remains to be integrated into VWM once it has been identified.

Model variants

We next focused on alternative explanations for the temporal dynamics observed in Experiment 1. Specifically, we examined whether the observed dynamics could be accounted for either solely by post-stimulus changes in neural signal amplitude or solely by noise-driven diffusion of stored values. To pre-empt our conclusions, we demonstrate that both components are needed to explain the observed dynamics in memory fidelity. Moreover, to more closely examine the role of diffusion in WM dynamics, we fit our neural model to an additional dataset collected in our lab ([62]; see Additional dataset 1 in Supplementary Information). This experiment used longer delays compared to those used in Experiment 1, and therefore precluded any beneficial effect of post-stimulus sensory information, while at the same time allowing the diffusion to operate over a longer period. This experiment allowed us to test whether diffusion is sufficient to account for human recall errors with longer memory delays.

Fixed neural signal

A recent computational study on forgetting in VWM proposed that diffusion is sufficient to explain memory dynamics over delay [10]. To test for this, we developed two reduced versions of the DyNR model in which the diffusion process was solely responsible for memory fidelity dynamics. In both variants, the sensory signal terminated abruptly with stimuli offset, so the VWM signal encoding the stimuli was independent of the delay duration and equal to the limit imposed by normalization (). In the first variant, the diffusion rate was constant across set sizes, as in the full model. The formal model comparison demonstrated that the full DyNR model performed better than this simplified alternative (ΔAIC = 609.5).

In the second variant, we allowed the diffusion rate to increase proportionally with set size (for a similar proposal see [64]). This model was again outperformed by the full DyNR model (ΔAIC = 666.4). Critically, both models tested here failed to qualitatively reproduce the observed non-linear pattern of changes in recall error with time, notably overestimating recall error at the shortest delays by assuming no modulation in the representational signal (Fig. S3).

Diffusion

We developed two variants of the proposed neural model to test the role of diffusion. In the first variant, we completely omitted the diffusion process from the model to test whether the sensory signal modulation during the retention period is sufficient to explain temporal dynamics in recall fidelity. It could be argued that diffusion accounts for only minor changes in precision over brief delays as used here, and therefore adds unnecessary complexity to the proposed model without improving the fit substantially. However, the formal model comparison revealed that the full DyNR model provides a better fit to human recall error compared to the matching model without diffusion (ΔAIC = 17.9).

The second variant was identical to the proposed model, except that we replaced the constant diffusion rate with a set-size-scaled diffusion rate by multiplying the right side of Eq 6 by N . The model comparison showed that the full DyNR model also outperformed this variant (ΔAIC = 29.8). While both model variants qualitatively reproduced the increase in memory error with delay and set size, the pattern of variability was better explained by the model with a constant diffusion rate across set sizes. Although a more substantial diffusion effect could become apparent with longer delays than those used here, previous work demonstrated that noise-driven diffusion causes representations to deteriorate throughout the entire retention period [60].

Finally, we examined the role of diffusion with longer memory intervals in a separate experiment using variable set sizes and memory intervals (1 and 7 seconds) (for full details see Additional dataset 1 in Supplementary Information). We demonstrated that, once sensory information decayed completely, an accumulation of error during retention interval accounted for continuing memory deterioration. Together, the results presented here corroborate findings on the role of diffusion in temporal dynamics of recall fidelity [9].

Experiment 2: Exposure duration

In Experiment 2, we evaluated the encoding phase of VWM, by testing recall of orientation stimuli displayed in arrays of variable size presented for variable durations. In the DyNR model, increasing the sensory evidence by prolonging stimulus presentation has a favorable effect on later recall of stimulus, as more of that evidence can be accumulated into VWM. Importantly, this accumulation is also capped by the VWM resources available to store it (Fig. 5).

Time course of sensory and WM gain with variable exposure duration. (A, B) The signal amplitude in the sensory population increases from stimulus onset, exponentially approaching the maximum sensory activity (). For shorter presentation durations (A) the attained amplitude at stimulus offset is only a fraction of the maximum (compare B, late offset). Following offset, sensory areas produce a decaying neural response, that is curtailed (faster decay) but not abolished by a backward mask. (C, D) Information about the stimulus is accumulated in WM from sensory activity. A shorter presentation (C) provides less sensory evidence for the initial accumulation of all items into VWM (compare D, late offset), and subsequently less decaying sensory activity that can supplement VWM activity for the target item following the cue.

Experimental data

Figure 6 shows the response error for different presentation durations and set sizes. Consistent with previous findings, response error can be seen to decrease with prolonged presentation duration, but increase as the number of items in memory increases. This was confirmed with a significant effect of display duration (F(6,72) = 29.01, p < .001, η2 = .21), set size (F(2,24) = 112.51, p < .001, η2 = .54), and their interaction (F(12,144) = 2.58, p = .004, η2 = .019).

Experiment 2 results and modelling data show the consequences of varying set size and stimulus exposure time on VWM reproduction error. (A) Empirical recall error distributions (black circles) and the DyNR model fits (colored curves). Different panels correspond to different set sizes (rows) and exposure durations (columns). (B) Corresponding RMS errors from experimental data (circles and errorbars) and the DyNR model fits (curves and error patches). Error bars and patches indicate ± 1 SEM.

We further explored this interaction by first confirming that response error decreased with display duration within each set size (F(6,72) 10.24, p < .001, η2 ≥ .26). A consistent pattern was observed across set sizes, comprising an initial rapid decrease in response error over the briefest presentation times (first 200 ms), followed by saturation at prolonged exposure durations. Next, we calculated the change in recall error between the longest and the shortest display exposure within each set size, revealing that response error decreased more rapidly with display time as the number of items in memory decreased (ANOVA: F(2,24) = 7.79, p = .002, η2 = .21; corrected pairwise comparisons: t14 = 3.65, p = .016, d = 0.87, t410 = 0.96, p = .72, d = .27).

These results reveal the time course of information accumulation into VWM and forming of stable representations. We again identified several key characteristics of the dynamics of recall fidelity in the data (Fig. 6B) to test agaist the DyNR model. Consistent with previous studies, we found recall fidelity changed with both presentation duration and the number of presented stimuli [3840]. Specifically, as display duration increased from the shortest exposure, recall error showed an initial rapid decrease followed by a gradual levelling-off. As set size increased, the initial slope became shallower and the plateau occurred at a higher level of error.

Dynamic Neural Resource model

Curves in Figs. 6A & B show fits of the model with maximum likelihood (ML) parameters (mean ± SE: population gain γ = 188.5 ± 109.6, tuning width κ = 10.2 ± 6.08, sensory rise time constant τrise = 0.33 ± 0.18, sensory decay time constant τdecay = 0.61 ± 0.19, VWM accumulation time constant τWM = 0.8 ± 0.34, cue processing constant b = 0.2 s ± .09 s, base diffusion σ2 = 0.28 ± 0.08, spatial uncertainty time constant τspatial = 0.013 ± 0.004, swap probability p = 0.053 ± 0.01). The model provided an excellent quantitative fit to response distributions (Fig. 6A) and RMSE (Fig. 6B), successfully reproducing the pattern of changes with set size and presentation duration.

The model predicted that information from a visible stimulus accrues at a high rate immediately after the stimulus onset, consistent with observed changes in human recall error over stimulus durations up to 200 ms (Fig. 6). This initial high encoding rate emerges naturally in the model due to the joint dynamics of sensory and VWM populations. In the sensory population, a low-pass temporal filter serves as a neural gain control mechanism, attenuating neural response to transient changes in stimuli [44, 47]. As a consequence, the neural response to stimulus onset increases exponentially (Fig. 5). The information from sensory areas is accumulated into VWM, such that the accumulation rate is directly proportional to the difference between the current and saturating state (i.e. the rate is faster when accumulated information is far from the saturating state). Therefore, dynamics in the sensory and VWM population jointly account for the initial high rate of information extraction from stimuli, and its dependence on set size.

After the initial steep change, the model predicts that recall fidelity will asymptote. This was again observed in human behavior (Fig. 6). Extending stimulus presentation beyond 200 ms had negligible impact on recall precision, consistent with previous studies [38]. The model explains this behavior by describing how sensory signal and VWM accumulation independently saturate with time (Fig. 5). Since the temporal filtering in the sensory population attenuates only high-frequency stimuli (i.e. very short presentations), with sufficient exposure, the sensory signal plateaus, resulting in a stable feed-forward input to VWM. Similarly, VWM signal strength is subject to limits determined by normalization. Once the accumulated information reaches the normalized maximum set by the number of objects in memory, further accumulation of sensory evidence is not possible. Following the cue, a portion of the resource is freed, allowing the target representation to be further strengthened. However, because the sensory signal plateaus at longer exposures, the information available for integration after the cue remains constant across the longer exposures, supplementing normalized VWM signal by the same amount. The result is a plateau in fidelity that varies with set size.

Model variants

We investigated whether post-stimulus sensory persistence contributed to the model fits in Experiment 2. We assumed that the signal persisting after stimulus offset would be impaired but not eliminated by the subsequent presentation of a noise mask in this experiment [52]. An alternative account suggests that the mask immediately terminates any stimulus-related signal. To test for this, we fit a variant of the DyNR model in which the sensory signal was terminated by the onset of the mask, providing a feed-forward signal to VWM only for the period of the stimulus presentation. We found that the proposed DyNR model, in which some sensory signal persists after the mask onset, gave a better account of the data than this model variant (ΔAIC = 446.67). Although the alternative model captured the general pattern of changes in memory fidelity with exposure duration, it mispredicted fidelity at shorter exposures, in particular the effect of set size (Fig. S4A).

A testable prediction of this alternative model is that the memory fidelity at recall should obey the neural normalization principle because there was no additional signal to supplement the presentation after initial encoding. To test for this, we additionally fitted each exposure condition separately using the original Neural Resource model with only three parameters (i.e., neural gain, tuning width, and swap probability). This model failed to predict actual fidelity levels at recall (Fig. S4B), corroborating the findings of the model comparison.

Finally, to investigate the role of the post-stimulus sensory persistence on encoding dynamics, we fit the DyNR model to an additional dataset from Bays et al. [38] (for full details see Additional dataset 2 in Supplementary Information). This experiment aimed to investigate VWM dynamics during encoding, like our Experiment 2. In contrast to our Experiment 2, Bays et al. [38] used a much longer delay interval (1100 ms vs 100 ms), precluding the possibility of further accumulation of sensory evidence following the cue. We expected that the DyNR model could account for memory dynamics in this study without any post-stimulus sensory activity. This was confirmed by accurately reproducing memory dynamics with a model in which encoding into VWM relied only on sensory evidence during stimulus presentation (detailed results in Supplementary Information).

Alternative accounts

Having demonstrated the need for both post-stimulus sensory persistence and diffusion to account for empirical data, we next considered alternatives to our account of VWM accumulation and information read-out.

Direct read-out of sensory information

In the DyNR model, recall fidelity is enhanced following the cue by integrating remaining sensory activity into capacity-limited VWM. As a consequence, response precision is bounded from above by the memory limit irrespective of the available sensory signal. An alternative possibility is that the decaying sensory representation can be directly read out following the cue to inform a response, bypassing working memory limitations. To formalize this alternative model, we assumed that independent sensory and VWM representations would be optimally combined via summation of neural activity to yield population gain

The model is otherwise identical to the proposed DyNR model. A distinctive prediction of this model is that the precision of recall changes exponentially with delay for every set size, including 1 item (Fig. S5). This prediction is qualitatively inconsistent with the pattern of results observed in Experiment 1, in contrast with the DyNR model which does not predict any beneficial effect of earlier cues with set size 1. This alternative model provided a worse fit to data from Experiment 1 (ΔAIC = 164) and Experiment 2 (ΔAIC = 84.6), for combined evidence favouring the DyNR model of ΔAIC = 248.6.

Cue processing

In the DyNR model, we assumed that identifying the target stimulus based on the cue is time-consuming, and becomes more so as the number of alternatives increases. Cue processing time encompasses perceptual, attentional, and decision components needed to interpret and act on the cue. We tested the necessity of this component by fitting a model variant in which VWM started accumulating evidence about the cued item at the moment of cue presentation. This model provided a worse fit to empirical data from both Experiment 1 (ΔAIC = 84.5) and Experiment 2 (ΔAIC = 107.5), for total evidence in favor of the DyNR model of ΔAIC = 192 (Fig. S6). We fit another variant in which cue processing time was constant across set sizes. This alternative provided a worse fit to the data in Experiment 1 (ΔAIC = 191.6) and Experiment 2 (ΔAIC = 105), for combined evidence ΔAIC = 296.6 in favor of the full DyNR model that assumes cue processing time increases with set size. These results corroborate previous findings on the important role of cue processing time in models of attention [59] and IM [65].

Constant accumulation rate

In the DyNR model, the rate of accumulation into VWM is proportional to the difference between the present VWM amplitude and the maximum normalized amplitude (Eq. 2). An arguably simpler assumption is that the neural signal approaches saturation at a constant rate [66, 67]. In particular, the rate at which the signal representing an item is transferred to VWM is constant and depends only on the number of encoded items, i.e.

The dependence on M (t) satisfies the constraint that the neural resources in VWM are allocated at a constant rate, irrespective of the number of items. We applied this model to psychophysical data from both experiments (Fig. S7) and found it provides a worse fit to the data from Experiment 1 (ΔAIC = 11.5) and Experiment 2 (ΔAIC = 36.2), for combined evidence favouring the DyNR model with exponential saturation (ΔAIC = 47.7).

Discussion

In the present study, we investigated the temporal dynamics of short-term recall fidelity. We conducted two new human psychophysical experiments and analyzed two existing datasets in order to characterize how recall errors are influenced by set size, stimulus duration and retention interval. We developed a Dynamic Neural Resource (DyNR) model to provide a mechanistic explanation of the observed behavior, capturing not only changes in overall fidelity but also the distribution of errors in the stimulus space and frequencies of swaps (intrusion errors). A key finding is that the benefit to recall precision observed at very short delays is due to additional post-cue integration of sensory information into working memory, and that direct retrieval from sensory memory is unable to account for the empirical patterns of error.

Sensory and WM dynamics during delay

In the first experiment we investigated the effects of brief unfilled delays on recall fidelity. With multi-item arrays, we observed that memory performance deteriorates precipitously over the first few hundred milliseconds after stimuli disappear, followed by a gradual levelling-off of error with longer delays (Fig. 4). These results are consistent with previously reported patterns of memory dynamics [2931, 33], and estimates of sensory decay ranging between 100 ms and 400 ms [68, 69]. Here, we shed new light on these results by taking a computational approach in explaining observed temporal dynamics, and asking what this superior recall’s neural origin is and its relation with VWM. To answer these questions, we adapted the Neural Resource model of Bays [12] with a temporal component. The new DyNR model considers dynamics in a sensory neural population registering the stimuli and in a VWM population that stores the stimuli for later recall. Critically, our model assumes that objects encoded with limited precision into VWM can be flexibly supplemented with sensory activity following a recall cue, within a brief temporal window while the sensory population provides a feed-forward input post-stimulus. The boost in the representational VWM signal predicts a behavioral benefit of early cues that is consistent with our data and a large corpus of previous experiments [56].

A common assumption in studies of visual short-term memory is that recall over brief delays is exclusively supported by one of two memory stores, IM or VWM [31, 32]. In this account, a cue presented within the first few hundred milliseconds after stimulus offset allows observers to access high resolution but rapidly deteriorating representations in IM; once the information in IM has decayed, objects must be retrieved from the capacity-limited VWM store. Two pieces of evidence from the current study contradict this view and strongly suggest that recall depends on VWM from the moment objects disappear. First, the recall benefit of short delays was not observed for one item arrays. We propose that this behavior reflects the fact that, during encoding, the entirety of the VWM resource is allocated to a single object, leaving no free capacity for further enhancement based on the available sensory signal post-cue. Second, we found clear evidence that recall fidelity varied with set size even with no delay between stimulus offset and cue (0 ms condition). We argue that this arises from the set-size dependence of representational fidelity in VWM, which is only incompletely compensated by integration of the decaying sensory signal post-cue, resulting in lower fidelity for higher set sizes. The DyNR model provides a successful quantitative account for these findings, which are in clear contrast with the traditional view of IM.

The rapid changes in fidelity over short delays can be distinguished from dynamics over longer retention intervals. A number of recent studies have observed a slow deterioration of VWM precision over the course of prolonged retention [9, 23, 24, 70–72]. The causes of this deterioration are still contested, but growing evidence links this behavior to noise-driven diffusion. At a mechanistic level, diffusion is considered a fundamental property of continuous attractor networks of the kind commonly associated with models of working memory [73, 74]. In such networks, memorized features are represented as persistent activity “bumps” in the network’s representational feature space. Over a memory delay, the activity bump is sustained by balanced excitatory and inhibitory connections, while stochasticity in neural activity causes shifts of the bump along the feature dimension, taking the form of a random walk. Although we did not model the network processes governing stability and diffusion within neural populations, our implementation is consistent with random (Brownian) perturbation, as assumed by attractor models (see also 9).

Our theoretical account of memory dynamics during delay differs from several existing models of forgetting, which emphasize diffusion as the dominant source of error in short-term memory (e.g., 10, 64). To solely account for the observed data in Experiment 1, diffusion would need to be strongest early in the retention period, followed by a much weaker diffusion with longer delays. However, it is unclear why the diffusion rate would change, and particularly slow down, with time. Assuming a constant neural signal encoding the stimulus, this would predict greater variability in neural activity initially compared to the later period after stimuli offset. This is inconsistent with electrophysiological data showing relatively stable levels of spiking variability throughout the memory delay period [7577]. The results observed here are consistent with the proposal that modulation of neural signal over short memory intervals accounts for an abrupt change in response fidelity, while diffusion accounts for a slower change that grows with time.

In the present study, a model assuming a constant diffusion rate, independent of the stored number of items, was preferred to one in which diffusion rate increases linearly with set size. This is consistent with results of Shin et al. [71] who did not find a significant effect of set size on the rate of memory deterioration. In contrast to that, Koyluoglu et al. [64] recently proposed that the rate of diffusion scales with set size. However, this study did not account for the presence of swap errors, which we found to increase with retention interval as well as set size. To draw strong conclusions about the dependence of diffusion on set size would require a future study to disentangle the different sources of error that could, in principle, increase with delay.

Sensory and WM dynamics during encoding

Having investigated memory degradation during the retention interval, in Experiment 2 we focused on the dynamics arising from accumulation of information during stimulus presentation. Using new psychophysical data, we showed that encoding of information into VWM is contingent on both presentation duration and the number of memorized stimuli. The observed patterns of data indicate that VWM encoding of elementary stimuli is mostly completed within the first 200 ms of presentation even at the largest set sizes, with minimal benefit of longer exposures, extending previous work [3840]. This fast encoding process may have an adaptive role: with a key function of VWM to store and accumulate information across saccadic eye movements, an efficient system should deploy its resources within the duration of a typical gaze fixation [78, 79].

Our aim was again to move beyond the description of the encoding dynamics and to provide a biologically plausible neurocomputational account of these dynamics. To achieve that, we applied the same VWM accumulation process that operates post-cue to the sensory information during stimulus presentation. Using previously published and newly collected data, we show that a model in which VWM accumulates dynamical sensory input up to a fidelity limit can successfully account for patterns of human recall errors with variable set size and stimulus presentation. An important result of our modelling is that the accumulated information in VWM increases with a rate proportional to unfilled capacity. In particular, the model with such exponential accumulation provided a better fit than a model assuming a constant encoding rate. This parallels previous observations that models based on exponential-like extraction of information successfully characterize attention [80, 81], working memory encoding [38, 82], memory updating [83], and broader cognitive processes [84, 85]. We hypothesize that this pattern represents an approach to an equilibrium state of balanced excitation from the sensory input and lateral inhibition within the VWM population, which is the basis for capacity of the memory system.

In Experiment 2, the longest presentation duration shows an upward trend in error at set sizes 4 and 10. While this falls within the range of measurement error, it is also possible that this is a meaningful pattern arising from visual adaptation of the sensory signal, whereby neural populations reduce their activity after prolonged stimulation. This would mean less residual sensory signal would be available after the cue to supplement VWM activity, predicting a decline in fidelity at higher set sizes. Visual adaptation has previously been successfully accounted for by a type of delayed normalization model in which the sensory signal undergoes a series of linear and nonlinear transformations [86]. Such a model could in future be incorporated into DyNR and validated against psychophysical and neural data.

Our computational account of VWM encoding dynamics differs from several existing modelling frame-works aiming to explain similar data. For example, the Theory of Visual Attention (TVA; 80) assumes that visual stimuli participate in a parallel exponential race towards limited VWM. Like the DyNR model, TVA assumes a form of normalization in the sense that the speed with which items race towards VWM depends on the number of items in the visual field. Unlike our dynamic model, TVA is not a theory of VWM, and it considers VWM only as a storage for categorizations of visual objects. In particular, TVA takes into account the limits of VWM but does not specify why or how these limitations arise. Finally, TVA considers whether an object was selected for entry into VWM in an all-or-none fashion; our dynamic model is mostly concerned with the fidelity of representations. A somewhat alternative account of VWM encoding is provided by the Competitive Interaction Theory (CIT; 87), which is similarly based on the Signal Detection theory and principles of normalization [43]. Like TVA, CIT is mostly focused on item selection and merely incorporates a concept of VWM capacity derived from object-based models of VWM. Although CIT had success in accounting for behavioral data from a two-alternative orthogonal discrimination task using up to four items and a limited range of encoding times, it remains an open question whether this model can account for error distributions as measured in a continuous report task, and a larger range of set sizes and stimulus exposures. Importantly, compared to both TVA and CIT, the DyNR model is strongly rooted in and inspired by findings from neuroscience. This not only adds to the biological plausibility of our model but also allows future studies to test the model’s predictions using physiological methods.

Neural mechanisms

The theory presented here generalizes the Neural Resource model of Bays [12], a simple encoding-decoding model in which visual features are represented in the noisy spiking activity of neural populations [15], and where the activity representing each feature scales inversely with the total number of representations, consistent with the prevalence of normalization mechanisms in the brain and observations from single-neuron recording [88] and fMRI decoding [89] studies. The population coding in the model is based on an abstract idealization of neural response functions. Nevertheless, it has recently been shown that more realistic population coding schemes that allow for heterogeneity in neural tuning curves and correlated spiking activity as observed in visual cortex, maintain the key predictions of the idealized model [90, 13]. This may be seen as a consequence of the different population codes inducing a common representational geometry [91].

We adapted the stationary VWM model by first incorporating a sensory population that provides an input drive to the VWM population. In parallel with neurophysiological observations, a common approach is to model these dynamics with a low-pass filter which acts like a neural gain modulation mechanism [47]. As a consequence, the sensory response to stimulus onset and offset is an exponential rise and decay in activity, respectively. The decaying component of the response has been recognized as a neural substrate of visual persistence and IM [37, 36]. Here, we modelled sensory decay with an exponential function [92], although other forms of decay have been proposed. For example, Loftus et al. [68] showed that iconic decay could be better captured using a gamma survival function, a generalization of exponential decay that could simply be implemented in our neural model by replacing a single filter with a cascade of exponential low-pass filters.

In addition to the dynamics in the sensory population, two features of VWM introduce additional dynamics in representation fidelity: the accumulation of information (discussed above) and the diffusion of representations owing to accumulated noise. Although we did not aim to model the neural processes behind diffusion, our implementation is consistent with the consequences of neural variability in attractor networks [25, 74]. Converging neural evidence demonstrating such diffusion has been observed using single-unit neural recording in monkeys [26], as well as EEG [28] and fMRI [27, 93] studies in humans.

As well as being implicated in higher cognitive processes including VWM [88, 89], divisive normalization has been shown to be widespread in basic sensory processing [9496]. The DyNR model presently incorporates the former but not the latter type of normalization. While the data observed in our experiments do not provide evidence for normalization of sensory signals (note comparable recall errors across set size in the simultaneous cue condition of Experiment 1), this may be because sensory suppressive effects are localized and our stimuli were relatively widely separated in the visual field: future research could explore the consequences of sensory normalization for recall from VWM using, e.g., centre-surround stimuli [97].

Following onset of a stimulus, the visual signal ascends through visual areas via a cascade of feedforward connections. This feedforward sweep conveys sensory information that persists during stimulus presentation and briefly after it disappears [98]. Simultaneously, reciprocal feedback connections carry higher-order information back towards antecedent cortical areas [99]. In our psychophysical task, feedback connections likely play a critical role in orienting attention towards the cued item, facilitating the extraction of persisting sensory signals, and potentially signalling continuous information on the available resources for VWM encoding. While our computational study does not address the nature of these feedforward and feedback signals, a challenge for future research is to describe the relative contributions of these signals in mediating transmission of information between sensory and working memory [100].

Our model makes a clear distinction between dynamics in sensory and VWM populations, however, it remains agnostic as to whether the populations have the same or different anatomical locus [101]. Albeit inspired by the properties of orientation-selective neurons in area V1, population tuning of this kind is a common coding motif across the brain [15]. While it could be considered efficient to use already specialized circuits to maintain as well as process visual information, it is still debated whether sensory areas are a feasible candidate for memory storage [102, 103]. While some studies have focused on prefrontal [104], parietal [105] or occipital [106] cortices as the primary locus of VWM, others argue for distributed storage by demonstrating that VWM contents can be decoded from imaging signals originating in multiple brain areas [107].

Representational dynamics of cue-dimension features

Memory retrieval failures in which a non-cued item is reported in place of the intended target represent an important source of error in VWM recall. These swap errors occur more often at higher set sizes and when spatial confusability is high [108111], as predicted by models in which they arise from uncertainty in the recall of cue-dimension features leading to incorrect selection of an item in memory [20, 61]. In the current study, we assumed memory for spatial location (the cue feature) undergoes similar dynamics to memory for orientation (the report feature), and in particular that spatial information degrades with retention time [9], leading to changes in swap error frequency with delay interval. Similarly, during encoding the fidelity of spatial representation increases with the accumulation of sensory evidence [112], reducing the uncertainty at retrieval and consequently swap errors at longer stimulus exposure. Although we did not explicitly model the neural signals representing location, the modelled dynamics in the probability of swap errors were consistent with those of the primary memory feature. We provided a more detailed neural account of swap errors in our earlier works that is theoretically compatible with the DyNR model [20, 61].

The DyNR model successfully captured the observed pattern of swap frequencies (intrusion errors). The only notable discrepancy between DyNR and the three-component mixture model (Fig. S2) arises with the largest set size and longest delay, although with considerable interindividual variability. As the variability in report-dimension increases, the estimates of swap frequency become more variable due to the growing overlap between the probability distributions of swap and non-swap responses. This may explain apparent deviations from the modelled swap frequencies with the highest set size and longest delay where orientation response variability was greatest.

Removal of information from WM

In the DyNR model, taking advantage of early cues requires rapid removal of the VWM signal associated with uncued items, to admit further accumulation of activity encoding the cued item. To achieve this, an active process of selective content elimination may be required [57], as opposed to a passive decay of uncued representations during the post-cue interval. Mounting evidence for such active removal has been provided at the behavioral [113] and neural [114] level. Importantly, studies show that a functional role of such active removal is to release resources allocated to the uncued representations, facilitating the encoding of new information [115, 116]. The fast reallocation of neural resources assumed by the DyNR model is consistent with such a description of active removal.

Methods

Participants

A total of twenty-three naive observers (12 females, 11 males; aged 18–34) took part in the study after giving informed consent in accordance with the Declaration of Helsinki. Ten observers participated in Experiment 1 and thirteen observers participated in Experiment 2. Volunteers were recruited through the Cambridge Psychology research sign-up system. All observers reported normal color vision and normal or corrected-to-normal visual acuity, and were remunerated £10/hr for their participation.

General methods

Experimental setup

Stimuli were presented on a 69 cm gamma-corrected LCD monitor with a refresh rate of 144 Hz. Participants were seated in a dark room and viewed the monitor at a distance of 60 cm, with their head supported by a forehead and chin rest. Responses were collected using Magic Trackpad 2, a pointing device (16 x 11.5 cm) with a tactile sensor operating at ∼90 Hz (Apple Inc.). Eye position was monitored online at 1000 Hz using an infrared eye tracker (SR Research). Stimulus presentation and response registration were controlled by a script written in Psychtoolbox and run using Matlab (The Mathworks Inc.).

Stimuli

Memory stimuli consisted of randomly oriented Gabor patches (wavelength of the sinusoid, 0.65 of visual angle; s.d. of Gaussian envelope, 0.5) presented on a uniform mid-grey background. The contrast of Gabor patches varied between experiments (see below). Memory stimulus positions were randomly chosen from a set of ten equidistant locations on the perimeter of an invisible circle with radius 6 centered at fixation. At the start of each trial, a black fixation annulus was shown (r = 0.15 and R = 0.25) in the display center. Once steady fixation was registered, the size of the inner radius increased (r = 0.2). Observers perceived this change as the annulus becoming thinner. The fixation annulus then stayed visible throughout the trial. Items were cued for recall by displaying a black arrow (2 length) extending from the center of the display and pointing to one of the previously occupied locations without overlapping with it.

Procedure

Each trial started with presentation of the central fixation annulus. Observers were required to maintain gaze fixation for 500 ms within a radius of 2 around the central annulus in order for a trial to proceed. Following stable fixation, the appearance of the fixation annulus changed, indicating that the memory array would appear in 500 ms. The memory sample array consisting of 1, 4, or 10 randomly oriented Gabor patches was then presented. This was followed by a delay period and finally a cue display, indicating to observers to report the memorized orientation of an item previously displayed at the indicated location.

Observers were instructed to reproduce the remembered orientation as accurately and as quickly as possible by executing a single movement of their index fingertip over the surface of the touchpad located centrally in front of them. Simultaneously with the observer’s movement, a blue line appeared on the screen, extending from the center of the screen and mimicking the observer’s response in real-time. The response was terminated if one of the following conditions was satisfied: the observer stopped movement for 500 ms; the observer lifted their finger from the touchpad; or the response line reached the edge of the display. This was followed by a feedback display, consisting of the actual orientation (shown with a white line) and reported orientation (shown with a blue line) overlaid at the location of the cued item. The recalled orientation was calculated as the angle of the line connecting a starting point and an endpoint of hand movement on the touchpad.

Observers were required to maintain central fixation during the stimulus presentation and delay phase. If gaze position deviated by more than 2 a message appeared on the screen, and the trial was aborted and restarted with newly randomized orientations. Participants completed the task in blocks of 50 trials, and each block corresponded to one experimental condition. The order of blocks was randomized for every observer. At the beginning of the testing session observers familiarized themselves with the task and experimental setup by doing at most 50 practice trials.

Experiment 1

In Experiment 1 we investigated the temporal dynamics of VWM fidelity over short delays by presenting observers with sets of stimuli of variable size and then cueing one of them for recall after a variable delay relative to the stimuli offset. A typical trial sequence is shown in Figure 3A. The memory sample array (Michelson contrast = 0.5) was presented for 200 ms. In 50% of trials, the stimuli changed phase (by 180) and contrast (Michelson contrast = 1) for the last 50 ms of presentation, while remaining at the same orientation. This manipulation was intended to minimize retinal after-effects (see e.g. 117 for similar techniques). The stimuli offset was followed by a variable blank delay of 0, 100, 200, 400, or 1000 ms, after which one item was cued for recall. In one additional condition, the cue was instead presented simultaneously with the memory sample array, indicating an item while it was still visible on the screen (Fig. 3B).

Each observer completed a total of 1800 trials, split into 36 blocks. The experiment was organized such that half of the observers first completed 18 blocks with phase shift (see above), and the other half first completed blocks without phase shift. Except for this constraint, block order was randomized for every observer. The testing was divided into four equal testing sessions, each lasting approximately 1.5 hours, with a separation of at least one day between sessions.

Experiment 2

In Experiment 2 we investigated the temporal dynamics of VWM fidelity during encoding. To this end, we displayed oriented stimuli for a variable duration and in sets of variable size. The experiment was similar to the previous experiment with a few exceptions (Fig. 3C). Each trial started with a presentation of a fixation annulus, followed by a memory array (Michelson contrast = 0.3). The stimuli stayed on the screen for a variable duration of 30, 48, 77, 122, 196, 313, or 500 ms, and were then replaced by noise masks (100 ms). Mask stimuli consisted of white noise at full contrast, windowed with a Gaussian envelope (0.5 s.d.) and flickering at 35 Hz. At the offset of the masking stimuli, one memory item was cued for recall. Each observer completed 21 blocks, for a total of 1050 trials. Blocks were spread over two testing sessions, each lasting approximately 1.5 hours, and taking place on different days. Observers completed 10 blocks in the first, and the remaining 11 blocks in the second session.

Data Availability

Data and code related to this study will be made available at https://doi.org/10.17863/CAM.95223.

Acknowledgements

We thank George Sperling and Sebastian Schneegans for helpful discussion, Robert Taylor for help with Bayesian hierarchical modelling, and Jessica McMaster for help with data collection. We used resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service. This research was supported by the Wellcome Trust (grant 106926 to PMB).

Supplementary information

Minimizing retinal after-effects

We assessed the method of minimizing retinal afterimages by repeating all measurements, with the exception of not using phase shift of stimuli (Fig 3A). We predicted retinal afterimages could serve as an additional source of information, but only for a brief period after stimuli offset. Therefore, here we expected to see better performance for brief delays compared to conditions with phase shift. Figure S1A shows recall error increased with both set size and delay. Both of these effects were statistically significant, as well as their interaction (set size: F(2,18) = 47.3, p < .001, η2 = .31; delay time: F(5,45) = 48.4, p < .001, η2 = .26; interaction: F(10,90) = 21.3, p < .001, η2 = .14), reminiscent of findings for data with phase shift.

Next, we focused on the comparison of conditions with and without phase shift of stimuli (Fig S1B). We illustrate the difference in performance by subtracting RMSE obtained in the condition without phase shift (Fig 4B) from RMSE shown in Figure S1A. Negative values indicate better performance in a condition without phase shift. As predicted, the overall pattern of data suggested performance was comparable for 1 item across all delays, and for all set sizes for extreme delays (simultaneous presentation and 1000 ms), indicated by the difference values around 0. We confirmed the difference in recall error for 1 item across all delays did not differ consistently with and without phase shift, as neither phase shift (F(1,9) = 0.03, p = .86, η2 < .001, BFincl = 0.143) nor the interaction of phase shift and delay (F(5,45) = 0.41, p = .89, η2 = .00, BFincl = 0.042) reached significance. Based on this result, we conducted all remaining analyses using only the remaining two set sizes. We ran separate repeated measures ANOVAs for each delay using phase shift and set size as factors. The pattern of results we observed was clear: performance was comparable with and without phase shift with the simultaneous presentation and 1000 ms delay (phase shift, F(1,9) 1.08, p ≥ .33, η2 ≤ .002, BFexcl 3.62; interaction, F(2,18) 0.8, p ≥ .44, η2 ≤ .02, BFexcl 3.39), while for the remaining intermediate delays recall error was consistently lower when phase shift was omitted (phase shift, F(1,9) 5.8, p ≤ .039, η2 ≥ .06; interaction, F(1,9) 2.8, p ≥ .13, η2 ≤ .001).

(A) Experiment 1 RMSE for trials without phase shift. (B) Differences in RMSE between trials with and without phase shift across set size and delay conditions. Negative values indicate better performance in the condition without phase shift.

Taken together, performance with and without phase shift of stimuli was comparable in perceptual condition (simultaneous presentation) and with the longest delay, suggesting phase shift did not change visibility or encoding of information into VWM. In contrast, we found strong evidence that observers had access to an additional source of information over intermediate delays when phase shift was not used, demonstrated by a better recall performance from 0 ms to 400 ms delay. Specifically, this source of information was available immediately after stimuli offset and was short-lived, consistent with the theoretical description of retinal afterimages [118].

Swap error estimates

Swap error estimates. (A&B) Probability of swap errors estimated from empirical data using the three-component mixture model [108] in Experiment 1 (A) and Experiment 2 (B). (C&D) Probability of swap errors in best-fitting DyNR model in Experiment 1 (C) and Experiment 2 (D).

Alternative models’ fits

Experiment 1 behavioral data and model fit for the DyNR model without sensory persistence after stimulus offset. (A) A version of the DyNR model with equal diffusion across set sizes. (B) A version of the DyNR model with diffusion that scales with set size.

Experiment 2 behavioral data and model fit for the neural model without sensory persistence after stimulus offset. (A) A version of the DyNR model without sensory persistence. (B) Separate fits of the simplified neural model to each exposure time.

Behavioral data and model fit for a neural model with the direct read-out of information from sensory memory for (A) Experiment 1 and (B) Experiment 2.

Behavioral data and model fit for the DyNR model without the cue processing time for (A) Experiment 1 and (B) Experiment 2.

Behavioral data and model fit for a neural model with constant accumulation of information into WM for (A) Experiment 1 and (B) Experiment 2.

Additional dataset 1

To further investigate the role of diffusion in memory dynamics, we analysed an additional dataset collected in our lab [62]. In this experiment we varied the set size and delay duration similar to Experiment 1. In contrast to Experiment 1, we used longer memory delays, which allowed us to examine the diffusion mechanism on a more suitable time scale. Moreover, memory delays used in this study are out of reach of the decaying sensory information, enabling us to investigate the diffusion without changes in the neural signal strength post-cue.

Methods

Ten observers (6 females, 4 males, aged 18-34) took part in this experiment. The data for this experiment was collected using the same equipment and the testing setting as described for the main experiments. A typical trial sequence is illustrated in Fig. S8. Each trial began with the presentation of a central annulus which served as a fixation point. Once a stable fixation was achieved, the inner annulus radius changed indicating that stimuli would appear in 500 ms. The memory sample array was then presented for a duration of 500 ms. The array consisted of one or three randomly oriented black bars (length 2.8). Each bar was positioned in one of six predetermined locations equally distributed around the circle with a radius of 5 around center of the screen. Each bar was presented along with a placeholder circle (radius 1.5).

Experimental procedure. Stimuli are not drawn to scale.

Memory array presentation was followed by a memory delay during which fixation circle and placeholders stayed visible. The retention interval was either 1 or 7 seconds long. After that, one stimulus was randomly cued for recall. The cue consisted of a second, larger circle drawn around one of the placeholders. Observers were instructed to start rotating a response dial (Griffin Technology PowerMate USB) once they were ready to respond. After the rotation of the response dial was detected, a randomly oriented black bar was displayed within the placeholder. Observers were instructed to rotate the dial until the displayed bar matched the remembered orientation of the cued item. Observers confirmed their response by pressing the dial. Trials with different set sizes and delay durations were randomly interleaved.

Eye movements were monitored from the beginning of the trial until stimuli offset, and observers were required to hold steady fixation during that period. If the gaze position deviated by more than 2 a message appeared on the screen and the trial was aborted and restarted with new orientations. Each observer completed 700 trials, divided into two sessions and each consisting of 7 equal blocks.

Two sessions were separated by at least one day, and each lasted approximately 1 hour. At the beginning of each session observers familiarized themselves with the task and experimental setup by doing at most 50 practice trials.

Results

Behavioral data

Recall performance is shown in Figure S9. As predicted, response error increased with set size and memory delay. A repeated measures ANOVA revealed a significant effect of set size (F(1,9) = 111.17, p < .001, η2 = .76) and memory interval (F(1,9) = 58.14, p < .001, η2 = .12), and their interaction (F(1,9) = 10.66, p = .01, η2 = .02) on response error. Moreover, conducting paired t-tests within each set size revealed recall error increased with the delay with set size 1 (t(9) = 5.83, p < .001, d = 1.84) and set size 3 (t(9) = 5.78, p < .001, d = 1.83). The interaction effect was a consequence of a larger increase in error with delay for set size 3 compared to set size 1 (ΔRMSE = RMSE7000ms RMSE1000ms; t(9) = 3.27, p = .01, d = 1.03). These results are consistent with Experiment 1, corroborating our finding that increasing the set size and delay time have a disadvantageous effect on memory fidelity.

Behavioral data and model fit for Experiment 1a

Neural model

We fitted the DyNR model to the data to test whether noise-driven diffusion is sufficient to account for changes in recall fidelity with longer memory intervals. We applied a simplified version of the model without sensory decay and VWM accumulation components. This was justified given that estimate of sensory decay from Experiment 1 was shorter (mean life τ = 0.21) than the shortest interval used in this experiment (1 s). Moreover, based on our findings in Experiment 2, we argue that a display duration of 500 ms is sufficient to fully encode objects into VWM.

Curves in Figure S9 show fits of the model with maximum likelihood (ML) parameters (mean ± se: population gain γ = 385.02 ± 208.3, tuning width κ = 2.67 ± 0.43, cue processing constant b = 0.68 ± .67, base diffusion σ2 = 0.009 ± 0.001, swap probability p = 0.005 ± 0.002). The model provided an excellent quantitative fit to response distributions and summary statistics (Fig. S9), successfully explaining the adverse effects of set size and memory interval on recall fidelity. Critically, and consistent with results from Experiment 1, the proposed DyNR model provided a better fit to human response error compared to the matching model without diffusion (ΔAIC = 144.75) or the model in which diffusion rate increases with set size (ΔAIC = 42.3). In conclusion, this result shows that variability in representations over longer memory intervals can be fully accounted for by noise-driven accumulation without changes in the representational signal [9, 10, 28].

Additional dataset 2

To further validate predictions of the DyNR model we fitted it to an existing working memory study (Experiment 1 in 38). This study focused on the role of temporal dynamics during WM encoding, thereby addressing the same question as our Experiment 2. In contrast to our Experiment 2, Bays et al. [38] used a longer delay period (1100 ms), precluding the strengthening influence of decaying sensory information on recall. This dataset therefore isolates the initial information accumulation process during stimuli presentation.

Methods

The observers (N = 32) performed a continuous report task in which a variable number of oriented bars was presented for a variable duration, followed by a pattern mask (100 ms) and a 1-second delay period after which one of the items was probed for recall. Set size was manipulated between observers and exposure duration was manipulated within observers. Each observer performed 100 trials per exposure duration, for a total of 25600 trials in the study. A more detailed description of the experiment is provided in Bays et al. [38].

Analysis

Considering only exposure duration in this experiment was manipulated at the observer level, we decided to expand our modelling approach by employing a Bayesian hierarchical method as a compromise between fitting the data for each observer (i.e., set size) independently and pooling the data across all observers. Using a Bayesian hierarchical modelling, individual-observer parameters are considered samples from population distributions, whose means and variances are estimated based on all available data. In general, this approach has a desirable characteristic of constraining individual-level parameters with the population-level distribution and producing meaningful parameter estimates when a model is fitted across separate groups. The dynamic neural model fitted to the data is identical to the model fitted in Experiment 2, with the exception that here we assumed any existing post-stimulus sensory activity completely diminished by the time of the cue (1100 ms post-stimulus offset), and therefore we did not model sensory decay here. To obtain the hierarchical fit, we used the Differential Evolution Markov Chain algorithm [119]. All individual-level parameters were samples drawn from normal (i.e., Gaussian) distributions, with corresponding mean and standard deviation being constrained by uniform hyperprior distributions. We collected 240000 post-warmup samples across 12 chains and computed median and 95% equal-tailed intervals (ETI) of posterior distributions to obtain the group and individual-level parameter estimates. Prior specifications and empirical data for all analyses can be found along with the published code.

Results

Figure S10 and Figure S11 show empirical distributions and summary statistics across all conditions. Similar to Experiment 2, increasing the exposure duration (F(7,196) = 110.9, p < .001, η2 = .188) and decreasing the set size (F(3,28) = 22.83, p < .001, η2 = .53) had beneficial effect on response error. Interaction of exposure duration and set size was significant (F(21,196) = 3.13, p < .001, η2 = .02). Critically, the pattern of memory fidelity dynamics largely matches the pattern observed in Experiment 2, with response errors decreasing rapidly as presentation duration was increased from the minimum duration, saturating at longer durations. This pattern was consistent across all set sizes, which only differed in the absolute error.

These dynamics were accurately predicted by the DyNR model, both at the level of response distributions (curves in Fig. S10) and summary statistics (curves in Fig. S11). The parameters used to generate model predictions were obtained by taking the individual observer’s posterior medians. We observed the following hyperparameters (median and 95% ETI of hyperposterior): population gain γ = 109.47 (88.1 - 133.57), tuning width κ = 3.23 (2.6 - 4.03), sensory rise time constant τrise = 0.0049 (0.0019 - 0.0091), VWM accumulation time constant τWM = 0.067 ± (0.051 - 0.087), cue processing constant b = 0.423 (0.093 - 0.8436), base diffusion σ2 = 0.095 (0.057 - 0.149), spatial uncertainty time constant τspatial = 0.031 (0.022 - 0.041), swap probability p = 0.02 (0.011 - 0.034).

Empirical recall error distributions (black circles) from Experiment 1 in Bays et al. [38] and the DyNR model fits to the data (colored curves).

Summary statistics (black circles) from Experiment 1 in Bays et al. [38] and the DyNR model fits to the data (colored curves). The DyNR model was fit to the distributions of recall errors shown in Fig. S10.