Strategydependent effects of workingmemory limitations on human perceptual decisionmaking
Abstract
Deliberative decisions based on an accumulation of evidence over time depend on working memory, and working memory has limitations, but how these limitations affect deliberative decisionmaking is not understood. We used human psychophysics to assess the impact of workingmemory limitations on the fidelity of a continuous decision variable. Participants decided the average location of multiple visual targets. This computed, continuous decision variable degraded with time and capacity in a manner that depended critically on the strategy used to form the decision variable. This dependence reflected whether the decision variable was computed either: (1) immediately upon observing the evidence, and thus stored as a single value in memory; or (2) at the time of the report, and thus stored as multiple values in memory. These results provide important constraints on how the brain computes and maintains temporally dynamic decision variables.
Editor's evaluation
This paper employs sophisticated modeling of human behavior in wellcontrolled tasks to study how limitations of working memory constrain decisionmaking. Because both are key cognitive processes, that have so far largely been studied in isolation, the paper will be of broad interest to neuroscientists and psychologists. The observed working memory limitations support previous findings and extend them in critical ways.
https://doi.org/10.7554/eLife.73610.sa0eLife digest
Working memory, the brain’s ability to temporarily store and recall information, is a critical part of decision making – but it has its limits. The brain can only store so much information, for so long. Since decisions are not often acted on immediately, information held in working memory ‘degrades’ over time. However, it is unknown whether or not this degradation of information over time affects the accuracy of later decisions.
The tactics that people use, knowingly or otherwise, to store information in working memory also remain unclear. Do people store pieces of information such as numbers, objects and particular details? Or do they tend to compute that information, make some preliminary judgement and recall their verdict later? Does the strategy chosen impact people’s decisionmaking?
To investigate, Schapiro et al. devised a series of experiments to test whether the limitations of working memory, and how people store information, affect the accuracy of decisions they make. First, participants were shown an array of colored discs on a screen. Then, either immediately after seeing the disks or a few seconds later, the participants were asked to recall the position of one of the disks they had seen, or the average position of all the disks. This measured how much information degraded for a decision based on multiple items, and how much for a decision based on a single item. From this, the method of information storage used to make a decision could be inferred.
Schapiro et al. found that the accuracy of people’s responses worsened over time, whether they remembered the position of each individual disk, or computed their average location before responding. The greater the delay between seeing the disks and reporting their location, the less accurate people’s responses tended to be. Similarly, the more disks a participant saw, the less accurate their response became. This suggests that however people store information, if working memory reaches capacity, decisionmaking suffers and that, over time, stored information decays.
Schapiro et al. also noticed that participants remembered location information in different ways depending on the task and how many disks they were shown at once. This suggests people adopt different strategies to retain information momentarily.
In summary, these findings help to explain how people process and store information to make decisions and how the limitations of working memory impact their decisionmaking ability. A better understanding of how people use working memory to make decisions may also shed light on situations or brain conditions where decisionmaking is impaired.
Introduction
Many perceptual, memorybased, and rewardbased decisions depend on an accumulation of evidence over time (Brody and Hanks, 2016; Gold and Shadlen, 2007; Ratcliff et al., 2016; Shadlen and Shohamy, 2016; Summerfield and Tsetsos, 2012). This dynamic process, which can operate on timescales ranging from tens to hundreds of milliseconds for many perceptual decisions to seconds or longer for rewardbased and other decisions (Bernacchia et al., 2011; Gold and Stocker, 2017), requires working memory to maintain representations of new, incoming evidence and/or the aggregated, updating decision variable. Working memory is constrained by capacity and temporal limitations (Bastos et al., 2018; Cowan et al., 2008; Funahashi et al., 1989; Oberauer et al., 2016; Panichello et al., 2019; Ploner et al., 1998; Schneegans and Bays, 2018; White et al., 1994) that, in principle, could also constrain decision performance. Several previous studies failed to identify such constraints on decisions that depend on working memory, but those studies used tasks involving binary choices that may be relatively insensitive to known workingmemory limitations (Liu et al., 2015; Waskom and Kiani, 2018). It remains unclear if and how workingmemory limitations affect decisions that require interpreting and storing continuously valued quantities whose representations are known to degrade over time (Ploner et al., 1998; Schneegans and Bays, 2018; Wei et al., 2012; White et al., 1994).
To better understand how workingmemory limitations affect decisionmaking, we examined how humans made decisions that required interpreting and storing continuously valued visuospatial information (visual target locations) that is sensitive to capacity and temporal limitations of working memory (Bastos et al., 2018; Funahashi et al., 1989; Panichello et al., 2019; Ploner et al., 1998; Schneegans and Bays, 2018; White et al., 1994). Specifically, we required participants to indicate a remembered spatial location that was informed by one or more briefly presented visual stimuli (‘disks’; Figure 1) after a variable delay. We compared the effects of variable set size and delay when the remembered location corresponded to either: (1) the perceived location (angle) of a specific disk, identified at the time of interrogation, which is a design that has been used previously (Ploner et al., 1998; Schneegans and Bays, 2018; Wei et al., 2012; White et al., 1994); or (2) the computed mean angle of a set of multiple disks, which is a form of continuous decision variable whose sensitivity to workingmemory limitations has not been examined in detail. Additionally, we examined the effects of workingmemory limitations on computed locations under two conditions that are representative of certain decisionmaking tasks. The first was a ‘simultaneous’ condition in which all disks (and thus all information) were presented at once. The second was a ‘sequential’ condition in which one disk was presented later than the others. This condition required participants to adjust to a withintrial change of available decisionrelevant information, typifying decisions that require evidence accumulation over time.
For spatial workingmemory tasks, the precision of working memory for perceived spatial locations is often well described by diffusion dynamics (Compte et al., 2000; Kilpatrick, 2018; Kilpatrick et al., 2013; Laing and Chow, 2001) that are commonly implemented in ‘bumpattractor’ models of working memory (Compte et al., 2000; Constantinidis et al., 2018; Laing and Chow, 2001; Riley and Constantinidis, 2016; Wei et al., 2012; Wimmer et al., 2014). Our analyses built on this framework by examining memory diffusion dynamics for the different task conditions and potential decision strategies. For the conditions we tested, most participants’ behavior was well fit by one of two distinct strategies, each with its own constraints on decision performance based on different workingmemory demands. The first strategy was to compute the decision variable (mean disk angle) immediately upon observing the evidence (individual disk angles), and then store that value in working memory in a manner that, like for the memory of a single perceived angle, could be modeled as a single particle with a particular diffusion constant (AveragethenDiffuse model; AtD). The second strategy was to maintain representations of all disk locations in working memory, modeled as separate diffusing particles, and then to combine them into a decision variable only at the time of the decision (DiffusethenAverage model; DtA). Such a strategy results in an effective diffusion constant for the average that is inversely related to the number of items. Our results show that like perceived locations, memory for computed mean locations degraded with increased set size (of relevant information), and delay between presentation and report. However, the degree of degradation depended on the strategy used to compute the decision variables, implying that multiple, strategy and taskdependent effects of workingmemory should be considered in the construction of future neural and computational models of decisionmaking.
Results
We measured the ability of human participants to remember spatial angles as a function of set size (1, 2, or 5 disks), delay duration (0, 1, or 6 s), and task context (Perceived or Computed blocks). Specifically, we measured the error between reported and probed angles as a proxy for workingmemory representations and inferred rates of memory degradation (diffusion constants) from the increase in variance of these errors over time within a framework of diffusingparticle models. Below we first describe the model framework, detailing its key assumptions and predictions. We next describe results from Simultaneous conditions, in which all items were presented simultaneously at the beginning of each trial, which demonstrate how capacity and temporal constraints on working memory relate to the accuracy of computed decision variables. We then describe results from Sequential conditions, in which one item was presented after the others in each trial, which demonstrate how capacity and temporal constraints affect the process of evidence integration over time.
Diffusingparticle framework and predictions
Within our diffusingparticle framework, the memory of an item is represented by the location of a diffusing particle. This representation allows us to quantify the corruption (i.e., reduced precision) of the memory by two distinct sources of noise. The first is described by a static, additive term (η_{1}) that encompasses all potential onetime noise sources within a trial including noise associated with the sensory encoding and the motor response. The second is the dynamic degradation of memory precision over time that is modeled as the diffusion of the particle (Figure 2a). This diffusion corresponds to an increase in variability over time that is linear, with a slope equal to the diffusion constant (σ_{1}^{2}; Figure 2b). Consistent with past modeling studies (Bays et al., 2009; Brady and Alvarez, 2015; Koyluoglu et al., 2017; Wei et al., 2012), we accounted for the decrease in workingmemory fidelity with item load by incorporating item number (N) dependence into both the static noise term (η_{N}) and the diffusion constant of each particle (σ_{N}^{2}; Figure 2b).
We extended this framework to account for workingmemory representations of values computed from multiple stimuli, namely their average location, via two primary models (these models also served as the basis for more complex extensions, including mixtures of the two models, used to account for the Sequential condition detailed in subsequent sections). In the first, called the AtD model, the average is calculated immediately upon observing the evidence and then stored as a single particle in working memory. This model has its own static noise term that includes variability in estimates of the mean of N items (η_{MN}) and then assumes that the single estimate held in working memory diffuses with the same diffusion constant as a single perceived item (σ_{MN}^{2}=σ_{1}^{2}; see the parallel purple and black lines in Figure 2b and the overlapping lines in Figure 2c). In the second, called the DtA model, the memories of all constituent items are maintained and then combined into a decision variable (the average) only at the time of the response. This model assumes an effective diffusion constant for the reported average that is related to σ_{N}^{2} by the inverse of the number of items (σ_{MN}^{2}=σ_{N}^{2}/N) because averaging over N random variables with a variability of σ_{N}^{2} results in a random variable with variability σ_{N}^{2}/N.
The ability to distinguish these two models depends on their relative ability to capture specific changes in error over time in the report of the average, which in turn depends on the relationship between the diffusion constant of a single item and multiple items. We describe this relationship as σ_{N}^{2}=σ_{1}^{2}*N^{A}, where A is a constant for a given participant and set size that describes the cost to store N items in memory. Because of the previously described relationships between σ_{1}^{2}, σ_{N}^{2}, and σ_{MN}^{2}, it is therefore also true that (the following constraints were enforced when determining bestfitting values of A using data from both Perceived and Computed blocks): (1) in the AtD model, σ_{N}^{2}=σ_{MN}^{2}*N^{A} (i.e., the diffusion constant describing memory degradation over time for N Perceived items held in working memory is proportional to the diffusion constant describing memory degradation over time for the one Computed value held in working memory); and (2) in the DtA model, σ_{MN}^{2}=σ_{1}^{2}*N^{A}/N (i.e., the diffusion constant describing memory degradation over time for N Computed items held in working memory is proportional to the diffusion constant describing memory degradation over time for one Perceived item held in working memory). For a given static noise level and σ_{1}^{2}, the A parameter dictates whether AtD or DtA has a lower σ_{MN}^{2} and thus results in lower memory loss over time (Figure 2c). Specifically, when A<1, DtA has a lower σ_{MN}^{2} and less variable responses because the averaging over multiple diffusing items counteracts the greater total noise of having many items. When A=1, the additional noise cost of each individual point in DtA exactly balances the effect of averaging, such that AtD and DtA have equal σ_{MN}^{2} and equal levels of accuracy and thus are indistinguishable (Figure 2—figure supplement 1 shows the models becoming increasingly indistinguishable as A approaches 1). When A>1, the additional cost of storing multiple items outweighs the effect of averaging, and AtD produces a lower σ_{MN}^{2} and less variable responses than DtA. A summary of all framework variables can be found in Table 1.
To summarize, our two models describe two different possible ways for decisionrelevant information to be stored in working memory prior to executing a decision. The different storage strategies result in different patterns of memory degradation, corresponding to trialtotrial variability (imprecision) of decision reports that increase as a function of the length of the withintrial delay period. For the AtD model, the individual pieces of information are immediately combined into a single decision variable that is then stored in memory. Thus, the rate of degradation of an estimated average is identical to the rate of degradation of a single item. In contrast, for the DtA model, all of the relevant pieces of information are stored in memory and then combined only at the time of indicating the decision. Thus, the rate of degradation of an estimated average is inversely proportional to the rate of degradation of each item held in memory. We used fits of these models to performance data from individual participants to distinguish different patterns of memory degradation and therefore different storage strategies.
Simultaneous condition behavior
When all disks were presented simultaneously, performance was consistent with several key predictions of the particle model. Specifically, the difference in reports of Perceived spatial angles and the true probed location (i.e., the response error) tended to be unbiased, in that the mean error across participants was not reliably different from 0 (Figure 3a, full distributions in Figure 3—figure supplement 1, individual participant mean errors in Figure 3—figure supplement 2). However, the variance of these errors increased roughly linearly over time (Figure 3c), like the location of a diffusing particle or bump attractor (Compte et al., 2000; Kilpatrick, 2018; Kilpatrick et al., 2013; Laing and Chow, 2001). This error variance depended systematically on set size (Figure 3c). However, the change in error variance over time (slope of variance increase) did not depend on set size (ANOVA, significant effect of set size, F(2,32)=83.87, p=1.88e−13, and delay, F(2,32)=29.55, p=5.37e−08, but no significant interaction between set size and delay, F(4,64)=1.36, p=0.256). Errors in reports of Computed (i.e., inferred mean) spatial angles relative to true mean angles showed similar trends, albeit with a much weaker dependence on the number of items. Specifically, Computed angle reports were also unbiased (mean error from the true value was not reliably different from 0; Figure 3b, Figure 3—figure supplements 1 and 3) but degraded (became more variable) with a roughly linear increase in variance over time (Figure 3d). Error variance in the report of the Computed average was higher at higher set sizes (set size 5 had higher variances), but the rate of degradation in accuracy did not depend on set size (ANOVA, significant effect of set size, F(2,32)=13.53, p=5.515e−5, and delay, F(2,32)=130.79, p=4.441e−16, but not their interaction, F(4,64)=0.538, p=0.708).
Simultaneous condition model fits
To better understand the effects of delay and set size on workingmemory representations of Perceived and Computed angles for individual participants, we fit the AtD and DtA models separately to data from each set size (N=2 or 5) condition and participant (Table 2; the two models each had the same number of free parameters and thus were compared using the loglikelihoods obtained from the fits). The fitting procedures for both models used data from all trials from Perceived set size one and set size N conditions, and from Computed set size N conditions. In general, both strategies were used by our participants in each of the two setsize conditions (for set sizes 2, 8, and 9 participants were best fit by the AtD and DtA models, respectively; for set sizes 5, 14, and 3 participants were best fit by the AtD and DtA models, respectively).
Because the two models are indistinguishable when A=1 (i.e., σ_{MN}^{2}=σ_{1}^{2}=σ_{1}^{2}*N^{A}/N=$\sigma $_{N}^{2}/N), we further analyzed the bestfitting values of A. Across our participants, the 95% confidence intervals (CIs) for A (determined from the SEM values shown in Table 2) did not overlap with 1, supporting the distinguishability of the two models, on average (although not for each individual participant; Figure 4—figure supplement 1a,b). Moreover, bestfitting values of A were similar when they were estimated separately from Perceived versus Computed blocks for DtA participants (twosided test for H_{0}: difference in mean bestfitting values across participants=0, p=0.895 and 0.452 for set sizes 2 and 5, respectively; note that A is not defined for the AtD model on Computed blocks alone), which supports our modeling assumption that A (the cost of storing N items in memory) is roughly the same in the two blocks. For the participants’ best fit by the AtD model, the mean, bestfitting values of A for both set sizes were close to 0. These values were consistent with the lack of interaction between set size and delay in the Perceptual ANOVA in Figure 3c (because in this model, σ_{N}^{2}=σ_{1}^{2} when A=0). Conversely, for the participants’ best fit by the DtA model, the mean, bestfitting values of A were slightly higher (the difference between each parameter between groups for a given set size was only significantly different from 0 for A at set size 5; see Table 2). These values were consistent with the lack of interaction between set size and delay in the Computed ANOVA in Figure 3d (because in this model, σ_{MN}^{2}=σ_{1}^{2} * N^{A}/N, which becomes less dependent on N when A approaches 1).
Simultaneous condition model validation
When A differs from 1, AtD and DtA make distinct assumptions about the diffusion constant relationships between either single (AtD) or multiple (DtA) Perceived angles(s) versus a Computed average angle (Figure 2b and c). We used these assumptions to validate whether the betterfitting model and bestfit parameters for a given participant at a given set size were likely to produce the participant’s behavior. Specifically, the AtD model assumes that the diffusion constant for a single Perceived angle and for a Computed average angle are the same because both involve the memory of a single value (Equation 9). In contrast, the DtA model assumes that the diffusion constant for a Computed average angle is 1/Nth the diffusion constant for N items, because all N items are held in memory prior to averaging (Equation 10). We analyzed how consistent these assumptions were with the behavioral data (Figure 4). For each participant, we fit a line to the measured error variances as a function of delay for a given set size in both Perceived and Computed blocks to estimate the change in variance over time (the empirical diffusion constant estimates: $\hat{\sigma}$_{1}^{2}, $\hat{\sigma}$_{N}^{2}, $\hat{\sigma}$_{MN}^{2}, where N=2 or 5 for the two set sizes). We then compared the differences of these empirical estimates to the differences predicted between diffusion constants by the best fit model for a given participant.
In general, the participant data conformed to the model predictions of the bestfit model for that participant, despite substantial individual variability. For participants whose data were best fit by the AtD model, empirical estimates of the diffusion constant ($\hat{\sigma}$_{MN}^{2}) from Computed blocks tended to be similar to the empirical estimates of the diffusion constant for a single Perceptual point ($\hat{\sigma}$_{1}^{2}; Figure 4a and b). Specifically, for all but two participants, the empirical diffusion constant differences fell within the 95% CI of simulated distribution. Likewise, for participants whose data were best fit by the DtA model, empirical estimates of the diffusion constant ($\hat{\sigma}$_{MN}^{2}) from Computed blocks tended to be similar to the empirical estimates of the diffusion constant for multiple Perceptual items divided by the set size ($\hat{\sigma}$_{N}^{2}/N; Figure 4d and e). Specifically, for all but one participant, empirical diffusion constant differences fell within the 95% CI of the simulated distribution. These analyses, which are summarized in Figure 4c and f, thus support the idea that for most participants, their behavior was well captured by their betterfitting model.
Summaries of the predicted reporterror variances by the AtD and DtA fits for wellfit participants are shown in Figure 5. Overall, the model predictions qualitatively match participant behavior. In general, AtD behavior was predicted by diffusion constants that were the same for either one Perceived location or the mean Computed location based on two or five items (i.e., parallel lines in Figure 5e and g). DtA behavior was well predicted by diffusion constants that were larger for multiple Perceived items compared to Single Perceived items (Figure 5f and h). As predicted by the DtA model, the Computed errors for DtA participants were well predicted by 1/Nth the diffusion constant for multiple Perceived items (Figure 5e–h). We also compared the variance in AtD and DtA participants’ reports of the mean across delays using an ANOVA and multiple comparisons. For set size 2, AtD participants had a significantly higher variance in their reports than DtA participants at delay 6 (p=0.018), reflecting the lower effective diffusion constant of the mean for DtA than AtD when A<1, as was the case for our participants, on average. For set size 5, there were no significant differences in variability at any delay between models (p>0.05). This lack of statistical difference may reflect the low number of participants at set size 5 and/or the fact that the A values of the DtA using participants at set size 5 was closer to 1, when AtD and DtA perform identically.
Simultaneous condition strategy comparisons
Across the population, participants seemed to have different tendencies to use the two strategies (AtD or DtA) for the two setsize conditions (Figure 6). For set size 2, equal numbers of wellfit participants were best fit by the AtD (n=8) and the DtA (n=8), and as such neither model was significantly more likely to be a better fit across the population (Wilcoxon signedrank test, p=0.756, Figure 6a, blue items). Conversely, for set size 5, more wellfit participants were better fit by the AtD (n=12) than the DtA (n=3) model (p=0.0027; Figure 6a, green items). Participants who were not poorly fit at either set size were more likely to be better fit by AtD in set size 5 compared to set size 2 (Wilcoxon signedrank twosided test for equal median loglikelihoods difference of fits of the two models across set sizes, p=0.029). We also found these results were robust to uncertainty associated with model identifiability (participantwise identifiability is given in Figure 2—figure supplement 1). Specifically, given different possible distributions of underlying strategy prevalence (proportions), the probability of obtaining the empirically observed distributions of models shown in Figure 6a for either set size while considering the average model identifiability was peaked near the observed strategy proportions. This result demonstrates that the observed proportions were not likely obtained due misidentificationrelated chance. These probabilities distributions were also highly nonoverlapping, which is consistent with a different prevalence of strategy use at the two different set sizes (Figure 6b).
These differences in strategy use did not correlate with the ages of the participants (Pearson correlation, Figure 6—figure supplement 1, p>0.20). These findings suggest that workingmemory load might have affected our participants’ decision strategies, such that a higher load corresponded to an increased tendency to discard information about individual samples (disk locations) and hold only the relevant computed decision variable in memory.
Sequential condition behavior
For the Sequential condition, we separately analyzed errors for Perceived reports of disks presented at the beginning (Early) or middle (Late) of a trial. Early Perceived reports tended to be relatively unbiased (twosided ttest for H_{0}: mean error=0, p>0.05; Figure 7a, full distributions in Figure 7—figure supplement 1; individual participant mean errors in Figure 7—figure supplement 2ad) but became more variable over time in a roughly linear manner (Figure 7d), consistent with the predictions of the particlediffusion models. For higher set sizes, errors were more variable than at lower set sizes. The rate of variance increase over time did not depend on set size (ANOVA, significant effect of set size, F(2,32)=33.44, p=1.45e−08, and delay, F(1,16)=77.02, p=1.64e−07, but not their interaction, F(2,32)=0.15, p=0.256). Late Perceived reports were likewise unbiased (mean error not significantly different from 0; Figure 7b, full distributions in Figure 7—figure supplement 1; individual participant mean errors in Figure 7—figure supplement 2eh) and degraded in precision (i.e., increased in variance) over time (Figure 7e). However, this degradation did not depend on set size (ANOVA, significant effect of delay, F(1,16)=39.28, p=1.12e−05, but not set size, F(1,16)=0.90, p=0.36 or their interaction, F(1,16)=0.0029, p=0.96).
Conversely, Computed (i.e., inferred mean) reports that required integrating both Early and Late items tended to be slightly biased towards the Early items for set size 2 (mean=13% of the distance between the two disks closer to the Early point than a true mean; Student’s twosided ttest, p<0.001) but not set size 5 (mean=3.5% closer to the mean of N–1 items than the true mean of all items; p>0.5; Figure 7c, full distributions in Figure 7—figure supplement 1; individual participant mean errors in Figure 7—figure supplement 3). The Computed report errors also increased in variance over time (Figure 7f). These overall errors of bias and variance did not change dramatically with training (ttests for differences in mean and standard deviation of report errors between the first and second half of trials=0, p>0.05 in all cases). However, the magnitude of the errors of variance, and their change over time, depended systematically on the number of items to remember, such that more items corresponded to a slightly greater overall variance in reports at short delays, but less gain in variance over time (ANOVA, significant effect of set size, F(2,32)=7.73, p=1.8e−3, delay, F(1,16)=73.76, p=2.18e−07, and their interaction, F(2,32)=6.81, p=3.4e−3). This interaction of delay and set size suggests the representation of the Computed value diffused in working memory with a different diffusion constant than for a single Perceived value. Such an interaction is consistent with predictions of both the AtD and DtA models under these conditions, though the nature of this interaction depends on the specific model, as detailed below.
Sequential condition model fitting
To better understand the effects of delay and set size on workingmemory representations of Perceived and Computed locations for individual participants under Sequential conditions, we fit the AtD and DtA models separately to data from each condition and participant (Table 3; the two models each had the same number of free parameters and thus were compared using the loglikelihoods of the fits). Recall that the η parameters quantify the effect of set size on nontimedependent noise (noise when delay is 0), whereas σ_{1}^{2} is the modelbased estimate of the diffusion constant for a single Perceived point. In general, both strategies were used by our participants in each of the two setsize conditions (for set sizes 2, 9, and 8 participants were best fit by the AtD and DtA models, respectively; for set size, 5, 8, and 8 participants were best fit by the AtD and DtA models, respectively).
Like in the Simultaneous condition, the models make identical predictions when A=1. Across the population, 95% CIs of A did not overlap with 1, supporting the distinguishability of the two models; however, this difference from one was not always true for individual participants (Estimates of A on a participantbyparticipant basis are shown in Figure 8—figure supplement 1). For a given set size, none of the fit parameter estimates differed significantly when comparing their bestfitting values from AtD versus DtA participants (ttests, p>0.05 in all cases). For participants’ best fit by AtD at both set sizes, the average A was close to 0, which is consistent with the lack of interaction between set size and delay seen in the Early and Late Perceptual ANOVA. Unlike in the Simultaneous condition, the participants’ best fit by DtA had negative A values at both set sizes, implying that the diffusion constant for multiple Perceived items became closer to 0 as the number of items increased. While counterintuitive to the concept that adding more items should increase the diffusion constant, negative A values can be explained by ceiling effects: if a participant has high levels of static noise (such as in set size 5), their performance has less room to degrade while they continue to accurately track the target. As such, the rate of increase in variability σ_{NE}^{2} cannot be very large and may be smaller than σ_{1}^{2}, which translates into a negative A value. Alternatively, the presentation of a new item may have had a stabilizing effect on the ensemble by creating directional drift toward the new item rather than random diffusion in the remaining items (Almeida et al., 2015; Wei et al., 2012), which is not inherently accounted for in any of the present models.
Sequential condition model validation
The Sequential condition models also make predictions about the relationship between the diffusion constants of remembered Computed and Perceived values. Once again, we assessed how well participant behavior matched these assumptions, detailed in Equation 11 for AtD and Equation 12 for DtA (Figure 8). We fit a line to the measured variances in reporting error as a function of delay for a given set size in both Perceived and Computed Sequential blocks to estimate the change in variance over time (the empirical diffusion constant estimates: $\hat{\sigma}$_{1}^{2}, $\hat{\sigma}$_{NE}^{2}, $\hat{\sigma}$_{NL}^{2}, $\hat{\sigma}$_{MNseq}^{2}, where N=2 or 5 for the two set sizes). We then compared the difference of these empirical estimates to the predictions of the bestfit model for each participant (Figure 8).
In general, the participant data conformed to the model predictions of the bestfit model for each participant, despite substantial individual variability. For participants whose data were best fit by the AtD model (n=9 for both set sizes), the difference between empirical estimates of the diffusion constant ($\hat{\sigma}$_{MNseq}^{2}) from Computed blocks and the modelpredicted equivalent fraction of the empirical estimates of the diffusion constant for a single point tended to be low (Figure 8a and b). Specifically, for every participant, the empirical diffusion constant differences fell within the 95% CI computed from simulations using the model fits. For participants whose data were best fit by the DtA model (n=8 for both set sizes), empirical estimates of the diffusion constant ($\hat{\sigma}$_{MN}^{2}) from Computed blocks tended to be similar to the expected average of empirical estimates of the diffusion constant for multiple items (0.5 $\hat{\sigma}$_{NL}^{2}+(N–1)* $\hat{\sigma}$_{NE}^{2})/N^{2}; Figure 8d and e. Specifically, for seven participants, empirical diffusion constant differences fell within the 95% CI computed from simulations using the model fits. The remaining participant was considered poorly fit and not considered in further analyses. These analyses, which are summarized in Figure 8c and f, thus support the idea that for most participants, their behavior was well captured by their betterfitting model.
Summaries of the predictions of report errors variances for AtD and DtA fits are shown in Figure 9. In general, participants’ best fit by AtD exhibited diffusion constants that were, on average, lower for Computed than Perceived values (Figure 9i and k; lower slope of cyan/blue line vs. purple line). This difference decreased with increased set size, which is expected from the averaging process (Figure 2d). Additionally, both the Early and Late variances were, on average, fairly well matched by their model predictions (Figure 9a, e, c and g). Conversely, participants’ best fit by DtA exhibited diffusion constants that were notably smaller for Computed mean locations versus single Perceived locations (Figure 9j and l; lower slope of cyan/blue line vs. purple line). The corresponding average predictions by the best fit DtA models for error variance of Early and Late items also aligned with participant data from DtA fit participants (Figure 9b, f, d and h). We also compared the variance in AtD and DtA participants’ reports of the mean across delays using an ANOVA and multiple comparisons but found no significant differences in variability at any delay between models (p>0.05).
Sequential condition strategy comparisons
Across the population, participants had roughly equal tendencies to use either one of the two strategies (AtD or DtA) for the two setsize conditions (Figure 10). For set size 2, one more participant was best fit by the AtD (n=9) versus the DtA (n=8) model (Wilcoxon signedrank twosided test for the median difference in the loglikelihoods of fits of the two models to data from each participant=0, p=0.868). For set size 5, two more participants were best fit by the AtD (n=9) versus the DtA (n=7) model (p=0.234). Participants well fit at either set size were not significantly more likely to be fit by either model across set sizes (Wilcoxon signedrank twosided test for identical median loglikelihoods difference of fits of the two models across set size, p=0.283). Given different possible distributions of underlying strategy prevalence (proportions), the probability of obtaining the empirically observed distributions of models shown in Figure 10a for either set size while considering the average model identifiability was peaked near the observed strategy proportions. This result demonstrates that the observed proportions were not likely obtained due to misidentificationrelated chance (Figure 10b). These differences in strategy use did not correlate with age of participants (Figure 10—figure supplement 1, Pearson correlation, p>0.20). Thus, on average, participants lost fidelity in their representations of a Computed value when it needed to be computed from sequentially presented information, as in many processes of evidence accumulation. The dynamics of this degradation differed for the two strategies, neither of which was statistically more likely than the other across our participants.
Alternative models for Sequential conditions
Up to this point, we have considered the two extremes of either: (1) holding all stimuli in memory until the time of the report (the DtA model), or (2) averaging all stimuli as soon as possible and then holding only this single average in memory (the AtD model). When set size is larger than 2, it is possible to perform a hybrid of these strategies. For example, one could diffuse the initial N–1 items until the final stimulus is presented and then combine all evidence at that point. Thus, we did an additional loglikelihood comparison using this hybrid model for set size 5 (note that AtD and this hybrid model are identical for set size 2, and all models had the same number of parameters). We found that three participants previously identified as AtD and three participants previously identified as DtA were slightly better described by the hybrid model (average LL difference for formerly DtA participants=0.34, average difference from formerly AtD participants 1.28). This finding is consistent with the idea that people likely use a spectrum of strategies, including nuanced combinations of AtDlike and DtAlike dynamics (which were the focus of the present study, to demonstrate how those dynamics can give rise to identifiable signatures of behavioral errors) that warrant further consideration when building more detailed models of memorydependent decisionmaking.
Strategy comparisons across conditions
The use of different strategies (i.e., those captured by the AtD and DtA models) did not appear to reflect a tendency of individual participants to use a particular strategy across different conditions. Specifically, we used Fisher’s exact test of independence based on set size across temporal conditions as well as based on temporal conditions across set sizes to test whether individual participants were best fit by the same model under different task conditions. We failed to reject the null hypothesis that there is no relationship between a participant’s strategy use across set size for both Simultaneous and Sequential conditions (i.e., strategy use in set size 2 Simultaneous was not predictive of use in set size 5 Simultaneous, nor was it for Sequential conditions; p=0.31 and p=1, respectively). We also failed to reject the null hypothesis that there is no relationship between a participant’s strategy use across temporal conditions for both set sizes 2 and 5 (p=0.54 and p=1, respectively). Thus, we found that only under set size 5 were Simultaneous conditions participants more likely to use one strategy (AtD) over the other (DtA). In all other tested cases, participants were equally likely to use either strategy, and strategy use was not predictive across conditions for individual participants.
Discussion
The goal of this study was to better understand if and how capacity and temporal limitations of working memory affect human decisionmaking. We used a task that required participants to report remembered spatial locations based on different numbers of objects and for different delay durations, both of which are known to systematically affect the precision of memory reports (Bastos et al., 2018; Cowan et al., 2008; Funahashi et al., 1989; Oberauer et al., 2016; Panichello et al., 2019; Ploner et al., 1998; Schneegans and Bays, 2018; White et al., 1994). We used two pairs of conditions to investigate these effects across decisionmaking circumstances. The first condition was Perceptual versus Computed, which allowed us to recapitulate previous findings of the effects of capacity and temporal limitations of working memory for directly observed (perceptual) quantities and then extend those findings to the kind of computed quantity that is used as a decision variable for tasks that require integration or averaging to reduce uncertainty (Brody and Hanks, 2016; Gold and Shadlen, 2007; Ratcliff et al., 2016; Shadlen and Shohamy, 2016; Summerfield and Tsetsos, 2012). The second was Simultaneous versus Sequential conditions, which extended our investigation to include the effects of workingmemory limitations on decisionmaking under relatively simple conditions (i.e., when all relevant evidence was presented at once) to the effects in a basic case of evidence accumulation over time (i.e., in which a new piece of evidence is used to update a computed quantity).
Our primary finding was that computed variables based on either simultaneously or sequentially presented information were susceptible to the same kinds of workingmemory constraints as perceived variables. These workingmemory limitations corresponded to a decrease in precision over time, which places critical constraints on the kinds of decision variables that are required to persist over time, such as when decisions are delayed. Specifically, the variability caused by singular events such as sensory encoding or averaging tended to be ~10° in Simultaneous conditions, whereas noise accumulating during memory delay periods was typically ~3.5°/s. Therefore, after ~3 s, the noise that can be attributed to memory versus nonmemory sources is about equal, and after this point memory noise begins to dominate overall variability. This result appears to contradict previous findings that found no effect of extra delays on the effectiveness of evidence accumulation for certain decisions (Liu et al., 2015; Waskom and Kiani, 2018). However, those studies used tasks with binary choices that required decision variables with less clear sensitivity to the kinds of workingmemory effects we found in the context of a continuous, spatially based decision variable. Additionally, we found that increasing the number of decisionrelevant items also decreased the accuracy of the continuous decision variable, although the nature of this effect was variable. More work is needed to fully characterize the conditions under which temporal and capacity limitations on the precision of workingmemory representations affect decisions based on those representations.
We also found that the exact nature of interactions between workingmemory limitations and decisionmaking depend critically on the strategy used to form the decision, and those strategies can vary substantially across individuals and tasks. For our tasks, we focused on two primary strategies. The first strategy, captured by the AtD model, stipulated that a participant first calculates and then stores the Computed value. Its key prediction is that a Computed value should be susceptible to the same effects of workingmemory limitations as a single remembered Perceptual value in simultaneous conditions. The second strategy, captured by the DtA model, stipulated that all individual values are stored in working memory until the time of decision. Its key prediction is that the overall rate of variance increase is inversely related to the number of items. Although the differences in performance of adopters of each of the strategies were often minimal in the present study, the differences in strategy rely on rates of degradation over time and thus performance differences would be expected to grow over longer delays. We found that participants tended to use an AtD strategy for the Simultaneous conditions with a relatively high load (five items), but otherwise were roughly equally likely to use either strategy, including for all Sequential conditions.
This finding of multiple strategy use raises several intriguing future questions. For example, we found that for the Simultaneous condition, several individuals switched from using DtA for the smaller set size to AtD for the larger set size, but we do not know if this switch was a consequence of their personal workingmemory capacities. From an optimality standpoint, DtA better preserves a computed value compared to AtD for a given level of nontimedependent noise and cost per storage item (A), but only if A remains low (<1). It would be interesting to see if for more intermediate set sizes (i.e., three or four items) there is a reliable increase in the probability of a participant using AtD with a progression that relates to other measures of the individual’s workingmemory capacity. Such future studies would more definitively support the conclusion that increased workingmemory load corresponds to an increased tendency to discard information about individual samples and hold only the computed decision variable in memory. Future studies should also examine other factors that might govern which strategy is used for a given set of conditions. For example, participants in our study were instructed to report the average but given no additional details about how to do so, nor given strong incentives for choosing any particular strategy versus another. Future studies could provide more detailed instructions, incentives, and/or feedback to better understand the flexibility with which these different strategies can be employed.
Future work should also examine in more detail several other facets of working memory that were not included in our models but in principle could affect decision variables that are computed and retained over time. First, we did not consider possible differences in metabolic energy and other resources needed to implement the different workingmemory demands of different strategies (van den Berg and Ma, 2018). Future studies of strategy heterogeneity may need to consider how different strategies minimize both response errors and execution costs. Second, our DtA model assumed no interference between multiple items stored in memory. This assumption is undoubtedly an oversimplification, given that storage of multiple items has been both hypothesized and shown to create attraction and repulsion (Almeida et al., 2015; Krishnan et al., 2018; Wei et al., 2012). Such directional drift can create a decrease in variance over time that could affect decision variables that involve multiple quantities stored at once. Third, our DtA model also assumed that each item was stored individually. Alternatively, items could have been discarded or merged (chunked) (Krishnan et al., 2018; Wei et al., 2012), leading to different memory loads which could also affect performance. Fourth, most of our participants used strategies that were well described by the AtD or DtA model. However, under certain conditions (i.e., Sequential, set size 5) some participants seemed to use hybrid strategies. This kind of strategy would suggest extensive flexibility in when and how evidence is incorporated into computed decision variables, thereby placing potentially complex demands on working memory.
Both of our primary models were based on assumptions of a drifting memory representation. This random drift is traditionally associated with attractor models of working memory (Bays, 2014; Compte et al., 2000; Macoveanu et al., 2007; Wei et al., 2012) that have been used extensively to describe the underlying neural mechanisms (Funahashi et al., 1989; Shafi et al., 2007; Takeda and Funahashi, 2002; Wimmer et al., 2014). In these models, neural network activity is induced by an external stimulus and then maintained via excitatory connections of similarly tuned neurons and longranged inhibition. Random noise causes the center of this activity (which represents the stimulus) to drift in a manner that, dependent on the implementation, can depend on the delay duration, set size, and/or their interaction (Almeida et al., 2015; Bays, 2014; Koyluoglu et al., 2017). A recent implementation even can naturally compute a running average based on sequentially presented information (EsnaolaAcebes et al., 2021). Our results imply that such models should be extended to support the flexible use of different strategies that govern when and how incoming information is used to form such averages. It will be interesting to see if such a flexible model can account for neural activity in the dorsolateral prefrontal cortex, which includes neurons with persistent activity that has been associated with both spatial working memory (Compte et al., 2000; Constantinidis et al., 2018; Riley and Constantinidis, 2016; Wei et al., 2012; Wimmer et al., 2014) and the formation of decisions based on an accumulation of evidence (Curtis and D’Esposito, 2003; Heekeren et al., 2006; Heekeren et al., 2008; Kim and Shadlen, 1999; Lin et al., 2020; Philiastides et al., 2011).
In conclusion, we found that in this spatial, continuous task, participant accuracy for both perceived and computed values was subject to workingmemory limitations of both time and capacity. Additionally, we found behavior that was consistent with both the storage strategies we investigated. The fact that different participants employed different strategies for storing a computed value (such as a decision variable) and that these strategies have different consequences on overall accuracy has important implications for not only future neural network models of working memory, but also for future computational models of decisionmaking.
Materials and methods
Human psychophysics behavioral task
Request a detailed protocolWe tested 17 participants (4 males, 12 females, 1 chose not to answer; age range=22–87 years). The task was created with PsychoPy3 (Peirce et al., 2019) and distributed to participants via Pavlovia.org, which allowed participants to perform the task on their home computers after providing informed consent. These protocols were reviewed by the University of Pennsylvania Institutional Review Board (IRB) and determined to meet eligibility criteria for IRB review exemption authorized by 45 CFR 46.104, category 2.
Participants were instructed to sit one armlength away from their computer screens during the experiment and to use the mouse to indicate choices. Each participant completed 1–2 sets of four blocks of trials in their own time.
The basic trial structure is illustrated in Figure 1. Each trial began with the presentation of a central white fixation cross (1% of the screen height). The participant was instructed to maintain fixation on this cross when not actively responding. The participant began each trial by placing the mouse over the cross and clicking, to allow for selfpacing and pseudofixation. Initiating a trial caused a white annulus of radius 25% of the screen height to appear. A blockspecific memory array appeared 250 ms later, centered at an angle chosen uniformly and at random on the annulus. The array consisted of 1, 2, or 5 colored disks sized 1.5% screen in diameter. The angular difference between any two adjacent disks was at least 6°, and between the two most distal disks was at most 60°. The disks from clockwise to counterclockwise were always presented in the same order: green, red, blue, magenta, and yellow. When fewer than five disks were presented, the latter colors were omitted. The consistent color ordering was intended to reduce errors caused by misbinding of location and color. The angular differences between disks in an array were randomly selected from five preselected sets of five angular differences that obeyed the restrictions stated above, centered on a randomly selected location on the circle. If set size was <5, later numbers were omitted. The sets were [–22, –11, –2, 7, 13], [–25,–4, 6, 12, 24], [–30,–18, 3, 15, 29], [–22,–10, 0, 7, 17], and [–19,–12, 0, 9, 28] (numbers are degrees clockwise relative to the randomly selected location on the circle).
The memory array remained on the screen for 0.5 s, while the annulus remained on the screen throughout the delay of 0, 1, or 6 s. At the end of the delay, the fixation cross was replaced with a response cue that either matched a color of a disk in the memory array, indicating a response to the remembered location of that disk, or was white, indicating a response to the mean angle of all disks in the present trial. The response type varied by block (see below). The participant then moved the mouse and clicked on the annulus at a position at which they remembered the requested response. Feedback was then given indicating the correct location, the participant’s response, and the difference between the two.
We used four blockwise conditions: (1) Simultaneous Perceived blocks used arrays of 1, 2, or 5 disks presented simultaneously at the beginning of the trial. Participants were told in advance that they would always be asked to report the location of one of the array disks but were not informed which one until the response period. The probed disk was picked randomly on each trial. (2) Simultaneous Computed blocks used arrays of 2 or 5 disks presented simultaneously at the beginning of the trial. Participants were told in advance they would need to report the average angle of all disks shown in the present trial. (3) Sequential Perceived blocks were identical to Simultaneous Perceived blocks, except only arrays of 2 or 5 disks were used, and all but one of the disks (the counterclockwise most) was presented at the beginning of the trial. The final disk was presented for 0.5 s ending midway through the delay of 1 or 6 s. The most counterclockwise disk was always the last presented disk, to make the task easier. Participants were told in advance that the final disk would be presented in the middle of the delay for these blocks. (4) Sequential Computed blocks were identical to Simultaneous Computed blocks, but with delayed presentation of the final disk as in Sequential Perceived blocks. Again, participants were told in advance that the final disk would be presented in the middle of the delay.
All participants completed one and most (12) participants completed two blocks of each type. Each block contained 50 trials at each set size and each delay time, the order of which was randomized.
Basic analyses
Request a detailed protocolTrials were excluded from analysis if the response was >30° from the correct angle. This cutoff was based on assessment of the error distributions (Figure 3—figure supplement 1, Figure 7—figure supplement 1); using a cutoff of 25° did not noticeably change the results. On average, <10% of trials were excluded per delay condition per set size per block (see Figure 3—figure supplement 1, Figure 7—figure supplement 1). These trials were excluded to focus analysis on trials that were directed toward the correct location and avoid lapses of attention and extreme motor errors. We investigated both the bias and variance in participant responses, as follows.
We quantified bias as the mean error between the response and the true probed angle for each participant and condition (positive/negative values imply errors that were systematically counterclockwise/clockwise, respectively). A Bonferronicorrected twosided ttest was used to assess whether this mean response error was significantly different from zero across participants for each set size, delay, response type, and temporal presentation. Additionally, the mean error and CI for each participant were calculated for each condition (Figure 3—figure supplements 2 and 3; Figure 7—figure supplements 2 and 3). For the Sequential condition, we also assessed how bias Computed responses were compared to the true mean location. We took the difference of the reported mean from the Nth point and normalized this difference by the distance between the mean of N–1 items and the Nth item. For set size 2, the true mean had a normalized value of 0.5. For set size 5, the true mean had a normalized value of 0.8.
We quantified the variance of the error between the response and the true probed angle for each participant and condition. We chose variance as opposed to other measures of dispersion for consistency with our particle models (see below) in which variance scales linearly with delay. We examined the effects of set size, delay duration, and task context on response variability using a twoway repeated measures ANOVA. On Simultaneous Perceived and Computed blocks, we used a 3 (delay duration: 0, 1, or 6 s) × 3 (set size: 1, 2, or 5 disks) withinparticipant design. On Sequential Perceived blocks, we used a 2 (delay duration: 1 or 6 s) × 3 (set size: 1, 2, or 5 disks) withinparticipants design for stimuli presented at the beginning of the trial (Early) and a 2 (delay: 0.5 or 3 s) × 2 (set size: 2 or 5 disks) design for stimuli presented halfway through the trial (Late). On Sequential Computed blocks, we used a 2 (delay duration: 1 or 6 s) × 3 (set size: 1, 2, or 5 disks) withinparticipants design. When the comparison included set size=1, data were always taken from the Simultaneous Perceived block.
To assess performance differences based on strategy use, additional analyses were performed once the data had been fit to the models and the best fit model had been selected (see below). These analyses included an assessment of response error variability in the Computed blocks using a 2 (model: AtD or DtA) × 3 or 2 (delay: 0, 2, or 6 s Simultaneous condition, 1 or 6 s for Sequential) ANOVA with multiple comparisons to identify differences. To interrogate best fit parameter differences, twosided ttests were used to see if the mean difference in bestfit parameter between AtD and DtA participants was significantly different from 0 for both Simultaneous and Sequential conditions. To assess learning effects, a twosided, paired ttest was used to see if the mean or standard deviation of error responses in set size 5 Sequential conditions differed between the first and second half of trials (we found no difference at either delay: for 1 s delay p=0.67 and 0.11 for mean and standard deviation, respectively; for 6 s delay p=0.75 and 0.98 for mean and standard deviation, respectively).
Modelbased analyses
Request a detailed protocolOur models were based on principles of working memory that are well described by bumpattractor network models (Compte et al., 2000; Laing and Chow, 2001; Wimmer et al., 2014). In such models, stimulus location is represented by a ‘bump’ in activity from neurons tuned to that and similar locations. These neurons recurrently activate each other, maintaining a bump of activity even after stimulus cessation. However, because of the stochastic nature of neural activity and synaptic transmission (Faisal et al., 2008), there is variability in which neurons have the most activity at any given time (and thus are the center of the bump representing the stimulus). This variability in bump center corresponds to variability in the location representation and a degradation of the memory representation over time. The dynamics of this bump can be described as a diffusion process that obeys Brownian motion (Compte et al., 2000; Kilpatrick, 2018; Kilpatrick et al., 2013; Laing and Chow, 2001). We used this simplified description in our models as follows.
Perceived values in working memory
Request a detailed protocolA single point (i.e., the central spatial location of a single disk), x_{1}, is assumed to be represented in working memory by $\widehat{x}$_{t,1}, where t represents the time since the removal of the stimulus. We assume that $\widehat{x}$_{t,1} evolves like a sample from a Brownianmotion process. Specifically, when ${x}_{1}$ is observed, it is encoded with some perceptual noise, η^{p}. Therefore, at time zero, $\widehat{x}$_{0,1} ~ N(x, η^{p}). This representation accumulates noise over time with some diffusion constant, σ_{1}^{2}, further degrading the representation of $\widehat{x}$_{t,1} from x_{1} such that $\widehat{x}$_{t,1} ~ N(x_{1}, η^{p}+t*σ_{1}^{2}). There is additional motor noise in the participant’s report, r_{t,1}, and we denote the variance of this motor noise by η^{m}. Mathematically, it is equivalent to add the motor noise at the beginning or the end of the diffusion of $\widehat{x}$_{t,1} when considering the report, r_{t,1}. In our model, we thus represent the sum of the perceptual and motor noise as a single, static noise term. Hence, we show simulated trajectories of $\widehat{x}$_{t,1} in Figure 2a with an initial variance of η_{1}=η^{p}+ η^{m}, so that at time t, the report r_{t,1} is the current angle of the trajectory. Therefore
We used Gaussian rather than von Mises distributions because: (1) they are easier to generalize to other, noncircular domains; (2) the Gaussian standard deviation parameter has a more intuitive interpretation than the von Mises concentration parameter; and (3) our stimuli were constrained to be a maximum of 60° apart, and thus the periodicity of the von Mises distribution was unnecessary to capture the diffusion dynamics.
When multiple items are held in memory, they are held with less fidelity than a single point (Bays et al., 2009; Brady and Alvarez, 2015; Koyluoglu et al., 2017; Wei et al., 2012). We therefore assume that the sum of the initial perceptual noise and final motor noise, with variance denoted by η_{N}, can depend on the number of disks, N. Moreover, we describe, $\widehat{x}$_{t,n}, the representation of the nth item at time t, by a normal distribution with a diffusion constant that is potentially higher than for a single point. We assume that this new diffusion constant σ_{N}^{2}, equals σ_{1}^{2}*N^{A} and thus scales as a power of the total number of stimuli, N, held in memory (Bays et al., 2009; Bays and Husain, 2008; Wei et al., 2012), and is proportional to the diffusion constant corresponding to a single stimulus representation, σ_{1}^{2}. Therefore:
All representations in a set of size N share the same magnitude of nontimedependent noise, η_{N}, but the evolution of each representation is assumed to be independent. To examine distributions of responses across the various presented locations, we measured the error of the response r_{t,n} relative to the true location of the target the observer was asked to report, x_{t,n}. According to our model, the difference between the true and reported location (the error, e_{t,n}) is
The linear relationship between total accumulated noise and time for both a single and multiple memoranda is illustrated in Figure 2b.
AtD simultaneous model
Request a detailed protocolFor this model, the representation of the average is stored as a single particle that diffuses the same as a Perceived item (i.e., a location at which there was a visible stimulus; see Figure 2b). Thus, the diffusion term for the representation of a computed average of N items, σ_{MN}^{2}, is also σ_{1}^{2}. We do not assume that the representation of the average has the same static noise as a single point, because there could be additional noise from inaccurately averaging multiple items or conversely a reduction in overall noise resulting from the averaging of multiple random variables (the constituent items). We denote the variance of the static noise for the Computed mean by η_{MN}. The difference between the true mean of N stimuli and the mean reported at time t is therefore:
DtA simultaneous model
Request a detailed protocolFor this model, the individual perceived items are stored as individual, independently diffusing particles and then averaged at the end of the trial. Thus, the diffusion constant of the Computed value is the variance of the average of N random variables each with the diffusion constant σ_{1}^{2}*N^{A}, resulting in an effective diffusion constant for the Computed value of σ_{MN}^{2}=σ_{1}^{2}*N^{A}/N, where the division by $N$ arises from averaging. Again, we allow for a free nontimedependentnoise term because of the uncertain effects of the averaging calculation itself. For this model, the error in the reported location at time t of the average of the mean, M, of N items, e_{t,MN}, is:
If A=1, the AtD and DtA models are identical. We thus used bestfitting values of A to help assess model distinguishability for each participant and task condition (see Figure 2—figure supplement 1; Figure 4—figure supplement 1). If A<1, then the DtA strategy results in a lower diffusion constant for a Computed value than predicted by the AtD model and results in a smaller average reporting error (see Figure 2c). If A>1, then AtD results in the lower diffusion constant and thus a lower average reporting error. However, given the parameter estimates obtained in this study, we did not find that participants necessarily used the strategy that would result in the lowest response variability.
Sequentially presented values in working memory
Request a detailed protocolIn the Sequential blocks, N–1 items were presented immediately (Early items), and the Nth item was presented halfway through the delay (Late point). Therefore, both models assume that the diffusion constant for the representation of the N–1 early items increases with the addition of the Nth item, corresponding to the increased memory load. The representation of the Late item then diffuses for only half of the delay time, T (see Figure 2d and e). We formalized this process with the following model for the report error of the Early (e_{T,NE}) and Late (e_{T,NL}) items:
Here, T is the total time of the delay, and we assumed different nontimedependent noise for both Early and Late items (η_{NE} and η_{NL}, respectively).
AtD sequential model
Request a detailed protocolThis model assumes that the Early items are averaged immediately and stored as a single item. At t=T/2, the Late item is presented and combined immediately, through appropriate weighted averaging, with the mean of the Early items. This new mean again diffuses with the same accumulating noise as a single item (see Figure 2d). Therefore:
At t=T/2, the representation of the Nth item has not accumulated any diffusion noise and only has nontimedependent noise, which is absorbed in the η_{MNSeq} term. The first timedependent term, ((N−1)/N)^{2}*T/2* σ_{1}^{2}, results from the appropriate weighted averaging of the mean of the Early items (timedependent noise of T/2* σ_{1}^{2}) with the Late item (timedependent noise=0). The final term, T/2* σ_{1}^{2}, is the diffusion of the resultant mean until the end of the delay.
DtA sequential model
Request a detailed protocolThis model assumes that the representations of all N items diffuse as they are presented, resulting in N–1 items described by Equation 6a and one item described by Equation 6b. These items are then averaged at the end of the delay, resulting in an overall error of:
where the constant noise terms from the Early and Late items are absorbed in the η_{MNSeqDtA} term, the next term T/2*σ_{1}^{2}*N. is the diffusion in the representation of the last disk shown, and the remaining terms arise from the first N–1 items shown. The effect of this averaging on the effective diffusion constant is shown in Figure 2e.
Alternative Model
Request a detailed protocolA third model was considered for Sequential set size five conditions: the first N–1 items diffused until the Nth point was presented, at which point all items were averaged and this average diffused for the remainder of the delay. Thus, e_{T,MNseqDtA}~N(η_{MN_SeqAlt}+(N–1)*(T/2* σ_{1}^{2}*( N–1)^{A})/N^{2}+T/2*σ_{1}^{2}).
Model fitting
Request a detailed protocolThe AtD and DtA models were fit to data from the Simultaneous Perceived and Computed blocks using five free parameters: (1) the static noise of a single point (η_{1}), (2) the diffusion noise of a single point (σ_{1}^{2}), (3) the static noise of N items (η_{N}), (4) the exponent of storing N items (A), and (5) the static noise of the mean (η_{MN(AtDor DtA}_{)}). We fit these models for N=2 and N=5 conditions separately, using trials from the following conditions. Perceived: delays 1, 3, and 6 s; array sizes 1 (Equation 3a) and N (Equation 3b). Computed: delays 1, 3, and 6 s; array size N (Equation 4 for AtD, Equation 5 for DtA). To validate the assumption that the cost of storing additional items (A) was constant between Perceived and Computed blocks for DtA fit participants, we refit the models using trials from only Perceived or only Computed trials. The difference in the best fit A values were compared across participants using a twosided ttest for mean difference=0.
The AtD and DtA models were fit to data from the Sequential Perceived and Computed blocks using six free parameters. The additional parameter accounted for differences in the static noise for Early and Late items. We fit these models for N=2 and N=5 conditions separately, using trials from the following conditions. Perceived: delays 1, 3, and 6 s; array size 1 (Equation 3a). Perceived: delays 3 and 6 s, array size N for both Early (Equation 6a) and Late (Equation 6b) items. Computed: delays 3 and 6 s, array size N (Equation 7 for AtD or Equation 8 for DtA).
Because the mean error for each individual participant was not always 0, when fitting the AtD and DtA models we used the empirical mean error from the condition being fitted as a fixed bias term in the model. Mean error and CIs for each participant for each condition are shown in Figure 3—figure supplements 2 and 3; Figure 3—figure supplements 2 and 3.
We obtained separate maximumlikelihood fits for AtD and DtA models for each individual participant, using the function fmincon in MATLAB to minimize the summed negative loglikelihood of obtaining the observed errors for a given condition according to the above equations. Initial parameter values were randomized and the fitting repeated to avoid local minima. Because all models within a given condition had the same number of parameters, we compared loglikelihoods to determine the bestfitting model for a given participant. Because the number of parameters is the same, comparing likelihoods produces equivalent model selection to BIC or AIC.
Assessing model assumption and identifiability
Request a detailed protocolTo assess how well each participant’s data matched the assumptions of the AtD and DtA models, we also fit a line to the variances of response errors across delays for a given condition for a given participant to obtain empirical estimates of the various diffusion constants (e.g., slope of lines in Figure 2b; empirical estimate of a Perceived value, $\widehat{\sigma}$_{N}^{2}, for set size N; empirical estimate of a Computed value, $\widehat{\sigma}$_{MN}^{2}, set size N). These empirical estimates of the diffusion constants did not enforce the relationships imposed by the AtD and DtA models between the different diffusion constants in each model, respectively. We compared the relationships of these empirical estimates of diffusion constants to the relationships assumed by our models, as follows:
AtD Simultaneous. The Computed mean diffuses with the same diffusion constant as a single value. Thus:
DtA Simultaneous. The Computed mean is the average of N items each diffusing with a constant of σ_{N}^{2}. Thus:
AtD Sequential. The timedependent noise has variance that increases as ((N–1)/N)^{2}*T/2*σ_{1}^{2}+T/2* σ_{1}^{2} (Equation 7). Factoring out T gives the diffusion constant for the Computed mean, σ_{MN}^{2}=[(N–1)^{2}+N^{2}]/(2 N^{2})*$\sigma $ _{1}^{2}. Thus:
DtA Sequential. The timedependent noise has variance that increases as T/2*σ_{1}^{2}*N^{A}+(N–1) [ T/2*σ_{1}^{2}*(N–1)^{A}+T/2*σ_{1}^{2}*N^{A}])/N^{2} (Equation 8). By Equation 6a, the diffusion constant for an Early Perceived point, σ_{NE}^{2}, is [0.5*σ_{1}^{2}*(N–1)^{A}+0.5*σ_{1}^{2}*N^{A}] and by Equation 6b, the diffusion constant for a Late Perceived point, σ_{NL}^{2}, is σ_{1}^{2}*N^{A}. Factoring out T and substituting gives the diffusion constant for the Computed mean, σ_{MN}^{2}=(0.5σ_{NL}^{2}+(N–1)*σ_{NE}^{2})/N^{2}. Thus:
To assess how well the relationships between participant empirical estimates of the diffusion constants matched these assumptions, for each participant, we simulated 1000 iterations of a participant performing the task using the bestfit model for the given true participant and the maximumlikelihood estimate parameters for that participant. We then estimated the empirical diffusion constants for each of these iterations in the same way that we did for our participants, namely by fitting a line to the measured variance of the simulated errors across delays, for each condition and iteration. Our 1000 simulations gave us an expected range around the expected diffusion constant relationships detailed in Equations 9–12 to compare to our participants’ empirical diffusion constant relationships. Participants whose empirical diffusion constant relationships fell within the central 95% of the simulated expected range were considered well fit by their model.
To assess model identifiability, for each participant and condition, we fit both models to the results of each set of 1000 simulations generated using the bestfitting parameters from the bestfitting model for that participant and condition. We used the loglikelihoods to determine the best model for each simulation and determined the percentage of correctly identified models. We used these results to further assess the reliability of our analyses of strategy prevalence across the population of participants, as follows. For each condition (Simultaneous vs. Sequential and set size 2 vs. 5), we determined an empirical proportion of AtD versus DtA prevalence (i.e., in terms of the relative numbers of participants whose data were best fit by the two models). We then sampled from a binomial distribution 10,000 times using a range of possible proportions between 0 and 1 in 1/34th increments. Each iteration yielded a simulated proportion, which we then adjusted with our modelidentifiability results: each simulated AtD participant had a chance of being misidentified according to average percent correctly identified for that model as determined above, and likewise with DtA. The proportion of samples that matched our data was used to create the probability of obtaining our observed results, given possible underlying proportions and average model identifiability (Figures 6b and 10b).
Data availability
All analysis code is available on GitHub (https://github.com/TheGoldLab/Memory_Diffusion_Task, copy archived at swh:1:rev:69cee7449f92f9d19148332979087bf4e6a9f867). Data used for figures will be made available on Dryad.

Dryad Digital RepositoryMemory array locations, delay times, and participant response.https://doi.org/10.5061/dryad.w3r2280rm
References

Neural circuit basis of visuospatial working memory precision: a computational and behavioral studyJournal of Neurophysiology 114:1806–1818.https://doi.org/10.1152/jn.00362.2015

Dynamic shifts of limited working memory resources in human visionScience (New York, N.Y.) 321:851–854.https://doi.org/10.1126/science.1158023

Noise in neural populations accounts for errors in working memoryThe Journal of Neuroscience 34:3632–3645.https://doi.org/10.1523/JNEUROSCI.320413.2014

A reservoir of time constants for memory traces in cortical neuronsNature Neuroscience 14:366–372.https://doi.org/10.1038/nn.2752

Neural underpinnings of the evidence accumulatorCurrent Opinion in Neurobiology 37:149–157.https://doi.org/10.1016/j.conb.2016.01.003

Synaptic Mechanisms and Network Dynamics Underlying Spatial Working Memory in a Cortical Network ModelCerebral Cortex (New York, N.Y 10:910–923.https://doi.org/10.1093/cercor/10.9.910

Persistent Spiking Activity Underlies Working MemoryThe Journal of Neuroscience 38:7020–7028.https://doi.org/10.1523/JNEUROSCI.248617.2018

Theory and Measurement of Working Memory Capacity LimitsAdvances in Research and Theory 9:49–104.https://doi.org/10.1016/S00797421(08)000029

Persistent activity in the prefrontal cortex during working memoryTrends in Cognitive Sciences 7:415–423.https://doi.org/10.1016/s13646613(03)001979

Mnemonic Coding of Visual Space in the Monkey’s Dorsolateral Prefrontal CortexJournal of Neurophysiology 61:331–349.https://doi.org/10.1152/jn.1989.61.2.331

The neural basis of decision makingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038

Visual DecisionMaking in an Uncertain and Dynamic WorldAnnual Review of Vision Science 3:227–250.https://doi.org/10.1146/annurevvision111815114511

The neural systems that mediate human perceptual decision makingNature Reviews. Neuroscience 9:467–479.https://doi.org/10.1038/nrn2374

Optimizing working memory with heterogeneity of recurrent cortical excitationJournal of Neuroscience 33:18999–19011.https://doi.org/10.1523/JNEUROSCI.164113.2013

Synaptic mechanisms of interference in working memoryScientific Reports 8:1–20.https://doi.org/10.1038/s41598018259589

Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaqueNature Neuroscience 2:176–185.https://doi.org/10.1038/5739

Synaptic efficacy shapes resource limitations in working memoryJournal of Computational Neuroscience 44:273–295.https://doi.org/10.1007/s1082701806797

Stationary bumps in networks of spiking neuronsNeural Computation 13:1473–1494.https://doi.org/10.1162/089976601750264974

What limits working memory capacityPsychological Bulletin 142:758–799.https://doi.org/10.1037/bul0000046

Errorcorrecting dynamics in visual working memoryNature Communications 10:3366.https://doi.org/10.1038/s41467019112983

PsychoPy2: Experiments in behavior made easyBehavior Research Methods 51:195–203.https://doi.org/10.3758/s1342801801193y

Temporal limits of spatial working memory in humansEuropean Journal of Neuroscience 10:794–797.https://doi.org/10.1046/j.14609568.1998.00101.x

Diffusion Decision Model: Current Issues and HistoryTrends in Cognitive Sciences 20:260–281.https://doi.org/10.1016/j.tics.2016.01.007

Role of Prefrontal Persistent Activity in Working MemoryFrontiers in Systems Neuroscience 9:181.https://doi.org/10.3389/fnsys.2015.00181

Drift in Neural Population Activity Causes Working Memory to Deteriorate Over TimeThe Journal of Neuroscience 38:4859–4869.https://doi.org/10.1523/JNEUROSCI.344017.2018

From Distributed Resources to Limited Slots in MultipleItem Working Memory: A Spiking Network Model with NormalizationThe Journal of Neuroscience 32:11228–11240.https://doi.org/10.1523/JNEUROSCI.073512.2012
Decision letter

Tobias H DonnerReviewing Editor; University Medical Center HamburgEppendorf, Germany

Michael J FrankSenior Editor; Brown University, United States

Peter R MurphyReviewer; Trinity College Dublin, Ireland
Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Strategydependent effects of workingmemory limitations on human perceptual decisionmaking" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Tobias Donner as the Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Peter R Murphy (Reviewer #1).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) Alternative strategies for sequential task.
The authors claim that some subjects follow the AtD strategy and others the DtA strategy but experimental evidence for this claim seems weak. Take Figure 10 as an example (Figure 6 is similar). The authors conclude from the data in Figure 10 that on the population level there is no significant difference between the models. On an individual subject level, the δ_{LL} values are small (for most subjects  δ_{LL}  < 3) which one could interpret as either model fitting the data equally well.
In order to claim that there are indeed two different strategies in place, it needs to be shown that the data can only be explained by heterogeneous strategies (for example following a methodology as in Stephan et al. Neurimage 2009 and Rigoux et al. Neuroimage 2014).
Regarding the sequential task: It may be worth considering a mixed strategy model as an alternative because it may explain the data better. Specifically, subjects would follow the DtA strategy until the last stimulus is observed and then switch to the AtD strategy until the end of the delay (i.e., compute the average in the middle of the trial, once all the evidence has been observed).
2) Appropriateness of modeling choices.
The A parameter, governing the relationship between the diffusion constant for a single point and the constants for multiple points, seems estimated differently in AtD and DtA models: in AtD, it's estimated using only data from Perceived blocks with set size > 1, and it plays no role in the AtD process (only, instead, in the memory maintenance process during the delay period of Perceived trials); whereas in DtA, it's estimate using data from both the same Perceived blocks, ‘and’ the Compute blocks at equivalent set sizes. This raises two concerns.
i. Because A parameters in each model are effectively fit to different data, any comparison of the parameter estimates (which is invited by placing them in same table and by some of the discussion in the text [p.9]) needs to be carefully qualified in the associated text.
ii. There is an implicit assumption that the A parameter is fixed across Perceived and Computed blocks. However, Perceived trials with set size > 1 require working memory maintenance of a ‘conjunction’ of stimulus features (location and colour), whereas the latter require maintenance (assuming the DtA process is employed) of only a single feature per stimulus (location); thus, it can reasonably be expected that the effect of load may be more severe in Perceived than Computed blocks. It seems that this possibility is not allowed for in the presented model fits.
Recommendations:
a. The above concerns could be addressed by fitting another round of models, this time fitting A in the DtA model using data from only Computed blocks.
b. In addition, A estimates should be compared between fits of the AtD and DtA models (something that is not possible given the fits as currently presented): if there is a systematic difference between the two, this would indicate that A is indeed different in Perceived and Computed blocks and this should be accounted for in the fits.
3) Implications of model fits.
The implications of strategy choice could be further illuminated by examining what factors if any (overall accuracy of judgments; magnitude of nontimedependent model parameters) differentiate AtD and DtA adopters. Further clarification of what differentiates working memory from decision computations on the presented tasks could be achieved by addressing the following questions through further analysis and/or discussion: How should the decisionspecific (eta_{MN}) parameters be interpreted in the context of other prominent models of decisionmaking? How does their magnitude compare to other noise sources? Does this speak to the question of whether the predominant source of noise in decisionmaking is sensory/motor or memoryrelated or related the decision computation itself?
4) Clarity of presentation.
The Results section is difficult read, and several key aspects of the approach and findings are only clarified during careful reading of the Methods section. Most prominently, there is insufficient explanation of key model predictions that may be counterintuitive for many readers; a lack of clarity around what individual model parameters capture; and confusing elements to how the model fits are presented. We encourage the authors to carefully revise the Results section with this concern in mind.
Specific recommendations:
a. Implications of AtD vs DtA strategy choice:
The fact that, all else being equal, the DtA strategy generates ‘more’ precise behaviour on Computed trials than the AtD model (at least for the parameter range human participants seem to occupy here) is the central feature that differentiates behaviour produced by the two strategies and renders the models identifiable. The authors also seem to take the direction of this effect to be selfevident, as no effort is made to explain to the reader why this pattern emerges. For instance, readers may wonder whether allowing for N = [2, 5] sources of gaussian noise compared to only 1 source should actually produce ‘more’ variability in behaviour. Now, the averaging over particles that takes place at the culmination of the DtA process counteracts the greater total noise to produce less variability in behavioural reports. But this was far from selfevident, and this key effect should be unpacked.
b. Model parameter descriptions:
There seems to be a general lack of clarity around what exactly each model parameter, and in particular different subscripts to different parameters, are supposed to capture. In Figure 2, for example, subscripts N, MN, 1, N(E/L) and MNSeq are all used but only explained in Methods.
c. Alternative strategy interpretation. (see also point 3):
Please clarify what exactly the finding is, because this currently seems ambiguous: Compare line 244 "(…) participants had roughly equal tendencies to use either of the two strategies" implying that we can distinguish which strategy individual subjects are following, vs. line 257 "(…) neither of which was more likely than the other for a given participant" which implies the opposite.
d. Model fitting procedures (see also point 2):
The role of the A parameter in the different fits is confusing, specifically, seeing fits of A for the AtD model since, this parameter does not seem to used at all in the AtD process. If our understanding is correct, the A parameter in these fits is only relevant to producing behavior in Perceived blocks with set size > 1 – a condition of the experiment to which the AtD process is never actually applied. But at no point is this made explicit, leaving room for quite considerable confusion when the reader encounters this important section of the Results.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Strategydependent effects of workingmemory limitations on human perceptual decisionmaking" for further consideration by eLife. Your revised article has been evaluated by Michael Frank (Senior Editor) and a Reviewing Editor.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. Both reviewers were overall very positive about your revisions and felt that the manuscript is much more accessible now. Both support publication in eLife.
Essential revisions:
1) Please address one outstanding clarification question by Reviewer #1, with which Reviewer #2 agreed. Rather than summarizing, we paste the original reviewer point below. Once this point is addressed, the paper can be accepted without additional review.
I only have one lingering point of confusion that I would welcome clarification on. This again centres around treatment of the A parameter in the AtD model. The authors write in the current manuscript (p.7) that in this model "the average is calculated immediately upon observing the evidence and then stored as a single particle in working memory" (lines 9596) and then "the single estimate held in working memory diffuses with the same diffusion constant as a single perceived item (σMN^{2} = σ1^{2}) (lines 9799). Based on this my understanding is that there is only ever one particle diffusing in the AtD model during Computed blocks, regardless of set size; this particle always has the same diffusion constant (σ1^{2}), and there is, therefore, ‘no role’ for set size/the A parameter in determining diffusion noise during Computedblock memory maintenance in the AtD model. Why, then, is it later written that "Because of the previously described relationships between σ1^{2}, σN^{2}, and σMN^{2} it is therefore also true that in the AtD model σN^{2} = σMN^{2} * N^{A}" (lines 109110)? Given the earlier sentences, the only way I can see this being true is if N here refers to the number of ‘particles’ being maintained in memory (which, in the AtD model, is always equal to 1, and so the N^{A} term is doing no work here and just causes considerable confusion) – and not the set size presented to the participant, as N is consistently used to denote elsewhere in the paper. I'm sorry if I'm missing something here, but this seems a key conceptual point to get right for clear presentation and differentiation between the two models. The new Table 1, and my careful reading of the Methods, seems consistent with my own intuition that A plays no role on Computed blocks in the AtD model. But this seems fundamentally inconsistent with the equation emphasized on lines 109110; and indeed with the authors' response to point 2 in the first round of reviews, which I must confess I did not understand.
Now, assuming my own interpretation is correct, and that indeed the A parameter is not doing any work on Computed blocks in the AtD model (instead, in this model it only serves to set the diffusion noise across different set sizes in ‘Perceived’ blocks), then I stand by my initial point that without clarification, it is misleading to include and invite comparison of fitted A parameters for the AtD and DtA models in the same table (new Table 2). In one case (AtD) the A parameter only captures (and in turn, will only be constrained by) behaviour in Perceived blocks; in the other (DtA) it captures (and is constrained by) behaviour in both Perceived and Computed blocks. But currently, this is never made explicit.
Reviewer #1 (Recommendations for the authors):
I thank the authors for engaging well with all comments and suggestions from the first round of reviews. In my opinion the new draft – including more detailed explanation of the models and their predictions early on – is a lot more accessible. I also found the new model identifiability analyses to be quite convincing, in the sense that they provide further evidence to support the claim the distinct strategies are indeed being used by different participants and that while the specifically identified proportions are subject to quite some uncertainty (especially for low set sizes), this key result is nonetheless recoverable given the data at the authors' disposal. Altogether, these additions reaffirm my initial impression that this manuscript is a valuable contribution to the field, breaking new ground in connecting working memory and decisionmaking.
I only have one lingering point of confusion that I would welcome clarification on. This again centres around treatment of the A parameter in the AtD model. The authors write in the current manuscript (p.7) that in this model "the average is calculated immediately upon observing the evidence and then stored as a single particle in working memory" (lines 9596) and then "the single estimate held in working memory diffuses with the same diffusion constant as a single perceived item (σMN^{2} = σ1^{2}) (lines 9799). Based on this my understanding is that there is only ever one particle diffusing in the AtD model during Computed blocks, regardless of set size; this particle always has the same diffusion constant (σ1^{2}), and there is, therefore, ‘no role’ for set size/the A parameter in determining diffusion noise during Computedblock memory maintenance in the AtD model. Why, then, is it later written that "Because of the previously described relationships between σ1^{2}, σN^{2}, and σMN^{2} it is therefore also true that in the AtD model σN^{2} = σMN^{2} * N^{A}" (lines 109110)? Given the earlier sentences, the only way I can see this being true is if N here refers to the number of ‘particles’ being maintained in memory (which, in the AtD model, is always equal to 1, and so the N^{A} term is doing no work here and just causes considerable confusion) – and not the set size presented to the participant, as N is consistently used to denote elsewhere in the paper. I'm sorry if I'm missing something here, but this seems a key conceptual point to get right for clear presentation and differentiation between the two models. The new Table 1, and my careful reading of the Methods, seems consistent with my own intuition that A plays no role on Computed blocks in the AtD model. But this seems fundamentally inconsistent with the equation emphasized on lines 109110; and indeed with the authors' response to point 2 in the first round of reviews, which I must confess I did not understand.
Now, assuming my own interpretation is correct, and that indeed the A parameter is not doing any work on Computed blocks in the AtD model (instead, in this model it only serves to set the diffusion noise across different set sizes in ‘Perceived’ blocks), then I stand by my initial point that without clarification, it is misleading to include and invite comparison of fitted A parameters for the AtD and DtA models in the same table (new Table 2). In one case (AtD) the A parameter only captures (and in turn, will only be constrained by) behaviour in Perceived blocks; in the other (DtA) it captures (and is constrained by) behaviour in both Perceived and Computed blocks. But currently, this is never made explicit.
Reviewer #2 (Recommendations for the authors):
The authors have followed the recommendations of the previous decision letter and the additional analysis confirm the findings of the first version. I have no further issues.
https://doi.org/10.7554/eLife.73610.sa1Author response
Essential revisions:
1) Alternative strategies for sequential task.
The authors claim that some subjects follow the AtD strategy and others the DtA strategy but experimental evidence for this claim seems weak. Take Figure 10 as an example (Figure 6 is similar). The authors conclude from the data in Figure 10 that on the population level there is no significant difference between the models. On an individual subject level, the δ_{LL} values are small (for most subjects  δ_{LL}  < 3) which one could interpret as either model fitting the data equally well.
In order to claim that there are indeed two different strategies in place, it needs to be shown that the data can only be explained by heterogeneous strategies (for example following a methodology as in Stephan et al. Neurimage 2009 and Rigoux et al. Neuroimage 2014).
We thank the reviewers for their suggestion of using Variable Bayesian Analysis (VBA) to assess conclusions regarding heterogeneous strategy use in the population. We investigated the use of this analysis by running our log likelihoods through the VBA toolbox. For the Simultaneous set size 5 and both Sequential data sets, the VBA agreed with our conclusions: of a greater frequency of AtD use in set size 5 and heterogeneous populations for both sequential conditions. However, for Simultaneous set size 2, the VBA suggested that there was a significantly greater frequency of AtD subjects than DtA subjects, in contrast to our finding that 8 subjects were best described by AtD and 9 by DtA (though one was not considered well fit). To try to better understand this discrepancy, we simulated 100 rounds of task performance using the parameters and models that were best fit to data from our subject population. Thus, for this simulated data set we knew the true underlying distribution of parameters and models. We then ran the VBA on data generated from these simulations. For our set size 2 Simultaneous simulations, we found that the VBA consistently overestimated the proportion of AtD subjects compared to the ground truth. We include a summary of these results in Author response image 1. We concluded that VBA was not an effective way to reliably recover population distributions and therefore chose not to include these analyses in our revised manuscript.
Instead, we further analyzed the identifiability of our models (percent of simulations correctly identified by LLR) and the probability of obtaining the observed model use distributions of our participants. We quantified model identifiability as the percent of simulated data for which the correct model had a lower negative log likelihood than the alternative. We now report these results for Simultaneous conditions in Figure 2 —figure supplement 1. We then used these identifiability results to run simulations of different underlying modeluse proportions. For 10000 iterations of each sampled underlying modeluse proportion (using values that ranged between 0 and 1 in 1/34 steps), we sampled 17 subjects and assigned them each a model (AtD or DtA) according to the given modeluse proportion. We then used our modelidentifiability results to reverse the model assignments according to our estimated probability of misidentifying each model. We used these percentages to calculate the probability of obtaining our results for different “true” underlying proportions of model prevalence given average model identifiability to supplement our log likelihood difference results. The results, presented in Figures 6b and 10b, support our conclusions that: (1) for the Simultaneous condition, more subjects used AtD for set size 5 versus set size 2; and (2) for the Sequential condition, roughly equal numbers of subjects used each strategy type for the two set sizes. The probability functions are also peaked near the underlying observed distributions, indicating that average model identifiability was sufficiently high to prevent the observed participant distribution from being observed by random chance due to misidentification.
Regarding the sequential task: It may be worth considering a mixed strategy model as an alternative because it may explain the data better. Specifically, subjects would follow the DtA strategy until the last stimulus is observed and then switch to the AtD strategy until the end of the delay (i.e., compute the average in the middle of the trial, once all the evidence has been observed).
We have considered the reviewer’s thoughtful suggestion of a DtA model that transitions to AtD when the final stimulus is observed. For set size 2, this suggested model is identical to AtD (because there is only one point to diffuse until the last stimulus is observed) and was thus not investigated further. In contrast, for set size 5, the suggested model is distinct, if not quite as separable from the existing two models. We include a new section (lines 353–367) reporting that when this model is included in the model selection process, 3 former AtD subjects and 3 former DtA subjects are (slightly) better fit by this hybrid. This result suggests that for more complex decisionmaking scenarios, a more nuanced spectrum of strategies may be in use.
2) Appropriateness of modeling choices.
The A parameter  governing the relationship between the diffusion constant for a single point and the constants for multiple points  seems estimated differently in AtD and DtA models: in AtD, it's estimated using only data from Perceived blocks with set size > 1, and it plays no role in the AtD process (only, instead, in the memory maintenance process during the delay period of Perceived trials); whereas in DtA, it's estimate using data from both the same Perceived blocks, ‘and’ the Compute blocks at equivalent set sizes. This raises two concerns.
We regret the confusion and have clarified in the revised manuscript that the A parameter, which describes the relationship between the diffusion constants for Perceived set size 1 and set size N, is fit to the same data for the two models: each time we fit the parameter using trials of all possible delays from Perceived set size 1, set size N, and Computed set size N (revised manuscript lines 155156).
To further clarify, the main difference between the two models is that: 1) the AtD model enforces relationships between the diffusion constants for Perceived set size 1 and Computed set size N, and thus the A parameter also describes the relationship between the diffusion constants for Perceived set size N and Computed set size N; whereas 2) the DtA model enforces relationships between the diffusion constants for Perceived set size N and Computed set size N, and thus the A parameter also describes the relationship between the diffusion constants for Perceived set size 1 and [Computed set size N]’N (because the Computed diffusion constant is defined in DtA as 1/N^{th} the Diffusion constant for Perceived set size N) (lines 107109). Accordingly, estimates of the A parameter are obtained using all of the aforementioned trials.
i. Because A parameters in each model are effectively fit to different data, any comparison of the parameter estimates (which is invited by placing them in same table and by some of the discussion in the text [p.9]) needs to be carefully qualified in the associated text.
See above; we clarified in the text that the A parameters for each model are fit to the same data.
ii. There is an implicit assumption that the A parameter is fixed across Perceived and Computed blocks. However, Perceived trials with set size > 1 require working memory maintenance of a ‘conjunction’ of stimulus features (location and colour), whereas the latter require maintenance (assuming the DtA process is employed) of only a single feature per stimulus (location); thus, it can reasonably be expected that the effect of load may be more severe in Perceived than Computed blocks. It seems that this possibility is not allowed for in the presented model fits.
See below for responses to the specific recommendations.
Recommendations:
a. The above concerns could be addressed by fitting another round of models, this time fitting A in the DtA model using data from only Computed blocks.
We thank the reviewers for this excellent suggestion. We tested and report in the revised manuscript (lines 164–169) that there is no statistically significant difference when comparing bestfitting values of A estimated from Perceived blocks alone versus Computed blocks alone for DtA participants. Therefore, there is no evidence that the memory load (as captured by A) was affected by the requirements of conjunction in the Perceived vs the Computed blocks.
We also now note (Figure 1 legend, Methods: 508–509) that we used a consistent ordering of the colors of the disks to minimize any extra memory load caused by the conjunction of location and color and balance memory load between Perceived and Computed blocks (as measured by A).
b. In addition, A estimates should be compared between fits of the AtD and DtA models (something that is not possible given the fits as currently presented): if there is a systematic difference between the two, this would indicate that A is indeed different in Perceived and Computed blocks and this should be accounted for in the fits.
We now report (Table 2, Table 3) that bestfitting values of A were slightly higher for DtA versus AtD subjects, although the difference was statistically significant (p<0.05) only for set size=5 Simultaneous. It is certainly true that our models do not capture all aspects of each participant’s strategy, but we still believe that the distinction of DtA versus AtD strategies parameterized in this way provides a parsimonious account of the primary strategy classes that they used.
3) Implications of model fits.
The implications of strategy choice could be further illuminated by examining what factors if any (overall accuracy of judgments; magnitude of nontimedependent model parameters) differentiate AtD and DtA adopters. Further clarification of what differentiates working memory from decision computations on the presented tasks could be achieved by addressing the following questions through further analysis and/or discussion: How should the decisionspecific (eta_{MN}) parameters be interpreted in the context of other prominent models of decisionmaking? How does their magnitude compare to other noise sources? Does this speak to the question of whether the predominant source of noise in decisionmaking is sensory/motor or memoryrelated or related the decision computation itself?
We thank the reviewers for this excellent suggestion. We now include an additional analysis of the factors that differentiate performance and parameters of AtD and DtA adopters. In brief, we found a significant performance difference in terms of the variability of responses for the average of AtD and DtA adopters at set size 2, delay 6, simultaneous conditions. These comparisons are now reported in lines: 172–174 and Table 2 for Simultaneous parameters, 211–216 for Simultaneous variability over conditions, 286–287 and Table 3 for sequential parameters, and 332–334 for sequential variability over conditions.
Regarding the broader questions about the decisionmaking/sensory/motor/memory related noise sources, we do not feel that our current study design is best suited to address these concerns. Eta_{MN} represents all static (i.e., not diffusiondependent) noise sources that affected the participants’ decisions. This term thus captures a combination of initial sensory cue location, motor, and averaging (decision) noise. The individual contribution of these different noise sources is difficult to disentangle given our study design. We found that eta_{MN} was typically ~10°, whereas σ^{2}_{MN} was ~3–4 °/second. As a consequence, after ~3 sec the majority of the response variability can be attributed to diffusion rather than the other factors, which we now note in the discussion (lines 403–407). Other studies, such as Drugowitsch et al. 2016, are more specifically designed to disentangle the different components of these nondiffusiondependent noise sources.
4) Clarity of presentation.
The Results section is difficult read, and several key aspects of the approach and findings are only clarified during careful reading of the Methods section. Most prominently, there is insufficient explanation of key model predictions that may be counterintuitive for many readers; a lack of clarity around what individual model parameters capture; and confusing elements to how the model fits are presented. We encourage the authors to carefully revise the Results section with this concern in mind.
Specific recommendations:
a. Implications of AtD vs DtA strategy choice:
The fact that, all else being equal, the DtA strategy generates ‘more’ precise behaviour on Computed trials than the AtD model (at least for the parameter range human participants seem to occupy here) is the central feature that differentiates behaviour produced by the two strategies and renders the models identifiable. The authors also seem to take the direction of this effect to be selfevident, as no effort is made to explain to the reader why this pattern emerges. For instance, readers may wonder whether allowing for N = [2, 5] sources of gaussian noise compared to only 1 source should actually produce ‘more’ variability in behaviour. Now, the averaging over particles that takes place at the culmination of the DtA process counteracts the greater total noise to produce less variability in behavioural reports. But this was far from selfevident, and this key effect should be unpacked.
We apologize for the lack of clarity. We have revised the manuscript substantially to address this issue. We now include a new subsection near the beginning of results entitled “DiffusingParticle Framework and Predictions” that provides intuitions, descriptions, and justifications of our modeling choices. We have also added a table that includes a description of the model parameters and relationships (Table 1).
b. Model parameter descriptions:
There seems to be a general lack of clarity around what exactly each model parameter, and in particular different subscripts to different parameters, are supposed to capture. In Figure 2, for example, subscripts N, MN, 1, N(E/L) and MNSeq are all used but only explained in Methods.
We have expanded and clarified our descriptions of the model parameters in the Results section, including descriptions of all of the subscripts. We have also added Table 1 for easy reference.
c. Alternative strategy interpretation. (see also point 3):
Please clarify what exactly the finding is, because this currently seems ambiguous: Compare line 244 "(…) participants had roughly equal tendencies to use either of the two strategies" implying that we can distinguish which strategy individual subjects are following, vs. line 257 "(…) neither of which was more likely than the other for a given participant" which implies the opposite.
We apologize for the confusion. We now make it clear that: 1) we can distinguish the two strategies (except when A=1), and 2) under some conditions the two strategies had roughly equal prevalence across our participants. The ambiguous sentence has been removed.
d. Model fitting procedures (see also point 2):
The role of the A parameter in the different fits is confusing, specifically, seeing fits of A for the AtD model since, this parameter does not seem to used at all in the AtD process. If our understanding is correct, the A parameter in these fits is only relevant to producing behavior in Perceived blocks with set size > 1 – a condition of the experiment to which the AtD process is never actually applied. But at no point is this made explicit, leaving room for quite considerable confusion when the reader encounters this important section of the Results.
Please see our response to point 2 and Specific Recommendations for point 4a. We have made major revisions to the text to make sure the procedure for determining A is clear.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Essential revisions:
1) Please address one outstanding clarification question by Reviewer #1, with which Reviewer #2 agreed. Rather than summarizing, we paste the original reviewer point below. Once this point is addressed, the paper can be accepted without additional review.
I only have one lingering point of confusion that I would welcome clarification on. This again centres around treatment of the A parameter in the AtD model. The authors write in the current manuscript (p.7) that in this model "the average is calculated immediately upon observing the evidence and then stored as a single particle in working memory" (lines 9596) and then "the single estimate held in working memory diffuses with the same diffusion constant as a single perceived item (σMN^{2} = σ1^{2}) (lines 9799). Based on this my understanding is that there is only ever one particle diffusing in the AtD model during Computed blocks, regardless of set size; this particle always has the same diffusion constant (σ1^{2}), and there is, therefore, ‘no role’ for set size/the A parameter in determining diffusion noise during Computedblock memory maintenance in the AtD model. Why, then, is it later written that "Because of the previously described relationships between σ1^{2}, σN^{2}, and σMN^{2} it is therefore also true that in the AtD model σN^{2} = σMN^{2} * N^{A}" (lines 109110)? Given the earlier sentences, the only way I can see this being true is if N here refers to the number of ‘particles’ being maintained in memory (which, in the AtD model, is always equal to 1, and so the N^{A} term is doing no work here and just causes considerable confusion) – and not the set size presented to the participant, as N is consistently used to denote elsewhere in the paper. I'm sorry if I'm missing something here, but this seems a key conceptual point to get right for clear presentation and differentiation between the two models. The new Table 1, and my careful reading of the Methods, seems consistent with my own intuition that A plays no role on Computed blocks in the AtD model. But this seems fundamentally inconsistent with the equation emphasized on lines 109110; and indeed with the authors' response to point 2 in the first round of reviews, which I must confess I did not understand.
Now, assuming my own interpretation is correct, and that indeed the A parameter is not doing any work on Computed blocks in the AtD model (instead, in this model it only serves to set the diffusion noise across different set sizes in ‘Perceived’ blocks), then I stand by my initial point that without clarification, it is misleading to include and invite comparison of fitted A parameters for the AtD and DtA models in the same table (new Table 2). In one case (AtD) the A parameter only captures (and in turn, will only be constrained by) behaviour in Perceived blocks; in the other (DtA) it captures (and is constrained by) behaviour in both Perceived and Computed blocks. But currently, this is never made explicit.
The reviewers wish us to clarify the role of the A parameter, particularly with regards to the AtD strategy in Computed blocks. We regret the confusion and very much appreciate the careful reading and the opportunity to clarify this important point.
It is true that A is not directly relevant to the execution of the AtD strategy in the Computed condition, because only one value (the average) is remembered. However, A is relevant to determining the bestfitting AtD diffusion constant for a Computed item (σ_{MN}^{2}), because in AtD, A governs the relationship between the diffusion constant of N Perceived items held in memory (σ_{N}^{2}) and one Computed item held in memory (σ_{MN}^{2}): σ_{N}^{2} = σ_{MN}^{2} * N^{A}. Our fitting procedure therefore determined the bestfitting value of A per participant and set size by finding the value that best enforced this relationship between data from Perceived and Computed blocks. These points are now clarified in lines 109–116.
To further clarify the reviewer’s specific comments: (1) in the above relationship, N continues to refer to the set size and number of particles being maintained; and (2) because A was determined from both Perceived and Computed data for both AtD and DtA fits, we believe the comparisons we present are useful.
https://doi.org/10.7554/eLife.73610.sa2Article and author information
Author details
Funding
National Institute of Mental Health (R01 MH115557)
 Kresimir Josic
 Zachary P Kilpatrick
 Joshua I Gold
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The authors thank Adrian Radillo and Gaia Tavoni for their discussions early in the development of this project, particularly regarding early formulations of the models used here and the task structure.
Ethics
Human subjects: The task was created with PsychoPy3 and distributed to participants via Pavlovia.com, which allowed participants to perform the task on their home computers after providing informed consent. These protocols were reviewed by the University of Pennsylvania Institutional Review Board (IRB) and determined to meet eligibility criteria for IRB review exemption authorized by 45 CFR 46.104, category 2.
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Tobias H Donner, University Medical Center HamburgEppendorf, Germany
Reviewer
 Peter R Murphy, Trinity College Dublin, Ireland
Publication history
 Received: September 4, 2021
 Preprint posted: September 6, 2021 (view preprint)
 Accepted: March 10, 2022
 Accepted Manuscript published: March 15, 2022 (version 1)
 Accepted Manuscript updated: March 18, 2022 (version 2)
 Version of Record published: April 12, 2022 (version 3)
Copyright
© 2022, Schapiro et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 832
 Page views

 164
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structurebased measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higherorder statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

 Cancer Biology
 Computational and Systems Biology
Lung squamous cell carcinoma (LUSC) is a type of lung cancer with a dismal prognosis that lacks adequate therapies and actionable targets. This disease is characterized by a sequence of low and highgrade preinvasive stages with increasing probability of malignant progression. Increasing our knowledge about the biology of these premalignant lesions (PMLs) is necessary to design new methods of early detection and prevention, and to identify the molecular processes that are key for malignant progression. To facilitate this research, we have designed XTABLE (Exploring Transcriptomes of Bronchial Lesions), an opensource application that integrates the most extensive transcriptomic databases of PMLs published so far. With this tool, users can stratify samples using multiple parameters and interrogate PML biology in multiple manners, such as two and multiplegroup comparisons, interrogation of genes of interests, and transcriptional signatures. Using XTABLE, we have carried out a comparative study of the potential role of chromosomal instability scores as biomarkers of PML progression and mapped the onset of the most relevant LUSC pathways to the sequence of LUSC developmental stages. XTABLE will critically facilitate new research for the identification of early detection biomarkers and acquire a better understanding of the LUSC precancerous stages.