Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorSimon van GaalUniversity of Amsterdam, Amsterdam, Netherlands
- Senior EditorMichael FrankBrown University, Providence, United States of America
Reviewer #1 (Public review):
Summary:
This paper examines whether humans use protracted temporal integration in a noise-free, deferred-response contrast discrimination task, using a covert evidence-duration manipulation combined with EEG (SSVEP, CPP, Mu/Beta). The key finding is that evidence for protracted sampling is behaviorally and neurally supported, but even joint CPP + behaviour fitting cannot fully discriminate a standard integration (DDM) model from a novel "extremum-flagging" non-integration model. The paper is transparent about this outcome.
Strengths:
This is a well-conducted and well-written study that makes a genuine contribution to the perceptual decision-making literature by introducing a clean experimental design for probing temporal integration without participants adapting their strategy and demonstrating for the first time that a non-integration model (extremum-flagging) can replicate CPP waveform dynamics that have long been considered hallmarks of evidence accumulation. The transparent treatment of equivocal modelling outcomes is commendable.
Weaknesses:
My main concerns relate to statistical power, the under-specification of the and the extremum-flagging mechanism. Addressing these would greatly strengthen the paper.
(1) The sample of 16 participants (15, after the exclusion of one participant) is described as "close to similar EEG studies" with no formal power analysis. Given that the paper's core claim rests on subtle quantitative differences between two model classes - differences that are, by the authors' own admission, not sufficient to declare a winner - even a modest increase in sample size might yield a more decisive outcome. At a minimum, the authors should report a sensitivity analysis or post-hoc power calculation to indicate what effect sizes the current N could reliably detect, particularly for the rmANOVA comparisons and the neural constraint fitting.
(2) The Extremum-flagging model is the paper's most novel contribution, yet its physiological basis is underspecified. The model posits that each decision-terminating bound-crossing triggers a stereotyped, half-sine-shaped centroparietal signal, but no neural circuit or computational mechanism is proposed for how the brain could detect the first bound-crossing event in a non-accumulating evidence stream or generate a temporally precise, fixed-amplitude signal in response. Possible connections to P3b theories of context updating and response facilitation are acknowledged, but these are vague functional descriptions rather than mechanistic accounts. I think the discussion should engage more directly with potential neural substrates that could generate this flagging signal, and whether these are consistent with the known generators of the CPP/P3b. Without this, the extremum-flagging model risks being viewed as a mathematical convenience rather than a biologically plausible alternative.
(3) The Integration model at the preferred neural weighting estimates a high-to-low contrast drift rate ratio of 8.7, whereas the empirical Mu/Beta lateralization slopes suggest a ratio of approximately 3.5. The authors attribute this discrepancy to the nonlinear contrast response function of early visual cortex and the salience of the high-contrast evidence onset, but these explanations are speculative. These outcomes are arguably the most quantitatively damaging result for the integration model, so they deserve more than a brief discussion. I would recommend that the authors (a) estimate what range of contrast response nonlinearities would be required to close this gap, (b) test whether an alternative drift rate parameterization (e.g., scaling drift rates directly by SSVEP amplitude rather than contrast) reduces the discrepancy, or (c) be more explicit about treating this as a point against the Integration account.
(4) The sensitivity analysis over neural constraint weightings (w = 0.1 to 1000) is thoughtful, but the paper ultimately acknowledges that the preferred weighting is w=10, chosen because it achieves "a good fit to CPP dynamics without substantively sacrificing behavioral fit" - a qualitative criterion. No principled statistical framework is used to select the optimal weighting or to compare models at a given weighting. A Bayesian model comparison could provide a more formal framework for combining behavioral and neural fit components, and would allow a clearer statement about the relative posterior probability of each model.
Reviewer #2 (Public review):
Summary:
The manuscript by Hajimohammadi, Mohr, O'Connell and Kelly is intended to demonstrate that participants integrate evidence over time to make a decision, even in a noise-free, static decision context. This is validated by the observation that (1) participant accuracy improves with increased exposure to the stimulus; and (2) there is a correlation between participant accuracy and a neural index of evidence accumulation, as measured by centro-parietal positivity (CPP).
Strengths:
(1) Joint modelling of accuracy and CPP dynamics is a significant achievement, as behaviour alone often cannot distinguish between competing theories of decision-making. In the case of protracted sampling in particular, the absence of reaction times (RT) due to the delayed nature of the response makes this method highly appealing.
(2) The experimental manipulations and the method used to extract the different neural indices are well chosen, enabling the mapping of putative cognitive processes such as evidence accumulation and motor preparation onto the recorded EEG with clarity.
(3) The in-depth discussion of the results clearly articulates those reported by the authors and in previous works.
Weaknesses:
(1) One main issue to support the interpretation of the authors toward the need for protracted sampling is the timing of the evidence. By design, participants believe that the signal is present for 1.6 seconds (reinforced by the fact that easy trials were displayed for 1.6 seconds). However, the difference in stimuli is turned off either 1.4, 1.2, 0.8 or 0 seconds before the cue to respond. While this makes sense in the context of the authors' question, it also raises the possibility that participants will focus on the last samples before answering. Even if participants apply equal weighting, this still favours them delaying evidence accumulation until they are sufficiently certain that the evidence should be present (e.g. participants might start accumulating after the stimulus has disappeared in the 0.2 condition). I do not see an easy way to test these alternative explanations outside of running a study in which the evidence is always offset before the go cue.
(2) Regarding the behavioural models, are these identifiable based on accuracy data alone? This should be addressed using a parameter recovery study, in which a set of parameters is used to generate data, and the same fitting routine used for the real data is used to estimate the parameters. This would enable us to determine what can be inferred from the model comparison presented. This is not a serious problem for the manuscript, as it specifically aims to go beyond behaviour. It is, however, worth noting that such a parameter recovery addition could be used to demonstrate the need for a joint modelling framework to answer the question of protracted sampling on delayed response times (RT).
Minor comments:
(1) I would advise authors to fix the D1 parameter and use it as a scaling parameter across all models. Currently, as I understand it, the models are scale-free, meaning the same fit is achieved by multiplying all parameters by two, for example. This makes the fit more complex (bounds on parameter values are required) and means that the models are less comparable in terms of their estimates. Perhaps I'm missing something, but I would have thought that fixing D1 (the common parameter across all models) would solve these issues.
(2) Why is the snapshot model so bad despite being a good model in Stine et al 2020? Can the authors speculate in the discussion?
(3) The meaning of the flag width is unclear. Figure 4 provides the reader with an intuitive understanding of the model that the authors have in mind. However, the tables in the appendices report values between 0.2 and 0.9. I understand that these values represent the width of the half-sine in seconds. This suggests that the actual estimated values for these flag events are much broader than those displayed in Figure 4. While this is probably fine for most models, it can be problematic for the extremum-flagging model, as it means that the rise to the peak takes between 0.1 and 0.45 seconds. While strictly speaking, this is still a 'flag' model, such a slow rise to the peak, given the usual expectation of evidence accumulation, would place this model closer to a smooth integration model than to a boundary-crossing flagging mechanism.
(4) In the modelling section, it is not clear overall (i.e. for G² and R²) how the participant dimension is taken into account. Are these individually fitted models, and if so, how are the secondary statistics generated from the individual estimates? Or were these fitted over all participants?
(5) On page 7, in the last sentence of the first paragraph of the section titled 'Decision-Related Neural Signals', the authors state that 'this stable contrast-difference encoding suggests that a constant (i.e. non-adapting) drift rate is a reasonable simplifying model assumption'. However, I am not sure how this is true given that SSVEP quantifies encoding, yet the drift rate can vary due to non-sensory aspects (e.g. attention).
(6) The mu/beta lateralisation does indeed favor the integration model more, but in terms of boundary estimation and starting-point analyses, both models are pretty far apart. Providing an interpretation of this observation, e.g. regarding alternative linking functions for mu/beta, would add to the manuscript.
Reviewer #3 (Public review):
Summary:
The authors aim to compare proposal models of perceptual decision making using a joint modeling approach, where they fit models to both behavioral outcomes as well as CPP. Most notably, they compare a standard evidence accumulation model with models that track the evidence without integrating it over time (extrema detection). The authors report that the joint CPP-behavioral data do not discriminate between two of their proposals.
Strengths:
This is an interesting finding that reinforces the idea that what we believe to see based on aggregation over trials may not be what happens on every single trial. The models are creative, and the simulations are convincing, relating the models to multiple neural markers of decision formation. These include the CPP but also mu/beta power spectra.
Weaknesses:
The paper makes some strong points, and the work seems generally well-executed. The weaknesses that I identified are twofold:
(1) Embedding in the literature/exposition of the main argument.
The focus in the introduction is on the noise-free nature of the stimulus and the prolonged presentation time. However, after reading the paper, I felt these were mostly experimental design choices that enable comparison of the different models using the CPP. Perhaps my misreading of the goals of the paper stems from two other observations:
a) The fact that the stimulus is noise-free does not entail that perception is noise-free. Thus, the argument that using a noise-free stimulus precludes the necessity of temporal integration seems not completely valid. Of course, one could argue that noise is limited in this case, but that makes a noise-free stimulus more of a design choice.
b) The focus on prolonged stimulus presentation, but at the same time the contrast with expanded judgement, did not make sense to me. Perhaps, as a non-native speaker, I am misreading the subtle difference between "protracted sampling" and "longer sampling", but again, the longer duration seems mostly a design choice.
More could be said about the optimality of the extrema detection methods. In particular, decades of work (centuries?) have shown that evidence integration is an optimal decision-making procedure: For example, the Sequential Probability Ratio Test is Bayes-optimal wrt mean RT (Wald, 1946); evidence accumulation together with collapsing threshold serves to maximize rewards in repeated choices (e.g., Bogacz et al., PsychRev, 2006; Boehm et al. APP, 2020). Given all this work, why would the brain have evolved to adopt a different mechanism? I realize that the paper is not about optimal decision making, but some discussion of this point seems warranted.
(2) Modeling choices.
The authors introduce a parameter, sampT, that represents uncertainty in the sampling onset time. It was not clear to me whether this parameter represented an offset of all trials, or a distribution (probably the latter). I wonder how exactly this parameter was integrated into the models, and in particular, if and how it interacts with the starting-point parameters. My intuition is that on a single-trial, IF early sampling occurs, you can model that with either a negative sampT and z at 0, or with sampT at 0 but a shift in z. This would suggest trade-offs between these parameters, making them hard to estimate independently. Since the paper does not depend on the identification of parameter estimates, this may not be a huge problem, but nevertheless it is good to explore the consequences.
The way the Bounded Integration model (BIntg) is formulated seems very close to the EZ-diffusion model (Wagenmakers et al., PBR, 2007). This model states that the proportion of correct responses Pc = 1/(1+exp(-B*D/s^2), with B and D the bound and drift rate parameters, respectively. However, filling in the numbers for the high contrast condition from Table 2, and assuming that s=2 (because the model description states that dt=2, with s undefined), I get a Pc of 80% for the 1.6H condition. This seems substantially less than what Figure 2 suggests.
On some occasions, it is unclear to me what modeling choices are being made:
a) It seems as if the models are fit on accuracy data alone (before introducing the neural data). This seems suboptimal given that the authors do report differences in RT.
b) Are the models fit on all data combined, or on the data of individual participants? Fitting individual participant data is preferred, as combined or aggregated data may be distorted by individual differences.
c) The authors seem to suggest that the diffusion coefficient s is estimated (in the section "Integration models"). Most likely, however, this is set to a fixed value. Obviously, it matters for the model comparison using AIC whether this parameter was freely estimated or not.
Not really a weakness, but I wondered about the effect of stimulus duration on RT. In particular, what hypothesis (or post hoc explanation) do the authors have for these RT effects? I could think of at least three hypotheses that are consistent with the behavioral data:
a) H1: The shorter the evidence duration, the more likely participants are to require a double-check before response execution, reflecting their uncertainty about their decision.
b) H2: There is a collapsing threshold that initiates at stimulus offset, leading to quicker responses on trials where there is more evidence.
c) H3: motor preparation is correlated with the evidence signal, which leads to faster responses on trials with more evidence.