Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorTobias DonnerUniversity Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Senior EditorJoshua GoldUniversity of Pennsylvania, Philadelphia, United States of America
Reviewer #1 (Public review):
This work presents data from three species (mice, rats, and humans) performing an evidence accumulation task, that has been designed to be as similar as possible between species (and is based on a solid foundation of previous work on decision-making). The tasks are well-designed, and the analyses are solid and clearly presented - showing that there are differences in the overall parameters of the decision-making process between the species. This is valuable to neuroscientists who aim to translate behavioral and neuroscientific findings from rodents to humans and offers a word of caution for the field in readily claiming that behavioral strategies and computations are representative of all mammals. The dataset would be of great interest to the community and may be a source of further modelling of across-species behavior, but unfortunately, neither data or code are currently shared.
A few other questions remain, that make the conclusions of the paper a bit hard to assess:
(1) The main weakness is that the authors claim that all species rely on evidence accumulation as a strategy, but this is not tested against other models (see e.g. Stine et al. https://elifesciences.org/articles/55365): the fact that the DDM fits rather well does not mean that this is the strategy that each species was carrying out.
(2) In all main analyses, it is unclear what the effect is of the generative flash rate and how this has been calibrated between species. Only in Figure 6C do we see basic psychometric functions, but these should presumably also feature as a crucial variable dominating the accuracy and RTs (chronometric functions) across species. The very easy trials are useful to constrain the basic sensorimotor differences that may account for RT variability, e.g. perhaps the small body of mice requires them to move a relatively longer distance to trigger the response.
(3) The GLM-HMM results (that mice are not engaged in all trials) are very important, but they imply that mouse DDM fits may well be more similar to rats and humans if done only on engaged trials. Could it be that the main species differences are driven by different engagement state occupations?
(4) It would be very helpful if the authors could present a comprehensive overview (perhaps a table) of the factors that may be relevant for explaining the observed species differences. This may include contextual/experimental variables (age range (adolescent humans vs. mice/rats, see https://www.jax.org/news-and-insights/jax-blog/2017/november/when-are-mice-considered-old; reward source, etc) and also outcomes (e.g. training time required to learn the task, # trials per session and in total).
Reviewer #2 (Public review):
Summary:
Chakravarty et al. propose a 'synchronized framework' for studying perceptual decision-making (DM) across species -namely humans, rats, and mice. Although all species shared hallmarks of evidence accumulation, the results highlighted species-specific differences. Humans were the slowest and most accurate, rats optimized the speed-accuracy tradeoff to maximize the reward rate and mice were the fastest but least accurate. In addition, while humans were better fit by a classic DDM with fixed bounds, rodents were better fit by a DDM with collapsing bounds. While comparing behavioral strategies in evidence accumulation tasks across species is an important and timely question, some of the presented differences across species lack a clear interpretation and could be simply caused by differences in the task design. There is important information and analyses missing about the DDM and the other models used, which lowers the confidence and enthusiasm about the results.
Strengths:
The comparison of behavior across species, including humans and commonly used laboratory species like rats and mice, is a fundamental step in neuroscience to establish more informed links between animal experiments and human cognition. In this work, Chakravarty et al. analyze and model the behavior of three species during the same evidence accumulation task. They draw conclusions about the different strategies used in each case.
Weaknesses:
Novelty:
While quite relevant, some parts of the work presented are more novel than others. That EA drives choice behavior and these choices can be described with a DDM have been shown before (see e.g. (Kane et al. 2023; Brunton et al. in 2013; Pinto et al 2018)). The novelty here mostly lies in the comparison of three species in the same task and in fitting the same exact model (close quantitative comparison of behavioral strategies). However, some of the differences lack a clear interpretation. For instance, the values of some of the DDM fitted parameters between the three species are not ordered "as expected" (e.g. non-decision time or DDM BIC). Other comparison results completely lack an explanation (e.g. rats' RT are near optimal while humans and mice are not). The aspect that I found most novel and exciting is the application of HMMs to each of the species. However, this part comes at the end of the paper and has been done without sufficient depth. There is almost no explanation for the results. I would suggest the authors bring up this part and move back to other aspects which are, in my opinion, less novel or interpretable (e.g. results around the optimality of RT).
Task design:
Since there is no fixation, the response time (RT) reflects both the evidence integration time plus the motor time (stimuli are played until a response is given). This design makes it hard to compare RTs between species. While humans just had to press a button, rodents had to move their whole bodies from a central port to a side port. When comparing rats and mice, their difference in size relative to port distance could explain different RTs. This could for example explain the large difference in non-decision time (ndt) in Figure 3F between mice and rats. Are the measurements of the rat and the mouse boxes comparable? The authors should explain this difference more openly and discuss its implications when interpreting the results. The Methods should also provide information about the distance between ports for each species. I also strongly recommend including a few videos of rats and mice performing the task to have a sense of the movements involved in the task in each species.
(1) DDM
Goodness of fit:
The authors conclude that the three species use an accumulation of evidence strategy because they can fit a DDM. However, there is little information about the goodness of these fits. They only show the RT distributions for one example subject (too small to distinguish whether the fit of the histograms is good or not). We suggest they make a figure showing in more detail the match of the RT distributions across subjects (e.g. they can compare RT quartiles for data and model for the entire group of subjects). Then they provide BIC which is a measure that depends on the number of trials. Were the number of trials matched across subjects/species? Could the authors provide a measure independent of the number of trials (e.g. cross-validated log-likelihood per trial)? Moreover, is this BIC computed only on the RTs, mouse responses, or both?
Overparameterization:
The authors chose to include as DDM parameters the variability of the initial offset, the variability in non-decision time, and the variability of the drift rate. Having so many parameters with just one stimulus condition (80:20 ratio of flashes) may lead to unidentifiability problems as recognized previously (e.g. see M. Jones (2021) here osf.io/preprints/psyarxiv/gja3u). Their parameter recovery Supplementary Figure 3 shows that at least two of these variability parameters can not be recovered. I also couldn't find the values of these parameters for the fitted DDM. So I was wondering the extent to which adding these parameters improves the fits and is overall necessary.
Tachometric curves:
The authors show increasing tachometric curves (i.e. Accuracy vs RT) and use this finding as proof of accumulation. They fit these curves using a GAAM with little justification or detail (in fact the GAAM seems to over-fit the data a bit). The authors do not say, however, that the other model used, i.e. the DDM, may not reproduce these increasing tachometric curves because "in its basic form", the DDM gives flat tachometric curves. Does the DDM fitted to the individual RT and choice data capture the monotonic increase observed in the tachometric curves?
Correct vs Error trials:
In a similar line, the authors do not test the fitted DDM separately in correct vs error trials, which is a classical distinction that most DDMs can't capture. It would be good to know if: (1) the RT in the data of correct vs error responses are similar (quantified in panel Figure 2B because in 2E it is not clear) and (2) the same trend between correct and error RTs are observed in the fitted DDMs.
Urgency model:
It is not clear how the urgency model used works. The authors cite Ditterich (2006), but in that paper, the urgency signal was applied to a race model with two decision variables: the urgency signal "accelerated" both DVs equally and sped up the race without favoring one DV versus the other. In a one-dimensional DDM, it is not clear where the urgency is applied. We assume it is applied in the direction of the stimulus, but then it is unclear how the urgency knows about the stimulus, which is what the DDM is trying to estimate in the first place. The authors should explain this model in greater detail and try to resolve this question.
Despite finding differences between species, the analyses seem mostly exploratory instead of hypothesis-driven. There is little justification for why differences in some DDM parameters across species would be expected.
(2) GLM and HMM
The GLM fits show nicely that humans, rats, and mice weigh differently the total provided evidence (Figures 6C-D). This may be because the internal noise in the accumulation of evidence is higher but also it could simply be because animals do not weigh the evidence that is presented when they are already moving towards the side ports. A parsimonious alternative to the "more noisy" species is simply that they only consider the first part of the stimulus. Extending the GLM to capture the differential weighting of each sequential sample (what is called the Psychophysical kernel, PK) should be straightforward and would provide a more fair comparison between species (i.e. perhaps the slope of the psychometric curves is not that different, once evidence is weighted in each species with its corresponding PK.
Choice Bias:
Panel 3G (DDM starting point) shows that both rats and mice are slightly but systematically biased to the Left (x0 < 0.5). Panel 6D "Bias" seems to be showing the absolute value of the GLM bias parameter. It would be nice to (i) show the signed GLM bias parameter. (ii) Compare that the biases computed in the DDM and GLM are comparable across species and subjects; it looks like from the GLM they are comparable in magnitude across species whereas the in DDM they weren't (mice had a much bigger |x0| in the DDM), (iii) explain (or at least comment) on why animals show a systematic bias to one side.
Reviewer #3 (Public review):
Summary:
This study directly compares decision-making strategies between three species, humans, rats, and mice. Based on a new and common behavioral task that is largely shared across species, specific features of evidence accumulation could be quantified and compared between species. The authors argue their work provides a framework to study decision-making across species, which can be studied by the same decision models. The authors report specific features of decision-making strategies, such as humans having a larger decision threshold leading to more accurate responses, and rodents deciding under time pressure.
Strengths:
The behavioral task is set up in similar, comparable ways across species, allowing for employing the same decision models and directly comparing specific features of decision behavior. This approach is compelling since it is otherwise challenging to compare behavior between species. Data analysis is solid and does not only quantify features of classic drift-diffusion models, but also additional commonly applied behavior models or features such as win-stay/lose-shift strategies, reward-maximization behavior, and slow, latent changes in behavior strategies. This approach reveals some interesting species differences, which are a starting point to investigate species-specific decision strategies more deeply and could inform a broad set of past and future behavior studies commonly used in cognitive and neuroscience.
Weaknesses:
(1) The choice of the stimulus difficulty is unclear, as choosing a single, specific evidence strength (80:20) could limit model fitting performance and interpretation of psychometric curves. This could also limit conclusions about species differences since the perceptual sensitivity seems quite different between species. Thus, the 80:20 lies at different uncertainty levels for the different species, which are known to influence behavioral strategies. This might be addressed by exploiting the distribution of actually delivered flashes, but it remained unclear to me to what degree this is the case. Previous perceptual discrimination studies typically sample multiple evidence levels to differentiate the source of variability in choice behavior.
(2) The authors argue that their task is novel and that their task provides a framework to investigate perceptual decision-making. However, very similar, and potentially more powerful, perceptual decision-making tasks (e.g., using several evidence strength levels) have been used in humans, non-human primates, rats, mice, and other species. In some instances, analogous behavioral tasks, including studies using the same sensory stimulus, have been used across multiple species. While these may have been published in different papers, they have been conducted in some instances by the same lab and using the same analyses. Further, much of this work is not referenced here. This limits the impact of this work.
(3) The employed drift-diffusion model has many parameters, which are not discussed in detail. Results in Supplementary Figures 3-5 are not explained or discussed, including the interpretation that model recovery tests fail to recover some of the parameters (eg, Figures S3E, G). This makes the interpretation of such models more difficult.
(4) The results regarding potential reward-maximization strategies are compelling and connect perceptual and normative decision models. The results are however limited by the different inter-trial intervals and trial initiation times between species, which are shown in Figure S6. It's unclear to me how to interpret, for example, how the long trial initiation times in rats relate to a putative reward-maximizing strategy. This compares to the very low trial initiation times (ie, very 'efficient') of humans, even though they are 'too accurate' in terms of their sampling time. Reward-maximizing strategies seem difficult with such different trial times and in the absence of experimental manipulation.