Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorMichael FrankBrown University, Providence, United States of America
- Senior EditorMichael FrankBrown University, Providence, United States of America
Reviewer #1 (Public review):
Summary:
The authors developed a new gaze-based reversal task to study 6 - 10-month-old infants, in what would typically be a very challenging age group to study behavior related to learning, exploration, and perseveration. Here, the research question is excellently motivated by pointing out the limitation of past work that has typically studied adult clinical populations using similar approaches, which presents only the endpoint of the developmental process. Thus, there is important clinical and scientific value in studying much earlier stages in the developmental process. Here, the authors accomplish this with a new gaze-based paradigm that allows them to fit a variety of complex computational models to data from 41 infants. The main advantage of their winning model is that the parameters provide better pattern separation between two identified clusters of participants compared to behavioral variables alone.
Strengths:
Overall, the paper is well-written, and the models and analyses are applied in a principled and thorough fashion. The authors do an excellent job of both motivating their research question and addressing it through their task and set of computational models. The scope is also quite ambitious, modeling both choices and pupillary responses, while also using the models to generate behavior that is comparable to the experimental data and performing a cluster analysis to compare the suitability of the model parameters vs. other behavioral/questionnaire data in performing pattern separation between participants.
Weaknesses:
However, despite these strengths, I had a number of concerns that may limit the reliability of the findings.
First, given the fact that the rewards for the initial pre-reversal setting are defined by the first choice of the infants, it was unclear to me whether the behavioral patterns in Figure 2 really support the fact that there was in fact, (prediction-error-based) learning in the task at all. The behavioral analyses proceed very briskly without really addressing this question, before rapidly jumping off the complexity cliff to present the models. However, even with the models, the winning model only had free parameters for preference (c) and a left-right dominance (epsilon), which don't really capture mechanisms related to learning. The epistemic and extrinsic components included in the model at the 2nd stage could potentially help shed light on this question, but (unless I've misunderstood) they seem to be all-or-nothing parts of the model, and thus don't reappear in later analyses (e.g., cluster analysis) because they are not individual-specific parameters. Thus, the main learning-relevant aspects of the model seem divorced from the ability to perform clustering or other clinically relevant diagnoses downstream. Thus, it was unclear to me whether the results really capture mechanisms related to cognitive flexibility that motivate the manuscript in the introduction.
My other main concern was the complexity of the models and the way model comparison was performed using the three stages. First of all, the set of models is quite complex and risks alienating many developmental psychologists who would otherwise be very interested in these findings. Thus, I'm curious why the authors didn't consider including much simpler context-based RL models (e.g., Rescorla-Wagner/Q-learning models) that explicitly use prediction-error updates and whose simplicity might better match the simplicity of the behavior that 6-10 month infants are capable of displaying. Certainly, preference (as an inverse temperature parameter for a softmax policy) and left-right dominance (as a bias) could be implemented with these much simpler models. Second, while the three-stage model comparison seems somewhat principled, it left me questioning whether the 1st stage or 2nd stage results might be impacted by later stages. For instance, if the Simple-discard model were to still win in the first stage, once omega and eta have been eliminated as free parameters. Of course, I understand that there may be feasibility issues with testing all combinatorial variants of the model. But it was unclear why this specific order was chosen and what consequences this sequential dependency in the model fitting may have for the conclusions. And while model identifiability is stated in the abstract as one of the strengths of this approach, there don't seem to be any clear analyses supporting this fact. I would have loved to see a model recovery analysis (see Wilson & Collins et al., eLife 2019) to support this statement.
Reviewer #2 (Public review):
Summary:
This paper examines infants' learning in a novel gaze-contingent cued reversal learning task. The study provides strong evidence that infants learn in the task, and they characterize individual differences in learning using computational modeling. The best-fitting model of the set compared reflects a learning of mappings between context cues and outcomes that do not carry over across blocks. Infants are then clustered into two groups based on model parameter estimates capturing primacy bias and reward sensitivity. These groupings exhibited differences in infant temperament and other developmental measures. The modeling is rigorous, with model predictions accounting for substantial variance in infants' choices, and parameter estimates showing high recoverability. This study is important in that it demonstrates that such rigorous standards in computational modeling of behavior can be successfully deployed in infant studies.
Strengths:
The study provides evidence that infants exhibit cognitive flexibility within a reversal learning task and do not simply perseverate.
The methods used within the novel gaze-contingent will be useful for other groups interested in studying learning and decision-making in infants.
The study applies rigorous computational modeling approaches to infants' choices (inferred from gaze) and their physiological responses (i.e., pupil dilation) in the task, demonstrating that infants' reward learning is well-captured by an error-driven learning process.
The authors conduct model comparison, posterior predictive checks, and parameter recoverability analyses and demonstrate that model parameters can be well estimated and that the model can recapitulate infant choice behavior.
Physiological pupil dilation measures that correlate with prediction error signals from the model further validate the model as capturing the learning process.
Weaknesses:
It is not entirely clear that the individual differences in reversal learning identified between the two clusters of infants (ostensibly reflecting differences in cognitive flexibility) have construct validity or specificity for the associated developmental abilities that differ between groups (daily living, communication, motor function, and socialization).
Similarly, it's not clear why the paper is framed as an advance for infant computational *psychiatry* rather than simply an advance in computational modeling of infant behavior. It seems to me that a more general framing is warranted. Basic cognitive development research can also benefit from cognitive hypothesis testing via computational model comparison and precise measurement of infants' behavior in reward learning tasks. Is there reason to believe that infants' behavior in this task might have construct validity for mental health problems related to cognitive flexibility later in development? Do the Vineland or IBQ-R-VSF prospectively predict clinical symptoms?
A large proportion of the recruited infants (14 of 55) were excluded, but few details are provided on why and when they were excluded. Did the excluded infants differ on any of the non-task measures? This information would be helpful to understand limitations in the utility of the task or the generalizability of the findings.
It is stated that: "The infants who completed at least three trials following the reversal were included in the analysis, as it is more likely that their expectations were violated in this interval." Are three trials post-reversal sufficient to obtain reliable estimates of model parameters? More details should be provided on the number of trials completed for all of the included/excluded infants.
Reviewer #3 (Public review):
This paper used computational modeling of infants' performance in a reversal learning paradigm to identify two subgroups of infants, one that initially learned a bit faster but then perseverated more and failed to switch after the reversal (yellow cluster), and those who sampled more before the switch but then perseverated less/switched better (magenta cluster - though see below for comments about infants' overall weak performance). The authors describe magenta babies as showing a profile of greater cognitive flexibility, which they note in adults is linked to better outcomes and a lower incidence of psychiatric disorder. Indeed, the yellow cluster scored less well on several scales of the Vineland and showed lower surgency on the IBQ than the magenta cluster. The authors argue that this paper paves the way for the field of "infant computational neuropsychiatry."
In general, I think this is a fun and intriguing paper. That said, I have a number of concerns with how it is currently written.
First, the role of pupil dilation in the models was really unclear -- I've read it through a few times and came away with different impressions each time. I am now pretty sure the models were only based on infants' behavioural responses (e.g., choice for the correct versus incorrect location) rather than differences in pupil size, but pupil size kept popping up throughout, and so I initially thought the clusters were based on that. The authors should clarify this so other readers are not confused. (One thing that might help is avoiding the word "behaviour" on its own, unless it is further specified as looking behaviour or not, as I assume that some would characterize pupil dilation as a behaviour as well.)
If clusters were NOT based on pupil size (e.g., reaction to prediction error), why not? Was this attempted, and did no clusters emerge? Did the yellow and magenta group also differ in reaction to prediction error, or not? It seems like the argument that this work will be the basis of infant computational psychiatry would require that there not simply be a link between behaviour in an infant study and other measurements of their functioning - because many other papers to date have demonstrated such relationships, many longitudinally - but instead with the link to something where the neurobiology of the behaviour being studied is better understood. I assume this is why pupil dilation kept coming up, but again, it didn't actually seem to be part of the modelling unless I missed something. That is, although I think that this is a nice finding, currently I think the novelty of the finding, as well as the suggestion that it will start a whole new field, may be overblown. I certainly think the pupillometry data has promise, as does the LUMO data, which the authors alluded to being in the works. But perhaps the implications should be toned down a bit in this paper, until those data are further along.
My final substantial comment (a few more minimal ones below) is that overall, babies did quite poorly at this task. Even after 9 post-switch trials, the magenta group was still responding at chance, and the yellow group seemed not to switch at all. Infants then all seemed to perform very well again during block 2, which makes it seem like they still had the original contingency in mind. That said, from what I could see, no data was provided about how many babies looked to the original correct first during Block 2. But based on the data, I assume they basically all went back to predicting on the first side, as otherwise their return to high levels of successful trials would not make sense, unless they somehow forgot the entire thing. It would be good to know for sure, and to have that data (specifically, how many babies looked to the original side again at the start of block 2) in the main paper. Given this overall lack of sensitive performance in the paradigm, even despite the cues signaling where the rewarding video would be changing completely (that is, the contingency between cue and outcome did not itself switch, the cues themselves did), it seems odd to discuss things like statistical or even skillful learning alongside these data.