Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorRui Ponte CostaUniversity of Oxford, Oxford, United Kingdom
- Senior EditorMichael FrankBrown University, Providence, United States of America
Reviewer #1 (Public review):
Summary:
This study uses an encoding model approach to compare a range of different deep learning models in predicting functional MRI data, collected while participants played the game "Super Mario Bros" inside the scanner. The fMRI data is rich, within-subject data, with around 15 hours of gameplay for each of five participants who took part in the study. A range of models are compared, including deep RL models (PPO), behaviour cloning (imitation learning), supervised visual models (ResNet), and untrained but structurally equivalent models. The main metric of model comparison is brain prediction (i.e., cross-validated R^2, and within-subject generalisation to out-of-distribution gameplay), rather than focussing on which model features are being encoded.
The core results are:
(1) The deep RL and imitation learning models show a modest improvement in prediction accuracy relative to the untrained and visual models (around a 1-2% increase in R^2). Notably, this is against a background in which the untrained model - essentially random projections of the gameplay pixels - can explain around 6 or 7% of the variance in fMRI data (Figure 2). So, the improvement in model fit is a small (but significant) one, and a major driver of prediction scores appears to be low-level visual stimulation as opposed to gameplay prediction.
(2) There is little variation across layers in prediction accuracy in the trained models. In the untrained model, prediction accuracy drops across layers. This suggests that the prediction accuracy in this untrained model results from its (early-layer) representations being closer to what is presented on screen - as the random weights move the untrained model's representation away from sensory features, it becomes less predictive of the brain. In a trained model, meaningful representations are maintained in deeper layers - and interestingly, there is no clear correspondence between layers of the model and layers of the visual pathway.
(iii) There is a noticeable improvement in brain prediction by both the deep RL and imitation models with model training. In other words, the 1-2% increase in R^2 mentioned in point (i) is a result of the training, rather than any other factor.
(iv) None of the models, including the untrained model, perform well in generalising to out-of-distribution data held out from the training/evaluation. This leads to the claim that the brain's encoding representations are 'brittle'.
Strengths:
(1) A major strength of the dataset is that it contains rich, extended naturalistic gameplay data within individual subjects. This mirrors some of the advantages seen in other naturalistic datasets (e.g., natural scenes dataset, storybook listening, video watching) - but there are very few examples of such data where the subject is controlling or generating the behaviour in the naturalistic task. This allows potentially new questions to be asked about how these representations are learned across time, within individual participants.
(2) A further strength of the manuscript is the clarity with which the aims and hypotheses are articulated in the introduction, and evaluated/discussed throughout the paper. This provides a clear set of objective criteria against which to evaluate the performance of the resulting models; the paper is also written in a very clear and honest way, in that some of the a priori hypotheses are not supported - this makes for a more transparent report than one written in an a posteriori manner.
(3) Finally, although the results in comparing different models are perhaps not as impressive as one might have hoped, the authors have been quite careful in making the models comparable in terms of their architecture and number of parameters, etc. This means that any variation in prediction is likely attributable to the different objective functions used to train the models, rather than other features of the model architecture.
Weaknesses:
(1) The work is currently framed as "training neural networks from scratch...leads to brittle brain encoding" - but I'm not sure that the results fully support this. First, the brittleness is still present in the untrained network (i.e., random projections of pixels), as shown in Figure 5b. This implies that the brittleness may not be a consequence of the network training, but of overfitting to the encoding (ridge regression) model of the fMRI data (as the authors acknowledge when presenting these results). I would instead encourage the authors to shift the emphasis slightly towards the (modest) improvement in prediction using the RL/imitation objectives, and/or the (similarly modest) improvement in prediction with training, rather than foregrounding the brittleness of the encoding.
(2) While the analyses of how model prediction improves with training are nice, it is a shame that there is no consideration of how prediction improves (or otherwise) across the training of the participants. Do participants improve across the 15 hours of gameplay - or do they, for instance, become more predictable by the imitation learning model? Is this more true in the naïve participants than those with extensive past experience of Mario? And does this in any way lead to better alignment with model predictions across sessions? These all seemed like natural questions that could benefit from the unique longitudinal nature of this dataset, and it seemed a shame that they were not touched upon at all.
(3) While there is little variation between the models in terms of predictive performance, it is currently a little unclear whether this is simply due to fitting a set of highly parameterised models to the data, or because the models are themselves fundamentally similar in their representations. One way to address the latter point might be to perform some kind of RSA or CKA (Kornblith et al, arXiv 2019; Williams et al, bioRxiv 2024) across the layer representations within-model, and between-models, to ask how similar (or different) the learned representations are between the different models used for fMRI prediction.
Reviewer #2 (Public review):
Summary:
This paper aims to test whether training models to play video games from visual inputs through reinforcement learning leads to better matches to human visual encoding during gameplay, compared to models with the same architecture and training images but with different training objectives. The authors find a slight advantage for the RL model, but encoding performance and generalization overall are weak and variable.
Strengths:
This was a reasonable hypothesis to test, and the model comparisons adequately represent other possibilities for training a model of the given architecture. The ResNet proxy is a particularly interesting way to benefit from a larger model's pre-training while still using the same constrained architecture and training set.
Weaknesses:
I always prefer to see learning curves for models on the tasks they were trained on, just to contextualize their performance on the brain encoding results, but they are not shown here.
The paper misses some of the relevant literature that has performed similar comparisons across learning objectives for visual encoding models, such as https://arxiv.org/abs/2112.02027 and https://pmc.ncbi.nlm.nih.gov/articles/PMC10569538/
The authors end up advocating for the idea that large-scale pre-training is needed in order to build good visual encoders for matching human data. In many ways, this was already known (given that brain encoding scores scale with imagenet performance, which requires at least a moderate amount of general-purpose image training to achieve). However, they also note that "the brain encoding performance of the ResNet model was not significantly different from that of the Untrained model." I would assume that an ImageNet-trained ResNet would be in the direction of the type of large-scale pre-trained model the authors advocate for (even when not trained for action generation), yet their results don't support this direction being the solution. Are their results about Resnet not surpassing an untrained model consistent with prior work, and if not, why not? How do they view this in light of their argument for the use of larger models?
Reviewer #3 (Public review):
Summary
In this paper, the authors have 5 human subjects learn to play Super Mario Bros while undergoing fMRI for 15 hrs each. They compare a reinforcement learning (RL) model (PPO), an imitation learning (IL) model, and a vision model (ResNet) in their ability to play the game, match human behavior, and, critically, explain human brain activity.
The key findings can be summarized as follows:
(1) RL, IL, and vision models explain similar amounts of variance in the BOLD signal (Fig 2a), with a significant but small trend of RL > IL > ResNet (Tab 1).
(2) Untrained models with the same architecture explain a smaller but very similar amount of variance (Figure 2a, Table 1).
(3) The brain maps across all models (and layers) are strikingly similar, with the strongest effects in visual, parietal, and motor regions (Figures 2b, 2d; Supplementary Material II).
(4) Behavioral and neural performance are correlated across model checkpoints (but not levels), such that later checkpoints in training have better behavioral and neural encoding performance (Figures 3 & 4), although the neural effect plateaus pretty quickly.
(5) Out-of-distribution performance is quite poor, both behaviorally (Figure 5a) and neurally (Figure 5b).
I believe this work will be of interest to neuroscientists, cognitive scientists, and AI researchers alike. There has been a growing trend in neuroscience to adopt AI models as cognitive models of complex perception and action, while at the same time, AI researchers are increasingly looking at the brain for inspiration. The key finding of this paper -- that these models fail to generalize to out-of-distribution levels -- questions the core assumptions of this whole enterprise.
Strengths:
Unlike previous studies applying machine learning to naturalistic game-play, the authors take great care to make sure their models are evaluated on an equal footing, using equivalent or similar architectures/number of parameters and training data.
While the number of subjects (5) is relatively small, the amount of data per subject (15 hours) is impressive, which is important for fitting the imitation learning & ResNet models and for obtaining reliable encoding performance for each individual subject. The authors employed a train/val/test split and held out sets, the gold standard in the literature.
Overall, the paper was well-written and easy to follow. The figures clearly illustrate the main findings.
Weaknesses:
(1) Missing statistical tests
I think the main weakness of the paper is that many of the claims are qualitative in nature and lack appropriate statistical tests, for example:
- "The conv3 layer has the highest brain encoding score";
- "Robust association between task performance and brain encoding" ;
- "Level patterns strongly predict brain encoding";
- "Brain encoding performance was severely degraded";
- "Effect of training on brain encoding was apparent".
While these effects are indeed qualitatively visible in the figures, it is unclear which of these differences are significant (with the notable exception of Table 1). I believe the paper would benefit substantially if these effects were quantified and every claim were supported by the appropriate statistical tests. As an example, with the exception of Table 1 and the corresponding paragraph, I could not find any p-values in the results section.
(2) Missing model performance and human-likeness
Also absent from the results is an assessment of model performance on the task and similarity to human performance/behavior. From Figures 3 and 4, we can see that the game score of PPO is around 500-1000 - how does that compare to the humans? We can also see that the imitation scores for IL are around 0.4-0.7, but what does that mean? Such results would be crucial to assess if the models have indeed learned to play the games and/or imitate the humans, and therefore, whether they would be good candidates as cognitive models (before even looking at brain activity). At minimum, plotting the human versus model game scores (see e.g. Tomov et al. 2023 Neuron, Figure 2) would be helpful; or, if you'd like to dig deeper, showing that human actions are more valuable or more likely under those models (see e.g. Cross et al. 2022 Neuron, Figure 2). It might also be helpful to look at imitation scores for the RL model and game performance of the imitation model -- I suspect they will both be bad, but they can at least serve as informative baselines for their counterparts.
(3) Possible undertraining
Relatedly, one possible explanation for why the Untrained model does so well is that all the models may be effectively undertrained. For example, while there are no training curves in the paper, it seems from the spacing of the checkpoint game scores (x-axis on Figure 3c) that the RL model may not have converged yet (it would be helpful if those were somehow colored by training epoch). Showing training curves would be helpful (i.e., something similar to Figure 3a, except with performance on the y-axis).
Additionally, it would be great to provide more details regarding the PPO training protocol. How many episodes? How many steps per episode? How many steps for all of the training? Similarly, for the imitation learning model: batch size, number of epochs, optimizer, scheduler, etc.
(4) Mysterious poor encoding performance of Untrained and ResNet models on the held-out set
Critically, and related to that, I'm a little confused about the Untrained model results on the held-out set (Figure 5b, top row on the right). Why should those be any different from the test set results with the Untrained model (Figure 2a, right, fourth row from the top)? It makes sense why the other models are worse on the held-out set -- they have never been trained on any frames from those levels. However, the untrained model has not been trained on *any* frames from *any* levels, including the test set and the held-out set.
The same is true for the ResNet model, which is pre-trained on a completely separate data set and yet similarly shows worse performance on the held-out set compared to the test set.
This cannot be explained by the ridge regression, which has no parameters or hyperparameters fitted on either the test set or the held-out set.
The big discrepancy in the untrained model & ResNet results between the test and the held-out set makes think that there is something substantially different about the levels in that held-out set; that they are truly out of distribution compared to the other 20 levels (e.g., maybe they're the last 2 hardest levels and look completely differently? e.g. ResNet proxy in Fig 5c shows worse performance than the mean, which is indicative of an anti-correlation). Alternatively, it may be some issue with the analysis pipeline. The poor generalization results are central to the claims of the paper, so I believe this should be clarified.
(4) Brittleness conclusion rationale
I'm not quite on board with the author's rationale that "[poor model performance on the out-of-distribution levels] demonstrates that the models we tested are limited in scope and may not provide a valid inference of brain-like processing, as human behavior remains robust and generalizable across levels".
For one, unlike the models, humans were actually trained on those levels, so it would not be surprising if they perform just as well on them as on the other levels (but do they? Again, it would be great to see some behavioral data from the humans and the models).
Second, as the authors themselves show, task performance and human-likeness do not really correlate with neural encoding across levels (Fig 4a & b, respectively), so even if model performance remained "robust and generalizable" on the held-out levels, that will not necessarily translate to good neural encoding.
Thirdly, and perhaps most importantly, unless the test set and held-out set were sampled exclusively from the practice phase when the subjects have mastered all the levels (that doesn't seem to be the case, but the authors should clarify), then the humans are continuously learning, which means that their own internal representations of the game are evolving. That's not the case for the models, which I assume are in "inference mode" when their representations are extracted for neural encoding. That is, their weights are frozen. So there's a fundamental mismatch between the mode in which humans are operating (continuously learning and executing) and the mode in which the models are operating (just executing). While this is true for all the levels, it may partially account for the discrepancy in the held-out set specifically.