A high-throughput approach for the efficient prediction of perceived similarity of natural objects

  1. Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
  2. Computational Cognitive Neuroscience and Quantitative Psychiatry, Department of Medicine, Justus Liebig University Giessen, Giessen, Germany
  3. Neural Coding Lab, Donders Institute for Brain, Cognition and Behavior, Nijmegen, Netherlands
  4. Center for Mind, Brain and Behavior (CMBB), Universities of Marburg, Giessen, and Darmstadt, Marburg, Germany

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Stefania Bracci
    University of Trento, Rovereto, Italy
  • Senior Editor
    Joshua Gold
    University of Pennsylvania, Philadelphia, United States of America

Reviewer #1 (Public review):

Summary:

This manuscript addresses the challenge of understanding and capturing the similarity among large numbers of visual images. The authors show that an automated approach using artificial neural networks that focuses upon the embedding of similarity through behaviorally relevant dimensions can predict human similarity data up to a certain level of granularity.

Strengths:

The manuscript starts with a very useful introduction that sets the stage with an insightful Figure 1. The methods are state of the art and well thought off, and the data are compelling. The authors demonstrate the added value of their approach in several directions, resulting in a manuscript that is highly relevant for different domains. The authors also explore its limitations (e.g., granularity).

Weaknesses:

Although this manuscript and the work it describes are already of high quality, I see several ways in which it could be further improved. Below I rank these suggestions tentatively in order of importance.

Predictions obtain correlations above 0.80, often close to correlations of 0.90. The performance of DimPred is not trivial, given how much better it performs relative to classic RSA and feature reweighting. Yet, the ceiling is not sufficiently characterized. What is the noise ceiling in the main and additional similarity sets that are used? If the noise ceiling is higher than the prediction correlations, then can the authors try to find the stimulus pairs for which the approach systematically fails to capture similarity? Or is the mismatch very distributed across the full stimulus set?

Also in the section on p. 8-p.9, it is crucial to provide information on the noise ceiling of the various datasets.

This consideration of noise ceiling brings me to another consideration. Arguments have been made that a focus on overall prediction accuracy might mask important differences in underlying processes that can be demonstrated in more specific, experimental situations (Bowers et al., 2023). Can the authors exclude the possibility that their automatic approach would fail dramatically in specifically engineered situations? Some examples can be found in the 2024 challenge of the BrainScore platform. How can future users of this approach know whether they are in such a situation or not?

The authors demonstrated one limitation of the DimPred approach to capture fine-grained similarity among highly similar stimuli. The implications of this finding were not clear to me from the Abstract etc, because it is not sufficiently highlighted in the summaries that in this case DimPred performs even worse, and much worse, than more simple approaches like feature reweighting and even than classic RSA. I would discuss this outcome more in detail. With hindsight, this problem might not be so surprising given that DimPred relies upon the embedding with a few tens dimensions that mostly capture between-category differences. To me, this seems like a more fundamental limitation than a mere problem of granularity or lack of data, as suggested in the abstract.

The DimPred approach is based on the dimensions of a similarity embedding derived from human behavior. What is important here is (i) that DimPred is based upon an approach that tries to capture latent dimensions; or (ii) that these dimensions are behaviorally relevant? There are a lot of dimension-focused approaches. Generic ones are PCA, MDS, etc. More domain-specific approaches in cogneuro include the following: (i) for two-dimensional shape representations, good results have been obtained with image-computable dimensions of various levels of complexity (Morgenstern et al., 2021, PLOS Comput. Biol.); (ii) another dimension-focused approach has focused upon identifying dimensions that are universal across networks & human representations (Chen & Bonner, 2024, arXiv). Would such generic or more specific approaches work as well as DimPred?

Reviewer #2 (Public review):

In this paper, the authors successfully incorporated the 49 dimensions found in a human similarity judgment task to better train DNNs to perform accurate human-like object similarity judgments. The results of the model performance are impressive but I am not totally convinced that the present modeling approach may bring new insights regarding the mental and neural representations of visual objects in the human brain. I have a few thoughts that I would like the authors to consider.

(1) Can the authors provide a detailed description of what these off-the-shelf DNNs are trained on? For models trained on visual images only, because semantic information was never present during training, it is not surprising they fail to capture such information, even with additional DimPred training. For the CLIP models, because visual-sematic associations were included during training, it again comes as no surprise that these models can do better even without DimPred training. Similarly, the results of homogenous image sets are not particularly surprising. In this regard, I am finding the paper reports many obvious results. Better motivations should be used to justify why particular models and analyses were performed, what predictions can be made, and how the results may be informative beyond what we already know.

(2) I am curious as to what DimPred training is doing exactly. If you create an arbitrary similarity structure (i.e., not the one derived from human similarity judgment) by, e.g., shuffling the values during training or creating 49 arbitrary dimensions, can the models be trained to follow this new arbitrary structure? In other words, do the models intrinsically contain a human-like structure, but we just have to find the right parameters to align them with the human structure or do we actually impose/force the human similarity structure onto the model with DimPred training?

Is it also an issue that you are including more parameters during DimPred training and that increased parameters alone can increase performance?

(3) There is very little information on how Figure 8 is generated. I couldn't find in the Methods any detailed descriptions of how the values were calculated. Are results from both the category-insensitive and category-sensitive embedding obtained from the same OpenCLIP-RN50x64? Figure 8 reports the relative improvement. What do the raw activation maps look like for the category-insensitive and category-sensitive embedding? I am surprised that the improvement is seen primarily in the early visual cortex (EVC) and higher visual areas but not more extensively in association areas sensitive to semantics. Why should EVC show such large improvements, given that category information is stored elsewhere?

Related to this point, how do other DNN models account for human brain fMRI responses in the present study? Many prior studies have documented the similarities and differences between DNN and human fMRI visual object representations. Do category-sensitive CLIP models outperform other DNN models? It is important to report the full results. Even though category-sensitive CLIP models outperform category-insensitive CLIP ones, if the overall model performance is low compared to the other DNNs, the results would not be very meaningful/impressive. I am just wondering if, in the process of achieving better human-like similarity judgment performance, these models lose some of the ability to account for visual object representations in the human ventral visual cortex.

(4) I am wondering how precisely the present results may yield new insights into the mental and neural representations of visual objects in the human brain. Prior human studies have already identified 49 dimensions that can capture human similarity judgment. Beyond predicting performance for new pairs of objects, how would the present modeling approach help us understand more about the human brain? The authors discussed this, but I am not sure the arguments are convincing.

Reviewer #3 (Public review):

Summary:

The authors compare how well their automatic dimension prediction approach (DimPred) can support similarity judgements and compare it to more standard RSA approaches. The authors show that the DimPred approach does better when assessing out-of-sample heterogeneous image sets, but worse for out-of-sample homogeneous image sets. DimPred also does better at predicting brain-behaviour correspondences compared to an alternative approach. The work appears to be well done, but I'm left unsure what conclusions the authors are drawing.

In the abstract, the authors write: "Together, our results demonstrate that current neural networks carry information sufficient for capturing broadly-sampled similarity scores, offering a pathway towards the automated collection of similarity scores for natural images". If that is the main claim, then they have done a reasonable job supporting this conclusion. However the importance of automating this process for broadly-sampled object categories is not made so clear.

But the authors also highlight the importance that similarity judgements have been for theories of cognition and brain, such as in the first paragraph of the paper they write: "Similarity judgments allow us to improve our understanding of a variety of cognitive processes, including object recognition, categorization, decision making, and semantic memory6-13. In addition, they offer a convenient means for relating mental representations to representations in the human brain14,15 and other domains16,17". The fact that the authors also assess how well a CLIP model using DimPred can predict brain activation suggests that their work is not just about automating similarity judgements, but highlighting how their approach reveals that ANNs are more similar to brains than previously assessed.

My main concern is with regards to the claim that DimPred is revealing better similarities between ANNs and brains (a claim that the authors may not be making, but this should be clarified). The fact that predictions are poor for homogenous images is problematic for this claim, and I expect their DimPred scores would be very poor under many conditions, such as when applied to line drawings of objects, or a variety of addition out-of-sample stimuli that are easily identified by humans. The fact that so many different models get such similar prediction scores (Fig 3) also raises questions as to the inferences you can make about ANN-brain similarity based on the results. Do the authors want to claim that CLIP models are more like brains?

With regards to the brain prediction results, why is the DimPred approach doing so much better in V1? I would not think the 49 interpretable categories are encoded in V1, and the ability to predict would likely reflect a confound rather than V1 encoding these categories (e.g., if a category was "things that are burning" then DNN might predict V1 activation based on the encoding of colour).

In addition, more information is needed on the baseline model, as it is hard to appreciate whether we should be impressed by the better performance of DimPred based on what is provided: "As a baseline, we fit a voxel encoding model of all 49 dimensions. Since dimension scores were available only for one image per category36, for the baseline model, we used the same value for each image of the same category and estimated predictive performance using cross-validation". Is it surprising that predictions are not good with one image per category? Is this a reasonable comparison?

Relatedly, what was the ability of the baseline model to predict? (I don't think that information was provided). Did the authors attempt to predict outside the visual brain areas? What would it mean if predictions were still better there?

Minor points:

The authors write: "Please note that, for simplicity, we refer to the similarity matrix derived from this embedding as "ground-truth", even though this is only a predicted similarity". Given this, it does not seem a good idea to use "ground truth" as this clarification will be lost in future work citing this article.

It would be good to have the 49 interpretable dimensions listed in the supplemental materials rather than having to go to the original paper.

Strengths:

The experiments seem well done.

Weaknesses:

It is not clear what claims are being made.

Author response:

We wish to express our gratitude to the reviewers for their insightful and constructive comments on the initial version of our manuscript. We greatly value their observations and have every intention of addressing their remarks in a thorough and constructive manner. Based on the editors’ and reviewers’ feedback, we realize that it was not entirely clear that we intended this work primarily to be a resource and not yield strong insights into DNN-human alignment. Since our method also covers the broad range of natural objects - as used in the vast majority of studies on object processing - we also feel we did not sufficiently highlight the breadth of the tool. Based on the editors’ assessment, our explorations into the limits of the method - which we saw as a strength, not a weakness of our work - perhaps overshadowed the otherwise broad applicability somewhat. We hope to clarify this in the revised manuscript. Beyond these general points, we would like to address the following four points:

• Where feasible, we intend to undertake additional analyses and refine existing ones. For instance, we plan to provide noise ceilings for all datasets where such calculations are possible, and we plan to give careful consideration to implementing a permutation or label-shuffling test to explore some of the ideas shared by the reviewers.

• We plan to discuss more thoroughly several topics raised by the reviewers (e.g., how our approach might contend with different experimental situations such when using line drawings as stimuli).

• We aim to enhance the clarity of our manuscript throughout. This will include refining the wording of our abstract and offering a more detailed explanation of the methods employed in the fMRI analyses.

• We plan to elaborate further on our line of reasoning by addressing potential sources of misunderstanding—such as clarifying what we mean by a “lack of data” and providing greater detail regarding the nature of the 49-dimensional embedding.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation