Cost functions and different perceived similarity sampling regimes.

(a) For a stimulus set with n stimuli (x-axis) the cost of sampling all similarities (y-axis) grows differently for different sampling regimes. (b) For most classical behavioral similarity tasks like the pairwise rating task (left), cost grows superlinearly. Ideally, though, behavioral tasks should have linear cost increases like the task we present in this study (middle). Flat costs (right) are only attainable when utilizing deep neural networks (DNNs) and applying an automatic method like the one we explore in this study with which the whole embedding for a given image set is predicted in one go which can then be used to compute the whole representational similarity matrix.

Overview of our approach.

For a given image set, we extracted image activations from various image-computable neural network models (top). Based on these activations, we trained an L2-penalized multi-output multiple regression encoding model to predict a 49-dimensional Sparse Positive Similarity Embedding (SPoSE). We then used this embedding to construct a DNN-predicted representational similarity matrix (RSM) which we validated with a behavioral ground-truth RSM. To compare our DNN-based approach to human performance, we conducted crowdsourcing to receive human-generated ratings of all images on the same embedding (bottom). Based on this embedding, we computed a human-constructed RSM which we validated with the ground-truth RSM. Images in this figure were taken from the THINGSplus dataset and are used for illustrative purposes only79.

Predictive performance of all computational models for perceived similarity.

For the 1,854 reference image set, separately for each computational model (x-axis) we Pearson-correlated the predicted and ground-truth similarity scores (y-axis). Across models, DimPred outperformed established representational similarity methods that relate representational spaces to each other (cRSA and FR-RSA). Amongst all computational models, Open-CLIP models performed exceptionally well. An ensemble DimPred model consisting of the best 8 computational models performed even better (denoted by a star symbol). For a mapping of computational model name abbreviations to full model names please see Supplement 4.

Predictive performance of all computational models for out-of-sample heterogeneous image sets.

Our dimension prediction approach (DimPred) yielded similarity scores that correlated very highly with ground-truth similarity scores across five different heterogeneous validation sets. Across image sets, DimPred outperformed either classical (cRSA; left) or feature-reweighted RSA (FR-RSA; right) in 256 and 231 of 265 cases, respectively. The best computational model for each image set is indicated using the model abbreviation from Supplement 4. An ensemble DimPred model consisting of the best 8 computational models performed even better (denoted by a star symbol for each image set in the left panel).

Correspondence between human-predicted, human & best single DNN combined or human & ensemble-DNN combined similarity matrices and ground-truth similarity matrices.

For the two image sets a) the 48-image-generalization and b) the 48-concept-generalization image set, the representational similarity matrix (RSM) solely based on human ratings of image dimension values (top row), an average of that RSM and the best single DNN- predicted RSM (middle row), and an average of the human-based RSM and the top-8 ensemble DimPred model RSM (bottom row) was correlated with the respective ground-truth RSM. While humans provided sensible dimensional ratings of images evidenced by RSMs correlating highly with ground-truth RSMs, enriching them with RSMs yielded by DimPred for OpenCLIP-RN50×64 (OpenAI, visual) or with RSMs yielded by the ensemble DimPred model increased correspondence with ground-truth RSMs notably compared to only using human-derived RSMs.

Predictive performance of all computational models for out-of-sample homogeneous image sets.

Dimpred yielded similarity scores that did not correlate well with ground-truth similarity scores for five validation sets containing homogeneous images. Across image sets, DimPred was outperformed by either classical (cRSA; left) or feature-reweighted RSA (FR-RSA; right) in 73 and 59 of 265 cases, respectively. An ensemble DimPred model consisting of the best 8 computational models sometimes performed slightly better (denoted by a star symbol for each image set in the left panel).

Visualization of image regions relevant for predicting different dimensions underlying similarity judgments.

For each of the example images, the three most important SPoSE dimensions are shown. In addition, the relevance for similarity prediction reflects the weighted integration across all dimensions, highlighting candidate image regions that are behaviorally relevant for similarity judgments and categorization behavior.

Prediction accuracy improvement of an fMRI voxel-wise encoding model based on the DimPred embedding.

For three subjects, we computed an encoding model to predict voxel-wise fMRI activity, either using a coarser category-insensitive object image embedding or using the category-sensitive image-specific embedding as produced by DimPred based on activations from OpenCLIP-RN50×64. The relative improvement in encoding-model prediction accuracy when using DimPred is color-coded separately for each subject on a cortical flat map.