Statistical inference on representational geometries
Abstract
Neuroscience has recently made much progress, expanding the complexity of both neural activity measurements and braincomputational models. However, we lack robust methods for connecting theory and experiment by evaluating our new big models with our new big data. Here, we introduce new inference methods enabling researchers to evaluate and compare models based on the accuracy of their predictions of representational geometries: A good model should accurately predict the distances among the neural population representations (e.g. of a set of stimuli). Our inference methods combine novel 2factor extensions of crossvalidation (to prevent overfitting to either subjects or conditions from inflating our estimates of model accuracy) and bootstrapping (to enable inferential model comparison with simultaneous generalization to both new subjects and new conditions). We validate the inference methods on data where the groundtruth model is known, by simulating data with deep neural networks and by resampling of calciumimaging and functional MRI data. Results demonstrate that the methods are valid and conclusions generalize correctly. These data analysis methods are available in an opensource Python toolbox (rsatoolbox.readthedocs.io).
Editor's evaluation
Schütt and colleagues introduce a new method for statistical inference on representational geometries based on a crossvalidated twofactor bootstrap that allows for generalization across both participants and stimuli while allowing the fitting of flexible models. In a series of elegant simulations and empirical analyses on existing datasets, the authors validate the method statistically. The work provides a fundamental and compelling advance for the analysis of representational geometries.
https://doi.org/10.7554/eLife.82566.sa0Introduction
Experimental neuroscience has recently made rapid progress with technologies for measuring neural population activity. Spatial and temporal resolution, as well as the coverage of measurements across the brains of animals and humans have all improved considerably (Parvizi and Kastner, 2018; Abbott et al., 2020; Wang and Xu, 2020; Allen et al., 2021; Guo et al., 2021; Uğurbil, 2021; Bandettini et al., 2021). Activity is measured using a wide range of techniques, including electrode recordings (Jun et al., 2017; Steinmetz et al., 2018; Parvizi and Kastner, 2018), calcium imaging (Wang and Xu, 2020), functional magnetic resonance imaging (fMRI; Allen et al., 2021; Uğurbil, 2021; Bandettini et al., 2021), and scalp electro and magnetoencephalography (EEG and MEG; Baillet, 2017; Craik et al., 2019). In parallel to the advances in measuring brain activity, theoretical neuroscience has substantially scaled up braincomputational models that implement computational theories (e.g. Kriegeskorte, 2015; Kell et al., 2018; Kubilius et al., 2019; Zhuang et al., 2021). The engineering advances associated with deep learning (e.g. Paszke et al., 2019; Abadi et al., 2015) provide powerful tools for modeling brain information processing for complex, naturalistic tasks (LeCun et al., 2015). How to leverage the new big data to evaluate the new big models, however, is an open problem (Stevenson and Kording, 2011; Sejnowski et al., 2014; Smith and Nichols, 2018; Kriegeskorte and Douglas, 2018).
An important concept for understanding neural population codes is the concept of representational geometry (Shepard and Chipman, 1970; Edelman et al., 1998; Edelman, 1998; Norman et al., 2006; Diedrichsen and Kriegeskorte, 2017; Kriegeskorte et al., 2008a; Kriegeskorte et al., 2008b; Connolly et al., 2012; Xue et al., 2010; KhalighRazavi and Kriegeskorte, 2014; Yamins et al., 2014; Cichy et al., 2014; Haxby et al., 2014; Freeman et al., 2018; Kietzmann et al., 2019; Stringer et al., 2019; Chung et al., 2018; Chung and Abbott, 2021; Kriegeskorte and Wei, 2021). Neural activity patterns that represent particular pieces of mental content, such as the stimuli presented in a neurophysiological experiment, can be viewed as points in the multivariate neural population response space of a brain region. The representational geometry is the geometry of these points. The geometry is characterized by the matrix of distances among the points. This distance matrix abstracts from the roles of individual neurons and provides a summary characterization of the neural population code that can be directly compared among animals and between brain and model representations (e.g. a cortical area and a layer of a neural network model). The representational geometry provides a multivariate characterization of a neural population code that can be motivated as a generalization of linear decoding analyses. A linear decoder reveals a single projection of the geometry. The full distance matrix (when measured after a transform that renders the noise isotropic) captures what information is available in any linear projection (Kriegeskorte and Diedrichsen, 2019a).
A popular method for analyzing representational geometries (Kriegeskorte and Kievit, 2013) on which we build here is representational similarity analysis (RSA; Kriegeskorte et al., 2008a; Nili et al., 2014). RSA is a threestep process (Figure 1): In the first step, RSA characterizes the representational geometry of the brain region of interest (ROI) by estimating the representational distance for each pair of experimental conditions (e.g. different stimuli). The distance estimates are assembled in a representational dissimilarity matrix (RDM). We use the more general term ‘dissimilarity’ here to include dissimilarity measures that are not distances or metrics in the mathematical sense, such as crossvalidated distance estimators that can return negative values. This relaxation enables inclusion of measures that are not biased by the noise in the data (Kriegeskorte et al., 2007; Nili et al., 2014; Walther et al., 2016; Kriegeskorte and Diedrichsen, 2019a), returning values distributed symmetrically about 0, when the true distance is 0, but patterns are noisy estimates. In the second step, each model is evaluated by the accuracy of its prediction of the data RDM. To this end, an RDM is computed for each model representation. Each model’s prediction of the data RDM is evaluated using an RDM comparator, such as a correlation coefficient. In the third step, models are inferentially compared to each other in terms of their RDM prediction accuracy to guide computational theory.
RSA is widely used (Kriegeskorte and Kievit, 2013; Haxby et al., 2014; Kriegeskorte and Diedrichsen, 2019a) and has gained additional popularity with the rise of imagecomputable representational models like deep neural networks (e.g. Krizhevsky et al., 2012; Yamins et al., 2014; KhalighRazavi and Kriegeskorte, 2014; Mehrer et al., 2017; Kriegeskorte, 2015; Yamins and DiCarlo, 2016; Xu and VaziriPashkam, 2021; Konkle and Alvarez, 2022; Cichy et al., 2016). There has been important recent progress with methods for estimating representational distances (step 1) as well as measures of RDM prediction accuracy (step 2). For RDM estimation, biased and unbiased distance estimators with improved reliability have been proposed (Nili et al., 2014; Cai et al., 2019; Walther et al., 2016). For quantification of the RDM prediction accuracy, the sampling distribution of distance estimators has been derived and measures of RDM prediction accuracy that take the dependencies between dissimilarity estimates into account have been proposed (Diedrichsen et al., 2020). However, existing statistical inference methods for RSA (step 3) have important limitations. Established RSA inference methods (Nili et al., 2014) provide a noise ceiling and enable comparisons of fixed models with generalization to new subjects and conditions. However, they cannot handle flexible models, can be severely suboptimal in terms of statistical power, and have not been thoroughly validated using simulated or real data where ground truth is known. Addressing these shortcomings poses three substantial challenges. (1) Modelcomparative inference with generalization to new conditions is not trivial because new conditions extend an RDM and the evaluation depends on pairwise dissimilarities, thus violating independence assumptions. (2) Standard methods for statistical inference do not handle multiple random factors — subjects and conditions in RSA. (3) Flexible models, that is models that have parameters enabling them to predict different RDMs, are essential for RSA (Diedrichsen et al., 2018; Kriegeskorte and Diedrichsen, 2016). Evaluation of such models requires methods that are unaffected by overfitting to either subjects or conditions to avoid a bias in favor of more flexible models.
Here, we introduce a comprehensive methodology for statistical inference on models that predict representational geometries (Figure 1). We introduce novel bootstrapping methods that support generalization of modelcomparative statistical inferences to new subjects, new conditions, or both simultaneously, as required to support the theoretical claims researchers wish to make. We also introduce a novel crossvalidation method for estimation of the RDM prediction accuracy of flexible models, that is models with parameters fitted to the data (KhalighRazavi and Kriegeskorte, 2014; Kriegeskorte and Diedrichsen, 2016). This is important, because theories do not always make a specific prediction for the representational geometry. There may be unknown parameters, such as the relative prevalences of different tuning functions (KhalighRazavi and Kriegeskorte, 2014; Jozwik et al., 2016) in the neural population or properties of the measurement process (Kriegeskorte and Diedrichsen, 2016). The combination of our 2factor bootstrap and 2factor crossvalidation methods enables statistical comparisons among fixed and flexible models that generalize across subjects and conditions.
We thoroughly validate the new inference methods using simulations and neural activity data. Extensive simulations based on deep neural network models and models of the measurement process enable us to test modelcomparative inference in a setting where the groundtruth model (the one that actually generated the data) is known. These simulations confirm the validity of the inference procedures and their ability to generalize to the populations of subjects and/or conditions. We also validated the methods on real data from calcium imaging (mouse) and functional MRI (human). For both datasets, we confirm that conclusions generalize from an experimental dataset (a subset of the real data) to the entire dataset (which serves as a standin for the population). The statistical inference methodology described in this paper is available in a new opensource RSA toolbox written in Python (https://github.com/rsagroup/rsatoolbox, copy archived at Schütt, 2023).
Results
We now introduce the 2factor bootstrap procedure for modelcomparative inference and the 2factor crossvalidation procedure for unbiased evaluation of flexible models. This paper also introduces a new representational dissimilarity estimator for electrophysiological recordings of patterns of firing rates across a population of neurons, based on the KLdivergence between Poisson distributions (Appendix 2) and a faster alternative to the rank correlation ${\tau}_{a}$ as an RDM comparator (Nili et al., 2014), which we call ${\rho}_{a}$ (Appendix 3). The proposed inferential methods work for any representational dissimilarity measure and any RDM comparator. We evaluate alternative RDM comparators in terms of their power in Appendix 6. A complete description of all steps of the new methodology can be found in the Materials and methods (Full description of the RSA method).
Methods for inference on representational geometries
A simple approach to inferential comparison of two models is to compute the difference between the models’ performance estimates for each subject and use Student’s $t$test (or a nonparametric alternative). However, inference then only takes the variability over subjects into account and thus does not justify generalization to different experimental conditions (e.g. different stimuli). Computational neuroscience usually pursues insights that generalize not only to a population of subjects but also to a population of conditions (Yarkoni, 2020). To support generalization to the population of conditions statistically, we require uncertainty estimates that treat the experimental conditions as a random sample from a population (Kriegeskorte et al., 2008a), whether or not the subjects are treated as a random sample.
For frequentist inference, the challenge is to estimate how variable the modelperformance estimates would be if we repeated the experiment many times with new subjects and/or conditions. We would like to know (1) the variance of each model’s performance estimate and (2) the variance of the estimated performance difference for each pair of models. The variance of modelperformance estimates enables us to statistically compare each model to a fixed value such as an RDM correlation of 0. The variance of our estimate of modelperformance difference enables us to statistically compare two models to each other (see Frequentist tests for model evaluation and model comparison for details).
Estimating the variance of modelperformance estimates for generalization to new subjects and conditions
To estimate the variance of modelperformance estimates across repetitions of the experiment with new conditions, we use a bootstrap method. Bootstrap methods estimate the variance of experimental outcomes by sampling from the measured data with replacement, treating the measured data as an approximation to the population (Efron and Tibshirani, 1994). The population here is the set of experimental conditions of which the actual experimental conditions can be considered a random sample. Because the conditions do not have independent influences on the model evaluations, we cannot compute a sample variance across conditions as we can across subjects to replace the bootstrap.
When we bootstrapresample conditions, we obtain RDMs of the same size as the original RDMs, but some of the conditions will be repeated. Here, we exclude the entries that correspond to the dissimilarity of any condition with itself from the comparisons between RDMs. Simulations confirm that this procedure yields a good estimate of how variable the results are when we sample new conditions with the same subjects (Figures 4a and 6g).
For simultaneous generalization to the populations of both conditions and subjects, we can employ a 2factor bootstrap (Figure 1b) as introduced previously (Nili et al., 2014; Storrs et al., 2021). However, our simulations and theory here show that a naive 2factor bootstrap approach triplecounts the variance contributed by the measurement noise (Methods, Estimating the uncertainty of our modelperformance estimates, Figures 4c and 7c). This effect is not unique to RSA; a naive 2factor bootstrap will triplecount variance related to the measurement noise for any type of experiment in which two factors (here subject and condition) jointly determine the experimental outcome. The true variance ${\sigma}_{both}^{2}$ of the experimental outcome when sampling both factors can be separated into a contribution from condition sampling (${\sigma}_{cond}^{2}$), a contribution from subject sampling (${\sigma}_{subj}^{2}$), and a contribution of the interaction of subjects and conditions or measurement noise (${\sigma}_{noise}^{2}$).
This decomposition is for the actual variance ${\sigma}_{both}^{2}$ across repeated experiments with new subjects and conditions. The variance ${\widehat{\sigma}}_{both}^{2}$ of the naive 2factor bootstrap can likewise be decomposed into three additive terms (Online Methods, Estimating the uncertainty of our modelperformance estimates), corresponding to subject sampling, condition sampling, and the interaction and/or noise. However, in the naive 2factor bootstrap estimate ${\widehat{\sigma}}_{both}^{2}$, the independent noise contribution enters not only its own term, but also the two others. Thus, the original bootstrap estimate contains the noise variance component three times instead of once:
This problem can be understood by considering the 1factor bootstraps, which also contain the independent noise component although it has not been added explicitly:
When we bootstrap two factors, this automatic inclusion of the noise component happens three times. We confirmed this by both theory and simulation. The overestimate of the variance renders the naive 2factor bootstrap conservative and not optimally powerful.
To correct the variance estimate, we introduce a novel corrected 2factor bootstrap procedure to estimate the variance: We first compute the 1factor bootstrap variance estimates ${\widehat{\sigma}}_{subj}^{2}$ and ${\widehat{\sigma}}_{cond}^{2}$. We also compute the naive 2factor bootstrap estimate ${\widehat{\sigma}}_{both}^{2}$. We can then linearly combine the variances from these three bootstraps to cancel the surplus contribution from the measurement noise. This procedures yields a corrected 2factor bootstrap estimate ${\widehat{\sigma}}_{c2f}^{2}$ that has approximately the right expected value:
The approximations in these equations are due to $\frac{N1}{N}$ factors that apply to the individual terms. We give the exact formulae including these factors in the methods section (Estimating the uncertainty of our modelperformance estimates). We show in multiple simulations that this estimate approximates the correct variance better than the uncorrected 2factor bootstrap (Figures 4c and 7c).
To stabilize the estimator and eliminate the possibility of a negative variance estimate, we bound the estimate from above and below. We use both ${\widehat{\sigma}}_{subj}^{2}$ and ${\widehat{\sigma}}_{stim}^{2}$ as lower bounds for the estimate as the variances they estimate are always smaller than the true variance. As an upper bound, we use ${\widehat{\sigma}}_{both}^{2}$, the naive, conservative estimate. Bounding slightly biases the variance estimate, but reduces its variability and ensures that it is strictly positive.
Evaluating the performance of flexible models
We often want to test flexible models, that is models that have parameters to be fitted to the brainactivity data. Two elements that often require fitting are weights for the model features and parameters of a measurement model. Feature weighting is required when a model is not meant to specify a priori how prevalent different tuning profiles are in the neural population or in the measured signals. For example, for deep neural network representations to match brain responses well, it is usually necessary to weight the features (e.g. Yamins et al., 2014; KhalighRazavi and Kriegeskorte, 2014; KhalighRazavi et al., 2017; Storrs et al., 2021). A flexible measurement model may be necessary to account for the process of measurement, which may subsample, average, or distort neural responses. For example, fMRI voxels average the neural activity locally, which can be modeled with a parameter for the local averaging range, and electrophysiological recordings may preferentially sample certain classes of neurons (Kriegeskorte and Diedrichsen, 2016).
To avoid the bias in the modelperformance estimates that can result from overfitting of flexible models, we use crossvalidation. Crossvalidation means that we partition the dataset into separate test sets. In each fold of crossvalidation, we then fit the models to all but one set and evaluate on the heldout set. Taking the average over the folds yields a single performance estimate. As for bootstrapping, crossvalidation is performed over both conditions and subjects so as to avoid overestimating the generalization performance of flexible models when tested on new subjects and new conditions drawn from the populations of subjects and conditions sampled in the actual experiment (Figure 1b).
Because the RDM for the test set must contain multiple values to allow a sensible comparison, the smallest possible number of conditions to perform crossvalidation is 6, which would yield three test conditions for twofold crossvalidation. For small numbers of conditions, we use twofolds. We use threefolds for ≥12 conditions, fourfolds for ≥24 conditions, and fivefolds for ≥40 conditions. These numbers seem to work reasonably well, but were chosen ad hoc.
To estimate our uncertainty about the crossvalidated model performances, we use the same bootstrap methods as for fixed models. To do so, we need to perform crossvalidation on each bootstrap sample. We call this procedure bootstrapwrapped crossvalidation.
In any crossvalidation, different ways to partition the data into test sets lead to different overall evaluations of the models. When we partition the conditions set into disjoint test sets in RSA, this effect is particularly strong, because dissimilarities between conditions in separate test sets do not contribute to the evaluation in any fold. The variance in the evaluations created by this random assignment is generated by our analysis and would vanish if we performed repeated cycles of crossvalidation with all possible partitionings of the conditions set into test sets. Unfortunately, such exhaustive crossvalidation will usually be prohibitively expensive in terms of computation time, especially in bootstrapwrapped crossvalidation.
We can estimate the variance without this surplus by sampling ${n}_{cv}>1$ different randomly chosen partitionings of the conditions set into crossvalidation test sets for each bootstrap sample. Each of the ${n}_{cv}$ partitionings into $k$ subsets defines a complete cycle of $k$fold crossvalidation. The bootstrapwrapped crossvalidation estimate of the variance of the modelperformance estimates with ${n}_{cv}$ crossvalidation cycles will be larger than the variance ${\sigma}_{boot}^{2}$ of the exact mean performance over all possible partitionings of a dataset. When we assume that the variance ${\sigma}_{cv}^{2}$ of randomly chosen partitionings around the mean is equal for each bootstrap sample, the overall variance ${\sigma}_{bootcv,{n}_{cv}}^{2}$ is:
When we have more than one cycle of crossvalidation for each bootstrap sample, it is straightforward to compute an estimate for the variance we would have gotten if we had drawn only a single partitioning ${\sigma}_{bootcv,1}^{2}$. We can simply use only the $i$ th partitioning for each bootstrap to estimate the variance and average these estimates. Using these two variance estimates for 1 and ${n}_{c}v$ partitionings, we can simply solve for the variance contributions of the random partitioning and of the bootstrap:
Thus, we can directly compute an estimate of the variance we expect for exhaustive crossvalidation from two or more crossvalidation cycles using random partitionings for each bootstrap sample. The repetition across bootstrap samples enables a stable estimate even for ${n}_{cv}=2$. The average estimate is independent of ${n}_{cv}$ (Figure 2a). We could invest computation in increasing either the number of bootstrap samples or the number of crossvalidation cycles per bootstrap sample. Our simulations show that the reliability of the bootstrap estimate of the variance of the modelperformance estimate improves more when we increase the number of bootstrap samples than when we increase the number of crossvalidation cycles per bootstrap sample (Figure 2b). Thus, we recommend using only two crossvalidation cycles per bootstrap sample.
This crossvalidation approach provides modelperformance estimates that are not biased by overfitting of flexible models to either subjects or conditions. Fixed and flexible models with different numbers of parameters can be robustly compared with generalization over conditions and/or subjects. The method can handle any model that can be fitted efficiently enough (for the types of flexible models we actually implemented, see Methods, Flexible models).
Validation of the statistical inference methods
We validate the inference methods using simulations, functional MRI data, and neural data. First, we establish that the statistical tests for model comparison are valid, controlling the falsepositive rate at the nominal level. This requires simulating data under the null hypothesis, where two models that predict distinct RDMs are exactly equal in their RDM prediction accuracy. We use a matrixnormal model to simulate this null scenario for model comparison. Second, we show that the estimates of our uncertainty about model performance correctly capture the true variability for different generalization schemes in more realistic simulated scenarios based on neural network models. In these simulations, we cannot simulate the null hypothesis of two models that predict the representational geometry equally accurately. We also use these more realistic simulations to evaluate the power afforded by different RDM comparators. Third, we validate the inference procedure for flexible models, confirming that our bootstrapwrapped crossvalidation scheme correctly accounts for the overfitting of flexible models. Fourth, we validate the methods using real data, acquired with functional MRI in humans and calcium imaging in mice.
Validity of inferential model comparisons
A frequentist test is valid when the rate of false positives (i.e. the rate of positive results when the null hypothesis is true) does not exceed the specified error rate α (e.g. 5%). Here, we check the validity of modelcomparative inference, where the null hypothesis is that the two models perform equally well at explaining the representational geometry. We simulate scenarios where two models predict distinct geometries, but perform equally well on average at predicting the true representational geometry.
To simulate situations where two different models perform equally well, we generated conditionresponse matrices (containing an activity level for each combination of condition and response channel) by sampling from matrixnormal density models. A matrixnormal distribution over matrices yields matrices with normally distributed cells whose covariance is separable into a covariance matrix across rows and one across columns. In our case, rows correspond to the experimental conditions (e.g. stimuli) and the columns correspond to measurement channels (e.g. neurons or voxels). For matrixnormal data, the covariance across conditions captures the similarity among conditionrelated response patterns and determines the expected squared Euclideandistance RDM (Diedrichsen and Kriegeskorte, 2017). The covariance among channels only scales the covariance of the distance estimates. This relationship enables us to generate matrixnormal data for arbitrary choices of the expected squared Euclideandistance RDM. To model the null hypothesis, we choose two models that predict distinct RDMs and generate data, such that the expected data RDM has equal Pearson correlation to both model RDMs (results in Appendix 1—figure 1; details in Appendix 1).
We first evaluated the bootstrap in the scenario, where the goal is to generalize across subjects only. All modelcomparative subjectonly bootstrap tests were found to be valid (Appendix 1—figure 1). Inflated falsepositive rates were observed for subjectonly bootstrap tests only when using a small sample of subjects (<20). For a small number of samples, bootstrapping is known to produce underestimates of the variance by a factor $\frac{n}{n1}$ for $n$ samples (e.g. Efron and Tibshirani, 1994, chapter 5.3). In this scenario, we recommend using a $t$test across subjects, which is more computationally efficient and more accurate than bootstrap methods for small numbers of subjects.
Next, we tested bootstrapping for generalization to new conditions. In this scenario, the bootstrap methods were all conservative, showing falsepositive rates substantially below 5% (Appendix 1—figure 1). This is expected, because we did not include any random selection of conditions in our data simulation, but enforced the H_{0} exactly for the measured conditions.
To assess how problematic it is to choose an inference method that ignores the variance due to condition sampling, we ran a simulation in which we sampled the conditions from a large pool. We generated two models that perform equally well on 1000 conditions using matrixnormal sampling and then sampled a smaller set of these conditions for the simulated experiment. In these simulations, all techniques that only take subjects into account as a random factor fail catastrophically (Appendix 1—figure 1), with falsepositive rates growing with the number of simulated subjects and reaching 60% at 40 simulated subjects. In contrast, our bootstrap tests that include condition sampling all remain valid, including the uncorrected 2factor bootstrap and our new corrected 2factor bootstrap with falsepositive rates below the nominal 5%. However, the uncorrected 2factor bootstrap was extremely conservative.
We also validated the tests against chance performance, where a single model is tested and the null hypothesis is that its performance is at chance level. To do so, we performed similar matrixnormal data simulations, evaluating a model that predicts a specific randomly sampled RDM on matrixnormal data consistent with an independently sampled random expected data RDM. Results show that a $t$test across subjects as well as the bootstrap $t$test approaches provide valid inference (Appendix 1—figure 1, top row). The subject $t$test and the corrected 2factor bootstrap $t$test avoid overly conservative falsepositive rates.
We conclude that the tests are valid in these simple simulated H_{0} scenarios, where we are able to estimate the falsepositive rate. In more realistic simulations using neural network models and real data, we can no longer simulate distinct models that predict the data RDM equally well. We therefore restrict ourselves to evaluating our bootstrap estimate of the variance of modelperformance estimates, assuming that the falsepositive rates are adequately controlled when we use an accurate variance estimate.
Criteria for evaluation of inference procedures
To evaluate alternative inference procedures, we perform simulations that reveal (1) whether the estimates of the uncertainty of the modelperformance estimates are accurate (ensuring the validity of the inferences), and (2) how sensitive different model comparison methods are to subtle differences between models (determining the power of the inferences). To measure whether our bootstrap methods correctly estimate the uncertainty of the modelperformance estimates, we compute the relative uncertainty (RU). The RU is the standard deviation of the bootstrap distribution of modelperformance estimates ${\sigma}_{boot}$ divided by the true standard deviation of modelperformance estimates ${\sigma}_{true}$ as observed over repeated simulations:
where ${\sigma}_{i}^{2}$ is the variance estimator of the bootstrap in simulated dataset $i$ of the $N$ simulations. Ideally, we would like the bootstrapestimated variance to match the true variance such that the RU is 1.
To measure how sensitive our analysis is to differences in model performance (e.g. comparing layers of a deep neural network), we define the model discriminability as a signaltonoise ratio (SNR). The signal is the magnitude of modelperformance differences, which is measured as the variance across models of their average of performance estimates across simulations. The noise is the nuisance variation, which includes subject and condition sample variation along with measurement noise. The noise is measured as the average across models of the variance of performance estimates across simulations. This results in the following formula, in which ${\mathrm{Perf}}_{i,m}$ is the performance of model $m$ of $M$ in repetition $i$ of $N$ repetitions of the simulation:
A higher SNR indicates greater sensitivity to differences in model performance: differences between models are larger relative to the variation of modelperformance estimates over repeated simulations. Note that this measure does not depend on the accuracy of the bootstrap because the bootstrap estimates of the variances do not enter this statistic. The SNR exclusively measures how large differences between models are compared to the level of nuisance variation we simulate, which may include random sampling of conditions, subjects, or both (in addition to measurement noise).
Validity of generalization to new subjects and conditions
To test whether our inference methods correctly generalize to new subjects and conditions, we performed a simulation that includes random sampling of both subjects and conditions (Figure 3). We used the internal representations of the deep convolutional neural network model AlexNet (Krizhevsky et al., 2012) to generate fMRIlike simulated data. In each simulated scenario, one of the layers of AlexNet served as the true (datagenerating) model, while all layers were considered as candidate models in the inferential model comparisons. We simulated true voxel responses as local averages of the activities of closeby units in the feature maps of layers of the model. The response of each simulated voxel was a local average of unit responses, weighted according to a 2D Gaussian kernel over the locations of the feature map multiplied by a vector of nonnegative random weights (drawn uniformly from the unit interval) across the features. We then simulated hemodynamicresponse timecourses and added measurement noise. The covariance structure of the noise was determined by the overlap of the simulated voxels’ averaging regions over space and a firstorder autoregressive model over time. The simulated data were subjected to a standard general linear model (GLM) analysis to estimate the conditionresponse matrix. Variation over conditions was generated by using randomly sampled natural images from ecoset (Mehrer et al., 2017) as input to the AlexNet model. Variation over subjects was generated by randomly choosing a new location and a new vector of feature weights for each voxel of a new simulated subject.
We simulated N = 100 datasets for each parameter setting to estimate how variable the modelperformance estimates truly are. In analysis, we must estimate our uncertainty about model performance from a single dataset. To estimate how accurate these estimates were, we compared the uncertainty estimates used by different inference procedures (including different bootstrap methods) to the true variability. This comparison is a enables us to validate our inference despite the fact that we cannot compute falsepositive rates of the model comparison tests. Our neuralnetworkbased simulations do not contain situations that correspond to the H_{0} of two different models with equal performance, which would require that the datagenerating neural network layer predicts an RDM equally similar to those predicted by two other model layers. As expected, the rate of erroneously finding an alternative model outperforming the true datagenerating model was very low (not shown) whenever the type of bootstrap matches the simulated level of generalization because the true layer has a higher average performance than the other models. At the 5% uncorrected significance level, the proportion of cases where any other layer performed significantly better than the true (datagenerating) layer was only 1.524%. This rate reflects the differences between the layers of AlexNet, the simulated variability due to subject, stimulus, and voxel sampling, the simulated noise level, and the number of layers. Tests against the best other layer (chosen based on all data) significantly favor this other layer in only 0.694% of cases. Multiplecomparison correction would reduce these modelselection error rates even further.
To test generalization to either new conditions or new subjects (but not both simultaneously), we kept the other dimension constant. When simulating condition sampling, the true variance across conditions is accurately estimated for 40 or more conditions (Figure 4a) and is overestimated by 1factor bootstrap resampling of conditions (rendering the inference conservative) when we have less than about 40 conditions (Figure 4a). When simulating subject sampling, the true variance across subjects is accurately estimated for 20 or more subjects (Figure 4b) and is underestimated by 1factor bootstrap resampling of subjects (invalidating the inference) when we have very few subjects (Figure 4b). This downward bias corresponds to the $\frac{n}{n1}$ factor between the sample variance and the unbiased estimate for the population variance. Our implementation in the RSA toolbox uses this factor to correct the variance estimate.
To test our corrected 2factor bootstrap method’s ability to generalize to new subjects and new conditions simultaneously, we simulated sampling of both conditions (stimuli) and subjects in our simulations. The corrected 2factor bootstrap estimates the overall variation caused by random sampling of subjects and conditions and by measurement noise much more accurately than the naive 2factor bootstrap (Figure 4c). Cases where an incorrect model (not the datagenerating model) significantly outperformed the true model occurred in only 0.3% of simulations with the corrected 2factor bootstrap, even without any multiplecomparison correction. This proportion would be larger if the alternative models performed more similar to the true model than simulated here. The RSA toolbox adjusts for multiple comparisons, controlling either the familywise error rate or the falsediscovery rate across all pairwise model comparisons.
Overall, we found that the new more powerful corrected 2factor bootstrap method yields accurate estimates of the variance across the simulated populations of subjects and conditions when the dataset is large enough (≥20 subjects, ≥40 conditions) and the type of bootstrap matches the population sampling simulated (subject, condition, or both).
The model discriminability (SNR) increases monotonically with the number of measurements, affording greater power for modelcomparative inference. Model discriminability increases with the amount of data according to a power law (straight line in log–log plot; Figure 4d–f). Such a relationship holds whether we increase the number of conditions, the number of subjects, or the number of repetitions per condition. This result is expected and validates the SNR as an indicator of modelcomparison power. In general, increasing the number of measurements helps most for the factor that causes most variability of the performance estimates, rendering generalization harder. For example, in our deepneuralnetworkbased simulations, the variability over subjects is smaller than the variability across conditions (Figure 4g). In this simulation, it thus increases statistical power more to collect data for more conditions. When there is more variability across subjects, the opposite is expected to hold. An intermediate voxel size (Gaussian kernel width) yielded the highest modelperformance discriminability as measured by the SNR (Figure 4h, see Appendix 5 for more discussion on this topic).
Validity of inference on flexible models
To validate inferential model comparisons involving flexible models, we made a variant of the deep neural network simulation in which we do not assume to know how voxels average local neural responses. As the simulated ground truth, we set the spatial weights for each voxel to a Gaussian with a standard deviation of 5% of the image size (full width at half maximum FWHM ≈ 11.77%) and randomly weighted the feature maps (with weights drawn independently for each voxel and feature map uniformly at random from the unit interval; details in Methods, Neuralnetworkbased simulation).
We then used models that a researcher could generate without knowing the ground truth of how voxels average local features. As building blocks for the models, we computed RDMs for different voxel averaging pool sizes and for different methods to deal with averaging across feature maps. To capture voxel averaging across retinotopic locations, we smoothed the feature maps with Gaussians of different sizes. To capture voxel averaging across feature maps, we (1) generated RDMs computed after taking the average across feature maps at each location (avg), (2) computed the expected RDM for the weight sampling implemented in the simulation (weighted), or (3) computed RDMs without any featuremap averaging (full).
We combined these building blocks into two types of flexible model: selection models and nonnegative linearcombination models. In a selection model, fitting is implemented as selection of the best among a finite set of RDMs. Here we defined one selection model for each method of combining the feature maps. Each selection model contained RDMs computed for different sizes of the local averaging pool. In linearcombination models, fitting consists in finding nonnegative weights for a set of basis RDMs, so as to maximize RDM prediction accuracy. The RDMs contain estimates of the squared Mahalanobis distances, which sum across sets of tuned neurons that jointly form a population code. As component RDMs, we chose the four extreme cases of RDM generation: no pooling across space or averaging across the whole image, each paired with either ‘full’ or ‘avg’ treatment of the feature maps. The resulting fourRDMcomponent linear model approximates the effect of computing the RDM from voxels that reflect the average activity over retinotopic patches of different sizes (Kriegeskorte and Diedrichsen, 2016). For the averaging across feature maps, which uses random weights, there is a strong motivation for using a linear model: When the voxel activities are nonnegatively weighted averages of the underlying neurons with the weights drawn independently from the same distribution, the expected squared Euclidean RDM is exactly a linear combination of the RDM computed based on the univariate populationaverage responses and the RDM based on all neurons (Appendix 4; see also Carlin and Kriegeskorte, 2017). For comparison, we also included fixed RDM models, corresponding to component RDMs of the fitted models.
We found that our bootstrapwrapped crossvalidation (corrected 2factor bootstrap with adjustment for excess crossvaldation variance) yielded accurate estimates of the uncertainty. The relative uncertainties were close to 1 (Figure 5b). The modelperformance discriminability (SNR) was primarily determined by how accurately the different models were able to recreate the true measurement model (Figure 5c). The highest SNRs were achieved when the assumed model matched exactly (weighted feature treatment and voxel size 0.05), but the model variants which allowed for some fitting still yield high SNRs. Analyses that take the averaging across space and features into account yielded the highest average model performance for the true model. In contrast, analyses that ignore averaging over space or features (the full feature set selection model and some of the fixed models) not only lead to lower SNRs (as seen in Figure 5c), but also systematically selected the wrong layer, because a higher average performance was achieved by a different layer than the one we used for generating the data (not shown).
We conclude that when the true voxel sampling is unknown, flexible models are needed to account for voxel sampling, so as to enable us recover the underlying datagenerating computational model with our modelcomparative inference. Fixed models based on incorrect assumptions about the voxel sampling can lead to low modelperformance discriminability (SNR) and even to incorrect inferences as to which model is the true model.
Validation with functional MRI data
The simulations presented so far validated all statistical inference procedures, but may not capture all aspects of the structure of real measurements of brain activity. To test our methods under realistic conditions, we used real human fMRI (this section) and mouse calciumimaging data (next section). We resampled data from a large openly available fMRI experiment in which humans viewed pictures from ImageNet (Horikawa and Kamitani, 2017). These data contain various noise sources, individual differences, signal shapes, and distributions that are difficult to simulate accurately without using measured data. We therefore implemented a databased simulation to create realistic synthetic data, whose groundtruth RDM we knew (Figure 6). By subsampling from this dataset, we generated smaller datasets to test inference with bootstrapping over conditions. We used the entire dataset as a standin for the population a researcher might wish to generalize to. For each cortical area, we computed the mean RDM using all data (all runs and subjects). Each area’s mean RDM served as a groundtruth RDM for datasets sampled from that area and as a model RDM for datasets sampled from all areas. The model comparison we attempted aims to recover which cortical area a dataset was subsampled from. The simulation enables us to check whether our uncertainty estimates are correct for modelperformance estimates based on real data.
We varied the strength of noise, the number of runs, and the number of conditions (i.e. viewed images). We did not vary the number of subjects because the original dataset contains only five subjects, which precludes informative resampling of subjects. To increase the variability of the resampled datasets beyond sampling from the 35 measurement runs and to vary the noise strength, we created new voxel timecourses for each sampled run while preserving the spatial structure and serial autocorrelation of the noise. To achieve this, we estimated a secondorder autoregressive model ($AR(2)$) separately for each run’s GLM residuals, permuted the ARmodel’s residuals and added the results to the GLM’s predicted timecourse (see Figure 6a–e and Methods, fMRIdatabased simulation for details). We repeated each simulated experiment 24 times and used the RU and the modelperformance discriminability (SNR) as our evaluation criteria.
Results were largely similar to those of the neuralnetworkbased simulations (Figure 6f, g). For the RU, which measures the accuracy of our bootstrap variance estimates, we see a convergence toward the expected ratio (dashed line at 1), validating the bootstrap procedure for real fMRI data. For the modelperformance discriminability (SNR), we find the same powerlaw increase with the number of conditions and the number of runs used as data. These results suggest that the regions are discriminable on the basis of their RDMs estimated from fMRI given five subjects’ data when a sufficient number of stimuli (≥30) and runs (≥16) is used.
Validation with calciumimaging data
We can also adjudicate among models of the representational geometry on the basis of direct neural measurements, such as electrophysiological recordings or calciumimaging data. These measurement modalities have very different statistical properties than fMRI. To test our methods for this kind of data, we performed a resampling simulation based on a large calciumimaging dataset of responses of mouse visual cortex to natural images (de Vries et al., 2020). This dataset contains recordings from six visual cortical areas: primary visual cortex (V1), laterointermediate (LM), posteromedial (PM), rostrolateral (RL), anteromedial (AM), and anterolateral (AL) visual area (Figure 7a).
As in the previous section, we used the overall mean RDM for each area as a groundtruth model and subsampled the data to create simulated datasets for which we know the groundtruth RDM. We used different numbers of stimulus repetitions, neurons, mice, and stimuli to vary the amount of information afforded by each simulated dataset. We used the crossnobis estimator of representational dissimilarity for all analyses here. We repeated each simulated experiment 100 times and computed the RU to assess the correctness of our bootstrap uncertainty estimates and the modeldiscriminability SNR to determine which noise covariance estimators and RDM comparators afford most sensitivity to model differences.
We analyzed the overall discriminability of the brain areas (Figure 7b). Although cortical areas vary in the reliability of the estimated RDMs, they can be discriminated reliably when using all data. We used the RU to assess whether our bootstrap variance estimates are correct for these data (Figure 7c). We resampled all factors (subjects, stimuli, runs, and cells) to generate simulted datasets. Correspondingly, the analysis used bootstrapping over both subjects and stimuli. We observed correct variance estimates for the corrected 2factor bootstrap. The uncorrected 2factor bootstrap was conservative, substantially overestimating the true variance.
To understand how the modelcomparative power depends on experimental parameters and analysis choices, we analyzed the modeldiscriminability SNR. We found that more subjects, more stimuli, more runs, and more cells all increased the SNR just as in our fMRI and neuralnetworkbased simulations (Figure 7d). Furthermore, we find that taking the noise covariance into account for computing the crossnobis RDMs in the firstlevel analysis improves the SNR (Figure 7e). Univariate noise normalization (implemented by using a diagonal noise covariance matrix) is better than no noise normalization. Multivariate noise normalization is slightly better than univariate noise normalization (Walther et al., 2016). For multivariate noise normalization, we tested two different shrinkage estimators with different targets: a multiple of the identity and the diagonal matrix of variances. These two variants perform similarly. In addition, we find that different RDM comparators yield the best model discriminability for different cortical areas (Figure 7g). For some, cosine RDM similarity performs better, for others, Pearson RDM correlation performs better. The whitened RDM comparators are better on average, but there are cases where the unwhitened RDM comparators perform slightly better. Thus, it remains dependent on the concrete experiment (with a particular choice of conditions, tested models and underlying representational geometry), which RDM comparator affords the best power for model comparison (Diedrichsen et al., 2020).
Discussion
We present new methods for inferential evaluation and comparison of models that predict brain representational geometries. The inference procedures enable generalization to new measurements, new subjects, and new conditions, treat flexible models correctly using crossvalidation, and work for any representational dissimilarity estimator and RDM comparator. For fixed as well as flexible models, our inference methods support all combinations of generalization: to new measurements using the same subjects and conditions, to new subjects, to new conditions, and to both new subjects and new conditions simultaneously. We validated the methods using simulated data as well as calciumimaging and fMRI data, showing that the inferences are correct. The methods are available as part of an opensource Python toolbox (rsatoolbox.readthedocs.io).
Generalizing to new measurements, new subjects, and/or new conditions
Inferential statistics is about generalization from the experimental random samples to the underlying populations. We must carefully consider the level of generalization, both at the stage of designing our experiments and analyses and at the stage of interpreting the results. The lowest level of inferential generalization is to new measurements. Our conclusions in this scenario are expected to hold only for replications of the experiment in the same animals using the same conditions. Inferential generalization to new subjects may not be possible, for example, in case studies or when the number of animals (e.g. two macaques) is insufficient. Generalization to new conditions is not needed when all conditions relevant to our claims have been sampled. For example, Ejaz et al., 2015 studied the representational similarity of finger movements in primary motor cortex. All five fingers were sampled in the experiments and there are no other fingers to generalize to. When generalizing to replications with the same subjects and conditions, we need separate data partitions to estimate the variability of the modelperformance estimates. We can then use a $t$test or ranksum test to test for significant differences between models.
If generalization to the population of subjects is desired, we need a sufficiently large sample of subjects. We can then evaluate each model for each subject and use a $t$test or ranksum test, treating the subjects as a random sample from a population. We showed that this method is valid, controlling falsepositive rates at their nominal values in our matrixnormal simulations (Methods, Frequentist tests for model evaluation and model comparison). The variance across subjects here is a good estimate of the variance across the population of subjects. However, the interpretation of the results must be restricted to the exact set of experimental conditions used in the experiment.
We often would like our inferences to generalize to a population of conditions. For example, when evaluating computational models of vision, we are not usually interested in determining which models dominate just for the particular visual stimuli presented in our experiment. We are interested in models that dominate for a population of visual stimuli. Modelcomparative inference can generalize to the population of conditions that the experimental conditions were randomly sampled from. The inference requires bootstrapping, because RDM prediction accuracy cannot be assessed for single conditions. We bootstrapresample the conditions set and evaluate all models on each sample. This procedure correctly estimates our uncertainty about modelperformance differences, and $t$tests based on the estimated bootstrap variances provide valid frequentist inference.
If we want to generalize simultaneously across conditions and subjects, then the corrected 2factor bootstrap approach provides accurate estimates of our uncertainty about model performances. These uncertainty estimates support valid inferential model comparisons, comparisons to the lower bound of the noise ceiling, and tests against chance performance. We expect the results to generalize to new subjects and conditions drawn from the respective populations sampled randomly in the experiment.
Inference on fixed and flexible models
Our performance estimates for flexible models must not be biased by overfitting to measurement noise, subjects, or conditions. To avoid this bias, we use a novel 2factor crossvalidation scheme that enables us to evaluate models’ predictive accuracy when simultaneously generalizing to new subjects and/or new conditions. The 2factor crossvalidation is nested in our 2factor bootstrap procedure for estimating uncertainty. By using two crossvalidation cycles with different data partitionings for each bootstrap sample, we can accurately remove the excess variance introduced by crossvalidation. Our method provides a computationally efficient estimate of the variances and covariances of modelperformance estimates for flexible models, which enables us to use a $t$test to inferentially compare models to each other, to the lower bound of the noise ceiling, and to chance performance.
Our methods are fully general in that inference can be performed on any model for which the user provides a fitting and an RDM prediction method. In practice, the complexity of the models is limited by the requirement that we need to fit each model thousands of times in our bootstrapwrapped crossvalidation scheme. Thus, we need a sufficiently fast and reliable fitting method for the model.
If fitting the model so often is not feasible or if the data RDMs do not provide sufficient constraints, one solution is to fit all models using a separate set of neural data before the inferential analyses. This approach is appropriate when many parameters are to be fitted, as is the case in nonlinear systems identification approaches as well as linear encoding models (Wu et al., 2006), where a large set of neural fitting data is required. All conclusions are then conditional on the fitting data: Inference will generalize to new test data assuming models are fitted on the same fitting data. Our methods support fitting of lowerparametric models as part of the modelcomparative inference. When applicable, this approach obviates the need for separate neural data for fitting and supports stronger generalization (not conditional on the neural fitting data).
Supported tests and implications of test results
Our methods enable comparison of a model’s RDM prediction performance (1) against other models, (2) against the noise ceiling, and (3) against chance performance. The first two of these tests are central to the evaluation of models. The test against chance performance is often also reported, but represents a low bar that we should expect most models to pass. In practice, RDM correlations tend to be positive even for very different representations, because physically highly similar stimuli or conditions tend to be similar in all representations. Just like a significant Pearson correlation indicates a dependency, but does not demonstrate that the dependency is linear, a significant RDM prediction result indicates the presence of stimulus information, but does not lend strong support to the particular model. We should resist interpreting significant prediction performance per se as evidence for a particular model (the singlemodelsignificance fallacy; Kriegeskorte and Douglas, 2019b). Theoretical progress instead requires that each model be compared to alternative models and to the noise ceiling. An additional point to note is that the interpretation of chance performance, where the RDM comparator equals 0, depends on the chosen RDM comparator, differing, for example, between the Pearson correlation coefficient and the cosine similarity (Diedrichsen et al., 2020).
RDM comparators like the Pearson correlation and the cosine similarity are related to the distance correlation (Székely et al., 2007), a general indicator of mutual information. Like a significant distance correlation, a significant RDM correlation mainly demonstrates that there is some mutual information between the brain region in question and the model representation. For a visual representation, for example, all that is required is for the two representations to contain some shared information about the input images. In contrast to the distance correlation (and other nonnegative estimates of mutual information), however, negative RDM correlations can occur, indicating simply that pairs of stimuli close in one representation tend to be far in the other and vice versa. For any RDM, there is even a valid perfectly anticorrelated RDM (Pearson $r=1$), which can be found by flipping the sign of all dissimilarities and adding a large enough value to make the RDM conform to the triangle inequality (which ensures the existence of an embedding of points that is consistent with the anticorrelated RDM). The existence of valid negative RDM correlations is important to the inferential methods presented here because it is required for our assumption of symmetric ($t$)distributions around the true RDM correlation.
Omnibus tests for the presence of information about the experimental conditions in a brain region have been introduced in previous studies (e.g. Kriegeskorte et al., 2006; Allefeld et al., 2016; Nili et al., 2020). Whether stimulus information is present in a region is closely related to the question whether the noise ceiling is significantly larger than 0, indicating RDM replicability. Such tests can sensitively detect small amounts of information in the measured activity patterns and can be helpful to assess whether there is any signal for model comparisons. If we are uncertain whether there is a reliable representational geometry to be explained, we need not bother with model comparisons.
The question whether an individual dissimilarity is significantly larger than zero is equivalent to the question whether the distinction between the two conditions can be decoded from the brain activity. Decoding analyses can be used for this purpose (Naselaris et al., 2011; Hebart et al., 2014; Tong and Pratte, 2012; Kriegeskorte and Douglas, 2019b). Such tests require care because the discriminability of two conditions cannot be systematically negative (Allefeld et al., 2016). This is in contrast to comparisons between RDMs, which can be systematically negative (although, as mentioned above, they tend to be positive in practice).
How many subjects, conditions, repetitions, and measurement channels?
Statistical inference gains power when more data are collected along any dimension. More independent measurement channels, more subjects, more conditions, and more repetitions all help. How much data is needed along each of these dimensions depends on the experiment. The most helpful dimension to extend is the one that currently limits generalization. When crossvalidation across repeated measurements is used to eliminate the bias of the distance estimates (as in the crossnobis estimator), using more repetitions brings an additional performance bonus because it reduces the variance increase associated with unbiased estimates (Diedrichsen et al., 2020, Appendix 5).
Which distance estimator and RDM comparator?
The statistical inference procedures introduced here work for any choice of representationaldistance estimator and RDM comparator. However, the choice of distance estimator and RDM comparator affects the power of modelcomparative inference and the meaning of the inferential results.
For computing the RDM, we tested only variations of the crossnobis (crossvalidated Mahalanobis) distance estimator, as recommended based on earlier research (Walther et al., 2016). The crossnobis estimator can use different noise covariance estimates to normalize patterns, such that the noise distribution becomes approximately isotropic. The noise covariance matrix can be the identity (no normalization), diagonal (univariate normalization), or a full estimate (multivariate normalization). Consistent with previous findings (Walther et al., 2016; Ritchie et al., 2021), our results suggest that univariate noise normalization is always preferable to no normalization, and that multivariate noise normalization using a shrinkage estimate of the noise covariance (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005) helps in some circumstances and never hurts model discrimination.
For evaluating RDM predictions, we can distinguish RDM comparison methods by the scale they assume for the distance estimates: ordinal, interval, or ratio. For ordinal comparisons, the different rank correlation coefficients perform similarly. We recommend ${\rho}_{a}$ for its computational efficiency and analytically derived noise ceiling. For interval and ratioscale comparisons, a more complex pattern emerges. In particular whether cosine similarities (ratio scale) or Pearson correlations (interval scale) work better depends on the structure of the model RDMs to be compared. We recently proposed whitened variants of the cosine similarity and Pearson correlation, which take into account that the distance estimates in an RDM are not independent (Diedrichsen et al., 2020). The whitened RDM comparators were more sensitive to subtle differences in model performance when evaluated on fixed models (Figure 5c). In the simulations based on the calciumimaging data, whitened RDM comparators still performed better on average, but there were some cortical areas that were easier to identify by using the unwhitened comparison measures.
Alternative approaches
We present a frequentist inference methodology that uses crossvalidation to obtain point estimates of model performance and bootstrapping to estimate our uncertainty about them. Bayesian alternatives deserve consideration. For example, a Bayesian approach has been proposed to alleviate the bias of distance estimates (Cai et al., 2019). This Bayesian estimate makes more detailed assumptions about the trial dependencies than our crossvalidated distance estimators, which remove the bias. The Bayesian estimate might be preferable for its higher stability when its assumptions hold and could be used in combination with our modelcomparative inference methods. For model comparisons, Bayesian inference is also an interesting alternative to the frequentist methods we discuss here (Kriegeskorte and Diedrichsen, 2016). Our whitened RDM comparison methods can be motivated as approximations to the likelihood for a model and we reported recently that they afford similar power as likelihoodbased inference with normal assumptions (Diedrichsen et al., 2020). Thus, frequentist inference using the whitened RDM comparators is related to Bayesian inference with a uniform prior across models. In the Bayesian framework, generalization to the populations of subjects and conditions would require a model of how RDMs vary across subjects and conditions. We currently do not have such a model. Until such models and Bayesian inference procedures for them are developed, the frequentist methods we present here remain the only method for generalization to the populations of subjects and conditions.
Another strongly related method for comparing models to data in terms of their geometry is pattern component modeling (Diedrichsen et al., 2018), which compares conditions in terms of their covariance over measurement channels instead of their representational dissimilarities. This approach is deeply related to representational similarity analysis (Diedrichsen and Kriegeskorte, 2017). Pattern component modeling is somewhat more rigid than RSA as the theory is based on normal distributions, but it has advantages in terms of analytical solutions. In particular, the likelihood of models can be directly evaluated, enabling tests based on the likelihood ratio. Due to the direct evaluation of likelihoods, this framework can be combined with Bayesian inference more easily and recently a variational Bayesian analysis was presented for this model (Friston et al., 2019).
Another powerful approach to inference on braincomputational models is to fit encoding models that predict measured brainactivity data instead of representational geometries (e.g. Wu et al., 2006; Kay et al., 2008; Dumoulin and Wandell, 2008; Naselaris et al., 2011; Wandell and Winawer, 2015; Diedrichsen and Kriegeskorte, 2017; Cadena et al., 2019a). This approach was originally developed in the context of lowdimensional models and measurements. When models and measurements are both high dimensional, even a linear encoding model can be severely underconstrained (Cadena et al., 2019b; Kornblith et al., 2019). As a result, an encoding model requires a combination of substantial fitting data and strong priors on the weights. The predictive model that is being evaluated comprises the encoding model and the priors on its weights (Diedrichsen and Kriegeskorte, 2017), which complicates the interpretation of the results (Cadena et al., 2019b; Kriegeskorte and Douglas, 2019b). Both model performances and the fitted weights can then be highly uncertain and/or dependent on the details of the assumed encoding model. The additional data and assumptions needed to fit complex encoding models motivate the consideration of methods as proposed here that do not require fitting of a highparametric mapping from model to measured brain activity.
The generalization challenges that we tackle here for RSA apply equally to encoding models and pattern component modeling. Inferences are often meant to generalize to new subjects and/or experimental conditions. The alternative approaches, in their current implementations, do not yet enable simultaneous generalization to the populations of experimental conditions and subjects. By default pattern component modeling and its Bayesian variants assume a single geometry and thus do not take either subject or condition variability into account. Variability across subjects can be taken into account in a grouplevel analysis (see e.g. Diedrichsen et al., 2018, 2.7.3), but this approach does not account for uncertainty due to the sample of experimental conditions. Encoding models usually follow the machine learning approach with training, validation, and test sets (e.g. Naselaris et al., 2011; Cichy et al., 2019; Cichy et al., 2021). Uncertainty about the model evaluations is either not estimated at all or estimated in a secondary analysis based on the variability across subjects, cells, or conditions. Because these secondary analysis is based solely on the test set, results are conditional on the training and validation sets, and so fall short of generalizing modelcomparative inferences to the underlying populations. Note that the bootstrapping and crossvalidation approaches we introduce here are not inherently specific to RSA. These methods could be adapted for estimating the uncertainties about other model evaluation measures such as those provided by pattern component and encoding models.
Conclusion
We present a comprehensive new methodology for inference on models of representational geometries that is more powerful than previous approaches, can handle flexible models, and enables neuroscientists to draw conclusions that generalize to new subjects and conditions. The validity of the methods has been established through extensive simulations and using real neural data. These methods enable neuroscientists working with humans and animals to evaluate complex braincomputational models with measurements of neural population activity. As we enter the age of big models and big data, we hope these methods will help connect computational theory to neuroscientific experiment.
Materials and methods
The methods section for this paper is separated into two parts: First, we describe the RSA analysis pipeline we propose in full. In the second part, we describe the simulation methods we used to test our pipeline for this paper.
Full description of the RSA method
The inference method we describe here represents a new pipeline for representational similarity analysis. Nonetheless, some parts of the analysis appeared in earlier or concurrent publications (Kriegeskorte et al., 2008b; Nili et al., 2014; Walther et al., 2016; Storrs et al., 2014). In this section, we describe the whole pipeline, including both new and established procedures, without requiring familiarity with previous papers.
Computing representational dissimilarity matrices
Request a detailed protocolThe first step of RSA is the computation of the representational dissimilarity matrices. The main question for this step is which dissimilarity measure shall be computed between conditions.
In the formal mathematical sense, a distance or metric is a function that takes two points from the space as input and computes a real number from them conforming to the following three rules: (1) Nonnegativity: The result is larger than or equal to zero, with equality only if the two points are equal. (2) Symmetry: The result is the same if the two points are swapped. (3) Triangle inequality: The sum of distances from $a$ to $b$ and $b$ to $c$ is less than or equal to the distance from $a$ to $c$ for all choices of the three points. We use the term dissimilarity for symmetric measures that may violate (1) and (3). Dropping requirement (3) admits symmetric divergences between probability distributions, for example. Dropping requirement (1) and allowing measures that may return negative values admits unbiased distance estimators (whose distribution is symmetric about 0 when the true distance is 0). We would still like the dissimilarity to be nonnegative in expectation.
In principle, any dissimilarity measure on the measured representation vectors can be used to quantify the dissimilarities between conditions. Popular choices in the past were the Pearson correlation distance, squared and unsquared Euclidean distances, cosine distance, and lineardecodingbased measures such as the decoding accuracy, the lineardiscriminant contrast (LDC, also known as the crossnobis estimator; Walther et al., 2016) and the lineardiscriminant $t$ value (LD$t$; Kriegeskorte et al., 2007; Nili et al., 2014). Earlier publications comparing different measures of dissimilarity found correlation distances to be less interpretable in terms of condition decodability and continuous crossvalidated decoding measures (LDC, LD$t$) to be more sensitive than decoding accuracy (Walther et al., 2016).
How representational dissimilarity is best quantified and inferred from raw data depends on the type of measurements taken. For fMRI for example, it is beneficial to take the noise covariance across voxels into account by computing Mahalanobis distances (Walther et al., 2016). For electrophysiological recordings of individual neurons one should take the Poisson nature of the variability into account. One approach is to transform the measured spike rates so as to stabilize their variance (Kriegeskorte and Diedrichsen, 2019a). Here, we introduce a KLdivergence dissimilarity measure based on the Poisson distribution (Appendix 2). Representational dissimilarities can also be inferred from behavioral responses, such as speeded categorizations or explicit judgments of properties or pairwise dissimilarities (Kriegeskorte and Mur, 2012).
Two aspects of the computation of dissimilarities warrant further discussion: crossvalidation of dissimilarities and taking noise covariance into account.
Crossvalidated distance estimators
Request a detailed protocolOne important aspect of the firstlevel analysis is that naive estimates of representational similarity can be severely biased (Walther et al., 2016; Cai et al., 2019; Diedrichsen et al., 2020) toward a structure dictated by the structure of the experiment rather than the structure of the representations. This happens because noise in the underlying patterns biases distance estimates upward. When different conditions are measured with different amounts of noise or the measurements between some pairs of conditions are correlated, this bias will be different for different distances introducing other structure into the RDM.
To avoid this problem, one can use crossvalidated distances, which combine difference estimates from independent measurements like separate runs, such that the dissimilarity estimate is unbiased. Crossvalidation applies to all dissimilarity measures that are defined based on inner products of the differences with themselves (e.g. squared and unsquared Euclidean distances, Mahalanobis distances, Poisson KL divergences). To compute a crossvalidated dissimilarity one computes two estimates of the difference vector from independently measured parts of the data and takes the inner product of these two independent estimates, averaging over different splits into independent data.
The most commonly used version of crossvalidated distances are crossvalidated Mahalanobis (Crossnobis) dissimilarities, which we use througout our simulations in this paper. For $N\ge 2$ repeated measurements of response patterns ${\mathbf{x}}_{mi},m=1\dots N,i=1\dots K$ (for $K$ conditions), the crossnobis estimator ${\widehat{d}}_{ij}$ is defined as:
for an estimate noise covariance matrix $\mathrm{\Sigma}$.
Crossnobis dissimilarities seem to be the most reliable dissimilarity estimate for fMRIlike data (Walther et al., 2016). As in the noncrossvalidated Mahalanobis distance, the linear transformation of of the response dimensions (using the noise precision matrix ${\mathrm{\Sigma}}^{1}$) improves reliability (Walther et al., 2016; Nili et al., 2014) and renders the estimates monotonically related to the linear decoding accuracy for each pair of conditions, when a fixed Gaussian error distribution is assumed. Their sampling distributions can be described analytically (Diedrichsen et al., 2020).
For Poisson distributed measurements as for electrophysiological recordings we can also define a crossvalidated dissimilarity based on the KLdivergence as we introduce in Appendix 2.
Noise covariance estimation
Request a detailed protocolTo take the noise covariance into account (in Mahalanobis and Crossnobis dissimilarities) we first need to estimate it. To do so, we can use one of two sources of information: We either estimate the covariance based on the variation of the repeated measurements around their mean or based on the residuals of a firstlevel analysis which estimated the patterns from the raw data. For fMRI for example, these residuals would be the residuals of the firstlevel GLM. In either case, we may have relatively little data to estimate a large noise covariance matrix. Making this feasible usually requires regularization. To do so the following methods are available:
When the estimation task is judged to be entirely impossible one can reduce the Mahalanobis and Crossnobis back to the Euclidean and crossvalidated Euclidean distances by using the identity matrix instead.
As a univariate simplification one can estimate a diagonal matrix which only takes the variances of voxels into account.
For estimating a full covariance one may use a shrinkage estimate, which ‘shrinks’ the sample covariance toward a simpler estimate of the covariance like a multiple of the identity or the diagonal of variances (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005). The amount of shrinkage used in these methods fortunately can be estimated quite accurately based on the data directly such that these methods do not require parameter adjustment.
We implemented these different methods. Overall the shrinkage estimates perform best, while the other techniques are equally good in some situations. Using the sample covariance directly is not advisable unless an unusually large amount of data exists for this estimation, in which case the shrinkage estimates converge toward the sample covariance anyway.
Comparing RDMs
Request a detailed protocolThe secondlevel analysis is comparing a measured data RDM (for each subject) to the RDMs predicted by different models. There are various measures to compare RDMs. The right choice depends on the aspects of the data RDM that the models are meant to predict. A strict measure would be the Euclidean distance (or equivalently the mean squared error) between a model RDM and the data RDM. However, we usually cannot predict the absolute magnitude of the distances because the SNR varies between subjects and measurement sessions. Allowing an overall scaling of the RDMs leads to the cosine similarity between RDMs. If we additionally drop the assumption that a predicted difference of 0 corresponds to a measured dissimilarity of 0, we can use a correlation coefficient between RDMs. For the cosine similarity and Pearson correlation between RDMs we recently proposed whitened variants which take the correlations between the different entries of the RDM into account (Diedrichsen et al., 2020). Finally, we can drop the assumption of a linear relationship between RDMs by using rank correlations like Kendall’s $\tau $ or Spearman’s $\rho $. For this lowest bar for a relationship Kendall’s ${\tau}_{a}$ or randomized rank breaking for Spearman’s ${\rho}_{a}$ are usually preferred over a standard Spearman’s $\rho $ or Kendall’s ${\tau}_{b}$ and ${\tau}_{c}$, which all favor RDMs with tied ranks (Nili et al., 2014). As we discuss below there is a direct formula for the expected Spearman’s $\rho $ under random tiebraking, which we prefer now for computational efficiency reasons.
Estimating the uncertainty of our modelperformance estimates
Additional to the point estimate of model performances we want to estimate how certain we should be about our evaluations. In the frequentist framework, this corresponds to an estimate how variable our evaluation results would be if we repeated the experiment. All variances we need can be computed from the covariance matrix of the modelperformance estimates. Thus, we keep an estimate of this matrix as our overall uncertainty estimate.
Subject variance
Request a detailed protocolThe easiest to estimate variance is the variance our results would have if we repeated the experiment with new subjects, but the same conditions, as all our evaluations are simple averages across subjects. Thus, an estimate of the variance can always be obtained by dividing the sample variance over subjects by the number of subjects.
Bootstrapping conditions
Request a detailed protocolThe dependence of the evaluation on the conditions is more complicated. Thus, we use bootstrapping (Efron and Tibshirani, 1994) to estimate the variance we expect over repetitions of the experiment with new conditions but the same subjects. To do so, we sample sets of conditions with replacement from the set of measured conditions and generate the data RDM and the model RDMs for this sample of conditions. The variance of the model performances on these resampled RDMs is then an estimate of the variance over experiments with different stimulus choices. The bootstrap samples of conditions contain repetitions of the same condition. The dissimilarity between a stimulus and itself will be 0 in the data and any model such that every model would correctly predict these selfdissimilarities. Thus, including these selfdissimilarities would bias all model performances upward. To avoid this, we exclude them from the evaluation.
2Factor bootstrap
Request a detailed protocolIf we want to estimate the variance for generalization to new subjects and new stimuli simultaneously, we need to use the correction we introduce in the results section (see Estimating the variance of modelperformance estimates for generalization to new subjects and conditions). This yields an estimate of the model evaluation (co)variances as all other methods for variance estimation.
As we mention in the main text, the exact formulas for the correction contain factors that depend on the number of subjects ${N}_{s}$ and conditions ${N}_{c}$, respectively. The expected value for the uncorrected 2factor bootstrap variance ${\widehat{\sigma}}_{sc}$ is:
If the true variances due to the subject is ${\sigma}_{s}^{2}$, the one due to the conditions is ${\sigma}_{c}^{2}$, and the one due to measurement noise or interaction of the two is ${\sigma}_{sc}^{2}$. Correspondingly, the expectations for the two 1factor bootstraps ${\widehat{\sigma}}_{s}^{2}$ and ${\widehat{\sigma}}_{c}^{2}$ are:
Combining these equations an unbiased estimate of the variance ${\widehat{\sigma}}^{2}$ can be obtained:
As ${N}_{s}$ and ${N}_{c}$ grow, the three ratio factors converge to 1, and the result converges to the one given by the simpler formula in the main text (Equation 5).
Bootstrapwrapped crossvalidation
Request a detailed protocolIf we employ any flexible models, we should additionally use crossvalidation, which leads to our new bootstrapwrapped crossvalidation explained in the results section (Evaluating the performance of flexible models).
Frequentist tests for model evaluation and model comparison
Request a detailed protocolBased on the uncertainty estimates, we construct frequentist tests to compare models to each other. The default method is a $t$test based on bootstrapestimated variances. There is a collection of other tests available to compare model performances against each other, to the noise ceiling, or to chance performance.
Because we base our uncertainty estimates on a bootstrap, there are two types of tests we can use for these comparisons: A percentile test based on the bootstrap samples or a $t$test based on the estimated variances.
For the percentile test, we calculate the bootstrap distribution for the differences and then test by checking whether the difference expected under the H_{0} (usually 0) lies within the simple percentile bootstrap confidence interval. It is possible to generate more exact confidence intervals like biascorrected and accelerated intervals based on the bootstrap samples, which might result in better tests. In our simulations and experience with natural data, however, model performances tend to be fairly symmetrically distributed around the true value, suggesting that these corrections are unnecessary.
For the $t$test we use the variance estimated from the bootstrap and use the number of observations minus one as the degrees of freedom. When bootstrapping across both subjects and conditions, we used the smaller number to stay conservative. This approach follows Efron and Tibshirani, 1994, chapter 12.4.
The $t$test has some advantages over the percentile bootstrap (Efron and Tibshirani, 1994): First, precise $p$ value estimates require many bootstrap samples. Especially, when smaller $\alpha $ levels or corrections for multiple comparisons are used, the percentile bootstrap can become computationally expensive and/or unreliable. Second, for small sample sizes, the bootstrap distribution does not take the uncertainty about the variance of the distribution into account. This is a similar error as taking the standard normal instead of a $t$distribution to define confidence intervals. Third, the bootstrap distributions are discrete, which is a bad approximation in the tails of the distribution. For example, a sample of five RDMs which are all positively related to the model is declared significantly related to the model at any $\alpha $ level, because all bootstrapped average evaluations are at least as high as the lowest individual evaluation. Fourth, for the 2factor bootstrap and the bootstrapwrapped crossvalidation, we can give corrections for the variance, but lack techniques to generate bootstrap samples directly.
We should also note that we expect the $t$distribution to be a good approximation for our case: We expect fairly symmetric distributions for the differences between models and average them across subjects, which should lead to a quick convergence toward a normal distribution for the model performances and their differences.
In particular, the model performances we base our tests on are not necessarily positive, allowing us to use the same techniques for the test against 0. This is different from tests that handle the original dissimilarities, where the true distances can only be positive.
Noise ceiling for model performance
Request a detailed protocolIn addition to comparing models to each other, we compare models to a noise ceiling and to chance performance. The noise ceiling provides an estimate of the performance the true (datagenerating) model would achieve. A model that approaches the noise ceiling (i.e. is not significantly below the noise ceiling) cannot be statistically rejected. We would need more data to reveal any remaining shortcomings of the model. The noise ceiling is not 1, because even the true group RDM would not perfectly predict all subjects’ RDMs because of the intersubject variability and noise affecting the RDM estimates. We estimate an upper and a lower bound for the true model’s performance (Nili et al., 2014). The upper bound is constructed by computing the RDM which performs best among all possible RDMs. Obviously, no model can perform better than this best RDM, so it provides a true upper bound. To estimate a lower bound, we use leaveoneout crossvalidation, computing the best performing RDM for all but one of the subjects and evaluating on the heldout subject. We can understand the upper and lower bound of the noise ceiling as uncrossvalidated and crossvalidated estimates of the performance of an overly flexible model that contains the true model. The uncrossvalidated estimate is expected to be higher than the true model’s performance because it is overfitted. The crossvalidated estimate is expected to be lower than the true model’s performance because it is compromised by the noise and subjectsampling variability in the data.
For most RDM comparators, the best performing RDM can be derived analytically as a mean after adequate normalization of the single subject RDMs. For cosine similarity, they are normalized to unit norm. For Pearson correlation, the RDM vectors are normalized to zero mean and unit standard deviation. For the whitened measures the normalization is based on the norm induced by the noise precision instead, that is subject RDM vectors $\mathbf{d}$ are divided by $\sqrt{{\mathbf{d}}^{T}{\mathrm{\Sigma}}^{1}\mathbf{d}}$ instead of the standard Euclidean norm $\sqrt{{\mathbf{d}}^{T}\mathbf{d}}$. For the Spearman correlation, subject RDM vectors are first transformed to ranks.
For Kendall’s ${\tau}_{a}$, there is no efficient method to find the optimal RDM for a dataset, which is one of the reasons for using the Spearman rank correlation for RDM comparisons. If Kendall $\tau $ based inference is chosen nonetheless, the problem can be solved approximately by applying techniques for Kemeny–Young voting (Ali and Meilă, 2012) or by simply using the average ranks, which is a reasonable approximation, especially if the rank transformed RDMs are similar across subjects. In the toolbox, we currently use this approximation without further adjustment.
For the lower bound, we use leaveoneout crossvalidation over subjects. To do this, each subject is once selected as the leftout subject and the best RDM to fit all other subjects is computed. The expected average performance of this RDM is a lower bound on the true model’s performance, because fitting all distances independently is technically a very flexible model, which performs the same generalization as the tested models. As all other models it should thus perform worse than or equal to the correct model.
When flexible models are used, such that crossvalidation over conditions is performed, the computation of noise ceilings needs to take this into account (Storrs et al., 2014). Essentially, the computation of the noise ceilings is then restricted to the test sets of the crossvalidation, which takes into account which parts of the RDMs are used for evaluation.
Flexible models
Request a detailed protocolAs model types, we implement three types of flexible model, in addition to the standard fixed model, which represents a single RDM to be tested:
A selection model, which states that one of a set of RDMs is the correct one.
A onedimensional manifold model, which consists of an ordered list of RDMs and is allowed to linearly interpolate between neighboring RDMs.
A weighted representational model, which states that the RDM is a (positively) weighted sum of a set of RDMs.
These models aim to provide the flexibility necessary to appropriately represent the uncertainty about the data generation process in different ways. First, we may be uncertain about aspects of the underlying braincomputational model, such as the relative prevalence in the neural population of different subpopulations of tuned neurons (KhalighRazavi and Kriegeskorte, 2014; KhalighRazavi et al., 2017; Jozwik et al., 2016). Second, we may be uncertain about aspects of the measurement process, such as the spatial smoothing and weighted averaging of features due to measurement methods. Measurement with functional MRI voxels, for example, can strongly influence the resulting RDMs, which can lead to wrong conclusions and generally bad model performance when the models are compared to measured RDMs (Kriegeskorte and Diedrichsen, 2016).
The selection model implements flexibility in perhaps the simplest way, by allowing a choice among a set of RDMs produced from the model under different assumptions about the brain computations and/or the measurement process. For training, we can simply evaluate each possible RDM on the training data and choose the best performing one as the model prediction for evaluation. This model implies no structure in which RDMs can be predicted, but can only handle a finite set of RDMs.
The onedimensional manifold model implements an ordered set of RDMs, where the model is allowed to interpolate between each pair of consecutive RDMs. This representation is helpful if the uncertainty about the measurements effect can be well summarized by one continuous parameter like the width of a smoothing kernel. Then we can sample a set of values for this parameter and use the simulated results as the basis RDMs for this kind of model. Then the model will provide an approximation to the continuous set of RDMs predicted by changing the parameter without requiring a method to optimize the parameter directly.
Finally, the weighted representational model represents the effect of weighting orthogonal features. In this case, the overall RDM is a weighted sum of the RDMs generated by the individual features. Whenever the original model has a feature representation, we may be uncertain about the features relative prevalence in the neural population and/or in the measured responses. It can be sensible then to assume that these features are represented with different weights or are differently amplified by the measurement process. The squared Euclidean representational distances that would obtain from a concatenation of feature subsets, each multiplied by a different weight, is equal to a nonnegatively weighted combination of the squared Euclidean RDMs for the individual feature sets. This justifies a nonnegative weighted model at the level of the RDMs.
A particular application of the weighted representation model is motivated by the local averaging in fMRI voxels, which leads to an overrepresentation of the populationmean dimension of the multivariate response space (Carlin and Kriegeskorte, 2017; Kriegeskorte and Diedrichsen, 2016; Ramírez et al., 2014). The expected RDM for measurements that independently randomly weight features is a linear combination of two RDMs, one based on treating features separately and one based on averaging all features before computing the RDM (see Appendix 4).
Our methodology is not specific to these types of model and can be easily extended to other types of model. To do so, the only requirement is that there is a reasonably efficient fitting method to infer the best fitting parameters for a given dataset of training RDMs. Indeed, new model types can be slotted into our toolbox by users by implementing only two functions: one that predicts an RDM based on a parameter vector and one that fits the parameter vector to a dataset.
Validation of the methodology
To evaluate our methods we use three kinds of simulations. First, we implement simulations based on deep neural networks and a simple approximation of voxel sampling. By choosing a new random voxel sampling per subject and using different randomly selected input images, we can test our methods with systematic variations across conditions and/or subjects. Second, we implement a simulation based on real fMRI data recombining measurements signals and noise to keep all complications found in true fMRI data. Third, we present simulations based on calciumimaging data from mice (de Vries et al., 2020).
Additionally, we tested that the tests we implement are in principle valid using a simple simulation based on a normal distribution for the original measurements, which corresponds to the matrixnormal generative model we used for theoretical derivations elsewhere (Diedrichsen et al., 2020). These simulations are presented in Appendix 1.
Neuralnetworkbased simulation
Request a detailed protocolOur simulations were based on the activities in the convolutional layers of AlexNet (Krizhevsky et al., 2012) in response to randomly chosen images from the ecoset validation set (Mehrer et al., 2021). For each stimulus, we computed the activities in the convolutional layer and took randomly chosen local averages to simulate the averaging of voxels. We then generated fMRIlike measurement timecourses to a randomly ordered short eventrelated design by convolution with a hemodynamicresponse function and addition of autoregressive noise. We then ran a GLM analysis to estimate the response strength to each stimulus. From these estimated voxel responses, we computed data RDMs per subject and ran our proposed analysis procedures to compute model performances of different models which we also based on the convolutional layers of AlexNet.
For the network, we used the implementation available for pytorch through the torchvision package (Paszke et al., 2019).
Stimuli
Request a detailed protocolStimuli were chosen independently from the validation set of ecoset by first choosing a category randomly and then sampling an image randomly from that category. These stimuli are natural images with categories chosen to approximate the relevance for human observers. The validation set contain 565 categories with 50 images each, that is 28,250 images in total.
Noisefree voxel response
Request a detailed protocolTo compute the response strength of a voxel to a stimulus we computed a local average of the feature maps. We first convolved the feature maps with a Gaussian representing the spatial extend of the voxels, whose size we defined by its standard deviation relative to the overall size of the feature map. A voxel with size 0.05 would thus correspond to a Gaussian averaging area whose standard deviation is 5% of the size of the feature map. Voxel locations were then chosen uniformly randomly over the locations within the feature map. To average across features, we chose a weight for each feature and each voxel uniformly between 0 and 1 and then took the weighed sum as the voxel response.
fMRI simulation
Request a detailed protocolTo generate timecourses we assumed a measurement was taken every 2 s and a new stimulus was presented during every second measurement, with no stimulus presented in the measurement intervals between stimulus presentations.
To generate a simulated fMRI response, we computed the stimulus by voxel response matrix and normalized it per subject to have equal averaged squared value. We then converted this into timecourses following the usual GLM assumptions and convolved the predictions with a hemodynamicresponse function. We set the hrf to the standard sum of two gamma distributions as assumed in statistical parametric mapping (SPM; Pedregosa et al., 2015), normalized to an overall sum of 1.
We then added noise from an autoregressive model of rank 1 (AR1) with covariance between pairs of voxels given by the overlap of the weighting functions of their weights. To control the strength of the autocorrelation, we set the coefficient for the previous data point to 0.5. To enforce the covariance between voxels, we multiplied the noise matrix with the cholesky decomposition of the desired covariance. To control the overall noise strength we scaled the final noise by a constant.
Each stimulus was presented once per run, with multiple stimulus presentations implemented as multiple runs.
Analysis
To analyse the simulated data we ran a standard GLM analysis which yielded a $\beta $estimate for each presented stimulus for each run of the experiment.
To compute RDMs we used Crossnobis distances based on leaveoneout crossvalidation over runs and the covariance of the residuals of the GLM. For this step, we used the function implemented in our toolbox.
Fixedmodel definition
Request a detailed protocolAs models to be compared we used the different layers of AlexNet. To generate an optimal model RDM we applied two transformations to mimic the average effect of voxel sampling. First we convolved the representation with the spatial receptive field of the voxels to mimic the spatial averaging effect. To capture the effect of pooling the features with nonnegative weights, we then computed a weighted sum of the RDM containing the features separately and one RDM based on the summed response across features weighted with weights 1 and 3.
This weighting computes the expected Euclidean distance of patterns under our random weighting scheme as we derive in Appendix 4: For our ${w}_{i}\sim U(0,1)$ the expected value is $\mathbb{E}(w)=\frac{1}{2}$ and the variance is $\mathrm{Var}({w}_{i})=\frac{1}{12}$ such that the weights for the RDM based on the individual features is $\frac{1}{4}$ and the weight for the RDM based on the summed feature response is $\frac{1}{12}$, that is a 3:1 weighting.
Based on this weighting we generated a fixed model for each individual processing step in AlexNet including the nonlinearities and pooling operations resulting in 12 models predicting a fixed RDM.
Tested conditions
Request a detailed protocolFor the large deepneuralnetworkbased simulation underlying the results in Figure 4, we chose a base set of factors which we crossed with all other conditions and a separate set of factors which were not crossed with each other but only with the base set.
Into the base set of factors we included the following factors: Which experimental parameters were changed over repetitions of the experiment (none, subjects, conditions, or both) and which bootstrapping method we applied (over conditions, over subjects, over both or applying the bootstrap correction). We applied all four bootstrapping conditions to the simulations in which none of the parameters were varied, the fitting ones to the subject and stimulus varying simulations and the bootstrap with and without correction for to the simulations were both parameters varied over repetitions resulting in 8 conditions for variation and bootstrap. Additionally, we included the number of subjects (5, 10, 20, 40, or 80) and the number of conditions (10, 20, 40, 80, and 160). For each set of conditions we thus ran 8 × 5 × 5 = 200 conditions.
Other factors we varied were: The number of repeats, which we set to 4 usually and tested 2 and 8. The layer we used to simulate the data, which we usually set to layer number 8 which corresponds to the output of the 3rd convolutional layer, and also tried 2, 5, 10, and 12, which correspond to the other 4 convolutional layers of AlexNet. The size of the voxels which we usually set to 0.05, that is we set the standard deviation of the Gaussian to 5% of the size of the feature map. As variations we tried 0, 0.25, and $\mathrm{\infty}$, that is no spatial pooling, a quarter of the size of the feature map as standard deviation and an average over the whole feature map. Finally, we varied the number of voxels, which we usually set to 100, but tried 10 and 1000 additionally. In total we thus ran 3 + 5 + 4 + 3 = 15 sets of conditions with 200 conditions each resulting in 3000 conditions, with a grand total of 300,000 simulations.
Bootstrapwrapped crossvalidation
Request a detailed protocolTo test the precision and consistency of the calculations for the bootstrapwrapped crossvalidation (Figure 5a, b), we needed repeated analyses for the same datasets. For this simulations we thus simulated only 10 datasets for the standard conditions, 20 subjects and 40 conditions, while varying both conditions and subjects and then ran repeated analyses on these datasets. For each setting, we ran 100 repeated analysis of each dataset. As conditions we chose 2, 4, 8, 16, and 32 crossvalidation assignments for 1000 bootstrap samples and additionally variants with only 2 crossvalidation folds and 2000, 4000, 8000, or 16,000 bootstrap samples.
Flexiblemodel treatment
Request a detailed protocolTo test whether our methods are adequate for estimating the variability for model performances of flexible models (Figure 5d–f), we ran our standard settings for 20 subjects and 40 stimuli and drawing new subjects and new stimuli, while replacing the fixed models per layer with flexible models of different kinds.
We generated models by combining models with different assumptions about the voxel pooling pattern: We varied two factors: (1) How feature weighting was handled: full, that is predicted distances are Euclidean distances in the original feature space; avg, that is distances are the differences in the average activation across features; or ‘weighted’, that is the weighted average of these two models, that corresponds to the expected RDM under the weight sampling we simulated. (2) How averaging over space was handled.
We first used different kinds fixed models, which serve as the building blocks for the flexible models. We varied two aspects of the measurement models applied: How large voxels are assumed to be (no pooling, $std=5\%$ of the image size and pooling over the whole feature maps) and how the features pooling is handled (no pooling, average feature, or the correct weighting assumed for the fixed models previously). These 3 × 3 combinations are the 9 fixedmodel variants.
We then generated selection models which had a range of voxel sizes to choose from (no pooling, std = 1%, 2%, 5%, 10%, 20%, 50% of the image size and pooling across the whole feature map). For the treatment of pooling over features we used four variants: For the first three called full, average, and weighted we used one of the types of fixed models to generate the RDMS. For the last, we allowed both the RDMs used by the full models and the ones used by the average models as a choice.
As an example of a linearly weighted model, we generated a model which was allowed to use a linear weighting of the four cornercase RDMs: no feature pooling and no spatial pooling, average feature and no spatial pooling, a global average per feature map, and the RDM induced by pooling over all locations and features. The model could predict any linear combination of the cornercase RDMs to fit the data RDMs.
fMRIdatabased simulation
With our fMRIdatabased simulation, we aim to show that our analyses are correct and functional for real fMRI data. Real data may contain additional statistical regularities, which we did not take into account in our deepneuralnetworkbased simulations. To do so, we took a large published dataset of fMRI responses to images and sampled from this dataset to generate hypothetical experimental datasets across which we would like to generalize. All scripts for the fMRIdatabased simulation are openly available on https://github.com/adkipnis/fmrisimulations, (Kipnis, 2023).
Dataset
Request a detailed protocolFor these simulations we used data from Horikawa and Kamitani, 2017 (as available from https://openneuro.org/datasets/ds001246/versions/1.2.1). This dataset contains fMRI data collected from five subjects viewing natural images selected from ImageNet or imagining images from a category. For our simulations, we used only the ‘test’ datasets, which contain 50 different images from distinct categories, which were each presented 35 times to each subject giving us an overall reliable signal and repetitions to resample from.
We used the automatic MRI preprocessing pipeline implemented in fMRIPrep 1.5.2 (Esteban et al., 2019; Esteban et al., 2019; RRID:SCR_016216), which is based on Nipype 1.3.1 (Gorgolewski et al., 2011; Gorgolewski et al., 2018; RRID:SCR_002502). This program was also used to produce the following description of the preprocesing performed.
Anatomicaldata preprocessing
Request a detailed protocolThe T1weighted (T1w) image was corrected for intensity nonuniformity (INU) with N4BiasFieldCorrection (Tustison et al., 2010), distributed with ANTs 2.2.0 (Avants et al., 2008, RRID:SCR_004757), and used as T1w reference throughout the workflow. The T1w reference was then skull stripped with a Nipype implementation of the antsBrainExtraction.sh workflow (from ANTs), using OASIS30ANTs as target template. Brain tissue segmentation of cerebrospinal fluid (CSF), white matter (WM), and gray matter (GM) was performed on the brainextracted T1w using fast (FSL 5.0.9, RRID:SCR_002823, Zhang et al., 2001). Brain surfaces were reconstructed using reconall (FreeSurfer 6.0.1, RRID:SCR_001847, Dale et al., 1999), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs and FreeSurferderived segmentations of the cortical GM of Mindboggle (RRID:SCR_002438, Klein et al., 2017). Volumebased spatial normalization to one standard space (MNI152NLin2009cAsym) was performed through nonlinear registration with antsRegistration (ANTs 2.2.0), using brainextracted versions of both T1w reference and the T1w template. The following template was selected for spatial normalization: ICBM 152 Nonlinear Asymmetrical template version 2009c (Fonov et al., 2009, RRID:SCR_008796; TemplateFlow ID: MNI152NLin2009cAsym).
Functional data preprocessing
Request a detailed protocolFor each of the 35 bloodoxygenleveldependent (BOLD) runs found per subject (across all tasks and sessions), the following preprocessing was performed. First, a reference volume and its skullstripped version were generated using a custom methodology of fMRIPrep. The BOLD reference was then coregistered to the T1w reference using bbregister (FreeSurfer) which implements boundarybased registration (Greve and Fischl, 2009). Coregistration was configured with six degrees of freedom. Headmotion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using mcflirt (FSL 5.0.9, Jenkinson et al., 2002). BOLD runs were slicetime corrected using 3dTshift from AFNI 20160207 (Cox and Hyde, 1997, RRID:SCR_005927). The BOLD timeseries, were resampled to surfaces on the following spaces: fsaverage5 and fsaverage6. The BOLD timeseries (including slicetiming correction when applied) were resampled onto their original, native space by applying the transforms to correct for headmotion. These resampled BOLD timeseries will be referred to as preprocessed BOLD in original space, or just preprocessed BOLD. The BOLD timeseries were resampled into standard space, generating a preprocessed BOLD run in [‘MNI152NLin2009cAsym’] space. First, a reference volume and its skullstripped version were generated using a custom methodology of fMRIPrep. Several confounding timeseries were calculated based on the preprocessed BOLD: framewise displacement (FD), the spatial root mean square of the data after temporal differencing (DVARS) and three regionwise global signals. FD and DVARS are calculated for each functional run, both using their implementations in Nipype (following the definitions by Power et al., 2014). The three global signals are extracted within the CSF, the WM, and the wholebrain masks. Additionally, a set of physiological regressors were extracted to allow for componentbased noise correction (Behzadi et al., 2007). Principal components are estimated after highpass filtering the preprocessed BOLD timeseries (using a discrete cosine filter with 128 s cutoff) for the two CompCor variants: temporal (tCompCor) and anatomical (aCompCor). tCompCor components are then calculated from the top 5% variable voxels within a mask covering the subcortical regions. This subcortical mask is obtained by heavily eroding the brain mask, which ensures it does not include cortical GM regions. For aCompCor, components are calculated within the intersection of the aforementioned mask and the union of CSF and WM masks calculated in T1w space, after their projection to the native space of each functional run (using the inverse BOLDtoT1w transformation). Components are also calculated separately within the WM and CSF masks. For each CompCor decomposition, the k components with the largest singular values are retained, such that the retained components’ timeseries are sufficient to explain 50% of variance across the nuisance mask (CSF, WM, combined, or temporal). The remaining components are dropped from consideration. The headmotion estimates calculated in the correction step were also placed within the corresponding confounds file. The confound timeseries derived from headmotion estimates and global signals were expanded with the inclusion of temporal derivatives and quadratic terms for each (Satterthwaite et al., 2013). Frames that exceeded a threshold of 0.5 mm FD or 1.5 standardized DVARS were annotated as motion outliers. All resamplings can be performed with a single interpolation step by composing all the pertinent transformations (i.e. headmotion transform matrices, susceptibility distortion correction when available, and coregistrations to anatomical and output spaces). Gridded (volumetric) resamplings were performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimize the smoothing effects of other kernels (Lanczos, 1964). Nongridded (surface) resamplings were performed using mri_vol2surf (FreeSurfer).
Many internal operations of fMRIPrep use Nilearn 0.5.2 (Abraham et al., 2014, RRID:SCR_001362), mostly within the functional processing workflow. For more details of the pipeline, see the section corresponding to workflows in fMRIPrep’s documentation.
The above boilerplate text was automatically generated by fMRIPrep with the express intention that users should copy and paste this text into their manuscripts unchanged. It is released under the CC0 license.
Region selection
Request a detailed protocolVisual areas were defined according to the surface atlas by Glasser et al., 2016. For our simulations we used the following 10 visual areas as ROIs, joining areas from the atlas to avoid too small ROIs (the name of the areas in the atlas is given in brackets): V1 (V1), V2 (V2), V3 (V3), V4 (V4), ventral visual complex (VVC), ventromedial visual area (VMV1, VMV2, VMV3), parahippocampal place area (PHA1, PHA2, PHA3), fusiform face area (FFC), inferotemporal cortex (TF, PeEc), and MT/MST (MT, MST). The areas were selected separately for the two hemispheres.
To map the atlas onto individual subject’s brain space we used the mappings estimated by FreeSurfer with fmriprep’s standard settings. The Glasser Atlas was registered to each participant’s native space with mri_surf2surf, and voxels labeled using mri_annotation2label. Next, each ROI was mapped to native T1w volumetric space with mri_label2vol. To cover as many contiguous voxels as possible, the resulting masks were inflated with mri_binarize and every voxel outside of the volume between the pial surface and WM was eroded with mris_calc. To convert the resulting masks to T2*w space we used custom python scripts: First, masks were smoothed with a Gaussian kernel of FWHM = 3 mm, resampled to T2*w space using nearest neighbor interpolation, and finally thresholded. The threshold for each mask was set to equalize mask volume between T1w and T2*w space. Finally, voxels with multiple ROI assignments were removed from all ROIs but the one with the highest prethreshold value. Voxels outside the fMRIPrepgenerated brain mask were removed from all generated 3dmasks of ROIs.
General linear modeling
Request a detailed protocolFor extracting response patterns from the measurements we used two GLMs. In the first GLM, we regressed out noise sources and in the second we estimate stimulus responses. This twostep process is advantageous in this case where stimulus predictors and noise predictors are highly collinear (As there is only one presentation of each stimulus per run and as they cover the whole run they can together form almost any sufficiently slow variation.): It allows us to attribute all variance that could be attributed to the noise sources uniquely to them and not to effects of stimulus presentation. This is closer to the original papers analysis, leads to higher reliability and generates the second GLM as a stage at which we can adequately model the noise with a relatively simple AR(2) model.
GLM was performed in SPM12 (http://www.fil.ion.ucl.ac.uk/spm). No spatial smoothing was applied, models were estimated using Restricted Maximum Likelihood on top of Ordinary Least Squares and autocorrelations were taken into account using SPM’s inbuilt AR(1) method.
For the first GLM, we used the following noise regressors from the ones provided by fMRIprep: An intercept for each run, the six basic motion parameters and their derivatives, six cosine basis functions to model drift, FD, DVARS, and the first six aCompCors with the largerst eigenvalues. All runs were pooled to get the best noise parameter estimates possible.
We interpret the residuals from the first GLM as a denoized version of the fMRI signal and use them as input for a second GLM separately for each run to estimate stimulus effects: Stimulusspecific regressors were generated by convolving stimulus onset timeseries with the canonical HRF without derivatives. From this GLM, we kept the estimated $\beta $ coefficients ${\widehat{\beta}}_{i}\in {\mathbb{R}}^{p}$ for each stimulus and the residuals r_{i} for further processing.
Resampling
Request a detailed protocolTo sample a single run for further analysis, we randomly chose a run from the measured data without replacement. To expand the set of possible datasets, we then generated a new simulated BOLD signal ${\stackrel{~}{y}}_{i}$ for each voxel $i$ at the stage of the secondlevel GLM. To do so, we model the data as a GLM with an AR(2) model for the noise and then generate a new timecourse by permuting the residuals ${\eta}_{i}$ of the AR(2) model. As we apply the same permutation to each voxel, this procedure largely preserves spatial noise covariance.
In mathematical formulas, this process can be described as follows: Let $p$ be the number of conditions, $n$ be the number of scans per run, and ${y}_{i}\in {\mathbb{R}}^{n}$ be the denoized BOLD response of voxel $i$. We can then use the design matrix of the run $X\in {\mathbb{R}}^{n\times p}$ , the point estimate ${\widehat{\beta}}_{i}\in {\mathbb{R}}^{p}$ for the parameter values and corresponding residuals ${r}_{i}\in {\mathbb{R}}^{n}$ estimated by SPM to simulate a new data run:
where ${w}_{i,1}$ and ${w}_{i,2}$ are the estimated parameters of an AR(2) model fitted to the residuals $\mathbf{r}}_{i$. Its residuals are denoted ${\eta}_{i}$ and were randomly permuted to give ${\stackrel{~}{\eta}}_{i}$ using the same permutation for all voxels in a run.
Additionally, we sampled the conditions to use with replacement, that is we used the ${\beta}_{i}$ of a random sample of conditons, which we also used to select the RDMs from the models.
We saved this dataset in the same format as the original data and reran the secondlevel GLM using SPM on these simulated data to generate noisy estimates stimulus responses in each voxel.
RDM calculation and comparison
Request a detailed protocolWe use crossnobis RDMs for this simulation, testing four different estimates for the noise covariance: We either use the identity, univariate noise normalization, or a shrinkage estimate of the covariance based on the covariance of the residuals, or based on the covariance of the individual runs’ meancentered $\beta $ estimates.
For comparing RDMs, we use the cosine similarity throughout.
Model RDMs
Request a detailed protocolWe use the RDMs of different ROIs as models, effectively testing how well our methods recover the datagenerating ROI. The model RDM for each ROI is the pooled RDM across all subjects and runs computed by the same noise normalization method as the one used for the data RDMs. Data for these RDMs stemmed from the original results of the secondlevel GLM, making them less noisy than any RDM stemming from the simulated data.
Simulation design
For each condition, we ran 100 repeats to estimate the true variability of results and ran all combinations of the following conditions: We used 2, 4, 8, 16, or 32 runs per simulation (5 variants). We used 5, 10, 20, 30, or 50 stimuli (5 variants). We scaled the noise by 0.1, 1.0, or 10.0 (3 variants). We used each of the 20 ROIs for data generation once (20 variants). And we used the 4 methods for estimating the noise covariance (4 variants). Resulting in 24· 5· 5· 3· 20·4 = 144,000 different simulations. To save computation time, we ran the fMRI simulation and analyses only once per repeat and noise level and ran all analysis variants on the same data. When fewer runs or conditions were required for a variant, we randomly selected a subset for the analysis without replacement.
Calciumimagingdatabased simulation
Request a detailed protocolFor the calciumimagingdatabased simulation we used the Allen institutes mouse visual coding calciumimaging data available at https://observatory.brainmap.org/visualcoding/ (de Vries et al., 2020). Detailed information on the recording techniques can be obtained from the original publications and with the dataset.
We used the ‘natural scenes’ data, which consists of measured calcium responses to 118 natural scenes. The natural scenes were shown for 250 ms each without an inter stimulus interval in random order. In each session, each image was present 50 times.
From this dataset, we selected all experimental sessions, which contained a natural scenes experiment. Additionally, we restricted ourselves to three relatively broad cre driver lines, which target excitatory neurons relatively broadly: ’Cux2CreERT2’, ’Emx1IRESCre’, and ’Slc17a7IRES2Cre’. For further analyses, we ignore which driver line was used to achieve enough data for resampling. This resulted in 174 experimental sessions from 91 mice with 146 cells recorded on average (range: 18–359). Of these recordings, 35 came from laterointermediate area, 32 from posteromedial visual area, 23 from rostrolateral visual area, 46 from primary visual cortex, 16 from anteromedial area, and 22 from anterolateral area.
To quantify the response of a neuron to the stimuli, we used the fully preprocessed $\frac{df}{F}$ traces as provided by the dataset. We then extract the measurements from the frame after the one marked as stimulus onset till the stated stimulus endframe resulting in six or seven frames per stimulus presentation. As a response per neuron we then simply took the average of these frames.
To compute RDMs based on these data, we used Crossnobis distances based on different estimates of the noise covariance matrix based on the variance of the stimulus repetitions around the average neural response for each stimulus. We either used: an identity matrix, effectively calculating a crossvalidated Euclidean distance; a diagonal matrix of variances, corresponding to univariate noise normalization; a shrinkage estimate toward a constant diagonal matrix (Ledoit and Wolf, 2004), or a shrinkage estimate shrunk toward the diagonal of sample variances (Schäfer and Strimmer, 2005).
To generate new datasets, we randomly sampled subsets of stimuli, mice, runs, and cells from a brain area without replacement. To exclude possible interactions we avoided sampling multiple sessions recorded from the same mouse by sampling the mice and then randomly sampling from the sessions of each mouse, if there were more than one. For this dataset, we did not use any further processing of the data.
As variants for this simulation, we performed all combinations of the following factors: 20, 40, or 80 cells per experiment; 5, 10, or 15 mice; 10, 20, or 40 stimuli; 10, 20, or 40 stimulus repeats; the four types of noise covariance estimates; four types of rdm comparison: cosine similarity, correlation, whitened cosine similarity, and whitened correlation; whether the bootstrap was corrected; and the six brain areas. This resulted in 3 × 3 × 3 × 3 × 4 × 4 × 2 × 6 = 15,552 simulation conditions for which we simulated 100 simulations each.
As models for the simulations, we used the average RDM for each brain area as a fixed RDM model for that brain area. Thus the models are not independent from the data in our main simulation. This is not problematic for checking the integrity of our inference methods, but does not show that we can indeed differentiate brain areas based on their RDMs. To show that retrieving the brain area is possible as displayed in Figure 7b in the main text, we performed leaveoneout crossvalidation across mice, that is we chose the RDM models for the brain areas based on all but one mice and evaluated the RDM correlation with the left out mouse’s RDM.
Appendix 1
Matrixnormal simulations
To establish the validity of our modelcomparative frequentist inference, we need to look at the falsealarm rate for data that is generated under the assumption that the null hypothesis H_{0} is true, that is that a model has chance performance in expectation or that two models, predict distinct RDMs, but achieve equal RDM prediction accuracy in expectation. In the deepneuralnetworkbased simulations and the dataresampling simulations in the main text, we are not able to generate such data. Here, we used a matrixnormal model, as a simpler simulation scheme for RSA in which we can enforce these null hypotheses.
We started by specifying a desired RDM for our data that fulfills the null hypothesis for the model(s). We then exploit the relationship between the RDM and the covariance matrix between conditions (Diedrichsen and Kriegeskorte, 2017) to find a covariance matrix that results in the given RDM and generate responses with this covariance between conditions. This random pattern will then have the desired (squared Euclidean) RDM.
Concretely, the secondmoment matrix $\mathbf{G}$ of inner products among conditionrelated patterns across voxels can be computed from the squared Euclideandistance matrix $\mathbf{D}$ as follows:
where $\mathbf{H}={\mathbf{I}}_{k}\frac{1}{K}{\mathbf{1}}_{k}$ is a centering matrix with $\mathbf{1}}_{k$ being a square matrix of ones. A dataset with this covariance across conditions has $\mathbf{D}$ as its squared Euclidean RDM (Diedrichsen and Kriegeskorte, 2017). We can easily generate Gaussian data with a given secondmoment matrix and can thus generate data with any desired RDM.
Comparison against 0
To generate H_{0} data for testing our comparisons of models against 0, we choose both the model and the data RDMs as the distances between independent drawn Gaussian noise samples.
Model comparisons
To generate H_{0} data for model comparisons, we first generate two random model RDMs from independent standard normal noise data. We then normalize the model RDMs to have 0 mean and standard deviation of 1. Then we average the two RDMs, which yields a matrix with equal correlation to the two models. As a last step, we then subtract the minimum, to yield only positive distances and add the maximum distance to all distances once, such that the triangle inequality is guaranteed. As this last step only shifted the distance vector by a constant, the final distance vector still has the exact same correlation with the two model predictions. These methods effectively draw the covariance over conditions from a standardized Wishart distribution with as many degrees of freedom as the number of measurement channels.
Random conditions
To generate H_{0} data for model comparisons with variance due to stimulus selection, we created two models for a large set of 1000 conditions, and generated a data RDM and covariance matrix that would yield equal performance as for the other modelcomparison simulations. We then sampled a random subset of the conditions for each simulated experiment.
Data generation
In all cases, we find a new configuration of data points that produce the desired RDM for each subject by converting the RDM into the secondmoment matrix via equation 17 and drawing random normal data as described at the beginning of this section. We then add additional i.i.d. normal ‘measurement noise’ to each entry of the data matrix. From this data matrix we then compute a squared Euclideandistance RDM per subject and use this as the data RDM to enter our inference process. Finally, we run our inference methods on these data RDMs and the original model RDMs to check whether the falsepositive rate matches the nominal level.
Selected conditions
For each test and setting we generated 50 randomly drawn model RDMs and 100 datasets for each of these RDMs. We always used 200 measurement dimensions and tested all combinations of the following factors: 5, 10, 20, or 40 subjects, 5, 20, 80, or 160 conditions and all test types. As tests we used percentile tests and $t$tests based on bootstrapping both dimensions, subjects only or conditions only, a standard $t$test across subjects and a Wilcoxon ranksum test. For the corrected bootstrap, we only used the $t$test based on the estimated variances, because we cannot draw bootstrap samples based on our correction.
Results
All test results are shown in Appendix 1—figure 1. They mostly turned out as expected. The classical $t$ and Wilcoxon tests performed very similar to the bootstrap tests based on subjects. For the tests against chance performance and the model comparisons with fixed conditions the falsepositive rates are all close to the nominal 5%. However, we observed some inflated falsepositive rates for the bootstraps at small sample sizes: About 7% when using the $t$test and up to 12% for the simple percentile bootstrap test. These slightly too large falsepositive rates are due to the bootstrap estimating the biased variance estimate (dividing by $N$ instead of $N1$). For more than 20 subjects, we cannot distinguish the percentage from 5% anymore. For the $t$test and Wilcoxon ranksum test, there were no such caveats as they consistently achieved a falsepositive rate of about 5%.
Once we introduce variance due to stimulus selection by random sampling of the conditions (Appendix 1—figure 1 third row), all methods based solely on the variance across subjects fail catastrophically with falsepositive rates of up to 60% that grow with the number of tested subjects. This effect demonstrates the need to include the bootstrap across conditions into the evaluation.
When bootstrap resampling the conditions, the tests were conservative, achieving falsepositive rates below 1%, lower than the nominal 5% (at the expense of power). This held whether or not subjects were treated as a random effect: The $t$tests based on either the corrected or the uncorrected 2factor bootstrap similarly had falsepositive rates below 1%. This conservatism is expected for the tests against chance and the model comparisons with fixed conditions, because these simulations contained no true variation due to sampling of conditions. All techniques that include resampling the conditions also remain valid and conservative in the random conditions simulations that add some variation due to the condition choice. In particular, the corrected bootstrap remains conservative despite yielding strictly lower variance estimates than the uncorrected bootstrap.
We additionally ran a similar simulation, to test the tests against the noise ceiling, which is not displayed in the figure, but the results of this simulation are quickly summarized: We generated a single random model and used the same RDM also for data generation. In these data, the lower noise ceiling never significantly outperformed the true model indicating that the comparison against the lower noise ceiling is a very conservative test. This is most likely due to the difference between the lower bound on the noise ceiling and the true noise ceiling.
Poisson KLdivergence
Instead of the Gaussian variability implied by the Euclidean and Mahalanobis dissimilarity measures, noise is often assumed to be Poisson or at least to have its variance increase linearly with mean activation. This is used primarily when the spiking variability of neurons is thought to be the main noise source as in electrophysiological recordings. For this case, we discuss two possible solutions.
The first alternative, discussed by Kriegeskorte and Diedrichsen, 2019a is to use a variance stabilizing transform, that is to apply a square root to all dimensions of all representations and use an RDM based on the transformed values. This has the advantage, that the covariances can be taken into account.
The second alternative, which we introduce here, is to use a symmetrized KLdivergence of Poisson distributions with mean firing rates given by the feature values. This approach automatically takes the increased variance at larger activation levels into account and inherits nice informationtheoretic and decodingbased interpretations from the KLdivergence.
The KLdivergence of two Poisson distributions with mean rates ${\lambda}_{1}$ and ${\lambda}_{2}$ is given by:
Based on this we can compute the symmetrized version of the KL:
To get a crossvalidated version of this dissimilarity we can calculate the difference in logarithms from one crossvalidation fold and the difference between raw values for a different fold and average across all pairs of different crossvalidation folds.
This KLdivergencebased dissimilarity is theoretically more interpretable than the squareroot transform, but comes with two small drawbacks: First, the underlying firing rates cannot be 0 as a Poisson distribution which never fires is infinitely different from all others. This can be easily fixed by using a weak prior on the firing rate, which results in a nonzero estimated firing rate. Second, there is no straightforward way to include a noise covariance into the dissimilarity. While such noise correlations are much weaker than correlations between nearby voxels in fMRI or nearby electrodes in MEG, correlated noise may still reduce or enhance discriminability based on large neural populations (Averbeck et al., 2006; Kriegeskorte and Wei, 2021). There might be situations when the need to model noise correlations is a good reason to prefer the squareroot transform.
Spearman’s ${\rho}_{a}$
Nili et al., 2014 recommended Kendall’s ${\tau}_{a}$ as the RDM comparator over other rank correlation coefficients whenever any of the models predicts tied ranks. Kendall’s ${\tau}_{a}$ does not prefer predictions with tied ranks over random orderings of the same entries in expectation, making it a valid RDM comparator when any model predicts the same dissimilarity for any pair of conditions. However, Kendall’s $\tau $type correlation coefficients are considerably slower to compute than Spearman’s $\rho $type correlation coefficients. Moreover, finding the RDM with the highest average ${\tau}_{a}$ for a given set of data RDMs (for computing noise ceilings) is equivalent to the Kemeny–Young method for preference voting (Kemeny, 1959; Young and Levenglick, 1978), which is NPhard and in practice too slow to compute for our application (Ali and Meilă, 2012).
Here, we propose using the expectation of Spearman’s $\rho $ under random tie breaking as the RDM comparator instead. The coefficient ${\rho}_{a}$ was described by Kendall, 1948, chapter 3.8 and is derived below. For a vector $\mathbf{x}\in {\mathbb{R}}^{n}$, let $Rae(\mathbf{x})$ be the distribution of randomamongequals ranktransforms, where each unique value in $\mathbf{x}$ is replaced with its integer rank and, in the case of a set of tied values, a random permutation of the corresponding ranks. For each draw $\stackrel{~}{\mathbf{x}}\sim Rae(\mathbf{x})$, thus, $x}_{i}<{x}_{j}\Rightarrow {\stackrel{~}{x}}_{i}<{\stackrel{~}{x}}_{j$ . However, for pairs $(i,j)$, where ${x}_{i}={x}_{j}$, the ranks will fall in order $\stackrel{~}{x}}_{i}<{\stackrel{~}{x}}_{j$ or $\stackrel{~}{x}}_{i}>{\stackrel{~}{x}}_{j$ with equal probability. The set of values $\{{\stackrel{~}{x}}_{i}1\le i\le n\}$ is $\{1,\mathrm{\dots},n\}$. The ${\rho}_{a}$ correlation coefficient is defined as:
For this expectation, we can derive a direct formula:
where $\overline{\mathbf{x}}$ and $\overline{\mathbf{y}}$ contain the ranks of $\mathbf{x}$ and $\mathbf{y}$, respectively, with tied values represented by tied average ranks. Thus, computing ${\rho}_{a}$ does not require drawing actual random tie breaks to sample $\stackrel{~}{\mathbf{x}}$ and $\stackrel{~}{\mathbf{y}}$.
The RDM comparator ${\rho}_{a}$ provides a general solution for rankbased evaluation that is correct in the presence of tied predictions and fast to compute. In addition, the mean of ranktransformed RDMs provides the bestfitting RDM, obviating the need for optimization and approximation in computing the noise ceiling.
Expected RDM under random feature weighting
If measurements weight features identically and independently, we can directly compute the expected squared Euclidean RDM for the measurements. We use this calculation both to justify a linear weighting model and to compute the correct models in some of our simulations.
Formally, we can show that this is true by the following calculation: Let ${w}_{iv}$ be the weighting for the $i$ th feature in the $v$ th voxel for two patterns $\mathbf{x}$ and $\mathbf{y}$ with feature values x_{i} and y_{i}. Then the expected squared Euclidean distance in voxel space can be written as:
This means that the expected RDM is a linear combination of the RDM based on individual features and the RDM based on the average across features weighted by the variance and the squared expected value of the weight distribution, respectively. As averaging or filtering across space is interchangeable with feature weighting, we can also use this calculation to compute the expected RDM for models that combine averaging over space and across features as in our simulations. Then the RDM at some level of averaging over space is still always a linear combination of the featureaveraged and featureseparate RDMs at that level of spatial averaging.
Choosing experimental design parameters for sensitive model adjudication
To quantify how much increasing the number of measurements along one of the experimental factors improved SNR for adjudication among models (Equation 10), we can use the slope of a regression line for the SNR against the number of measurements in log–log space. This slope corresponds to the exponent of the powerlaw relationship (Figure 4 in the main text). We observe that increasing the number of conditions ($slope=0.935$) is slightly more effective than increasing the number of subjects ($slope=0.690$), and increasing the number of repeated measurements is most effective ($slope=1.581$), probably due to the crossvalidation we employ. The crossvalidation across repeated measurements we use to yield unbiased distance estimates produces $\frac{m}{m1}$ times the variance in the original RDM entries compared to the biased estimates without crossvalidation. This provides an additional benefit for increasing the number of repeated independent measurements.
The modeldiscriminability SNR depends on the sources of nuisance variation included in the simulations. In these particular simulated experiments, resampling the conditions set induces more nuisance variation than resampling the subjects set (Figure 4g). This indicates that inference generalizing across conditions is harder than inference generalizing across subjects in these simulations. For small noise levels, the SNR is much higher when nothing is varied over repetitions or only subjects are varied than when the conditions are also varied. At large measurement noise levels this effect disappears, because the measurement noise becomes the dominant factor.
The intuition to explain our observations about the SNR is that it is most helpful to take more samples along the dimension which currently causes most variation in the results. Clearly, our variation in conditions caused more variance than our variation in voxel sampling to simulate subject variability. As a result, to boost the modeldiscriminability SNR, increasing the number of conditions is more effective than increasing the number of subjects by the same factor. Results also reveal that we simulated sufficiently high noise levels for a reduction in measurement noise through more repetitions to remain effective. Beyond noise reduction through averaging, more repetitions are also more profitable due to the crossvalidated distances, which loses less efficiency the larger the sample becomes Diedrichsen et al., 2020.
Additionally, we observe that an intermediate voxel size (Gaussian kernel width) yields the highest model discriminability as measured by the SNR (Figure 4h in the main text). When each voxel averages over a large area, information in finegrained patterns of activity is lost, which is detrimental to model selection. The falloff for very small voxels in our simulations is due to randomly sampled voxels covering the feature map less well leading to greater variability. In real fMRI experiments, we do not expect this effect to play a role, as we expect voxels to always cover the wholebrain area, such that smaller voxels correspond to more voxels, which are clearly beneficial for better model selection. We do nonetheless expect a fall of for small voxel sizes for real fMRI experiments as well, because small voxel sizes lead to a steep increase in instrumental noise for fMRI and the BOLD signal itself is not perfectly local to the neurons that cause it (Bodurka et al., 2007; Chaimow et al., 2018; Weldon and Olman, 2021). Thus, the dependence on voxel averaging size is what we expect for real fMRI experiments as well, albeit for different reasons. Also, it might be informative for other measurement methods like electrophysiology, that a local average can be preferable over perfectly local measurements for model selection, when the number of measured channels is limited.
Choosing an RDM comparator for sensitive model adjudication
An important question is how to measure RDM prediction accuracy for model evaluation. We ran the same analysis with different RDM comparators on the same datasets in a separate simulation.
We presented the deep neural network with our standard set of stimuli and simulated data for 10 subjects, 40 conditions, and 2 repeats, changing which parameters varied over repetitions of the experiment as in the main simulation. We omitted all bootstrapping, because the bootstrap variances are not needed to estimate model discriminability (SNR, Equation 10) for different RDM comparators. To improve comparability between different generalization conditions, we enforced that the first simulation for each generalization condition used the same conditions and subjects. The other 99 simulations then varied conditions and subject according to the required generalization. The different RDM comparators were applied to the same simulated experimental data.
We found that different types of rank correlation are all similarly good at discriminating models (Appendix 1—figure 2). Proper evaluation of models predicting tied dissimilarities requires Kendall’s ${\tau}_{a}$ (Nili et al., 2014) or ${\rho}_{a}$, a rarely used variant of Spearman’s rank correlation coefficient without correction for ties, analogous to Kendall’s ${\tau}_{a}$ (derivation in Appendix 3). We recommend ${\rho}_{a}$ over ${\tau}_{a}$ for its lower computational cost and analytically derived noise ceiling.
If we are willing to assume that the representational dissimilarity estimates are on an interval scale, we expect to be able to achieve greater modelperformance discriminability with RDM comparators that are not just sensitive to ranks. In this context, we compare the Pearson correlation and cosine similarity, and their whitened variants, which we introduced recently (Diedrichsen et al., 2020). The whitened measures boost the power of inferential model comparisons, by accounting for the anisotropic sampling distribution of RDM estimates. To further increase our modelcomparative power, both the whitened and the unwhitened cosine similarity assume a ratio scale for the representational dissimilarities, which requires that indistinguishable conditions have an expected dissimilarity of zero. This assumption is justified when using a crossvalidated distance estimator (Nili et al., 2014; Walther et al., 2016), which provides unbiased dissimilarity estimates with an interpretable zero point.
Consistent with the theoretical expectations, we observe greatest modelperformance discriminability for the whitened cosine similarity, which assumes ratioscale dissimilarities, intermediate discriminability for the whitened Pearson correlation, and somewhat lower modelperformance discriminability for the unwhitened Pearson correlation and the unwhitened cosine similarity. Rank correlation coefficients performed surprisingly well, matching or even outperforming unwhitened Pearson correlation and unwhitened cosine similarity (Appendix 1—figure 2). They provide an attractive alternative to the whitened criteria when researchers wish to make weaker assumptions about their model predictions.
Data availability
No new data were collected for this study. The code to run both the analysis we do and our simulations is available with our rsatoolbox (https://github.com/rsagroup/rsatoolbox, copy archived at Schütt, 2023). The data for the fMRIbased resampling analysis are available from Horikawa and Kamitani, 2017 and the data for the calcium imaging are available from the Allen institutes website.

Open NeuroID ds001246. Generic decoding of seen and imagined objects using hierarchical visual features.

Allen Brain AtlasID visualcoding. A largescale standardized physiological survey reveals functional organization of the mouse visual cortex.
References

A nanoelectrode array for obtaining intracellular recordings from thousands of connected neuronsNature Biomedical Engineering 4:232–241.https://doi.org/10.1038/s4155101904557

Machine learning for neuroimaging with scikitlearnFrontiers in Neuroinformatics 8:14.https://doi.org/10.3389/fninf.2014.00014

Experiments with Kemeny ranking: What works when?Mathematical Social Sciences 64:28–40.https://doi.org/10.1016/j.mathsocsci.2011.08.008

Neural correlations, population coding and computationNature Reviews. Neuroscience 7:358–366.https://doi.org/10.1038/nrn1888

Magnetoencephalography for brain electrophysiology and imagingNature Neuroscience 20:327–339.https://doi.org/10.1038/nn.4504

Challenges and opportunities of mesoscopic brain mapping with fMRICurrent Opinion in Behavioral Sciences 40:189–200.https://doi.org/10.1016/j.cobeha.2021.06.002

Deep convolutional models improve predictions of macaque V1 responses to natural imagesPLOS Computational Biology 15:e1006897.https://doi.org/10.1371/journal.pcbi.1006897

ConferenceHow well do deep neural networks trained on object recognition characterize the Mouse visual system?Advances in Neural Information Processing Systems.

Adjudicating between facecoding models with individualface fMRI responsesPLOS Computational Biology 13:e1005604.https://doi.org/10.1371/journal.pcbi.1005604

Classification and geometry of general perceptual manifoldsPhysical Review X 8:031003.https://doi.org/10.1103/PhysRevX.8.031003

Neural population geometry: An approach for understanding biological and artificial neural networksCurrent Opinion in Neurobiology 70:137–144.https://doi.org/10.1016/j.conb.2021.10.010

Resolving human object recognition in space and timeNature Neuroscience 17:455–462.https://doi.org/10.1038/nn.3635

ConferenceThe Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence2019 Conference on Cognitive Computational Neuroscience.https://doi.org/10.32470/CCN.2019.10180

The representation of biological classes in the human brainThe Journal of Neuroscience 32:2608–2618.https://doi.org/10.1523/JNEUROSCI.554711.2012

Deep learning for electroencephalogram (EEG) classification tasks: A reviewJournal of Neural Engineering 16:031001.https://doi.org/10.1088/17412552/ab0ab5

Comparing representational geometries using whitened unbiaseddistancematrix similarityNeurons, Behavior, Data Analysis, and Theory 5:27664.https://doi.org/10.51628/001c.27664

Representation is representation of similaritiesThe Behavioral and Brain Sciences 21:449–467.https://doi.org/10.1017/s0140525x98001253

Hand use predicts the structure of representations in sensorimotor cortexNature Neuroscience 18:1034–1040.https://doi.org/10.1038/nn.4038

The neural representational geometry of social perceptionCurrent Opinion in Psychology 24:83–91.https://doi.org/10.1016/j.copsyc.2018.10.003

Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in pythonFrontiers in Neuroinformatics 5:13.https://doi.org/10.3389/fninf.2011.00013

ConferenceFlexible, MultiShank Stacked Array for HighDensity OminiDirectional Intracortical Recording2021 IEEE 34th International Conference on Micro Electro Mechanical Systems (MEMS).. pp. 540–543.https://doi.org/10.1109/MEMS51782.2021.9375160

Decoding neural representational spaces using multivariate pattern analysisAnnual Review of Neuroscience 37:435–456.https://doi.org/10.1146/annurevneuro062012170325

Generic decoding of seen and imagined objects using hierarchical visual featuresNature Communications 8:15037.https://doi.org/10.1038/ncomms15037

Deep supervised, but not unsupervised, models may explain IT cortical representationPLOS Computational Biology 10:e1003915.https://doi.org/10.1371/journal.pcbi.1003915

Fixed versus mixed RSA: Explaining visual representations by fixed and mixed feature sets from shallow and deep computational modelsJournal of Mathematical Psychology 76:184–197.https://doi.org/10.1016/j.jmp.2016.10.007

Mindboggling morphometry of human brainsPLOS Computational Biology 13:e1005350.https://doi.org/10.1371/journal.pcbi.1005350

ConferenceSimilarity of Neural Network Representations RevisitedProceedings of the 36th International Conference on Machine Learning.

Representational similarity analysis  connecting the branches of systems neuroscienceFrontiers in Systems Neuroscience 2:4.https://doi.org/10.3389/neuro.06.004.2008

Representational geometry: integrating cognition, computation, and the brainTrends in Cognitive Sciences 17:401–412.https://doi.org/10.1016/j.tics.2013.06.007

Deep neural networks: A new framework for modeling biological vision and brain information processingAnnual Review of Vision Science 1:417–446.https://doi.org/10.1146/annurevvision082114035447

Inferring braincomputational mechanisms with models of activity measurementsPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 371:20160278.https://doi.org/10.1098/rstb.2016.0278

Cognitive computational neuroscienceNature Neuroscience 21:1148–1160.https://doi.org/10.1038/s4159301802105

Peeling the onion of brain representationsAnnual Review of Neuroscience 42:407–432.https://doi.org/10.1146/annurevneuro080317061906

Interpreting encoding and decoding modelsCurrent Opinion in Neurobiology 55:167–179.https://doi.org/10.1016/j.conb.2019.04.002

Neural tuning and representational geometryNature Reviews. Neuroscience 22:703–718.https://doi.org/10.1038/s41583021005023

ConferenceImagenet classification with deep convolutional neural networksAdvances in neural information processing systems.

BookBrainlike object recognition with highperforming shallow recurrent AnnsIn: Wallach H, Larochelle H, Beygelzimer A, AlchéBuc F, Fox E, Garnett R, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 1–12.

Evaluation of Noisy DataJournal of the Society for Industrial and Applied Mathematics Series B Numerical Analysis 1:76–85.https://doi.org/10.1137/0701007

Honey, i shrunk the sample covariance matrixThe Journal of Portfolio Management 30:110–119.https://doi.org/10.3905/jpm.2004.110

ConferenceDeep neural networks trained on ecologically relevant categories better explain human ITConference on Cognitive Computational Neuroscience.

Encoding and decoding in fMRINeuroImage 56:400–410.https://doi.org/10.1016/j.neuroimage.2010.07.073

A toolbox for representational similarity analysisPLOS Computational Biology 10:e1003553.https://doi.org/10.1371/journal.pcbi.1003553

Beyond mindreading: multivoxel pattern analysis of fMRI dataTrends in Cognitive Sciences 10:424–430.https://doi.org/10.1016/j.tics.2006.07.005

Promises and limitations of human intracranial electroencephalographyNature Neuroscience 21:474–483.https://doi.org/10.1038/s4159301801082

BookPytorch: an imperative style, highperformance deep learning libraryIn: Wallach H, Larochelle H, Beygelzimer A, d’ F, Fox E, Garnett R, editors. Advances in Neural Information Processing System. Curran Associates, Inc. pp. 8024–8035.

The neural code for face orientation in the human fusiform face areaThe Journal of Neuroscience 34:12155–12167.https://doi.org/10.1523/JNEUROSCI.315613.2014

A shrinkage approach to largescale covariance matrix estimation and implications for functional genomicsStatistical Applications in Genetics and Molecular Biology 4:Article32.https://doi.org/10.2202/15446115.1175

SoftwareRepresentational Similarity Analysis 3.0, version swh:1:rev:01e767c432e77633fe31304201718afce6a6ff9cSoftware Heritage.

Putting big data to good use in neuroscienceNature Neuroscience 17:1440–1441.https://doi.org/10.1038/nn.3839

Challenges and opportunities for largescale electrophysiology with Neuropixels probesCurrent Opinion in Neurobiology 50:92–100.https://doi.org/10.1016/j.conb.2018.01.009

How advances in neural recording affect data analysisNature Neuroscience 14:139–142.https://doi.org/10.1038/nn.2731

Diverse deep neural networks all predict human inferior temporal cortex well, after training and fittingJournal of Cognitive Neuroscience 33:2044–2064.https://doi.org/10.1162/jocn_a_01755

Measuring and testing dependence by correlation of distancesThe Annals of Statistics 35:2769–2794.https://doi.org/10.1214/009053607000000505

Decoding patterns of human brain activityAnnual Review of Psychology 63:483–509.https://doi.org/10.1146/annurevpsych120710100412

N4ITK: improved N3 bias correctionIEEE Transactions on Medical Imaging 29:1310–1320.https://doi.org/10.1109/TMI.2010.2046908

Ultrahigh field and ultrahigh resolution fMRICurrent Opinion in Biomedical Engineering 18:100288.https://doi.org/10.1016/j.cobme.2021.100288

Computational neuroimaging and population receptive fieldsTrends in Cognitive Sciences 19:349–357.https://doi.org/10.1016/j.tics.2015.03.009

Forging a path to mesoscopic imaging success with ultrahigh field functional magnetic resonance imagingPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 376:20200040.https://doi.org/10.1098/rstb.2020.0040

Complete functional characterization of sensory neurons by system identificationAnnual Review of Neuroscience 29:477–505.https://doi.org/10.1146/annurev.neuro.29.051605.113024

Using goaldriven deep learning models to understand sensory cortexNature Neuroscience 19:356–365.https://doi.org/10.1038/nn.4244

The generalizability crisisThe Behavioral and Brain Sciences 45:1–37.https://doi.org/10.1017/S0140525X20001685

A consistent extension of condorcet’s election principleSIAM Journal on Applied Mathematics 35:285–300.https://doi.org/10.1137/0135023

Segmentation of brain MR images through a hidden Markov random field model and the expectationmaximization algorithmIEEE Transactions on Medical Imaging 20:45–57.https://doi.org/10.1109/42.906424
Decision letter

John T SerencesReviewing Editor; University of California, San Diego, United States

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Decision letter after peer review:
Thank you for submitting your article "Statistical Inference on Representational Geometries" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
The authors appear to get lost in details and some of the key methods are hard to find in the Methods section. This will make the paper hard to follow for those not super familiar with the analysis approach.
Expanding the section on testing for the presence of information would be very useful to broaden the appeal.
Expanding the discussion and exposition of the generalizability of the statistical tests (see Reviewer #1).
Reviewer #1 (Recommendations for the authors):
Schütt and colleagues introduce a new method for statistical inference on representational geometries based on a crossvalidated twofactor bootstrap that allows for generalization across both participants and stimuli while allowing the fitting of flexible models. In a series of elegant simulations and empirical analyses on existing datasets, the authors validate the method statistically.
Strengths:
– The authors are clearly experts on the methods, and the statistical approach significantly improves upon the state of the art of existing methods in terms of generalization across participants and stimuli.
– There is a potential for this method to not only become a new standard for analyses of representational geometries but to be applicable to different methodological approaches that not only aim at generalizing to new participants but also to new stimuli.
– The treatment of the topic is very thorough, with both extensive simulations as well as validation using functional MRI and calcium imaging data.
– The authors introduce a number of complex yet highly informative and useful new methodological advances, such as the (re)discovery of Spearman rho_a for improved comparison of dissimilarities as compared to Kendall's tau_a.
Weaknesses:
– Overall, while the introduction starts off very nicely, the manuscript ends up being rather difficult to read. The authors appear to get lost in details in the main text. Other critical methodological details are buried in the Methods section. Specifically, the key methodological advance, the twofactor bootstrap, is barely explained in the main text, and in my reading, it is never mentioned what data are bootstrapped (i.e. original data, rows and columns in the RDM, individual cells in the RDM).
– The authors assume a lot of knowledge from the reader, often referring to very recent work or preprints in a matteroffact kind of way. While this can be seen as a strength and highlights the timeliness of the work, the constant mix of more established and recent methods makes it much harder for the reader to understand what is actually introduced in this work. This separation is solved nicely in the introduction but does not appear to continue into the Results section.
– Representational similarity analysis is recommended by the authors to be used for model comparison. However, a very common, probably even more common, use case is to test for the presence of information (i.e. is the representational similarity > 0), which, however, is only briefly discussed.
– The validity of the Ttest based on bootstrap estimates for tests against chance seem to assume a null distribution for individual modeldata comparisons that is centered around zero. However, negative similarities cannot be explained by population variance in the population null distribution, which is currently not discussed by the authors.
Are the claims of the authors justified?
– For comparisons between models, the claims of the authors clearly seem to be justified and reflect an important advance in the stateoftheart statistical evaluation of representational geometries.
– That said, I believe that it is important to clarify the open statistical issue of generalizability to the population.
1. I really enjoyed reading this manuscript and believe it will make an important contribution to the field. That said, the authors introduce a lot in this work that is only indirectly related to the statistical analysis framework, and as a consequence, the manuscript is currently quite dense and hard to follow. I think that this manuscript would benefit strongly from a much more focused treatment of the key aspects (the introduction of the new method) and a reduction of the emphasis on advanced methods that are not key to this work (such as the use of reweighting and neuronal population sampling approaches, to name only a few).
I think the issue is that given the flexibility of analyzing representational geometries with RSA, the authors try to be as general as possible and try to encompass all possible use cases in their writing. In addition, the specific use case for a crossvalidated twofactor bootstrap seems to be fitting flexible models, which alone is already quite advanced. I know this is difficult to solve, so I would like to provide one specific recommendation for making the manuscript easier to digest: it would perhaps help to first provide the reader with a quick runthrough, without justifying all steps in detail but only summarizing the approach and the basic motivation for it. Then a more thorough treatment, including relevant parts from the methods section that explains the motivation behind the twofactor bootstrap could follow, again followed by extensive validation. This is just one suggestion for improving clarity.
2. Given the very common use of RSA for testing the presence of effects, rather than model comparison, I think the impact of this work would be strengthened if the authors expanded on their specific use case, even if it is comparably simple (they call this "simple dependence test", which is perhaps a little confusing to the reader).
3. RSA measures the match of one or more model RDMs with a data RDM. For a test against chance, without very specific biases, a negative representational similarity should not be found empirically for subjects and only for a subset of stimuli. Any such effects should thus only be caused by measurement noise or by stimulus variability. I am wondering to what degree this affects the ability to carry out valid inferences against the null at the population level. See Allefeld et al. (2016) for the treatment of a similar problem with decoding accuracies.
4. The introduction would benefit from a better motivation of the method. It seems as if the authors discuss previous work on RDMs but then jump to the introduction of the new method. Did no other method exist before? What were the issues with these methods, and what is the gap that needs to be filled? This would help the reader better understand why they should be reading this work.
5. While valid, the approach appears to be rather conservative, producing very low false positive rates. Are false negatives not a potentially problematic issue in that respect?
Reviewer #2 (Recommendations for the authors):
This paper addresses a major question in computational neuroscience by proposing a novel methodology to test models to explain behavioral/brain data that generalize across conditions and subjects using bootstrapping.
The experiments reported validate the claims of the authors. The methodology is applied and analysed in different available datasets.
I found particularly helpful and thorough the tests with the simulations. However, I found that the reported analysis is focused mainly on the newly proposed method, and this could bring a wider perspective into the picture.
It is with such simulated data that I believe a deeper discussion and possibly adding a comparison to existing methods, such as vanilla RSA and/or linear encoding methods could be reported to support the final discussion on the limitations of such existing methods. This would allow showcasing in which cases this method reveals new conclusions and has lower false positive rates, or in which cases there which method is limited to the experimental paradigm used to obtain the data (e.g., how many participants, repetitions, and conditions).
https://doi.org/10.7554/eLife.82566.sa1Author response
Essential revisions:
The authors appear to get lost in details and some of the key methods are hard to find in the Methods section. This will make the paper hard to follow for those not super familiar with the analysis approach.
We have restructured the manuscript to make it easier to follow. Concretely, we moved the methods parts that are newly introduced here into the main text. We collected all parts that give a general explanation of state of the art representational similarity analysis into a coherent material and methods section (5.1) for readers that are not familiar with the analysis approach yet. And to avoid getting lost in details, we pushed all parts that we did not deem central to our new methods into Appendices. We believe this separation makes the manuscript much more readable.
Additionally, we redistributed the flexible model figure, which one of the reviewers described as the hardest to follow. This is now separated into an early part that explains the method and math and a later section that reports our tests of the method.
Expanding the section on testing for the presence of information would be very useful to broaden the appeal.
Expanding the discussion and exposition of the generalizability of the statistical tests (see Reviewer #1).
We discuss this point in a bit more detail in the paper in the new section 3.3. We agree that this is a test that is frequently and sensibly employed and should be mentioned. In the case we are most interested in, this is a relatively low information test, as we should expect most of our models to be distinguishable from pure noise predictions.
We also discuss in this section, what kind of tests our methods cover exactly and which ones we don’t. This should hopefully make things clearer.
The new section 3.3. contains the following text:
“3.3 Supported tests and implications of test results
Our methods enable comparison of a model’s RDM prediction performance (1) against other models, (2) against the noise ceiling, and (3) against chance performance. The first two of these tests are central to the evaluation of models. The test against chance performance is often also reported, but represents a low bar that we should expect most models to pass. In practice, RDM correlations tend to be positive even for very different representations, because physically highly similar stimuli or conditions tend to be similar in all representations. Just like a significant Pearson correlation indicates a dependency, but does not demonstrate that the dependency is linear, a significant RDM prediction result indicates the presence of stimulus information, but does not lend strong support to the particular model. We should resist interpreting significant prediction performance per se as evidence for a particular model (the singlemodelsignificance fallacy; Kriegeskorte and Douglas (2019)). Theoretical progress instead requires that each model be compared to alternative models and to the noise ceiling. An additional point to note is that the interpretation of chance performance, where the RDM comparator equals 0, depends on the chosen RDM comparator, differing, for example, between the Pearson correlation coefficient and the cosine similarity (Diedrichsen et al., 2020).
RDM comparators like the Pearson correlation and the cosine similarity are related to the distance correlation (Székely, Rizzo, and Bakirov, 2007), a general indicator of mutual information. Like a significant distance correlation, a significant RDM correlation mainly demonstrates that there is some mutual information between the brain region in question and the model representation. For a visual representation, for example, all that is required is for the two representations to contain some shared information about the input images. In contrast to the distance correlation (and other nonnegative estimates of mutual information), however, negative RDM correlations can occur, indicating simply that pairs of stimuli close in one representation tend to be far in the other and vice versa. For any RDM, there is even a valid perfectly anticorrelated RDM (Pearson r = −1), which can be found by flipping the sign of all dissimilarities and adding a large enough value to make the RDM conform to the triangle inequality (which ensures the existence of an embedding of points that is consistent with the anticorrelated RDM). The existence of valid negative RDM correlations is important to the inferential methods presented here because it is required for our assumption of symmetric (t)distributions around the true RDM correlation.
Omnibus tests for the presence of information about the experimental conditions in a brain region have been introduced in previous studies (e.g. Allefeld, Görgen, and Haynes, 2016; Kriegeskorte, Goebel, and Bandettini, 2006; Nili, Walther, Alink, and Kriegeskorte, 2020). Whether stimulus information is present in a region is closely related to the question whether the noise ceiling is significantly larger than 0, indicating RDM replicability. Such tests can sensitively detect small amounts of information in the measured activity patterns and can be helpful to assess whether there is any signal for model comparisons. If we are uncertain whether there is a reliable representational geometry to be explained, we need not bother with model comparisons.
The question whether an individual dissimilarity is significantly larger than zero is equivalent to the question whether the distinction between the two conditions can be decoded from the brainactivity. Decoding analyses can be used for this purpose (Hebart, Görgen, and Haynes, 2015; Kriegeskorte and Douglas, 2019; Naselaris, Kay, Nishimoto, and Gallant, 2011; Tong and Pratte, 2012). Such tests require care because the discriminability of two conditions cannot be systematically negative (Allefeld et al., 2016). This is in contrast to comparisons between RDMs, which can be systematically negative (although, as mentioned above, they tend to be positive in practice).”
Reviewer #1 (Recommendations for the authors):
1. I really enjoyed reading this manuscript and believe it will make an important contribution to the field. That said, the authors introduce a lot in this work that is only indirectly related to the statistical analysis framework, and as a consequence, the manuscript is currently quite dense and hard to follow. I think that this manuscript would benefit strongly from a much more focused treatment of the key aspects (the introduction of the new method) and a reduction of the emphasis on advanced methods that are not key to this work (such as the use of reweighting and neuronal population sampling approaches, to name only a few).
I think the issue is that given the flexibility of analyzing representational geometries with RSA, the authors try to be as general as possible and try to encompass all possible use cases in their writing. In addition, the specific use case for a crossvalidated twofactor bootstrap seems to be fitting flexible models, which alone is already quite advanced. I know this is difficult to solve, so I would like to provide one specific recommendation for making the manuscript easier to digest: it would perhaps help to first provide the reader with a quick runthrough, without justifying all steps in detail but only summarizing the approach and the basic motivation for it. Then a more thorough treatment, including relevant parts from the methods section that explains the motivation behind the twofactor bootstrap could follow, again followed by extensive validation. This is just one suggestion for improving clarity.
As described in our response to the essential revisions above, we restructured the manuscript to make it less dense. In particular, we tried to move as many of the only indirectly related parts to appendices.
We also followed the reviewers suggestion to move the relevant methods sections into the main text and generated a Materials and methods section that gives a proper exposition to the state of the art RSA methods. As most of the content in that section is not new we decided to place this part into the Materials and methods rather than the Results section. Nonetheless, this part should fulfill the purpose of explaining advanced topics like the new evaluations metrics and flexible models to readers that are not familiar with them.
2. Given the very common use of RSA for testing the presence of effects, rather than model comparison, I think the impact of this work would be strengthened if the authors expanded on their specific use case, even if it is comparably simple (they call this "simple dependence test", which is perhaps a little confusing to the reader).
We hope this point is resolved with our new section 3.3. And the answer we give to the general point above.
3. RSA measures the match of one or more model RDMs with a data RDM. For a test against chance, without very specific biases, a negative representational similarity should not be found empirically for subjects and only for a subset of stimuli. Any such effects should thus only be caused by measurement noise or by stimulus variability. I am wondering to what degree this affects the ability to carry out valid inferences against the null at the population level. See Allefeld et al. (2016) for the treatment of a similar problem with decoding accuracies.
The problem the reviewer highlights here is very important for tests for the presence of any information in the data. We do not cover this kind of test in this manuscript though. We are always testing model performances here and model performances can be reliably negative. We do see that this distinction is fairly subtle and that we did not state this clearly in our previous manuscript. To make this distinction clearer, we added some discussion on which exact tests we cover in this paper in our new section 3.3:
“In contrast to the distance correlation (and other nonnegative estimates of mutual information), however, negative RDM correlations can occur, indicating simply that pairs of stimuli close in one representation tend to be far in the other and vice versa. For any RDM, there is even a valid perfectly anticorrelated RDM (Pearson r = −1), which can be found by flipping the sign of all dissimilarities and adding a large enough value to make the RDM conform to the triangle inequality (which ensures the existence of an embedding of points that is consistent with the anticorrelated RDM). The existence of valid negative RDM correlations is important to the inferential methods presented here because it is required for our assumption of symmetric (t)distributions around the true RDM correlation.”
4. The introduction would benefit from a better motivation of the method. It seems as if the authors discuss previous work on RDMs but then jump to the introduction of the new method. Did no other method exist before? What were the issues with these methods, and what is the gap that needs to be filled? This would help the reader better understand why they should be reading this work.
This connection does indeed seem important. We added a paragraph that explicitly states the limitations of previous RSA inference methods:
“However, existing statistical inference methods for RSA (step 3) have important limitations. Established RSA inference methods (Nili et al., 2014) provide a noise ceiling and enable comparisons of fixed models with generalization to new subjects and conditions. However, they cannot handle flexible models, can be severely suboptimal in terms of statistical power, and have not been thoroughly validated using simulated or real data where ground truth is known. Addressing these shortcomings poses three substantial challenges. (1)
Modelcomparative inference with generalization to new conditions is not trivial because new conditions extend an RDM and the evaluation depends on pairwise dissimilarities, thus violating independence assumptions. (2) Standard methods for statistical inference do not handle multiple random factors — subjects and conditions in RSA. (3) Flexible models, that is models that have parameters enabling them to predict different RDMs, are essential for RSA (Diedrichsen, Yokoi, and Arbuckle, 2018; Kriegeskorte and Diedrichsen, 2016). Evaluation of such models requires methods that are unaffected by overfitting to either subjects or conditions to avoid a bias in favor of more flexible models.”
5. While valid, the approach appears to be rather conservative, producing very low false positive rates. Are false negatives not a potentially problematic issue in that respect?
The reviewer here refers to the simple simulations based on matrix normal data, which we now present in Appendix 1—figure 1 for model comparisons. In this case, there were indeed some methods that are overly conservative.
There are three reasons why we believe that this is less problematic than it seems:
1) This impression is due to the simplified simulations. In these simulations we enforced that the two models performed equally well for the exact stimuli in the RDM. This is even more equal than one would expect for a random sample of stimuli for models that perform equally on average. For simulations that include variation due to the stimuli, the methods that look good in our earlier simulations can fail catastrophically, as we demonstrate with a new simulation in which we create such models that perform equally only on average (New third row of Figure 8). Thus the existing, less conservative methods are in fact invalid.
2) The new methods we propose in this manuscript are strictly more progressive than existing methods for the 2factor bootstrap. We explicitly bound the variance estimates with the uncorrected 2factor bootstrap. Thus, we produce fewer false negatives than existing valid methods.
3) We show in a diverse set of more realistic simulations, that we can estimate the variance quite accurately. Given an unbiased variance estimate and fairly simple unimodal sampling distributions for the model evaluations, we believe that there is not that much room for improvement left.
Overall, we are sure that we improve over existing methods, also in terms of false negative rates. Nonetheless, it is true that there could be other methods that gain even more power than the methods we propose by raising the false positive rate to the nominal value.
Reviewer #2 (Recommendations for the authors):
This paper addresses a major question in computational neuroscience by proposing a novel methodology to test models to explain behavioral/brain data that generalize across conditions and subjects using bootstrapping.
The experiments reported validate the claims of the authors. The methodology is applied and analysed in different available datasets.
I found particularly helpful and thorough the tests with the simulations. However, I found that the reported analysis is focused mainly on the newly proposed method, and this could bring a wider perspective into the picture.
It is with such simulated data that I believe a deeper discussion and possibly adding a comparison to existing methods, such as vanilla RSA and/or linear encoding methods could be reported to support the final discussion on the limitations of such existing methods. This would allow showcasing in which cases this method reveals new conclusions and has lower false positive rates, or in which cases there which method is limited to the experimental paradigm used to obtain the data (e.g., how many participants, repetitions, and conditions).
We believe there are two aspects to this question, which are largely independent:
First, we can compare to other inference methods for RSA to illustrate their limitations. The primary reason to do this is to show that these limited methods indeed fail for the simultaneous generalization to subjects and stimuli that we cover now.
We have added one such simulation result in Appendix 1—figure 1 in the Appendix. There we created two models that are equally close to the data across a large set of stimuli and then subsample these stimuli to create simulated experiments. This leads to vastly inflated false alarm rates for differences between the two models, for all inference methods that do not handle variance due to stimulus selection explicitly. Preliminary simulations based on the deep neural networks also showed that all methods based on the variance across subjects do indeed fail. These simulations drive home the point that earlier methods were insufficient, but we felt they were distracting from our main point that the methods we now recommend work. Also, the severity of these failures depends on how much variability is due to subject choice and stimulus choice respectively and we can make all methods that are based on the variability across subjects fail arbitrarily badly, depending on the simulation parameters, such that the precise size of the error is rather meaningless.
Second, it would be interesting to compare RSA to other methods that are used for comparisons between representational models, like encoding models to estimate which method performs best for model selection. Such comparisons feel out of scope for this manuscript unfortunately. Which model performance metric is best suited for comparing representational models will depend on many technical details of the model evaluations, which we would not want to add to this already dense manuscript. Furthermore, the statistical problems we tackle here apply equally to any other model performance metrics and the question which one works best is largely orthogonal to our questions here.
https://doi.org/10.7554/eLife.82566.sa2Article and author information
Author details
Funding
Deutsche Forschungsgemeinschaft (Forschungsstipendium SCHU 3351/11)
 Heiko H Schütt
The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 John T Serences, University of California, San Diego, United States
Version history
 Preprint posted: December 16, 2021 (view preprint)
 Received: August 9, 2022
 Accepted: August 7, 2023
 Version of Record published: August 23, 2023 (version 1)
Copyright
© 2023, Schütt et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 670
 Page views

 87
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
Consumption of food and water is tightly regulated by the nervous system to maintain internal nutrient homeostasis. Although generally considered independently, interactions between hunger and thirst drives are important to coordinate competing needs. In Drosophila, four neurons called the interoceptive subesophageal zone neurons (ISNs) respond to intrinsic hunger and thirst signals to oppositely regulate sucrose and water ingestion. Here, we investigate the neural circuit downstream of the ISNs to examine how ingestion is regulated based on internal needs. Utilizing the recently available fly brain connectome, we find that the ISNs synapse with a novel celltype bilateral Tshaped neuron (BiT) that projects to neuroendocrine centers. In vivo neural manipulations revealed that BiT oppositely regulates sugar and water ingestion. Neuroendocrine cells downstream of ISNs include several peptidereleasing and peptidesensing neurons, including insulin producing cells (IPCs), crustacean cardioactive peptide (CCAP) neurons, and CCHamide2 receptor isoform RA (CCHa2RRA) neurons. These neurons contribute differentially to ingestion of sugar and water, with IPCs and CCAP neurons oppositely regulating sugar and water ingestion, and CCHa2RRA neurons modulating only water ingestion. Thus, the decision to consume sugar or water occurs via regulation of a broad peptidergic network that integrates internal signals of nutritional state to generate nutrientspecific ingestion.

 Neuroscience
Complex behaviors depend on the coordinated activity of neural ensembles in interconnected brain areas. The behavioral function of such coordination, often measured as cofluctuations in neural activity across areas, is poorly understood. One hypothesis is that rapidly varying cofluctuations may be a signature of momentbymoment taskrelevant influences of one area on another. We tested this possibility for errorcorrective adaptation of birdsong, a form of motor learning which has been hypothesized to depend on the topdown influence of a higherorder area, LMAN (lateral magnocellular nucleus of the anterior nidopallium), in shaping momentbymoment output from a primary motor area, RA (robust nucleus of the arcopallium). In paired recordings of LMAN and RA in singing birds, we discovered a neural signature of a topdown influence of LMAN on RA, quantified as an LMANleading cofluctuation in activity between these areas. During learning, this cofluctuation strengthened in a premotor temporal window linked to the specific movement, sequential context, and acoustic modification associated with learning. Moreover, transient perturbation of LMAN activity specifically within this premotor window caused rapid occlusion of pitch modifications, consistent with LMAN conveying a temporally localized motorbiasing signal. Combined, our results reveal a dynamic topdown influence of LMAN on RA that varies on the rapid timescale of individual movements and is flexibly linked to contexts associated with learning. This finding indicates that interarea cofluctuations can be a signature of dynamic topdown influences that support complex behavior and its adaptation.