Statistical inference on representational geometries

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

Neuroscience has recently made much progress, expanding the complexity of both neural activity measurements and brain-computational models. However, we lack robust methods for connecting theory and experiment by evaluating our new big models with our new big data. Here, we introduce new inference methods enabling researchers to evaluate and compare models based on the accuracy of their predictions of representational geometries: A good model should accurately predict the distances among the neural population representations (e.g. of a set of stimuli). Our inference methods combine novel 2-factor extensions of crossvalidation (to prevent overfitting to either subjects or conditions from inflating our estimates of model accuracy) and bootstrapping (to enable inferential model comparison with simultaneous generalization to both new subjects and new conditions). We validate the inference methods on data where the ground-truth model is known, by simulating data with deep neural networks and by resampling of calcium-imaging and functional MRI data. Results demonstrate that the methods are valid and conclusions generalize correctly. These data analysis methods are available in an open-source Python toolbox (rsatoolbox.readthedocs.io).

Editor's evaluation

Schütt and colleagues introduce a new method for statistical inference on representational geometries based on a cross-validated two-factor bootstrap that allows for generalization across both participants and stimuli while allowing the fitting of flexible models. In a series of elegant simulations and empirical analyses on existing datasets, the authors validate the method statistically. The work provides a fundamental and compelling advance for the analysis of representational geometries.

https://doi.org/10.7554/eLife.82566.sa0

Introduction

Experimental neuroscience has recently made rapid progress with technologies for measuring neural population activity. Spatial and temporal resolution, as well as the coverage of measurements across the brains of animals and humans have all improved considerably (Parvizi and Kastner, 2018; Abbott et al., 2020; Wang and Xu, 2020; Allen et al., 2021; Guo et al., 2021; Uğurbil, 2021; Bandettini et al., 2021). Activity is measured using a wide range of techniques, including electrode recordings (Jun et al., 2017; Steinmetz et al., 2018; Parvizi and Kastner, 2018), calcium imaging (Wang and Xu, 2020), functional magnetic resonance imaging (fMRI; Allen et al., 2021; Uğurbil, 2021; Bandettini et al., 2021), and scalp electro- and magnetoencephalography (EEG and MEG; Baillet, 2017; Craik et al., 2019). In parallel to the advances in measuring brain activity, theoretical neuroscience has substantially scaled up brain-computational models that implement computational theories (e.g. Kriegeskorte, 2015; Kell et al., 2018; Kubilius et al., 2019; Zhuang et al., 2021). The engineering advances associated with deep learning (e.g. Paszke et al., 2019; Abadi et al., 2015) provide powerful tools for modeling brain information processing for complex, naturalistic tasks (LeCun et al., 2015). How to leverage the new big data to evaluate the new big models, however, is an open problem (Stevenson and Kording, 2011; Sejnowski et al., 2014; Smith and Nichols, 2018; Kriegeskorte and Douglas, 2018).

An important concept for understanding neural population codes is the concept of representational geometry (Shepard and Chipman, 1970; Edelman et al., 1998; Edelman, 1998; Norman et al., 2006; Diedrichsen and Kriegeskorte, 2017; Kriegeskorte et al., 2008a; Kriegeskorte et al., 2008b; Connolly et al., 2012; Xue et al., 2010; Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Cichy et al., 2014; Haxby et al., 2014; Freeman et al., 2018; Kietzmann et al., 2019; Stringer et al., 2019; Chung et al., 2018; Chung and Abbott, 2021; Kriegeskorte and Wei, 2021). Neural activity patterns that represent particular pieces of mental content, such as the stimuli presented in a neurophysiological experiment, can be viewed as points in the multivariate neural population response space of a brain region. The representational geometry is the geometry of these points. The geometry is characterized by the matrix of distances among the points. This distance matrix abstracts from the roles of individual neurons and provides a summary characterization of the neural population code that can be directly compared among animals and between brain and model representations (e.g. a cortical area and a layer of a neural network model). The representational geometry provides a multivariate characterization of a neural population code that can be motivated as a generalization of linear decoding analyses. A linear decoder reveals a single projection of the geometry. The full distance matrix (when measured after a transform that renders the noise isotropic) captures what information is available in any linear projection (Kriegeskorte and Diedrichsen, 2019a).

A popular method for analyzing representational geometries (Kriegeskorte and Kievit, 2013) on which we build here is representational similarity analysis (RSA; Kriegeskorte et al., 2008a; Nili et al., 2014). RSA is a three-step process (Figure 1): In the first step, RSA characterizes the representational geometry of the brain region of interest (ROI) by estimating the representational distance for each pair of experimental conditions (e.g. different stimuli). The distance estimates are assembled in a representational dissimilarity matrix (RDM). We use the more general term ‘dissimilarity’ here to include dissimilarity measures that are not distances or metrics in the mathematical sense, such as crossvalidated distance estimators that can return negative values. This relaxation enables inclusion of measures that are not biased by the noise in the data (Kriegeskorte et al., 2007; Nili et al., 2014; Walther et al., 2016; Kriegeskorte and Diedrichsen, 2019a), returning values distributed symmetrically about 0, when the true distance is 0, but patterns are noisy estimates. In the second step, each model is evaluated by the accuracy of its prediction of the data RDM. To this end, an RDM is computed for each model representation. Each model’s prediction of the data RDM is evaluated using an RDM comparator, such as a correlation coefficient. In the third step, models are inferentially compared to each other in terms of their RDM prediction accuracy to guide computational theory.

Figure 1

Download asset Open asset

Overview of model-comparative inference.

(a) Multiple conditions are presented to observers and to models (here different stimulus images). The brain measurements during the presentation produce a set of measurements for each stimulus and subject, potentially with repetitions; a model yields a feature vector per stimulus. Importantly, no mapping between brain measurement channels and model features is required. (b) To compare the two representations, we compute a representational dissimilarity matrix (RDM) measuring the pairwise dissimilarities between conditions for each subject and each model. For model comparison, we perform 2-factor crossvalidation within a 2-factor bootstrap loop to estimate our uncertainty about the model performances. On each fold of crossvalidation, flexible models are fitted to the representational dissimilarities for a set of fitting stimuli estimated in a set of fitting subjects (blue fitting dissimilarities). The fitted models must then predict the representational dissimilarities among held-out test stimuli for held-out test subjects (red test dissimilarities). The resulting performance estimates are not biased by overfitting to either subjects or stimuli. (c) Based on our uncertainty about model performances (error bars indicate estimated standard errors of measurement), we can perform various statistical tests, which are marked in the graphical display. Dew drops (gray) clinging to the lower bound of the noise ceiling mark models performing significantly below the noise ceiling. White dew drops on the horizontal axis mark models whose performance significantly exceeds 0 or chance performance. Pairwise differences are summarized by arrows. Each arrow indicates that the model marked with the dot performed significantly better than the model the arrow points at and all models further away in the direction of the arrow.

Image credit: Ecoset (Mehrer et al., 2017) and Wiki Commons.

RSA is widely used (Kriegeskorte and Kievit, 2013; Haxby et al., 2014; Kriegeskorte and Diedrichsen, 2019a) and has gained additional popularity with the rise of image-computable representational models like deep neural networks (e.g. Krizhevsky et al., 2012; Yamins et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Mehrer et al., 2017; Kriegeskorte, 2015; Yamins and DiCarlo, 2016; Xu and Vaziri-Pashkam, 2021; Konkle and Alvarez, 2022; Cichy et al., 2016). There has been important recent progress with methods for estimating representational distances (step 1) as well as measures of RDM prediction accuracy (step 2). For RDM estimation, biased and unbiased distance estimators with improved reliability have been proposed (Nili et al., 2014; Cai et al., 2019; Walther et al., 2016). For quantification of the RDM prediction accuracy, the sampling distribution of distance estimators has been derived and measures of RDM prediction accuracy that take the dependencies between dissimilarity estimates into account have been proposed (Diedrichsen et al., 2020). However, existing statistical inference methods for RSA (step 3) have important limitations. Established RSA inference methods (Nili et al., 2014) provide a noise ceiling and enable comparisons of fixed models with generalization to new subjects and conditions. However, they cannot handle flexible models, can be severely suboptimal in terms of statistical power, and have not been thoroughly validated using simulated or real data where ground truth is known. Addressing these shortcomings poses three substantial challenges. (1) Model-comparative inference with generalization to new conditions is not trivial because new conditions extend an RDM and the evaluation depends on pairwise dissimilarities, thus violating independence assumptions. (2) Standard methods for statistical inference do not handle multiple random factors — subjects and conditions in RSA. (3) Flexible models, that is models that have parameters enabling them to predict different RDMs, are essential for RSA (Diedrichsen et al., 2018; Kriegeskorte and Diedrichsen, 2016). Evaluation of such models requires methods that are unaffected by overfitting to either subjects or conditions to avoid a bias in favor of more flexible models.

Here, we introduce a comprehensive methodology for statistical inference on models that predict representational geometries (Figure 1). We introduce novel bootstrapping methods that support generalization of model-comparative statistical inferences to new subjects, new conditions, or both simultaneously, as required to support the theoretical claims researchers wish to make. We also introduce a novel crossvalidation method for estimation of the RDM prediction accuracy of flexible models, that is models with parameters fitted to the data (Khaligh-Razavi and Kriegeskorte, 2014; Kriegeskorte and Diedrichsen, 2016). This is important, because theories do not always make a specific prediction for the representational geometry. There may be unknown parameters, such as the relative prevalences of different tuning functions (Khaligh-Razavi and Kriegeskorte, 2014; Jozwik et al., 2016) in the neural population or properties of the measurement process (Kriegeskorte and Diedrichsen, 2016). The combination of our 2-factor bootstrap and 2-factor crossvalidation methods enables statistical comparisons among fixed and flexible models that generalize across subjects and conditions.

We thoroughly validate the new inference methods using simulations and neural activity data. Extensive simulations based on deep neural network models and models of the measurement process enable us to test model-comparative inference in a setting where the ground-truth model (the one that actually generated the data) is known. These simulations confirm the validity of the inference procedures and their ability to generalize to the populations of subjects and/or conditions. We also validated the methods on real data from calcium imaging (mouse) and functional MRI (human). For both datasets, we confirm that conclusions generalize from an experimental dataset (a subset of the real data) to the entire dataset (which serves as a stand-in for the population). The statistical inference methodology described in this paper is available in a new open-source RSA toolbox written in Python (https://github.com/rsagroup/rsatoolbox, copy archived at Schütt, 2023).

Results

We now introduce the 2-factor bootstrap procedure for model-comparative inference and the 2-factor crossvalidation procedure for unbiased evaluation of flexible models. This paper also introduces a new representational dissimilarity estimator for electrophysiological recordings of patterns of firing rates across a population of neurons, based on the KL-divergence between Poisson distributions (Appendix 2) and a faster alternative to the rank correlation $τ_{a}$ as an RDM comparator (Nili et al., 2014), which we call $ρ_{a}$ (Appendix 3). The proposed inferential methods work for any representational dissimilarity measure and any RDM comparator. We evaluate alternative RDM comparators in terms of their power in Appendix 6. A complete description of all steps of the new methodology can be found in the Materials and methods (Full description of the RSA method).

Methods for inference on representational geometries

A simple approach to inferential comparison of two models is to compute the difference between the models’ performance estimates for each subject and use Student’s $t$ -test (or a nonparametric alternative). However, inference then only takes the variability over subjects into account and thus does not justify generalization to different experimental conditions (e.g. different stimuli). Computational neuroscience usually pursues insights that generalize not only to a population of subjects but also to a population of conditions (Yarkoni, 2020). To support generalization to the population of conditions statistically, we require uncertainty estimates that treat the experimental conditions as a random sample from a population (Kriegeskorte et al., 2008a), whether or not the subjects are treated as a random sample.

For frequentist inference, the challenge is to estimate how variable the model-performance estimates would be if we repeated the experiment many times with new subjects and/or conditions. We would like to know (1) the variance of each model’s performance estimate and (2) the variance of the estimated performance difference for each pair of models. The variance of model-performance estimates enables us to statistically compare each model to a fixed value such as an RDM correlation of 0. The variance of our estimate of model-performance difference enables us to statistically compare two models to each other (see Frequentist tests for model evaluation and model comparison for details).

Estimating the variance of model-performance estimates for generalization to new subjects and conditions

To estimate the variance of model-performance estimates across repetitions of the experiment with new conditions, we use a bootstrap method. Bootstrap methods estimate the variance of experimental outcomes by sampling from the measured data with replacement, treating the measured data as an approximation to the population (Efron and Tibshirani, 1994). The population here is the set of experimental conditions of which the actual experimental conditions can be considered a random sample. Because the conditions do not have independent influences on the model evaluations, we cannot compute a sample variance across conditions as we can across subjects to replace the bootstrap.

When we bootstrap-resample conditions, we obtain RDMs of the same size as the original RDMs, but some of the conditions will be repeated. Here, we exclude the entries that correspond to the dissimilarity of any condition with itself from the comparisons between RDMs. Simulations confirm that this procedure yields a good estimate of how variable the results are when we sample new conditions with the same subjects (Figures 4a and 6g).

For simultaneous generalization to the populations of both conditions and subjects, we can employ a 2-factor bootstrap (Figure 1b) as introduced previously (Nili et al., 2014; Storrs et al., 2021). However, our simulations and theory here show that a naive 2-factor bootstrap approach triple-counts the variance contributed by the measurement noise (Methods, Estimating the uncertainty of our model-performance estimates, Figures 4c and 7c). This effect is not unique to RSA; a naive 2-factor bootstrap will triple-count variance related to the measurement noise for any type of experiment in which two factors (here subject and condition) jointly determine the experimental outcome. The true variance $σ_{b o t h}^{2}$ of the experimental outcome when sampling both factors can be separated into a contribution from condition sampling ( $σ_{c o n d}^{2}$ ), a contribution from subject sampling ( $σ_{s u b j}^{2}$ ), and a contribution of the interaction of subjects and conditions or measurement noise ( $σ_{n o i s e}^{2}$ ).

\begin{aligned} σ_{b o t h}^{2} & \approx σ_{s u b j}^{2} + σ_{c o n d}^{2} + σ_{n o i s e}^{2} \end{aligned}

This decomposition is for the actual variance $σ_{b o t h}^{2}$ across repeated experiments with new subjects and conditions. The variance ${\hat{σ}}_{b o t h}^{2}$ of the naive 2-factor bootstrap can likewise be decomposed into three additive terms (Online Methods, Estimating the uncertainty of our model-performance estimates), corresponding to subject sampling, condition sampling, and the interaction and/or noise. However, in the naive 2-factor bootstrap estimate ${\hat{σ}}_{b o t h}^{2}$ , the independent noise contribution enters not only its own term, but also the two others. Thus, the original bootstrap estimate contains the noise variance component three times instead of once:

\begin{aligned} {\hat{σ}}_{b o t h}^{2} & \approx (σ_{s u b j}^{2} + σ_{n o i s e}^{2}) + (σ_{c o n d}^{2} + σ_{n o i s e}^{2}) + σ_{n o i s e}^{2} \\ = σ_{s u b j}^{2} + σ_{c o n d}^{2} + 3 σ_{n o i s e}^{2} \end{aligned}

This problem can be understood by considering the 1-factor bootstraps, which also contain the independent noise component although it has not been added explicitly:

\begin{aligned} {\hat{σ}}_{s u b j}^{2} & \approx σ_{s u b j}^{2} + σ_{n o i s e}^{2} \end{aligned}

\begin{aligned} {\hat{σ}}_{c o n d}^{2} & \approx σ_{c o n d}^{2} + σ_{n o i s e}^{2} \end{aligned}

When we bootstrap two factors, this automatic inclusion of the noise component happens three times. We confirmed this by both theory and simulation. The overestimate of the variance renders the naive 2-factor bootstrap conservative and not optimally powerful.

To correct the variance estimate, we introduce a novel corrected 2-factor bootstrap procedure to estimate the variance: We first compute the 1-factor bootstrap variance estimates ${\hat{σ}}_{s u b j}^{2}$ and ${\hat{σ}}_{c o n d}^{2}$ . We also compute the naive 2-factor bootstrap estimate ${\hat{σ}}_{b o t h}^{2}$ . We can then linearly combine the variances from these three bootstraps to cancel the surplus contribution from the measurement noise. This procedures yields a corrected 2-factor bootstrap estimate ${\hat{σ}}_{c 2 f}^{2}$ that has approximately the right expected value:

\begin{aligned} {\hat{σ}}_{c 2 f}^{2} & = 2 ({\hat{σ}}_{s u b j}^{2} + {\hat{σ}}_{c o n d}^{2}) - {\hat{σ}}_{b o t h}^{2} \\ \approx σ_{s u b j}^{2} + σ_{c o n d}^{2} + σ_{n o i s e}^{2} \end{aligned}

The approximations in these equations are due to $\frac{N - 1}{N}$ factors that apply to the individual terms. We give the exact formulae including these factors in the methods section (Estimating the uncertainty of our model-performance estimates). We show in multiple simulations that this estimate approximates the correct variance better than the uncorrected 2-factor bootstrap (Figures 4c and 7c).

To stabilize the estimator and eliminate the possibility of a negative variance estimate, we bound the estimate from above and below. We use both ${\hat{σ}}_{s u b j}^{2}$ and ${\hat{σ}}_{s t i m}^{2}$ as lower bounds for the estimate as the variances they estimate are always smaller than the true variance. As an upper bound, we use ${\hat{σ}}_{b o t h}^{2}$ , the naive, conservative estimate. Bounding slightly biases the variance estimate, but reduces its variability and ensures that it is strictly positive.

Evaluating the performance of flexible models

We often want to test flexible models, that is models that have parameters to be fitted to the brain-activity data. Two elements that often require fitting are weights for the model features and parameters of a measurement model. Feature weighting is required when a model is not meant to specify a priori how prevalent different tuning profiles are in the neural population or in the measured signals. For example, for deep neural network representations to match brain responses well, it is usually necessary to weight the features (e.g. Yamins et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Khaligh-Razavi et al., 2017; Storrs et al., 2021). A flexible measurement model may be necessary to account for the process of measurement, which may subsample, average, or distort neural responses. For example, fMRI voxels average the neural activity locally, which can be modeled with a parameter for the local averaging range, and electrophysiological recordings may preferentially sample certain classes of neurons (Kriegeskorte and Diedrichsen, 2016).

To avoid the bias in the model-performance estimates that can result from overfitting of flexible models, we use crossvalidation. Crossvalidation means that we partition the dataset into separate test sets. In each fold of crossvalidation, we then fit the models to all but one set and evaluate on the held-out set. Taking the average over the folds yields a single performance estimate. As for bootstrapping, crossvalidation is performed over both conditions and subjects so as to avoid overestimating the generalization performance of flexible models when tested on new subjects and new conditions drawn from the populations of subjects and conditions sampled in the actual experiment (Figure 1b).

Because the RDM for the test set must contain multiple values to allow a sensible comparison, the smallest possible number of conditions to perform crossvalidation is 6, which would yield three test conditions for twofold crossvalidation. For small numbers of conditions, we use twofolds. We use threefolds for ≥12 conditions, fourfolds for ≥24 conditions, and fivefolds for ≥40 conditions. These numbers seem to work reasonably well, but were chosen ad hoc.

To estimate our uncertainty about the crossvalidated model performances, we use the same bootstrap methods as for fixed models. To do so, we need to perform crossvalidation on each bootstrap sample. We call this procedure bootstrap-wrapped crossvalidation.

In any crossvalidation, different ways to partition the data into test sets lead to different overall evaluations of the models. When we partition the conditions set into disjoint test sets in RSA, this effect is particularly strong, because dissimilarities between conditions in separate test sets do not contribute to the evaluation in any fold. The variance in the evaluations created by this random assignment is generated by our analysis and would vanish if we performed repeated cycles of crossvalidation with all possible partitionings of the conditions set into test sets. Unfortunately, such exhaustive crossvalidation will usually be prohibitively expensive in terms of computation time, especially in bootstrap-wrapped crossvalidation.

We can estimate the variance without this surplus by sampling $n_{c v} > 1$ different randomly chosen partitionings of the conditions set into crossvalidation test sets for each bootstrap sample. Each of the $n_{c v}$ partitionings into $k$ subsets defines a complete cycle of $k$ -fold crossvalidation. The bootstrap-wrapped crossvalidation estimate of the variance of the model-performance estimates with $n_{c v}$ crossvalidation cycles will be larger than the variance $σ_{b o o t}^{2}$ of the exact mean performance over all possible partitionings of a dataset. When we assume that the variance $σ_{c v}^{2}$ of randomly chosen partitionings around the mean is equal for each bootstrap sample, the overall variance $σ_{b o o t c v, n_{c v}}^{2}$ is:

σ_{b o o t c v, n_{c v}}^{2} = σ_{b o o t}^{2} + \frac{σ_{c v}^{2}}{n_{c v}}

When we have more than one cycle of crossvalidation for each bootstrap sample, it is straightforward to compute an estimate for the variance we would have gotten if we had drawn only a single partitioning $σ_{b o o t c v, 1}^{2}$ . We can simply use only the $i$ th partitioning for each bootstrap to estimate the variance and average these estimates. Using these two variance estimates for 1 and $n_{c} v$ partitionings, we can simply solve for the variance contributions of the random partitioning and of the bootstrap:

{\hat{σ}}_{c v}^{2} = \frac{n_{c v}}{n_{c v} - 1} ({\hat{σ}}_{b o o t c v, 1}^{2} - {\hat{σ}}_{b o o t c v, n_{c v}}^{2})

\begin{aligned} {\hat{σ}}_{b o o t}^{2} & = {\hat{σ}}_{b o o t c v, n_{c v}}^{2} - \frac{{\hat{σ}}_{c v}^{2}}{n_{c v}} \\ = {\hat{σ}}_{b o o t c v, n_{c v}}^{2} - \frac{{\hat{σ}}_{b o o t c v, 1}^{2} - {\hat{σ}}_{b o o t c v, n_{c v}}^{2}}{n_{c v} - 1} \end{aligned}

Thus, we can directly compute an estimate of the variance we expect for exhaustive crossvalidation from two or more crossvalidation cycles using random partitionings for each bootstrap sample. The repetition across bootstrap samples enables a stable estimate even for $n_{c v} = 2$ . The average estimate is independent of $n_{c v}$ (Figure 2a). We could invest computation in increasing either the number of bootstrap samples or the number of crossvalidation cycles per bootstrap sample. Our simulations show that the reliability of the bootstrap estimate of the variance of the model-performance estimate improves more when we increase the number of bootstrap samples than when we increase the number of crossvalidation cycles per bootstrap sample (Figure 2b). Thus, we recommend using only two crossvalidation cycles per bootstrap sample.

Figure 2

Download asset Open asset

Correction for variance caused by crossvalidation.

(a) Unbiased estimates of the variance of model-performance estimates (dashed line) require either many crossvalidation cycles (light blue dots) or the proposed correction formula (back dots). Each model in each simulated dataset contributes one dot to each point cloud in this plot, corresponding to the average estimated variance across 100 repeated analyses. All variance estimates of a model are divisively normalized by the average corrected variance estimate for this model over all numbers of crossvalidation cycles for the dataset. For many crossvalidation cycles, the uncorrected and corrected estimates converge, but the correction formula yields this value even when we use only two crossvalidation cycles. (b) Reliability of the corrected bootstrap variance estimate across multiple estimations on the same dataset, comparing the use of more crossvalidation cycles per bootstrap sample (gray, 2, 4, 8, 16, 32 crossvalidations at 1000 bootstrap samples) to using more bootstrap samples (black, 1000, 2000, 4000, 8000, 16,000 bootstrap samples with 2 crossvalidation cycles per sample). The horizontal axis represents the total number of crossvalidation cycles (number of cycles per bootstrap × number of bootstraps). More bootstrap samples are more efficient at stabilizing our bootstrap estimates of the variance of model-performance estimates. Increasing the number of bootstraps decreases the variance roughly at the $N^{- \frac{1}{2}}$ rate expected for sampling approximations indicated by the dashed line.

This crossvalidation approach provides model-performance estimates that are not biased by overfitting of flexible models to either subjects or conditions. Fixed and flexible models with different numbers of parameters can be robustly compared with generalization over conditions and/or subjects. The method can handle any model that can be fitted efficiently enough (for the types of flexible models we actually implemented, see Methods, Flexible models).

Validation of the statistical inference methods

We validate the inference methods using simulations, functional MRI data, and neural data. First, we establish that the statistical tests for model comparison are valid, controlling the false-positive rate at the nominal level. This requires simulating data under the null hypothesis, where two models that predict distinct RDMs are exactly equal in their RDM prediction accuracy. We use a matrix-normal model to simulate this null scenario for model comparison. Second, we show that the estimates of our uncertainty about model performance correctly capture the true variability for different generalization schemes in more realistic simulated scenarios based on neural network models. In these simulations, we cannot simulate the null hypothesis of two models that predict the representational geometry equally accurately. We also use these more realistic simulations to evaluate the power afforded by different RDM comparators. Third, we validate the inference procedure for flexible models, confirming that our bootstrap-wrapped crossvalidation scheme correctly accounts for the overfitting of flexible models. Fourth, we validate the methods using real data, acquired with functional MRI in humans and calcium imaging in mice.

Validity of inferential model comparisons

A frequentist test is valid when the rate of false positives (i.e. the rate of positive results when the null hypothesis is true) does not exceed the specified error rate α (e.g. 5%). Here, we check the validity of model-comparative inference, where the null hypothesis is that the two models perform equally well at explaining the representational geometry. We simulate scenarios where two models predict distinct geometries, but perform equally well on average at predicting the true representational geometry.

To simulate situations where two different models perform equally well, we generated condition-response matrices (containing an activity level for each combination of condition and response channel) by sampling from matrix-normal density models. A matrix-normal distribution over matrices yields matrices with normally distributed cells whose covariance is separable into a covariance matrix across rows and one across columns. In our case, rows correspond to the experimental conditions (e.g. stimuli) and the columns correspond to measurement channels (e.g. neurons or voxels). For matrix-normal data, the covariance across conditions captures the similarity among condition-related response patterns and determines the expected squared Euclidean-distance RDM (Diedrichsen and Kriegeskorte, 2017). The covariance among channels only scales the covariance of the distance estimates. This relationship enables us to generate matrix-normal data for arbitrary choices of the expected squared Euclidean-distance RDM. To model the null hypothesis, we choose two models that predict distinct RDMs and generate data, such that the expected data RDM has equal Pearson correlation to both model RDMs (results in Appendix 1—figure 1; details in Appendix 1).

We first evaluated the bootstrap in the scenario, where the goal is to generalize across subjects only. All model-comparative subject-only bootstrap tests were found to be valid (Appendix 1—figure 1). Inflated false-positive rates were observed for subject-only bootstrap tests only when using a small sample of subjects (<20). For a small number of samples, bootstrapping is known to produce underestimates of the variance by a factor $\frac{n}{n - 1}$ for $n$ samples (e.g. Efron and Tibshirani, 1994, chapter 5.3). In this scenario, we recommend using a $t$ -test across subjects, which is more computationally efficient and more accurate than bootstrap methods for small numbers of subjects.

Next, we tested bootstrapping for generalization to new conditions. In this scenario, the bootstrap methods were all conservative, showing false-positive rates substantially below 5% (Appendix 1—figure 1). This is expected, because we did not include any random selection of conditions in our data simulation, but enforced the H₀ exactly for the measured conditions.

To assess how problematic it is to choose an inference method that ignores the variance due to condition sampling, we ran a simulation in which we sampled the conditions from a large pool. We generated two models that perform equally well on 1000 conditions using matrix-normal sampling and then sampled a smaller set of these conditions for the simulated experiment. In these simulations, all techniques that only take subjects into account as a random factor fail catastrophically (Appendix 1—figure 1), with false-positive rates growing with the number of simulated subjects and reaching 60% at 40 simulated subjects. In contrast, our bootstrap tests that include condition sampling all remain valid, including the uncorrected 2-factor bootstrap and our new corrected 2-factor bootstrap with false-positive rates below the nominal 5%. However, the uncorrected 2-factor bootstrap was extremely conservative.

We also validated the tests against chance performance, where a single model is tested and the null hypothesis is that its performance is at chance level. To do so, we performed similar matrix-normal data simulations, evaluating a model that predicts a specific randomly sampled RDM on matrix-normal data consistent with an independently sampled random expected data RDM. Results show that a $t$ -test across subjects as well as the bootstrap $t$ -test approaches provide valid inference (Appendix 1—figure 1, top row). The subject $t$ -test and the corrected 2-factor bootstrap $t$ -test avoid overly conservative false-positive rates.

We conclude that the tests are valid in these simple simulated H₀ scenarios, where we are able to estimate the false-positive rate. In more realistic simulations using neural network models and real data, we can no longer simulate distinct models that predict the data RDM equally well. We therefore restrict ourselves to evaluating our bootstrap estimate of the variance of model-performance estimates, assuming that the false-positive rates are adequately controlled when we use an accurate variance estimate.

Criteria for evaluation of inference procedures

To evaluate alternative inference procedures, we perform simulations that reveal (1) whether the estimates of the uncertainty of the model-performance estimates are accurate (ensuring the validity of the inferences), and (2) how sensitive different model comparison methods are to subtle differences between models (determining the power of the inferences). To measure whether our bootstrap methods correctly estimate the uncertainty of the model-performance estimates, we compute the relative uncertainty (RU). The RU is the standard deviation of the bootstrap distribution of model-performance estimates $σ_{b o o t}$ divided by the true standard deviation of model-performance estimates $σ_{t r u e}$ as observed over repeated simulations:

RU = \frac{σ_{b o o t}}{σ_{t r u e}} = \sqrt{\frac{\frac{1}{N} \sum_{i = 1}^{N} σ_{i}^{2}}{σ_{t r u e}^{2}}},

where $σ_{i}^{2}$ is the variance estimator of the bootstrap in simulated dataset $i$ of the $N$ simulations. Ideally, we would like the bootstrap-estimated variance to match the true variance such that the RU is 1.

To measure how sensitive our analysis is to differences in model performance (e.g. comparing layers of a deep neural network), we define the model discriminability as a signal-to-noise ratio (SNR). The signal is the magnitude of model-performance differences, which is measured as the variance across models of their average of performance estimates across simulations. The noise is the nuisance variation, which includes subject and condition sample variation along with measurement noise. The noise is measured as the average across models of the variance of performance estimates across simulations. This results in the following formula, in which ${Perf}_{i, m}$ is the performance of model $m$ of $M$ in repetition $i$ of $N$ repetitions of the simulation:

SNR = \frac{{Var}_{m} (\frac{1}{N} \sum_{i = 1}^{N} {Perf}_{i, m})}{\frac{1}{M} \sum_{i = 1}^{M} {Var}_{i} ({Perf}_{i, m})} .

A higher SNR indicates greater sensitivity to differences in model performance: differences between models are larger relative to the variation of model-performance estimates over repeated simulations. Note that this measure does not depend on the accuracy of the bootstrap because the bootstrap estimates of the variances do not enter this statistic. The SNR exclusively measures how large differences between models are compared to the level of nuisance variation we simulate, which may include random sampling of conditions, subjects, or both (in addition to measurement noise).

Validity of generalization to new subjects and conditions

To test whether our inference methods correctly generalize to new subjects and conditions, we performed a simulation that includes random sampling of both subjects and conditions (Figure 3). We used the internal representations of the deep convolutional neural network model AlexNet (Krizhevsky et al., 2012) to generate fMRI-like simulated data. In each simulated scenario, one of the layers of AlexNet served as the true (data-generating) model, while all layers were considered as candidate models in the inferential model comparisons. We simulated true voxel responses as local averages of the activities of close-by units in the feature maps of layers of the model. The response of each simulated voxel was a local average of unit responses, weighted according to a 2D Gaussian kernel over the locations of the feature map multiplied by a vector of nonnegative random weights (drawn uniformly from the unit interval) across the features. We then simulated hemodynamic-response timecourses and added measurement noise. The covariance structure of the noise was determined by the overlap of the simulated voxels’ averaging regions over space and a first-order autoregressive model over time. The simulated data were subjected to a standard general linear model (GLM) analysis to estimate the condition-response matrix. Variation over conditions was generated by using randomly sampled natural images from ecoset (Mehrer et al., 2017) as input to the AlexNet model. Variation over subjects was generated by randomly choosing a new location and a new vector of feature weights for each voxel of a new simulated subject.

Figure 3

Download asset Open asset

Illustration of the deep-neural-network-based simulations for functional magnetic resonance imaging (fMRI)-like data.

The aim of the analyses was always to infer which layer of AlexNet the simulation was based on. (a) Stimuli are chosen randomly from ecoset (Mehrer et al., 2017) and we simulate a simple rapid event-related experimental design. (b) ‘True’ average response per voxel to a stimulus are based on local averages of the internal representations of AlexNet. To simulate the response of a voxel to a stimulus we choose a (x,y)-position uniformly randomly and take a weighted average of the activities around that location. As weights we choose a Gaussian in space and independently draw a weight per feature between 0 and 1. (c) To generate a simulated voxel timecourse we generate the undistorted timecourses of voxel activities, convolve them with a standard hemodynamic-response function and add temporally correlated normal noise. (d) To estimate the response of a voxel to a stimulus we estimate a standard general linear model (GLM) to arrive at a noisy estimate of the true channel responses we started with in C. (e) From the estimated channel responses we compute the stimulus by stimulus dissimilarity matrices. These dissimilarity matrices can then be compared to the dissimilarity matrices computed based on the full deep neural network representations from the different layers.

We simulated N = 100 datasets for each parameter setting to estimate how variable the model-performance estimates truly are. In analysis, we must estimate our uncertainty about model performance from a single dataset. To estimate how accurate these estimates were, we compared the uncertainty estimates used by different inference procedures (including different bootstrap methods) to the true variability. This comparison is a enables us to validate our inference despite the fact that we cannot compute false-positive rates of the model comparison tests. Our neural-network-based simulations do not contain situations that correspond to the H₀ of two different models with equal performance, which would require that the data-generating neural network layer predicts an RDM equally similar to those predicted by two other model layers. As expected, the rate of erroneously finding an alternative model outperforming the true data-generating model was very low (not shown) whenever the type of bootstrap matches the simulated level of generalization because the true layer has a higher average performance than the other models. At the 5% uncorrected significance level, the proportion of cases where any other layer performed significantly better than the true (data-generating) layer was only 1.524%. This rate reflects the differences between the layers of AlexNet, the simulated variability due to subject, stimulus, and voxel sampling, the simulated noise level, and the number of layers. Tests against the best other layer (chosen based on all data) significantly favor this other layer in only 0.694% of cases. Multiple-comparison correction would reduce these model-selection error rates even further.

To test generalization to either new conditions or new subjects (but not both simultaneously), we kept the other dimension constant. When simulating condition sampling, the true variance across conditions is accurately estimated for 40 or more conditions (Figure 4a) and is overestimated by 1-factor bootstrap resampling of conditions (rendering the inference conservative) when we have less than about 40 conditions (Figure 4a). When simulating subject sampling, the true variance across subjects is accurately estimated for 20 or more subjects (Figure 4b) and is underestimated by 1-factor bootstrap resampling of subjects (invalidating the inference) when we have very few subjects (Figure 4b). This downward bias corresponds to the $\frac{n}{n - 1}$ factor between the sample variance and the unbiased estimate for the population variance. Our implementation in the RSA toolbox uses this factor to correct the variance estimate.

Figure 4

Download asset Open asset

Results of the deep-neural-network-based simulations.

(**a–c**) Relative uncertainty, that is the bootstrap estimate of the standard deviation of model-performance estimates divided by the true standard deviation over repeated simulations. Dashed line and gray box indicate the expected value and standard deviation due to the number of simulations per condition. (a) Bootstrap resampling of conditions when repeated simulations use random samples of conditions and a fixed set of subjects. (b) Bootstrap resampling of subjects when simulations use random samples of subjects (simulated voxel placements) and a fixed set of conditions. (c) Direct comparison of the uncorrected and corrected 2-factor bootstraps (see Estimating the uncertainty of our model-performance estimates for details) for simulations that varied both conditions and subjects. (**d–h**) Signal-to-noise ratio (Equation 10), a measure of sensitivity to differences in model performance, for the different inference procedures and simulated scenarios. Infinite voxel averaging range refers to voxels averaging across the whole feature map. All error bars indicate standard deviations across simulation types that fall into the category.

To test our corrected 2-factor bootstrap method’s ability to generalize to new subjects and new conditions simultaneously, we simulated sampling of both conditions (stimuli) and subjects in our simulations. The corrected 2-factor bootstrap estimates the overall variation caused by random sampling of subjects and conditions and by measurement noise much more accurately than the naive 2-factor bootstrap (Figure 4c). Cases where an incorrect model (not the data-generating model) significantly outperformed the true model occurred in only 0.3% of simulations with the corrected 2-factor bootstrap, even without any multiple-comparison correction. This proportion would be larger if the alternative models performed more similar to the true model than simulated here. The RSA toolbox adjusts for multiple comparisons, controlling either the familywise error rate or the false-discovery rate across all pairwise model comparisons.

Overall, we found that the new more powerful corrected 2-factor bootstrap method yields accurate estimates of the variance across the simulated populations of subjects and conditions when the dataset is large enough (≥20 subjects, ≥40 conditions) and the type of bootstrap matches the population sampling simulated (subject, condition, or both).

The model discriminability (SNR) increases monotonically with the number of measurements, affording greater power for model-comparative inference. Model discriminability increases with the amount of data according to a power law (straight line in log–log plot; Figure 4d–f). Such a relationship holds whether we increase the number of conditions, the number of subjects, or the number of repetitions per condition. This result is expected and validates the SNR as an indicator of model-comparison power. In general, increasing the number of measurements helps most for the factor that causes most variability of the performance estimates, rendering generalization harder. For example, in our deep-neural-network-based simulations, the variability over subjects is smaller than the variability across conditions (Figure 4g). In this simulation, it thus increases statistical power more to collect data for more conditions. When there is more variability across subjects, the opposite is expected to hold. An intermediate voxel size (Gaussian kernel width) yielded the highest model-performance discriminability as measured by the SNR (Figure 4h, see Appendix 5 for more discussion on this topic).

Validity of inference on flexible models

To validate inferential model comparisons involving flexible models, we made a variant of the deep neural network simulation in which we do not assume to know how voxels average local neural responses. As the simulated ground truth, we set the spatial weights for each voxel to a Gaussian with a standard deviation of 5% of the image size (full width at half maximum FWHM ≈ 11.77%) and randomly weighted the feature maps (with weights drawn independently for each voxel and feature map uniformly at random from the unit interval; details in Methods, Neural-network-based simulation).

We then used models that a researcher could generate without knowing the ground truth of how voxels average local features. As building blocks for the models, we computed RDMs for different voxel averaging pool sizes and for different methods to deal with averaging across feature maps. To capture voxel averaging across retinotopic locations, we smoothed the feature maps with Gaussians of different sizes. To capture voxel averaging across feature maps, we (1) generated RDMs computed after taking the average across feature maps at each location (avg), (2) computed the expected RDM for the weight sampling implemented in the simulation (weighted), or (3) computed RDMs without any feature-map averaging (full).

We combined these building blocks into two types of flexible model: selection models and nonnegative linear-combination models. In a selection model, fitting is implemented as selection of the best among a finite set of RDMs. Here we defined one selection model for each method of combining the feature maps. Each selection model contained RDMs computed for different sizes of the local averaging pool. In linear-combination models, fitting consists in finding nonnegative weights for a set of basis RDMs, so as to maximize RDM prediction accuracy. The RDMs contain estimates of the squared Mahalanobis distances, which sum across sets of tuned neurons that jointly form a population code. As component RDMs, we chose the four extreme cases of RDM generation: no pooling across space or averaging across the whole image, each paired with either ‘full’ or ‘avg’ treatment of the feature maps. The resulting four-RDM-component linear model approximates the effect of computing the RDM from voxels that reflect the average activity over retinotopic patches of different sizes (Kriegeskorte and Diedrichsen, 2016). For the averaging across feature maps, which uses random weights, there is a strong motivation for using a linear model: When the voxel activities are nonnegatively weighted averages of the underlying neurons with the weights drawn independently from the same distribution, the expected squared Euclidean RDM is exactly a linear combination of the RDM computed based on the univariate population-average responses and the RDM based on all neurons (Appendix 4; see also Carlin and Kriegeskorte, 2017). For comparison, we also included fixed RDM models, corresponding to component RDMs of the fitted models.

We found that our bootstrap-wrapped crossvalidation (corrected 2-factor bootstrap with adjustment for excess crossvaldation variance) yielded accurate estimates of the uncertainty. The relative uncertainties were close to 1 (Figure 5b). The model-performance discriminability (SNR) was primarily determined by how accurately the different models were able to recreate the true measurement model (Figure 5c). The highest SNRs were achieved when the assumed model matched exactly (weighted feature treatment and voxel size 0.05), but the model variants which allowed for some fitting still yield high SNRs. Analyses that take the averaging across space and features into account yielded the highest average model performance for the true model. In contrast, analyses that ignore averaging over space or features (the full feature set selection model and some of the fixed models) not only lead to lower SNRs (as seen in Figure 5c), but also systematically selected the wrong layer, because a higher average performance was achieved by a different layer than the one we used for generating the data (not shown).

Figure 5

Download asset Open asset

Validation of flexible model tests using bootstrap crossvalidation.

(a) MDS arrangement of the representational dissimilarity matrices (RDMs) for one simulated dataset. Colored circles show the predictions based on one correct and one wrong layer changing the voxel averaging region and the treatment of features (‘full’, ‘weighted’, and ‘avg’, as described in the text). Fixed models correspond to single choice of model RDM for each layer. Selection models select the best fitting voxel size from the RDMs presented in one color (or two for ‘both’). Crosses mark the four components of the linear model for Layer 2. The small black dots represent simulated subject RDMs without functional magnetic resonance imaging (fMRI) noise. (b) Histogram of relative uncertainties $σ_{b o o t} / σ_{t r u e}$ , showing that the bootstrap-wrapped crossvalidation accurately estimates the variance of the performance estimates across many different inference scenarios. (c) Model discriminability as signal-to-noise ratios for different model types.

We conclude that when the true voxel sampling is unknown, flexible models are needed to account for voxel sampling, so as to enable us recover the underlying data-generating computational model with our model-comparative inference. Fixed models based on incorrect assumptions about the voxel sampling can lead to low model-performance discriminability (SNR) and even to incorrect inferences as to which model is the true model.

Validation with functional MRI data

The simulations presented so far validated all statistical inference procedures, but may not capture all aspects of the structure of real measurements of brain activity. To test our methods under realistic conditions, we used real human fMRI (this section) and mouse calcium-imaging data (next section). We resampled data from a large openly available fMRI experiment in which humans viewed pictures from ImageNet (Horikawa and Kamitani, 2017). These data contain various noise sources, individual differences, signal shapes, and distributions that are difficult to simulate accurately without using measured data. We therefore implemented a data-based simulation to create realistic synthetic data, whose ground-truth RDM we knew (Figure 6). By subsampling from this dataset, we generated smaller datasets to test inference with bootstrapping over conditions. We used the entire dataset as a stand-in for the population a researcher might wish to generalize to. For each cortical area, we computed the mean RDM using all data (all runs and subjects). Each area’s mean RDM served as a ground-truth RDM for datasets sampled from that area and as a model RDM for datasets sampled from all areas. The model comparison we attempted aims to recover which cortical area a dataset was subsampled from. The simulation enables us to check whether our uncertainty estimates are correct for model-performance estimates based on real data.

Figure 6

Download asset Open asset

Functional magnetic resonance imaging (fMRI)-data-based simulation.

(a) These simulations are based on a dataset of neural recordings for 50 stimuli in 5 human observers (Horikawa and Kamitani, 2017), which were each shown 35 times. To extract stimulus responses from these data we perform two general linear model (GLM) steps as in the original publication. (b) In the first step, we regress out diverse noise estimators (provided by fMRIprep) from pooled fMRI runs. (c) We then apply a second GLM separately on each run to extract the stimulus responses. (d) We then extract regions of interest (ROIs) based on an atlas (Glasser et al., 2016), randomly chose differently sized subsets of runs resp. stimuli to enter further analyses. To simulate realistic noise, we estimate an AR(2) model on the second GLM’s residuals, permute and filter them to keep their original autocorrelation structure, and finally scale them by the factors 0.1, 1, and 10. To generate simulated timecourses, we add these altered residuals to the GLM prediction. We then rerun the second GLM on the simulated data and use the Beta-coefficient maps for following steps. (e) Finally, we compute crossnobis representational dissimilarity matrices (RDMs) and perform RSA based on the overall RDM across all subjects. (f) Results of the simulations, separately for each noise scaling factor. The signal-to-noise ratio shows the same increase as for our abstract simulation. (g) The relative uncertainty converges to 1 for increasing stimulus numbers. Error bars indicate standard deviations across different simulation types.

We varied the strength of noise, the number of runs, and the number of conditions (i.e. viewed images). We did not vary the number of subjects because the original dataset contains only five subjects, which precludes informative resampling of subjects. To increase the variability of the resampled datasets beyond sampling from the 35 measurement runs and to vary the noise strength, we created new voxel timecourses for each sampled run while preserving the spatial structure and serial autocorrelation of the noise. To achieve this, we estimated a second-order autoregressive model ( $A R (2)$ ) separately for each run’s GLM residuals, permuted the AR-model’s residuals and added the results to the GLM’s predicted timecourse (see Figure 6a–e and Methods, fMRI-data-based simulation for details). We repeated each simulated experiment 24 times and used the RU and the model-performance discriminability (SNR) as our evaluation criteria.

Results were largely similar to those of the neural-network-based simulations (Figure 6f, g). For the RU, which measures the accuracy of our bootstrap variance estimates, we see a convergence toward the expected ratio (dashed line at 1), validating the bootstrap procedure for real fMRI data. For the model-performance discriminability (SNR), we find the same power-law increase with the number of conditions and the number of runs used as data. These results suggest that the regions are discriminable on the basis of their RDMs estimated from fMRI given five subjects’ data when a sufficient number of stimuli (≥30) and runs (≥16) is used.

Validation with calcium-imaging data

We can also adjudicate among models of the representational geometry on the basis of direct neural measurements, such as electrophysiological recordings or calcium-imaging data. These measurement modalities have very different statistical properties than fMRI. To test our methods for this kind of data, we performed a resampling simulation based on a large calcium-imaging dataset of responses of mouse visual cortex to natural images (de Vries et al., 2020). This dataset contains recordings from six visual cortical areas: primary visual cortex (V1), laterointermediate (LM), posteromedial (PM), rostrolateral (RL), anteromedial (AM), and anterolateral (AL) visual area (Figure 7a).

Figure 7

Download asset Open asset

Results in mice with calcium-imaging data.

(a) Mouse visual cortex areas used for analyses and resampling simulations. (b) Overall similarities of the representations in different cortical areas in terms of their representational dissimilarity matrix (RDM) correlations. For each mouse and cortical area (‘data RDM’, vertical), the RDM was correlated with the average RDM across all other mice (‘model RDM’, horizontal), for each other cortical area. We plot the average across mice of the crossvalidated RDM correlation (leave-one-mouse-out crossvalidation). The prominent diagonal shows the replicability across mice and the distinctness between cortical areas of the representational geometries. (c) Relative uncertainty for the 2-factor bootstrap methods. The gray box indicates the range of results expected from simulation variability if the bootstrap estimates were perfectly accurate. The correction is clearly advantageous here although the method is still slightly conservative (overestimating the true standard deviation $σ_{t r u e}$ of model-performance evaluations) for small numbers of stimuli. For 40 or more stimuli, the corrected 2-factor bootstrap correctly estimates the variance of model-performance evaluations. (d) Signal-to-noise ratio validation: The signal-to-noise ratio (SNR) grows with the number of cells per subject and the number of repeats per stimulus. (e) Signal-to-noise ratio for different noise covariance estimates. Taking a diagonal covariance estimate into account, that is normalizing cell responses by their standard deviation is clearly advantageous. The shrinkage estimates provide a marginal improvement over that. (f) Signal-to-noise ratio for data sampled from different areas. (g) Which measure is optimal for discriminating the models depends on the data-generating area. On average there is an advantage of the cosine similarity over the RDM correlation and of the whitened measures over the unwhitened ones. Error bars indicate standard deviations across different simulation types.

Image credit: Allen Institute.

As in the previous section, we used the overall mean RDM for each area as a ground-truth model and subsampled the data to create simulated datasets for which we know the ground-truth RDM. We used different numbers of stimulus repetitions, neurons, mice, and stimuli to vary the amount of information afforded by each simulated dataset. We used the crossnobis estimator of representational dissimilarity for all analyses here. We repeated each simulated experiment 100 times and computed the RU to assess the correctness of our bootstrap uncertainty estimates and the model-discriminability SNR to determine which noise covariance estimators and RDM comparators afford most sensitivity to model differences.

We analyzed the overall discriminability of the brain areas (Figure 7b). Although cortical areas vary in the reliability of the estimated RDMs, they can be discriminated reliably when using all data. We used the RU to assess whether our bootstrap variance estimates are correct for these data (Figure 7c). We resampled all factors (subjects, stimuli, runs, and cells) to generate simulted datasets. Correspondingly, the analysis used bootstrapping over both subjects and stimuli. We observed correct variance estimates for the corrected 2-factor bootstrap. The uncorrected 2-factor bootstrap was conservative, substantially overestimating the true variance.

To understand how the model-comparative power depends on experimental parameters and analysis choices, we analyzed the model-discriminability SNR. We found that more subjects, more stimuli, more runs, and more cells all increased the SNR just as in our fMRI and neural-network-based simulations (Figure 7d). Furthermore, we find that taking the noise covariance into account for computing the crossnobis RDMs in the first-level analysis improves the SNR (Figure 7e). Univariate noise normalization (implemented by using a diagonal noise covariance matrix) is better than no noise normalization. Multivariate noise normalization is slightly better than univariate noise normalization (Walther et al., 2016). For multivariate noise normalization, we tested two different shrinkage estimators with different targets: a multiple of the identity and the diagonal matrix of variances. These two variants perform similarly. In addition, we find that different RDM comparators yield the best model discriminability for different cortical areas (Figure 7g). For some, cosine RDM similarity performs better, for others, Pearson RDM correlation performs better. The whitened RDM comparators are better on average, but there are cases where the unwhitened RDM comparators perform slightly better. Thus, it remains dependent on the concrete experiment (with a particular choice of conditions, tested models and underlying representational geometry), which RDM comparator affords the best power for model comparison (Diedrichsen et al., 2020).

Discussion

We present new methods for inferential evaluation and comparison of models that predict brain representational geometries. The inference procedures enable generalization to new measurements, new subjects, and new conditions, treat flexible models correctly using crossvalidation, and work for any representational dissimilarity estimator and RDM comparator. For fixed as well as flexible models, our inference methods support all combinations of generalization: to new measurements using the same subjects and conditions, to new subjects, to new conditions, and to both new subjects and new conditions simultaneously. We validated the methods using simulated data as well as calcium-imaging and fMRI data, showing that the inferences are correct. The methods are available as part of an open-source Python toolbox (rsatoolbox.readthedocs.io).

Generalizing to new measurements, new subjects, and/or new conditions

Inferential statistics is about generalization from the experimental random samples to the underlying populations. We must carefully consider the level of generalization, both at the stage of designing our experiments and analyses and at the stage of interpreting the results. The lowest level of inferential generalization is to new measurements. Our conclusions in this scenario are expected to hold only for replications of the experiment in the same animals using the same conditions. Inferential generalization to new subjects may not be possible, for example, in case studies or when the number of animals (e.g. two macaques) is insufficient. Generalization to new conditions is not needed when all conditions relevant to our claims have been sampled. For example, Ejaz et al., 2015 studied the representational similarity of finger movements in primary motor cortex. All five fingers were sampled in the experiments and there are no other fingers to generalize to. When generalizing to replications with the same subjects and conditions, we need separate data partitions to estimate the variability of the model-performance estimates. We can then use a $t$ -test or rank-sum test to test for significant differences between models.

If generalization to the population of subjects is desired, we need a sufficiently large sample of subjects. We can then evaluate each model for each subject and use a $t$ -test or rank-sum test, treating the subjects as a random sample from a population. We showed that this method is valid, controlling false-positive rates at their nominal values in our matrix-normal simulations (Methods, Frequentist tests for model evaluation and model comparison). The variance across subjects here is a good estimate of the variance across the population of subjects. However, the interpretation of the results must be restricted to the exact set of experimental conditions used in the experiment.

We often would like our inferences to generalize to a population of conditions. For example, when evaluating computational models of vision, we are not usually interested in determining which models dominate just for the particular visual stimuli presented in our experiment. We are interested in models that dominate for a population of visual stimuli. Model-comparative inference can generalize to the population of conditions that the experimental conditions were randomly sampled from. The inference requires bootstrapping, because RDM prediction accuracy cannot be assessed for single conditions. We bootstrap-resample the conditions set and evaluate all models on each sample. This procedure correctly estimates our uncertainty about model-performance differences, and $t$ -tests based on the estimated bootstrap variances provide valid frequentist inference.

If we want to generalize simultaneously across conditions and subjects, then the corrected 2-factor bootstrap approach provides accurate estimates of our uncertainty about model performances. These uncertainty estimates support valid inferential model comparisons, comparisons to the lower bound of the noise ceiling, and tests against chance performance. We expect the results to generalize to new subjects and conditions drawn from the respective populations sampled randomly in the experiment.

Inference on fixed and flexible models

Our performance estimates for flexible models must not be biased by overfitting to measurement noise, subjects, or conditions. To avoid this bias, we use a novel 2-factor crossvalidation scheme that enables us to evaluate models’ predictive accuracy when simultaneously generalizing to new subjects and/or new conditions. The 2-factor crossvalidation is nested in our 2-factor bootstrap procedure for estimating uncertainty. By using two crossvalidation cycles with different data partitionings for each bootstrap sample, we can accurately remove the excess variance introduced by crossvalidation. Our method provides a computationally efficient estimate of the variances and covariances of model-performance estimates for flexible models, which enables us to use a $t$ -test to inferentially compare models to each other, to the lower bound of the noise ceiling, and to chance performance.

Our methods are fully general in that inference can be performed on any model for which the user provides a fitting and an RDM prediction method. In practice, the complexity of the models is limited by the requirement that we need to fit each model thousands of times in our bootstrap-wrapped crossvalidation scheme. Thus, we need a sufficiently fast and reliable fitting method for the model.

If fitting the model so often is not feasible or if the data RDMs do not provide sufficient constraints, one solution is to fit all models using a separate set of neural data before the inferential analyses. This approach is appropriate when many parameters are to be fitted, as is the case in nonlinear systems identification approaches as well as linear encoding models (Wu et al., 2006), where a large set of neural fitting data is required. All conclusions are then conditional on the fitting data: Inference will generalize to new test data assuming models are fitted on the same fitting data. Our methods support fitting of lower-parametric models as part of the model-comparative inference. When applicable, this approach obviates the need for separate neural data for fitting and supports stronger generalization (not conditional on the neural fitting data).

Supported tests and implications of test results

Our methods enable comparison of a model’s RDM prediction performance (1) against other models, (2) against the noise ceiling, and (3) against chance performance. The first two of these tests are central to the evaluation of models. The test against chance performance is often also reported, but represents a low bar that we should expect most models to pass. In practice, RDM correlations tend to be positive even for very different representations, because physically highly similar stimuli or conditions tend to be similar in all representations. Just like a significant Pearson correlation indicates a dependency, but does not demonstrate that the dependency is linear, a significant RDM prediction result indicates the presence of stimulus information, but does not lend strong support to the particular model. We should resist interpreting significant prediction performance per se as evidence for a particular model (the single-model-significance fallacy; Kriegeskorte and Douglas, 2019b). Theoretical progress instead requires that each model be compared to alternative models and to the noise ceiling. An additional point to note is that the interpretation of chance performance, where the RDM comparator equals 0, depends on the chosen RDM comparator, differing, for example, between the Pearson correlation coefficient and the cosine similarity (Diedrichsen et al., 2020).

RDM comparators like the Pearson correlation and the cosine similarity are related to the distance correlation (Székely et al., 2007), a general indicator of mutual information. Like a significant distance correlation, a significant RDM correlation mainly demonstrates that there is some mutual information between the brain region in question and the model representation. For a visual representation, for example, all that is required is for the two representations to contain some shared information about the input images. In contrast to the distance correlation (and other nonnegative estimates of mutual information), however, negative RDM correlations can occur, indicating simply that pairs of stimuli close in one representation tend to be far in the other and vice versa. For any RDM, there is even a valid perfectly anti-correlated RDM (Pearson $r = - 1$ ), which can be found by flipping the sign of all dissimilarities and adding a large enough value to make the RDM conform to the triangle inequality (which ensures the existence of an embedding of points that is consistent with the anti-correlated RDM). The existence of valid negative RDM correlations is important to the inferential methods presented here because it is required for our assumption of symmetric ( $t$ -)distributions around the true RDM correlation.

Omnibus tests for the presence of information about the experimental conditions in a brain region have been introduced in previous studies (e.g. Kriegeskorte et al., 2006; Allefeld et al., 2016; Nili et al., 2020). Whether stimulus information is present in a region is closely related to the question whether the noise ceiling is significantly larger than 0, indicating RDM replicability. Such tests can sensitively detect small amounts of information in the measured activity patterns and can be helpful to assess whether there is any signal for model comparisons. If we are uncertain whether there is a reliable representational geometry to be explained, we need not bother with model comparisons.

The question whether an individual dissimilarity is significantly larger than zero is equivalent to the question whether the distinction between the two conditions can be decoded from the brain activity. Decoding analyses can be used for this purpose (Naselaris et al., 2011; Hebart et al., 2014; Tong and Pratte, 2012; Kriegeskorte and Douglas, 2019b). Such tests require care because the discriminability of two conditions cannot be systematically negative (Allefeld et al., 2016). This is in contrast to comparisons between RDMs, which can be systematically negative (although, as mentioned above, they tend to be positive in practice).

How many subjects, conditions, repetitions, and measurement channels?

Statistical inference gains power when more data are collected along any dimension. More independent measurement channels, more subjects, more conditions, and more repetitions all help. How much data is needed along each of these dimensions depends on the experiment. The most helpful dimension to extend is the one that currently limits generalization. When crossvalidation across repeated measurements is used to eliminate the bias of the distance estimates (as in the crossnobis estimator), using more repetitions brings an additional performance bonus because it reduces the variance increase associated with unbiased estimates (Diedrichsen et al., 2020, Appendix 5).

Which distance estimator and RDM comparator?

The statistical inference procedures introduced here work for any choice of representational-distance estimator and RDM comparator. However, the choice of distance estimator and RDM comparator affects the power of model-comparative inference and the meaning of the inferential results.

For computing the RDM, we tested only variations of the crossnobis (crossvalidated Mahalanobis) distance estimator, as recommended based on earlier research (Walther et al., 2016). The crossnobis estimator can use different noise covariance estimates to normalize patterns, such that the noise distribution becomes approximately isotropic. The noise covariance matrix can be the identity (no normalization), diagonal (univariate normalization), or a full estimate (multivariate normalization). Consistent with previous findings (Walther et al., 2016; Ritchie et al., 2021), our results suggest that univariate noise normalization is always preferable to no normalization, and that multivariate noise normalization using a shrinkage estimate of the noise covariance (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005) helps in some circumstances and never hurts model discrimination.

For evaluating RDM predictions, we can distinguish RDM comparison methods by the scale they assume for the distance estimates: ordinal, interval, or ratio. For ordinal comparisons, the different rank correlation coefficients perform similarly. We recommend $ρ_{a}$ for its computational efficiency and analytically derived noise ceiling. For interval- and ratio-scale comparisons, a more complex pattern emerges. In particular whether cosine similarities (ratio scale) or Pearson correlations (interval scale) work better depends on the structure of the model RDMs to be compared. We recently proposed whitened variants of the cosine similarity and Pearson correlation, which take into account that the distance estimates in an RDM are not independent (Diedrichsen et al., 2020). The whitened RDM comparators were more sensitive to subtle differences in model performance when evaluated on fixed models (Figure 5c). In the simulations based on the calcium-imaging data, whitened RDM comparators still performed better on average, but there were some cortical areas that were easier to identify by using the unwhitened comparison measures.

Alternative approaches

We present a frequentist inference methodology that uses crossvalidation to obtain point estimates of model performance and bootstrapping to estimate our uncertainty about them. Bayesian alternatives deserve consideration. For example, a Bayesian approach has been proposed to alleviate the bias of distance estimates (Cai et al., 2019). This Bayesian estimate makes more detailed assumptions about the trial dependencies than our crossvalidated distance estimators, which remove the bias. The Bayesian estimate might be preferable for its higher stability when its assumptions hold and could be used in combination with our model-comparative inference methods. For model comparisons, Bayesian inference is also an interesting alternative to the frequentist methods we discuss here (Kriegeskorte and Diedrichsen, 2016). Our whitened RDM comparison methods can be motivated as approximations to the likelihood for a model and we reported recently that they afford similar power as likelihood-based inference with normal assumptions (Diedrichsen et al., 2020). Thus, frequentist inference using the whitened RDM comparators is related to Bayesian inference with a uniform prior across models. In the Bayesian framework, generalization to the populations of subjects and conditions would require a model of how RDMs vary across subjects and conditions. We currently do not have such a model. Until such models and Bayesian inference procedures for them are developed, the frequentist methods we present here remain the only method for generalization to the populations of subjects and conditions.

Another strongly related method for comparing models to data in terms of their geometry is pattern component modeling (Diedrichsen et al., 2018), which compares conditions in terms of their covariance over measurement channels instead of their representational dissimilarities. This approach is deeply related to representational similarity analysis (Diedrichsen and Kriegeskorte, 2017). Pattern component modeling is somewhat more rigid than RSA as the theory is based on normal distributions, but it has advantages in terms of analytical solutions. In particular, the likelihood of models can be directly evaluated, enabling tests based on the likelihood ratio. Due to the direct evaluation of likelihoods, this framework can be combined with Bayesian inference more easily and recently a variational Bayesian analysis was presented for this model (Friston et al., 2019).

Another powerful approach to inference on brain-computational models is to fit encoding models that predict measured brain-activity data instead of representational geometries (e.g. Wu et al., 2006; Kay et al., 2008; Dumoulin and Wandell, 2008; Naselaris et al., 2011; Wandell and Winawer, 2015; Diedrichsen and Kriegeskorte, 2017; Cadena et al., 2019a). This approach was originally developed in the context of low-dimensional models and measurements. When models and measurements are both high dimensional, even a linear encoding model can be severely under-constrained (Cadena et al., 2019b; Kornblith et al., 2019). As a result, an encoding model requires a combination of substantial fitting data and strong priors on the weights. The predictive model that is being evaluated comprises the encoding model and the priors on its weights (Diedrichsen and Kriegeskorte, 2017), which complicates the interpretation of the results (Cadena et al., 2019b; Kriegeskorte and Douglas, 2019b). Both model performances and the fitted weights can then be highly uncertain and/or dependent on the details of the assumed encoding model. The additional data and assumptions needed to fit complex encoding models motivate the consideration of methods as proposed here that do not require fitting of a high-parametric mapping from model to measured brain activity.

The generalization challenges that we tackle here for RSA apply equally to encoding models and pattern component modeling. Inferences are often meant to generalize to new subjects and/or experimental conditions. The alternative approaches, in their current implementations, do not yet enable simultaneous generalization to the populations of experimental conditions and subjects. By default pattern component modeling and its Bayesian variants assume a single geometry and thus do not take either subject or condition variability into account. Variability across subjects can be taken into account in a group-level analysis (see e.g. Diedrichsen et al., 2018, 2.7.3), but this approach does not account for uncertainty due to the sample of experimental conditions. Encoding models usually follow the machine learning approach with training, validation, and test sets (e.g. Naselaris et al., 2011; Cichy et al., 2019; Cichy et al., 2021). Uncertainty about the model evaluations is either not estimated at all or estimated in a secondary analysis based on the variability across subjects, cells, or conditions. Because these secondary analysis is based solely on the test set, results are conditional on the training and validation sets, and so fall short of generalizing model-comparative inferences to the underlying populations. Note that the bootstrapping and crossvalidation approaches we introduce here are not inherently specific to RSA. These methods could be adapted for estimating the uncertainties about other model evaluation measures such as those provided by pattern component and encoding models.

Conclusion

We present a comprehensive new methodology for inference on models of representational geometries that is more powerful than previous approaches, can handle flexible models, and enables neuroscientists to draw conclusions that generalize to new subjects and conditions. The validity of the methods has been established through extensive simulations and using real neural data. These methods enable neuroscientists working with humans and animals to evaluate complex brain-computational models with measurements of neural population activity. As we enter the age of big models and big data, we hope these methods will help connect computational theory to neuroscientific experiment.

Materials and methods

The methods section for this paper is separated into two parts: First, we describe the RSA analysis pipeline we propose in full. In the second part, we describe the simulation methods we used to test our pipeline for this paper.

Full description of the RSA method

The inference method we describe here represents a new pipeline for representational similarity analysis. Nonetheless, some parts of the analysis appeared in earlier or concurrent publications (Kriegeskorte et al., 2008b; Nili et al., 2014; Walther et al., 2016; Storrs et al., 2014). In this section, we describe the whole pipeline, including both new and established procedures, without requiring familiarity with previous papers.

Share this article

Cite this article

Overview of model-comparative inference.

Correction for variance caused by crossvalidation.

Illustration of the deep-neural-network-based simulations for functional magnetic resonance imaging (fMRI)-like data.

Results of the deep-neural-network-based simulations.

Validation of flexible model tests using bootstrap crossvalidation.

Functional magnetic resonance imaging (fMRI)-data-based simulation.

Results in mice with calcium-imaging data.

Evaluation of the tests using normally distributed data simulated under different null hypotheses.

Sensitivity to model differences of different RDM comparators.

Author details

Heiko H Schütt

Present address

Contribution

For correspondence

Competing interests

Alexander D Kipnis

Present address

Contribution

Competing interests

Jörn Diedrichsen

Contribution

Competing interests

Nikolaus Kriegeskorte

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism