Statistical inference on representational geometries

  1. Heiko H Schütt  Is a corresponding author
  2. Alexander D Kipnis
  3. Jörn Diedrichsen
  4. Nikolaus Kriegeskorte  Is a corresponding author
  1. Zuckerman Institute, Columbia University, United States
  2. Western University, Canada
9 figures and 1 additional file

Figures

Overview of model-comparative inference.

(a) Multiple conditions are presented to observers and to models (here different stimulus images). The brain measurements during the presentation produce a set of measurements for each stimulus and subject, potentially with repetitions; a model yields a feature vector per stimulus. Importantly, no mapping between brain measurement channels and model features is required. (b) To compare the two representations, we compute a representational dissimilarity matrix (RDM) measuring the pairwise dissimilarities between conditions for each subject and each model. For model comparison, we perform 2-factor crossvalidation within a 2-factor bootstrap loop to estimate our uncertainty about the model performances. On each fold of crossvalidation, flexible models are fitted to the representational dissimilarities for a set of fitting stimuli estimated in a set of fitting subjects (blue fitting dissimilarities). The fitted models must then predict the representational dissimilarities among held-out test stimuli for held-out test subjects (red test dissimilarities). The resulting performance estimates are not biased by overfitting to either subjects or stimuli. (c) Based on our uncertainty about model performances (error bars indicate estimated standard errors of measurement), we can perform various statistical tests, which are marked in the graphical display. Dew drops (gray) clinging to the lower bound of the noise ceiling mark models performing significantly below the noise ceiling. White dew drops on the horizontal axis mark models whose performance significantly exceeds 0 or chance performance. Pairwise differences are summarized by arrows. Each arrow indicates that the model marked with the dot performed significantly better than the model the arrow points at and all models further away in the direction of the arrow.

Image credit: Ecoset (Mehrer et al., 2017) and Wiki Commons.

Correction for variance caused by crossvalidation.

(a) Unbiased estimates of the variance of model-performance estimates (dashed line) require either many crossvalidation cycles (light blue dots) or the proposed correction formula (back dots). Each model in each simulated dataset contributes one dot to each point cloud in this plot, corresponding to the average estimated variance across 100 repeated analyses. All variance estimates of a model are divisively normalized by the average corrected variance estimate for this model over all numbers of crossvalidation cycles for the dataset. For many crossvalidation cycles, the uncorrected and corrected estimates converge, but the correction formula yields this value even when we use only two crossvalidation cycles. (b) Reliability of the corrected bootstrap variance estimate across multiple estimations on the same dataset, comparing the use of more crossvalidation cycles per bootstrap sample (gray, 2, 4, 8, 16, 32 crossvalidations at 1000 bootstrap samples) to using more bootstrap samples (black, 1000, 2000, 4000, 8000, 16,000 bootstrap samples with 2 crossvalidation cycles per sample). The horizontal axis represents the total number of crossvalidation cycles (number of cycles per bootstrap × number of bootstraps). More bootstrap samples are more efficient at stabilizing our bootstrap estimates of the variance of model-performance estimates. Increasing the number of bootstraps decreases the variance roughly at the N-12 rate expected for sampling approximations indicated by the dashed line.

Illustration of the deep-neural-network-based simulations for functional magnetic resonance imaging (fMRI)-like data.

The aim of the analyses was always to infer which layer of AlexNet the simulation was based on. (a) Stimuli are chosen randomly from ecoset (Mehrer et al., 2017) and we simulate a simple rapid event-related experimental design. (b) ‘True’ average response per voxel to a stimulus are based on local averages of the internal representations of AlexNet. To simulate the response of a voxel to a stimulus we choose a (x,y)-position uniformly randomly and take a weighted average of the activities around that location. As weights we choose a Gaussian in space and independently draw a weight per feature between 0 and 1. (c) To generate a simulated voxel timecourse we generate the undistorted timecourses of voxel activities, convolve them with a standard hemodynamic-response function and add temporally correlated normal noise. (d) To estimate the response of a voxel to a stimulus we estimate a standard general linear model (GLM) to arrive at a noisy estimate of the true channel responses we started with in C. (e) From the estimated channel responses we compute the stimulus by stimulus dissimilarity matrices. These dissimilarity matrices can then be compared to the dissimilarity matrices computed based on the full deep neural network representations from the different layers.

Results of the deep-neural-network-based simulations.

(a–c) Relative uncertainty, that is the bootstrap estimate of the standard deviation of model-performance estimates divided by the true standard deviation over repeated simulations. Dashed line and gray box indicate the expected value and standard deviation due to the number of simulations per condition. (a) Bootstrap resampling of conditions when repeated simulations use random samples of conditions and a fixed set of subjects. (b) Bootstrap resampling of subjects when simulations use random samples of subjects (simulated voxel placements) and a fixed set of conditions. (c) Direct comparison of the uncorrected and corrected 2-factor bootstraps (see Estimating the uncertainty of our model-performance estimates for details) for simulations that varied both conditions and subjects. (d–h) Signal-to-noise ratio (Equation 10), a measure of sensitivity to differences in model performance, for the different inference procedures and simulated scenarios. Infinite voxel averaging range refers to voxels averaging across the whole feature map. All error bars indicate standard deviations across simulation types that fall into the category.

Validation of flexible model tests using bootstrap crossvalidation.

(a) MDS arrangement of the representational dissimilarity matrices (RDMs) for one simulated dataset. Colored circles show the predictions based on one correct and one wrong layer changing the voxel averaging region and the treatment of features (‘full’, ‘weighted’, and ‘avg’, as described in the text). Fixed models correspond to single choice of model RDM for each layer. Selection models select the best fitting voxel size from the RDMs presented in one color (or two for ‘both’). Crosses mark the four components of the linear model for Layer 2. The small black dots represent simulated subject RDMs without functional magnetic resonance imaging (fMRI) noise. (b) Histogram of relative uncertainties σboot/σtrue, showing that the bootstrap-wrapped crossvalidation accurately estimates the variance of the performance estimates across many different inference scenarios. (c) Model discriminability as signal-to-noise ratios for different model types.

Functional magnetic resonance imaging (fMRI)-data-based simulation.

(a) These simulations are based on a dataset of neural recordings for 50 stimuli in 5 human observers (Horikawa and Kamitani, 2017), which were each shown 35 times. To extract stimulus responses from these data we perform two general linear model (GLM) steps as in the original publication. (b) In the first step, we regress out diverse noise estimators (provided by fMRIprep) from pooled fMRI runs. (c) We then apply a second GLM separately on each run to extract the stimulus responses. (d) We then extract regions of interest (ROIs) based on an atlas (Glasser et al., 2016), randomly chose differently sized subsets of runs resp. stimuli to enter further analyses. To simulate realistic noise, we estimate an AR(2) model on the second GLM’s residuals, permute and filter them to keep their original autocorrelation structure, and finally scale them by the factors 0.1, 1, and 10. To generate simulated timecourses, we add these altered residuals to the GLM prediction. We then rerun the second GLM on the simulated data and use the Beta-coefficient maps for following steps. (e) Finally, we compute crossnobis representational dissimilarity matrices (RDMs) and perform RSA based on the overall RDM across all subjects. (f) Results of the simulations, separately for each noise scaling factor. The signal-to-noise ratio shows the same increase as for our abstract simulation. (g) The relative uncertainty converges to 1 for increasing stimulus numbers. Error bars indicate standard deviations across different simulation types.

Results in mice with calcium-imaging data.

(a) Mouse visual cortex areas used for analyses and resampling simulations. (b) Overall similarities of the representations in different cortical areas in terms of their representational dissimilarity matrix (RDM) correlations. For each mouse and cortical area (‘data RDM’, vertical), the RDM was correlated with the average RDM across all other mice (‘model RDM’, horizontal), for each other cortical area. We plot the average across mice of the crossvalidated RDM correlation (leave-one-mouse-out crossvalidation). The prominent diagonal shows the replicability across mice and the distinctness between cortical areas of the representational geometries. (c) Relative uncertainty for the 2-factor bootstrap methods. The gray box indicates the range of results expected from simulation variability if the bootstrap estimates were perfectly accurate. The correction is clearly advantageous here although the method is still slightly conservative (overestimating the true standard deviation σtrue of model-performance evaluations) for small numbers of stimuli. For 40 or more stimuli, the corrected 2-factor bootstrap correctly estimates the variance of model-performance evaluations. (d) Signal-to-noise ratio validation: The signal-to-noise ratio (SNR) grows with the number of cells per subject and the number of repeats per stimulus. (e) Signal-to-noise ratio for different noise covariance estimates. Taking a diagonal covariance estimate into account, that is normalizing cell responses by their standard deviation is clearly advantageous. The shrinkage estimates provide a marginal improvement over that. (f) Signal-to-noise ratio for data sampled from different areas. (g) Which measure is optimal for discriminating the models depends on the data-generating area. On average there is an advantage of the cosine similarity over the RDM correlation and of the whitened measures over the unwhitened ones. Error bars indicate standard deviations across different simulation types.

Image credit: Allen Institute.

Appendix 1—figure 1
Evaluation of the tests using normally distributed data simulated under different null hypotheses.

Each plot shows the false-positive rate plotted as a function of the number of subjects and conditions used. Ideal tests should fall on the dotted line at the nominal alpha level of 5%. Dots below the line indicate tests that are valid but conservative. Dots above the line are invalid. The ‘test against chance’ simulations (top row) evaluate tests of the ability of a model to predict RDMs. Data are simulated under the null hypothesis of no correlation between the data and model RDMs. A positive result would (erroneously) indicate that the model predicts the data RDM better than expected by chance. The ‘model comparison’ simulations (middle and bottom row) evaluate tests that compare the predictive accuracy of two models. Data are simulated under the null hypothesis that both models are equally good matches to the data. For the ‘fixed conditions’ simulations (middle row) this was enforced for the exact measured conditions. For the ‘random conditions’ simulations (bottom row) we instead generated models that are equally good on a large set of 1000 conditions, of which only a random subset of the given size is available for the inferential analysis.

Appendix 1—figure 2
Sensitivity to model differences of different RDM comparators.

We used the data simulated on the basis of neural network representations of images to assess how well different models (neural network layer representations) can be discriminated for model-comparative inference when using different RDM comparators. We plot the model discriminability (signal-to-noise ratio, Equation 10) computed for the same simulated data for each RDM comparator and generalization objective (to new measurements of the same conditions in the same subjects: gray, to new measurements of the same conditions in new subjects: red, to new measurements of new conditions in the same subjects: blue, and to new measurements of new conditions in new subjects: purple). Because the condition-related variability dominated the simulated subject-related variability here, model discriminability is markedly higher (gray, red) when no generalization across conditions is attempted. The different rank-based RDM comparators τa,τb,ρa,ρb perform similarly and at least as well as the Pearson correlation (corr) and cosine similarity (cosine), while requiring fewer assumptions. This may motivate the use of the computationally efficient ρa, which we introduce in Appendix C. Better sensitivity to model differences can be achieved using the whitened Pearson (whitened corr) and whitened cosine similarity (whitened cosine).

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Heiko H Schütt
  2. Alexander D Kipnis
  3. Jörn Diedrichsen
  4. Nikolaus Kriegeskorte
(2023)
Statistical inference on representational geometries
eLife 12:e82566.
https://doi.org/10.7554/eLife.82566