Workflow of the Fisher kernel prediction approach.

To generate a description of brain dynamics, we (1) concatenate all subjects’ individual timeseries; then (2) estimate a Hidden Markov Model (HMM) on these timeseries to generate a group-level model; then (3) dual-estimate into subject-level HMM models. Steps 1-3 are the same for all kernels. In order to then use this description of all subjects’ individual patterns of brain dynamics, we map each subject into a feature space (4). This mapping can be done in different ways: In the naïve kernels (4a), the manifold (i.e., the curved structure) on which the parameters lie is ignored and examples are treated as if they were in Euclidean space. The Fisher kernel (4b), on the other hand, respects the structure of the parameters in their original Riemannian manifold by working in the gradient space. We then construct kernel matrices (κ), where each pair of subjects has a similarity value given their parameters in the respective embedding space. Finally, we feed κ to kernel ridge regression to predict a variety of demographic and behavioural traits in a cross-validated fashion (5).

Distributions of performance across subject traits and CV iterations when using different methods for prediction of subject traits on HCP data.

The best-performing methods are highlighted by black arrows in each plot. a) Pearson’s correlation coefficients (r) between predicted and actual variable values in deconfounded space as a measure of prediction accuracy (x-axis) of each method (y-axis). Larger values indicate that the model predicts more accurately. The linear Fisher kernel has the highest average accuracy among the time-varying methods, while the Ridge regression model in Riemannian space had the highest average accuracy among the time-averaged methods. Note that we here show the distribution across target variables and CV iterations but averaged over folds for visualisation purposes, while the fold-wise accuracies were used for significance testing. Asterisks indicate significant Benjamini-Hochberg corrected p-values of repeated k-fold cross-validation corrected t-tests below 0.05 (*). b) Coefficient of determination (R2) in deconfounded space (x-axis) for each of the methods (y-axis). The x-axis is cropped at -0.1 for visualisation purposes since individual runs can produce large negative outliers, see panel c. c) Normalised maximum errors (NMAXAE) in original (non-deconfounded) space as a measure of excessive errors (x-axis) by method (y-axis). Large maximum errors indicate that the model predicts very poorly in single cases. Differences between the methods mainly lie in the tails of the distributions, where the naïve normalised Gaussian kernel produces extreme maximum errors in some runs (NMAXAE > 10,000), while the linear naïve normalised kernel and the linear Fisher kernel, along with several time-averaged methods have the smallest risk of excessive errors (NMAXAE below 1). The x-axis is plotted on the log-scale. d) Robustness of prediction accuracies. The plot shows the distribution across variables of the standard deviation of correlation coefficients over folds and CV iterations on the x-axis for each method (on the y-axis). Smaller values indicate greater robustness. The linear Fisher kernel and the time-averaged Ridge regression model in Riemannian space are the most robust. Asterisks indicate significant Benjamini-Hochberg corrected p-values for repeated measures t-tests below 0.01 (**) and 0.001 (***). a), b), c) Each violin plot shows the distribution over 3,500 runs (100 iterations of 10-fold CV for all 35 variables) that were predicted from each method. d) Each violin plot shows the distribution over 35 variables that were predicted from each method.

Model performance estimates over cross-validation (CV) iterations by behavioural variable and method, ordered by accuracy on HCP data.

Boxplots show the distribution over 100 iterations of 10-fold CV of correlation coefficient values (x-axis) of each method, separately for each of the 35 predicted variables (y-axes). Among the time-varying methods, the linear Fisher kernel (green) predicts at higher accuracy for many variables, and also shows the narrowest range, indicating high robustness. However, for many target variables, it is outperformed by the time-averaged tangent space models (Ridge reg. Riem. and Elastic Net Riem.). Black lines within each boxplot represent the median.

Simulations.

a) Simulating two groups of subjects that are different in their state means. The error distributions of all 10 iterations show that the Fisher kernel recovers the simulated group difference in all runs with 0% error. b) Simulating two groups of subjects that are different in their transition probabilities. Neither kernel is able to reliably recover the group difference in all 10 iterations. c) Simulating two groups of subjects that are different in their transition probabilities but excluding state parameters when constructing the kernels. The Fisher kernel performs best in recovering the group difference.

Effects of removing sets of features from the kernels on prediction accuracies.

a) In the overall prediction accuracies, removing state features significantly decreased performance in the Fisher kernel and the naïve normalised kernel, while removing transition features had no significant effect. b) Removing features has similar effects on all variables, both better predicted (left panel) and worse predicted ones (right panel).

Effects of HMM training scheme.

a) Prediction accuracies for HMM-based kernels depending on HMM training scheme (training on all subjects: together; training only on training set: separate). In the real data (N=1,001), fitting the HMM to all subjects before constructing the kernels compared to fitting it only to the training set to preserve train-test separation has no effect. Note that we are here plotting the fold-wise accuracies (as opposed to averaged over folds, as in the figures above), and we only ran one iteration of CV (rather than 100 repetitions, as in the figures above). b) Prediction accuracies in simulated heterogeneous subject groups depending on training scheme, between-group difference, and target variable (Y) noise. In simulated data, the Fisher kernel’s performance decreases when the test subjects are increasingly different from the training subjects. c) Example kernels for high between-group difference. While the naive kernel underperforms in both cases, the strong difference between training and test subjects is visible in the naive normalised kernel, while it completely dominates the Fisher kernel.