Workflow of the Fisher kernel prediction approach. To generate a description of brain dynamics, we (1) concatenate all subjects’ individual timeseries; then (2) estimate a Hidden Markov Model (HMM) on these timeseries to generate a group-level description of brain dynamics; then (3) dual-estimate into subject-level HMM models. Steps 1-3 are the same for all kernels. In order to then use this description of all subjects’ individual brain dynamics, we map each subject into a feature space (4). This mapping can be done in different ways: In the naïve kernels (a), the manifold on which the parameters lie is ignored and examples are treated as if they were in Euclidean space. The Fisher kernel (b), on the other hand, reflects the structure of the parameters in their original Riemannian manifold by working in the gradient space. We then construct kernel matrices k, where each pair of subjects has a similarity value given their parameters in the respective embedding space. Finally, we feed k to kernel ridge regression to predict a variety of demographic and behavioural traits in a cross-validated fashion (5).

Distributions of model performance by kernel. The best-performing methods are highlighted by grey columns in each plot. a) Pearson’s correlation coefficients (r) between predicted and actual variable values as a measure of prediction accuracy (y-axis) of each method (x-axis). Larger values indicate that the model predicts more accurately. The linear Fisher kernel has the highest average accuracy. b) Normalised maximum errors (NMAXAE) as a measure of excessive errors (y-axis) by kernel (x-axis). Large maximum errors indicate that the model predicts very poorly in single cases. Differences between the kernels mainly lie in the tails of the distributions, where the KL divergence model produces extreme maximum errors in some runs (NMAXAE > 1,000), while the linear naïve normalised kernel and the linear Fisher kernel have the smallest risk of excessive errors (NMAXAE < 10). The y-axis is plotted on the log-scale. Asterisks here indicate significant results of Kolmogorov-Smirnov permutation tests. c) Robustness of correlation between model-predicted and actual values. The plot shows the distribution across variables of the standard deviation of correlation coefficients over CV iterations on the y-axis for each kernel (on the x-axis). Smaller values indicate greater robustness. The linear naïve normalised kernel and the linear Fisher kernel are most robust. a),b) Each violin plot shows the distribution over 3,500 runs (100 iterations of 10-fold CV for all 35 variables) that were predicted from each kernel. c) Each violin plot shows the distribution over 35 variables that were predicted from each kernel. Asterisks indicate significant Bonferroni-corrected p-values using 20,000 permutations: *: p < 0.05, **: p < 0.01, ***: p < 0.001.

Variables predicted at relatively high accuracy by at least one method. The table shows mean correlation coefficients for all methods in those variables that could be predicted at relatively high accuracy (correlation coefficient > 0.3) by at least one method. The best performing model is highlighted in bold.

Model performance estimates over CV iterations by behavioural variable and method. Boxplots show the distribution over 100 iterations of 10-fold CV of correlation coefficient values (x-axis) of each method, separately for each of the 35 predicted variables (y-axes). The Fisher kernel (green) not only predicts at higher accuracy for many variables, but also shows the narrowest range, indicating high robustness. Black lines within each boxplot represent the median.

Simulations. If the group difference was apparent in the features, there would be a clear difference between red and blue dots in the feature plot. If the group difference was apparent in the kernels, there would be a checkerboard-pattern with four squares in the kernel matrices: high similarity within the first half and within the second half of subjects, and low similarity between the first and the second half of subjects. Each kernel should also have the strongest similarity on the diagonal because each subject should be more similar to themselves than to any other subject. In the second kernel plots, we remove the diagonal for visualisation purposes to show the group difference more clearly. a) Simulating two groups of subjects that are different in their state means. The error distributions of all 10 iterations show that the Fisher kernel recovers the simulated group difference in all runs with 0% error (1). Features, kernel matrices, and kernel matrices with the diagonal removed for the first iteration for the linear naïve kernel (2), the linear naïve normalised kernel (3), and the linear Fisher kernel (4). The Fisher kernel matrices show an obvious checkerboard pattern corresponding to the within-group similarity and the between-group dissimilarity of the first and the second half of subjects. b) Simulating two groups of subjects that are different in their transition probabilities. Neither kernel is able to reliably recover the group difference, as shown in the error distribution of all 10 iterations (1), and the features and kernel matrices of one example iteration (2-4). c) Simulating two groups of subjects that are different in their transition probabilities but excluding state parameters when constructing the kernels. The Fisher kernel performs best in recovering the group difference as shown by the error distributions of all 10 iterations (1).

Effects of removing sets of features on prediction accuracies. a) In the overall prediction accuracies, removing state features significantly decreased performance in the Fisher kernel and the naïve normalised kernel, while removing transition features had no significant effect. b) and c) Removing features has similar effects on all variables, both better predicted (b) and worse predicted ones (c).