It is problematic to naively use statistics when necessary assumptions are not met. A: Empirical distribution of the Pearson correlation coefficient between two independent samples of 200 data points each, repeated 10000 times. Within sample the data are either independent or weakly or strongly autocorrelated. When the samples have temporal autocorrelation the empirical distribution of the correlation coefficient widens beyond the theoretical distribution for independent and identically distributed (IID) data. B: Corresponding distributions of p-values obtained under the assumption that the data is independent, showing that this mistake inflates the number of small p-values, yielding false positives (type I error) beyond the significance level set by the researcher. C: Empirical distribution of the deviance of a Poisson GLM comparing two nested models where one includes an irrelevant covariate (x-axis is cut at 30). 200 data points are generated of each covariate and the response variable, and this is repeated 10000 times. The covariates are either IID or autocorrelated, and we compare cases where there is or is not another relevant covariate missing from the model. D: Corresponding distributions of p-values obtained from likelihood-ratio tests comparing the nested models. When there exists a missing covariate, the number of small p-values and hence the rate of false positives are inflated. E: Type I error rates (how often does an overly complex model attain the best CV-score) for different CV schemes using different choices of block size and whether to skip blocks or not, including a scheme where folds are drawn completely at random, ignoring temporal structure. We see that failure to account properly for temporal dependencies greatly increases the probability that the wrong model is selected. Data is generated as in Results on simulated data, with 300 spike trains and 3 covariates simulated. The bars show 95% Clopper-Pearson confidence interval for the proportion of incorrect conclusions for each CV scheme. F: Same scenario as in E, except here a combination of cross-validation and the Wilcoxon signed-rank test is used. The significance level is set to α = 0.05, meaning that is the type I error rate we expect to see. With too small blocks the false positive rate is inflated, and with larger blocks the method appears overly conservative. G: Expected experimentwise type I error rates when performing multiple, independent tests, illustrating the need for correction. For a hypothesis test that on its own has type I error rate α, the experimentwise type I error rate grows as 1 − (1 − α)n.

Summary of different hypothesis tests used in step 3 of the forward selection procedure, and the corresponding CV scheme used to determine which covariate that should be added to the model.

Different hypothetical distributions of p-values resulting from a hypothesis test. For the test to be valid, the probability of obtaining a p-value smaller than a specified significance level α must be at most α, i.e. Pr(pα) ≤ α, when the null hypothesis is assumed to be true. If this is not true, and Pr(pα) > α, the distribution will be skewed towards 0, the rate of false positives will be inflated (A), and the test is invalid. For a valid test we typically see Pr(pα) = α (B), giving a uniform distribution of p-values. A test might also be conservative, meaning it is still valid, but the distribution is skewed towards 1 (C). In the latter case the rate of false negatives will be inflated, and it is less likely that true positive results are discovered.

Summary of results when using simulated data with 300 simulated cells. We show type I error rates (proportion of simulations where a covariate is incorrectly included) for scenario 1, where none of the observed covariates affect the response, and power (proportion of simulations where the two-dimensional covariate is correctly included) for scenario 2, where the two-dimensional covariate has an effect on the response. We also show corresponding 95% Clopper-Pearson confidence intervals. For each hypothesis test the significance level α = 0.05 is used.

Testing simulated data where none of the observed covariates have any effect on the response. A: For each method (CV, mSRMaxT, mSRRMaxT, CSBonf, SR, SRBonf, see Table 1) the number of cells classified as tuned to each of the 8 combinations of the three covariates is shown, for a total of 300 simulated cells. The bars are colored according to which final model is selected by the forward selection procedure. Note that the y-axis is in log-scale to emphasize sparse results. B: Distribution of p-values for each method at the first step. For this scenario the proportion of p-values smaller than α = 0.05 corresponds to the type I error rate, and must itself be smaller than α on average for the method to be valid. CV without any added hypothesis test exceeds this threshold, whereas the other methods all give error rates below it.

Testing on simulated data where the simulated position covariate has an effect on the response. For each method (CV, mSRMaxT, mSRRMaxT, CSBonf, SR, SRBonf, see Table 1) the number of cells classified as tuned to each of the 8 combinations of the three covariates is shown, for a total of 300 simulated cells. The bars are colored according to which final model is selected by the forward selection procedure. We see that the methods in general correctly favor the pure position model. Note that the y-axis is in log-scale to emphasize sparse results. Covariate A is relevant to the activity of the simulated cells and should be included in the model. Additionally, since we use a significance level of α = 0.05, we expect to see additional, irrelevant covariates for approximately 5% of the cells where the correct one is included.

Summary of type I error rates (proportion of cells where a covariate is incorrectly included) for the scenario with mismatched data where none of the observed covariates affect the response, and proportion of the 147 grid cells classified by Zong et al. (2022) that are also classified as position tuned. We also show corresponding 95% Clopper-Pearson confidence intervals. For each hypothesis test the significance level α = 0.05 is used. (*) The confidence intervals are based on the binomial distribution, which assumes that the neurons are independent. In reality there are small correlations present, meaning that the intervals are slightly too optimistic, and should be somewhat wider.

Results from testing on calcium data where the cell activity and covariates are from different sessions. A: For each method (CV, mSRMaxT, mSRRMaxT, CSBonf, SR, SRBonf, see Table 1) the number of cells classified as tuned to each of the 8 combinations of the three covariates is shown, for a total of 249 cells. The bars are colored according to which final model is selected by the forward selection procedure. Note that the y-axis is in log-scale to emphasize sparse results. B: Distribution of p-values for each method at the first step. For this scenario the proportion of p-values smaller than α = 0.05 corresponds to the type I error rate, and must itself be smaller than α on average for the method to be valid. Cross-validation without any added hypothesis test exceeds this threshold, whereas the other methods all give error rates below it.

Results from testing on calcium data where the cell activity and covariates are from the same session. For each method (CV, mSRMaxT, mSRRMaxT, CSBonf, SR, SRBonf, see Table 1) the number of cells classified as tuned to each of the 8 combinations of the three covariates is shown, for a total of 249 cells. The bars are colored according to which final model is selected by the forward selection procedure. Most methods favor models including position as a covariate, and CS often suggests a model with multiple covariates.

Ratemaps and tuning curves for a selection of cells with different classifications by the GLM approach with CSBonf. A: Cells classified as grid cells by Zong et al. (2022), and as position tuned by the GLM. B: Cells not classified as grid cells, but as position tuned by the GLM. C: Cells classified as grid cells, but not classified as position tuned by the GLM. Animal position is binned using 40 bins in each direction, and the average number of events occurring in each two-dimensional bin is calculated. Then a two-dimensional Gaussian filter with standard deviation of 1 bin is applied. Each ratemap is then normalized by dividing it by its 95th percentile. The mean of the average event rates for each group the subset of cells is from is displayed in each plot title.

Resulting classification from the GLM using CSBonf, mSRRMaxT and SR, compared to the grid cell classification by Zong et al. (2022). This means that CSBonf classifies for example 2 + 0 + 20 + 17 + 33 + 15 + 2 + 0 = 89 cells as tuned to head direction, 17 + 20 + 16 + 31 + 37 + 40 + 33 + 15 = 209 to position, 33 + 20 = 53 as tuned to position, head direction and speed, and that 37 + 40 + 33 + 15 = 125 of the 147 grid cells are classified as position-tuned. In general CSBonf classifies more cells than the other two methods, which yield similar results.

Illustration of CV scheme with blocking and skipping. Each box is a block of consecutive time bins. When a fold is used as the test set, the neighboring blocks are excluded from the training set. Note that there are 8 folds in the illustration, compared to 10 or 20 in our analyses.

Illustration of the cyclical shift permutation.

Illustration of how multiple comparisons results in too small p-values when not accounted for (A), and potential solutions using the Bonferroni (B) and maxT (C) correction methods.

Comparison of real and simulated data. A: Animal X-position over time. B: Autocorrelation function of the animal’s X-position. C: X- and Y-position from the real mouse data. D: Simulated X-position. E: Autocorrelation function of the simulated X-position. F: Simulated X- and Y-position.

Illustration of how the event probability is constructed for the simulated data. A: The contribution X and Y have on the probability of an event, pxy. B: The contribution from h, ph. C: The resulting probability as a function of X and Y for one set of simulated (x, y, h). D: A smoothed ratemap of a sample of Bernoulli variables generated from p(x, y, h).

Illustration of calcium data. A: Raster of cell activity. B: Head direction of the animal over time. C: Animal position over time. D: Ratemap of the cell activity over position for a single neuron.