It is problematic to naively use statistics when necessary assumptions are not met. A: Empirical distribution of the Pearson correlation coefficient between two independent samples of 200 data points each, repeated 10000 times. Within sample the data are either independent or weakly or strongly autocorrelated. When the samples have temporal autocorrelation the empirical distribution of the correlation coefficient widens beyond the theoretical distribution for independent and identically distributed (IID) data. B: Corresponding distributions of p-values obtained under the assumption that the data is independent, showing that this mistake inflates the number of small p-values, yielding false positives (type I error) beyond the significance level set by the researcher. C: Empirical distribution of the deviance of a Poisson GLM comparing two nested models where one includes an irrelevant covariate (x-axis is cut at 30). 200 data points are generated of each covariate and the response variable, and this is repeated 10000 times. The covariates are either IID or autocorrelated, and we compare cases where there is or is not another relevant covariate missing from the model. D: Corresponding distributions of p-values obtained from likelihood-ratio tests comparing the nested models. When there exists a missing covariate, the number of small p-values and hence the rate of false positives are inflated. E: Type I error rates (how often does an overly complex model attain the best CV-score) for different CV schemes using different choices of block size and whether to skip blocks or not, including a scheme where folds are drawn completely at random, ignoring temporal structure. We see that failure to account properly for temporal dependencies greatly increases the probability that the wrong model is selected. Data is generated as in Results on simulated data, with 300 spike trains and 3 covariates simulated. The bars show 95% Clopper-Pearson confidence interval for the proportion of incorrect conclusions for each CV scheme. F: Same scenario as in E, except here a combination of cross-validation and the Wilcoxon signed-rank test is used. The significance level is set to α = 0.05, meaning that is the type I error rate we expect to see. With too small blocks the false positive rate is inflated, and with larger blocks the method appears overly conservative. G: Expected experimentwise type I error rates when performing multiple, independent tests, illustrating the need for correction. For a hypothesis test that on its own has type I error rate α, the experimentwise type I error rate grows as 1 − (1 − α)n.