## Peer review process

**Not revised:** This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

## Editors

- Reviewing EditorJoshua GoldUniversity of Pennsylvania, Philadelphia, United States of America
- Senior EditorJoshua GoldUniversity of Pennsylvania, Philadelphia, United States of America

**Reviewer #1 (Public Review):**

Summary:

This paper describes a comparison of different statistical methods for model comparison and covariate selection in neural encoding models. It shows in particular that issues arising from temporal autocorrelation and missing variables can lead to statistical tests with substantially higher false positive rates than expected from theory. The paper proposes methods for overcoming these problems, in particular cross-validation with cyclical shift permutation tests. The results are timely, important, and likely to have a broad impact. In particular, the paper shows that cell tuning classification can vary dramatically with the testing procedure, which is an important lesson for the field as a whole.

Strengths:

- Novel and important comparison of different methods for variable selection in nested models.

Weaknesses:

- Does not (yet) examine effect sizes

- Does not motivate/explain key methods clearly enough in the main text.

General Comments:

1. My first general comment is that the paper in its current form focuses on the "null hypothesis significance testing" (NHST) paradigm. That is, it is focused on binary tests about what variables to include (or not include) in a regression model, and the false-positive rates of such tests. However, the broader statistics community has recently seen a shift away from NHST and towards a statistical reporting paradigm focused on effect sizes. See for example:

- "Scientists rise up against statistical significance". Nature, March 2019.

- Moving to a World Beyond "p < 0.05". RL Wasserstein, AL Schirm, NA Lazar. The American Statistician, 2019.

In light of this shift, I think the paper would be substantially strengthened if the authors could add a description of effect sizes for the statistical procedures they consider. Thus, for example, in cases where a procedure selects the wrong model (e.g., by selecting a variable that should not be included), how large is the inferred regression weight, and/or how large is the improvement in prediction performance (e.g. test log-likelihood) from including the erroneous regressor? How strong is the position tuning ascribed to a MEC cell that is inappropriately classified as having position tuning under one of the sub-optimal procedures? (Figure 7 shows some example place maps, but it would be nice to see a more thorough and rigorous analysis).

My suspicion would be that even when the hypothesis test gives a false positive, the effect sizes tend to remain small... but it is certainly possible that I'm mistaken, or that inferred effect sizes are more accurate for some procedures than others.

2. My only other major criticism relates to clarity and readability: in particular, the various procedures discussed in the paper ("forward selection", "maxT correction", "permutation test with cyclic shifts") are not clearly explained in the main paper, but are relegated to the Methods. Although I think it is useful to keep many of the mathematical details in the methods section, it would benefit the reader to have a general and intuitive explanation of the key methods within the flow of the main paper. The first paragraph of the Results section is particularly underdeveloped and hard to read and could benefit from a substantial revision to introduce and motivate the terms and procedures more clearly. I would recommend moving much of the text from the Methods into the Results section, or at the very least adding a paragraph describing the general idea/motivation for each method in Results.

**Reviewer #2 (Public Review):**

This paper considers methods for statistical analysis of autocorrelated neural recording time series: an important question for neuroscience, that is underappreciated in the community. The paper makes a valuable contribution to this topic by comparing methods based on cross-validation and cyclic shift on simulated grid-cell data. My main suggestions regard clarity, which would greatly benefit from a more didactic approach: explaining the methods compared to the main text and providing more explanatory figures. But there are also some additional analyses that would strengthen the paper.

There are two ways to build support for the validity of a statistical method: by mathematically proving that it is valid, or by empirically verifying it with simulated data where the correct answer is known. A mathematical proof removes all doubt to validity but empirical validation can still be useful even without proof, as it demonstrates that the method works in at least some circumstances. For empirical validation to be most convincing, it helps to also show some situations where the method doesn't work, ideally by varying a continuous parameter that reliably moves the simulation from a situation where it works to one where it doesn't. If the method works in all but extremely unrealistic cases, this builds confidence that it will work on real data.

The main conclusion of this paper's simulations is that the cyclic shift method most often detects valid correlations, while still not exceeding the false positive rate expected for a valid test. Readers may take this paper as indicating that the circular shift method is safe in all circumstances, but this is not correct. The authors acknowledge that circular shift can sometimes be invalid, and have made modifications to mitigate the problem. But there is neither a mathematical proof that these mitigations work, nor an analysis of the circumstances under which they succeed and fail. I doubt a formal proof is possible since there are likely situations in which even the new methods give false positive results. So the authors should include an empirical test of their modified circular shift method as compared to plain circular shift in various simulations. To gain confidence in the new method it is important to characterize the situations where both methods succeed; where the new method succeeds but traditional cyclic shift gives false positive errors; and situations in which both fail. If situations where the new method fails are so unrealistic that they would never occur in real data, we can have better confidence in the method.

The main contributions of the paper are the modifications to circular shifting and cross-validation that avoid problems of temporal contiguity, but these are only described in the Methods section. But this is a methods paper, so the description of the new methods should be in the main text, including explanatory figures currently in the Methods.

The introduction presents two problems that can occur in neural data: autocorrelation, and omitted variables. However, it is not clear that the current methods help with the problem of omitted variables. In fact, I don't see how any analysis method could solve the problem of omitted variables. If an experimenter observes a correlation between X and Y, there is no way to know this isn't because a third variable Z correlates with X and influences Y, without any effect of X on Y. It is generally impossible to prove causation without making randomized manipulations of one variable; although some methods claim to infer causality by observing all variables that could possibly have a causal effect, this is unlikely to occur in neuroscience. In any case, the problem of omitted variables seems irrelevant to the current study and could be removed.

The list of analysis methods mentioned in the first paragraph of the introduction (eg TDA, LVM) seems irrelevant: it is not clear how the methods evaluated here would be used to assess the significance of those methods. Better to stick to a description of how correlations are difficult to detect in autocorrelated signals, which is what the current methods address.

**Reviewer #3 (Public Review):**

Summary:

The authors consider various statistical testing frameworks for model selection in the context of neuronal tuning. They consider cross-validation as a baseline scheme, and show various corrections and modifications to existing cross-validation schemes together with the underlying data/sign shuffling procedures for finding null distributions. Through careful simulations, they show that some of these tests are expectedly too conservative or too optimistic, and show that a log-likelihood-based test statistic with a cyclic shift permutation test for obtaining null distribution and Bonferroni correction strikes the right balance between hits and false detection. They further apply these tests to calcium imaging data from the mouse entorhinal cortex to identify grid cells (i.e., cells for which position is selected as a relevant variable).

Strengths:

The paper is very well written, easy to follow, and enjoyable to read. It addresses an important issue in modern neuroscience, which is drawing conclusions based on data with missing or (unaccounted for) auto-correlated covariates.

Weaknesses:

The paper would benefit from including more rigorous theoretical justification on why some of the procedures examined here outperform the others. This could be done in a stylized example with a Gaussian linear model, for which some of the used statistics have well-known distributions.

Comparisons with false discovery rate (FDR) control, as a more appropriate measure of performance when dealing with many comparisons, would benefit the existing comparisons merely based on Bonferroni correction.

Including spiking history in the generalized linear models (GLMs) used in analyzing the mouse data could be beneficial, as existing literature points to the importance of spiking history as a relevant covariate.