Permute-match tests: Detecting significant correlations between time series despite nonstationarity and limited replicates

Alex E Yuan; Wenying Shou

doi:10.7554/eLife.103703.1

Introduction

Scientists frequently look for correlations between variables to identify potentially important relationships, or to support conceptual models. In disciplines with a focus on dynamics such as ecology and physiology, it is common to measure correlations between time series. Many important biological processes are nonstationary, meaning that their statistical properties (e.g. mean or variance) change systematically across time. Such processes range from the spread of an invasive species to a cell’s response to a new environmental stress.

Interpreting a correlation between a pair of nonstationary time series can be highly fraught because it is easy to obtain a seemingly impressive correlation between two time series that have no meaningful relationship [1]. For example, the sizes of any two exponentially-growing populations will be correlated over time due to a shared growth law, even if the populations lack any interaction or shared influences. To avoid overinterpreting spurious correlations, it helps to distinguish between the concepts of “correlation” and “dependence”. In time series research, the term “correlation” is often used procedurally [2, 3, 4]. Similarly, here we define a correlation function (ρ) to be any function that takes two time series and produces a statistic, although it is usually interpreted as a measure of similarity or relatedness.

In contrast to correlation, “dependence” is a hypothesis about the relationship between variables, and it has clearer scientific implications. Two events A and B are called dependent if the probability that they both occur P (A, B) differs from the product of their individual probabilities P (A)P (B). Similarly, two temporal processes are dependent if the probability of observing any particular pair of trajectories differs from the product of the probabilities of individual trajectories. (The formal definition of dependence extends this idea to continuous variables, where discrete probabilities may be replaced by probability densities [5, 6].) The importance of dependence relationships can be attributed to Reichenbach’s common cause principle, which states that if two variables are dependent, then either they share a common causal influence, or one variable causally influences the other (possibly indirectly) [7, 8]. Thus, it is useful to to first test whether an observed correlation indicates dependence before pursuing specific mechanistic explanations.

There are a handful of systematic approaches to testing for dependence between nonstationary time series. However, most are fairly limited in their scope. First, when possible, a nonstationary time series can be transformed to become stationary, meaning that its statistical properties do not change systematically across time. This enables access to a wide arsenal of tests applicable to stationary data. Transformations include subtracting a trend, taking the derivative (more precisely, “differencing” between neighboring points), or choosing a stationary-seeming window of time [9, 10]. However, it is easy to see potential pathologies in each of these: Taking the derivative of an exponential curve just produces another exponential curve; subtracting a fitted linear trend from the random walk X(t) = X(t − 1) + ϵ(t) (where ϵ(t) is random noise) does not make it stationary [9]; similarly, searching for a stationary window of the same random walk process is futile since the variance of X(t) increases with each step. In a second approach, one may compare the observed correlation with correlations obtained when one time series is replaced by a synthetic replicate, where synthetic replicates are generated in a way that reflects the original nonstationary process [11]. However, these require a correct model of how the statistical properties of the process evolve over time. Lastly, there are tests within the econometrics literature that can provide evidence for dependence between random-walk-like nonstationary processes by detecting a property called cointegration, but cointegration only occurs when some linear combination of the time series is stationary [12].

Alternatively, an approach based on permutations is possible when there are multiple identically distributed and independent (iid) replicates (or trials; we use “replicates” and “trials” interchangeably). In this case, one may evaluate the significance of a within-replicate correlation by comparing it to between-replicate correlations. That is, if X_i and Y_i are time series of variables X and Y from replicate i, the correlation of (X_i, Y_i) may be compared to the correlation of (X_i, Y_{j≠ i}). If the two variables are dependent, within-replicate correlations should tend to be stronger than between-replicate correlations. This approach is sometimes called “inter-subject surrogates” [13, 14].

The inter-subject surrogate approach can be used to test for dependence in each trial separately [15]. In this case, a simple nonparametric test of dependence between, for instance, X₁ and Y₁ can be performed by computing the correlation ρ(X₁, Y_j) for j = 1, …, n and writing down a p-value as the proportion of these correlations that are greater than or equal to ρ(X₁, Y₁) [16, 17]. For this approach, at least n = 20 trials are needed if one wishes to possibly obtain a p-value of 0.05 or below. This procedure checks for dependence in trial 1 specifically; assessing dependence in a different trial via this method would involve an analogous test.

To reduce the required number of trials below 20, a straightforward strategy is to perform a single permutation test (Fig 1A) using the data from all replicates [18, 19]. As an example with n = 4 trials, the test procedure begins by computing the mean within-trial correlation:

Next, as a null model, recompute the mean correlation after permuting the Y time series while holding the X time series in the original order. This is the same as permuting Xs and maintaining the order of the Y s. For instance, one such permutation might be:

A p-value can be calculated as the proportion of permutations (including the original ordering) that produce a mean correlation that is as strong as, or stronger than, ρ_orig (Fig 1A). One may then detect a correlation at the significance level α if p ≤ α. We emphasize that these permutations are obtained by swapping trials, not by swapping time points. This test has been used to detect correlations between time series in neuroscience and psychology settings [20, 18, 21]. It has also been used to detect correlations between variables measured in brain images, which can have similar nonstationarity challenges as time series [22]. A noteworthy advantage of this approach is that it is valid (meaning that the false positive rate is guaranteed to not exceed α) under very mild assumptions. Namely, the test is valid if the X_i trials are exchangeable with one another (i.e. all permutations of the sequence X₁, …, X_n have the same joint probability distribution), or if the Y_i trials are exchangeable (see Corollary 4 in Appendix 1).

The permutation test, perfect match concept, and permute-match tests. (A) The permutation test. Circles indicate the average correlation obtained with (grey) or without (pink) trial swapping. Here, N_≥ is the number of average correlations (including the original correlation) that are equal to or larger than the original correlation. p_perm is then N_≥/n! because n! is the total number of average correlations, including the original. (B) Illustration of a Y -perfect match, represented as a table of correlations wherein each diagonal entry is the greatest in its own row. Each line with a “>” or “<” symbol denotes a required ordering relationship between the two numbers at either end of the line. Note that this example has a Y -perfect match, but not an X-perfect match. For an X-perfect match, each diagonal entry would need to be the greatest in its own column. (C) If an X-or Y -perfect match occurs, then the original correlation, ρ_orig is always greater than any shuffled correlation ρ_shfl. Top: If a Y -perfect match occurs, then each Y_i gives the greatest correlation when it s paired with the X_i from the same trial. A shuffled correlation ρ_shfl pairs Y_is from some trials with X_js from different trials, thereby reducing the correlation. Lines with symbols “>“, “<“, and “=” denote comparisons between terms. Bottom: If an X-perfect match occurs, a similar argument can be applied. (D) Since a perfect match ensures that the original correlation is greater than any shuffled correlation, a perfect match also ensures that p_perm takes on its lowest possible value of 1/n!. (E) The simultaneous permute-match_(X;Y) test. (F) The sequential permute-match_(X▸Y) test; note that the permute-match_{(Y ▸X)} test is defined analogously.

Yet, even the permutation test remains considerably limited by the number of trials. The lowest p-value that can be obtained with n replicates is 1/n! since there are n! possible permutations. For example, with n = 3, the lowest possible p-value is 1/3! ≈ 0.17 and with n = 4, the lowest is 1/4! ≈ 0.04. These minima apply even if the average within-trial correlation is much greater than the average between-trial correlation. While replicate counts of 4 and 5 can produce p-values below the conventional 0.05 threshold, these p-values remain modest. This limitation is particularly problematic when multiple hypothesis testing is required, as the modest p-values may no longer be significant after applying corrections such as the Bonferroni method.

The challenge of limited replicates is a practical reality in many experimental settings due to cost and feasibility concerns [23, 24, 25]. Factors outside the control of investigators further restrict the number of useful replicates. For example, a recently published study of cerebral ischemia in pigs began with 6 animals, but lost 2 due to death and experimental mishap, so complete records exist for only 4 pigs [26]. Similarly, in an animal migration study, authors affixed GPS tags to 13 juvenile cuckoos, but only 5 of them produced GPS records suitable for the study due to heterogeneous behavior among cuckoos [27]. Finally, secondary analysis of public data from earlier studies can be a resource-efficient mode of research [28], and this approach is necessarily limited to the number of replicates within the existing dataset.

In this article, we show that by adding additional steps to the permutation test, we may achieve p-values as low as 1/nⁿ. For n = 3 or 4 respectively, the lowest possible p-value is then ≈ 0.04 or 0.004. This new result only requires that the replicates be iid. Thus, this modified permutation test allows the data analyst to detect dependence with stronger confidence when there is sufficient evidence to do so. We illustrate this approach using simulations as well as a real world example involving recordings of zebrafish swimming behavior with only 3 replicates.

Results

The concept of an X-perfect or Y -perfect match

Central to our approach is the concept of an X-perfect match or Y -perfect match. To motivate the ideas, suppose that a group of n graduate students X = (X₁, …, X_n) has been paired with a group of n advisors Y = (Y₁, …, Y_n) so that the ith student (X_i) is paired with the ith advisor (Y_i). Moreover, let ρ(X_i, Y_j) be a number that measures how well the ith student and the jth advisor get along. We say that the ith student is “happy” if they get along with their own advisor strictly better than they get along with any other advisor, meaning that ρ(X_i, Y_i) > ρ(X_i, Y_j) for all j ≠ i. Similarly, the ith advisor is “happy” if ρ(X_i, Y_i) > ρ(X_j, Y_i) for all j ≠ i. The arrangement of student-advisor pairs is X-perfect if all students are happy, and Y -perfect if all advisors are happy.

Analogously, if X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) are collections of time series with n trials each, and if ρ is some correlation function, we say that an X-perfect match has occurred if (and only if) for all X trials,

Similarly, a Y -perfect match has occurred if and only if:

Throughout the article, we use boldface X or Y to indicate a collection of trials, X_i or Y_i to indicate an individual trial, and X or Y to refer to the variable in general.

A perfect match (either X-perfect or Y -perfect) provides especially strong evidence to support the hypothesis that X and Y are dependent. With a perfect match, the original correlation ρ_orig is strictly greater than any of the shuffled correlations, as illustrated in Fig 1C. Thus, a perfect match guarantees p_perm = 1/n!, the lowest p-value achievable with the permutation test (Fig 1D). Furthermore, letting P denote probability, we can prove that

under the null hypothesis H₀ wherein (1) the Y_i trials are iid, (2) the X_i trials are iid, and (3) X and Y are independent (Lemma 6 in Appendix 2). By an analogous argument (Lemma 7 in Appendix 2), under H₀ we have P (Y −perfect match) ≤ 1/nⁿ and thus

Permute-match tests: modified permutation tests of dependence

We now describe tests for dependence between nonstationary time series based on the ideas just laid out, which we will call permute-match tests. Suppose that an experiment produces time series X = (X₁, …, X_n) and Y = (Y₁, …, Y_n), where X_i and Y_i are time series obtained from the ith replicate, and we want to test whether X and Y are dependent. For example, we may have video recordings of n animals in separate but identically constructed enclosures; for the ith animal, we extract a time series of movement speed (X_i) and a time series of the distance from a light source (Y_i). Using these data, we may wish to investigate whether the animals under study tend to move at different speeds depending on their distance from the light source. We introduce two tests: simultaneous permute-match and sequential permute-match. As we will see, the former can be more reproducible while the latter can be more powerful.

In the simultaneous permute-match test (Fig 1E), we check for both an X−perfect and a Y − perfect match. If either the X-or Y -perfect match occurs, we write down a p-value to be p = 2/nⁿ per equation 1. If neither X nor Y achieves a perfect match, then we fall back to the permutation test: p = p_perm = N_≥/n! where N_≥ is the number of all n! possible permutations (including the original ordering) that produce an average correlation that is at least as large as the original ordering (Fig 1A).

In the sequential permute-match test, we begin by choosing a variable (either X or Y) for the first perfect match test. For example, suppose we choose X first (in which case we refer to the procedure as the “permute-match_(X▸Y)” test). If an X-perfect match is observed, then p = 1/nⁿ (Fig 1F). If X-perfect match is not observed, we then check for a Y -perfect match, and write p = 2/nⁿ if a Y -perfect match is observed. If neither an X-perfect nor Y -perfect match occurs, then p = p_perm. A permute-match_{(Y ▸X)} test is defined analogously.

Under H₀ (i.e. the Y_i replicates are iid, the X_i replicates are iid, and X and Y are independent), all the above tests are valid, meaning that for any significance level α, we have P (p ≤ α) ≤ α. This fact is proven as Theorems 10-12 in Appendix 2.

The permute-match tests (Fig 1E-F) are “upgrades” to the permutation test (Fig 1A) because permute-match tests always report the same or a lower p-value (since 1/nⁿ ≤ 2/nⁿ < 1/n! for all integers n ≥ 2; see Lemma 9 in Appendix 1). In exchange for lower p-values, the permute-match tests have a slightly more restrictive data requirement than the permutation test. The permutation test is valid either if the X_i trials are exchangeable, or if the Y_i trials are exchangeable (Corollary 4 in Appendix 1). The permute-match tests require that all X_i trials are iid and that Y_i trials are iid. If trials are iid, then they are necessarily exchangeable. However, exchangeable trials are not always iid. For example, the first five cards drawn from a shuffled deck are exchangeable because all possible orderings of cards are equally likely, but they are not iid because if the first card is an ace of hearts, the second card cannot be the ace of hearts. In practice, biological experimental designs typically use replicates that are both exchangeable and iid, so the more restrictive requirement of the permute-match test is often inconsequential.

The different tests - permutation test, simultaneous permute-match test, and sequential permute-match tests - are valid tests of dependence given iid replicates (Theorems 10-12 in in Appendix 1). In other words, they all guarantee false positive rates at or below the significance level. Two subtle aspects of the permutematch tests are worth mentioning here. First, no multiple testing correction is applied to account for the fact that both match tests and a permutation test are used. In other words, although the simultaneous permute-match_(X;Y) test assigns p = 2/nⁿ instead of p = 1/nⁿ when an X-or Y -perfect match occurs as a Bonferroni-type correction, no such correction is needed when combining the match test and the permutation test (Fig 1E). This is because the permutation test and perfect match events are nested. That is, an X-or Y -perfect match is possible only if p_perm is already at its lowest possible value of 1/n! (Fig 1C-D). Thus, unlike combining un-nested tests where each can separately contribute false positives, we can safely combine the checks for perfect matches with the permutation test. The second subtlety refers specifically to the sequential permute-match tests. These tests use the lower p-value of 1/nⁿ if the first perfect match occurs, but use the more conservative “Bonferroni-corrected” p-value of 2/nⁿ if only the second match occurs (Fig 1F). This superficially resembles the practice of conducting additional tests merely because earlier tests did not result in a detection -a form of p-hacking that generally results in an invalid final p-value. However, our sequential permute-match procedure is actually valid because it combines binary events instead of continuous variables. See Appendix 3 for a detailed discussion of this point.

Permute-match test variants may differ in statistical power

Although the different tests - permutation test, simultaneous permute-match test, and sequential permutematch tests - are all valid in the sense of correctly controlling false positive rates, how do they vary in power - the probability of detecting true dependence? The answer depends on the number of replicates n and the desired significance level α. There are four regimes to consider (Fig 2A).

Statistical power of various permutation test variants as illustrated using a simple nonstationary system. (A) Summary of test power as a function of the significance level α and the number of replicates n. (**B, C**) System equations and example dynamics. The processes X and Y are given by a linear trend with additive noise on a time grid t = 1, 2, …, 100. The noise terms ϵ_X,i(t) and ϵ_Y,i(t) are drawn from a bivariate normal distribution with a mean of 0, variance of 1, and covariance of r_X,Y. The chart shows an example pair of time series where X and Y are dependent (r_X,Y = 0.3). (C) Statistical power of the permutation test and various permute-match tests as a function of the replicate number n, significance level α, and strength of dependence r_X,Y. Power was calculated from 5000 simulations at each value of r_X,Y between r_X,Y = 0 and 0.54 in steps of size 0.01. We chose the Pearson correlation coefficient as our correlation function ρ.

Regime 1: when α ≥ 1/n!, all tests have the same power. If the permutation test detects dependence, permute-match tests will too since they fall back to a permutation test in the “worst case” (Fig 2A, column 1). Conversely, if the permutation test did not detect dependence (i.e. p_perm > α), then it follows that p_perm > 1/n!, which implies that neither an X-perfect nor Y -perfect match occurred (Fig 1D), and so the permute-match tests cannot have detected dependence either.

Regime 2: when 1/n! > α ≥ 2/nⁿ, all permute-match tests are tied for highest power (Fig 2A, column 2). In this regime, the permutation test has zero power, since its p-value is limited at 1/n!. Conversely, if either an X-perfect or a Y -perfect match occurs, then, the sequential permute-match tests as well as the simultaneous permute-match_(X;Y) test will all detect dependence.

Regime 3: when 2/nⁿ > α ≥ 1/nⁿ, the simultaneous permute-match_(X;Y) test will have no power as the desired p-value is below the lowest possible p-value achievable by the test. Only the permute-match_(X▸Y) or _{(Y ▸X)} tests now have power (Fig 2A, column 3). Note that permute-match_(X▸Y) and permute-match_{(Y ▸X)} test may differ in power, as can be seen in the upper left panel of Fig 2D. We do not currently know how to pre-diagnose which of these two tests will be most powerful based on a given dataset, so the choice of which test to run in this case is arbitrary.

Regime 4: when α < 1/nⁿ, no test has any power even when a perfect match occurs.

Figure 2B-D illustrates these different scenarios using a simulated nonstationary system and some common significance cutoffs. We simulated time series of two variables (X and Y) from a small number of replicates (between 3 and 5). Here, the time series were produced from a pair of linear time trends with coupled additive noise (r_X,Y determines the strength of coupling; Fig 2B). We chose the Pearson correlation coefficient as our correlation function ρ. The various sub-panels of Fig 2D illustrate the different aforementioned regimes.

The above example was chosen for ease of presentation - in fact, the example is so simple that it may be correctly treated by just fitting and removing the linear trend. In Appendix 4, we show that the same ideas hold in an example with a more complex nonstationary data-generating process (a logistic map with time-varying parameters) and a nonlinear correlation function (cross-map skill).

An example from animal behavior science

Living in groups is a common experience among animals. For example, at least half of fish species are thought to form groups at some stage of life [29, 30]. The properties of these groups can impart a variety of important fitness effects [31]. In groups of fish, coordinated swimming may facilitate foraging (e.g. by enabling fish to “communicate” the location of food to others) and predator escape (e.g. by confusing predators) [32, 30]. One study [30] used videos of small groups of zebrafish (8 fish per group) to observe that fish swam faster during video segments when they were in a high-polarization state (i.e. when they were directionally aligned) than during segments when they were in a low-polarization state. A correlation between polarization and speed might indicate statistical dependence between the two variables, or it might just be a consequence of temporal trends (e.g. polarization and swimming speed both happen to decrease with time).

Using a publicly available dataset [33] containing fish trajectories from 3 replicate 10-minute videos of small groups of juvenile Danio rerio zebrafish (10 fish per group), we performed a permute-match test of dependence between polarization and fish swimming speed. As we will see, although these quantities are potentially nonstationary, the permute-match test nevertheless detects a statistical dependence between them.

To quantify polarization, we use “circular variance” (v_circ; see Methods), a common measure of angular dispersion [34]. Since high polarization means low dispersion, we quantify polarization as 1 − v_circ, and call this quantity the circular concentration, since it indicates “how concentrated the [circular] data is toward the center” ([35], page 15). Circular concentration is bounded between 0 (low alignment) and 1 (perfect alignment). To quantify speed, we took the “average individual speed”, which is a term we use to denote an average across individuals, not across time. More precisely, the average individual speed of a 10-fish group at time t is (s_1,t + s_2,t + · · · + s_10,t)/10 where s_i,t is the speed of fish i at time t. Note that average individual speed is distinct from the speed of the group center, which has also been studied in Danio fish [31].

Neither average individual speed nor circular concentration is obviously stationary (Fig 3A). By visual inspection, the average individual speed seems to decrease across time, and all time series are deemed nonstationary by a Kwiatkowski–Phillips–Schmidt–Shin test (p < 0.01), and so we may be on shaky ground to use methods that require stationarity. Additionally, since only 3 trials are available, the permutation test and the simultaneous permute-match test cannot detect dependence at the 0.05 level, so we perform a sequential permute-match test.

A sequential permute-match test detects dependence between average individual speed and circular concentration in small groups of zebrafish. (A) Time series of average individual speed and circular concentration for the three replicate videos. Black curves show original time series, and red curves show a 30-second moving average. Average individual speed appears to decrease with time in all trials. All 12 time series (2 variables × 3 trials × 2 smoothing conditions) were deemed nonstationary by a Kwiatkowski– Phillips–Schmidt–Shin (KPSS; [37]) test (*p <* 0.01). Note that the KPSS test seeks to reject a stationary null hypothesis in contrast to other common tests, whose null hypotheses are nonstationary. (B) Pearson correlation between circular concentration and average individual speed, both within trials and between trials, using the full 10 minutes of data. Table entries are shaded by correlation. We observe a speed-perfect match (and a circular concentration-perfect match), and thus detect dependence with p = 1/3³ ≈ 0.04. (C) A parametric test also detects dependence. As a parametric alternative, for each trial we averaged values of speed and circular concentration over the full 10 minutes (e.g. the scatter plot shown, where each point is one trial). The sample Pearson correlation of the time-averaged variables is 0.99996, and a one-tailed test of significance gives p ≈ 0.003. (D) Permute-match tests detected a significant correlation between speed and circular concentration more consistently than the parametric test. We truncated time series to different lengths starting from the first frame, and compared the parametric test with the two possible permute-match tests. Length was varied between 20 seconds and 10 minutes in 20-second increments, with each increment representing 640 frames since the frame rate is 32 frames per second. We used the Pearson correlation (not its magnitude) in the permute-match tests since the alternative hypothesis is that the correlation is positive, not merely nonzero.

We arbitrarily started with average individual speed and identified a “speed”-perfect match, indicating a significant correlation between average individual speed and circular concentration (Fig 3B; p = 1/3³ ≈ 0.04). Significance does not depend on the choice of which match is tested, since both a “speed”-perfect and a “circular concentration”-perfect match is obtained. Note that although trials are independent of each other, all between-trial correlations from the full dataset are positive (Fig 3B), illustrating the danger in applying naive data analysis methods to data that are autocorrelated or even nonstationary.

To try a parametric alternative, for each trial we averaged values of average individual speed and circular concentration over the full 10 minutes (Fig 3C; where each point is one trial). The sample Pearson correlation of the time-averaged variables is 0.99996, and a one-tailed test of significance (using the function stats.pearsonr from the Python package Scipy [36]) also detects dependence (p ≈ 0.003), consistent with the permute-match tests. This test relies on a null distribution derived under the assumption that the three points in Fig 3C are drawn from a bivariate Gaussian distribution with zero covariance. We view this test as comparable to the permute-match tests because although it requires a parametric assumption, it can be valid even when the time series are nonstationary.

Since data are limited in many scientific applications, we asked whether the results of the sequential permute-match tests and the parametric Pearson correlation test were consistent across different time series lengths. We truncated time series to various lengths (from 20 seconds to 10 minutes), and checked whether each test reported a significant (p ≤ 0.05) correlation (Fig 3D). Both sequential permute-match tests reported a significant correlation at all lengths tested, whereas the parametric test was inconsistent, finding a significant correlation for about one third of lengths.

Discussion

Outside the context of time series, permutation tests have been celebrated because they require only minimal assumptions [38]. To satisfy the requirements of permutation tests, it is sufficient (see Corollary 4 in Appendix 1) to collect data from exchangeable replicates. Data in each trial can be autocorrelated (e.g. temporally or spatially) and nonstationary. No distributional assumptions are required, and the correlation function can even be asymmetric so that ρ(X, Y) ≠ ρ(Y, X) as occurs when correlations are based on techniques that use one time series to predict values of another (e.g. Appendix 4). Such radical freedom from assumptions stands in stark contrast to the situation in the analysis of single-replicate time series, where the practitioner may be forced to make assumptions that can be difficult to verify. For example, most statistical tests of correlation between two time series require stationary data at a minimum [39]. Yet, it is difficult to verify that a single time series is stationary because stationarity is a property of an entire ensemble of time series [39], so that checking for stationarity in a single time series is philosophically analogous to checking for the Gaussianity of a single data point.

However, applying a trial-swapping permutation test to check for significant correlations requires that a sufficient number of trials be available to permute amongst one another. When only small numbers of trials are available, the permutation test is severely limited by mathematics. The minimum p-value that can possibly be achieved by a permutation test is 1/n!, which will provide only modest evidence against a null hypothesis when n = 4 and is essentially useless when n = 3, regardless of how strong the true dependence may be. This limitation is particularly severe for researchers testing several related hypotheses, as this ideally requires a test that can produce p-values sufficiently low to maintain significance even after a multiple-testing correction.

Here we have described modified permutation tests–the sequential and simultaneous permute-match tests–that have access to lower p-values. In the sequential permute-match test, if the first variable does not display a perfect match, this test falls back to the simultaneous permute-match test. Thus, the sequential permute-match test is more powerful than the simultaneous test. However, the sequential test requires the arbitrary choice of checking for an X-or Y -perfect match first, making it potentially less reproducible between studies than the simultaneous test, which does not involve an arbitrary choice. If neither an X-nor Y -perfect match is obtained, both tests default to a permutation test. This step is valid because the tests are nested in the sense that a perfect match can only occur when the permutation p-value takes on its lowest possible value (Lemma 8 in Appendix 1).

It seems likely that the tests could be further modified to report an even lower p-value when both an X-and Y -perfect match occur, as in Fig 3B. However, we have not yet determined a sharp upper bound on the probability of this two-way match event. This problem is left for future efforts.

Although we have here used language and examples involving time series, the permute-match test can in principle be applied to other scenarios where iid replicates are available, but data within each replicate are dependent. One example is spatial data in brain imaging, where the permutation test has been performed using multiple scanned brains [22]. Beyond spatial processes, dependent data also appear in nucleotide sequences and natural language text. Overall, advances in methods that take advantage of multiple replicates may facilitate statistical testing without requiring assumptions that are difficult to verify.

Methods

Data preprocessing

We obtained fish trajectory data [33] from the web address at https://drive.google.com/drive/folders/1UmzlX-yJhzQ5KX5rGry8wZgXvcz6HefD. For the first trial we used the file at the path “10/1/trajecto-ries_wo_gaps.npy” (where “10” indicates the number of fish and “1” indicates the first trial). For the second and third trials, the file path was the same except it began with “10/2/” and “10/3/” respectively. These files contain sequences of fish positions indexed by video frame, as well as constants such as frame rate and approximate fish body length.

We estimated the velocity of each fish as

Where is a 2-dimensional column vector specifying the position of the ith fish in units of body length at video frame is a 2-dimensional column vector specifying the fish velocity in units of body length per second at video frame t, and R is the frame rate (32 frames per second). A group’s average individual speed at each time was then given by

where N is the number of fish. The circular concentration C_t of a group at each time was given by the formula

where

Here, atan2 is the 2-argument arctangent function (i.e. “arctan2” in the Python package Numpy, ver. 1.21.6). Also, v_x,i,t and v_y,i,t denote the x and y components of the velocity vector . The calculation was carried out using the function “stats.circstats.circvar” in the Python package Astropy (ver. 3.2.2) [40].

Statistical analysis

The permute-match test is described in Fig 1 and its validity is proved in Appendix 2. For the KPSS test, we used the function “statsmodels.tsa.stattools.kpss” in the Python package Statsmodels (ver. 0.10.1) [41]. We set the “regression” argument to “c” (which tests the null hypothesis of stationarity rather than trend-stationarity), and we set the “nlags” argument to “auto” (which means that the number of lags used for testing stationarity is automatically chosen by a data-dependent method).

Data availability

All code used, along with execution instructions, is within S1 Code. The time series data analyzed in this paper can be obtained via the following link: https://drive.google.com/drive/folders/1UmzlX-yJhzQ5KX5rGry8wZgXvcz6HefD

Additional files

S1 Code

Appendix

In Appendix 1, we review the basic permutation test of independence and a proof of its validity. This is because, although these results are known to the statistical community, they are often presented in an abstract form whose connection to independence testing may not be obvious to the nonspecialist, or parts of the argument are simply left as an exercise to the reader. In Appendix 2, we provide the theoretical justification for the permute-match test, which is the main topic of this work. Appendix 3 expands on a brief claim made in the main text about p-values of sequential statistical tests, and Appendix 4 contains supplementary data.

1 Justification of the permutation test of independence

In this section we provide a proof that the permutation test of independence is valid. No major novelty is claimed in this subsection. For instance, the results of Lemma 3 can be mostly reconstructed by piecing together various theorems, examples, and homework problems from section 15.2.1 of Lehmann and Romano [42]. Nevertheless, we present the complete argument here as a courtesy to the reader. Related proofs or proof sketches of various kinds of permutation tests can be found in [19, 43].

The following lemma describes a general rank-based method to test whether the distribution of a random variable changes upon applying a set of transformations. This lemma and proof are based on Theorem 15.2.1 in Lehmann and Romano [42]. We begin with this result because it comes first in the chain of logical steps that justify the permutation test, but it is somewhat abstract, and so the reader who prefers to start out with concrete concepts may wish to skip to Def 2 and return to this lemma when its necessity becomes apparent later. As a notational clarification, note that for the transformation h_i, to denote “h_i applied to Z” we write h_iZ rather than the more traditional h_i(Z).

Lemma 1

Let Z ∈ Ƶ be a random variable. Let H = {h₁, …, h_m} be a set of functions from Ƶ to Ƶ such that:

the probability distributions of Z and h_iZ are the same for all functions h_i in H, and
the unordered sets {h₁Z, …, h_mZ} and {h₁h_iZ, …, h_mh_iZ} are equal for all h_i in H.

Let the statistic T be a function from Ƶ to ℝ. Given some z ∈ Ƶ, let the ordered values of {T (h₁z), …, T (h_mz)} be

Choose some significance level α ∈ [0, 1] and let the r ank k be

where ⌊mα⌋ is the largest integer less than or equal to mα. Then,

where P (·) denotes the probability of an event.

Remark: Before proceeding to the proof, let us walk through the lemma, using a concrete example to help with the most abstract aspects. Suppose that random variable Z is the result of a fair coin flip. That is, Z = 0 with 50% chance and Z = 1 with 50% chance. Then Ƶ (the set of valid Z values) is the the set {0, 1}, and z can take on the value of 0 or 1. Next the lemma introduces H, which is a set of functions that need to satisfy certain requirements with respect to Z. Let’s choose H = {h₁, h₂} where h₁ “turns the coin over” and h₂ leaves the coin as it is. That is, h₁z = 1 − z and h₂z = z. To check that H satisfies requirement (1), note that flipping the f air c oin over does n ot change the p robability d istribution of i ts outcome, and neither does leaving it alone. To see that H satisfies requirement (2), we need to check that

If Z = 0, then the above expression becomes

and if Z = 1, then it is

and since these are unordered sets, they are all equal, so we have verified that H satisfies requirement (2). Next, we need a function T. For example, choose T (z) = z + 10. Thus, for z = 0 we have {T (h₁z), T (h₂z)} = {11, 10}. The lemma then introduces a ranking notation where T ⁽ⁱ⁾(Z) is the ith smallest term in the sequence. This notation is useful as we will be dealing with the set {T (h₁Z), …, T (h_mZ)} for potentially large values of m. Returning to our example, if z = 0, then T ⁽¹⁾(0) = 10 and T ⁽²⁾(0) = 11. Similarly for z = 1, T ⁽¹⁾(1) = 10 and T ⁽²⁾(1) = 11. Also, in the statement of the lemma, T ⁽ⁱ⁾ is defined as a function of z, not Z. This is to emphasize that the ordering depends on the particular outcome of the random variable and is not fixed. Finally, the lemma makes a claim about the probability that T (Z) exceeds T ^(k)(Z), which is the kth lowest statistic in the set. In the example, if α = 0.05, then k = 2 − ⌊0.1⌋ = 2. T ⁽²⁾(Z) is always going to be the second smallest, i.e. the largest, element of T (h₁z), T (h₂z), which is 11. Of course, the probability of T (Z) being even greater than the largest number is 0, and since 0 is less than α = 0.05 we have verified the lemma in this case. As another example, suppose α = 0.5. Then k = 2 − ⌊1⌋ = 1. T ^(k)(Z) is then T ⁽¹⁾(Z), the smallest element of {T (h₁z), T (h₂z)}. The probability of T (Z) being greater than the smallest number is 0.5 ≤ α = 0.5, which verifies the lemma for α = 0.5.

Proof of Lemma 1: Let I(A) be the indicator function for the event A (meaning that I(A) = 1 if A occurs and I(A) = 0 otherwise).

First suppose that T ^(k)(Z) is not tied with any other T ⁽ⁱ⁾(Z). Then Eq. A-1 implies that there are ⌊mα⌋ values of T ⁽ⁱ⁾(Z) that are greater than T ^(k)(Z). That is,

To arrive at the second equality, we used the fact that T ^(k)(Z) = T ^(k)(h_iZ), which follows from the second requirement of the lemma (i.e. {h₁Z, …, h_mZ} = {h₁h_iZ, …, h_mh_iZ).

Alternatively, if there is possibly a tie for T ^(k)(Z) (meaning that there may be some j ≠ k such that T ^(j)(Z) = T ^(k)(Z)), then we have a similar formula, but with an inequality instead of an equality:

This version with the inequality is of course general since it is true under both cases. Using this formula,

Since Z and h_iZ have the same distribution, we may remove the h_i in the above formula, and we have

Dividing through by m gives

as required.

Next we give a definition of the permutation test. The definition here is somewhat more general than in the main text. As with h_i, we will denote “g_i applied to Z” by writing g_iZ.

Definition 2 The permutation test of independence

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of random variables. Let G = {g₁, …, g_n!} be the complete set of functions that permute a sequence of length n. For example, if Y is (2, 7, 4), then g_iY might be (2, 4, 7). Define the statistic T (X, Y) to be a function that returns a real number, typically interpreted as the overall correlation strength. Repeatedly compute the statistic after permuting the elements of Y. That is, compute

Let N_≥ be the number of permuted statistic values that are at least as large as the original statistic value. That is,

where I(A) is the indicator function for the event A (meaning that I(A) = 1 if A occurs and I(A) = 0 otherwise). Then the p-value of the test is given by

We now prove the validity of the permutation test.

Lemma 3

The permutation test of independence is valid when Y is exchangeable.

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of random variables such that Y is exchangeable and where X is independent of Y. Perform the permutation test of independence (Def. 2) to obtain p_perm. Then, P (p_perm ≤ α) ≤ α for any α ∈ [0, 1].

Proof: Here we use the symbols g_i and T with their meanings from Def. 2.

We first set up the problem in a way that allows us to apply Lemma 1. To construct the random variable Z and transformation set H for Lemma 1, we choose Z = (X, Y) and h_iZ = (X, g_iY). Lemma 1 has two requirements. First, Z and h_iZ must have the same distribution. This requirement is satisfied: Since Y is exchangeable and X is independent of Y, it follows that (X, Y) has the same distribution as (X, g_iY). The second requirement is that {h₁Z, …, h_mZ} = {h₁h_iZ, …, h_mh_iZ}. This requirement is also satisfied: The set of permutations of Y is the same as the set of permutations of g_iY. Since both requirements are satisfied, we may apply Lemma 1.

We now prove the result directly. Below, A ↔ B means “A if and only if B”.

As in Eq. A-1, let us define k = m − ⌊mα⌋, where n! = m. Then, the final line above says “the number of permuted Y s that produce a smaller T than the original Y is greater than or equal to k.” Said another way, “when we rank the T s from least to greatest, the T from the original Y will have a higher rank than k”. This in turn is equivalent to T (X, Y) > T ^(k)(X, Y). Thus, we may directly apply Lemma 1, thereby obtaining

as required.

Lemma 3 is “asymmetric” in that it requires Y to be exchangeable and says nothing about X. It is possible to write down a variant of this lemma where either an exchangeable Y or an exchangeable X is sufficient, but this requires an extra condition on T. Specifically, it is required that applying the same permutation to both X and Y must leave the value of T (X, Y) unchanged. That is, T (X, Y) = T (g_iX, g_iY) for any permutation function g_i. Importantly, the form of T that is used for the permute-match test (i.e. satisfies this requirement because in this case, T (g_iX, g_iY) simply rearranges the order of the terms in the summation, which clearly has no effect on the value of T. We establish this formally in the corollary below.

Corollary 4

If T is “nice”, then the permutation test is valid when either X or Y is exchangeable.

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of random variables such that at least one of (X, Y) is exchangeable and where X is independent of Y. Perform the permutation test of independence (Def. 2) using a correlation function with the “nice” property that

for any permutation function g_i. Then, P (p_perm ≤ α) ≤ α for any α ∈ [0, 1].

Proof: Lemma 3 already covers the case where Y is exchangeable. We now consider the case where X is exchangeable, but not Y.

We proceed by showing that the permutation test where Y is permuted is equivalent to the permutation test where X is permuted. Notice that to show this equivalence it is sufficient to establish that

where the left and right sides both denote unordered sets. It will be useful to refer to the inverse permutation , which is the permutation that “undoes” g_i. More precisely, is defined by the relation . Using Eq. A-2, we have:

Note that the inverse permutation is guaranteed to exist because any permutation g_i can be “undone” by some other permutation . In the edge case where g_i is the trivial identity permutation, is also the identity permutation. Additionally, the set of permutation functions is the same as the set of inverse permutation functions, meaning that

because each permutation is an inverse permutation (because g_i inverts g⁻¹), and each inverse permutation is a permutation. It follows that

where the first equality follows from Eq. A-3 and the second equality follows from Eq. A-4. This shows that the permutation test where Y is permuted is equivalent to the permutation test where X is permuted.

From Def. 2 and Lemma 3 it is clear that a permutation test where X is permuted instead of Y would be valid (i.e. P (p_perm ≤α) ≤α) if X is exchangeable. Since we have just now shown that the present procedure is equivalent to a permutation test where X is permuted instead of Y, it follows that this procedure is valid when X is exchangeable, which completes the proof.

2 Validity of the permute-match test

We now justify the validity of the permute-match test, which is the primary contribution of the present work. Recall from the main text that the null hypothesis of this test is that X₁, …, X_n are iid, Y₁, …, Y_n are iid and Y is independent of X.

Definition 5

The perfect match e vents (M_X and M_Y )

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of random variables. Let ρ (the “correlation function”) be a function that maps a pair (X_i, Y_j) to a real number. We say that the event M_Y (also called a “Y -perfect match”) occurs if and only if:

for all pairs (i, j) such that i ≠ j.

Similarly, the event M_X (also called an “X-perfect match”) occurs if and only if

for all pairs (i, j) such that i≠ j.

Lemma 6

Let X = (X₁, …, X_n) be a sequence of random variables and let Y = (Y₁, …, Y_n) be a sequence of iid random variables. Let X and Y be independent. Then, the probability of a Y -perfect match is at most 1/nⁿ.

Proof: Let us define the random variables (C₁, …, C_n) as follows:

That is C_i ∈{1, …, n} is defined to be the choice of j that m aximizes ρ(X_j, Y _i). In the case of a multiway tie, C_i is the lowest from among the maximizing index values. (For example, if j = 3 and j = 7 both produced the greatest value of ρ(X_j, Y_i), then C_i would be 3 since 3 < 7.) These C_i variables are useful because if a Y -perfect match occurs, then it must be the case that

Notice that C₁, …, C_n are iid given X = x for any particular x that X may take. This is because C_i is a function of only Y_i and X, so once we condition on X = x, each C_i is then a function of only Y_i. Moreover, the Y₁, …, Y_n are of course iid given X = x because Y and X are independent.

Since the C_i terms are identically distributed given X = x, we have

for all i. Therefore we have

By dividing the above equation through by n, we may write

where the inequality comes from the AM-GM inequality (e.g. [44] pg. 20).

Since the C_i terms are independent given X = x, we have

Substituting Eq. A-6 into the right side of inequality A-5 and raising both sides to the nth power gives us

Since X must take on some value x, it is clear that P (C₁ = 1, …, C_n = n) ≤ 1/nⁿ. More formally, applying the law of total probability we find

Since a Y -perfect match can only occur when (C₁ = 1, …, C_n = n), we may then write

Combining inequalities A-7 and A-8 completes the proof.

Lemma 7

Let X = (X₁, …, X_n) be a sequence of iid random variables and let Y = (Y₁, …, Y_n) be a sequence of random variables. Let X and Y be independent. Then, the probability of an X-perfect match is at most 1/nⁿ.

Proof: The argument is analogous to that of Lemma 6. The main differences are that in this case we define instead of , and here we condition on Y = y instead of X = x.

Lemma 8

A perfect match implies p_perm = 1/n!

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of random variables. Let ρ be a function that maps a pair (X_i, Y_j) to a real number. Define the average correlation

Perform the permutation test of independence (Def. 2) using T as the statistic, and obtain the p-value p_perm. Also check for an X-perfect and Y -perfect match test (Def. 5) using ρ as the correlation function. Then, the occurrence of either an X-perfect match or a Y -perfect match implies that p_perm = 1/n!.

Proof: This claim is shown by a visual argument in Fig 1C. Here we walk through the same logic in prose. Suppose a Y -perfect match occurs, meaning that

In other words, for each Y_i, its correlation term ρ(·, Y_i) is strictly maximized when Y_i is paired with X_i. Applying a permutation to Y (other than the trivial identity permutation) means that for at least one choice of i, the ρ(X_i, Y_i) term in Eq. A-9 becomes replaced with ρ(X_j, Y_i) for some j ≠ i. But ρ(X_i, Y_i) > ρ(X_j, Y_i).

Therefore, T (X, Y) must be strictly greater than all permuted versions T (X, g_iY) (other than when g_i is the identity permutation). It follows that p_perm = 1/n!.

Suppose now that an X-perfect match occurs. It follows that for each X_i its correlation term ρ(X_i, ·) is strictly maximized when X_i is paired with Y_i. Applying a permutation to X (other than the identity permutation) means that for at least one choice of i, the ρ(X_i, Y_i) term in Eq. A-9 becomes replaced with ρ(X_i, Y_j) for some j ≠ i. Therefore, T (X, Y) is be strictly greater than all permuted versions T (g_iX, Y) whenever g_i is not the identity permutation. So p_perm = 1/n!.

Lemma 9

1/n! ≥ 2/nⁿ for all integers n > 1, with equality only when n = 2.

Proof: When n = 2 it is clear that 1/n! = 2/nⁿ = 1/2.

For n > 2, we establish 1/n! > 2/nⁿ by an inductive argument. For the base case, suppose n = 3. Then, 1/n! > 2/nⁿ becomes 1/6 > 2/27, which is true.

For the inductive step, suppose the target inequality 1/n! > 2/nⁿ holds. Multiplying the target inequality on both sides by 1/(n + 1), we obtain

This shows that 1/n! > 2/nⁿ implies 1/(n + 1)! > 2/(n + 1)ⁿ⁺¹. We conclude by the inductive principle that the target inequality is true for n > 2.

Theorem 10

The permute-match_X▸Y test is valid.

Let X = (X₁, …, X_n) and Y = (Y₁, …, Y_n) be sequences of iid random variables. Let X and Y be independent. Let the correlation function ρ be a function that maps a pair (X_i, Y_j) to a real number. Define the average correlation

Perform the permutation test of independence (Def. 2) using T as the statistic, and obtain the p-value p_perm. Also check for an X-perfect match test and a Y -perfect match test (Def. 5) using ρ as the correlation function. Let

Then P (p ≤ α) for α ∈ [0, 1] and n > 1.

Proof: For notational convenience, we denote the events of an X-perfect match and Y -perfect match as M_X and M_Y.

There are four cases to consider. Either (case i) α ≥ 1/n!, or (case ii) 1/n! > α ≥ 2/nⁿ, or (case iii) 2/nⁿ > α ≥ 1/nⁿ, or (case iv) α < 1/nⁿ. Note that Lemma 9 tells us that 1/n! ≥ 2/nⁿ.

Case i: Suppose α ≥ 1/n!. In this case, p ≤ α if M_X occurs or M_Y occurs or p_perm ≤ α.

This may be split into 3 terms via the identity P (A or B) = P (A and B)+P (A and not B)+P ((not A) and B), as follows:

However, by Lemma 8, if either M_X or M_Y occurs, then p_perm = 1/n!, which would in turn imply p_perm ≤ α under case i. That is, the term P (p_perm > α and (M_X or M_Y)) in the above summation is zero. Removing this term, we have

where the final inequality is due to Lemma 3. This establishes the claim for case i.

Case ii: Suppose 1/n! > α ≥ 2/nⁿ. In this case, p_perm > α since p_perm cannot be less than 1/n!, so we have

From Lemmas 6-7 we know P (M_X) ≤ 1/nⁿ and P (M_Y) ≤ 1/nⁿ. Thus the above inequality becomes

which establishes the claim in case ii.

Case iii: Suppose 2/nⁿ > α ≥ 1/nⁿ. In this case, only M_X will give us p ≤ α. So we have

which verifies the claim in case iii.

Case iv: If α < 1/nⁿ, then, p α is impossible so P (p ≤α) = 0 ≤α, so the claim is trivially satisfied in case iv.

Having been shown in all four cases, the claim has been proven.

Theorem 11

The permute-match_{Y ▸X} test is valid.

Consider the same setting as Theorem 10, except that now

Then, P (p ≤ α) for α ∈ [0, 1] and n > 1.

Proof: The proof is analogous to that of Theorem 11. In fact, the only thing we need to change here is that in case iii, we use the fact that P (M_Y) ≤ 1/nⁿ instead of that P (M_X) ≤ 1/nⁿ.

Theorem 12 The simultaneous permute-match_(X;Y) test is valid.

Consider the same setting as Theorem 10, except that now

Then, P (q ≤ α) for α ∈ [0, 1] and n > 1.

Proof: We first show by cases that p of Theorem 10 is always less than or equal to q. First, if an X-perfect match occurs, then p = 1/nⁿ < 2/nⁿ = q. Second, if there is a Y -perfect match but not an X-perfect match, then p = 1/nⁿ = q, Third, if neither an X-perfect nor a Y -perfect match occurs, then p = p_perm = q. Overall, p ≤ q. It follows that q ≤ α implies p ≤ α.

Then it follows from Theorem 10 that

so P (q ≤ α) ≤ α as required.

3 The sequential permute-match test versus a similar flawed procedure

In the main text we briefly noted that the sequential permute-match test resembles the flawed practice of conducting additional tests merely because earlier tests did not result in a detection. Here we provide a detailed explanation of why this practice generally results in an invalid p-value, and point out that the sequential permute-match test does not have this problem.

To spell out the flawed practice, suppose a researcher runs one test (test X) and obtains a p-value, p_X. If p_X is below their desired significance level α, they stop there and report p_X. Conversely, if p_X > α, this researcher performs another test (test Y) that assesses the same biological hypothesis in a slightly different way, obtaining a second p-value, p_Y. However, the researcher has a vague understanding that if one conducts multiple tests, a correction is required, so at this point they perform a Bonferroni correction to account for two tests and report 2p_Y. Overall, the p-value obtained from this series of steps is:

One may ask whether this overall procedure has the desired property (known as ‘validity’) that it prevents the false positive rate from exceeding the significance level. Unfortunately, this procedure is generally not valid, meaning that it commonly leads to P (p ≤α) > α under the null hypothesis.

To demonstrate this fact mathematically, we make three assumptions. First, we assume that the null hypothesis is true for both test X and Y so that P (p ≤α) is the false positive rate. Second, we assume that p_X and p_Y are both random variables following a uniform distribution between 0 and 1. Finally, we assume that the significance level α is less than 1/2.

The false positive rate of the procedure in Eq A-10 depends on the relationship between the two tests. If tests X and Y tend to report the same result, they will behave more like a single test and the false positive rate of Eq A-10 will be lower; if the tests frequently report different results, they will behave like two genuinely distinct tests, and the false positive rate will be greater. This relationship is quantified in Table A-1, which describes the joint and marginal distributions of the X and Y test outcomes. We write the probability that test X is significant (or not) along the bottom row, and the probability that test Y is significant (or not) along the rightmost column. We then define a parameter (P ^′) which is the probability that test Y is significant while test X is not. We then fill out the rest of the table, corresponding to the probabilities of the other possible test results, by ensuring that the rows and columns of the 2 × 2 “Joint probability” table sum to the entries in the “Total probability” row and column.

The procedure of Eq A-10 produces a false positive rate that generally exceeds α, and depends on the relationship between tests X and Y. This relationship X and Y can be quantified by P^′, the probability that test Y is significant (i.e. 2p_Y ≤α) but test X is not significant (i.e. p_X *> α*). The false positive range is shown over the full range of P^′, from 0 to α/2. The only way for the overall procedure of Eq A-10 to be valid is when P^′ = 0, where tests X and Y are so tightly coupled that a significant result in test Y deterministically ensures that test X is also significant. Other than this edge case, the false positive rate of the overall procedure will exceed α.

Joint and marginal probabilities of the outcomes of tests X and Y, assuming that p_X and p_Y both follow Unif(0, 1). The parameter P^′ is the probability that test Y is significant but test X is not.

Each joint probability is of course bounded between 0 and 1. Thus, moving clockwise from the upper left of the 2 × 2 inner table we have the following constraints on P^′:

Among these constraints, the tightest bounds on P ^′ are 0 ≤P ^′ ≤α/2. Notice that the false positive rate of the overall procedure is given by:

Fig A-1 plots Eq A-11 within the allowed bounds of P^′, noting the extreme cases of P^′ = 0 and P^′ = α/2, as well as the case of independence in which P^′ = P (2p_Y ≤α) × (p_X > α). Other than the extreme case of P^′ = 0, in which a significant Y test result deterministically ensures that the X test is also significant, the false positive rate of the overall procedure, P (p ≤α), will be greater than α. Thus the procedure is not valid.

The approach for the sequential permute-match test (Fig 1F) is different because it is based on events, not continuous random variables. Perhaps counterintuitively, we can actually get away with a similar recipe while producing a valid p-value. To best compare our approach to the flawed procedure above, we presently leave out the permutation part and focus only on the matching test component in the following proposition:

Proposition 13

Let M_X and M_Y be events with P (M_X) ≤ 1/nⁿ and P (M_Y) ≤ 1/nⁿ. Let

Then, P (p ≤ α) ≤ α for 0 ≤ α ≤ 1 and for all integers n ≥ 2.

Proof: We consider four cases, each corresponding to possible values of α.

First, if α = 1, then P (p ≤α) ≤α trivially because probability does not exceed 1.

Second is the case 1 > α≥ 2/nⁿ. In this case, either M_X or M_Y is sufficient to achieve p ≤α. So we have:

as required.

Third, consider the case 2/nⁿ > α ≥ 1/nⁿ. In this case, only M_X will give us p ≤ α. So we have

as required.

Finally consider the case α < 1/nⁿ. In this case p≤ α is impossible so P (p ≤α) = 0 ≤α, as required. Since the claim holds under our (exhaustive) set of cases, the proof is complete.

The same argument holds if 1/nⁿ and 2/nⁿ is replaced q and 2q for an arbitrary q between 0 and 1, but we use 1/nⁿ and 2/nⁿ to highlight the connection to the permute-match test. Our complete proof full (Theorem 10) followed largely the same steps as shown here. Overall, we hope that readers of this section now have a deeper understanding of why the sequential permute-match test is valid, and an appreciation of why similar approaches are invalid in the more standard context where statistical tests are not simply binary event checks.

Significance of findings

Strength of evidence

Abstract

Introduction

Results

The concept of an X-perfect or Y -perfect match

Permute-match test variants may differ in statistical power

An example from animal behavior science

Discussion

Methods

Data preprocessing

Statistical analysis

Data availability

Additional files

Appendix

1 Justification of the permutation test of independence

Lemma 1

Lemma 3

Corollary 4

2 Validity of the permute-match test

Definition 5

Lemma 6

Lemma 7

Lemma 8

Lemma 9

Theorem 10

Theorem 11

3 The sequential permute-match test versus a similar flawed procedure

Proposition 13

4 Illustration of permute-match test with logistic map system

References

Article and author information

Author information

Alex E Yuan

Wenying Shou

Author Notes

Version history

Cite all versions

Copyright

Metrics