Abstract
Within neuroscience, psychology, and neuroimaging, the most frequently used statistical approach is null hypothesis significance testing (NHST) of the population mean. An alternative approach is to perform NHST within individual participants and then infer, from the proportion of participants showing an effect, the prevalence of that effect in the population. We propose a novel Bayesian method to estimate such population prevalence that offers several advantages over population mean NHST. This method provides a populationlevel inference that is currently missing from study designs with small participant numbers, such as in traditional psychophysics and in precision imaging. Bayesian prevalence delivers a quantitative population estimate with associated uncertainty instead of reducing an experiment to a binary inference. Bayesian prevalence is widely applicable to a broad range of studies in neuroscience, psychology, and neuroimaging. Its emphasis on detecting effects within individual participants can also help address replicability issues in these fields.
eLife digest
Scientists use statistical tools to evaluate observations or measurements from carefully designed experiments. In psychology and neuroscience, these experiments involve studying a randomly selected group of people, looking for patterns in their behaviour or brain activity, to infer things about the population at large.
The usual method for evaluating the results of these experiments is to carry out null hypothesis statistical testing (NHST) on the population mean – that is, the average effect in the population that the study participants were selected from. The test asks whether the observed results in the group studied differ from what might be expected if the average effect in the population was zero. However, in psychology and neuroscience studies, people’s brain activity and performance on cognitive tasks can differ a lot. This means important effects in individuals can be lost in the overall population average.
Ince et al. propose that this shortcoming of NHST can be overcome by shifting the statistical analysis away from the population mean, and instead focusing on effects in individual participants. This led them to create a new statistical approach named Bayesian prevalence. The method looks at effects within each individual in the study and asks how likely it would be to see the same result if the experiment was repeated with a new person chosen from the wider population at random.
Using this approach, it is possible to quantify how typical or uncommon an observed effect is in the population, and the uncertainty around this estimate. This differs from NHST which only provides a binary ‘yes or no’ answer to the question, ‘does this experiment provide sufficient evidence that the average effect in the population is not zero?’ Another benefit of Bayesian prevalence is that it can be applied to studies with small numbers of participants which cannot be analysed using other statistical methods.
Ince et al. show that the Bayesian prevalence can be applied to a range of psychology and neuroimaging experiments, from brain imaging to electrophysiology studies. Using this alternative statistical method could help address issues of replication in these fields where NHST results are sometimes not the same when studies are repeated.
Introduction
Within neuroscience, psychology, and neuroimaging, the common experimental paradigm is to run an experiment on a sample of participants and then infer and quantify any effect of the experimental manipulation in the population from which the sample was drawn. For example, in a psychology experiment, a particular set of stimuli (e.g. visual or auditory stimuli) might be presented to a sample of human participants, who are asked to categorise the stimuli or perform some other task. Each participant repeats the procedure several times with different stimuli (experimental trials), and their responses and reaction times are recorded. In a neuroimaging experiment, the same procedure is employed with neuroimaging signals recorded in addition to behavioural responses. The researcher analyses these responses to infer something about brain function in the population from which the participants were drawn.
In this standard experimental paradigm, the implicit goal is usually to determine the presence of a causal relationship between the experimental manipulation and the response of interest. For example, between a stimulus property and the neural activity in a particular brain region, or between neural activity and a behavioural measure (e.g. accuracy, reaction time). A properly controlled experimental design in which other extraneous factors are held constant (i.e. a randomised control trial) enables a causal interpretation of a correlational relationship (Angrist and Pischke, 2014; Pearl, 2009). We use statistical tools to evaluate the measured effect and ensure that we are not being fooled by randomness—that is, it should be unlikely our observed result was produced by random fluctuations with no effect present. This is often formalised as null hypothesis significance testing (NHST). We reject the null hypothesis (often that the population mean is zero) when the probability of observing a given effect size (or larger) is less than some prespecified value (often p=0.05) if the null hypothesis was true (i.e. if the population mean really was zero). Simply stated, we would be unlikely to obtain the observed effect size if the null hypothesis was true.
Researchers performing such studies usually wish to infer something about the population from which the experimental participants are selected (Holmes and Friston, 1998), rather than about the specific sample of participants that were examined (as in a case study). Importantly, any statistical inference from a sample to the population requires a model of the population itself. The ubiquitous approach used in psychology and neuroimaging is to model the effect in the population with a normal distribution and perform inference on the mean of this model: the population mean (see Materials and methods). For example, the null hypothesis is often that the population mean is zero, and the probability of the observed sample data is computed under this assumption, taking into account the variance between individuals in the sample of participants.
However, an alternative and equally valid question is to ask how typical is an effect in the population (Friston et al., 1999b). In this approach, we infer an effect in each individual of the sample, and from that infer the prevalence of the effect in the population—that is, the proportion of the population that would show the effect, if tested (Allefeld et al., 2016; Donhauser et al., 2018; Friston et al., 1999a; Rosenblatt et al., 2014). The results obtained using the population mean (vs. population prevalence) can differ, particularly when effects are heterogenous across participants.
Here, we argue that in many experimental applications in psychology and neuroscience, the individual participant is the natural replication unit of interest (Little and Smith, 2018; Nachev et al., 2018; Smith and Little, 2018; Thiebaut de Schotten and Shallice, 2017). This is because many aspects of cognition, and the neural mechanisms underlying them, are likely to vary across individuals. Therefore, we should seek to quantify effects and ensure that our results can be reliably distinguished from chance within individual participants. We argue that with such a shift in perspective towards experimental assessment within individual participants, we should also shift our statistical focus at the population level from NHST of the population mean to estimating the population prevalence: the proportion of individuals in the population who would show a true positive abovechance effect in a specific experiment. This can be thought of as the expected withinparticipant replication probability. Although we focus here on a broad class of experiments in psychology and neuroimaging that feature human participants and noninvasive recording modalities, the arguments we present are general and apply equally well to other experimental model organisms or sampling units.
To support this shift in perspective, we present a simple Bayesian method to estimate population prevalence based on the results of withinparticipant NHST, including prevalence differences between groups of participants or between tests on the same participants. This approach can also be applied without explicit withinparticipant NHST to estimate prevalence of different effect sizes, giving a new view on what can be learned about a population from an experimental sample. We suggest that applying Bayesian population prevalence in studies that are sufficiently powered within individual participants could address many of the recent issues raised about replicability in psychology and neuroimaging research (Benjamin et al., 2018; Ioannidis, 2005). Bayesian prevalence provides a population prevalence estimate with associated uncertainty and therefore avoids reducing an entire experiment to a binary NHST inference on a population mean (McShane et al., 2019).
Results
Population vs. withinparticipant statistical tests
To illustrate, we simulate data from the standard hierarchical population model that underlies inference of a population mean effect based on the normal distribution (Friston, 2007; Holmes and Friston, 1998; Penny and Holmes, 2007) (see Materials and methods).
We emphasise that this standard hierarchical population simulation is the simplest possible illustrative example, intended to demonstrate the different perspective of withinparticipant vs. population mean inference. The model simulated here is used implicitly in almost every randomeffects inference employed in psychology and neuroimaging studies, in which the tested null hypothesis is that the population distribution is normal with zero mean and nonzero variance. We also emphasise that the simulated data values within individuals are considered to represent the effect of a controlled experimental manipulation—that is, a causal relationship. However, this could be any causal relationship that we can quantify in psychology, neuroscience, or neuroimaging. For example, we could consider a withinparticipant difference in reaction time between two classes of stimuli, a coefficient or contrast from a linear model (e.g. a General Linear Model of fMRI data), a crossvalidated outofsample predictive correlation from a highdimensional stimulus encoding model (e.g. a model predicting auditory cortex MEG responses to continuous speech stimuli), a rank correlation of dissimilarity matrices in a Representational Similarity Analysis, or parameters of computational models of decision making (e.g. the Diffusion Drift Model) or learning. When evaluating the results of the simulations of Figure 1, it is important to keep in mind that these results are meant to represent any statistical quantification of any experimental manipulation.
Figure 1 illustrates the results of four simulations that differ in the true population mean, µ (A, B: $\mu =0$; C, D, $\mu =1$) and in the number of trials performed per participant (A, C: 20 trials; B, D: 500 trials). For each simulation, we performed inference based on a standard twosided, onesample ttest against zero at two levels. First, we applied the standard summary statistic approach: we took the mean across trials for each participant and performed the secondlevel population ttest on these perparticipant mean values. This provides an inference on the population mean, taking into account betweenparticipant variation. This is equivalent to inference on the full hierarchical model in a design where participants are separable (Holmes and Friston, 1998; Penny and Holmes, 2007; Penny et al., 2003) (see Materials and methods). The modelled population distribution is plotted as a solid curve, coloured according to the result of the population mean inference (orange for significant population mean, blue for nonsignificant population mean). Second, we performed inference within each participant, applying the ttest on withinparticipant samples, separately for each participant. The sample mean ± s.e.m. is plotted for each participant (orange for significant participants, blue for nonsignificant participants).
The population mean inference correctly fails to reject the null hypothesis for Figure 1 panels A and B (ground truth $\mu =0$) and correctly rejects the null in panels C and D (ground truth $\mu =1$). But consider carefully Figure 1B,C. With 500 trials in panel B, 32/50 participants (orange markers) show a significant withinparticipant result. The probability of this happening by chance, if there was no effect in any members of the population, can be calculated from the cumulative density function of the binomial distribution. In this case, it is tiny—for a test with false positive rate $\alpha =0.05$, and no effect in any individual, p<2.2×10^{16} (below 64bit floatingpoint precision). Compare that to p=0.008 for the population ttest in panel C. Thus, the panel B results provide much stronger statistical evidence of a nonzero effect at the population level: the observed results are very unlikely if no individuals in the population have a nonzero effect in this test. This statistical evidence would be ignored in analyses based only on the population mean. Considering inference at the individual level, panel C results (11/50 significant) have p=4.9×10^{6} if there was no effect within any individuals (i.e. the proportion of the population showing an effect was zero). Thus, even panel C, which simulates an experiment with only 20 trials per participant, provides stronger evidence for a population effect from the withinparticipant perspective than from the population mean perspective.
Obviously, these two different pvalues are calculated from two different null hypotheses. The population ttest tests the null hypothesis that the population mean is zero, assuming that individual means follow a normal distribution:
while the pvalue for the number of significant withinparticipant tests comes from the null hypothesis that there is no effect in any individual in the population, termed the global null (Allefeld et al., 2016; Donhauser et al., 2018; Nichols et al., 2005):
where i indexes all members of the population. These are testing different questions, but importantly both are framed at the population level and both provide a populationlevel inference. Clearly, the global null itself is quite a blunt tool. The goal of our Bayesian prevalence method is to quantify withinparticipant effects at the population level in a more meaningful and graded way. We agree that it is important to align ‘the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating’ (Yarkoni, 2020). Often, when the goal of the analysis is to infer the presence of a causal relationship within individuals in the population, the withinparticipant perspective may be more appropriate. We argue that performing NHST at the individual participant level is preferable for conceptual reasons in psychology and neuroimaging, but also for practical reasons related to the replicability crisis (see Discussion).
Although we have so far considered only the global null, the simulations show that the withinparticipant perspective can give a very different impression of the evidence for a populationlevel effect from a data set. The results in Figure 1B represent strong evidence of a nonzero effect within many participants, albeit one that is approximately balanced across participants between positive and negative directions. We do not intend to suggest that such twosided effects are particularly common but would highlight how they are implicitly assumed in traditional population mean inference models. Figure 2 provides further examples of how the withinparticipant and population mean approaches can diverge in situations with effects in one direction. In fact, if the researcher believes that all effects in an experiment are positive, then the standard null model illustrated here is inappropriate. The traditional population mean ttest is actually providing a fixedeffects analysis based on the global null rather than the randomeffects interpretation of generalisation to the population that is normally applied (Allefeld et al., 2016).
Estimating population prevalence
The pvalues under the global null are obtained from the cumulative density function of the binomial distribution, based on a withinparticipant false positive rate $\alpha =0.05$. However, we can also model the number of abovethreshold individuals in a sample when the population prevalence of true positive (significant) test results is $\gamma$ (see Materials and methods). Consider a withinparticipant test with a false positive rate $\alpha$. In this case, the distribution of the number of significant test results follows a binomial distribution with success probability $\theta =\left(1\gamma \right)\alpha +\gamma$. Here, we present a Bayesian approach to estimate the population prevalence of true positive test results, $\gamma $, from this binomial model of withinparticipant testing. Alternatively, frequentist inference approaches can be used (Allefeld et al., 2016; Donhauser et al., 2018; Friston et al., 1999a). Note that we are not estimating the population prevalence of the ground truth binary status of the tested effect. We could only obtain a lower bound on this quantity because there might be individuals with an effect too small to detect with our test. Therefore, here we focus throughout on the prevalence of true positive test results—that is, the proportion of the population we would expect to give a true positive test result given the specific experimental test procedure considered.
The Bayesian approach provides a full posterior distribution for $\gamma$, from which we can obtain the maximum a posteriori (MAP) estimate, together with measures of uncertainty—for example, highest posterior density intervals (HPDIs) or lower bound posterior quantiles. Figure 1E shows the Bayesian posteriors, MAPs, and HPDIs for the four simulated data sets in Figure 1A–D. Even though there is no population mean effect in Figure 1B, the MAP estimate of the prevalence is 0.62 (96% HPDI: [0.47 0.76]). Given the data, the probability that the population prevalence is greater than 47% is higher than 0.96. Based on this result, we would consider it highly likely that more than 47% of the population would show a true positive effect if tested in the same experiment with 500 trials.
Figure 2 demonstrates two situations where Bayesian prevalence gives a different statistical assessment of the presence of an effect in the population compared to a secondlevel ttest for a nonzero population mean. We simulated a simple EEG experiment with 100 trials repeated in each of 20 participants. Template Gaussian eventrelated potentials (ERPs) are added on each trial with fixed width and uniformly distributed amplitude, with a specific peak time per participant. Both withinparticipant and secondlevel ttests are Bonferroni corrected over the 600 time points considered. Figure 2A shows a simulation where all participants have an effect, with peak times drawn from a uniform distribution over the 100–400 ms range. There is strong evidence for an effect in each participant, and the estimated prevalence is high: 1 [0.85 1] (MAP [96% HPDI]). However, because the effects are not aligned in time across participants, there are no time points at which the null hypothesis of zero population mean can be rejected. In the simulation shown in Figure 2B, only 10/20 participants demonstrate an effect, with peak times drawn from a uniform distribution over the range 200–275 ms. Here, effects can be reliably detected in those 10 participants, leading to an estimated prevalence of 0.47 [0.25 0.70]. However, because the effects are not present in all participants, there are no time points when the null hypothesis of zero population mean can be rejected. Interestingly, plotting the prevalence posterior distribution as a function of time does reveal evidence for an effect in the population during the time window of the simulated effect.
Estimating differences in prevalence
Often the scientific question of interest might involve comparing an effect between groups of participants or between different experimental conditions in the same set of participants. In the first case, a researcher might, for example, be interested in examining differences in a behavioural or neuroimaging effect between patients and healthy controls, or between participants from two different cultures. In the second case, a researcher might be interested in determining the effect of an intervention on a particular measured effect, testing the same set of participants on two occasions, once with an intervention and once in a control condition. From the population mean perspective, these questions would typically be addressed with a twosample unpaired ttest for the first case and a paired ttest for the second. From the prevalence perspective, the question would be whether the prevalence of true positive results differs between two sampled populations (in the first case) or between two experimental tests (in the second case). We therefore provide additional functions (see Materials and methods) to directly estimate the difference in prevalence of true positive test results for these two comparisons, which we term betweengroup (same test applied to two different samples) and withingroup (two different tests applied to a single sample).
To estimate the difference in prevalence of true positive results for a given test between two samples from separate populations (e.g. patients vs. healthy controls), the input data required is the count of positive results and the total number of participants in each group. We illustrate this with a simulated data set. We specify the true prevalence in the two populations as ${\gamma}_{1}=0.75$ and ${\gamma}_{2}=0.25$, respectively, and draw a random sample based on the respective binomial distributions with parameters ${\theta}_{i}$ (see Materials and methods). We simulate ${N}_{1}=60$ participants in the first group and ${N}_{2}=40$ participants in the second group. The results of one such draw give ${k}_{1}=45$, ${k}_{2}=11$ positive tests in each group, respectively. In this case, the MAP [96% HPDI] prevalence difference ${\gamma}_{1}{\gamma}_{2}$, calculated from these four values (${k}_{1},{k}_{2},{N}_{1},{N}_{2}$), is 0.49 [0.29 0.67], which closely matches the ground truth (0.5). Figure 3A and B shows how the betweengroup posterior prevalence difference estimates scale with the number of participants for three different simulations.
To estimate the difference in prevalence of true positive results between two different tests applied to the same sample of participants, the input parameters are the number of participants significant in both tests, the number significant only in each of the two tests, and the total number of participants. We simulate two tests applied to a group of $N=50$ participants. Each test detects a certain property with false positive rate $\alpha =0.05$. The ground truth prevalences of true positives for the two tests are ${\gamma}_{1}=0.5$ and ${\gamma}_{2}=0.25$, respectively, and the correlation between positive results is ${\rho}_{12}=0.2$ (i.e. participants who test positive on one test are more likely to also test positive on the other test). The results of one random draw from this model give: (see Materials and methods) ${k}_{11}=8$ participants with a significant result in both tests; ${k}_{10}=19$ participants with a significant result in the first test but not the second; and ${k}_{01}=5$ participants with a significant result in the second but not the first. In this case, the MAP [96% HPDI] prevalence difference $\gamma}_{1}{\gamma}_{2$, calculated from these four values (${k}_{11},{k}_{10},{k}_{01},N$), is 0.28 [0.08 0.46], which again matches the ground truth (0.25). Figure 3C,D shows how the withingroup posterior prevalence difference estimates scale with the number of participants for three different ground truth situations, given as $\left[{\gamma}_{1}\text{}{\gamma}_{2}\right]{\rho}_{12}$.
Both these approaches are implemented using Monte Carlo methods, and the functions return posterior samples (Gelman, 2014), which can be used to calculate other quantities, such as the posterior probability that one test or group has a higher prevalence than the other. The posterior log odds in favour of this hypothesis can be computed by applying the logit function to the proportion of posterior samples in favour of a hypothesis. In the betweengroup example above, the proportion of posterior samples in favor of the hypothesis $\gamma}_{1}>{\gamma}_{2$ is 0.9999987, corresponding to log posterior odds of 13.5. In the above withingroup comparison, the proportion of posterior samples in favour of the hypothesis $\gamma}_{1}>{\gamma}_{2$ is 0.9979451, corresponding to log posterior odds of 6.2 (each computed from 10 million posterior samples).
The additional example in Figure 4 demonstrates how betweengroup prevalence differences can occur between two populations with the same mean. We simulated two groups of participants where group 1 was from a homogenous population with a nonzero mean whereas group 2 comprised participants drawn from two different distributions, one with zero mean and one with a large mean. The two samples (and populations) have similar means, with no significant difference between them (twosample ttest p=0.78). However, considering the prevalence in the two populations clearly reveals their difference (Figure 4C), which is formalised with the posterior distribution of the prevalence difference between the two populations (Figure 4D).
Prevalence as a function of effect size
In the above, we focused on performing explicit statistical inference within each participant. A possible criticism of this approach is that the withinparticipant binarisation of a continuous effect size can lose information. If the null distribution is the same for each participant, then the withinparticipant inference involves comparing each participant’s effect size, $E}_{p$, to a common statistical threshold $\hat{E}$. The prevalence estimation can therefore be interpreted as estimating the population prevalence of participants for which $E}_{p}>\hat{E$. In the NHST case, $\hat{E}$ is chosen so that $P(E>\hat{E})=\alpha$ (usually $\alpha =0.05$), but we can generally consider the prevalence of participants with effects exceeding any value $\hat{E}$. We therefore estimate the prevalence of $E}_{p}>\hat{E$ as a function of $\hat{E}$. This can reveal if a data set provides evidence for population prevalence of a subthreshold withinparticipant effect, as well as showing how population prevalence decreases for larger effects. Figure 5 demonstrates this approach for the simulated systems of Figure 1, showing that prevalence results for both rightsided, $E}_{p}>\hat{E$, and leftsided, $E}_{p}<\hat{E$, effect size thresholds. Note that this approach requires the null distribution to be the same for each participant and requires the false positive rate $\alpha$ to be calculated for each effect size considered. It reveals everything we can learn about the population prevalence of different effect sizes from our data set, exploring beyond the binarisation of the withinparticipant inference. The asymmetry visible in Figure 5C,D reflects the positive population mean for those simulations.
In Figure 6, three simulated EEG data sets (c.f. Figure 2) show how prevalence as a function of effect size can disambiguate situations where the prevalence of p=0.05 NHST rejections does not differ. Each panel (A–C) shows a simulation of a different EEG experiment. All three have a similar population prevalence of p=0.05 null hypothesis rejections, as shown in the lowerright posterior distributions. However, the prevalence curves as a function of effect size differ in each simulation. For example, in panel A there is no evidence that the population contains individuals with an effect size greater than T(99) = 12, whereas the MAP for prevalence of effects greater than 12 is around 50% in panel B. Similarly, the prevalence curve in panel C reflects the larger population variance of the effect in that simulation compared to panel A. While these differences would also be clear from a standard plot of the participant data (e.g. violin or raincloud plot of perparticipant maximal effect sizes), the posterior prevalence curves go beyond descriptive visualisations of the specific data sample by quantifying a formal inference to the population, including its associated uncertainty.
How to apply Bayesian prevalence in practice
As in the simulation of Figure 1, a typical population mean inference is often framed as a twolevel summary statistic procedure. At the first level, the effect is quantified within each participant (e.g. a difference in mean response between two conditions). At the second level, the population mean is inferred under the assumption that the effect is normally distributed in the population (i.e. based on the mean and standard deviation of the measured effect across participants). Bayesian prevalence is similarly framed as a twolevel procedure. At the first level, a statistical test is applied within each participant, the result of which can be binarized via a withinparticipant NHST (e.g. using a parametric test, as in our simulation, or alternatively using nonparametric permutation methods, independently for each participant), or via an arbitrary effect size threshold $\hat{E}$. At the second level, the binary results from the first level (i.e. the counts of significant participants) are the input to the Bayesian population prevalence computation. To accompany this paper, we provide code in MATLAB, Python, and R to visualise the full posterior distribution of the population prevalence, as well as extract properties, such as the MAP point estimate and HPDIs. We also provide functions to provide Bayesian estimates of the difference in prevalence between two mutually exclusive participant groups to the same test (betweengroup prevalence difference), as well as the difference in prevalence between two different tests applied to a single sample of participants (withingroup prevalence difference). We suggest reporting population prevalence inference results as the MAP estimate together with one or more HPDIs (e.g. with probability 0.5 or 0.96, see Materials and methods).
It is important to stress that the secondlevel prevalence inference does not impose any requirements on the firstlevel withinparticipant tests, other than that each test should provide strong control of the same false positive rate $\alpha $ (see Materials and methods). It is not required, for example, that each participant have the same number of trials, degrees of freedom, or withinparticipant variance. The withinparticipant test can be parametric (e.g. a ttest) or nonparametric (e.g. based on permutation testing). It can be a single statistical test or an inference over multiple tests (e.g. a neuroimaging effect within a certain brain region), provided that the familywise error rate is controlled at $\alpha $ (e.g. by using permutation tests with the method of maximum statistics).
Effect of number of participants and number of trials
Figure 7 illustrates how Bayesian prevalence inference scales with the number of participants and trials. Figure 7A–C suggests that there are benefits to having larger numbers of participants for the Bayesian prevalence metrics (including decrease in variance of obtained MAP and HPDI width, increase in prevalence lower bound). However, above around 50 participants, these benefits become less pronounced. Figure 7E shows that, on average, inferred prevalence is mostly sensitive to the number of trials per participant (horizontal contours) and is invariant to the number of participants (although variance decreases with increasing N, as in Figure 7A, C and F). By comparison, ttest power (Figure 7D) is mostly sensitive to the number of participants (vertical contours) and is largely invariant to the number of trials above around 100 trials per participant (Baker et al., 2020). In sum, compared to the population mean ttest, prevalence exhibits greater sensitivity to the number of trials obtained per participant and less sensitivity to the number of participants.
Discussion
The fields of psychology and neuroimaging are currently dominated by a statistical approach in which the primary focus is the population mean. Here, we propose a conceptual shift in perspective, from estimating population means to estimating the prevalence of effects detected within individuals. We provide a simple but novel Bayesian method to estimate the population prevalence of true positive results from any experimental test. We propose that withinparticipant inference, combined with prevalence estimation, may often match a researcher’s qualitative hypothesis regarding the existence of an effect in the population better than a binary inference that the population mean differs from zero. Particularly in the cognitive sciences, we would argue that effects within individuals are often the ultimate object of study, although these must be assessed at the population level to generalize from a specific experiment. Bayesian prevalence estimation provides this generalisation. It can easily be applied to almost any statistical evaluation of any experiment, provided a NHST can be performed at the individual participant level. The simulations presented here can also be used for simple power analyses when designing studies from this perspective.
Together, this conceptual shift and novel practical method support an alternative perspective for statistics in which the individual participant becomes the most relevant experimental unit to consider for the purposes of replication (Nachev et al., 2018; Smith and Little, 2018; Thiebaut de Schotten and Shallice, 2017), and where power should be considered for effects within individual participants. This perspective gives a very different view of the strength of evidence provided by a data set and of the importance of sample size (for both participants and trials) compared to the more common population mean perspective (Baker et al., 2020). For example, the simulation of 50 participants with 20 trials in Figure 1C has p=0.008 for a group mean different from zero, a result that is as surprising, under the null hypothesis, as observing seven heads in a row from tosses of a fair coin (Rafi and Greenland, 2020). This is weaker evidence than just two out of five participants showing an effect at $\alpha =0.05$ (p=0.0012 under the global null, or about as surprising as 10 heads in a row). Three out of five significant participants correspond to p=0.00003 under the global null (as surprising as 15 heads in a row); this is substantially stronger evidence for a populationlevel effect than that provided by the population mean inference in Figure 1D (from 50 participants, even with 500 trials). However, in the current scientific climate, the weaker result obtained from the larger sample size would commonly be viewed as providing more satisfactory evidence by most readers and reviewers. We would like to highlight this pervasive misunderstanding: that larger participant numbers automatically imply better evidence at the population level. The crux of our argument is that most studies focus on the difference between Figure 1A and C (a small difference in population mean, ignoring the implications of large betweenparticipant variance in studies that are underpowered for withinparticipant effects). In contrast, moving towards the situation shown in Figure 1B and D (increased power within individual participants) would provide both improved replicability and greater insight into the size and prevalence of effects in the population.
The practice of collecting enough data to perform withinparticipant inference is not a new idea—much of traditional psychophysics employs this approach (Smith and Little, 2018). We have employed this technique with EEG (Schyns et al., 2011; Smith et al., 2006) and MEG (Ince et al., 2015; Zhan et al., 2019), and in fMRI it is becoming more common to collect large quantities of data for fitting highdimensional, crossvalidated machine learning models within individual participants (Huth et al., 2016; Naselaris et al., 2021; Stansbury et al., 2013). Recently, this practice has also been adopted in the restingstate fMRI field where it is termed dense sampling or precision imaging (Braga and Buckner, 2017; Gordon et al., 2017; Laumann et al., 2015; Poldrack, 2017), or more recently deep imaging (Gratton and Braga, 2021). Ensuring that experiments are sufficiently powered to obtain reliable effect size estimates within individual participants could benefit studies relating individual differences in experimental effects to other external factors. In fact, the focus on population mean may have emphasised experimental effects which have low betweenparticipant variance and which are therefore less wellsuited for studying individual differences (Elliott et al., 2020; Hedge et al., 2018). With more reliable withinparticipant estimates, comparisons between groups can look beyond means to consider other differences in the distribution of effect sizes (Rousselet et al., 2017), as we show here considering prevalence as a function of effect size (Figure 6).
Note that while similar points regarding withinparticipant inference have been made elsewhere (Grice et al., 2020; Nachev et al., 2018; Smith and Little, 2018; Thiebaut de Schotten and Shallice, 2017), such results are typically considered to form a case study, without an inferential link that allows generalisation to the population (Neuroscience, S. for, 2020). (‘[The withinparticipant] approach only allows for statements that pertain to the existence and magnitude of effects in those subjects, rather than in the populations those subjects are drawn from.’) The methods presented here address this concern by providing an inferential bridge from withinparticipant tests to the population level, even when the number of participants is small. Randomisation tests (Edgington et al., 2007) provide a completely nonparametric and assumptionfree approach to experimental statistics, but their interpretation is limited since they cannot provide statistical statements which generalise to a population. The combination of withinparticipant randomisation tests with Bayesian prevalence provides a population inference with minimal assumptions (compared to tests of population mean which usually assume a normal distribution).
Bayesian prevalence produces a quantitative prevalence estimate with associated uncertainty, rather than reducing an entire experiment to a single, binary, population NHST result. This addresses many of the problems noted with the NHST framework (Amrhein et al., 2019; McShane et al., 2019) and also reduces the risk of questionable research practices, such as phacking (Forstmeier et al., 2017). Each participant serves as a replication since the same statistical analysis is performed on separate sets of data. This provides a degree of protection from problems such as researcher degrees of freedom. While it takes on average 20 different independent analyses of null data to find a single p=0.05 population mean NHST rejection in a multiparticipant data set, it would take 400 different analyses to reject a withinparticipant NHST at p=0.05 by chance in two different participants, or 8000 different analyses to find three participants significant by chance. Therefore, withinparticipant inference provides exponential protection with increasing participants against issues such as phacking and researcher degrees of freedom, likely resulting in more robust and replicable findings. We have also shown how to estimate the prevalence of different effect size thresholds, which avoids focusing on a single withinparticipant dichotomisation. However, estimated prevalence is only one approach to evaluate the strength of evidence provided by a data set, and it should be assessed alongside the quality of the experimental design, withinparticipant power, and the effect sizes within individuals. Prevalence and population mean estimation are not mutually exclusive; ideally one could report withinparticipant effect sizes and estimated population prevalence together with estimation and inference of the population mean.
It is also possible to perform frequentist inference on the prevalence testing model (Allefeld et al., 2016; Donhauser et al., 2018). Various null hypotheses for specific prevalence values can be considered, for example, the global null ($\gamma =0$), the majority null ($\gamma <0.5$), or any value of interest. However, this approach results in a binarisation of the experimental outcome to significant or not, and adds a degree of freedom to the researcher in terms of the particular value of $\gamma $ to test. An alternative approach is to consider the family of null hypotheses $\gamma <{\gamma}_{c}$ and find the largest ${\gamma}_{c}$ value for which the corresponding null hypothesis can be rejected at the chosen $\alpha $. This can be interpreted as a frequentist lower bound on the true prevalence given the specified false positive rate. However, the Bayesian posterior we present here gives a full quantitative description of the population based on the evidence provided by the data, avoids dichotomisation of the final analysis result, provides direct comparison between different groups or tests, and allows for prevalence as a function of effect size to give a more detailed picture of the population. Prevalence as a function of effect size can give insights into the variation of effects across the population (e.g. subgroups, Figure 6B) without requiring a parametric model (e.g. a normal distribution).
We have focused on human participants in typical psychology or neuroimaging experiments, but the proposed methods can be applied to infer the population prevalence of effects in other types of experimental units. For example, our approach could be directly applied to identified singleunit neurons in electrophysiological recordings. Typically, only a subset of the recorded neurons respond to a specific stimulus or experimental manipulation. Bayesian prevalence can be applied to estimate the proportion of neurons that respond in a particular way in the population of neurons in a particular brain region. For example, one could consider wellisolated single units recorded in the ventroposterior medial nucleus (VPm) of rodent thalamus (Petersen et al., 2008). Of all the N cells recorded (across sessions and perhaps animals), a certain number k might be found to be sensitive to the velocity of whisker motion, based on a nonparametric test of an encoding model. Applying Bayesian prevalence to this data would give a posterior distribution for the proportion of neurons in VPm thalamus that are velocity sensitive in a particular experimental preparation. Our betweengroup comparison method could also be used to formally compare the population prevalence of a certain type of responsive neuron between different brain areas, between different species, or between different experimental conditions (e.g. before or after learning, or with or without a pharmacological intervention). Thus, although it is common in electrophysiology to have individual neurons as the replication unit and perform inference at that level, the inferential bridge to a population of cells that Bayesian prevalence provides offers a new perspective on the results of such studies.
One potential criticism is that the demonstration of withinparticipant effects, which the prevalence approach requires, sets a much higher bar of evidence. It might be impractical to reach sufficient withinparticipant power in some imaging modalities with typical experimental designs. However, many statisticians have explicitly argued that the replicability crisis results from standards of evidence for publication being too weak (Benjamin et al., 2018). If so, the withinparticipant approach should lead to increased replicability. Indeed, it is common in neuroimaging studies to have no consistent pattern discernible in the effect size maps of individual participants, and yet such studies report a significant population mean effect in a focal region. In our view, this is problematic if our ultimate goal is to relate neuroimaging results to cognitive functions within individuals (Naselaris et al., 2021; Smith and Little, 2018). By contrast, as our simulations show (Figures 1B and 2), strong evidence for a modulation can occur in the absence of a population mean effect when the effect is heterogeneous across participants.
It is natural to expect considerable heterogeneity to exist across populations for a wide range of experimental tasks. In fact, the normal distribution that underlies most inferences on the population mean implies such heterogeneity (Figure 1B). In clinical studies of rare diseases, patient numbers are often limited, and heterogeneity can be higher relative to the healthy population. If an experiment is sufficiently powered within individual participants, then Bayesian prevalence provides a statistical statement about the population of patients with the disease, even from a small sample and without assuming a normal distribution for the population.
It has recently been suggested that researchers should define a smallest effect size of interest (SESOI) (Lakens, 2017) and should consider this when calculating statistical power. We suggest that the SESOI should also be considered at the individual level and explicitly related to the population variance obtained from hierarchical mixed effects models. If the population variance is large, then this implies that individuals exist in the population who have effects that are larger than the individual SESOI, even if the population mean is not larger. This possibility is almost always ignored in applications of such modelling, which usually focus only on the population mean. Equivalence tests (Lakens, 2017) could be applied within each participant and used to estimate the population prevalence of effects significantly smaller than the SESOI (i.e. the population prevalence of the absence of the effect).
Furthermore, the common assumption that effect sizes in the population follow a normal distribution is strong, although seldom justified (Lakens et al., 2018). For example, information processing, decision making, risk taking, and other strategies might vary categorically within the healthy population and across clinical and subclinical groups (Nachev et al., 2018; Smith and Little, 2018). In neuroimaging studies, there are related issues of variability in anatomical and functional alignment. To address these issues, results are often spatially smoothed within individuals before performing population mean inference. However, many new experimental developments, such as highresolution 7T fMRI to image cortical layers and highdensity intracranial recordings in humans, pose increasing difficulties in terms of aligning data across participants. A major advantage of the prevalence approach is that we can perform withinparticipant inference corrected for multiple comparisons, and then estimate the population prevalence of these effects without requiring them to be precisely aligned (in space, time frequency, etc.) across participants (as illustrated in Figure 2). For example, one might report that 24/30 participants showed an EEG alpha band power effect between 500 ms and 1000 ms post stimulus, which implies a population prevalence MAP of 0.79 (96% HPDI [0.61 0.91]), without requiring these individual effects to occur at precisely the same time points in those 24 participants. Similarly, if withinparticipant inference is corrected for multiple comparisons, one can infer prevalence of, say, an effect in a certain layer of primary visual cortex, without requiring precise alignment of the anatomical location of the effect between participants (Fedorenko, 2021).
Of course, there are many cases where the population mean is indeed the primary interest; in such cases, estimating and inferring on the population mean using hierarchical models is the most appropriate analysis, when possible. Linear mixedeffect models can also be interrogated in different ways to investigate the question of prevalence, for example, by examining the variance of the byparticipant random slopes or explicitly computing prevalence of different effect sizes from the normal distribution fit to the population. It is possible to extend Bayesian hierarchical models to explicitly account for different subgroups of participants (Bartlema et al., 2014; Haaf and Rouder, 2017; Haaf and Rouder, 2019; Rouder and Haaf, 2020). However, these approaches currently are not widely adopted, cannot easily be applied to nonlinear or highdimensional analysis methods common in neuroimaging, and add both mathematical and computational complexity compared to the secondlevel Bayesian prevalence method we present here, which is straightforward to apply to any firstlevel, withinparticipant analysis. Further, for many complex computational techniques—from modelling learning behaviour and decision making, to neuroimaging analysis techniques such as Representational Similarity Analysis, or highdimensional encoding models (Haxby et al., 2014)—it is currently not possible to employ a multilevel linear modelling approach down to the trial level due to the complexity of the nonlinear analysis functions and models employed. It is also important to note that if the effect of interest is an unsigned information measure such as mutual information (Ince et al., 2017) or classification accuracy (Haxby et al., 2014) then the ttest on the population mean does not actually provide a randomeffects inference that generalises to the population, but is equivalent to rejection of the global null showing only some sampled participants have an effect, that is, a fixedeffect analysis (Allefeld et al., 2016). In such cases, Bayesian prevalence estimates may be more descriptive and provide stronger generalisation than the population mean NHST.
In sum, Bayesian prevalence has a broad range of applicability—spanning dense sampling studies with high withinparticipant power, to more traditional sampling models (more participants, fewer trials, e.g. Figure 1C). It is applicable to any behavioural study, including detailed computational models of behaviour, provided that model comparison or inference on model parameters can be performed within individuals. In neuroimaging, Bayesian prevalence can be applied to any imaging modality (EEG, MEG, fMRI, fNIRS, intracranial EEG), to individual neurons within a brain region (to infer the proportion of responsive neurons), and with any statistical approach, including nonlinear and multivariate analysis methods. The crucial requirement is an adequately powered experiment to detect effects within individual participants (or other units of interest, e.g. neurons). We argue that ensuring experiments are adequately powered to detect effects within individuals would have a wide range of advantages, including strengthening the robustness and replicability of reported results.
Conclusions
While the problems that underlie the replication crisis are being increasingly recognised, there is currently no consensus as to alternative statistical approaches that are needed to address the underlying problems. Here, we propose that shifting our focus to quantifying and inferring effects within individuals addresses many of the pressing concerns recently highlighted in psychology and neuroimaging (Amrhein et al., 2019; Benjamin et al., 2018; Forstmeier et al., 2017; Ioannidis, 2005; McShane et al., 2019). We present a Bayesian approach to estimating population prevalence which is broadly applicable as it places no assumptions on the withinparticipant tests nor on the distribution of effects in the population. Further, prevalence does not require a Bayesian treatment; frequentist inference approaches can be used instead. The crucial point is to shift our perspective to first evaluate effects within individual participants, who represent the natural replication unit for studies of brains and behaviour.
Materials and methods
Simulations from an hierarchical population model
Request a detailed protocolThe data shown in Figure 1 were simulated from the standard hierarchical model:
where ${y}_{ij}$ denotes the measurement made on the jth trial (out of T) of the ith participant (out of N). $\mu}_{i$ represents the mean of each individual participant, $\sigma}_{w$ represents a common withinparticipant standard deviation over trials, $\sigma}_{b$ represents the standard deviation between participants, and $\mu}_{pop$ represents the overall population mean. This can be written as
where ${\eta}_{ij}~N(0,{\sigma}_{w}^{2})$, and ${\u03f5}_{i}~N(0,{\sigma}_{b}^{2})$. Note that under this model the distribution of the withinparticipant means is $N({\mu}_{pop},{\sigma}_{b}^{2}+\frac{1}{t}{\sigma}_{w}^{2})$.
Binomial model of population prevalence of the ground truth state of sampled units
Request a detailed protocolWe consider a population of experimental units (e.g. human participants or individual neurons) which are of two types: those that have a particular binary effect or property, and those that do not. We consider the population prevalence of the ground truth state of each unit $\gamma}_{gt$, which is the proportion of the population from which the sample was drawn that have the effect ($0<{\gamma}_{gt}<1$). If the true status of each individual unit could be directly observed, then the sample could be modelled with a binomial distribution with probability parameter $\gamma}_{gt$. However, we cannot directly observe the true status of each unit. Instead, we apply to each unit a statistical test following the NHST framework, which outputs a binary result (which we term here positive vs. negative). This test has a false positive rate $\alpha ,$ and sensitivity $\beta$. Thus, the probability that a randomly selected unit from the population that does not possess the defined effect but will produce a positive test result is $\alpha $, while the probability that a randomly selected unit that does possess the defined effect will produce a positive test result is $\beta .$ Under the assumption that the units are independent and $\alpha $ and $\beta $ are constant across units, the number of positive tests $k$ in a sample of size $n$ can be modelled as a binomial distribution with parameter $\theta $ (Donhauser et al., 2018; Friston et al., 1999b; Friston et al., 1999a; Rogan and Gladen, 1978):
Binomial model of population prevalence of true positives for a given test procedure
Request a detailed protocolA major issue with the above approach is that it requires the sensitivity $\beta$ to be specified and constant across units. $\beta $ is the probability of a significant result given that the individual has an effect and differs as a function of the ground truth effect size. In general, we do not assume that all members of the population have the same effect size, so it is not possible to specify a common $\beta$ for all individuals.
Therefore, rather than modelling the prevalence of ground truth effects, we consider the prevalence of true positive test results for the particular test procedure we employ,$\gamma}_{tp$:
In this case, the only assumption is that test procedure has the same false positive rate $\alpha $ for every individual, which is easily satisfied by most common parametric and nonparametric statistical tests. Note that this is equivalent to estimating ground truth prevalence with a test with $\beta =1$. In general, a test with lower sensitivity allows inference of a higher prevalence for an observed $k$ because some of the observed negative results will be missed true positive results. Therefore, prevalence of true positives obtained with $\beta =1$ is a conservative lower bound on the prevalence of ground truth state (Allefeld et al., 2016; Donhauser et al., 2018; Friston et al., 1999a).
Throughout this paper we consider the prevalence of true positives for a given test procedure rather than the prevalence of ground truth state and so omit the subscript tp. This quantifies the proportion of the population that would be expected to provide a true positive test result (i.e. have a nonnull effect that would be detected by the experimental test procedure considered).
Frequentist estimation of and inference on population prevalence
Request a detailed protocolVarious frequentist approaches can be used with the above binomial model of statistical testing. First, the maximum likelihood estimate of the population prevalence of true positives can be obtained as
Standard bootstrap techniques (Johnson, 2001) can give percentile bootstrap confidence intervals as an indication of uncertainty in this estimate. We can also explicitly test various null hypotheses at the population level. For example, we can test a compound null hypothesis $\gamma <0.5$, termed the majority null (Allefeld et al., 2016; Donhauser et al., 2018). This is chosen with the idea that a prevalence of >50% supports a claim that the effect is typical in the population. Other explicit compound nulls of this form can also be tested (e.g. that $\gamma <0.25$ or $\gamma <0.75$). Alternatively, it is possible to infer a lower bound on the population prevalence by finding the largest value ${\gamma}_{c}$, such that $p\left(X>k\text{}\text{}\gamma {\gamma}_{c}\right)0.05$ (Allefeld et al., 2016; Donhauser et al., 2018). This inferred lower bound provides a more graded output than a binary significance result of testing against a specific compound null (i.e. the continuous value ${\gamma}_{c}$).
Bayesian estimation of population prevalence
Request a detailed protocolWe apply standard Bayesian techniques to estimate the population prevalence parameter of this model (Gelman, 2014). Assuming a beta prior distribution for $\theta $ with shape parameters $r,s$, together with a binomial likelihood function, the posterior distribution for $\theta $ is given by a beta distribution with parameters $(k+r,nk+s)$, truncated to the interval $[\alpha ,1]$, where $k$ is the number of participants showing an abovethreshold effect out of $n$ tested. In the examples shown here, we use a uniform prior (beta with shape parameters $r=s=1$), as in the general case there is no prior information regarding $\theta $. This implies a uniform prior also for $\gamma $, so, a priori, we consider any value of population prevalence equally likely. While we default to the uniform prior, the code supports any beta distribution as a prior. Alternative priors could be implemented via Markov chain Monte Carlo methods (Gelman, 2014) together with the models described here. Note that similar Bayesian approaches have been applied in the field of epidemiology, where sometimes multiple complementary diagnostic tests for a disease are applied with or without a gold standard diagnosis in a subset of the sampled units (Berkvens et al., 2006; Enøe et al., 2000; Joseph et al., 1995).
Under the uniform prior, the Bayesian MAP estimate for prevalence proportion of true positives is available analytically and is equivalent to the maximum likelihood estimate:
Following McElreath, 2016, we present 96% HPDIs here to emphasise the arbitrary nature of this value and reduce the temptation to interpret the interval as a frequentist p=0.05 inference.
Bayesian estimation of the prevalence difference between two independent groups
Request a detailed protocolWe consider here a situation where the same test is applied to units sampled from two different populations. In addition to the prevalence of true positives within each population, we wish to directly estimate the difference in prevalence between the two populations. We denote the prevalence within each population as $\gamma}_{1$, $\gamma}_{2$, respectively. We sample $n}_{1},{n}_{2$ participants at random from each population and record ${k}_{1},{k}_{2}$, the number of significant withinparticipant tests in each sample. Assuming independent uniform priors on the prevalences and associated ${\theta}_{i}$ variables as above with:
then the posterior distribution for $({\theta}_{1},{\theta}_{2})$ is given by the product of two truncated beta distributions, with parameters $({k}_{i}+1,{n}_{i}{k}_{i}+1)$, respectively, both truncated to the interval $\left[\alpha ,1\right]$. The prevalence difference can be obtained as:
For nontruncated beta distributions, an analytic exact result is available (PhamGia et al., 1993). This result could be extended to provide an exact distribution for the prevalence difference, but the mathematical expressions involved are quite complex. We find it simpler to employ Monte Carlo methods, which can provide as close an approximation to the exact answer as desired. Here we use Monte Carlo methods to draw samples from the posterior for $({\theta}_{1},{\theta}_{2})$, obtaining samples for the prevalence difference with the above expression. We use these samples to numerically compute the MAP and HPDIs.
Bayesian estimation of the prevalence difference of two different tests within the same sample of participants
Request a detailed protocolIn this situation, we consider that two different test procedures are applied to a single sample of $n$ units. We assume both tests have the same value of $\alpha$ and define:
Here ${\theta}_{i}$ is the probability that a randomly selected unit from the population will show a positive result on the ith test, and ${\gamma}_{i}$ is the population prevalence of true positives associated with the ith test. Now, each unit provides one of four mutually exclusive results based on the combination of binary results from the two tests. ${k}_{11}$ represents the number of units that have a positive result in both tests, ${k}_{10},{k}_{01}$ represent the number of units that have a positive result only in the first or second test, respectively, and ${k}_{00}$ is the number of units that do not show a positive result in either test. So $\sum _{i,j}{k}_{ij}=n$. We can define analogous variables ${\mathit{\theta}=\{\theta}_{ij}\}$, representing the population proportion for each of the four combined test outcomes. Note that ${\theta}_{ij}>0$ and $\sum _{i,j}{\theta}_{ij}=1$. The marginal success probabilities ${\theta}_{i}$ can be expressed as:
and so
The marginal probabilities ${\theta}_{i}$ are subject to the constraints
and so
Assuming a uniform prior and a multinomial distribution for the ${k}_{ij}$, the posterior of $\mathit{\theta}$ is a truncated Dirichlet distribution with parameters ${m}_{ij}={k}_{ij}+1$ subject to the constraints above (which are analogous to the truncation of the beta posterior distribution in the case of a single test). We use a Monte Carlo approach to draw samples from this posterior following a modified stickbreaking process.
Draw a sample ${\theta}_{11}$ from a $Beta({m}_{11},{m}_{10}+{m}_{01}+{m}_{00})$ distribution truncated to the interval [0,1].
Draw a sample ${z}_{10}$ from a $Beta({m}_{10},{m}_{01}+{m}_{00})$ distribution truncated to the interval $\left[max\left(\frac{\alpha {\theta}_{11}}{1{\theta}_{11}},0\right),\frac{1{\theta}_{11}}{1{\theta}_{11}}\right]$. Set ${\theta}_{10}=\left(1{\theta}_{11}\right){z}_{10}$.
Draw a sample ${z}_{01}$ from a $Beta({m}_{01},{m}_{00})$ distribution truncated to the interval $\left[max\left(\frac{\alpha {\theta}_{11}}{1{\theta}_{11}{\theta}_{10}},0\right),\text{}min\left(\frac{1{\theta}_{11}}{1{\theta}_{11}{\theta}_{10}},1\right)\right]$. Set ${\theta}_{01}=\left(1{\theta}_{11}{\theta}_{10}\right){z}_{01}$.
Set ${\theta}_{00}=1{\theta}_{11}{\theta}_{10}{\theta}_{01}$.
Then $\left({\theta}_{11},{\theta}_{01},{\theta}_{10},{\theta}_{00}\right)$ is a draw from the required truncated Dirichlet distribution, and $({\theta}_{10}{\theta}_{01})/(1\alpha )$ is a draw from the posterior distribution of the prevalence difference.
We use these samples to numerically compute properties like the MAP estimate and HPDIs.
To specify a ground truth to simulate data from two tests applied to the same participants (Figure 3), we require ${\gamma}_{1}$ and ${\gamma}_{2}$, the population prevalences of the two tested effects, together with ${\rho}_{12}$, the correlation between the presence of the two effects across the population. From this we can calculate ${\gamma}_{11}$, the proportion of true positive results to both tests as
Similarly, we can define ${\gamma}_{ij}$ representing the population proportions corresponding to the other test result configurations. Then we can generate multinomial samples using the parameters ${\theta}_{ij}$ computed as
Prevalence as a function of effect size threshold
Request a detailed protocolEstimating the prevalence of $E}_{p}>\hat{E$ proceeds as for prevalence inference based on withinparticipant NHST. One additional step is the need to calculate $\alpha $, the false positive rate under the null hypothesis of no effect, for each threshold value $\hat{E}$. This is simply the probability of $E}_{p}>\hat{E$ under the null hypothesis. In the examples shown here, we calculate this from the cumulative distribution function of the appropriate tdistribution, but for other tests this could also be estimated nonparametrically. A number of $\hat{E}$ values are selected, linearly spaced over the observed range of the sample. For each of these values the count of the number of participants satisfying the inequality and the $\alpha $ value corresponding to the inequality are used to obtain the Bayesian posterior for prevalence of true positives. Note that this can be applied to either tail $E}_{p}>\hat{E$ or $E}_{p}<\hat{E$.
Data availability
Request a detailed protocolScripts implementing the simulations and creating the figures from the paper are available at https://github.com/robince/bayesianprevalence (copy archived at swh:1:rev:a10f2760930f7638d1c2a73944719e6283aedcec, Ince, 2021).
Code availability
Request a detailed protocolTo accompany this paper, we provide functions in MATLAB, Python, and R to calculate the Bayesian prevalence posterior density (e.g. to plot the full posterior distribution), the MAP estimate of the population prevalence, HPDI intervals of the posterior and lower bound quantiles of the posterior, as well as prevalence differences between samples or between tests within a sample. We also provide example scripts that produce posterior plots as in Figure 1E. See https://github.com/robince/bayesianprevalence.
Data availability
Scripts implementing the simulations and creating the figures from the manuscript are available at https://github.com/robince/bayesianprevalence (copy archived at https://archive.softwareheritage.org/swh:1:rev:a10f2760930f7638d1c2a73944719e6283aedcec). A toolbox of functions to calculate Bayesian prevalence is provided for Matlab, Python and R.
References

BookMastering ’metrics: The Path from Cause to EffectPrinceton ; Oxford: Princeton University Press.

Power contours: Optimising sample size and precision in experimental psychology and human neurosciencePsychological Methods 1:met0000337.https://doi.org/10.1037/met0000337

Redefine statistical significanceNature Human Behaviour 2:6–10.https://doi.org/10.1038/s415620170189z

Imaging of neural oscillations with embedded inferential and group prevalence statisticsPLOS Computational Biology 14:e1005990.https://doi.org/10.1371/journal.pcbi.1005990

The early origins and the growing popularity of the individualsubject analytic approach in human neuroscienceCurrent Opinion in Behavioral Sciences 40:105–112.https://doi.org/10.1016/j.cobeha.2021.02.023

Detecting and avoiding likely falsepositive findings – a practical guideBiological Reviews 92:1941–1968.https://doi.org/10.1111/brv.12315

BookStatistical Parametric Mapping: The Analysis of Functional Brain ImagesAcademic Press.

Editorial overview: Deep imaging of the individual brain: past, practice, and promiseCurrent Opinion in Behavioral Sciences 40:iii–vi.https://doi.org/10.1016/j.cobeha.2021.06.011

Persons as Effect SizesAdvances in Methods and Practices in Psychological Science 3:443–455.https://doi.org/10.1177/2515245920922982

Developing constraint in bayesian mixed modelsPsychological Methods 22:779–798.https://doi.org/10.1037/met0000156

Some do and some don’t? Accounting for variability of individual difference structuresPsychonomic Bulletin & Review 26:772–789.https://doi.org/10.3758/s134230181522x

Decoding Neural Representational Spaces Using Multivariate Pattern AnalysisAnnual Review of Neuroscience 37:435–456.https://doi.org/10.1146/annurevneuro062012170325

Tracing the Flow of Perceptual Features in an Algorithmic Brain NetworkScientific Reports 5:17681.https://doi.org/10.1038/srep17681

SoftwareBayesianprevalence, version swh:1:rev:a10f2760930f7638d1c2a73944719e6283aedcecSoftware Heritage.

BookTeaching StatisticsIn: Johnson RW, editors. An Introduction to the Bootstrap. CRC press. pp. 49–54.https://doi.org/10.1111/14679639.00050

Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standardAmerican Journal of Epidemiology 141:263–272.https://doi.org/10.1093/oxfordjournals.aje.a117428

Equivalence tests: A practical primer for T tests, correlations, and metaanalysesSocial Psychological and Personality Science 8:355–362.https://doi.org/10.1177/1948550617697177

Replication is already mainstream: Lessons from small n designsBehavioral and Brain Sciences 41:766.https://doi.org/10.1017/S0140525X18000766

BookStatistical Rethinking: A Bayesian Course with Examples in R and StanBoca Raton: CRC Press/Taylor & Francis Group.

Abandon statistical significanceThe American Statistician 73:235–245.https://doi.org/10.1080/00031305.2018.1527253

Extensive sampling for complete models of individual brainsCurrent Opinion in Behavioral Sciences 40:45–51.https://doi.org/10.1016/j.cobeha.2020.12.008

Consideration of sample size in neuroscience studiesThe Journal of Neuroscience 40:4076–4077.https://doi.org/10.1523/JNEUROSCI.086620.2020

ConferenceStatistical Parametric Mapping: The Analysis of Functional Brain ImagesRandom effects analysis. pp. 156–165.

Bayesian analysis of the difference of two proportionsCommunications in Statistics  Theory and Methods 22:1755–1771.https://doi.org/10.1080/03610929308831114

Estimating prevalence from the results of a screening testAmerican Journal of Epidemiology 107:71–76.https://doi.org/10.1093/oxfordjournals.aje.a112510

Beyond differences in means: Robust graphical methods to compare two groups in neuroscienceEuropean Journal of Neuroscience 46:1738–1748.https://doi.org/10.1111/ejn.13610

Cracking the code of oscillatory activityPLOS Biology 9:e1001064.https://doi.org/10.1371/journal.pbio.1001064

Small is beautiful: In defense of the smalln designPsychonomic Bulletin & Review 25:2083–2101.https://doi.org/10.3758/s1342301814518

Identical, similar or different? Is a single brain model sufficient?Cortex; a Journal Devoted to the Study of the Nervous System and Behavior 86:172–175.https://doi.org/10.1016/j.cortex.2016.12.002
Decision letter

John T SerencesReviewing Editor; University of California, San Diego, United States

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom

Alex HuthReviewer; University of Texas, Austin, United States

Sam LingReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
This is paper describes methods to move away from frequentist null hypothesis testing toward a Bayesian approach of estimate the prevalence of withinparticipant effects. This methodological advance should be widely applicable in many fields, and should promote a more robust evaluation of effects.
Decision letter after peer review:
Thank you for submitting your article "Bayesian inference of population prevalence" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Alex Huth (Reviewer #1); Sam Ling (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. We will all reevaluate the manuscript after the resubmission, and I hope the comments below are useful.
Summary:
As you'll see below, the reviewers and I were all enthusiastic about the paper. That said, we all felt that a revision is required to provide concrete examples of when this method is preferred over frequentist approaches, as well as the general types of questions that this method is appropriate for addressing.
In addition, we all felt that the focus of the paper should be expanded to give more attention to nonfMRI studies. For example, there is passing mention in the Discussion section about how this might apply to single unit studies, but that seems more of an afterthought at the moment. Thus, the paper is missing a chance to speak to, and influence, a much broader audience.
Essential revisions:
In addition to revising the text as indicated above (and in the more detailed reviews), it seems likely that some additional simulations may need to be run to address all of the points. This is particularly true of the suggestion to demonstrate a situation where prevalence estimation is more powerful than standard methods.
Reviewer #1 (Recommendations for the authors):
In this paper, Ince and colleagues describe a novel set of Bayesian methods for estimating and working with "population prevalence" of an effect, which they pose as an alternative to NHST on the mean effect in the population. In turns this paper was tantalizing, exciting, and frustrating. I very much want the method the authors propose to be useful, but I was left without a clear sense of (1) what specific questions ("qualitative hypotheses") this method is appropriate for answering, and (2) why (quantitatively) this method should be preferred over others. Specific comments and suggestions are enumerated below.
1. Please show us how prevalence estimation can be more useful than classical approaches! The authors convincingly argued that estimating prevalence amounts to replicating an experiment in each individual subject, a shift in perspective that may help alleviate issues of nonreplicability in psychology and neuroimaging (I am very sympathetic to these arguments!). However, it was somewhat frustrating that none of the computational experiments seemed to directly contrast prevalence estimation vs. population mean estimation. If prevalence is indeed a superior way to conceptualize effects in a population, then there should be some experiment where prevalence and population mean give different results for testing the same qualitative hypothesis. For example, in the discussion the authors bring up the issue of uninteresting heterogeneity across subjects, which can reduce the effectiveness of standard approaches (i.e. where the precise anatomical location or timing of an effect is variable across subjects). This paper would be much more convincing if it could experimentally demonstrate a situation like this (in simulation, of course), where prevalence estimation might truly be more powerful than standard methods.
2. Tell us more about what inferences can actually be made using prevalence estimation! The experiment in Figure 1 lays out the differences between prevalence and population mean estimation quite clearly, but the conclusions are much less clear. The authors "suggest that often, if the goal of the analysis is to infer the presence of a causal relationship within individuals in the population, the withinparticipant perspective may be more appropriate" – but more appropriate for what? Exploringspecifically at this point in the paperwhich types of qualitative hypotheses (or some specific examples?) can be meaningfully addressed using prevalence analyses would, in my opinion, greatly strengthen the authors' case.
3. The argument that prevalence estimation is more sensitive to trials per participant than number of participants (Figure 2E) seems circular. The authors define prevalence as the fraction of subjects with "true positive (significant) test results" rather than the fraction where the null hypothesis is actually false, because they can only find a lower bound on the latter measure. Given this definition, the finding that "prevalence greater sensitivity to the number of trials per participant, and less sensitivity to the number of participants" (which is seemingly framed as the main conclusion of the analyses in Figure 2?) is highly unsurprising.
4. How does estimating group differences in prevalence compare to estimating differences in group means? The experiment shown in Figure 3 is tantalizing, but what I really wanted to know here is whether (and under what conditions) comparing groups using prevalence might be more powerful than comparing groups using traditional models.
5. The analysis of prevalence as a function of effect size is missing context and conclusions. Ultimately, I was left without any sense of the utility of estimating prevalence for different effect sizes after reading this section (and Figure 4). What is this specific procedure useful for? What kinds of hypotheses can it test? How does it compare to other methods? Answering these questions would undoubtedly strengthen the paper.
Reviewer #2 (Recommendations for the authors):
This manuscript advocates for a shift in the statistical (and methodological) approaches taken in neuroscience and psychology, wherein effects are considered from the perspective of prevalence, rather than populationcentric NHST. They provide simple yet compelling simulations to make their case, and introduce the bayesian prevalence method, along with code to implement on one's own. This manuscript is incredibly well written, the simulations are tightly constructed, and will, I hope, become widely read by the neuroscience and psychology community. It's a straightforward treatment of a statistic that the field needs to consider shifting towards, and the ability to generalize to any NHST will no doubt make this user friendly for most all researchers. I will readily admit that this was a method that my own lab has been in search of for a while, and we plan to deploy this in our own work. Below, I outline my questions/comments, but they are largely meant to support clarification to a broader audience.
1. While the authors promote a Bayesian treatment of the estimate of prevalence, they readily admit that a frequentist approach is suitable as well. It would be worth detailing a bit more the advantages and disadvantages of the two, if any. Particularly because the frequentist approach to neuroimaging has been out in the wild for a while now.
2. The authors simulate individuals within the population as having equal withinsubject (or unit) variability. To what degree are the assumptions for prevalence estimation upheld in the case where individual subjects have unequal variance?
3. I assume this rests on the assumption of independence between units of replication (ie subjects, voxels, ROIs, cells). Is this true? And how robust is this method to a lack of independence? I ask because if one were carrying this method out on a voxelbyvoxel basis, within a participant, there's quite a bit of nonindependence between these voxels. And likely same goes for between adjacent ROIs, or even between time points. It would be helpful if the authors dedicated some text to this assumption.
4. This comment is more to aid in promotion of the work, but I think the introduction and walkthrough of the simulation would be better serviced by using a case example throughout. It could be a psychophysics example, clinical psych example, or neuroscience example, doesn't matter. I think it would simply help the reader better concretize the reasoning this way. This is entirely up to the authors, but I imagined that the broader audience that they (and I) want to reach would be more likely to engage if this was more explicitly related to their work.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for submitting your article "Bayesian inference of population prevalence" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Sam Ling (Reviewer #2).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential Revisions:
This is a paper that describes methods to move away from frequentist null hypothesis testing toward a Bayesian approach of estimate the prevalence of withinparticipant effects. This methodological advance should be widely applicable in many fields, and should promote a more robust evaluation of effects.
Reviewer #1 (Recommendations for the authors):
In their revised manuscript, Ince and colleagues have added a number of illustrative simulations and clarified several important points, substantially improving the final product. In particular, the simulated EEG analysis (new Figure 2) is an extremely clear and powerful demonstration of the advantages of prevalence estimation over population mean inference for a realistic problem. This is exactly the type of comparison that seemed to be missing in the initial submission, and its inclusion strengthens the paper substantially. The new simulations in Figures 4 and 6 also clarify the related points. Overall, I think the current manuscript is excellent, interesting, and timely.
Here I will respond to two of the authors' points specifically.
Authors: We avoid direct quantitative comparisons because the hypotheses tested (nonzero population mean, vs. posterior estimate of prevalence of withinparticipant effects) and statistical methods (frequentist vs. Bayesian) are different, although we do invite comparison of the different frequentist pvalues (for null of mean zero vs global null of no participants with an effect) for the examples in Figure 1. We focused instead on qualitative arguments in favour of the prevalence view (better match to qualitative hypothesis of interest in psychology and neuroimaging, which we believe should better consider effects within individual behaving brains, and greater robustness given that the effect is shown independently in multiple participants).
This is spot on. The hypotheses tested by these methods are different, but the key question is how these (quantitative) hypotheses map to the implicit, qualitative hypotheses that are actually the subject of inquiry. On the discussion of this point, the current manuscript does not differ substantially from the original. Yet I still found myself much more convinced of its correctness this time around than the first. To this I credit the new simulations that the authors added (Figure 2 and 4). By directly comparing the results of the two methods, these simulations highlight how much better the hypotheses tested by prevalence estimation align with intuitive/qualitative hypotheses than do ttests.
Minor aside: oddly, I think Figure 1 actually does the authors' arguments something of a disservice on this central point. Yes, it clearly illustrates how the two methods differ. But it presumes a world in which all the assumptions of the hierarchical population mean test are met. This leads to a situation where the standard method matches the reader's expectations (/qualitative hypotheses) much more closely than does the prevalence method. The authors acknowledge the weirdness of the twosided effects seen in Figure 1B, and point out that this weirdness is actually implicit in the standard hierarchical model, but I worry that by this point the damage is already done. I'm not sure how best to do this – Figure 1 is a very clear illustration of the methods – but it could be worthwhile to consider reorganization that deprioritizes this simplebutweird example in favor of less weird examples like Figure 2.
Authors (regarding prevalence as a function of effect size): The main point of this section is to answer the potential criticism that we are dichotomising within participant results and to show that this is not a requirement for valid quantification of population prevalence. Prevalence as a function of effect size gives a fuller picture of the distribution of effect sizes observed. One could argue this is also given by a standard density plot, for example in a violin or raincloud plot, but these are descriptive properties of the specific data sample, without any formal inferential link to the population.
The new simulations shown in Figure 6 certainly clarify this point by showing examples with different underlying population statistics. I'm not specifically asking for anything more – I think this is sufficiently convincing asis – but I think this would be stronger still with a more concrete example that shows how the prevalence/effect size curve could appear in different realworld situations, with implications for the reader's qualitative hypotheses. For an example like the question posed by Reviewer 2, this curve would look quite different if the individual subject standard deviation was not fixed, but was instead drawn from some distribution. Again, this is not critical, but a more realworldlike example would make this section more convincing.
Reviewer #2 (Recommendations for the authors):
The authors have done an admirable job addressing my questions. I'm eager for this work, which I hope to be influential in the field, to be out in the wild for others to adopt!
https://doi.org/10.7554/eLife.62461.sa1Author response
Summary:
As you'll see below, the reviewers and I were all enthusiastic about the paper. That said, we all felt that a revision is required to provide concrete examples of when this method is preferred over frequentist approaches, as well as the general types of questions that this method is appropriate for addressing.
We thank the reviewers for their useful comments and apologise for the delay in producing our revisions. We have added additional simulations to compare directly with Null Hypothesis Statistical Testing (NHST) on the population mean and reordered the results presentation to emphasise the broad applicability of our Bayesian prevalence method.
In addition, we all felt that the focus of the paper should be expanded to give more attention to nonfMRI studies. For example, there is passing mention in the Discussion section about how this might apply to single unit studies, but that seems more of an afterthought at the moment. Thus, the paper is missing a chance to speak to, and influence, a much broader audience.
It was not our intention to focus on fMRI, although it’s possible that the citations reflected this area of research more than others. In fact, our approach has been developed primarily from behavioural and M/EEG studies. We revised the manuscript to clarify that the new approach is completely general with a simple abstract example to propose the concept. Also, we have now added simulations of typical eventrelated EEG experiments to motivate how Bayesian prevalence can be applied and how its results can differ from a second level ttest. The extension to single unit studies is a discussion point because we prefer to focus the examples to develop the critical point that human participants are the replication unit of interest in a broad range of psychology and neuroimaging studies. We have extended this discussion point with a more concrete example of how prevalence could be applied with neurons as the replication unit of interest.
Essential revisions:
In addition to revising the text as indicated above (and in the more detailed reviews), it seems likely that some additional simulations may need to be run to address all of the points. This is particularly true of the suggestion to demonstrate a situation where prevalence estimation is more powerful than standard methods.
We want to emphasise that we don’t necessarily claim that Bayesian prevalence is more powerful than standard methods. In fact, the frequentist notion of power doesn’t directly apply to Bayesian estimation. Rather, we want to explain how Bayesian prevalence quantifies a different property of the sampled population, and thereby supports a conceptual shift of perspective, where the replication unit is the individual participant, and the proportion of participants showing a significant effect becomes the focus of statistical inference (in contrast to Null Hypothesis Significance Testing of the mean across participants). There are situations where this different perspective shows effects or differences that the population mean is not sensitive to, and we have added such example simulations as requested by the reviewers (new Figures 2, 4, 6).
Reviewer #1 (Recommendations for the authors):
In this paper, Ince and colleagues describe a novel set of Bayesian methods for estimating and working with "population prevalence" of an effect, which they pose as an alternative to NHST on the mean effect in the population. In turns this paper was tantalizing, exciting, and frustrating. I very much want the method the authors propose to be useful, but I was left without a clear sense of (1) what specific questions ("qualitative hypotheses") this method is appropriate for answering, and (2) why (quantitatively) this method should be preferred over others. Specific comments and suggestions are enumerated below.
1. Please show us how prevalence estimation can be more useful than classical approaches! The authors convincingly argued that estimating prevalence amounts to replicating an experiment in each individual subject, a shift in perspective that may help alleviate issues of nonreplicability in psychology and neuroimaging (I am very sympathetic to these arguments!). However, it was somewhat frustrating that none of the computational experiments seemed to directly contrast prevalence estimation vs. population mean estimation. If prevalence is indeed a superior way to conceptualize effects in a population, then there should be some experiment where prevalence and population mean give different results for testing the same qualitative hypothesis. For example, in the discussion the authors bring up the issue of uninteresting heterogeneity across subjects, which can reduce the effectiveness of standard approaches (i.e. where the precise anatomical location or timing of an effect is variable across subjects). This paper would be much more convincing if it could experimentally demonstrate a situation like this (in simulation, of course), where prevalence estimation might truly be more powerful than standard methods.
In the original submission, we tried to strike a balance between a clear conceptual exposition and lists of practical examples, because the breadth of possible qualitative hypotheses to which Bayesian prevalence can be applied is so broad. In fact, pretty much any experimental question or test that can be considered at the per participant level can be framed as a population prevalence question, as an alternative to the typical population mean inference. So, in the revised manuscript, we keep the examples generic in the hope that readers could export the conceptual shift embedded within the prevalence concept to any questions that might arise from their own work.
We avoid direct quantitative comparisons because the hypotheses tested (nonzero population mean, vs. posterior estimate of prevalence of withinparticipant effects) and statistical methods (frequentist vs. Bayesian) are different, although we do invite comparison of the different frequentist pvalues (for null of mean zero vs global null of no participants with an effect) for the examples in Figure 1. We focused instead on qualitative arguments in favour of the prevalence view (better match to qualitative hypothesis of interest in psychology and neuroimaging, which we believe should better consider effects within individual behaving brains, and greater robustness given that the effect is shown independently in multiple participants).
We would argue the simple generic simulation of the standard hierarchical population model shown in Figure 1 does demonstrate a situation where prevalence and population mean give different results for testing the same qualitative hypothesis (presence of a nonzero experimental effect in the population). This simulation directly contrasts inference of nonzero population mean (color of population curves, reported pvalues on the left), with the prevalence approach (pvalues for rejection of global null on the right, posterior prevalence curves in panel E). For example, the experiment simulated in Figure 1B shows very different results from the two approaches to testing the qualitative hypothesis “do participants in the population show a nonzero effect?”. The population mean provides no evidence for this (p=0.52), whereas the prevalence perspective provides very strong evidence (global null, p=2.2x10^{16}; purple posterior distribution in Panel 1E).
To give further intuition into situations where the NHST and Bayesian Prevalence can differ, we have added two new simulations of more concrete and realistic situations with similar divergence between the two approaches (Figure 2). These additional simulations don’t depend on symmetrical signed effects as the examples of Figure 1 (although we would like to emphasise that the presence of symmetrical signed effects is directly implied by the null hypothesis employed whenever a secondlevel ttest is used). The new Figure 2 shows simulated EEG data for two situations where prevalence can quantify robust population effects that are not visible from the population mean perspective. Panel A illustrates the alignment problem, robust effects are obtained in each participant, but at different poststimulus times. Panel B illustrates the homogeneity problem, here the participants are heterogeneous, with 50% showing a strong effect and 50% showing no effect. In both cases the population mean NHST fails to show an effect when correcting for multiple comparisons over time. In contrast, the prevalence method does show strong evidence for effects at the population level (with the same multiple comparison correction applied withinparticipants).
As an aside, it was not our intention to suggest between participant heterogeneity is uninteresting; only that it can present a problem for traditional populationmean based approaches. In fact, we think exploring individual differences in anatomy and function is an important area of research. Our point is that Bayesian prevalence can provide an inferential link to the population which is currently missing in such cases.
2. Tell us more about what inferences can actually be made using prevalence estimation! The experiment in Figure 1 lays out the differences between prevalence and population mean estimation quite clearly, but the conclusions are much less clear. The authors "suggest that often, if the goal of the analysis is to infer the presence of a causal relationship within individuals in the population, the withinparticipant perspective may be more appropriate" – but more appropriate for what? Exploringspecifically at this point in the paperwhich types of qualitative hypotheses (or some specific examples?) can be meaningfully addressed using prevalence analyses would, in my opinion, greatly strengthen the authors' case.
We aimed to present the prevalence approach as a completely general second level approach to drawing population level inferences from data obtained from multiple participants. In this sense, it is completely general and can test any qualitative hypothesis that can be quantified within individual participants (either via an effect size, or a withinparticipant NHST). Our argument is that, in many cases, the withinparticipant perspective may be more appropriate for quantifying the presence of a causal relationship (which could be any relationship tested with a randomised experiment and quantified with any test statistic or model). It can be applied to everything from behavioural modelling (including uncertainty, risk, expectation etc.), reaction time measures, multivariate decoding, encoding models for complex naturalistic stimuli, simple neuroimaging contrasts (face vs house), spectral analyses (phase or power response, phaseamplitude coupling etc.).
We have now moved the paragraph covering the breadth of possible applications to the start of this section to ensure it is in the front of the readers mind when considering the results of the simulations in Figure 1.
3. The argument that prevalence estimation is more sensitive to trials per participant than number of participants (Figure 2E) seems circular. The authors define prevalence as the fraction of subjects with "true positive (significant) test results" rather than the fraction where the null hypothesis is actually false, because they can only find a lower bound on the latter measure. Given this definition, the finding that "prevalence greater sensitivity to the number of trials per participant, and less sensitivity to the number of participants" (which is seemingly framed as the main conclusion of the analyses in Figure 2?) is highly unsurprising.
We agree this result is unsurprising when due consideration is given to the different perspective of prevalence. However, we still like to show it explicitly as most discussion of statistical power and power analysis in the psychology, neuroscience and neuroimaging literature considers only number of participants. So, we think it is important to emphasise this point – that the different perspectives have different properties, and that these give very different views in terms of experimental design. We also think these simulations can be used directly, or the scripts easily repurposed to provide simulation based “power analysis”. While not formal frequentist power analysis, these results could nevertheless provide some motivation for justifying number of participants in a particular study design. We have moved this to the end of the Results section to deemphasise.
4. How does estimating group differences in prevalence compare to estimating differences in group means? The experiment shown in Figure 3 is tantalizing, but what I really wanted to know here is whether (and under what conditions) comparing groups using prevalence might be more powerful than comparing groups using traditional models.
They are quantifying fundamentally different properties of the populations. In Figure 4, we have added an additional simulation to illustrate the differences between ttest for difference in means and prevalence difference between two groups. We show two groups whose means do not significantly differ, but for which there is evidence for a clear difference in the proportion (of the population, not just the samples) that shows a statistically significant withinparticipant effect. If the population mean is the primary interest, then the ttest should be used. If the proportion of participants who show the effect is of interest, then Bayesian prevalence provides a valid methodology for making statistical claims at the population level. Crucially, these are valid quantitative statistical statements about the population, and not just a case study of the specific collected samples, which is the criticism traditionally levelled at withinparticipant statistics, e.g. “It should be acknowledged that this approach only allows for statements that pertain to the existence and magnitude of effects in those subjects, rather than in the populations those subjects are drawn from.” https://doi.org/10.1523/JNEUROSCI.086620.2020
5. The analysis of prevalence as a function of effect size is missing context and conclusions. Ultimately, I was left without any sense of the utility of estimating prevalence for different effect sizes after reading this section (and Figure 4). What is this specific procedure useful for? What kinds of hypotheses can it test? How does it compare to other methods? Answering these questions would undoubtedly strengthen the paper.
The main point of this section is to answer the potential criticism that we are dichotomising within participant results and to show that this is not a requirement for valid quantification of population prevalence. Prevalence as a function of effect size gives a fuller picture of the distribution of effect sizes observed. One could argue this is also given by a standard density plot, for example in a violin or raincloud plot, but these are descriptive properties of the specific data sample, without any formal inferential link to the population.
The prevalence as a function of effect size provides a modelfree valid inferential link to the population and can be interpreted as Bayesian evidence for prevalence at the population level, rather than as a property only of the observed sample. Figure 6 shows six new simulations to illustrate situations where prevalence based on withinparticipant NHST at p=0.05 is equivalent (panels A,B,C and D,E,F respectively), but the prevalence effect size plot gives further information, at the population level, that clearly distinguishes the simulated populations.
Reviewer #2 (Recommendations for the authors):
This manuscript advocates for a shift in the statistical (and methodological) approaches taken in neuroscience and psychology, wherein effects are considered from the perspective of prevalence, rather than populationcentric NHST. They provide simple yet compelling simulations to make their case, and introduce the bayesian prevalence method, along with code to implement on one's own. This manuscript is incredibly well written, the simulations are tightly constructed, and will, I hope, become widely read by the neuroscience and psychology community. It's a straightforward treatment of a statistic that the field needs to consider shifting towards, and the ability to generalize to any NHST will no doubt make this user friendly for most all researchers. I will readily admit that this was a method that my own lab has been in search of for a while, and we plan to deploy this in our own work. Below, I outline my questions/comments, but they are largely meant to support clarification to a broader audience.
We thank the reviewer for their comments.
1. While the authors promote a Bayesian treatment of the estimate of prevalence, they readily admit that a frequentist approach is suitable as well. It would be worth detailing a bit more the advantages and disadvantages of the two, if any. Particularly because the frequentist approach to neuroimaging has been out in the wild for a while now.
We have added some comments on this to the Discussion section.
2. The authors simulate individuals within the population as having equal withinsubject (or unit) variability. To what degree are the assumptions for prevalence estimation upheld in the case where individual subjects have unequal variance?
The only assumptions for the Bayesian prevalence are that the participants are independent, and the false positive rate is properly controlled to the nominal α level within each participant. So long as the withinparticipant test is robust there is no problem with differences in variance between participants. We show here Author response image 1, a version of Figure 1 generated with betweensubject standard deviation randomly sampled from a lognormal distribution with mean log(10) and standard deviation log(3). Prevalence decreases w.r.t. Figure 1 in the paper because some participants have much larger variance and so there are fewer withinparticipant significant results, but the prevalence estimation is still valid and the interpretations remain the same. We have added text making this point explicit in section “How to apply Bayesian prevalence in practise”.
3. I assume this rests on the assumption of independence between units of replication (ie subjects, voxels, ROIs, cells). Is this true? And how robust is this method to a lack of independence? I ask because if one were carrying this method out on a voxelbyvoxel basis, within a participant, there's quite a bit of nonindependence between these voxels. And likely same goes for between adjacent ROIs, or even between time points. It would be helpful if the authors dedicated some text to this assumption.
The method does assume independence between “replication units” (most commonly thinking of participants) and so doesn’t address any hierarchical structure (e.g. time of day, time of year, testing lab, characteristics of university subject pool vs full population etc.). We would argue this is common to many statistical methods and needs to be evaluated by the particular scientific field as a “nonstatistical” consideration (Eddington and Onghena, “Randomization Tests”, CRC Press). For example, the secondlevel ttest used for NHST of the population mean also assumes the participants are independent samples of the population.
We agree that treating voxels or ROIs as the replication unit of interest would make this problem particularly pronounced, and we haven’t suggested that as an intended application. We propose to do population prevalence across participants (ie. the proportion of the population that would be expected to show an effect in a particular voxel or region), which is different to treating voxels as the replication unit of interest and attempting to estimate the prevalence of an effect in a population of voxels (within a subject, the entire population of voxels for that individual is available). To simplify, throughout the paper we have participants as the replication unit of interest over which prevalence is calculated. The multiplicity which results from multiple time points, sensors or voxels is dealt with using existing multiple comparison correction techniques, but applied at the individual participant, rather than the population level. We hope the new simulations in Figure 2 make this more explicit (prevalence is calculated over participants, either for a time window, or for each time point, but prevalence is not carried out over time points).
Another possible replication unit which we discuss is to consider well isolated single unit neuron recordings. We have expanded this point in the discussion with a clearer example application. In this case, we propose combining neurons from multiple animals to determine the proportion of neurons in a particular region (where the population is all neurons across all animals). Again, the issue of whether it is valid to ignore the hierarchical structure is a problemspecific nonstatistical consideration that the researcher must address. Perhaps it would be more reasonable to combine cells recorded in multiple sessions from the same animal (separate electrode insertions in the same brain region) to make an inference about the prevalence of an effect within the population of cells in that region in that individual if there is reason to suspect large variation between individual animals.
4. This comment is more to aid in promotion of the work, but I think the introduction and walkthrough of the simulation would be better serviced by using a case example throughout. It could be a psychophysics example, clinical psych example, or neuroscience example, doesn't matter. I think it would simply help the reader better concretize the reasoning this way. This is entirely up to the authors, but I imagined that the broader audience that they (and I) want to reach would be more likely to engage if this was more explicitly related to their work.
We have kept the simulations the same, but we have moved up to the start of this section the paragraph discussing the breadth and generality of possible applications, inviting readers to keep in mind the most relevant application to their work when considering the results of the simulations in Figure 1.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Reviewer #1 (Recommendations for the authors):
In their revised manuscript, Ince and colleagues have added a number of illustrative simulations and clarified several important points, substantially improving the final product. In particular, the simulated EEG analysis (new Figure 2) is an extremely clear and powerful demonstration of the advantages of prevalence estimation over population mean inference for a realistic problem. This is exactly the type of comparison that seemed to be missing in the initial submission, and its inclusion strengthens the paper substantially. The new simulations in Figures 4 and 6 also clarify the related points. Overall, I think the current manuscript is excellent, interesting, and timely.
We thank the reviewer for their comments. The paper evolved originally from a short perspective piece. We were aiming to keep the presentation general without locking in to one particular modality (like EEG) that not all readers might be familiar with. But we agree adding Figure 2 improves the presentation a lot and thank the reviewer for the suggestion.
This is spot on. The hypotheses tested by these methods are different, but the key question is how these (quantitative) hypotheses map to the implicit, qualitative hypotheses that are actually the subject of inquiry. On the discussion of this point, the current manuscript does not differ substantially from the original. Yet I still found myself much more convinced of its correctness this time around than the first. To this I credit the new simulations that the authors added (Figure 2 and 4). By directly comparing the results of the two methods, these simulations highlight how much better the hypotheses tested by prevalence estimation align with intuitive/qualitative hypotheses than do ttests.
We thank the reviewer for such a frank opinion, which we of course share. However, we don’t want to fall into the trap of statisticians telling users what they want to know http://daniellakens.blogspot.com/2017/11/thestatisticiansfallacy.html We try to make the argument that the population mean is not the only population property we can formally quantify. In many cases prevalence might be of interest and particularly so in cognitive domains, where we would argue that prevalence might often better match the qualitative scientific question of interest. We agree that the new Figures make this point much more clearly and hope our new Figure 6 will similarly add to the weight of the argument. We thank the reviewer for their input that helped us understand how to better communicate the shift of perspective from means to prevalence.
Minor aside: oddly, I think Figure 1 actually does the authors' arguments something of a disservice on this central point. Yes, it clearly illustrates how the two methods differ. But it presumes a world in which all the assumptions of the hierarchical population mean test are met. This leads to a situation where the standard method matches the reader's expectations (/qualitative hypotheses) much more closely than does the prevalence method. The authors acknowledge the weirdness of the twosided effects seen in Figure 1B, and point out that this weirdness is actually implicit in the standard hierarchical model, but I worry that by this point the damage is already done. I'm not sure how best to do this – Figure 1 is a very clear illustration of the methods – but it could be worthwhile to consider reorganization that deprioritizes this simplebutweird example in favor of less weird examples like Figure 2.
We understand what the Reviewer is suggesting, weighted both options, even tried to change the presentation as suggested. However, after careful thinking the logic of presentation of current Figure 2 still requires that the prevalence method is clearly defined. The best way to do this is to introduce it step by step (from the idea of withinparticipant significance providing evidence against a population effect given by the global null). So, in the end, we decided to leave current Figures 1 and 2 figures in this order, for the purpose of logical clarity of presentation. We have added a sentence to signpost Figure 2 in the last paragraph of the Population vs withinparticipant statistical tests section.
Authors (regarding prevalence as a function of effect size): The main point of this section is to answer the potential criticism that we are dichotomising within participant results and to show that this is not a requirement for valid quantification of population prevalence. Prevalence as a function of effect size gives a fuller picture of the distribution of effect sizes observed. One could argue this is also given by a standard density plot, for example in a violin or raincloud plot, but these are descriptive properties of the specific data sample, without any formal inferential link to the population.
The new simulations shown in Figure 6 certainly clarify this point by showing examples with different underlying population statistics. I'm not specifically asking for anything more – I think this is sufficiently convincing asis – but I think this would be stronger still with a more concrete example that shows how the prevalence/effect size curve could appear in different realworld situations, with implications for the reader's qualitative hypotheses. For an example like the question posed by Reviewer 2, this curve would look quite different if the individual subject standard deviation was not fixed, but was instead drawn from some distribution. Again, this is not critical, but a more realworldlike example would make this section more convincing.
We have produced a new Figure 6, which uses the same EEG simulation setting as Figure 2 to motivate three different situations where the overall prevalence of p=0.05 is similar, but the effect size prevalence curve reveals interpretable differences between the simulated data sets. We hope this makes a more intuitive and clearer link between data and effect size prevalence results. As these three new examples cover the main features of the old Figure 6 (different variances and subgroups) we replaced the old figure with this new one.
https://doi.org/10.7554/eLife.62461.sa2Article and author information
Author details
Funding
Wellcome Trust (214120/Z/18/Z)
 Robin AA Ince
 Angus T Paton
Wellcome Trust (107802)
 Philippe G Schyns
Engineering and Physical Sciences Research Council (MURI 1720461)
 Philippe G Schyns
Royal Society (RSWF\R3\183002)
 Philippe G Schyns
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Carsten Allefeld for useful discussions. RAAI was supported by the Wellcome Trust [214120/Z/18/Z]. PGS was supported by the EPSRC [MURI 1720461] and the Wellcome Trust [107802]. PGS is a Royal Society Wolfson Fellow [RSWF\R3\183002].
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 John T Serences, University of California, San Diego, United States
Reviewers
 Alex Huth, University of Texas, Austin, United States
 Sam Ling
Publication history
 Received: August 25, 2020
 Accepted: July 26, 2021
 Version of Record published: October 6, 2021 (version 1)
Copyright
© 2021, Ince et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,533
 Page views

 159
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.