Bayesian analysis of phase data in EEG and MEG

  1. Sydney Dimmock  Is a corresponding author
  2. Cian O'Donnell
  3. Conor Houghton
  1. Faculty of Engineering, University of Bristol, United Kingdom
  2. School of Computing, Engineering & Intelligent Systems, Ulster University, United Kingdom

Abstract

Electroencephalography and magnetoencephalography recordings are non-invasive and temporally precise, making them invaluable tools in the investigation of neural responses in humans. However, these recordings are noisy, both because the neuronal electrodynamics involved produces a muffled signal and because the neuronal processes of interest compete with numerous other processes, from blinking to day-dreaming. One fruitful response to this noisiness has been to use stimuli with a specific frequency and to look for the signal of interest in the response at that frequency. Typically this signal involves measuring the coherence of response phase: here, a Bayesian approach to measuring phase coherence is described. This Bayesian approach is illustrated using two examples from neurolinguistics and its properties are explored using simulated data. We suggest that the Bayesian approach is more descriptive than traditional statistical approaches because it provides an explicit, interpretable generative model of how the data arises. It is also more data-efficient: it detects stimulus-related differences for smaller participant numbers than the standard approach.

Editor's evaluation

This important work advances the available statistical methods for estimating the degree to which the neural response is phase-locked to a stimulus. It does so by taking a compelling Bayesian approach that leverages the circular nature of the phase readout and demonstrates the added value of the approach in both simulated and empirical datasets.

https://doi.org/10.7554/eLife.84602.sa0

eLife digest

Phase coherence is a measurement of waves, for example, brain waves, which quantifies the similarity of their oscillatory behaviour at a fixed frequency. That is, while the waves may vibrate the same number of times per minute, the relative timing of the waves with respect to each other may be different (incoherent) or similar (coherent).

In neuroscience, scientists study phase coherence in brain waves to understand how the brain responds to external stimuli, for example if they occur at a fixed frequency during an experiment. To do this, phase coherence is usually quantified with a statistic known as ‘inter-trial phase coherence’ (ITPC). When ITPC equals one, the waves are perfectly coherent, that is, there is no shift between the two waves and the peaks and troughs occur at exactly the same time. When ITPC equals zero, the waves are shifted from each other in an entirely random way.

Phase coherence can also be modelled on phase angles – which describe the shift in each wave relative to a reference angle of zero – and wrapped distributions. Wrapped distributions are probability distributions over phase angles that express their relative likelihood. Wrapped distributions have statistics, including a mean and a variance. The variance of a wrapped distribution can be used to model phase coherence because it explicitly represents the similarity of phase angles relative to the mean: larger variance means less coherence.

While the ITPC is a popular method for analysing phase coherence, it is a so-called ‘summary statistic’. Analyses using the ITPC discard useful information in the trial-to-trial-level data, which might not be lost using phase angles.

Thus, Dimmock, O’Donnell and Houghton set out to determine whether they could create a model of phase coherence that works directly on phase angles (rather than on the ITPC) and yields better results than existing methods.

Dimmock, O’Donnell and Houghton compare their model to the ITPC using both experimental and simulated data. The comparison demonstrates that their model can detect entrainment of the brain to grammatical phrases compared to ungrammatical ones at smaller sample sizes than ITPC, and with fewer false positives. Traditional tools for studying how the brain processes language often yield a lot of noise in the data, which makes it difficult to analyse measurements. Dimmock, O’Donnell and Houghton demonstrates that the brain is not simply responding to the ‘surprise factor’ of words in a phrase, as some have suggested, but also to their grammatical category.

These results of this study will benefit scientists who analyse phase coherence. By using the model in addition to other approaches to study phase coherence, researchers can provide a different perspective on their results and potentially identify new features in their data. This will be particularly powerful in studies with small sample sizes, such as pilot studies where maximising the use of data is important.

Introduction

In an electroencephalography (EEG) or magnetoencephalography (MEG) frequency-tagged experiment, the stimuli are presented at a specific frequency and the neural response is quantified at that frequency. This provides a more robust response than the typical event-related potential (ERP) paradigm because the response the brain makes to the stimuli occurs at the predefined stimulus frequency while noise from other frequencies, which will correspond to other cognitive and neurological processes, does not contaminate the response of interest. This quantification is often approached by calculating the inter-trial phase coherence (ITPC). Indeed, estimating coherence is an important methodological tool in EEG and MEG research and is used to answer a wide variety of scientific questions. There is, however, scope for improving how the phase coherence is measured by building a Bayesian approach to estimation. This is a per-item analysis and gives a better description of uncertainty. In contrast, the ITPC discards information by averaging across trials. As a demonstration, both approaches are compared by applying them to two different frequency-tagged experimental datasets and through the use of simulated data.

Frequency tagging is a well-established tool in the study of vision, where it is often referred to as the steady-state visual-evoked potential (Regan, 1966). At first, it was predominately used to study low-level processing and attention (see Norcia et al., 2015 for a review). Latterly, though, it been used for more complex cognitive tasks, such as face recognition and discrimination (Alonso-Prieto et al., 2013; Farzin et al., 2012; Liu-Shuang et al., 2014), perception of number (Guillaume et al., 2018; Van Rinsveld et al., 2020), and the ‘human quality’ of dance movements (Alp et al., 2017). It has been applied to other modalities: audition (Galambos et al., 1981; Picton et al., 2003; Bharadwaj et al., 2014), somatosensation (Tobimatsu et al., 1999; Galloway, 1990), and nociception (Colon et al., 2012; Colon et al., 2014). It has been used to study broad phenomenon like memory (Lewis et al., 2018) and lateralisation (Lochy et al., 2015; Lochy et al., 2016) along with more specific types of neurocognitive response, such as visual acuity (Barzegaran and Norcia, 2020) and the perception of music (Nozaradan, 2014). Furthermore, frequency tagging can be used in the assessment of disorders such as autism (Vettori et al., 2020a; Vettori et al., 2020b) and schizophrenia (Clementz et al., 2008). It has even been used to study neural responses to social interaction (Oomen et al., 2022). Beyond EEG and MEG, phase coherence has been proposed as a mechanism for signal routing (Abeles, 1982; Salinas and Sejnowski, 2001; Börgers and Kopell, 2008), assembly formation (Singer, 1999; Buzsáki, 2010), and coding (O’Keefe and Recce, 1993; Panzeri et al., 2010), so the measurement of phase coherence for electrocorticography, local field potentials, and neuronal spiking is important for the neuroscience of neuronal systems.

One striking application of frequency tagging is in neurolinguistics (Ding et al., 2016; Ding et al., 2017). Neurolinguistic experiments are difficult; since language experiments inevitably involve humans and target a phenomenon whose temporal grain is often too fine for magnetic resonance imaging, the principal neural imaging techniques are EEG and MEG. However, the complexity of the neural processing of language makes the signals recorded in these experiments particularly noisy, difficult to analyse, and difficult to disentangle from other cognitive processes. A further difficulty in neurolinguistics is that an ERP is often difficult to obtain because the tens or hundreds of repetitions required would render the stimulus meaningless to the participant, a phenomenon known as the semantic satiation (Jakobovits, 1962).

Consider, as an example, the frequency-tagged experiment described in Burroughs et al., 2021 and following the paradigm shown in Ding et al., 2017. This is used here as a paradigmatical example of a frequency-tagged experiment in neurolinguistics and, although the description here is specific to this example, much of the methodology is typical. In Burroughs et al., 2021, the neural response to phrases was investigated by comparing the response to grammatical adjective–noun (AN) phrases

...old rat_ sad man ill_ wife_...

to ungrammatical adjective–verb (AV) pairs

...rough give ill tell thin chew ...

where care had been taken to have a similar 2 g frequency for adjacent word pairs in each condition. The words are all presented at 3.125 Hz; however, the frequency of interest is the phrase rate, 1.5625 Hz, corresponding to the phrase structure of the AN stimuli (see Figure 1). In Burroughs et al., 2021, it is suggested that the strength of the response to AN stimuli relative to AV stimuli at this frequency measures a neural response to the grammatical structure, rather than to lexical category of the words.

The syntactic target for the experiment.

In the adjective–noun (AN) stimulus, there are noun phrases at the phrase rate, 1.5625 Hz; this structure is absent in the adjective–verb (AV) stimulus because AV pairs do not form a phrase. 3.125 Hz corresponds to the syllable rate in this experiment.

This investigation required a quantitative measurement of the strength of the response. The obvious choice: the induced power at 1.5625 Hz does not work, empirically this proves too noisy a quantity for the stimulus-dependent signal to be easily detected and, indeed, although the frequency tag produces a more robust signal than an ERP, for more high-level or cognitive tasks, particularly neurolinguistic tasks, where frequency-tagging is now proving valuable, the power is not a useful measure; more needs to be done to remove the noise. Typically this is done by assuming the response is phase-locked to the stimulus, and so for frequency-tagged data in cognitive tasks it is common to use the ITPC. The ITPC is defined using the mean phase angle:

(1) R(f,ϕ)=1Kkeiθfkϕ

where f is the frequency, k is the trial index, K is the number of trials, ϕ represents other indices such as electrode number or experimental condition, and θfkϕ is the phase of the complex Fourier coefficient for the EEG or MEG trace (f,k,ϕ). Across different applications, the mean phase angle is often called the mean resultant, a term we will use here. The ITPC is the length of the mean resultant:

(2) R(f,ϕ)=|R(f,ϕ)|

The ITPC is chosen as a quantitative measure to extract the portion of the response at the frequency of interest that is phase-locked to the stimulus and therefore consistent in phase from trial to trial. This is another of the denoising strategies required by these noisy data.

In the case of the experiment, we are discussing here the hold-all index ϕ is made up of participant, condition, and electrode indices. To produce a result, the ITPC is averaged over the 32 electrodes used in the experiment to give R(f,p,c), where p labels participants, c labels conditions, and f labels frequency. The principal result of Burroughs et al., 2021 is that, in the language of frequentist statistics, R(f=1.5626 Hz,p,c=AN) is significantly larger than R(f=1.5626 Hz,p,c=AV).

The result of these experiments analysed using ITPC are summarised in Figure 2. This plots the ITPC measure for all six experimental conditions, the two, AN and AV, that have already been described and four others; a table of the experimental conditions is provided in Appendix 3. In Figure 2A, it is seen that there is a strong peak in ITPC at the syllable rate, 3.125 Hz, and, in the case of AN, at the phrase rate 1.5625 Hz. The ITPC at the phase rate is graphed in Figure 2B, where, again, it appears only AN has a response at the phrase rate.

Summarising the inter-trial phase coherence (ITPC) for different conditions.

(A) Coloured lines show the ITPC for each participant after averaging over electrodes and is traced across all frequencies. The mean trace, calculated by averaging over all participant traces, is overlaid in black. Vertical lines mark the sentence, phrase, and syllable frequencies as increasing frequencies, respectively. (B) Statistical significance was observed with an uncorrected paired two-sided Wilcoxon signed-rank test (∗0.05, ∗∗0.01). (C) ITPC differences for each condition pair calculated at the phrase frequency and interpolated across the skull. Filled circular points mark clusters of electrodes that were discovered to be significantly different (p<0.05) by the cluster-based permutation test.

In Burroughs et al., 2021, all analyses are done for ITPC averaged across electrodes; nonetheless in Figure 2C, we show the condition-to-condition difference in ITPC at each electrode, but averaged across participants:

(3) ΔRce=ΔRpcep

where

(4) ΔRpce=R(f=1.5626 Hz,e,c=AN)R(f=1.5626 Hz,e,c=RR).

Visually these data appear to show a left temporal and right parietal response. However, the data are noisy and deciding the significance of any comparison is complicated: a straightforward calculation of the p-value using an uncorrected paired two-sided Wilcoxon signed-rank test gives a value <0.05 for 54 of the 480 possible comparisons. This includes some comparisons that fit well with the overall picture, for example, when comparing AN to AV the P4 electrode shows a difference with p=0.002 and the T7 electrode with p=0.044. It also includes some more surprising results: for the comparison of RR to RV, the CP5 electrode has p=0.0182 and the FC1 electrode has p=0.0213. If we interpret these as significant difference, the apparent difference between two conditions without an apparent phrase structure is odd and presumably misleading. However, a naïve Bonferroni correction would use a factor of 32×15, and in a manner typical of this very conservative approach, it massively reduces the number of significantly different responses, to one in this case.

As part of our reanalysis of these data, we used cluster-based permutation testing (Maris and Oostenveld, 2007) to identify significant clusters of electrodes for each ITPC condition difference, thereby providing a quantification of the observed structure in the headcaps. See Appendix 4 for an outline of the method. This statistical test is a widely adopted approach to significance testing EEG data because it does not fall prey to the high false-negative rate of a Bonferroni correction and takes advantage of spatiotemporal correlations in the data. However, this is different to testing individual electrode effects, and we must be careful to articulate this difference. With this method it is not possible to make inferential claims about the strength of the effect of any one electrode appearing in a significant cluster; electrodes appearing in significant clusters cannot be defined as significant as these comparisons are not controlled for under the null (Sassenhagen and Draschkow, 2019). As will be discussed later, this is a weaker claim than that based of the Bayesian posterior that can quantify this effect through a probabilistic statement.

There are a number of disadvantages to the ITPC. The most obvious problem is that the item in the statistical analysis of ITPC is a participant, not a trial. In the results described in Burroughs et al., 2021, the statistical significance relied on a t-test between conditions with a pair of data points for each participant: there are actually 24 trials for each participant but these are used to calculate the ITPC values. Some of the analysis in Burroughs et al., 2021 is done using 20 participants, some using 16; sticking to the latter for simplicity, the hypothesis testing is performed using 16 pairs of values, rather than 16×24=384 or even 16×24×32=12288 items if the individual electrodes are included. In short, the ITPC is itself a summary statistic, a circular version of variance, and so it hides the individual items inside a two-stage analysis

(5) itemsITPCstatistical analysis

However, this is hard to rectify: it is difficult to compare items across participants, or across electrodes, because the mean phase, phase[R(f,ϕ)], is very variable and not meaningful to the scientific questions of interest. This variability is graphed in Figure 4: this figure shows the value of

(6) μpe=phase[R(f=1.5625,p,e,c=AN)]

the phase of the mean resultant for the AN condition. For illustrative purposes, three example electrodes are picked out and the distribution across participants is plotted. What is clear is how variable these phases are; this means that individual responses cannot be compared across participants and electrode since p and e have such a strong effect on their value.

There are other classical tests of coherence which use phase information. One example is the Rayleigh test (Rayleigh, 1880; Rayleigh, 1919); this test can be used to check for either significant departure from uniformity or from the ‘expected phase’, that is, a particular phase angle specified by the researcher based on some other insight into the behaviour. Other tests, such as Hotelling’s T2, apply jointly to phase and amplitude (Hotelling, 1931; Picton et al., 2001; Picton et al., 1987). These classical approaches are incompatible with the neurolinguistic study presented here. Firstly, it would be difficult to provide an expected phase; as demonstrated in Figure 4, the mean phase angle is highly variable across participants. There is also no substantive prior information available that could be used to supplement this value because language experiments vary from experiment to experiment. Secondly, because of the problem of semantic satiation the experiments we consider here are relatively short and lack the frequency resolution these classical approaches require.

Here we provide a Bayesian approach to phase data. We believe this has advantages when compared to the ITPC: it permits a per-item analysis and correspondingly a more statistically efficient and richer use of the data. Furthermore, as a Bayesian approach, it supports a better description of the data because it quantifies uncertainty and because it describes a putative abstraction of the stochastic process that may have generated the data while explicitly stating the underpinning assumptions. This replaces a hypothesis-testing and significance-based account with a narrative phrased in terms of models and their consequences, so, in place of an often contentious or Procrustean framework based on hypotheses, a Bayesian approach describes a putative model and quantifies the evidence for it.

A Bayesian account starts with a parameterised probabilistic model of the data. The model proposes a distribution for the data given a set of model parameters: this is the likelihood. In our case, the likelihood will be the probability distribution for the phases of the responses, given our model. Our interest is in how the variance of this distribution depends on the condition. In addition to the likelihood, the full Bayesian model also includes a prior distribution for parameters, for example, it includes priors for the parameters which determine the relationship between the condition and the distribution of phases. The goal is to calculate the posterior distribution, the probability distribution for the parameters given the experimental data; this follows from Bayes’ theorem:

(7) P(Θ|Δ)=P(Δ|Θ)P(Θ)P(Δ)

where Θ are the parameters and Δ the data. Essentially, P(Δ|Θ) is the likelihood, the distribution of the data given some parameters: the goal is to take the data and from them calculate the posterior distribution of the parameters: P(Θ|Δ). The denominator P(Δ) can usually be ignored because it is just a normalising constant that is independent of the parameter values and therefore does not change the shape of the posterior distribution. There are a number of excellent descriptions of the Bayesian approach to data, including the textbook (Gelman et al., 1995) and a recent review (van de Schoot et al., 2021); our terminology and notation will often follow conventions established by the textbook (McElreath, 2018).

In many ways Bayesian descriptions are more intuitive and easier to follow than the frequentist approaches that have been favoured over the last century. The impediment to their use has been the difficulty of calculating the posterior distribution. These days, however, powerful computing resources and new insight into how to treat these models mean that there are a variety of approaches to estimating the posterior; one approach, the one used here, is to sample from the posterior without calculating it analytically using Markov chain Monte Carlo (MCMC) techniques. Probabilistic programming languages such as Stan (Carpenter et al., 2017) and Turing (Ge et al., 2018) make it easy to use advanced MCMC sampling methods such as Hamiltonian/Hybrid Monte Carlo (HMC) and the no U-turn sampler (NUTS) (Duane et al., 1987; Neal, 2011; Betancourt, 2013), making the complexity of a frequentist analysis unnecessary. Here, we report results calculated using Stan though many of the computations were carried out in both Stan and Turing.

Materials and methods

Data

In this article, we consider two experimental datasets and simulated data. The first experimental dataset is the phrase data described above; this can be considered the primary example, and these data are familiar to us and formed the target data while developing the approach. In this section, the methods are described with reference to these particular data; we believe using a particular example allows us to describe the method with greater clarity. However, to demonstrate the generality of the approach we also apply it to another dataset measuring statistical learning of an artificial language. These data are described briefly here. The experiments we performed with simulated data used data generated by the Bayesian model with different effect sizes; this is described in the ‘Results’ section.

The second experimental dataset is related to statistical learning of an artificial language. Statistical learning refers to the implicit capacity of the brain to extract the rules and relationships between different stimuli in the environment. We used our Bayesian model to analyse data from an interesting frequency-tagged experiment that investigated statistical learning in an artificial language task (Pinto et al., 2022). In the experiment, 18 syllables are arranged into six three-syllable words and played in a stream so that the transition probabilities inside a pseudoword are 1 while the transition between the last syllable of a pseudoword and the first of another is 0.2. The goal is to assess the degree to which pseudowords are learned. A frequency-tagged paradigm was used. The syllables were presented at a constant rate f, such that three-syllable pseudowords had a frequency of f/3. Evidence of statistical learning can then be quantified using ITPC at this frequency and its harmonics.

In the experiment, syllables were presented at 4 Hz, resulting in a three-syllable pseudoword frequency of 1.33 Hz. Each participant was subject to two conditions, which we refer to as baseline (BL) and experimental (EXP) in line with Pinto et al., 2022. For the EXP condition, there were six pseudowords with a between-word transitional probability of 0.2; a summary of these are shown in Figure 3. The BL condition contained the same 18 syllables as EXP, but adopted different transitional probability rules to remove regularities. In this study, recordings were taken from 40 adult participants (25 females, 35 right-handed, ages 20–38) using 64 electrodes sampled over three blocks of 44 trials; of these participants, 39 had complete EEG recordings to use in the analysis. In the ‘Results’ section, we recapitulate the original ITPC analysis of these data and consider a Bayesian model.

Pseudowords and position-controlled syllables.

All six pseudowords used in the experiment are numbered. Groups of position-controlled syllables have been coloured.

Circular distributions

Request a detailed protocol

Here, the data are a set of phases and so the model for the data is a probability distribution on a circle. The motivation which informs the ITPC is that the phases to a greater or lesser extent have a unimodal distribution around the circle and so the model should be a unimodal distribution on the circle (see Figure 5). One common class of distributions on the circle is given by the wrapped distributions with probability density function

(8) p(θ)=n=pr(θ+2πn)

where pr(x) is a probability density of a distribution on the real line and θ is the angle. It might seem that the obvious choice for pr(x) would be the Gaussian distribution. In fact, the wrapped Gaussian distribution is not a very satisfactory example of a wrapped distribution because p(θ) cannot be calculated in closed form. A much better example is the Cauchy distribution

(9) pr(x)=1πγγ2(xx0)2+γ2

where x0 is a parameter giving the median of the distribution and γ is a scale parameter; the corresponding wrapped distribution has the closed form:

(10) p(θ)=12πsinhγcoshγcos(θμ)

and, in contrast to the Cauchy distribution on the real line, where the moments are not defined, the wrapped distribution has a well-defined and convenient value for the mean resultant:

(11) R=eiμγ

The circular variance S of the wrapped Cauchy distribution can be derived from the length of this complex vector:

(12) S=1|R|
(13) =1eγ

Thus, as illustrated in Figure 5A, a large value of γ corresponds to a highly dispersed distribution; a low value to a concentrated one. With this explicit relationship between parameter values and the mean resultant, the Cauchy distribution is a convenient choice for our model.

Prior distributions

Request a detailed protocol

The next important element is the choice of priors both for the mean of the wrapped distribution, µ, and for γ, which determines how dispersed the distribution is. The prior for µ is the more straightforward: a different value of µ is required for each participant, condition, and electrode. This prior should be uniform over the unit circle (Figure 4). Although there is likely to be correlations in µ values for the same electrode across participants and for the electrodes for a given participant, since the value of µ is not of interest, it is convenient to ignore this and pick an independent value μpce for each triplet of participant–condition–electrode values. Future studies that aim to extend this model could consider adding correlations to µ.

Mean phases are uniform across participants.

The left-hand panel shows the distribution of phases across electrodes for each participant: each column corresponds to one participant and each dot marks the mean phase µ for each of the 32 electrodes calculated at the phrase frequency for the adjective–noun (AN) grammatical condition. To show how a given electrode varies across participant, three example electrodes are marked, T7 in green, P4 in orange, and F8 in purple. The right-hand panel shows the distribution of mean phases, µ, across participants.

Since µ has a uniform prior over the unit circle, it would seem that the correct prior is

(14) μepcUniform(π,π)

This is, however, wrong; the ideal distribution is a uniform distribution on the circle, not on the interval, and while the uniform distribution on an interval has the same numerical probability values, it has a different topology. This matters when sampling using an MCMC method. In MCMC, to create the list of samples, referred to as the chain, the sampler moves from sample to sample, exploring the parameter space. In this exploration for µ, if the posterior value is, for example, close to π, then the chain should explore the region near to π, which includes values near -π in [-π,π). A small change should move the sampler from π to -π. However, dynamics on the interval [-π,π) can only get from one to the other by traversing the potentially low-likelihood interval in between. Nothing in the mathematical description of the common MCMC samplers, such as NUTS, prevents the prior from being defined on a circle or other compact region. However, there is a problem: the current high-quality implementations of these methods in do not allow priors over circles.

Sampling from a circular prior distribution

Request a detailed protocol

As a practical approach to avoiding this difficulty, we introduce a two-dimensional distribution which, in polar coordinates, is uniform in the angle coordinate and in the radial part restricts sampling to a ring around the origin. Because its probability density function resembles the Bundt cake tin, used to make kugelhopf (Hudgins, 2010, see Figure 5B), this will be referred to as a Bundt distribution. The choice of the radial profile of the Bundt distribution is not critical; its purpose is to restrict the samples to a ring: we sample points (x,y) on a plane so that their radius ρ=x2+y2 is drawn from a gamma distribution

(15) ρGamma(10,0.1).
Model construction and geometry.

(A) Example wrapped Cauchy density functions for different values of the scale parameter. (B) The Bundt distribution has a shape reminiscent of a cake made in a Bundt tin. (C) The mean phase is sampled from an axially symmetric prior distribution with a soft constraint on the radius. Highlighted pairs (x,y) give the location of example points; these points correspond to the mean angle for a wrapped Cauchy distribution using angle(x,y). (D) An example distribution for S=1-R which determines the γ parameter in the wrapped Cauchy distribution. S is related to other parameters such a condition and participant number through a logistic regression, as in Equation 19, the priors for the slopes in the regression are used to produce the distribution shown here; as before, two example points are chosen, each will correspond to a different value of γ in the corresponding wrapped Cauchy distribution. (E) Example wrapped Cauchy distributions are plotted in correspondence with the numbered prior proposals in (C) and (D).

giving what we will call a Bundt-gamma distribution. This distribution has mean 1 and standard deviation 0.1, giving the golden ring of likely (x,y) values seen in Figure 5C. In fact, the radial values are not used in the model; what is used is the angle:

(16) μpce=angle(xpce,ypce)

A linear model for the scale of the wrapped Cauchy distribution

Request a detailed protocol

The final element of the model is the prior for γ; obviously the intention is to have this depend on the condition. To make our priors easier to interpret, it is convenient to use a link function, first converting from γ to the circular variance S:

(17) γpce=log(1Spce)

S is bound between 0 and 1, so a second link function is applied

(18) Spce=σ(υpce)

where σ(υ) is the logistic function. The quantity υpce quantifies the effect of participant, condition, and electrode on response. In this model it is linear

(19) υpce=αc+βpc+δce

so αc is understood as quantifying the effect of condition, βpc the effect of the participant, and δce the effect of electrode. In the language of regression, these are slopes. In the case of βpc and δce, experimenting with different models has demonstrated a better fit when these are interaction terms, allowing the effect of respectively participant and electrode to be condition dependent.

Thus, the main objects of interest are αc, βpc, and δce, and our result is calculated by sampling the posterior distribution for these quantities. Of course, these quantities also require priors. The obvious place to start is the condition effects ac; because effects are weak in these data, our prior belief is that for any condition the circular variance should be reasonably large, likely bigger than a half. Conversely, the parameters βpc and δce correspond to deviations about the baseline level αc which can be represented easily using unbounded symmetric distributions. The prior for the slopes βpc has a hierarchical structure, allowing correlations across conditions; βpc models the participant response: roughly speaking the idea that a participant who is not paying attention in one condition is likely to be inattentive for all of them. The participants slopes, βpc, were assigned a multivariate t-distribution, chosen because its heavy tails give a more robust estimation in the presence of ‘unusual’ participants: exceptionally strong or exceptionally weak, probably due to lack of attention. This multivariate parameterisation allows for a simultaneous two-way regularisation process due to information sharing both within conditions and across conditions. The idea of self-regularising priors is common in hierarchical Bayesian models and is often referred to as partial pooling (see Gelman et al., 1995 for a review). A similar approach was adopted for the electrode slopes, but with partially pooling only within condition, and not across conditions: testing showed that this was not useful. These priors are described in further detail as part of a full description of the model in the supporting information (see Appendix 1).

At the moment one disadvantage of Bayesian analysis is that the process of selecting priors is unfamiliar and this might appear intimidating, particularly for experimental scientists hoping to benefit from the approach without being interested in the nitty-gritty of defining priors. Hopefully, as our understanding matures, this process will become both better established and better understood, with good default choices available as suggestions from analysis libraries.

Results

The posterior distribution was sampled using the NUTS algorithm implementation in Stan. Four chains were run for 4000 iterations, with half dedicated to a warm up or calibration epoch. Details of the software packages and libraries used can be found in Appendix 2.

Comparison with ITPC results

The posterior distributions are described in Figure 6. This figure reprises, using our analysis, the ITPC analysis exhibited in Figure 2. Figure 6A shows a point estimate of the mean resultant length across all frequencies estimated using the optimise function within RStan; as in the earlier figure (Figure 2A), there is a phase peak visible at the phrase frequency 1.5626 Hz for AN, but not for the other conditions.

Posterior distributions.

(A) The traces show point estimates of the mean resultant length calculated across all 58 frequencies using the optimisation procedure. (B) The marginal posterior distributions for each transformed condition effect αc are shown with a violin plot. Posteriors over condition differences are given directly above, the colour of which represents the condition against which the comparison is made. For example, the green interval above the adjective–verb (AV) condition describes the posterior difference AN-AV. Posterior differences and marginal intervals are all given as 90% highest density intervals (HDIs) marked with posterior medians. (C) Posterior medians are interpolated across the skull for all condition comparisons. Filled circle shows those electrodes where zero was not present in the 95% HDI for the marginal posteriors over the quantity in Equation 22.

Figure 6B represents our attempt to find a way to present the results of Bayesian analysis in a way which resembles as much as possible the ‘significance difference bracket’ often used in presenting experimental results. At the bottom of Figure 6B, we see the posterior distributions over the mean resultant length for each condition. These posteriors are obtained by transforming posterior samples of the parameter αc, which describes the effect of condition on the response within the regression to circular variance Sc, as described in Equation 18, then subtracting from one to obtain the mean resultant length.

(20) Rc=1Sc
(21) =1σ(αc)

It appears that the AN condition has a higher value of the mean resultant length than the other five conditions. To examine this further, the upper panel in Figure 6B also shows the 90% highest density intervals (HDIs) and posterior medians of the posterior distribution over the differences between the mean resultant length of all condition pairings. The HDI provides a summary of the full posterior distribution: it is the smallest width interval that contains a specified proportion of the total probability, and here, above the violin plot for each αc, we have plotted the HDI for that condition relative to the other four: this could be considered as a Bayesian equivalent to the confidence brackets common in frequentist plots like Figure 2. Here, only the HDIs that do not overlap zero are the ones corresponding to a difference between AN and another condition: this clearly shows that in our model there is a neural response at the phrase stimulus frequency for AN but not for the other conditions. It appears, for example, that although the MP condition consists of grammatical phrases, the fact that these phrases are of different types means that there does not appear to be a response. This suggests that the neuronal response observed for AN is a response to a specific type of phrase, not to any phrase.

In Figure 6C, we see the electrode-by-electrode comparisons across conditions. These graphs show a clearer structure than the corresponding ITPC analysis in Figure 2C; there is a left temporal and right parietal response for AN and nothing else. In an attempt to draw a comparison with the headcaps in Figure 2C, we have highlighted the electrodes whose marginal posteriors did not contain zero. This is not a one-to-one comparison: in Figure 2C, no claims of significance can be made about any specific electrode, only the clusters of activity themselves are significant. Here we are presenting evidence given by the Bayesian model based on each marginal distribution and argue that false-positives arising from these 32 comparisons should be reduced by the multivariate machinery of the Bayesian model. In Figure 6C, only a summary of the posterior for each electrode-by-electrode comparison is shown. It is important to note that, in contrast with the ITPC analysis in Figure 2C, the posterior is much more than a point estimate. In Appendix 9, an example of the plausible patterns of activity captured by the posterior distribution for the AN–AV comparison is provided.

Participant effects

In Figure 7A, we plot the 90% HDIs for the participant slopes, βpc, for c= AN; more positive values of β correspond to less attentive participants, more negative values correspond to more attentive. These have been arranged in increasing order of β with the participant number p given on the x-axis. From an experimental point of view, this plot gives some reassurance that there is no systematic trend, with participation becoming better or worse as the experiment progressed through participants. Our model includes a condition-dependent standard deviation for the participant response (see Appendix 1); posterior distributions for these standard deviations are plotted in Figure 7B. This appears to indicate that there is more across-participant variation in responses to the MP and ML conditions, where there is a structure but a complicated or confusing one, than to either the highly structured and grammatical AN condition or the RV and RR conditions, with little or no structure at the phrase rate.

Participant attentiveness and localised electrode effects.

(A) The intervals show participant effects for the grammatical adjective–noun (AN) condition given as 50/90% highest density intervals (HDIs) and posterior medians. (B) The posterior distributions over the standard deviation of participant slopes for each condition. Outer vertical lines mark the 90% posterior HDIs, inner lines mark the posterior median. (C) The skull plot from Figure 6C for the AN–AV difference with electrode names marked. (D) Posterior distributions over electrode differences for those positions on the skull where the grammatical condition shows a higher coherence of phases at the average participant in (C). Intervals give 50/90% HDIs and the posterior medians. AV, adjective–verb.

Electrode effects

To investigate the electrode-dependent response, Figure 7C is an enlarged version of the first headcap plot from Figure 6C: the difference in mean resultant between AN and AV. The heatmap colour scale is recalibrated since here it refers only to this one plot. The localisation of the response is seen very clearly. It is difficult to combine a headcap plot and information about the posterior distribution, so the HDI for

(22) ΔRe=Rc1eRc2e

where c1=AN, c2=AV and

(23) Rce=1σ(αc+δce)

is plotted for three example electrodes, one electrode from each of the two active areas and one from an area that shows little activity. The response for P4 and T7 is clearly different from zero, indicating that there is a stronger response to the AN condition than to the AV condition at these two electrode. The same HDI analysis for RR versus RN does not show any electrodes whose HDI does not overlap zero; the presumably misleading results for CP5 and FC1 noted in the discussion of ITPC results do not appear here.

In Figure 2C, we see that even for conditions, such as RR and RV, which contain no linguistic structure at the phase rate, there are patterns of electrode activity in the topographic headcaps. In contrast, the analogous Bayesian headcaps in Figure 6C did not show similar patterns. We used simulated data to investigate whether the Bayesian model is correctly demonstrating that there is no phrase-level response for these conditions, rather than the other possibility: that the beguiling patterns seen in the ITPC headcaps represent a real activity invisible to the Bayesian analysis. In fact, the evidence points to the first alternative; Figure 8 presents evidence that the Bayesian model is more faithful to the data when there is no meaningful variation in electrode effects. Figure 8A shows the real data again; however, whereas previously the headcap was plotted for differences between conditions, here we fit directly to the RR condition. There is no effect visible for the Bayesian headcap, but for the ITPC headcap there are variations that may suggest localised activity, even though this condition does not have any structure at the phrase rate. In Figure 8B, four datasets were simulated from the generative Bayesian model with electrode effects set to zero; other parameters were centred on the posterior means of the ungrammatical AV condition. The four simulations are marked as 1-4 in the figure. For simplicity, there is only one condition, but in other respects the simulated data mimics the real data: it has 16 participants, 32 electrodes, and 24 trials. These simulations are intended to represent four different iterations of the same experiment; apart from differing in any random numbers, they are identical. The data resulting from these four simulations were fitted with both methods. Evidently, the Bayesian results are much closer to the ground truth. The ITPC results show variations that could easily be misinterpreted.

Comparison of electrode effects for no signal.

(A) Topographic headcaps for the phrase data using the random words condition (RR). When calculating phase coherence using the inter-trial phase coherence (ITPC) for this condition, there is an apparent high but misleading variation in electrodes across the skull. This does not manifest in the Bayesian result due to regularisation of the electrode effects. (B) Data was simulated from the generative Bayesian model four times with electrode effects set to zero to provide a known ground truth. Plots 1-4 can be thought of as results from four separate experiments. On this simulated data, the ITPC shows variation similar to (A); the Bayesian results are consistent with the ground truth. The ITPC has an upward bias, so in all figures the mean was subtracted for ease of comparison.

Sampler diagnostics

When calculating posteriors using MCMC, it is necessary to check the success of sampling; sometimes it can become stuck in one part of the parameter space (Gelman et al., 1995; McElreath, 2018). Figure 9 plots the standard MCMC diagnostic measures calculated from our posterior samples. There does not appear to have been any problems: the most commonly used measure of the success of sampling is R^, often referred to as R-hat. This is a measure of convergence that compares the means and variances of chains; ideally it would be 1.0, but typically a value of <1.05 is considered acceptable and <1.02 desirable. Here, all values of R-hat are <1.006, indicating good mixing; values are plotted in Figure 9A; Figure 9C plots the chains for the parameter with the largest R-hat value for each parameter type; none of these plots appear to show the sort of pathological behaviour associated with poor sampling, and chains are both stationary and convergent. Another measure of sampling success, the comparison of marginal and transitional probabilities of the Hamiltonian, is exhibited in Figure 9B; this also indicates good sampling. See Appendix 2 for a note on tree depth warnings.

Sampler performance and diagnostics.

(A) The performance of the sampler is illustrated by plotting R^ (R-hat) against the ratio of the effective number of samples for each parameter in the model. Points represent individual model parameters grouped by colour with a different colour for each parameter type. For convenience, the dot sizes are scaled, so the more numerous parameters have smaller dots, the less numerous, fewer, so, for example, αc with only six examples, is large. (B) A histogram comparing the marginal energy distribution πE, and the transitional energy distribution πΔE of the Hamiltonian. (C) Post-warmup trace plots. All four chains for the poorest performing parameter within each parameter group are overlaid. Corresponding points in (A) are marked with a black border and zero transparency.

Case study: Statistical learning for an artificial language

As for the phrase data in Figure 2, we perform a standard ITPC analysis at the group level for this dataset. In Figure 10A, we replicate the original statistical analysis in Pinto et al., 2022 with a one-tailed t-test for each frequency. Since ITPC is bounded by zero and one, it cannot be normally distributed, so we also present the results of a Wilcoxon signed-rank test. There is a strong response at the syllable frequency (4 Hz) for both the BL and EXP conditions; however, statistical tests give complicated results. A small increase in coherence can be observed at the pseudoword rate (1.33 Hz) and an even stronger one at the second harmonic (5.33 Hz). No significant difference was observed between BL and EXP at the first harmonic (2.66 Hz), although four participants showed a considerable increase in coherence at this frequency, exceeding values 1.5*IQR above the 75th percentile of the data.

Inter-trial phase coherence (ITPC) analysis.

(A) ITPC averages across all trials and electrodes for each of the 39 participants. We replicate the statistical procedure as stated in Pinto et al., 2022 of paired one-sided test of a greater ITPC mean of experimental condition (EXP) at the pseudoword rate (1.33 Hz) and its first and second harmonics (2.66 Hz, 5.33 Hz). A one-sided test for a larger ITPC value of the baseline condition (BL) at the syllable rate is also performed. Significance values on the specified test results of an uncorrected paired Wilcoxon signed-rank test (left) and an uncorrected paired t-test (right). (B) Statistically significant clusters of electrodes were found using a cluster-based permutation test between these two conditions at 4 Hz and 5.33 Hz.

The headcaps in Figure 10B present the condition-to-condition difference in ITPC at each electrode averaged across participants (Equation 3) and interpolated across the skull. We used cluster-based permutation testing to identify significant clusters of activity that describe the differences in average electrode response between conditions. No significant clusters of activity were found at the pseudoword frequency or its first harmonic; however, a stronger mid-to-frontal cluster appears at the second harmonic, suggesting a larger ITPC of electrodes in EXP compared to BL. A significant cluster of activity also appears at the syllable rate and describes an opposite effect of condition: frontal electrodes have a larger ITPC for BL compared to EXP.

Bayesian analysis

The model was fit separately for each frequency using four chains sampling for 2000 iterations each, with half of these attributed to the warmup phase. No divergent transitions were detected during sampling, and convergence diagnostic R^<1.03. The posteriors over the difference in mean resultant length for each frequency are shown in Figure 11A. Despite a preference of the posterior at the pseudoword rate to prefer values greater than zero, there are some values in the left tail consistent with no difference. From the summary statistics of the posterior differences in Table 1, we can calculate that zero is approximately 1.6 standard deviations from the posterior mean.

Bayes analysis for the statistical learning data set.

(A) Posterior distributions over the condition difference EXP-BL are shown for all frequencies of interest. In each case, the full posterior is given by the histogram and is annotated by its 90% highest density interval (HDI) and posterior median. (B) Marginal posterior distributions over the mean resultant length for each condition and frequency of interest. (C) Posterior means for the difference EXP-BL at each of the 64 electrodes are interpolated across the skull. Filled circles label those electrodes where zero was not contained by the 95% HDI calculated from the quantity in Equation 22.

Table 1
Summary of posterior values for the statistical learning dataset.

All values are rounded to three decimal places. The difference shown is EXP-BL. HDI: highest density interval.

Frequency (Hz)MeanMedianSD90% HDI
1.330.0160.0160.010[0.001, 0.032]
2.660.0000.0000.009[–0.014, 0.015]
4–0.025–0.0250.016[–0.052, 0.000]
5.330.0300.0300.009[0.016, 0.046]

If we are interested in the full extent of the posterior distribution, it would be more appropriate to calculate the total probability associated with any value of the parameter that supports it. For example, we can calculate from posterior samples that P(ΔR>0|y)0.956. This analysis indicates that there is no strong evidence for a large difference in expectation between conditions at the pseudoword frequency, but it is plausible that a difference exists. The first harmonic of the pseudoword rate clearly demonstrates no difference between the experimental groups. The posterior is peaked symmetrically around a mean of zero and has low variance. The second harmonic shows the largest difference between the groups; zero is approximately 3.24 standard deviations away from the mean. Consistent with the ITPC analysis, the results at the syllable frequency also show BL as having a larger value than EXP. In this case, zero lies approximately 1.59 standard deviations from the mean.

Posterior means at each electrode for the same difference are shown in Figure 11B. As before, filled points are those electrodes where the 95% HDI of the marginal posterior over the condition difference does not contain zero. In line with the ITPC analysis for the pseudoword frequency, and its first harmonic (Figure 10B), there is no evidence to suggest any localised patterns of activity occurring in either condition by the Bayesian model. A strong result appears for the second harmonic; according to the Bayesian result, every electrode has a higher mean resultant length in EXP compared to BL. At the syllable rate, a frontal response is discovered; this also reprises the findings from the ITPC analysis.

The Bayesian results give evidence of statistical learning in the second harmonic of the pseudoword frequency; however, results for the pseudoword frequency are not so strong as to rule out no difference entirely. It appears that the strength of conclusions to be made is limited by some participants demonstrating an opposite effect. In Appendix 8, we plot the posterior distribution over the EXP-BL comparison at each frequency and for each participant. In the strongest result, 5.33 Hz, the majority of participants show an increased response in EXP. However in 2.33 Hz and 1.33 Hz, the number of participants that show an opposite effect of condition to those that do not is much more even. In Pinto et al., 2022, it is suggested that evidence of SL is difficult to find in individual participants. The high variance of participant posteriors both within and across frequencies supports this conclusion.

Simulation study

In this article, we have tried to concentrate on real experimental data: real data often has characteristics and irregularities that are hard to reproduce using fictive data. In our approach here, we have been fortunate that there are datasets which have already been analysed using other, more traditional methods, allowing us to test our method in a situation with something akin to ground truth. Nonetheless, it is useful to also explore the characteristics of our method on fictive data; using fictive data allows us to manipulate effect sizes and the amount of data and contains a ground truth, albeit one located in the perhaps too regular a context provided by simulated data generation.

The Bayesian model reduces bias in the estimation of R. If Ri is the true value of the mean resultant length for a condition i, then R¯i, as calculated by the formula for ITPC (Equation 2), is a positively biased overestimate of this quantity (Kutil, 2012). In a simulation study, we demonstrate that our Bayesian model reduces bias in the estimation of this quantity compared to a typical ITPC analysis. We sampled R1 and R2 as ground truth mean resultant lengths for two fictive conditions uniformly over the interval [0,1] by replacing the Beta distribution, which described an explicit prior assumption, with a uniform distribution (see Equation 35). We then use this modified model to generate fictive datasets with different numbers of participants and trials over the sampled ground truth values R1 and R2. The estimation bias in each dataset was then analysed with both the ITPC and Bayesian approaches.

Figure 12A plots the result of this study for fictive datasets of 5 participants, 10 trials, and 8 electrodes. Each point on the graph is a summary of the performance of each method on its estimation of the true difference in mean resultant length. The x axis gives the absolute value of the true difference, and the y axis gives the ratio of the estimated difference and the true difference:

(24) ratio of difference=R¯1R¯2R1R2
Simulation study.

(A) The Bayesian model has a higher true detection rate for lower participant numbers. The bias of the estimate is also greatly reduced by the Bayesian model. As the real difference increases along the x axis, the variation in model estimates reduces in both methods; however, the distribution of these points around y=1 is much more symmetric for the Bayesian model; this result highlights its bias reduction. The y axis has been restricted to help highlight the behaviour of interest. (B) Simulation-based calibration for the same participant and trial numbers where the rank of ΔR is analysed. There is no evidence to suggest a tendency of the Bayesian model to overestimate or underestimate the difference.

Any systematic deviation from the ideal value of 1 implies a bias in the estimation. Such a trend is present with the ITPC estimation but reduced in the Bayesian one.

Using the same simulated datasets, we looked at how well each method detects a real difference in the mean resultant length. In Figure 12A, points are coloured orange when a difference is correctly detected by the method (zero lies outside the 95% HDI for Bayes, p<0.05 for the ITPC using a paired two-tailed Wilcoxon signed-rank test). The Bayesian model can detect a real difference in mean resultant length for smaller participant numbers. Interestingly, even after a doubling of the number of trials for the same participant number, a difference still cannot be detected by the statistical test: see Appendix 5 and Appendix 6 for the comparison of true-positive and false-positive rates on simulated data between both methods.

We used simulation-based calibration (SBC) (Talts et al., 2018) to demonstrate the calibration of our model. SBC is primarily an algorithm for unit-testing a Bayesian model and is used to provide a visual proof that the model – and its geometry – are not so complex that the sampling algorithm, on average, cannot sample from it without providing a dishonest description of parameters in the posterior. We can use this method to show that our Bayesian model does not provide a biased estimation of ΔR arising from systematic overestimates or underestimates of R¯.

Checks for calibration using SBC require that the histogram of rank statistics produced by the algorithm is demonstrably uniform. Figure 12B gives the straightforward histogram of rank statistics and the quantiles containing 99% of the total variation we expect of a true uniform distribution estimated from a finite sample size (details are provided in Appendix 7). The histogram gives no indication of behaviour deviating from a uniform distribution in specific bins or by a more general trend such as a sloping histogram of ranks. Smaller deviations can be hard to detect using this approach, so a second, more sensitive approach is recommended (Talts et al., 2018). The second plot shows the difference between the empirical cumulative distribution function (ECDF) for a uniform distribution and the ECDF estimated from the histogram of rank statistics. The area containing the variability expected of the difference between the uniform ECDF and uniform CDF is shaded. Even with this more sensitive approach no deviation from permissible variation is present. From this result, we can be confident that under its generating process the Bayesian model does not provide a biased estimate of ΔR.

Discussion

Here, we have presented a Bayesian description of phase data using examples from neurolinguistics. Our approach reprises the original conclusions of the statistical analysis of these data, but, we believe, does so in a more expressive and more natural way. Our account focuses on neurolinguistics, where frequency tagging is common, and we use specific neurolinguistic examples: an example with which we are familiar and for which data is openly available (Burroughs et al., 2021), and an example from a recent study that investigated the presence of statistical learning for an artificial language (Pinto et al., 2022). However, we believe that our approach has broad application across the multiple applications of frequency tagging. Bayesian analysis is, essentially, a more modest approach to data than the more commonly used frequentist analysis: where a frequentist approach seeks to establish with significant certainty whether a hypothesis is true or false, perhaps also using Bayes factors to quantify evidence for a particular hypothesis using a discrete set of models, in a Bayesian analysis we restrict ourselves to the more achievable goal of estimating the values of parameters in a model of the data and calculating our certainty or uncertainty in making those estimates.

Model extensions

The resemblance of a logistic regression for the scale of the wrapped Cauchy allows the model to be easily adjusted to include other terms typical of a linear model. For example, the statistical learning dataset (Pinto et al., 2022) records from participants in blocks. One extension to the model would be to include an additional term to capture any block effects; alternatively it could also be implemented as an interaction between block and condition.

It is clear from the ITPC headcaps in both analyses that electrodes have spatial correlation. The datasets used in the analyses have been relatively large, both in terms of participants and trials; this has been an important factor in helping the Bayesian model, and the ITPC, resolve activity patterns across the skull. However, for smaller datasets this is not a guarantee. An extension to the model would incorporate prior knowledge about electrode locations, modifying the current independent approach to one that encodes a correlated response between neighbouring electrodes. An initial starting point would be a multivariate normal with a correlation matrix calculated using an appropriate kernel function, such as an exponentiated quadratic, applied to a 2/3D distance matrix of electrode locations.

The statistical learning dataset (Pinto et al., 2022) differs from the phrase dataset (Burroughs et al., 2021) as effects of condition also appear in the harmonics of the main frequency. In the Bayesian analysis we presented, each harmonic is fitted independently of each other. However, if a response is expected at a selection of harmonics of the baseline, then a model that jointly handles these frequencies would be potential avenue for improvement, especially for smaller datasets where information sharing between dependent parameters is a powerful tool for obtaining better estimates.

Data efficiency

The Bayesian approach also appears to make more efficient use of the data. In order to investigate the data efficiency of the frequentist and Bayesian approaches, we used the phrase data (Burroughs et al., 2021) and simulated the result we would have had if the experiment had been stopped early, with fewer participants. It can be misleading to directly compare frequentist and Bayesian results; the aims of the two approaches are different. Nonetheless, we have done just that. In Figure 13A, we plot the confidence intervals arising from the two-sided paired Wilcoxon signed-rank test alongside the HDIs from the posterior distribution for decreasing participants numbers. This is produced by removing participants from the data starting with the last recorded. It shows that the posterior still points to a real difference of condition in cases where the low participant number causes the frequentist confidence interval to overlap with zero and fail. The width of these intervals is derived from the critical values outlined below. In Figure 13B, we plot the p-value produced by the same test, and the corresponding probability, calculated from the posterior, of observing a difference less than zero. We also mark the lines for α=0.1, α=0.05, and α=0.003; these correspond to the critical values in an uncorrected one-sided test, an uncorrected two-sided test, and a two-sided test in which a Bonferroni correction of C(6,2) is used to correct for multiple comparisons across the six phrase conditions. We are not advocating for any of these α values, and the uncertainty in deciding an appropriate value of α plagues frequentist approaches. Interestingly, when both participants 15 and 16 are removed from the data, leaving 14 participants, the p-value increases by a factor of approximately 2 (n=15: 0.005, n=14 : 0.011). Figure 7A can explain this result: the posteriors for β show that participant 15 performs better on the task than participant 16, so removing participant 15 from the analysis weakens the result more than removing participant 16.

Efficiency of the frequentist and Bayesian approaches for participant number.

(A) Confidence intervals arising from a two-sided paired Wilcoxon signed-rank (left), alongside Bayesian highest density intervals (right), calculated for the condition difference AN-RR in the phrase dataset. The intervals give widths 90/95/99.7% for each method respectively. (B) The p-value arising from the same significance test (left) compared with the probability of observing a value less than zero by the posterior distribution (right).

Figure 14 is similar to Figure 13; however, it uses the statistical learning dataset (Pinto et al., 2022), comparing conditions BL and EXP at the frequency 5.33 Hz. In fact, for these data we saw little evidence that the Bayesian approach works better when the number of participants is reduced: we attribute this to the large number of trials; generally the extra efficiency of a Bayesian approach appears most apparent in low data regimes and the statistical learning dataset is admirably well sampled. For this reason, we used this dataset to investigate data efficiency when the number of trials is reduced for a fixed number of participants. In Figure 14, data from the first 20 participants are considered and the analysis is repeated with different numbers of trials, discarding trials from the end. It is clear from Figure 14A that the Bayesian model can reliably detect the signal in the data with at least half the number of trials that the frequentist approach requires; this is potentially useful especially because of the challenge semantic satiation poses to some neurolinguistic experiments. Figure 14B compares the p-values arising from the significance test with P(ΔR<0) calculated from the posterior and shows the fast convergence of the posterior to the signal; the p-value is much slower and also more variable across trials. For these analyses regarding data efficiency, the degrees of freedom parameter ν was fixed to 30 to address divergent transitions arising for small participant numbers. HDIs were calculated from 8000 posterior samples.

Efficiency of the frequentist and Bayesian approaches for trial number.

(A) Confidence intervals arising from a two-sided paired Wilcoxon signed-rank (left), alongside Bayesian highest density intervals (HDIs) (right), calculated for the condition difference EXP-BL in the statistical learning dataset. The intervals are given for confidence levels of 90 and 95%. On the right are the HDIs for the same levels. (B) The p-value arising from the significance test (left) compared with the probability of observing a value less than zero by the posterior distribution (right).

Through simulation we have shown that for lower participant numbers there is evidence that the Bayesian model can detect a true difference more quickly. Similarly, if you have many participants but few trials the Bayesian model also provides a benefit. The probability of making a type 1 error also appeared markedly reduced when using the Bayesian approach for a range of data sizes. Together, these promote the adoption of the Bayesian approach to analysing phase data, especially in studies where data is limited, such as pilot studies, where findings influence the direction of subsequent research.

It may appear that our motivation is contradictory; we first explain that frequency-tagging produces robust encephalography results, but then explain that a new framework is required to analyse these results because they are often too noisy to study using a naïve power analysis. Of course, there is no contradiction; the encephalographic study of cognitive phenomena like language demands both a robust experimental paradigm and a cutting-edge analysis pipeline!

EEG data can benefit from a Bayesian analysis

The Bayesian approach we have advanced in this article is undoubtedly much more computationally demanding than a frequentist approach; it also demands some thought and experiment in the formulation of the model and its priors. Frequency tagging is, in this regard, a particularly demanding application of the approach. However, we believe that the clarity of a Bayesian description and the complete way it presents the model and its evidence, along with the great data efficiency it provides, make it superior. Some of the complexity of our approach derives from the difficulty of sampling a circle, and we hope this example will be helpful in incorporating compact distributions into the standard probabilistic packages such as Stan and Turing.

In general, Bayesian models become worth the effort in scenarios with two properties: (1) where the data are limited and noisy, so statistical uncertainty is high and therefore worth representing explicitly; (2) where the dataset has a strong structure, which the Bayesian model can be designed to match and therefore share information across parameters. For these reasons, we also believe that similar Bayesian approaches will have broad application to EEG data. The nature of EEG data, its noisiness high-dimension, and the tendency to small participant numbers make it likely that Bayesian methods will be helpful. This certainly is evident in the preliminary work report in Turco and Houghton, 2022.

Appendix 1

Full model

In the Bayesian model, the individual phases are modelled as draws from a wrapped Cauchy distribution:

(25) θpcekWrapped-Cauchy(μpce,γpce)

where, as above, p, c, and e are participant, condition, and electrode number, and k is the trial number. The mean phase is derived from the Bundt-gamma distribution:

(26) (xpce,ypce)BundtGamma(10,0.1)

The probability density function for the Bundt-gamma distribution can be derived through a Jacobian adjustment from polar to Cartesian coordinates. Our assumptions in polar coordinates are a uniform angle, and a gamma distributed radius:

(27) ρpceGamma(10,0.1)
(28) μpceUniform(π,π)

Unlike the choice of a uniform distribution for the mean, the choice for the distribution for the radius is somewhat arbitrary because it has no implication for quantities that we analyse. It is simply a mathematical tool that can promote more efficient sampling by soft-constraining the sampler in parameter space. To represent these assumptions in Cartesian coordinates, we multiply these independent assumptions by the Jacobian of the transformation 1/ρ:

(29) p(xpce,ypce)=1ρpcep(ρpce,μpce)
(30) =12πρpceGamma(ρpce|10,0.1)

This gives an angle uniform on the circle, not on the interval:

(31) μpce=angle(xpce,ypce).

As described above, the model for γ uses a pair of link functions so

(32) γpce=log(1Spce)

and

(33) Spce=σ(υpce)

with a linear model for υpce:

(34) υpce=αc+βpc+δce

We have priors for each of αc, βpc, and δce, what in linear regression are referred to as slopes. The prior for αc is induced through placing a prior over σ(αc) which represents the baseline circular variance for each condition

(35) σ(αc)Beta(3,2)

By applying the change of variables formula, we can work out the pdf for the prior induced on αc:

(36) p(ac)=Beta(σ(ac)|3,2)eαc(1+eαc)2

As discussed above, for βpc we have a hierarchical structure modelling covariance of participant responses across conditions, thus:

(37) βpMvT(ν,0,Σ)

where βp is a vector over the c index. With C conditions, the scale matrix Σ is a C×C matrix. It is made up of a correlation matrix Ω, and a set of scales, σ1 to σC.

(38) Σ=diag(σ1,,σC)Ωdiag(σ1,,σC)

To facilitate the interpretation as a covariance matrix, this scale matrix needs to be multiplied by ν/(ν-2). The correlation matrix has a Lewandowski–Kurowicka–Joe prior (Lewandowski et al., 2009; Gelman et al., 1995):

(39) ΩLKJ(2.0)

The prior for the degrees of freedom parameter ν is given a gamma prior:

(40) νGamma(2,10)

and the scales have half-normal priors:

(41) σcHalfNormal(0,0.5)

Finally, for δce we partially pool electrodes within condition only:

(42) δecNormal(0,τc)
(43) τcHalfNormal(0,0.5)

To attempt a standard notation, we have followed the conventions set by the julia library Distribution.jl by writing the distributions as words and using the same arguments as are found there: in particular, the two parameters for the Gamma distribution correspond to shape and scale.

The prior distributions for β and δ were implemented using a reparameterisation known as non-centring (Papaspiliopoulos et al., 2007). This is a commonly adopted technique in hierarchical Bayesian modelling to help alleviate funnels, a class of pathological feature in the target distribution that cause slow and biased sampling. This reparameterisaton does not change the mathematical model; its sole purpose is to help the numerical computation. See McElreath, 2018 and Betancourt and Girolami, 2015 for an introduction to this approach.

Appendix 2

Software

Posteriors were sampled using rstan v2.21.5 and cmdstanr v0.5.2. Data and posteriors were analysed using R v4.2.1; tidyverse v1.3.1; reshape2 v1.4.4; and HDInterval v0.2.2. All graphs were plotted in ggplot2 v3.3.6. Figure 2B used ggsignif v0.6.3 for hypothesis testing and additional plotting functionality; Figure 5B used viridis v0.6.2 for heatmap colours; headcaps were interpolated using mgcv v1.8–40 for Figure 2C, Figure 6C, and Figure 7C; ridgeplots were created for Figure 7B with ggridges v0.5.3; Hamiltonian energy distributions were plotted in Figure 9B using bayesplot v1.9.0 (Gabry and Mahr, 2022; Gabry et al., 2019). All panels were composed using inkscape v1.1.1.

Code and data

The data used here are from the open dataset (Burroughs et al., 2021); all codes are available on GitHub (copy archived at Houghton and Dimmock, 2023).

Tree depth warnings

The sampler has been observed to produce a low number (<1%) of max_treedepth warnings. This does not imply biased computation like those arising from divergences, but it is a warning about efficiency. A higher tree depth comes at the cost of doubling the number of gradient evaluations required at the previous depth (Hoffman and Gelman, 2014), adding a penalty to the run time.

Computing resources

Posteriors were sampled locally on a Dell XPS 13 7390 laptop (Intel i7-10510U @ 1.80 GHz, 16 GB of RAM) running under Ubuntu 20.04.4 LTS.

Appendix 3

Table of experimental conditions

The six experimental conditions are shown in Appendix 3—table 1.

Appendix 3—table 1
Table of conditions for the phrase dataset.
ConditionDescriptionExample
ANAdjective–noun pairsold rat sad man
AVAdjective–verb pairs…rough give ill tell…
MLAdjective–pronoun verb–preposition…old this ask in…
MPMixed grammatical phrasesnot full more green
RVRandom words with every fourth a verb…his old from think…
RRRandom words…large out fetch her…

Of these, four are ‘phrase conditions,’ AN, AV, MP, and RR, and were analysed in Burroughs et al., 2021; the other two, ML and RV, are ‘sentence conditions’ which formed part of the experiment and are used to investigate phenomena which proved to be absent. ML stands for ‘mixed lexical’ and provides a four-word analogue of the AV condition, repeating lexical category but avoiding grammatical structure. All stimuli are available; see the data and code availability list in Appendix 2.

Appendix 4

Cluster-based permutation testing

For the non-Bayesian results, significant clusters of electrodes were identified using cluster-based permutation testing on the ITPC differences. First the mean resultant length was calculated for each participant, condition, and electrode by averaging over trials at the frequency of interest f.

(44) R(f,ϕ)=|1Kkeiθfkϕ|

Cluster-based permutation testing requires a test statistic (separate of any significance testing of clusters) to threshold values, in this case electrodes, that contribute to cluster formulation. The test statistic we chose for deciding if an electrode should appear in a cluster is the mean of the difference in mean resultant length:

(45) ze=1Pp=1PΔRpe

The threshold that signifies an electrode as being important enough to appear in a cluster is based on it being larger than some value that would be unlikely assuming no signal in the data. Specifically, assuming uniformity in the angle of an electrode across trials, and sufficiently large number of trials k10, the quantity Rpe can be modelled using a Rayleigh distribution (Rayleigh, 1880; Horner, 1946).

(46) RpeRayleigh(1/2K)

To determine a threshold for values of ze that may be due to a real signal in the data, we bootstrap the distribution of ze as the mean of the difference of two Rayleigh distributions. This is a distribution of test statistics arising from the assumption that observed phase angles are uniform across trials for each participant, electrode and condition. Values of the 2.5 and 97.5% quantiles of this distribution were used to threshold the test statistic; we estimated

(47) P[ze(P=16,K=24)0.065]0.975
(48) P[ze(P=39,K=132)0.018]0.975

With this setup defined we can proceed with the cluster-based permutation algorithm. For 5000 iterations, ΔRpe was calculated for each participant by randomly swapping condition labels. For this permuted dataset, the value of the test statistic is calculated and electrodes are thresholded. Clusters were formed based on the spatial proximity of thresholded electrodes in a normalised 2D map. The size of the largest cluster for each iteration (sum of the absolute value of test statistics ze of all electrodes within the cluster) was appended to the null distribution. Finally, this procedure is replicated without permuting the data to calculate a value of the test statistic for the observed data. Clusters identified in the non-shuffled data that had a sum of test statistics greater than the 95th percentile of the approximated null distribution are marked as significant. These appear as filled points in the plots.

Appendix 5

True discovery rates

It is important to quantify how well the Bayesian model can correctly detect a true difference in mean resultant length. We simulated from the generative model 500 times for two conditions over a range of participant–trial pairs. From this simulation experiment, we conclude that for lower participant numbers the Bayesian model can detect a true difference much more consistently than the ITPC. The disparity between the two is completely reduced once enough data has been included. From the plot, we can see that the type 2 error is approximately 25% and appears to be slowly decreasing with data size. This is as expected as more data should increase the power of the ITPC, and through a similar reasoning, should also benefit the Bayesian analysis.

Appendix 5—figure 1
True discovery rates.

A summary of more participant and trial numbers of the plots shown in Figure 12A. For five participants, a doubling of the number of trials still provides insufficient information for the inter-trial phase coherence (ITPC) and resulting significance test to conclude that a difference exists in any simulation. Bars show the 95% confidence intervals on this measure estimated through bootstrap.

Appendix 6

False discovery rates

Alongside our comparison of the true discovery rate between the Bayesian model and ITPC, we also looked at the false discovery, or type 1 error rates. This required a slight change in the generative model for the data, namely, setting the true value of the Rc in both fictive condition groups equal. Our simulation showed that the false discovery rates between the models are considerably different and dependent on data size. It is immediately clear from the plot below that the Bayesian result outperforms the ITPC in almost every combination of participant and trial number.

To investigate what was driving this discrepancy between the two approaches, we compared the posterior distribution to a bootstrapped sampling distribution of the mean difference for simulated datasets where the Bayesian model gives a true-negative, but the ITPC a false-positive. This highlighted that the Bayesian estimate is a more conservative one, likely taking into account more sources of variation in the data to form its conclusion about the difference. The ITPC was overconfident in its estimates compared to the Bayesian counterpart; its paradoxical trend of increasing false-positive rates with increasing data size is due to making already overconfident conclusions worse. The statistical test only operates on a summary statistic of the data; it does not know about the number of trials, or even the number of electrodes, the Bayesian model does. Increasing participants for the same trial number, over increasing trials for the same, helps reduce the type 1 error rate of the ITPC because this is the only dimension that can effectively inform the test about variation in the population.

Appendix 6—figure 1
False discovery rates.

The frequentist approach to inter-trial phase coherence (ITPC) differences using a paired Wilcoxon signed-rank test has a type 1 error that increases with both participant and trial number. Bars show the 95% confidence intervals on this measure estimated through bootstrap.

Appendix 7

Simulation-based calibration

To generate the set of rank statistics, we iterated N=2000 times taking L=1023 post warm up sampled at each iteration. This results in a distribution of 2000 rank statistics: in this case integers in the interval [0, 1023]. Neighbouring ranks were then binned together to reduce the total number of bins down to 32 as necessary to give a trade-off between variance reduction and sensitivity of the histogram (Talts et al., 2018). The horizontal lines on the histogram mark the (0.005, 0.5, 0.995) quantiles from a Binomial distribution that describes the variation expected of counts in any of the 32 rank-bins after N iterations:

(49) Binomial(n=2000,p=1/32)

Appendix 8

Participant posteriors

In a frequency-tagged experiment, it may be of interest to the researcher to look for effects of condition in individual participants. From posterior samples, it is possible to construct participant-specific posteriors over the mean resultant length or its difference between condition. In a similar manner to the calculation for electrodes in Equation 22:

(50) ΔRp=Rc1pRc2p

where c1= EXP, c2= BL, and

(51) Rcp=1σ(αc+βpc)
Appendix 8—figure 1
Participant posteriors.

Highest density intervals containing 99% of the posterior probability for each participant and frequency over the difference in mean resultant length EXP-BL.

Appendix 9

Headcap variation

In Figures 6B and 7C, headcap plots are constructed by interpolating posterior means at each electrode across the skull. This is useful for summarising the result; however, it ignores the joint behaviour of the posterior and how its uncertainty describes a range of similar, but different responses. The plot below shows 25 samples from the AN-AV posterior distribution. Each headcap has been normalised through local z-scoring to prevent large magnitude differences from masking any individual behaviour.

Appendix 9—figure 1
Joint headcap posterior.

Here we visualise uncertainty in the posterior over the difference AN-AV and how it captures a range of plausible activity patterns. As expected, samples demonstrate variation about the mean shown in Figure 6B of a right parietal and left temporal activation. AN: adjective–noun; AV: adjective–verb.

Data availability

This manuscript is a computational study, so no data have been generated. All modelling code for this study is available from GitHub (also provided in appendix 2). The statistical learning dataset used as a case study in this paper is available from OSF (contact author EZ Golumbic for data related correspondence).

The following previously published data sets were used
    1. Burroughs A
    2. Kazanina N
    3. Houghton C
    (2020) Zenodo
    Grammatical category and the neural processing of phrases - EEG data.
    https://doi.org/10.5281/zenodo.4385970
    1. Pinto D
    2. Golumbic EZ
    (2021) Open Science Framework
    ID syn3h. Assessing the sensitivity of EEG-based frequency-tagging as a metric for statistical learning.

References

    1. Abeles M
    (1982)
    Role of the cortical neuron: integrator or coincidence detector?
    Israel Journal of Medical Sciences 18:83–92.
    1. Betancourt M
    2. Girolami M
    (2015) Hamiltonian Monte Carlo for hierarchical models
    Current Trends in Bayesian Methodology with Applications 79:2–4.
    https://doi.org/10.1201/b18502
  1. Conference
    1. Ge H
    2. Xu K
    3. Ghahramani Z
    (2018)
    Turing: A language for flexible probabilistic inference
    International conference on artificial intelligence and statistics PMLR. pp. 1682–1690.
    1. Hoffman MD
    2. Gelman A
    (2014)
    The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo
    Journal of Machine Learning Research: JMLR 15:1593–1623.
  2. Software
    1. Jakobovits L
    (1962)
    Effects of repeated stimulation on cognitive aspects of behavior: some experiments on the phenomenon of semantic Satiation
    PhD Thesis.
    1. Picton TW
    2. Vajsar J
    3. Rodriguez R
    4. Campbell KB
    (1987) Reliability estimates for steady-state evoked potentials
    Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section 68:119–131.
    https://doi.org/10.1016/0168-5597(87)90039-6

Decision letter

  1. Andrea E Martin
    Reviewing Editor; Max Planck Institute for Psycholinguistics, Netherlands
  2. Joshua I Gold
    Senior Editor; University of Pennsylvania, United States
  3. Benedikt Zoefel
    Reviewer; CNRS, France

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Bayesian analysis of phase data in EEG and MEG" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Reviewing Editor Andrea Martin and Senior Editor Joshua Gold. The following individual involved in the review of your submission has agreed to reveal their identity: Benedikt Zoefel (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) All reviewers agree that a more extensive quantitative demonstration of the advantages of your methodological approach compared to existing approaches is needed. For example, first, quantifying the advantage of your analysis over the ITPC analysis in the manuscript would be more convincing than the current graphical approach.

Furthermore, it seems that using a simulation approach could be helpful. Simulation of common experimental and data situations, as well as extreme or tough cases where traditional methods run into problems (but your method does not, or is more robust), could be persuasive and help make the impact of the approach more demonstrable and quantifiable.

2) Comparison of data – More thorough and extensive quantitative comparison of the performance of your method compared to existing approaches, as all Reviewers mention, could be carried out on multiple (open) datasets of various sample sizes.

3) Reviewer 3 gives helpful concrete suggestions and concerns regarding the impact of this method for statistical inference (viz., mixed models). These, too, need to be addressed, ideally also quantitatively, but could also be addressed formally/mathematically.

4) Reviewer 1 helpfully explains how the perspective of experimentalists needs to be taken into approach in order for the work to have more impact. Similar to (3) above.

Reviewer #1 (Recommendations for the authors):

The study by Dimmock et al. proposes a Bayesian approach to measuring phase coherence. Although I'm familiar with the kind of EEG data analyzed here, I didn't figure out the purpose of the study. It seems like the aim of the study is neither to provide a more powerful statistical test nor to demonstrate some new neural phenomena. The only purpose seems to provide a Bayesian test, but why do we want it?

If the aim is to provide a more powerful test, it should be compared to classic tests for steady-state responses, such as the ones described in the following article.

Picton, Terence W., et al. "The use of phase in the detection of auditory steady-state responses." Clinical Neurophysiology 112.9 (2001): 1698-1711.

The current article is certainly not written in a way that can be understood by an experimentalist. It doesn't matter too much if the methods are hard to follow, but it does matters if no interpretable results are shown. For example, the authors argue that the topographic plots using the new method have a clearer structure than the traditional ones. As an experimentalist, however, I can't figure out which structure is clearer and why it helps to answer scientific questions.

As a methodological paper, testing the method on multiple datasets is a basic requirement. More importantly, the method has to have a clear goal and clearly demonstrate how the goal is achieved.

Reviewer #2 (Recommendations for the authors):

This paper presents a novel Bayesian approach to testing for phase coherence in neurophysiological recordings. The approach is centred on probability distributions and therefore allows for more fine-grained conclusions about the existence of such phase consistency, in contrast to the often artificial yes/no decision on the acceptance of the alternative hypothesis that can be found in the literature.

I find this manuscript well written and the rationale well explained. The authors demonstrate that their approach can produce similar, but potentially clearer and less noisy results as compared to more commonly applied techniques (such as inter-trial coherence). It remains difficult to quantify differences between the two approaches (Bayesian vs frequentist) – for instance, the authors write that "these graphs [from Bayesian analysis] show a clearer structure than the corresponding ITPC analysis" without providing a quantification of the difference.

Together, this paper will be useful to the community, possibly opening up new ways of analysing phase-locked neural responses.

Reviewer #3 (Recommendations for the authors):

This paper proposes a Bayesian take on the inter-trial phase coherence (ITPC) used to estimate how consistent the oscillatory phase is across trials for a given condition of interest. For standard ITPC the statistical power of the comparisons on the group level is determined by the sample size of the dataset since estimates are derived by first averaging across trials to derive a single condition-level estimate per subject. The advantage of the proposed Bayesian approach is that the resulting model is more robust as it is estimated from the trial level without averaging. It also allows us to derive subject-level estimates (slopes) and explore subject-variable noise. The authors illustrated this by replicating the ITPC analysis from the paper by Burroughs et al. (2021a) using the Bayesian ITPC and demonstrating perceivable noise reduction in the resulting estimates across frequencies and topographical EEG plots. Another key advantage of this method, as illustrated by the authors, is the ability to generate stable estimates for much smaller EEG datasets. While the authors show that Bayesian ITPC can replicate the findings obtained with the standard ITPC, it is not clear what advantages the proposed Bayesian approach offers over other previously proposed methods that allow for trial-level modelling such as linear mixed effects modelling. Secondly, a broader and more accessible description of the steps of the model settings, estimation, and the derivation of the summary statistics should be provided to enable the reader to replicate this method for their own dataset

Abstract

1) lines 12-17 please consider re-phrasing as the message here is not very clear. Please be more specific (based on your analysis findings) what Bayesian approach to coherence contributes more than the traditional one? 'More descriptive' and 'data-efficient' are vague descriptions.

Introduction

2) Lines 26-44. Here to help the readers I would recommend communicating your main point early in the paragraph – that measurement of coherence is an important methodological tool in M/EEG research that is used to answer a wide variety of scientific questions, yet there is room for improvement in how ITPC is estimated.

3) Line 84 – 'this plots', instead of 'this graphs'

4) Lines 96-107 – the main message from this section is not clear. Do authors argue that in the per-electrode ITPC approach the Bonferroni correction for multiple comparisons precludes finding meaningful spatial patterns in the data? In such cases, Bonferroni is rarely used, and spatial cluster-based permutation is a typically used approach that is less conservative and allows the finding of significant clusters of spatially connected electrodes.

5) Lines 126-128 – please unpack a bit more what is meant by 'a better description of the data' and 'a narrative phrased in terms of models and their consequence'.

6) Line 161 – here you mean Figure 4?

Methods section

7) Authors provide a detailed explanation and mathematical descriptions for the distributions from which the phase data can be modelled, and parameters are sampled when building up a Bayesian model of the ITPC. The supplementary materials then detail equations behind the full model used. Yet from these two sources of information, it is challenging for the reader to reconstruct the set of steps authors took to derive the results they plot in Figure 5. If the aim of the paper is to have the reader use the Bayesian approach to ITPC for their own datasets a more accessible step-by-step description of the model estimation is necessary – from calculating participant and electrode slopes/estimates to averaging steps used to produce Figure 5. This can be done by expanding relevant sections in the Methods.

8) Other methods such as Linear mixed models that likewise allow trial-level analysis and model subject slopes have been broadly applied to the EEG data and also ITPC. To increase the contribution of this paper, authors need to outline and demonstrate analytically the advantages of the Bayesian approach over these other non-Bayesian methods.

Discussion section

9) The section Model design choices seem to belong in the Results and not the Discussion section.

10) The Data efficiency section is very helpful in demonstrating the advantage of the Bayesian approach for smaller datasets. This section can be expanded by demonstrating further key advantages of the Bayesian approach over other non-Bayesian methods that use a trial-level approach (as proposed in point 8).

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Bayesian analysis of phase data in EEG and MEG" for further consideration by eLife. Your revised article has been evaluated by Joshua Gold (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewers 2 and 3 are satisfied with your revisions, however, given that eLife works on consensus and the fact that Reviewer 1 is not satisfied with a major concern of theirs from the first round of review, we ask that you directly address Reviewer 1's queries thoroughly. This includes those concerns regarding interpretability for experimentalists, and specifically, that you compare your method to the classic tests for steady-state responses as the Reviewer suggests. Please pay close attention to each of Reviewer 1's queries and address them in full.

Reviewer #1 (Recommendations for the authors):

1. The authors did not address my all my comments and I copied them here.

If the aim is to provide a more powerful test, it should be compared to classic tests for steady-state responses, such as the ones described in the following article.

Picton, Terence W., et al. "The use of phase in the detection of auditory steady-state responses." Clinical Neurophysiology 112.9 (2001): 1698-1711.

The current article is certainly not written in a way that can be understood by an experimentalist. It doesn't matter too much if the methods are hard to follow, but it does matter if no interpretable results are shown. For example, the authors argue that the topographic plots using the new method have a clearer structure than the traditional ones. As an experimentalist, however, I can't figure out which structure is clearer and why it helps to answer scientific questions.

2. I'm glad that the authors included a new dataset in the analysis. However, as an experimentalist, I still cannot see why the new method outperforms the traditional ITPC analysis in the newly added experiment. For the session "Case study – statistical learning for an artificial language", we need at least a few conclusions, explicitly stating whether the new method or the traditional method gives a more reasonable result and why.

3. Simulation is also important. However, I can't really understand the "Simulation study" section. What is exactly R1 or R2? Why do we care about the bias? A more helpful simulation is probably just to simulate the time-domain EEG signal (e.g., using sinusoids and noise) and demonstrate that the new method, e.g., can yield statistical significance with fewer subjects.

"We then use this modified model to generate fictive datasets with different numbers of participants and trials", but where are the results? It seems like Figure 11 does not show how the results change with the number of participants and trials.

For the new section on "Data efficiency", why just one dataset and why only 4 participants? Testing two datasets and all possible numbers of participants are minimal requirements. Also, as an experimentalist, I really cannot understand what is shown in Figure 12.

4. "the power is not a useful measure. Instead, the typical approach to frequency-tagged data for cognitive tasks is to use the inter-trial phase coherence." In fact, most of the studies cited in the introduction used power rather than phase analysis.

5. "The Bayesian approach is more descriptive than traditional statistical approaches: it is a generative model of how the data arises and each component is interpretable and informative about data characteristics."

It's great. But why is the method more interpretable? Could you please summarize it in a way that can be understood by experimentalists?

"It is also more data-efficient: it detects stimulus-related differences for smaller participant numbers than the standard approach."

How is this demonstrated in the two datasets? Is there a guideline about how many participants can be saved using the new approach?

https://doi.org/10.7554/eLife.84602.sa1

Author response

Essential revisions:

Reviewer #1 (Recommendations for the authors):

The study by Dimmock et al. proposes a Bayesian approach to measuring phase coherence. Although I'm familiar with the kind of EEG data analyzed here, I didn't figure out the purpose of the study. It seems like the aim of the study is neither to provide a more powerful statistical test nor to demonstrate some new neural phenomena. The only purpose seems to provide a Bayesian test, but why do we want it?

If the aim is to provide a more powerful test, it should be compared to classic tests for steady-state responses, such as the ones described in the following article.

Picton, Terence W., et al. "The use of phase in the detection of auditory steady-state responses." Clinical Neurophysiology 112.9 (2001): 1698-1711.

The current article is certainly not written in a way that can be understood by an experimentalist. It doesn't matter too much if the methods are hard to follow, but it does matters if no interpretable results are shown. For example, the authors argue that the topographic plots using the new method have a clearer structure than the traditional ones. As an experimentalist, however, I can't figure out which structure is clearer and why it helps to answer scientific questions.

As a methodological paper, testing the method on multiple datasets is a basic requirement. More importantly, the method has to have a clear goal and clearly demonstrate how the goal is achieved.

We agree with this comment: our paper presents a novel approach to the analysis of phase data and so should be tested on additional datasets. Stress-testing a novel method is paramount to finding where it excels and perhaps where it does not, and is an necessary step to both increase confidence in the method and promote its adoption.

In our improved manuscript we have provided a detailed analysis of another dataset from a frequency tagged experiment that investigated the role of statistical learning in an artificial language (Pinto et al., 2022). As with the first dataset (Burroughs et al., 2021a, 2021b), we use our Bayesian model to compute statistics equivalent of the frequentist approach and compare the results. Please see: Materials and methods > Data for a description of this new data, and Case study – statistical learning for an artificial language, for the results.

Additionally, we analysed the performance of our Bayesian model through a simulation study. This compared the two approaches on the following quantities:

– reported discovery of a difference when it exists (true-positive)

– reported discovery of a difference when it does not exist (false-positive)

We found that the Bayesian model can not only detect a true difference in mean resultant length between conditions with fewer data than the ITPC approach, it also has a greatly reduced false positive rate. These results are described in Simulation study, and appendices 5-6.

As a consequence of this simulation study we also discovered that our Bayesian model reduces bias in the estimation of mean resultant length. The value as calculated in the ITPC analysis is positively biased (Kutil, 2012), however, as demonstrated by our simulation study the Bayesian estimates do not show the same systematic bias as the ITPC. To provide some more rigour to this claim we also demonstrate through simulation based calibration (Talts et al., 2018), evidence of a non-biased estimation by the Bayesan model.

Reviewer #2 (Recommendations for the authors):

This paper presents a novel Bayesian approach to testing for phase coherence in neurophysiological recordings. The approach is centred on probability distributions and therefore allows for more fine-grained conclusions about the existence of such phase consistency, in contrast to the often artificial yes/no decision on the acceptance of the alternative hypothesis that can be found in the literature.

I find this manuscript well written and the rationale well explained. The authors demonstrate that their approach can produce similar, but potentially clearer and less noisy results as compared to more commonly applied techniques (such as inter-trial coherence). It remains difficult to quantify differences between the two approaches (Bayesian vs frequentist) – for instance, the authors write that "these graphs [from Bayesian analysis] show a clearer structure than the corresponding ITPC analysis" without providing a quantification of the difference.

Together, this paper will be useful to the community, possibly opening up new ways of analysing phase-locked neural responses.

Thank you for this encouraging review of our paper. As you have pointed out, it can be difficult to compare Bayesian and frequentist results, and this is certainly something we experienced throughout this project. Instead of strict one-to-one comparisons, that may not always be possible and which, in any case would be against the spirit of what a Bayesian analysis attempts to do, we pursued a qualitative approach, and compare the methods based on the description of the data they provide in their conclusions. For example, confidence intervals are not comparable to highest density intervals (HDIs), nevertheless their respective application is similar: to determine an interval and check its support for a hypothesis. We do feel that one of our contributions in our manuscript is to introduce figure types which are analogous to traditional frequentist results figures while sticking to the descriptive rather than hypothesis based spirit of Bayesian inference; examples of this are figures 6A, 10A and 12B.

Reviewer #3 (Recommendations for the authors):

This paper proposes a Bayesian take on the inter-trial phase coherence (ITPC) used to estimate how consistent the oscillatory phase is across trials for a given condition of interest. For standard ITPC the statistical power of the comparisons on the group level is determined by the sample size of the dataset since estimates are derived by first averaging across trials to derive a single condition-level estimate per subject. The advantage of the proposed Bayesian approach is that the resulting model is more robust as it is estimated from the trial level without averaging. It also allows us to derive subject-level estimates (slopes) and explore subject-variable noise. The authors illustrated this by replicating the ITPC analysis from the paper by Burroughs et al. (2021a) using the Bayesian ITPC and demonstrating perceivable noise reduction in the resulting estimates across frequencies and topographical EEG plots. Another key advantage of this method, as illustrated by the authors, is the ability to generate stable estimates for much smaller EEG datasets. While the authors show that Bayesian ITPC can replicate the findings obtained with the standard ITPC, it is not clear what advantages the proposed Bayesian approach offers over other previously proposed methods that allow for trial-level modelling such as linear mixed effects modelling. Secondly, a broader and more accessible description of the steps of the model settings, estimation, and the derivation of the summary statistics should be provided to enable the reader to replicate this method for their own dataset

Thank you for this clear summary of our manuscript. While we are pleased that we have been able to convey the central results, we would like to improve on this by addressing your points.

We appreciate that the communication of the Bayesian model requires extensive mathematical treatment. Due to the nature of the model, the sequence of transformations that are required to change raw values to quantities of interest can become complicated. To help improve clarity, we have provided further detail on how different quantities are related to each other (see point 7 below).

We do not draw a strong distinction between our Bayesian model and Linear mixed models (LMMs). Our Bayesian model is, at its core, a LMM; with fixed effects for condition, and random effects for participants and electrodes. However, the power of the estimation method allows for increased model complexity, such as a custom link function, and a wrapped distribution for the likelihood. For example, bmrs (Bu¨rkner, 2017), is a package based on stan (Carpenter et al., 2017), that aims to provide greater flexibility to multilevel or mixed modelling than maximum likelihood based approaches such as lme4 (Bates et al., 2015), all while using a near-identical syntax for model specification.

Abstract

1) lines 12-17 please consider re-phrasing as the message here is not very clear. Please be more specific (based on your analysis findings) what Bayesian approach to coherence contributes more than the traditional one? 'More descriptive' and 'data-efficient' are vague descriptions.

It is important for us to communicate the proposed benefits of our Bayesian approach clearly and thoroughly. We have expanded on the abstract, taking these points into consideration.

“This Bayesian approach is illustrated using two examples from neurolinguistics and its properties are explored using simulated data. The Bayesian approach is more descriptive than traditional statistical approaches: it is a generative model of how the data arises and each component is interpretable and informative about data characteristics. It is also more data-efficient: it detects stimulus-related differences for smaller participant numbers than the standard approach.”

Introduction

2) Lines 26-44. Here to help the readers I would recommend communicating your main point early in the paragraph – that measurement of coherence is an important methodological tool in M/EEG research that is used to answer a wide variety of scientific questions, yet there is room for improvement in how ITPC is estimated.

Thank you for this recommendation. We agree, the main ideas of this manuscript had not been introduced adequately in the opening paragraphs. We have now changed the text to convey our motivations earlier.

“In an electroencephalography (EEG) or magnetoencephalography (MEG) frequency-tagged experiment the stimuli are presented at a specific frequency and the neural response is quantified at that frequency. This provides a more robust response than the typical event-related potential (ERP) paradigm because the response the brain makes to the stimuli occurs at the predefined stimulus frequency while noise from other frequencies, which will correspond to other cognitive and neurological processes, does not contaminate the response of interest. This provides a more robust response than the typical event-related potential (ERP) paradigm because the response the brain makes to the stimuli occurs at the predefined stimulus frequency while noise from other frequencies, which will correspond to other cognitive and neurological processes, does not contaminate the response of interest. This quantification is often approached by calculating the inter-trial phase coherence (ITPC). Indeed, estimating coherence is an important methodological tool in EEG and MEG research and is used to answer a wide variety of scientific questions. There is, however, scope for improving how the phase coherence is measured by building a Bayesian approach to estimation. This is a per-item analysis which gives a better description of uncertainty. In contrast, the ITPC discards information by averaging across trials. As a demonstration, both approaches are compared by applying them to two different frequency-tagged experimental datasets and through the use of simulated data.”

3) Line 84 – 'this plots', instead of 'this graphs'

Thank you – “This plots the ITPC measure for all six experimental conditions, …”

4) Lines 96-107 – the main message from this section is not clear. Do authors argue that in the per-electrode ITPC approach the Bonferroni correction for multiple comparisons precludes finding meaningful spatial patterns in the data? In such cases, Bonferroni is rarely used, and spatial cluster-based permutation is a typically used approach that is less conservative and allows the finding of significant clusters of spatially connected electrodes.

This was an interesting point, and one that we had overlooked. Bonferroni correcting such a large number of electrodes was a very extreme case, and as you have mentioned, is avoided in practise for an albeit weaker, but fairer cluster-based permutation test (CBPT). This has allowed us to form a more realistic comparison between the Bayesian and ITPC approaches; not a strictly one-to-one comparison, but nevertheless one that invites similar interpretations.

The headcap plots in figure 2B, and figure 9B (for the additional dataset), now include significant clusters of electrodes marked on the skull. The corresponding Bayesian results, figures 6/10, mark electrodes that did not contain zero in their highest density interval. Also, text (lines 119-130) has been added to the manuscript that discusses CBPT, how it applies to the ITPC, and how it compares with the Bayesian results.

5) Lines 126-128 – please unpack a bit more what is meant by 'a better description of the data' and 'a narrative phrased in terms of models and their consequence'.

In light of this comment we have expanded on the text:

“Furthermore, as a Bayesian approach, it supports a better description of the data, by describing a putative abstraction of the stochastic process that may have generated the data while explicitly stating the underpinning assumptions. This replaces a hypothesis-testing and significance-based account with a narrative phrased in terms of models and their consequence so, in place of an often contentious or Procrustean framework based on hypotheses, a Bayesian approach describes a putative model and quantifies the evidence for it.”

6) Line 161 – here you mean Figure 4?

Yes, thank you – Done.

Methods section

7) Authors provide a detailed explanation and mathematical descriptions for the distributions from which the phase data can be modelled, and parameters are sampled when building up a Bayesian model of the ITPC. The supplementary materials then detail equations behind the full model used. Yet from these two sources of information, it is challenging for the reader to reconstruct the set of steps authors took to derive the results they plot in Figure 5. If the aim of the paper is to have the reader use the Bayesian approach to ITPC for their own datasets a more accessible step-by-step description of the model estimation is necessary – from calculating participant and electrode slopes/estimates to averaging steps used to produce Figure 5. This can be done by expanding relevant sections in the Methods.

Thank you for this helpful comment. We agree that it is important for the reader to be clear on the series of steps taken to arrive at the plotted results. To improve clarity, we have added extra equations. Equations 12-13 have been added to show exactly how the circular variance relates to both the mean resultant R, and the scale of the wrapped Cauchy distribution γ. Equations 20-21 have been added to the Results section to make more apparent the calculation used to obtain the values plotted in figure 6 (previously figure 5). As an additional example of how one could go about extracting quantities of interest from the Bayesian model we have provided a further example in appendix 8 that shows posteriors over differences in mean resultant length for each participant.

8) Other methods such as Linear mixed models that likewise allow trial-level analysis and model subject slopes have been broadly applied to the EEG data and also ITPC. To increase the contribution of this paper, authors need to outline and demonstrate analytically the advantages of the Bayesian approach over these other non-Bayesian methods.

Our Bayesian model is very similar to a linear mixed model (LMM), we are essentially estimating fixed and random effects corresponding to conditions, participants, and electrodes. However, it affords us much more flexibility than a typical LMM or gLMM. Because of this we felt that in addition to the current frequentist analysis, the paper would not significantly benefit from including a LMM treatment of these data.

While we agree that a LMM may be suitable for modelling the ITPC, we are unaware how one could do this with phase angles, because this requires a simultaneous estimation of the mean resultant R and its circular variance S. As we have shown, these require a wrapped distribution. Modelling ITPC values, Rpce, which at the very least have been averaged over trials is a reasonable approach, but restricted compared to our model, both in terms of expressability—no joint model for R and S—and data efficiency, by working directly on a summary statistic of the data.

Discussion section

9) The section Model design choices seem to belong in the Results and not the Discussion section.

Yes, we agree. This section has now been moved to Results.

10) The Data efficiency section is very helpful in demonstrating the advantage of the Bayesian approach for smaller datasets. This section can be expanded by demonstrating further key advantages of the Bayesian approach over other non-Bayesian methods that use a trial-level approach (as proposed in point 8).

Providing examples where the Bayesian model outperforms the standard approach that we are comparing it to is important. However, including LMMs into this analysis is out of scope. Instead, to address this point from a different perspective we investigated the benefits of the Bayesian approach compared to the ITPC in a simulation study; this expands upon the data efficiency section to look at bias reduction, false-positive rates, and true-positive rates over various data sizes. We hope that this provides further, interesting, detail to the data-efficiency discussion.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Thank you for resubmitting your work entitled "Bayesian analysis of phase data in EEG and MEG" for further consideration by eLife. Your revised article has been evaluated by Joshua Gold (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewers 2 and 3 are satisfied with your revisions, however, given that eLife works on consensus and the fact that Reviewer 1 is not satisfied with a major concern of theirs from the first round of review, we ask that you directly address Reviewer 1's queries thoroughly. This includes those concerns regarding interpretability for experimentalists, and specifically, that you compare your method to the classic tests for steady-state responses as the Reviewer suggests. Please pay close attention to each of Reviewer 1's queries and address them in full.

Reviewer #1 (Recommendations for the authors):

1. The authors did not address all my comments and I copied them here.

We are sorry we did not address these comments; this was a mistake on our part, the comments in the original reviews which required the most additional research – the addition of a second data set and simulation studies – distracted us from other important comments.

If the aim is to provide a more powerful test, it should be compared to classic tests for steady-state responses, such as the ones described in the following article.

Picton, Terence W., et al. "The use of phase in the detection of auditory steady-state responses." Clinical Neurophysiology 112.9 (2001): 1698-1711.

We have included a discussion of this approach in our introduction; this approach was developed for long stimulation by periodic stimuli recorded using a small number of scalp electrodes, our recordings are shorter, giving poorer frequency resolution, they are less temporally homogenous and are recorded from multiple electrodes; this makes the classic test unsuitable. The classic test are best suited to the classic application of steady-state methods, with simple tone-like stimuli; the hope is that it is becoming possible to apply frequency-tagging to richer, more temporally complex stimuli. We write:

“There are other classical tests of coherence which use phase information. One example is the Rayleigh test [Rayleigh (1880), Rayleigh (1919)], this test can be used to check for either significant departure from uniformity, or from the ‘expected phase’, that is a particular phase angle specified by the researcher based on some other insight into the behaviour. Other test, such as Hotelling’s T2 apply jointly to phase and amplitude [Hotelling (1931); Picton et al. (1987, 2001)]. These classical approaches are incompatible with the neurolinguistic study presented here. Firstly, it would be difficult to provide an expected phase; as demonstrated in Figure 4, the mean phase angle is highly variable across participants. There is also no substantive prior information available that could be used to supplement this value because language experiments vary from experiment to experiment. Secondly, because of the problem of semantic satiation the experiments we consider here are relatively short an lack the frequency resolution these classical approaches require.”

Beyond these details though, we believe that the Bayesian approach is better because it models the experiment; ultimately classical approaches depend on the skill of the researcher in statistical analysis whereas Bayesian inference will provide a set of methods which will make efficient use of the data by creating a model of a scientific understanding of the process that produces the data. In other words, although these Bayesian approaches are unfamiliar ultimately they will be easier to use for experimentalists! It seems to us that there are lots of intricate non-Bayesian “workarounds” to various data challenges, but in the long run this makes it harder and harder to understand the statistical framework while a Bayesian approach, though unfamiliar, will lead to more straightforward analyses. Even if this turns out not to be the case, it is a possibility that should be explored and we feel we are contributing to that exploration.

The current article is certainly not written in a way that can be understood by an experimentalist. It doesn't matter too much if the methods are hard to follow, but it does matter if no interpretable results are shown. For example, the authors argue that the topographic plots using the new method have a clearer structure than the traditional ones. As an experimentalist, however, I can't figure out which structure is clearer and why it helps to answer scientific questions.

We are very grateful for this comment, we realise we have “under-sold” one of the biggest advantages of our approach: the Bayesian method is less inclined to producing attractive looking but meaningless information, in frequentist statistics there is a danger of false discovery, or if that is corrected for, a risk of losing real evidence in a effort to remove fictive results. We have added a new figure focusing on a comparison of the topographic headmaps, both using the real data and using fictive data. It is clear that the Bayesian approach represents the underlying ground truth better. The new figure is Figure 8 and we write (lines 370 – 388):

“In Figure 2C we see that even for conditions, such as RR and RV, which contain no linguistic structure at the phase rate there are patterns of electrode activity in the topographics headcaps. In contrast, the analogous Bayesian headcaps in Figure 4C did not show similar patterns. We used simulated data to investigate whether the Bayesian model is correctly demonstrating that there is no phrase-level response for these conditions, rather that the other possibility: that the beguiling patterns seen in the ITPC headcaps represent a real activity invisible to the Bayesian analysis. In fact, the evidence points to the first alternative, Figure 8, presents evidence that the Bayesian model is more faithful to the data when there is no meaningful variation in electrode effects. Figure 8A shows the real data again, however, whereas previously the headcap was plotted for differences between conditions, here we fit directly to the RR condition. There is no effect visible for the Bayesian headcap but for the ITPC headcap there are variations that may suggest localised activity, even though this condition does not any structure at the phrase rate. In Figure 8B four datasets were simulated from the generative Bayesian model with electrode effects set to zero, the four simulations are marked as 1-4 in the figure. Except that there is only one condition the simulated data mimics the real data: it has 16 participants, 32 electrodes and 24 trials. These simulations are intended to represent four different iterations of the same experiment, apart from differing in any random numbers they are identical. The data resulting from these four simulations were fitted with both methods. Evidently, the Bayesian results are much closer to the ground truth. The ITPC results show variations that could easily be misinterpreted.”

We are pleased that the new figure so clearly demonstrates that advantage of our approach.

Overall, we have tried to find ways to make a Bayesian analysis interpretable and believe this is a useful contribution on our part, for example, in Figure 6B we introduce a novel Bayesian equivalent to the typical “significance bracket” style of graph.

2. I'm glad that the authors included a new dataset in the analysis. However, as an experimentalist, I still cannot see why the new method outperforms the traditional ITPC analysis in the newly added experiment. For the session "Case study – statistical learning for an artificial language", we need at least a few conclusions, explicitly stating whether the new method or the traditional method gives a more reasonable result and why.

Thank you for this comment. We are pleased that you found the additional dataset useful, but regret that we did not make clear the respective performance of each method on it. To rectify this, we have added a new figure (Figure 14), that compares the two methods in terms of both confidence and highest density intervals, and p-values and posterior probability, as a function of the number of trials in the data. This figure shows that although the methods perform similarly on all the data (compare Figure 10 with Figure 11), when considering a lower number of trials the performance of the Bayesian model in detecting the signal in the data is much better: thus, although the original experiment was impressive in the size of its sample and, of course, as a published data set it is an example where the traditional statistical framework produced a result, a smaller experiment would have failed to produce a signal in cases where the Bayesian analysis would have succeeded. We also write (lines 581-594):

“Figure 14 is similar to Figure 13, however, it uses the statistical learning dataset Pinto et al. (2022), comparing conditions BL and EXP at the frequency 5.33Hz. In fact, for these data we saw little evidence that the Bayesian approach works better when the number of participants are reduced: we attribute this to the large number of trials, generally the extra efficiency of a Bayesian approach appears most apparent in low data regimes and the statistical learning data set is admirably well sampled. For this reason we used this data set to investigate data efficiency when the number of trials is reduced for a fixed number of participants. In Figure 14 data from the first 20 participants are considered and the analysis is repeated with different numbers of trials, discarding trials from the end. It is clear from Figure 14A that the Bayesian model can reliably detect the signal in the data with at least half the number of trials that the frequentist approach requires, this is potentially useful especially when because of the challenge semantic satiation posses to some neurolinguistic experiments. Figure 14B compares the p-values arising from the significance test with P(∆R < 0) calculated from the posterior and shows the fast convergence of the posterior to the signal; the p-value is much slower and also more variable across trials.”

3. Simulation is also important. However, I can't really understand the "Simulation study" section. What is exactly R1 or R2? Why do we care about the bias? A more helpful simulation is probably just to simulate the time-domain EEG signal (e.g., using sinusoids and noise) and demonstrate that the new method, e.g., can yield statistical significance with fewer subjects.

R1 and R2 refer to the mean resultant lengths of two conditions 1 and 2 in the simulation study; we had failed to say that and have now fixed that silly error.

"We then use this modified model to generate fictive datasets with different numbers of participants and trials", but where are the results? It seems like Figure 11 does not show how the results change with the number of participants and trials.

The results for different participants and trials were given in appendices 6 and 7, however, in light of this point, we felt that structure of this section was not helpful. It has now been re-arranged to flow in a more obvious way, and the importance of appendices 6 and 7 to the simulation results has been made clearer in the text (lines 467-476).

For the new section on "Data efficiency", why just one dataset and why only 4 participants? Testing two datasets and all possible numbers of participants are minimal requirements. Also, as an experimentalist, I really cannot understand what is shown in Figure 12.

We go to a minimum of five participants to accommodate the parameters in the Bayesian model that estimate participant variance. With too little these parameters will not be informed. Similarly this is why a certain number of groups are recommended for random effects models, as it is important to obtain a reasonable estimate of their variance.

Our previous Figure 12 (now adapted into Figure 13) aims to show that the Bayesian model can detect a difference between conditions using fewer participants. We compare confidence intervals of widths 90/95/0.996 to Bayesian highest density intervals of the same. Depending on how conservative the analyst is about their significance level α, small to large differences arise between the methods in terms of the number of participants they require to convincingly detect the signal in the data.

4. "the power is not a useful measure. Instead, the typical approach to frequency-tagged data for cognitive tasks is to use the inter-trial phase coherence." In fact, most of the studies cited in the introduction used power rather than phase analysis.

That is a good point; while power is frequently sufficient for some tasks, the ITPC becomes necessary for more difficult cognitive tasks, for example, power is not used in neurolinguistics since it rarely succeeds in identifying a signal. We have adjusted our text to reflect this:

”the induced power [...] does not work, empirically this proves too noisy a quantity for the stimulus-dependent signal to be easily detected and, indeed, although the frequency tag produces a more robust signal than an ERP, for more high-level or cognitive tasks, particularly neurolinguistic tasks, where frequency-tagging is now proving valuable, the power is not a useful measure; more needs to be done to remove the noise. Typically this is done by assuming the response is phase locked to the stimulus and so for frequency-tagged data in cognitive tasks it is common to use the inter-trial phase coherence.”

5. "The Bayesian approach is more descriptive than traditional statistical approaches: it is a generative model of how the data arises and each component is interpretable and informative about data characteristics."

It's great. But why is the method more interpretable? Could you please summarize it in a way that can be understood by experimentalists?

We don’t believe the current approaches are easy to interpret; the issue of how to correct for multiple comparisons is very fraught for example, and there is no real clarity as to what the appropriate correction is. In the case of ITPC there is a lot of different standards used in claiming a response is significant, based on comparison with simulated data, or data at other nearby frequencies; it is very hard to decide what the correct approach is or how to interpret results that appear significant using one standard and not for another. As another ITPC example, often, two conditions are compared by t-test even though, since ITPC is bound by zero and one, the data is not Gaussian; interpreting the Gaussianity of data that passes a test of Gaussianity when it cannot actually be Gaussian is difficult! In contrast, a Bayesian analysis makes a more modest claim, it gives a description of a model and estimates the posterior of the model parameters based on the data. The familiarity we feel for the traditional methods are the result of years of exposure, we cannot hope to make the Bayesian approach equally familiar to experimentalists in a single paper, but we hope our use of multiple figures, two real data sets and two simulated data sets will be a step towards that goal.

"It is also more data-efficient: it detects stimulus-related differences for smaller participant numbers than the standard approach."

How is this demonstrated in the two datasets? Is there a guideline about how many participants can be saved using the new approach?

We demonstrate the data efficiency of the Bayesian model on the two datasets in Figure 13 and Figure 14 respectively. Figure 13 shows that the number of participants can be reduced when using the Bayesian model, the amount depending on how cautious the analyst wants to be regarding false positives. Figure 14 is a new addition that looks at this problem as a function of trials instead of participants. Here it is shown that the Bayesian model detects the signal in the data with approximately half the number of trials required by the ITPC. Please also see our response to comment 2, and 3b.

https://doi.org/10.7554/eLife.84602.sa2

Article and author information

Author details

  1. Sydney Dimmock

    Faculty of Engineering, University of Bristol, Bristol, United Kingdom
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Visualization
    For correspondence
    sd14814@bristol.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0163-2048
  2. Cian O'Donnell

    1. Faculty of Engineering, University of Bristol, Bristol, United Kingdom
    2. School of Computing, Engineering & Intelligent Systems, Ulster University, Derry/Londonderry, United Kingdom
    Contribution
    Writing – original draft, Writing – review and editing, Methodology, Project administration, Supervision, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2031-9177
  3. Conor Houghton

    Faculty of Engineering, University of Bristol, Bristol, United Kingdom
    Contribution
    Conceptualization, Funding acquisition, Writing – original draft, Writing – review and editing, Methodology, Project administration, Supervision, Software, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5017-9473

Funding

Leverhulme Trust (RF-2021-533)

  • Conor Houghton

Medical Research Council (MR/S026630/1)

  • Cian O'Donnell

Engineering and Physical Sciences Research Council (EP/R513179/1)

  • Sydney Dimmock

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

CH is a Leverhulme Research Fellow (RF-2021-533). We would also like to acknowledge funding from the MRC (MR/S026630/1 to COD) and an EPSRC Doctoral Training Partnership (EP/R513179/1) award to SD. We would like to recognise the discussion 'divergence treedepth issues with unit vector' that highlighted some of the difficulties involved with sampling directional statistics and potential ways to ameliorate them in our model.

Senior Editor

  1. Joshua I Gold, University of Pennsylvania, United States

Reviewing Editor

  1. Andrea E Martin, Max Planck Institute for Psycholinguistics, Netherlands

Reviewer

  1. Benedikt Zoefel, CNRS, France

Version history

  1. Preprint posted: October 17, 2022 (view preprint)
  2. Received: October 31, 2022
  3. Accepted: September 11, 2023
  4. Accepted Manuscript published: September 12, 2023 (version 1)
  5. Accepted Manuscript updated: September 20, 2023 (version 2)
  6. Version of Record published: October 20, 2023 (version 3)

Copyright

© 2023, Dimmock et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 936
    Page views
  • 233
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sydney Dimmock
  2. Cian O'Donnell
  3. Conor Houghton
(2023)
Bayesian analysis of phase data in EEG and MEG
eLife 12:e84602.
https://doi.org/10.7554/eLife.84602

Share this article

https://doi.org/10.7554/eLife.84602

Further reading

    1. Neuroscience
    E Nicholas Petersen, Mahmud Arif Pavel ... Scott B Hansen
    Research Article

    Rapid conversion of force into a biological signal enables living cells to respond to mechanical forces in their environment. The force is believed to initially affect the plasma membrane and then alter the behavior of membrane proteins. Phospholipase D2 (PLD2) is a mechanosensitive enzyme that is regulated by a structured membrane-lipid site comprised of cholesterol and saturated ganglioside (GM1). Here we show stretch activation of TWIK-related K+ channel (TREK-1) is mechanically evoked by PLD2 and spatial patterning involving ordered GM1 and 4,5-bisphosphate (PIP2) clusters in mammalian cells. First, mechanical force deforms the ordered lipids, which disrupts the interaction of PLD2 with the GM1 lipids and allows a complex of TREK-1 and PLD2 to associate with PIP2 clusters. The association with PIP2 activates the enzyme, which produces the second messenger phosphatidic acid (PA) that gates the channel. Co-expression of catalytically inactive PLD2 inhibits TREK-1 stretch currents in a biological membrane. Cellular uptake of cholesterol inhibits TREK-1 currents in culture and depletion of cholesterol from astrocytes releases TREK-1 from GM1 lipids in mouse brain. Depletion of the PLD2 ortholog in flies results in hypersensitivity to mechanical force. We conclude PLD2 mechanosensitivity combines with TREK-1 ion permeability to elicit a mechanically evoked response.

    1. Developmental Biology
    2. Neuroscience
    Athina Keramidioti, Sandra Schneid ... Charles N David
    Research Article

    The Hydra nervous system is the paradigm of a ‘simple nerve net’. Nerve cells in Hydra, as in many cnidarian polyps, are organized in a nerve net extending throughout the body column. This nerve net is required for control of spontaneous behavior: elimination of nerve cells leads to polyps that do not move and are incapable of capturing and ingesting prey (Campbell, 1976). We have re-examined the structure of the Hydra nerve net by immunostaining fixed polyps with a novel antibody that stains all nerve cells in Hydra. Confocal imaging shows that there are two distinct nerve nets, one in the ectoderm and one in the endoderm, with the unexpected absence of nerve cells in the endoderm of the tentacles. The nerve nets in the ectoderm and endoderm do not contact each other. High-resolution TEM (transmission electron microscopy) and serial block face SEM (scanning electron microscopy) show that the nerve nets consist of bundles of parallel overlapping neurites. Results from transgenic lines show that neurite bundles include different neural circuits and hence that neurites in bundles require circuit-specific recognition. Nerve cell-specific innexins indicate that gap junctions can provide this specificity. The occurrence of bundles of neurites supports a model for continuous growth and differentiation of the nerve net by lateral addition of new nerve cells to the existing net. This model was confirmed by tracking newly differentiated nerve cells.