Abstract
Markergene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased toward detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues toward truly quantitative and reproducible metagenomics measurements.
https://doi.org/10.7554/eLife.46923.001Introduction
Markergene and metagenomic sequencing (jointly, MGS) have transformed the study of biological communities. Extracting and sequencing total DNA from a community can identify thousands of taxa along with their genes and potential functions, while sequencing a phylogenetic marker gene (e.g. 16S rRNA) can quantify taxon abundances (Li, 2015; Quince et al., 2017). MGS measurements of microbial communities are yielding fundamental new insights into the structure and dynamics of microbial ecosystems and the roles of microbes as drivers of host and ecosystem health (Zeevi et al., 2015; Graham et al., 2016; Knight et al., 2017; Callahan et al., 2017; Lehman et al., 2015). Applications of MGS, often under the alternative terms eDNA sequencing or metabarcoding, increasingly extend beyond microbes to the measurement and monitoring of plants, insects, and vertebrates (Bell et al., 2019; Krehenwinkel et al., 2017; Thomas et al., 2016). MGS methods are now being adopted in fields ranging from food safety (Cocolin et al., 2018) to wastewater remediation (Rosso et al., 2018) to forensics (Metcalf et al., 2017) along with biology and medicine. Unfortunately, however, the community compositions measured by MGS are wrong.
MGS measurements are biased: The measured relative abundances of the taxa and genes in the sample are systematically distorted from their true values (Brooks, 2016; Sinha et al., 2017). Bias arises because each step in an experimental MGS workflow preferentially measures (i.e. preserves, extracts, amplifies, sequences, or bioinformatically identifies) some taxa over others (Brooks, 2016; Hugerth and Andersson, 2017; Pollock et al., 2018). For example, bacterial species differ in how easily they are lysed and therefore how much DNA they yield during DNA extraction (Morgan et al., 2010; Costea et al., 2017), and they differ in their number of 16S rRNA gene copies and thus how much PCR product we expect to obtain per cell (Kembel et al., 2012). Most sources of bias are protocoldependent: Different PCR primers preferentially amplify different sets of taxa (Sipos et al., 2007), different extraction protocols can produce 10fold or greater differences in the measured proportion of a taxon from the same sample (Costea et al., 2017), and almost every choice in an MGS experiment has been implicated as contributing to bias (D'Amore et al., 2016; Hugerth and Andersson, 2017; Sinha et al., 2017; Pollock et al., 2018). Every MGS experiment is biased to some degree, and measurements from different protocols are quantitatively incomparable (Nayfach and Pollard, 2016; Hiergeist et al., 2016; Mallick et al., 2017; Sinha et al., 2017; Gibbons et al., 2018).
The biases of MGS protocols and the error those biases introduce remain unknown. Thus we do not know whether the measured taxonomic or gene compositions derived from MGS are accurate, or to what extent the biological conclusions derived from them are valid. It is common to assume that conclusions drawn from measurements using the same protocol are robust to MGS bias. But simulated examples have shown that bias can lead to qualitatively incorrect conclusions about which taxa dominate different samples (Kembel et al., 2012; Edgar, 2017), which ecosystems are more similar (Kembel et al., 2012), and which taxa are associated with a given disease (Brooks, 2016). Furthermore, variation in bias limits our ability to make the direct comparisons between results from different experiments that are central to the scientific process. It has been suggested that these issues would be circumvented if bias were the same in every experiment, leading to a number of efforts to define and promulgate standardized MGS protocols (Gilbert et al., 2014; Costea et al., 2017). However, methodological standardization has several limitations. For example, it can be overly restrictive given the variety of ecosystems and biological questions where MGS methods are applied as well as the continual advance in technology, and unmeasured technical variability can introduce experimentspecific biases into nominally standardized methods (Yeh et al., 2018). More important, standardized protocols remain biased and thus still do not provide accurate measurements of the underlying communities.
Current attempts to counter bias are limited and of unknown efficacy because of our poor understanding of how bias across the full experimental workflow distorts MGS measurements. Hundreds of published studies compare MGS measurements of defined samples to their expected composition in an effort to characterize the bias of the given protocol (many cited in Hugerth and Andersson, 2017; Pollock et al., 2018). But this approach has limited value so long as we do not know how the error observed in one sample translates to differently composed samples. If we understood how bias acts across samples we might be able to estimate the effect of bias from measurements of samples of defined composition and use those estimates to calibrate measurements of samples of interest to their true values (Thomas et al., 2016; Hardwick et al., 2017). Alternatively, natural communities measured by multiple experiments could be used to calibrate measurements between experiments using different protocols. A quantitative understanding of how bias distorts MGS measurements would also elucidate how statistical analyses and diagnostics are affected by bias and suggest more robust alternatives.
Here we propose and test a mathematical model of how bias distorts taxonomic compositions measured by MGS from their true values. In our model, bias manifests as a multiplication of the true relative abundances by taxon and protocolspecific factors that are constant across samples of varying compositions. We validate key components of this model, including that bias acts independently on each taxon in a sample, in 16S rRNA gene and shotgun metagenomic sequencing data from bacterial communities of defined composition. We use our proposed model to quantify bias, to partition bias into steps such as DNA extraction and PCR amplification, and to reason about the effects of bias on downstream statistical analyses. Finally, we demonstrate how this model can be used to correct biased MGS measurements when suitable controls are available.
Results
A mathematical model of MGS bias
Consider a markergene or metagenomic sequencing (MGS) experiment as a multistep transformation that takes as input biological material and provides as output the taxonomic profile corresponding to each sample—the set of measured taxa and their associated relative abundances (Figure 1A). Each step introduces systematic and random errors that cumulatively lead to error in the measured taxonomic profiles. Bias is a particular, ubiquitous form of systematic error that arises from the different efficiencies with which various taxa are measured (i.e. preserved, extracted, amplified, sequenced, or bioinformatically identified and quantified) at each step.
Many bias mechanisms are thought to act multiplicatively on the taxon abundances, at least to first approximation. For instance, the DNA concentration of a taxon after DNA extraction equals its initial cell concentration multiplied by its DNA yield per cell. This percell yield indicates the efficiency of extraction for the taxon, which is is expected to depend on factors such as genome size and the structure of the taxon’s cell wall (Morgan et al., 2010). Therefore, we expect extraction efficiencies to vary among taxa, but be approximately constant for any specific taxon across samples treated with the same protocol. Other multiplicative sources of bias include variation in PCR binding and amplification efficiencies (Wagner et al., 1994; Suzuki and Giovannoni, 1996; Polz and Cavanaugh, 1998; Edgar, 2017) and in markergene copy number (Kembel et al., 2012).
Inspired by these observations, we propose that at every step in an MGS experiment, the output abundances of individual taxa differ from the input abundances by taxonspecific multiplicative factors (Figure 1), which we refer to as the measurement efficiencies in that step. The measurement efficiencies are determined by the interaction between the experimental protocol and the biological/chemical/physical/informatic state of each taxon in that step, and are therefore independent of the composition of the sample. Typical MGS experiments only measure relative rather than absolute abundances (Gloor et al., 2017), and the change in the relative abundances during a step depend only on the relative efficiencies. This yields the following mathematical model of bias (Figure 1): The relative abundances measured in an MGS experiment are equal to the input relative abundances multiplied by taxonspecific but compositionindependent factors (the relative efficiencies) at every step.
The mathematical accounting of bias is simplified by the use of compositional vectors: vectors for which only the ratios among elements carry meaning. The relative abundances and relative efficiencies can be described as compositional vectors with $K$ nonnegative elements, where $K$ is the number of possible taxa. Two vectors $\mathbf{\mathbf{X}}$ and $\mathbf{\mathbf{Y}}$ are compositionally equivalent, denoted $\mathbf{\mathbf{X}}\sim \mathbf{\mathbf{Y}}$, if $\mathbf{\mathbf{X}}=a\mathbf{\mathbf{Y}}$ for some positive constant $a$ because the ratios among the elements of $\mathbf{\mathbf{X}}$ and $\mathbf{\mathbf{Y}}$ are the same: ${X}_{i}/{X}_{j}=a{Y}_{i}/a{Y}_{j}$ (BarcelóVidal et al., 2001). A compositional vector $\mathbf{\mathbf{X}}$ of relative abundances can be converted to proportions, which we denote $\mathrm{Pr}(\mathbf{\mathbf{X}})$, by dividing the taxon abundances by their sum, $\mathrm{Pr}(\mathbf{\mathbf{X}})=\mathbf{\mathbf{X}}/{\sum}_{i}{X}_{i}$, without changing its meaning in terms of the ratios among taxa. For example, the vector of observed proportions in Figure 1 of (4%, 72%, 24%) is compositionally equivalent to the vector (1, 18, 6) obtained by dividing all abundances by that of the first taxon.
For a given sample, let $\mathbf{\mathbf{A}}$ be the vector of actual relative abundances and $\mathbf{\mathbf{O}}$ be the vector of observed (measured) relative abundances. Subscripts denote specific taxa; for example ${A}_{i}$ is the relative abundance of taxon $i$. Similarly, let ${\mathbf{\mathbf{B}}}^{({P}_{l})}$ be the vector of the relative efficiencies of each taxon at step $l$ in Protocol $P$. (Interactions between steps are allowed; see Appendix 1.) Our model of bias can be stated mathematically as
where $\cdot $ denotes elementwise multiplication of two vectors (Figure 1). We define the bias of Protocol $P$ by the product over all steps, ${\mathbf{\mathbf{B}}}^{(P)}\sim {\mathbf{\mathbf{B}}}^{({P}_{1})}\cdot {\mathbf{\mathbf{B}}}^{({P}_{2})}\cdot \mathrm{\cdots}\cdot {\mathbf{\mathbf{B}}}^{({P}_{L})}$. The observed composition is then simply the actual composition multiplied by the protocol’s bias,
When considering samples measured by the same protocol, we will drop the superscript $P$ and simply refer to the total protocol bias as $\mathbf{\mathbf{B}}$.
From Equation 2 we see that the ratio between the observed relative abundances of any two taxa $i$ and $j$ is
and the observed proportion of taxon $i$ is
The denominator, ${\sum}_{j=1}^{K}\mathrm{Pr}{(\mathbf{\mathbf{A}})}_{j}{B}_{j}$, is the sample mean efficiency—the average efficiency of the sampled individuals.
The systematic error in the measured composition under our model is $\mathbf{\mathbf{O}}/\mathbf{\mathbf{A}}\sim \mathbf{\mathbf{B}}$, where $/$ denotes elementwise division and is referred to as the compositional difference (Aitchison, 1992). The compositional difference unites the experimental notion of bias—variation in the efficiencies with which different taxa are measured—with the statistical notion of bias—the difference between the expected value of an estimate and the true value—with the understanding we are considering the compositional difference rather than the conventional Euclidean difference between compositions.
Properties and implications of the model
Bias is fully described by the relative efficiencies of the total workflow
The bias of individual steps only influences the measurement error through their product, ${\mathbf{\mathbf{B}}}^{(P)}$. Consequently, knowledge of the total protocol bias is sufficient to determine how bias affects the measured taxonomic profiles even if the biases of the individual protocol steps (the ${\mathbf{\mathbf{B}}}^{({P}_{l})}$) remain unknown. The bias ${\mathbf{\mathbf{B}}}^{(P)}$ has just $K$−1 parameters, denoting the relative efficiencies with which the $K$ taxa of interest are measured by the protocol as a whole, and fully describes the effect of bias on measurements of those $K$ taxa in all samples.
Systematic error in taxon ratios, but not in taxon proportions, is independent of sample composition
The folderror in the observed ratios of the abundances of taxon $i$ and taxon $j$ relative to the actual ratio in their abundances is $({O}_{i}/{O}_{j})/({A}_{i}/{A}_{j})={B}_{i}/{B}_{j}$ (Equation 3). This error depends only on the ratio between the total protocol efficiencies of taxon $i$ and taxon $j$ and is independent of the rest of the sample. Critically, this means that the systematic error in taxon ratios caused by bias will remain the same in samples of varying composition.
In contrast, the error in the proportion of a taxon depends on the sample composition. The folderror in the observed proportion of taxon $i$ relative to its actual proportion is
This error depends on the sample mean efficiency ${\sum}_{j}\mathrm{Pr}{(\mathbf{\mathbf{A}})}_{j}{B}_{j}$ and thus depends on the proportions of all the other taxa in the sample. Intuitively, bias leads to overestimation of taxa that are more easily measured than the community average in the given sample. As a result, the same taxon can be overestimated in samples dominated by lowefficiency taxa and underestimated in samples dominated by highefficiency taxa.
To illustrate, we consider the hypothetical measurement of a second community sample (Sample ${\text{S}}_{2}$ in Figure 2) alongside that of the even sample from Figure 1 (Sample ${\text{S}}_{1}$ in Figure 2). The dominance of the lowefficiency Taxon 1 in Sample ${\text{S}}_{2}$ substantially lowers its sample mean efficiency compared to the evenmixture Sample ${\text{S}}_{1}$, changing the folderror in all taxon proportions. In particular, Taxon 3 changes from having a lowerthanaverage efficiency in Sample ${\text{S}}_{1}$ to a higherthanaverage efficiency in Sample ${\text{S}}_{2}$. As a result, its observed proportion is lower than its actual proportion in Sample ${\text{S}}_{1}$, but higher than its actual proportion in Sample ${\text{S}}_{2}$! Yet the folderror in the ratios among taxa is identical in both samples and equal to the bias (Figure 2, bottom row).
Analyses based on foldchanges in taxon ratios are insensitive to bias, while analyses based on taxon proportions can give spurious results
Although it is widely understood that bias distorts individual community profiles, it is often thought to effectively ‘cancel out’ when analyzing the differences between samples that have been measured by the same protocol. Unfortunately, simulating measurement under our model easily provides examples where common analyses give qualitatively incorrect results. In Figure 2, for example, bias causes the uneven Sample ${\text{S}}_{2}$ to appear to have a more even distribution of taxa than the perfectly even Sample ${\text{S}}_{1}$. As a result, any analysis of alpha diversity that incorporates evenness (e.g. the Shannon or Inverse Simpson indices) will incorrectly conclude that Sample ${\text{S}}_{2}$ is more diverse. The previous section provides a general explanation as to why, for many analyses, bias does not simply cancel: The underlying statistics are functions of the individual taxon proportions, the error of which varies inconsistently across samples. Consequently, proportionbased analyses can lead to qualitatively incorrect conclusions. As a further example, the actual proportion of Taxon 3 decreases from Sample ${\text{S}}_{1}$ to Sample ${\text{S}}_{2}$ in Figure 2, but the measured proportion increases!
In contrast, the folderror in taxon ratios is independent of sample composition, and foldchanges in taxon ratios across samples are insensitive to bias. Consider the foldchange in the ratio of a pair of taxa $i$ and $j$ between two samples $s$ and $t$. Following Equation 3, the observed change is
and thus equals the true change. That is, the foldchange in taxon ratios between samples is invariant to bias. More generally, the compositional difference between samples is invariant to multiplication by a fixed vector (Aitchison, 1992) and thus to bias,
Returning to the samples in Figure 2, the actual and observed ratios of Taxon 2 to Taxon 1 both change by the same factor of 1/15 from Sample ${\text{S}}_{1}$ to Sample ${\text{S}}_{2}$, and the actual and observed compositional difference between samples is (1, 1/15, 4/15). Equation 7 shows that any analysis that depends only on the compositional differences between samples will be invariant to bias under our model.
The systematic difference between measurements from different protocols is given by the difference in their biases
Consider Protocol $P$ with bias ${\mathbf{\mathbf{B}}}^{(P)}$ and reference Protocol $R$ with bias ${\mathbf{\mathbf{B}}}^{(R)}$. If both protocols measure the same sample with actual composition $\mathbf{\mathbf{A}}$, the compositional difference between their measurements is
The actual composition drops from the equations and the difference in their measurements is simply the compositional difference in the biases of each protocol, which we refer to as the differential bias ${\mathbf{\mathbf{B}}}^{(P/R)}\equiv {\mathbf{\mathbf{B}}}^{(P)}/{\mathbf{\mathbf{B}}}^{(R)}$ of Protocol $P$ relative to the reference Protocol $R$. Measurements on common samples are related to one another by ${\mathbf{\mathbf{O}}}^{(P)}\sim {\mathbf{\mathbf{O}}}^{(R)}\cdot {\mathbf{\mathbf{B}}}^{(P/R)}$, independent of the actual composition of the sample. Usefully, differential bias is mathematically equivalent to bias if we consider the ‘reference’ compositions measured by Protocol $R$ as the truth.
Estimates of bias from control samples can be used to correct measurements of other samples
The consistency of bias across samples makes it possible to estimate bias from samples of known composition, referred to as calibration controls, and to use that estimate $\widehat{\mathbf{\mathbf{B}}}$ to calibrate, or remove the bias from, measurements of other samples with unknown compositions. A point estimate of the bias of $K$ taxa present with known relative abundances in control sample $c$ is given by the compositional difference between the observed and actual compositions, $\widehat{\mathbf{\mathbf{B}}}\sim \mathbf{\mathbf{O}}(c)/\mathbf{\mathbf{A}}(c)$. In Materials and methods, we describe a general method for estimating bias from multiple controls by maximizing the explained compositional error in the control measurements. Measurements from controls containing different taxa can be combined into a single estimate of bias provided that the controls have sufficient taxonomic overlap (Appendix 2).
Once bias has been estimated for a set of taxa, it can be used to calibrate the relative abundances of those taxa in an unknown sample. Letting $\mathbf{\mathbf{O}}$ denote the measured relative abundances for these taxa, the estimate $\widehat{\mathbf{\mathbf{A}}}$ of the actual relative abundances is
That is, the calibrated abundances are found by compositionally subtracting the estimated bias from the original measurement. Through its use of compositional vectors, Equation 9 automatically accounts for differences in composition between the controls and the target sample. Calibrated estimates of the true taxon proportions are obtained by normalizing the elements of $\widehat{\mathbf{\mathbf{A}}}$ to sum to 1.
An alternative form of calibration we call reference calibration can be performed using control samples whose true composition is unknown but that have been measured by a reference Protocol $R$. Estimation and calibration proceed as before but with the control composition $\mathbf{\mathbf{A}}(c)$ replaced by the reference measurement ${\mathbf{\mathbf{O}}}^{(R)}(c)$. In this case, the calibrated composition is an estimate of the measurement we would expect if the target sample had been measured by the reference protocol.
Testing the model with mock communities
We tested our model of bias in data from two studies, Brooks et al. (2015) and Costea et al. (2017), that evaluated the bias of markergene and shotgun metagenomic sequencing, respectively, using mock microbial communities in samples of varying composition.
Markergene sequencing of even mixtures of various bacterial taxa
Brooks et al. (2015) generated taxonomic profiles from 71 samples of 58 unique mock communities by amplicon sequencing of the V1V3 region of the 16S rRNA gene. Each unique mock community consisted of an even mixture of between two and seven bacterial taxa. Each sample was measured in three experiments employing a common experimental workflow, but beginning from different starting points: even mixtures of cells, of extracted DNA, and of PCR product. The authors reported large systematic errors in the taxon proportions measured from the cell and DNA mixtures, which they explained in part by a highly parameterized linear model with many interaction terms. Here we reanalyze the data from this study in order to evaluate our model of bias and its performance relative to alternatives.
The proportions measured from the cellmixture mock communities differed greatly from the expected even proportions of each taxon (Figure 3A). The ratios between pairs of taxa also diverged sharply from the ratio of 1 expected in these even mixtures (Figure 3D). However, and as predicted by our model (see Properties and implications), the error in the ratios was consistent across samples (Figure 3D) while the error in the proportions varied dramatically in both magnitude and direction (Figure 3C).
Our model explained almost all of the error in the measured compositions of the cell mixtures. We estimated bias from all samples by a simple pointestimation procedure (Materials and methods; Table 1). We then used the estimated bias to predict the observed compositions from the expected even mixtures using Equation 2. The measured pairwise ratios closely matched the ratios predicted by our model—the ratios of the efficiencies of the two taxa (black crosses in Figure 3D). The proportions predicted from the fitted model reduced the mean squared error by 98.8% and closely matched the observed proportions (Figure 3B).
The DNA and PCRproduct mixture experiments confirmed that our model can also effectively describe partial MGS workflows. The compositions measured from DNA mixtures were affected by large systematic errors that were well explained by our model, while the systematic error in compositions measured from the PCR mixtures was small compared to the random errors (Figure 3—figure supplement 1 and Figure 3—figure supplement 2). Notably, the bias in DNA mixtures substantially differed from the bias in the cell mixtures (Table 1 and Figure 3—figure supplement 1). This observation suggests that PCR (performed in both experiments) and DNA extraction (performed only in the cellmixture experiment) are both large, independent sources of bias that each act in accordance with our model.
Our model better explains the data from the cell and DNA mixtures than proposed alternatives while employing a small number of parameters (6, equal to the number of taxa minus 1). Two recent studies (Krehenwinkel et al., 2017; Bell et al., 2019) used simple linear regression of the observed proportion of a taxon against its actual proportion, which uses 7 or 14 parameters for all taxa depending on whether intercept terms are included. Such models do not constrain the observed proportions to the [0, 1] interval. More critically, they cannot explain that the observed proportion of a given taxon can be higher than or lower than its actual proportion in samples of different composition (e.g. L. crispatus in Figure 3C), while such behavior is a straightforward consequence of our model. Brooks et al. (2015) attempted to overcome this limitation by adding second and thirdorder interaction terms between taxa to the linear model. This model obtains a close fit at the cost of vastly increased model complexity—441 parameters for all taxa instead of just 6. As a result, the interactions model is likely to overfit and poorly predict the measured compositions with different compositions from which it is trained on. Figure 3—figure supplement 3 compares the fit of our model to the simple linear model and the linear interactions model.
Metagenomic sequencing of fecal samples with a spikein mock community
The Phase III experiment of Costea et al. (2017) performed shotgun metagenomic sequencing of a cellular mock community spiked into fecal samples. The mock spikein contained 10 bacterial taxa with known abundance spanning 2.5 orders of magnitude and (unintentionally) an Escherichia/Shigella contaminant with unknown true abundance. It was added to fecal specimens from eight individuals as well as a blank ‘mockonly’ sample. DNA was then extracted from each specimen using three distinct DNA extraction protocols (Protocols H, Q, and W) and measured via a common shotgun sequencing protocol. Here we test whether bias among the spikein taxa is consistent across the varying backgrounds of the nine specimens.
Taxonomic profiles measured by MetaPhlAn2 (Materials and methods) showed substantial variation in the native bacterial composition across fecal specimens and in the proportion formed by the spikein, both of which were protocoldependent (Figure 4A). In contrast, the observed relative abundances of the spikein taxa were consistent across specimens for a given protocol (Figure 4A). This observation is what we expect given that the true spikein composition is fixed, since our model predicts that the error in the ratios among the spikein taxa is independent of the presence and abundance of other taxa. Figure 4A shows abundances relative to the geometric mean of the 10 mock taxa. The average difference between the observed and actual abundance for each taxon estimates the bias of the protocol in terms of the taxon’s efficiency relative to the average taxon (Figure 4—figure supplement 1 and Table 2). The bias shows qualitative differences between protocols, with certain mock taxa being enriched by one protocol and diminished by another. Also, the consistent difference in the observed relative abundance of the contaminant indicates a consistent differential bias between Protocol W and the other two protocols of the contaminant relative to the 10 mock taxa. These results indicate a consistent and unique bias associated with each protocol when bias is measured in accordance to our model.
Applications of the model
Calibration
Our model implies that a protocol’s bias can be estimated from control sample(s) of known composition and used to calibrate (through Equation 9) the measured compositions of unknown samples towards their true compositions. In the Brooks et al. (2015) cell mixtures, we estimated the bias from two samples containing all seven taxa and used this estimate to calibrate the measurements of the other 69 samples. Calibration reduced the mean squared error of the proportions in the calibrated samples by 92.6% and the average BrayCurtis dissimilarity between the actual and observed compositions from 0.35 to 0.08. In the Costea et al. (2017) dataset, the measured composition of the spikein mock community deviated from the truth in a protocolspecific fashion (Figure 5, top row). We estimated the bias of each protocol on the mock taxa from three fecal specimens and used those estimates to calibrate all samples (Materials and methods). Calibration removed most of the systematic error and greatly increased the accuracy of the measurements (Figure 5, middle row).
Random error, or noise, in the measurement process creates error in the estimated bias that propagates into the calibrated measurements. To evaluate the effect of noise on the accuracy of bias estimation, we picked the protocol with an intermediate noise level (Protocol H) and estimated the standard error in the relative efficiencies as a function of the number of control samples (Figure 5—figure supplement 1). Because noise was much weaker than bias, standard errors were less than the bias even for a single control measurement, suggesting substantial benefits from calibration even with a limited number of control observations. The results further suggest that three or four control measurements for a taxon substantially reduces the risk of inaccurate bias estimates due to occasional large random errors.
Differential bias between experiments (Equation 8) can be estimated from samples common to each experiment, even if the actual composition of the common samples is unknown. Differential bias can then be used to calibrate measurements from various experiments to those of a chosen reference protocol, thereby making measured compositions from different experiments quantitatively comparable even if their fidelity to the true compositions remains unclear. We illustrate calibration to a reference protocol using the multiprotocol design of the Costea et al. (2017) experiment (Figure 5). We defined the measurements by Protocol W as the reference composition that was then used in place of the actual composition in our calibration procedure. This greatly reduced the systematic differences between measurements from different protocols, without necessarily improving the accuracy compared to the actual composition (Figure 5, bottom row).
Bias measurement as a means of evaluating and improving protocols
Under our model, the compositional vector of relative efficiencies completely describes the effect of bias in samples of any composition, and thus is the correct way to measure and evaluate bias.
For the purpose of selecting lessbiased protocols, the overall magnitude of bias can be quantified through the use of ratiobased summary statistics. We provide two such statistics for the Brooks et al. (2015) experiment in Table 1 and the Costea et al. (2017) experiment in Table 2. The maximum pairwise bias, equal to the geometric range of the relative efficiencies, indicates the maximum error due to bias in the ratio of any two taxa. The average pairwise bias indicates the magnitude of the multiplicative error averaged over all pairs of taxa. For the three shotgun protocols in Table 2, these statistics indicate that Protocol H has a much larger bias than the other two protocols, though one should keep in mind that the large values for Protocol H are heavily influenced by its extremely low efficiency for L. plantarum and high efficiency for F. nucleatum. Summarizing the residual error in the control samples leads to an analogous average pairwise measure of the noise, or random error, associated with each protocol, which we also include in the tables. The noise measure for the three shotgun protocols indicates that, in this case, the least biased protocol (Protocol Q) also yielded the noisiest measurements of the mock taxa. However, the average noise of the protocols decreases with the propensity of the protocol to sequence the mock over the native gut taxa (Figure 4A), indicating that the greater noise of Protocol Q may be at least partially an artifact of limited sequencing depth.
Given suitable experimental designs, the estimated vectors of relative efficiencies can be used to quantify the bias attributable to specific parts of the workflow (Materials and methods and Appendix 1). In the Brooks et al. (2015) study, the same MGS workflow was run from different starting points: cells, extracted DNA, and PCR product. Comparing the bias resulting from different starting points leads to estimates of the bias attributable to DNA extraction, PCR amplification, and sequencing plus (bio)informatics (Figure 3; Figure 6A). For instance, dividing the relative efficiencies measured in the cell mixtures by those in the PCR mixtures provides an estimate of the bias that arises during DNA extraction. These estimates indicate that for these taxa and workflow, DNA extraction is the largest single source of bias, although PCR bias was also substantial. We can alternately understand these estimates through their predicted effect on the composition of an even mixture of taxa as it moves through the experimental workflow (Figure 6B), which clearly shows how extraction and PCR can oppose each other or work together. PCR and extraction bias acted in opposite directions for some taxa, such as L. crispatus and P. bivia, and in the same direction for others, such as G. vaginalis and for L. iners, leading to more moderate or extreme total relative efficiencies, respectively.
In the Costea et al. (2017) study, the same MGS workflow with different DNA extraction protocols was used to measure a common set of samples. This design implies that the differential bias between protocols (Table 2) can be attributed specifically to the effect of extraction (including possible effects of extraction on downstream steps; see Appendix 1). The differential bias of Protocol H relative to Protocols Q or W is substantially less than its bias (relative to the actual abundances), as can be seen from the summary statistics in Table 2 and visually in Figure 4—figure supplement 1. This observation suggests that components of bias are shared between protocols, either due to similarities among the extraction protocols or bias from shared steps such as library preparation.
Estimates of bias can be used to test mechanistic hypotheses and proposed methods for predicting bias. To demonstrate this application, we considered the effect of 16S copy number (CN) on bias in the Brooks et al. (2015) data. We estimated 16S CN per genome and per bp for the seven taxa using available CN and genome size estimates (Materials and methods; Table 3). We then compared the bias predicted by CN to the estimated PCR bias and the estimated bias of the total protocol without CN correction (Figure 6—figure supplement 1). 60% of the variance in estimated PCR bias was explained by CN variation (log relative efficiency scale; coefficient of determination $\approx 0.60$; $p\approx 0.021$ by permutation test). In contrast, total bias was poorly explained by CN variation (coefficient of determination $\approx 0.10$; $p\approx 0.23$ by permutation test). Accordingly, CN correction reduced the mean squared error in the taxon proportions by about half in the DNA mixtures but only slightly in the cell mixtures (Figure 3—figure supplement 2). The limited effect of CN correction can also be seen in Figure 6B. These results indicate that CN variation is just one component of PCR bias, which itself is just one component of total bias, and thus even perfect correction of CN bias may not substantially ameliorate total bias in markergene sequencing experiments.
Discussion
The lack of a rigorous understanding of how bias distorts markergene and metagenomic sequencing (jointly, MGS) measurements stands in the way of accurate and reproducible communitycomposition measurements. Previous analyses of bias in MGS experiments have largely relied on descriptive statistical models (Brooks et al., 2015; Sinha et al., 2017; Krehenwinkel et al., 2017; Bell et al., 2019) whose parameters cannot be identified with biophysical quantities that one might expect to apply to differently composed samples. Failure to develop more mechanistic models may have stemmed from the seeming hopelessness of accounting for the many verified sources of bias. Here we proposed a mathematical model of bias in MGS experiments as a set of taxonspecific factors (the relative efficiencies) that multiply the true relative abundances to produce the measured relative abundances. Our model was inspired by the observation that many sources of bias, such as differences in DNA extraction efficiency (Morgan et al., 2010), PCR primer binding and amplification efficiency (Wagner et al., 1994; Suzuki and Giovannoni, 1996; Polz and Cavanaugh, 1998; Edgar, 2017), and markergene copy number (Kembel et al., 2012), are thought to act multiplicatively, and hence so could their cumulative effect. The parameters in our model (the relative efficiencies) have biophysical interpretations as the relative yield per unit of input for each taxon, for individual steps or for the workflow overall. Our hypothesis that the relative efficiencies are consistent across samples is grounded in existing understanding of individual bias mechanisms and was supported by markergene and shotgunmetagenomic sequencing measurements of mock bacterial communities with varying composition. We further showed how our model could be used to measure, understand, and correct bias.
We found bias to be independent of sample composition only after accounting for the compositional nature of MGS measurements. Bias appeared inconsistent when viewed in terms of taxon proportions—for example the measured proportion of L. crispatus was both higher and lower than its true value in different samples (Figure 3C). However, these apparent inconsistencies did not reflect inconsistency in the action of bias, but instead were a consequence of the compositional nature of MGS data. A limited number of sequencing reads are generated from each sample, so if one taxon is enriched by bias then other taxa must be correspondingly diminished. Therefore, a taxon's proportional over or underrepresentation depends not on its absolute measurement efficiency but on its efficiency relative to the average individual (e.g. microbial cell) in the sample—L. crispatus increased in proportion when its efficiency was greater than the sample average, and decreased otherwise. Models that do not account for this effect (such as those of Brooks et al., 2015; Krehenwinkel et al., 2017; Bell et al., 2019; Kevorkian et al., 2018) will yield parameter estimates that do not extrapolate to differentlycomposed samples. Once we accounted for compositionality it became clear that relative efficiencies were consistent across differently composed samples. Bias had the same effect in each sample when we divided out the effect of compositionality by considering ratios of taxa (Figure 3D), and when we fully modeled the normalization involved in constructing proportions we found that a single set of relative efficiencies explained the observed proportions in every sample (Figure 3B).
A quantitative model allows the sensitivity of downstream analyses to bias to be rigorously evaluated. Consider the often unstated assumption that analyses of the differences between samples measured in the same experiment should be robust to bias, because each sample is biased in the same way. We can formally evaluate this assumption in the simple numerical example shown in Figure 2: Sample ${\text{S}}_{1}$ has higher Shannon diversity than Sample ${\text{S}}_{2}$ but the measured diversity of Sample ${\text{S}}_{1}$ is lower than Sample ${\text{S}}_{2}$, and Sample ${\text{S}}_{1}$ has a higher proportion of Taxon 3 than Sample ${\text{S}}_{2}$ but is observed to have a lower proportion of Taxon 3, despite the same bias distorting each sample. Whether such qualitative errors are likely can be investigated by simulating our model with empirical distributions of bias and community compositions. For example, consider the Bacteroidetes:Firmicutes ratio, a repeatedlyproposed diagnostic of gut health (Ley et al., 2006; Finucane et al., 2014). The range of relative efficiencies within the Firmicutes indicates that the Bacteroidetes:Firmicutes ratio measured by the metagenomic sequencing protocols evaluated in Costea et al. (2017) can differ from the true ratio by very little, or by as much as 100fold, depending on which Firmicutes species is dominant in the sample!
If bias acts (as we propose) as a consistent multiplication of the relative abundances, then analyses of MGS data based on taxon ratios could reduce the possibility for spurious results and make results from different experiments more comparable. The key insight is that the fold changes between samples in the ratios between taxa is invariant to consistent multiplicative bias, because the relative efficiencies divide out (Equation 6 and Equation 7). In contrast, such canceling does not occur for the foldchanges in taxon proportions, as exemplified by the spurious observed increase of Taxon 2 in Figure 2. To be clear, this biasinvariance property for ratiobased analyses only holds for samples biased in the same way, so these analyses still must be conducted within experiments sharing a common MGS protocol. But by controlling for studyspecific bias, these analyses may give results that are more concordant across studies than other analyses. One ready source of such methods is the field of Compositional Data Analysis (CoDA) (Aitchison, 1986; Gloor et al., 2017). In fact, our model of bias is equivalent to what is referred to as a compositional perturbation in the CoDA field; many CoDA methods are invariant to compositional perturbations (Aitchison, 2003; van den Boogaart and TolosanaDelgado, 2013) and thus would be invariant to bias. CoDA methods are being increasingly used to analyze MGS data, but to date this has been motivated by the need to account for the compositionality of MGS data. The possibility that such methods could also reduce or remove the effect of bias has not been widely appreciated.
Studies investigating bias and/or optimizing protocols should evaluate the systematic errors introduced by bias in a way that accounts for the compositional nature of MGS data. Most previous studies of bias quantified bias with taxon proportions (e.g. Brooks et al., 2015; Krehenwinkel et al., 2017; Bell et al., 2019) or proportionbased summary statistics such as BrayCurtis dissimilarities (Sinha et al., 2017) or differences in Shannon diversity (Song et al., 2016). Proportionbased measurements do not consistently measure bias in differently composed samples, and thus are difficult to interpret and can mislead researchers attempting to reduce the effect of bias on their experiments. The adoption of compositionally aware analytical methods to study bias may lead to insights that generalize beyond the specific sample compositions considered in these studies. In particular, quantification of bias in the form of the ‘bias vector’ of relative efficiencies we proposed here has a natural biological interpretation as the relative yields of each taxon and can be naturally decomposed into an elementwise product of bias vectors for each step in a workflow, allowing for granular investigation of MGS protocol choices.
Our results suggest that calibration could become a practical approach to improving the accuracy of MGS analyses and diagnostics. If bias is composition independent, then it can be estimated from one or more control samples and those estimates used to correct the relative abundances in target samples. Our results overcomes major limitations in recent attempts at MGS calibration (Brooks et al., 2015; Thomas et al., 2016; Krehenwinkel et al., 2017; Bell et al., 2019) by indicating how to obtain a compositionally independent estimate of bias for many taxa from a small number of control samples. Intriguingly, we show that the differential bias between protocols behaves in the same manner as the bias of an individual protocol. This property opens the possibility of calibration based on a reference protocol's measurements of control samples even if their true composition is not known. Reference calibration does not give the abundances in terms of biologically tangible units like cell concentration, but can make measurements from differently biased experiments quantitatively comparable, allowing diagnostic criteria to be applied outside of the lab in which they were defined. Reference calibration may sidestep the practical challenges of creating defined cellular mixtures of many taxa by using natural samples (or aggregations of natural samples) as calibration controls that would then contain the full range of taxa naturally present.
Limitations and next steps
We found bias to act multiplicatively in accordance with our model in two mockcommunity experiments; however, many sources of bias may deviate from multiplicativity under nonideal conditions. For example, it has been observed that the efficiency of a target sequence is altered by saturation during PCR amplification when the target is either rare or highly abundant (Suzuki and Giovannoni, 1996; Gonzalez et al., 2012). In a nonmicrobial markergene experiment, Thomas et al. (2016) found a strong and consistent saturation effect that may have been caused by such a mechanism. The opposite of saturation, where taxa have lower efficiencies when rarer in the sample, may occur if lowabundance taxa are culled by the minimumabundance thresholds used by taxonomic profilers such as MetaPhlAn2 (Truong et al., 2015). Such deviations from multiplicativity may be eliminated through protocol design (e.g. to avoid PCR saturation; Polz and Cavanaugh, 1998) or accounted for with extensions allowing for deterministic and random variation in efficiencies.
Even when bias does act multiplicatively on individual microbial taxa, that multiplicativity will not hold for aggregates of taxa that vary in their efficiencies (Appendix 1). Consider again the Bacteroidetes:Firmicutes ratio diagnostic. We know that within the Firmicutes phylum there is tremendous phenotypic variation, and that variation manifested itself in the Costea et al. (2017) study as orderofmagnitude differences between the relative efficiencies of various Firmicutes species. As a result, consistent multiplicative bias could dramatically enrich or diminish the relative abundance of the Firmicutes phylum depending on which Firmicutes species were present. Unfortunately, the potential for bias to act inconsistently on aggregates of taxa cannot be entirely circumvented by eschewing aggregation in our analyses because even the fundamental units we derive from MGS data effectively aggregate variation at some level (McLaren and Callahan, 2018). In the bacterial context, the traditional 97% ribosomal OTU groups variation at roughly the genus level (Yarza et al., 2014), exact shortread ribosomal sequence variants and common shotgun profilers group variation at roughly the species level (Edgar, 2018; Hillmann et al., 2018), and metagenome assembly combines strains too similar to be separated by the assembler or by the subsequent binning (Dick, 2018, p. 79). This does not make bias an intractable problem, but it does mean that additional work is needed to understand the phylogenetic scales over which bias significantly varies. Methods to control bias will need to operate on taxonomic objects with commensurate or finer resolution than that of variation in the bias phenotype (McLaren and Callahan, 2018).
Cellular phenotype and matrix chemistry can vary substantially between samples of different types, which may cause bias to vary among samples. For example, different cellular growth states may change the cellular membrane and thus the efficiency of DNA extraction for a given taxon, and inhibitors in soil samples are known to influence the efficiency of PCR (Schrader et al., 2012). In the Costea et al. (2017) experiment, protocols differentially extracted the mock and native gut taxa (Figure 4A), which may have been due to physiological differences between the labgrown cells and the preserved fecal cells. Potential phenotype and matrix effects are particularly relevant for calibration applications, as accurate calibration will require that bias measured in the controls is representative of the bias in the target samples.
The development of effective calibration methodologies will be limited by our ability to develop control samples that cover the range of taxa present in target communities. The bias estimation procedure we developed here is limited to those taxa present in the controls, and thus only allows for partial calibration of the subcomposition of the target samples consisting of those taxa. However, even partial calibration is useful when controls with key taxa in the target community are available. For example, vaginal microbiome samples from a patient over several clinical visits analyzed by Brooks et al. (2015) were mostly comprised of the seven taxa in their mock mixtures. It may be possible to effectively augment the range of taxa with credible bias estimates beyond those included in control samples by using phylogenetic inference methods to predict the bias of related taxa (Goberna and Verdú, 2016). It will be easier to broadly cover the taxa in a given environment when calibrating to a reference protocol, because any sample measured by the reference protocol can be used as a reference calibration control, including samples from the environment of interest.
Our discussion has so far ignored another ubiquitous source of error in MGS experiments—contamination, either from the lab, reagents, or other samples (Eisenhofer et al., 2019). Contamination may be more important than bias in certain scenarios, such as the study of low biomass samples or analyses that are sensitive to very low frequency taxa, and bias correction may be insufficient for accurate measurement and inference in such settings. Accurate bias inference also requires accurate removal of contaminant reads prior to estimation through manual filtering or automated removal methods (Davis et al., 2018; Larsson et al., 2018). Incorporation of our bias model with models of contamination may allow for simultaneous and improved estimation and correction of both error processes.
The model of bias we explored in this paper treats bias and measurement error more generally as deterministic. In practice, however, there is variability in the error of taxon ratios around the mean (Figure 3D; Figure 4—figure supplement 1). We developed a pointestimation method with a bootstrapping procedure to estimate the bias with associated uncertainty from random observations (Materials and methods). However, robust estimation and calibration may require a statistical model of bias that explicitly accounts for sources of random error such as variation in relative efficiencies across samples and limited sequencing depth. A statistical model would facilitate the construction of confidence intervals for the calibrated taxon proportions in a sample. Challenges associated with building such a model include modeling the presence of taxa thought to be absent from the community (but observed due to contamination or index switching; Eisenhofer et al., 2019), the absence of taxa known to be present (Yeh et al., 2018), and accounting for the noise associated with the count nature of sequencing data. Our finding that multiplicative error in taxon ratios provides a parsimonious model for bias paves the way for the development of a such a statistical model, which we leave for future work.
Conclusion
We suggest a simple yet profound change in how researchers view MGS measurements. Currently, researchers tend to either 1) take MGS measurements as telling us only about presence and absence of detectable taxa, 2) hope that bias in the measurements of individual samples will somehow cancel out when analyzing differences between samples within a given experiment, or 3) pretend bias doesn’t exist. We propose a new view in which the measured relative abundances within an experiment are biased by unknown—but constant—multiplicative factors. When bias acts consistently in this manner it can be accounted for through the use of biasinsensitive analyses or corrected by a calibration procedure. Our results lay a foundation for the rigorous understanding of bias in markergene and metagenomic sequencing measurements that is required for accurate and reproducible research using MGS methods and for the development of reliable MGS diagnostics and interventions.
Materials and methods
Bias estimation
Request a detailed protocolWe describe a procedure for averaging multiple control observations to obtain a single estimate of the protocol’s bias; additional details and motivation are given in Appendix 2. For sample $s$ in a set of control samples $S$, let $\mathbf{\mathbf{A}}(s)$ denote the actual composition and $\mathbf{\mathbf{O}}(s)$ be the observed composition. We assume that $\mathbf{\mathbf{O}}(s)$ and $\mathbf{\mathbf{A}}(s)$ are nonzero for the same taxa; in practice, this assumption requires ignoring sequencing reads from taxa not supposed to be in the sample and adding a small abundance to taxa actually present but not detected. Under our deterministic model, the observed composition is given exactly by Equation 2. In practice, however, each measurement will vary—for example, due to random error in sample construction, variation in sample handling, and random sampling of reads during sequencing. We decompose the error $\mathbf{\mathbf{O}}(s)/\mathbf{\mathbf{A}}(s)$ into a deterministic component $\mathbf{\mathbf{B}}$ and a random component $\u03f5(s)$,
The random error $\u03f5(s)$ is a random compositional vector that we assume has an expected value of $(1,\mathrm{\dots},1)$ in the compositional Aitchison geometry (given by the elementwise geometric mean; Aitchison, 2003, p. 38). Intuitively, we estimate $\mathbf{\mathbf{B}}$ by the vector $\widehat{\mathbf{\mathbf{B}}}$ that minimizes the residual errors, $\widehat{\u03f5}(s)\sim \mathbf{\mathbf{O}}(s)/(\mathbf{\mathbf{A}}(s)\cdot \widehat{\mathbf{\mathbf{B}}})$.
We quantify the magnitude of errors by the Aitchison norm. The Aitchison norm (PawlowskyGlahn, 2015; Aitchison, 2003, Chapter 2) of a $K$element composition $\mathbf{\mathbf{X}}$ is given by
where $g(\mathbf{\mathbf{X}})={({\prod}_{i=1}^{K}{X}_{i})}^{1/K}$ is the geometric mean of the elements of $\mathbf{\mathbf{X}}$. When estimating the bias from a sample $s$ that is missing some of the taxa, the elements of $\mathbf{\mathbf{O}}(s)/\mathbf{\mathbf{A}}(s)$ and of the residual error $\widehat{\u03f5}(s)$ corresponding to the missing taxa are undefined. We define $\Vert \hat{\u03f5}(s)\Vert$ in this case by restricting to just the defined elements (and adjusting $K$ accordingly) before applying Equation 11.
We take our estimate of $\mathbf{\mathbf{B}}$ to be the compositional vector that minimises the sum of squared residual error over all samples,
This definition equates $\widehat{\mathbf{\mathbf{B}}}$ with the compositional mean, or center, of the compositional errors $\mathbf{O}(s)/\mathbf{A}(s)$ when the center is defined to allow missing values (Appendix 2). If all samples contain all $K$ taxa, then $\widehat{\mathbf{\mathbf{B}}}$ is given by the elementwise geometric mean of the set $\{\mathbf{\mathbf{O}}(s)/\mathbf{\mathbf{A}}(s)\}$,
where $\leftS\right$ is the number of samples in the set $S$. More generally, the solution to $\hat{\mathbf{B}}$ can be computed using the projection approach of van den Boogaart et al. (2006). The solution is unique up to compositional equivalence given sufficient taxonomic overlap among the samples (Appendix 2).
Differential bias of the given protocol to a reference Protocol $R$ that has also measured the samples in $S$ can be estimated without knowing the actual sample compositions by replacing the actual compositions $A(s)$ in the above with the reference measurements ${\mathbf{\mathbf{O}}}^{(R)}(s)$.
We describe a bootstrap procedure for estimating the uncertainty of the bias estimate in Appendix 2. Each bootstrap replicate consists of drawing either multinomial or Dirichlet weights for the control samples and computing the weighted center to obtain the replicate value of $\widehat{\mathbf{\mathbf{B}}}$. Standard errors for the relative efficiencies are estimated by the geometric standard deviation of the corresponding efficiency ratio across replicates.
Bioinformatics and statistical analysis
Data and code availability
Request a detailed protocolFunctions and a tutorial for estimating and visualizing bias and performing calibration are provided in the ‘metacal’ R package, available on GitHub (McLaren, 2019a, copy archived at https://github.com/elifesciencespublications/metacal). The raw data for our analysis is available in the ‘Additional files’ of Brooks et al. (2015) and in European Nucleotide Archive study accession PRJEB14847 for Costea et al. (2017). The processed data—along with all code used to download and process the raw data, perform all statistical analyses, and generate all figures and tables—is contained in the manuscript’s GitHub repository (McLaren, 2019b, copy archived at https://github.com/elifesciencespublications/mgsbiasmanuscript). Analysis and visualization is performed using the R software environment (R Development Core Team, 2018) with the ‘metacal’ package, the ‘tidyverse’ suite of R packages (Wickham, 2019), and the ‘cowplot’ R package (Wilke, 2019). Analysis code is contained in Rmarkdown documents that can be executed to generate all numerical results, tables, and figures. Versions that have been ‘knit’ into html documents showing code interlaced with output and figures are available in the GitHub repository.
Brooks et al. (2015) experiment
Request a detailed protocolWe used taxonomic profiles generated in the original study and provided as supplemental information. Specifically, we used the sample information and read assignments in Additional Files 2, 10, and 11 of Brooks et al. (2015) to build a table of amplicon sequences assigned to each of the seven mock taxa in each sample. Brooks et al. (2015) used a classification method and 16S reference database designed for specieslevel classification of vaginally associated taxa from V1V3 region amplicons (Fettweis et al., 2012). Reads were assigned to species in the database according to a 97% sequence identity threshold, resulting in 93.5% of reads assigned and for which the vast majority (99.98%) were assigned to species corresponding to the seven mock taxa. We discarded the small fraction (0.0002%) of reads assigned to other species. Most samples were assigned a small fraction of their reads from species not expected to be in the sample. These outofsample species generally had much lower frequency than the expected species, suggesting they were the result of crosssample contamination rather than mislabeling or misconstruction of the samples. We therefore removed these reads before evaluating and estimating bias.
We took the actual composition of each sample to be an even mixture of the taxa added to that sample, in units of cell concentration, DNA concentration, or PCRproduct concentration. Brooks et al. (2015) constructed the cell mixtures to be even mixtures based on CFUs (a proxy for cell concentration); the DNA mixtures based on DNA concentration; and the PCR mixtures based on volume from amplification of a fixed weight of DNA. Extraction and PCR protocols differed somewhat when using pure cultures to create the DNA and PCRproduct mixtures than when applied to communities in the cell experiment. Thus, the DNA and PCR product in the second and third experiments may differ qualitatively from that in the cell mixture experiments, which could in principle affect the bias of downstream steps.
We estimated genome size and 16S copy number for the seven mock taxa from available genome databases and experimental measurements. We estimated genome size by the average genome size for the given species in NCBI RefSeq release 86 as collated by the GTDB (Parks et al., 2018). We estimated 16S copy number (CN) through a combination of RefSeq annotations for the given species; CN estimates in the rrnDB for the given species or their nearby relatives identified in the GTDB phylogeny (Parks et al., 2018); and measurement by pulsefield gel electrophoresis reported by Yuan et al. (2012) for A. vaginae and L. iners. A full account is given in the manuscript GitHub repository. The resulting genome size estimates approximately agree with those of Brooks et al. (2015), but the 16S CN estimates differ substantially for several taxa (compare our Table 3 to their Table 5). In particular, Brooks et al. (2015) estimated four taxa to have CNs of 1 based on NCBI annotations then available, but we suspect these numbers to be artifacts of poor assembly and annotation.
We estimated the bias predicted due to CN variation for each mixture type as follows. Cell mixtures: CN bias is simply the compositional vector of CNs (16S copies per genome). DNA mixtures: CN bias is the compositional vector of CN divided by genome size (16S copies per bp). PCRproduct mixtures: CN bias is the compositional identity vector (1,..., 1) (i.e. no bias). Denoting the estimated CN bias as ${\widehat{\mathbf{\mathbf{B}}}}^{(\text{CN})}$ for the given experiment, the CNcorrected proportions are $\mathrm{Pr}[\mathbf{\mathbf{O}}/{\widehat{\mathbf{\mathbf{B}}}}^{(\text{CN})}]$.
For each mixture experiment, we estimated bias as described above in ‘Bias estimation’. We then used these estimates to partition our estimate of the total protocol bias into three steps—1) DNA extraction, 2) PCR amplification, and 3) sequencing and bioinformatics—under the simplifying assumption that the bias of shared steps are the same across experiments. We assume the bias of the cell mixture experiments is ${\mathbf{\mathbf{B}}}^{\text{(Cells)}}={\mathbf{\mathbf{B}}}^{({P}_{1})}\cdot {\mathbf{\mathbf{B}}}^{({P}_{2})}\cdot {\mathbf{\mathbf{B}}}^{({P}_{3})}$, of the DNA mixtures is ${\mathbf{\mathbf{B}}}^{\text{(DNA)}}={\mathbf{\mathbf{B}}}^{({P}_{2})}\cdot {\mathbf{\mathbf{B}}}^{({P}_{3})}$, and of the PCRproduct mixtures is $\mathbf{B}}^{(\text{PCR product})}={\mathbf{B}}^{({P}_{3})$. We therefore estimate the extraction bias as ${\widehat{\mathbf{\mathbf{B}}}}^{({P}_{1})}={\widehat{\mathbf{\mathbf{B}}}}^{\text{(Cells)}}/{\widehat{\mathbf{\mathbf{B}}}}^{\text{(DNA)}}$, the PCR bias as $\hat{\mathbf{B}}}^{({P}_{2})}={\hat{\mathbf{B}}}^{(\text{PCR product})}/{\hat{\mathbf{B}}}^{(\text{DNA})$, and the sequencing and bioinformatics bias as $\hat{\mathbf{B}}}^{({P}_{3})}={\hat{\mathbf{B}}}^{(\text{PCR product})$.
Costea et al. (2017) Phase III experiment
Request a detailed protocolWe downloaded raw sequencing reads for the Phase III experiment from European Nucleotide Archive study accession PRJEB14847 and generated taxonomic profiles using MetaPhlAn2 version 2.7.6 (Truong et al., 2015) with the commandline options min_cu_len 0 stat avg_g. These options were chosen to increase sensitivity and accuracy for the rarest spikein taxa and resulted in the detection of all spikein taxa in every sample. Taxonomic profiles generated by MetaPhlAn2 provide estimated proportions of taxa at various taxonomic levels. We restricted our analysis to specieslevel abundances and the kingdom Bacteria, which constituted over 99% of nonviral abundance in each sample.
Costea et al. (2017) reported Escherichia coli as a likely spikein contaminant due to its presence in sequence data from the mockonly samples. Consistent with this report, the MetaPhlAn2 profiles showed a substantial presence of Shigella flexneri in the mockonly samples and we identified this species as the ‘Contaminant’ in our subsequent analyses and in all figures and tables.
We estimated the true mockcommunity composition using the flow cytometry (FACS) measurements reported in Costea et al. (2017). We used the arithmetic mean of two replicate measurements where available and ignored any measurement error in the resulting actual mock composition for our analysis. The FACS measurements provided by Costea et al. (2017) disagree with those shown in their Figure 6 for three taxa (V. cholerae, C. saccharolyticum, and Y. pseudotuberculosis). Analysis of our MetaPhlAn2 profiles indicates that these taxa are most likely mislabeled in the figure and not in the FACS measurements. A mislabeling in the FACS measurements would change the specific bias values we estimate for these taxa but not our main results or conclusions.
We estimated the bias of each protocol and the differential bias between protocols as described in ‘Bias estimation’. We estimated standard errors using the Dirichletweighted bootstrap method described in Appendix 2. To determine how precision in the bias estimate for Protocol H varies with the number of control samples (Figure 4—figure supplement 1), we computed standard errors using the multinomialweighted bootstrap method with the number of trials in the multinomial distribution equal to the specified number of control samples (Appendix 2).
To demonstrate calibration, we randomly chose three fecal specimens to use as the ‘estimation set’ to estimate bias, and then calibrated all samples using Equation 9. We excluded the mockonly specimen from the estimation set since its atypical values for a few taxa resulted in an unrepresentative picture of the success of calibration; however, we included it when evaluating the effect of noise on bias estimation in Figure 4—figure supplement 1.
Appendix 1
Model details
Qualitative or downstream effects of a step
In the most general formulation of our model first presented in the Results, we allow for the possibility that the protocol choice at a step influences the qualitative, as well as the quantitative, properties of the taxa. For instance, DNA extraction protocols yield differently fragmented DNA (Costea et al., 2017), which may affect the bias of downstream steps such as PCR and sequencing. Therefore, the bias at a step in general will depend on the protocol for that step as well as for all previous steps. To make this dependence explicit, let ${P}_{l}\mid {P}_{1}\mathrm{\dots}{P}_{l1}$ denote the $l$th step of protocol $P$ performed after steps $1$ through $l1$ of protocol $P$. The bias of protocol $P$ is then
Our core assumption is that the bias at each step is fixed within the context of the total protocol $P$, but independent of the composition of the sample. All properties and implications of bias or differential bias of the total protocols described in the main text are valid under this condition in the presence of such qualitative or downstream effects.
Downstream effects are important to consider, however, when attempting to estimate the bias of an individual step or protocol choice. Consider using differential bias to estimate the effect of DNA extraction protocol in the Costea et al. (2017) experiment, where the protocols differ only at the extraction step. The differential bias between Protocols H and W, ${\mathbf{\mathbf{B}}}^{(H/W)}$, includes both the direct effect of the differential bias at the extraction step as well as the indirect downstream effects the extraction protocol has on later steps. Here we cannot distinguish the two, as estimating the size of indirect effects requires experiments where protocols at different steps are varied in a combinatorial fashion.
A different but related issue arises in our estimation of bias from different steps in the Brooks et al. (2015) experiment. There, DNA extraction from pure cultures to create the DNA mixtures differed from extraction in the community samples (the cell mixtures). In order to estimate the bias during extraction, we must assume that there is no qualitative difference in the DNA, and so the bias of PCR, sequencing, etc. will be the same whether starting from the cell or the DNA mixtures. In other words, we must assume that the bias at a step ${P}_{l}$ is only a property of the individual step.
Aggregates of taxa
Aggregating groups of taxa through the standard method of summing their abundances can cause bias to appear to vary among samples with different compositions. Consider four taxa with bias $({B}_{1},{B}_{2},{B}_{3},{B}_{4})$ such that Taxon 1 is in Phylum 1, Taxon 2 is in Phylum 2, and Taxon 3 and Taxon 4 are both in Phylum 3. In a sample with actual composition $({A}_{1},{A}_{2},{A}_{3},{A}_{4})$, the observed composition is $({A}_{1}{B}_{1},{A}_{2}{B}_{2},{A}_{3}{B}_{3},{A}_{4}{B}_{4})$. If we aggregate taxa within phyla by the standard method, we observe the phylumlevel composition $({A}_{1}{B}_{1},{A}_{2}{B}_{2},{A}_{3}{B}_{3}+{A}_{4}{B}_{4})$. The bias we observe at the phylum level is
The observed efficiency of Phylum 1 relative to Phylum 2 remains ${B}_{1}/{B}_{2}$, independent of sample composition, but the efficiency of Phylum 3 relative to the other phyla depends on the relative abundances of Taxon 3 and Taxon 4.
The above example illustrates the general result that if we sum taxa that differ in their efficiencies, bias will no longer be independent of composition. As a result, estimates of bias at the aggregate level may find bias to vary among samples and perturbationinvariant analyses applied to taxonomic aggregates may not be invariant to bias.
When our measurement has sufficient taxonomic resolution to distinguish taxa at the level at which bias is conserved, then we can overcome these limitations by using the compositional approach to combining elements, which involves multiplying rather than summing the elements within each group (van den Boogaart and TolosanaDelgado, 2013). For instance, if we multiply rather than sum taxa within a phylum we see that the error is again independent of composition,
Bias remains independent when individual taxa are replaced with products of taxa, and analyses that are perturbation invariant to bias on taxa will remain so when applied to taxa products. CoDA analyses that use this approach are typically based on balances, which are scaled and logtransformed ratios of the products of groups of taxa (van den Boogaart and TolosanaDelgado, 2013). Applications of balances to microbiome analyses are described in Silverman et al. (2017) and RiveraPinto et al. (2018).
Appendix 2
Biasestimation details
Bias estimation is complicated by the compositional nature of MGS measurement, particularly if different control samples contain different taxa. Because only relative abundances are measured, the measurement of a control sample $s$ only provides information about the efficiencies of the taxa in the sample relative to each other. If we restrict our scope to a set of $K$ possible taxa, then we can write the actual and observed compositions of $s$ as $K$element compositions denoted $\mathbf{\mathbf{A}}(s)$ and $\mathbf{\mathbf{O}}(s)$. The measurement error can now be described by the compositional difference, $\mathbf{\mathbf{O}}(s)/\mathbf{\mathbf{A}}(s)$, which we denote $\mathbf{\mathbf{D}}(s)$. Recall that the compositional difference is computed by elementwise division. If the sample $s$ contains all taxa—that is, if all of the elements of $\mathbf{\mathbf{A}}(s)$ are positive—then $\mathbf{\mathbf{D}}(s)$ is fully defined. But if $s$ does not contain all taxa, then some of the elements of $\mathbf{\mathbf{D}}(s)$ are $0/0$ or undefined. The undefined elements of $\mathbf{\mathbf{D}}(s)$ indicate that the control measurement tells us nothing about the bias between the taxa that are in $s$ and taxa that are not in $s$.
This appendix first summarizes concepts from the field of Compositional Data Analysis for computing the mean of a set of compositional vectors with undefined elements. These concepts allow us to equate the estimator of bias defined in Equation 13 with the compositional mean, or center, of the compositional errors observed in the control samples,
We then describe the condition for sufficient taxonomic overlap among the control samples for this estimate to be fully defined. Finally, we describe a bootstrap procedure for approximating the sampling distribution of the center that can be used to estimate uncertainty in the bias estimate.
Distances and means in the compositional Aitchison geometry
The Aitchison geometry (PawlowskyGlahn, 2015; Aitchison, 2003, Chapter 2) defines a normed vector space for compositional vectors of a given length $K$, in which vector addition and subtraction are performed by elementwise multiplication and division and distance is measured according to the Aitchison norm. Equation 11 shows that the Aitchison norm of $\mathbf{\mathbf{X}}$ grows with the pairwise ratios among the elements or, equivalently, with the ratios of the elements to their geometric mean. The distance between two compositions $\mathbf{\mathbf{X}}$ and $\mathbf{\mathbf{Y}}$ in the Aitchison geometry is given by the Aitchison norm of their (compositional) difference and equals
Equation 16 shows that the Aitchison distance between two compositions grows with the folddifference in the element ratios between $\mathbf{\mathbf{X}}$ and $\mathbf{\mathbf{Y}}$.
The compositional mean, or center, of a set $\{X(s)\mid s\in S\}$ of compositional vectors with no undefined elements is typically defined by their elementwise geometric mean (van den Boogaart and TolosanaDelgado, 2013, p. 74),
where $\leftS\right$ is the number of elements in the set $S$. Alternatively, we can define the center geometrically as the vector that minimizes the sum of squared distances between the compositions in $\{\mathbf{\mathbf{X}}(s)\}$ and their center (PawlowskyGlahn and Egozcue, 2002),
In this way, the compositional mean can be seen to be directly analogous to the familiar mean in Euclidean geometry, which minimizes the sum of squared Euclidean distances between the vectors and their mean.
The geometric definition Equation 18 can be used even when some of the compositions have elements that are undefined, provided that we first define the Aitchison distance between vectors with undefined elements. In the scenario we are interested in, the elements of $\mathbf{\mathbf{X}}=\mathbf{\mathbf{D}}(s)$ are undefined because the taxa corresponding to these elements were not present in the sample $s$, leaving us with no information about their ratios to the elements corresponding to the present taxa. It is therefore natural to compute the distance between $\mathbf{\mathbf{X}}$ and a candidate estimate of the center $\mathbf{\mathbf{Y}}$ with Equation 16 restricted to just the defined elements. Equivalently, we can define the norm of $\mathbf{\mathbf{Z}}\sim \mathbf{\mathbf{X}}/\mathbf{\mathbf{Y}}$ as the norm of the smallestnormed $K$element vector that has the same ratios among the defined elements of $\mathbf{\mathbf{Z}}$. This definition of the norm amounts to computing the norm of the ${K}^{*}$element vector of the defined elements, as commented below Equation 11. To see why, note that the smallest $K$element vector consistent with $\mathbf{\mathbf{Z}}$ is given by setting the undefined elements of $\mathbf{\mathbf{Z}}$ equal to the geometric mean of the defined elements. When we compute its norm using the second form in Equation 11, the logarithmic terms corresponding to these elements equal zero and so do not contribute to the sum.
This generalized definition of the norm now lets us use Equation 18 to define the center of a set of compositional vectors that may have undefined elements. This definition can be used to compute the center numerically by minimizing the sum of squared distances. However, van den Boogaart et al. (2006) have developed an analytical approach based on projections of the logtransformed compositions; we refer readers to van den Boogaart et al. (2006) and Bren et al. (2008) for an explanation of this approach, which we use for all bias estimation in our analysis.
Taxonomic overlap among control samples needed for a fullydetermined bias estimate
Recall that the bias $\mathbf{B}$ is a compositional vector and so is defined only up to compositional equivalence, or an arbitrary positive constant multiplied by all elements. The bias estimate $\widehat{\mathbf{\mathbf{B}}}$ is uniquely determined (up to compositional equivalence) for all taxa present in at least one control sample only if there is sufficient taxonomic overlap among the samples. Here we explain the intuition for why this is the case and describe the general condition for sufficient overlap.
Suppose that there are two control samples; the first contains Taxon 1 and Taxon 2 and the second contains Taxon 3 and Taxon 4. Each sample allows us to learn the efficiencies of its taxa relative to others in that sample but not in the other sample. The first sample allows us to estimate ${B}_{1}/{B}_{2}$ and the second to estimate ${B}_{3}/{B}_{4}$, but we cannot estimate ${B}_{1}/{B}_{3}$. The situation improves if there is a third sample containing Taxon 2 and Taxon 4, allowing us to also estimate ${B}_{2}/{B}_{4}$. Now we can estimate ${B}_{1}/{B}_{3}$ by the relation ${B}_{1}/{B}_{3}=({B}_{1}/{B}_{2})({B}_{2}/{B}_{4})({B}_{4}/{B}_{3})$.
With just the first two samples, the bias estimate $\widehat{\mathbf{\mathbf{B}}}=({\widehat{B}}_{1},{\widehat{B}}_{2},{\widehat{B}}_{3},{\widehat{B}}_{4})$ is not welldefined. Because we have no information about the ratios between the two groups of taxa, we can always multiply the estimated efficiencies of one group (say, ${\widehat{B}}_{1}$ and ${\widehat{B}}_{2}$) by a constant $a$ while leaving the other group unchanged. The resulting distinct compositional vector would be just as consistent with the data from the control measurements in the sense that the residual sum of squared errors (RSS) would be unaffected. However, once we have the third sample, changing the ratios between the two groups of taxa in such a manner would increase the RSS and yield a worse estimate of the bias. Thus, with the inclusion of the third sample there is a unique estimate $\widehat{\mathbf{\mathbf{B}}}$ that minimizes the RSS and the bias estimate is welldefined.
The general condition for determining if there is sufficient taxonomic overlap to uniquely determine $\widehat{\mathbf{\mathbf{B}}}$ can be stated using the concept from network theory of a network component. A (connected) component of an undirected network is a subnetwork whose nodes are connected by paths through the network and are not connected to nodes in any other component (Newman, 2010, p. 142). Define the taxon cooccurrence network as the network with taxa as nodes and undirected edges between any pair of taxa that cooccur in the same sample. Each component of the taxon cooccurrence network defines a set of taxa within which bias can be estimated, but for which the bias to other taxa cannot be estimated. Thus the estimate $\widehat{\mathbf{\mathbf{B}}}$ is uniquely determined if and only if the taxon cooccurrence network has a single component. When there are multiple components, the bias estimation method we have implemented (using the projection method of van den Boogaart et al., 2006) resolves ambiguities by giving the value of $\widehat{\mathbf{\mathbf{B}}}$ with the smallest Aitchison norm. The resulting estimate can still be used to analyze the bias as long as only the efficiency ratios between taxa that share a component are considered as having any meaning.
Bootstrap method for estimating uncertainty in the bias
We describe two bootstrap (resampling) methods for estimating the uncertainty in the estimated bias. In the first method, based on the standard bootstrap introduced by Efron (1979), the weights assigned to the control measurements in each resampling are drawn from a multinomial distribution. In the second method, based on the Bayesian bootstrap introduced by Rubin (1981), the weights are drawn from a Dirichlet distribution. The two methods give approximately the same standard errors if all samples contain all taxa, but the second method avoids a pathology of the first when different samples contain different taxa.
Each method works by approximating the sampling distribution of $\widehat{\mathbf{\mathbf{B}}}$ by computing $\widehat{\mathbf{\mathbf{B}}}$ for a large number, $R$, of resamplings of the observed compositional errors, $\{\mathbf{\mathbf{D}}(s)=\mathbf{\mathbf{O}}(s)/\mathbf{\mathbf{A}}(s)\}$. We first describe the multinomial resampling (standard bootstrap) method in the simple case where all control samples contain all taxa. In this method, a single bootstrap replicate consists of drawing $\leftS\right$ error measurements by sampling with replacement from $\{\mathbf{\mathbf{D}}(s)\}$ and computing the center of the resulting error measurements. Equivalently, we first draw a length$\leftS\right$ vector $\mathbf{\mathbf{w}}$ from a $\text{Multinomial}(\leftS\right,(1/\leftS\right,\dots ,1/\leftS\right))$ distribution to specify the weights assigned to the $\leftS\right$ original measurements and compute the weighted center, $\mathrm{cen}[\{\mathbf{\mathbf{D}}(s)\},\mathbf{\mathbf{w}}]$, defined by
From each bootstrap replicate $r$ we obtain an estimate ${\widehat{\mathbf{\mathbf{B}}}}_{r}=\mathrm{cen}[\{\mathbf{\mathbf{D}}(s)\},{\mathbf{\mathbf{w}}}_{r}]$ of the bias that, because each sample contains all taxa, is necessarily a fullydetermined $K$element compositional vector.
We now use the approximate resampling distribution given by $\{{\widehat{\mathbf{\mathbf{B}}}}_{r}\}$ for a large number of replicates ($R$) to estimate the uncertainty in $\widehat{\mathbf{\mathbf{B}}}$. In order to respect the compositional nature of bias, we propose quantifying the uncertainty in $\widehat{\mathbf{\mathbf{B}}}$ by the geometric (multiplicative) standard errors in the relative efficiencies of various taxa. In the tables and figures, we typically present bias relative to the average taxon, which amounts to dividing the relative efficiency of each taxon by the geometric mean efficiency of all taxa. To estimate the geometric standard errors of the relative efficiencies in this case, we first geometrically center the elements of each ${\widehat{\mathbf{\mathbf{B}}}}_{r}$ (by taking ${\widehat{\mathbf{\mathbf{B}}}}_{r}\mapsto {\widehat{\mathbf{\mathbf{B}}}}_{r}/g({\widehat{\mathbf{\mathbf{B}}}}_{r})$) and then compute the elementwise geometric standard deviations over the $R$ replicates,
This standard error quantifies the uncertainty in the ratio of the efficiency of Taxon $i$ to the geometric mean efficiency of all taxa. Other ratios are also useful for understanding the uncertainty in the estimated bias. For example, one can estimate the standard error in the bias between taxa $i$ and $j$ by the geometric standard deviation of ${\widehat{B}}_{r,i}/{\widehat{B}}_{r,j}$ over the $R$ replicates.
The above procedure can fail if different samples have different taxa. Multinomial resamplings typically leave out some samples and thus can lack sufficient taxonomic overlap to give a fullydetermined bias estimate even if $\widehat{\mathbf{\mathbf{B}}}$ is fully determined. A simple way to overcome this problem is to instead use Dirichlet resampling, as in the Bayesian bootstrap (Rubin, 1981). Procedurally, the Bayesian bootstrap proceeds like the weightbased formulation of the standard bootstrap given above, except that the weights ${\mathbf{\mathbf{w}}}_{r}$ are sampled from a $\text{Dirichlet}(1,\dots ,1)$ distribution rather than the multinomial distribution. Each sample now receives a positive weight (with probability 1), ensuring that each replicate estimate ${\widehat{\mathbf{\mathbf{B}}}}_{r}$ is fully determined so long as $\widehat{\mathbf{\mathbf{B}}}$ is. We can then proceed to compute compositional standard errors as before. The Dirichlet and multinomial weights of the two methods have the same expected value and approximately the same variance (Rubin, 1981). Consequently, the two methods will give approximately the same standard errors when all samples have all taxa.
To determine how much the precision of the bias estimate for Protocol H of Costea et al. (2017) varied with the number of control samples (Figure 4—figure supplement 1), we computed standard errors using the multinomialweighted bootstrap method with the number of trials in the multinomial distribution equal to the specified number of control samples. That is, to estimate the precision associated with a given number $n$ of control samples ($1\le n\le \leftS\right$), we drew the length$\leftS\right$ weight vector $\mathbf{\mathbf{w}}$ from a $\text{Multinomial}(n,(1/\leftS\right,\dots ,1/\leftS\right))$ distribution.
References
 1

2
On criteria for measures of compositional differenceMathematical Geology 24:365–379.https://doi.org/10.1007/BF00891269

3
A concise guide to compositional data analysis2nd Compositional Data Analysis Workshop.

4
Mathematical foundations of compositional data analysisProceedings of IAMG'01—The Sixth Annual Conference of the International Association for Mathematical Geology.
 5

6
News from "compositions", the R packageNews from "compositions", the R package, https://core.ac.uk/download/pdf/132548286.pdf.
 7

8
Challenges for casecontrol studies with microbiome dataAnnals of Epidemiology 26:336–341.https://doi.org/10.1016/j.annepidem.2016.03.009
 9

10
Next generation microbiological risk assessment metaomics: the next need for integrationInternational Journal of Food Microbiology 287:10–17.https://doi.org/10.1016/j.ijfoodmicro.2017.11.008

11
Towards standards for human fecal sample processing in metagenomic studiesNature Biotechnology 35:1069–1076.https://doi.org/10.1038/nbt.3960
 12
 13

14
Genomic Approaches in Earth and Environmental SciencesChichester, UK: John Wiley & Sons, Ltd.
 15

16
Updating the 97% identity threshold for 16S ribosomal RNA OTUsBioinformatics 34:2371–2375.https://doi.org/10.1093/bioinformatics/bty113

17
Bootstrap methods: another look at the jackknifeThe Annals of Statistics 7:1–26.https://doi.org/10.1214/aos/1176344552

18
Contamination in low microbial biomass microbiome studies: issues and recommendationsTrends in Microbiology 27:105–117.https://doi.org/10.1016/j.tim.2018.11.003

19
Specieslevel classification of the vaginal microbiomeBMC Genomics 13 Suppl 8:S17.https://doi.org/10.1186/1471216413S8S17
 20

21
Correcting for batch effects in casecontrol microbiome studiesPLOS Computational Biology 14:e1006102.https://doi.org/10.1371/journal.pcbi.1006102
 22

23
Microbiome datasets are compositional: and this is not optionalFrontiers in Microbiology 8:2224.https://doi.org/10.3389/fmicb.2017.02224

24
Predicting microbial traits with phylogeniesThe ISME Journal 10:959–967.https://doi.org/10.1038/ismej.2015.171
 25
 26

27
Reference standards for nextgeneration sequencingNature Reviews Genetics 18:473–484.https://doi.org/10.1038/nrg.2017.44

28
Multicenter quality assessment of 16S ribosomal DNAsequencing for microbiome analyses reveals high intercenter variabilityInternational Journal of Medical Microbiology 306:334–342.https://doi.org/10.1016/j.ijmm.2016.03.005
 29
 30

31
Incorporating 16S gene copy number information improves estimates of microbial diversity and abundancePLOS Computational Biology 8:e1002743.https://doi.org/10.1371/journal.pcbi.1002743

32
Estimating population turnover rates by relative quantification methods reveals microbial dynamics in marine sedimentApplied and Environmental Microbiology 84:e0144317.https://doi.org/10.1128/AEM.0144317

33
The microbiome and human biologyAnnual Review of Genomics and Human Genetics 18:65–86.https://doi.org/10.1146/annurevgenom083115022438
 34
 35

36
Soil biology for resilient, healthy soilJournal of Soil and Water Conservation 70:12A–18.https://doi.org/10.2489/jswc.70.1.12A
 37

38
Microbiome, metagenomics, and HighDimensional compositional data analysisAnnual Review of Statistics and Its Application 2:73–94.https://doi.org/10.1146/annurevstatistics010814020351
 39
 40

41
mgsbiasmanuscriptmgsbiasmanuscript, 4812a65, https://github.com/mikemc/mgsbiasmanuscript.
 42

43
Microbiome tools for forensic scienceTrends in Biotechnology 35:814–823.https://doi.org/10.1016/j.tibtech.2017.03.006
 44
 45
 46
 47

48
Modelling and Analysis of Compositional Data23–31, The Aitchison geometry, Modelling and Analysis of Compositional Data, 3, Chichester, UK, John Wiley & Sons, Ltd.

49
BLU estimators and compositional dataMathematical Geology 34:259–274.https://doi.org/10.1023/A:1014890722372

50
The madness of microbiome: attempting to find consensus "Best Practice" for 16S Microbiome StudiesApplied and Environmental Microbiology 84:e0262717.https://doi.org/10.1128/AEM.0262717

51
Bias in templatetoproduct ratios in multitemplate PCRApplied and Environmental Microbiology 64:3724–3730.

52
Shotgun metagenomics, from sampling to analysisNature Biotechnology 35:833–844.https://doi.org/10.1038/nbt.3935

53
R: A Language and Environment for Statistical ComputingR Foundation for Statistical Computing, Vienna, Austria.
 54

55
Tools for metagenomic analysis at wastewater treatment plants:application to a foaming episodeWater Environment Research 90:258–268.https://doi.org/10.2175/106143017X15054988926352
 56

57
PCR inhibitors  occurrence, properties and removalJournal of Applied Microbiology 113:1014–1026.https://doi.org/10.1111/j.13652672.2012.05384.x
 58
 59
 60
 61

62
Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCRApplied and Environmental Microbiology 62:625–630.
 63

64
MetaPhlAn2 for enhanced metagenomic taxonomic profilingNature Methods 12:902–903.https://doi.org/10.1038/nmeth.3589

65
Concepts for handling of zeros and missing values in compositional dataProc. IAMG 6:1–4.
 66
 67

68
tidyverse: Easily install and load packages from the tidyversetidyverse: Easily install and load packages from the tidyverse, https://github.com/tidyverse/tidyverse.

69
cowplot: Streamlined plot theme and plot annotations for ’ggplot2’cowplot: Streamlined plot theme and plot annotations for ’ggplot2’, https://github.com/wilkelab/cowplot.

70
Uniting the classification of cultured and uncultured Bacteria and archaea using 16S rRNA gene sequencesNature Reviews Microbiology 12:635–645.https://doi.org/10.1038/nrmicro3330
 71
 72
 73
Decision letter

Peter TurnbaughReviewing Editor; University of California, San Francisco, United States

Wendy S GarrettSenior Editor; Harvard T.H. Chan School of Public Health, United States

Peter TurnbaughReviewer; University of California, San Francisco, United States

Christopher QuinceReviewer; University of Warwick, United Kingdom

Sean GibbonsReviewer
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Consistent and correctable bias in metagenomic sequencing experiments" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Peter Turnbaugh as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Wendy Garrett as the Senior Editor. The following individuals involved in review of your submission have also agreed to reveal their identity: Christopher Quince (Reviewer #2); Sean Gibbons (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The independent replication of microbiome research can produce very different results due to biases inherent in different protocols for studying complex microbial communities. Here, McLaren, Willis, and Callahan focus on multiplicative biases in microbiome experiments and quantify their effect on distorting the resulting abundance distributions. The authors formulate a rigorous framework in treating multiplicative biases based on individual protocol steps and derive a correction of that bias. Application of their correction methods to two mock community data sets validates their findings and provides improved estimates of microbial abundances. Together, this work provides an important framework to begin to more robustly compare data across experiments and research groups. This work is incredibly timely and provides a significant advance in our understanding of biases in microbiome data and a framework that will be built on to address them in the future.
Essential revisions:
1) The github repository is great but might not be enough of a tutorial for inexperienced users. I'd recommend writing a step by step tutorial (a la http://qiime.org/tutorials/).
Also, are there any plans in place to update this code as new approaches become available? It may also help to name the package/tool.
2) More discussion is needed regarding the propagation of noise or errors across multiplicative factors as opposed to bias. For instance, is noise in the extraction protocol more important than PCR because it occurs earlier? Or are the errors simply multiplicative too so that there is no purpose in trying to reduce noise in the earlier steps? In any case noise is perhaps more important than bias as it cannot be corrected so in their Tables 1 and 2 it would be good to have an estimate of the average error associated with each protocol. In some cases, a more biased but reproducible protocol could be preferable.
3) Both data sets used by the authors are derived from mock communities with known ground truth. The authors argue that their methods can be used even in the absence of a ground truth by using relative efficiencies. I feel that this is an important use case for realworld data sets. Standard communities, including all/most of the detected taxa across a set of realworld samples, are usually not available for any given microbiome study. Here, one could still estimate relative efficiencies by using technical replicates (samples coprocessed across batches – how many samples need to be coprocessed across batches for this approach to be useful?). However, that particular use case is discussed only briefly and there is no quantitative study with any data set that is not a mock data set. I feel that inclusion of a nonmock data set with technical replicates may be more informative here since that is closer to most published microbiome data sets. Based on that the authors could formulate a protocol that can be proposed for future microbiome experiments and that would allow systematic elimination of multiplicative biases. One challenge I foresee is that relative efficiencies will be very difficult to estimate for lowabundance taxa that are often unique to a small number of samples, unless every sample is processed across all batches.
4) The authors start by arguing that major biases should be multiplicative. Even though they provide a wide array of references describing the existence of bias in microbiome assays, there is no reference that supports the assumption that the strongest bias has to be multiplicative. This is addressed later in the manuscript by showing that multiplicative bias captures the error in mock communities well and by stating this multiplicative limitation in the Discussion. I would have liked a more systematic study of the potential magnitudes of multiplicative vs. additive/other biases.
a) For instance, what is the amount of variance explained by multiplicative bias vs. potentially additive biases (e.g. saturating effects, contamination from external sources, or welltowell contamination)? I would expect that additive biases, like welltowell contamination, to be more important in lowbiomass samples. Further discussion of this point would be useful.
b) In the two data sets analyzed in this study the authors removed contamination manually. However, this approach likely won't be possible for nonmock data.
c) All bias estimates reported in the text are usually reported without dispersion measures, even though the formulated point estimate should induce an appropriate dispersion measure as well. It would be interesting to see how much uncertainty there is regarding the bias as a function of samples used for correction. This could aid in determining how many samples should be used to estimate the bias.
5) It would be worthwhile to discuss the observed bias structure with other data transformations (not only the ones from compositional data analysis). For instance, Equation 7 essentially makes fold changes of the same taxon across samples invariant to the bias. This would mean that the established transformation from RNAseq analysis would also be applicable here (logtransformation and comparison by logfold change rather than absolute difference).
6) I feel that some further investigation into correlations of the identified biases with taxonomy may be helpful to gain mechanistic insight into the sources of bias. For instance, does the bias correlate with the abundance of the taxon, or do phylogenetically similar taxa show similar biases? This could help to formulate strategies for estimating bias for taxa that are undetected in the batch replicates. The authors also argue that summarization of taxa into higher phylogenetic ranks may introduce additional errors, but they do not state which rank is permissible here (species, genus, family?).
The reviewers were all enthusiastic about publication and realize that all of these points may be difficult to accomplish within the next couple months. We would encourage the authors to use their best judgement in deciding what can be addressed in a substantive way and which points would be best addressed with additional discussion.
https://doi.org/10.7554/eLife.46923.026Author response
Essential revisions:
1) The github repository is great but might not be enough of a tutorial for inexperienced users. I'd recommend writing a step by step tutorial (a la http://qiime.org/tutorials/).
Also, are there any plans in place to update this code as new approaches become available? It may also help to name the package/tool.
We agree with the reviewers that a more accessible and better documented R package would be useful, and have now created a named package and a stepbystep tutorial.
We split off the generically useful functions into an R package called “metacal” (for “metagenomics calibration”), hosted on Github at https://github.com/mikemc/metacal. This package should be considered “alpha” but contains functions that can be used to measure bias in qualitycontrol experiments that include samples of known composition. The GitHub Issues tracker will host discussions about the needs of potential users. We will update this package as we develop improved estimation and calibration methods.
We have added a tutorial to the R package that takes users through the process of estimating bias and performing calibration in the special case where the bias of all taxa can be directly measured from the control samples. The tutorial is viewable online at https://mikemc.github.io/metacal/articles/tutorial.html. We plan to expand upon this tutorial as we further our methods.
We have also taken several steps during the revision that make the package a more robust and useful tool for analyses beyond the present manuscript. In particular, we have:
Added functions to facilitate computing ratiobased statistics and diagnostic plots;
Added a bootstrap procedure for estimating uncertainty in bias estimates (now described in Appendix 2 section “Bootstrap method for estimating uncertainty in the bias”);
Implemented an analytical solution to the bias estimator for the case where different samples have different taxa, which is much faster than our previous optimization method and allows fast bootstrapping (described in Materials and methods section “Bias estimation” and motivated in Appendix 2 section “Distances and means in the compositional Aitchison geometry”);
Added functions to check if there is sufficient taxonomic overlap among the control samples to obtain a fully determined bias estimate, or to find the disjoint groups of taxa within which bias can be estimated (fully explained in Appendix 2 section “Taxonomic overlap among control samples needed for a fullydetermined bias estimate”).
Each of these features is demonstrated in the tutorial.
2) More discussion is needed regarding the propagation of noise or errors across multiplicative factors as opposed to bias. For instance, is noise in the extraction protocol more important than PCR because it occurs earlier? Or are the errors simply multiplicative too so that there is no purpose in trying to reduce noise in the earlier steps? In any case noise is perhaps more important than bias as it cannot be corrected so in their Tables 1 and 2 it would be good to have an estimate of the average error associated with each protocol. In some cases, a more biased but reproducible protocol could be preferable.
We have now added a measure of multiplicative noise to Tables 1 and 2 and described this measure and its application to the Costea et al. (2017) experiment in the second paragraph of the subsection “Bias measurement as a means of evaluating and improving protocols”. To allow direct comparison, we now use the analogous measure of bias instead of the previously used geometric standard deviation, although the new measure has a similar interpretation. In the Brooks et al. (2015) experiments (Table 1), we see the highest noise in the PCR mixtures, which is likely due to random error during sample construction resulting in imprecise estimates of the actual compositions, rather than noise during the MGS measurement. In the Costea et al. experiment (Table 2), the noise is smaller for protocols that preferentially measured the spikein, suggesting that limited sequencing depth or other discretesampling effects may be having an effect.
Our quantification of noise assumes a multiplicative model of noise, and found that it gave reasonable results consistent with our expectations (e.g. noise << bias in most cases). However, we have not proven that all sources of noise in MGS measurements are multiplicative. Some noise mechanisms are likely to be wellmodeled as multiplicative. For example, PCR amplification efficiencies may vary based on location of the PCR tube in the PCR machine, introducing random variation in a multiplicative factor (the efficiencies) that is independent of the sample’s composition. However, at least one source of noise is known to not be wellmodeled as multiplicative: the random variation around the relative abundances introduced by the sequencing sampling process is larger and more asymmetric for rare taxa.
In general, we expect that noise will be reasonably modeled as multiplicative for reasonably abundant taxa. This means that, as the reviewers noted, the effect of multiplicative noise will be the same whether it arises in an early step or a late step. However, further investigation is warranted, and we expect at a minimum that a proper accounting for the nonmultiplicative noise introduced by the generation of discrete sequencing reads will be necessary for robust bias estimation in low abundance taxa.
3) Both data sets used by the authors are derived from mock communities with known ground truth. The authors argue that their methods can be used even in the absence of a ground truth by using relative efficiencies. I feel that this is an important use case for realworld data sets. Standard communities, including all/most of the detected taxa across a set of realworld samples, are usually not available for any given microbiome study. Here, one could still estimate relative efficiencies by using technical replicates (samples coprocessed across batches – how many samples need to be coprocessed across batches for this approach to be useful?). However, that particular use case is discussed only briefly and there is no quantitative study with any data set that is not a mock data set. I feel that inclusion of a nonmock data set with technical replicates may be more informative here since that is closer to most published microbiome data sets. Based on that the authors could formulate a protocol that can be proposed for future microbiome experiments and that would allow systematic elimination of multiplicative biases. One challenge I foresee is that relative efficiencies will be very difficult to estimate for lowabundance taxa that are often unique to a small number of samples, unless every sample is processed across all batches.
We agree with the reviewers that the restriction of our analyses to mock communities is a major limitation of the present study. We are actively working towards demonstrating the utility of our methods on natural samples from other experiments, with the ultimate goal of formulating protocols for both mock and nonmock calibration in future microbiome experiments. However, we determined that a thorough investigation of differential bias in natural samples would require followup article(s) for reasons we outline below.
First, as the reviewers noted, most species are not present in most samples in realworld datasets, and many of the taxa that are present are at very low abundances. These complications make robust estimation of the relative efficiencies more difficult in natural samples than in controls. Second, consistent matching between the taxa identified by different protocols is challenging in natural samples in which many similar taxa may be present. Consider a set of E. coli and Shigella flexneri strains that are present in the realworld dataset. The way that a 16S rRNA protocol and a shotgun metagenomics protocol will divide these strains into taxonomic units will be different, and matching between the taxonomic units derived from different MGS protocols can be surprisingly challenging. This issue is avoided in control samples that contain collections of strains that are taxonomically wellseparated. Third, most currently available datasets that could be used for investigating differential bias (i.e. that contain multiple samples measured by different MGS protocols) do not include technical replicate measurements, which are necessary to be able to distinguish inconsistent bias from contamination, random error, or sample mislabeling. For example, such technical replicate measurements of the same samples are not present in the Costea et al. (2017) dataset we analyzed, which otherwise would be a useful dataset for investigating differential bias in the natural fecal portion of the measured communities. We believe that the problems posed by noise, contamination, and taxonomic matching in realworld datasets can be overcome by more careful modeling and data curation, but require further substantial work and dataset development that would overly complicate the present manuscript.
Despite these challenges, we believe that differential bias is a nonobvious concept that is critical for understanding and developing methods to reduce technical disagreement between studies, and that it is therefore worth introducing in this manuscript. In order to better support its practical utility given the reviewers’ concerns, we expanded the illustration of differential bias in the revised manuscript: 1) We added the spikein contaminant in the Costea et al. (2017) experiment to Figure 4B to illustrate how a consistent differential bias can be observed even when the true abundance of the taxon is unknown (subsection “Metagenomic sequencing of fecal samples with a spikein mock community”, last paragraph). 2) We added estimates of differential bias between the Costea et al. (2017) protocols including the contaminant taxon to Table 2. While not equivalent to demonstration on a completely natural community, we believe this at least shows the utility of differential bias in the case where true abundances are not known.
4) The authors start by arguing that major biases should be multiplicative. Even though they provide a wide array of references describing the existence of bias in microbiome assays, there is no reference that supports the assumption that the strongest bias has to be multiplicative. This is addressed later in the manuscript by showing that multiplicative bias captures the error in mock communities well and by stating this multiplicative limitation in the Discussion. I would have liked a more systematic study of the potential magnitudes of multiplicative vs. additive/other biases.
a) For instance, what is the amount of variance explained by multiplicative bias vs. potentially additive biases (e.g. saturating effects, contamination from external sources, or welltowell contamination)? I would expect that additive biases, like welltowell contamination, to be more important in lowbiomass samples. Further discussion of this point would be useful.
Note: To be consistent with the manuscript, we use the term “bias” exclusively for the systematic error caused by differences in the measurement efficiency among taxa.
Our aim has been to determine how the error due to bias behaves and can be addressed. In particular, we do not consider error due to contamination and taxonomic misassignment, which can lead to measurements of taxa not actually in the sample and cannot be expressed in our simple multiplicative framework. These other sources of error may be more important than bias in some contexts and efforts to obtain accurate and reproducible metagenomics measurements must ultimately grapple with all three types of error. We have added a paragraph to the Discussion (subsection “Limitations and Next Steps”) to clarify the potential for contamination to limit effective application of the present work. In this response, we discuss further the role of contamination in obtaining quantitative measurements of community differences. We discuss practical solutions for dealing with contamination in our response to 4 b).
Contamination: Like bias, contamination (either from the lab environment or reagents or crosscontamination between samples) is a ubiquitous source of systematic error in metagenomics experiments. Whether bias or contamination is more important will depend on the experimental platform and system and the specific goals of the experiment. For example, contamination will increase in importance for sequencing platforms with high rates of index/barcode switching, for low biomass samples, and for downstream analyses that are sensitive to taxa found in very small proportions. The measures we propose for quantifying bias are challenged by the presence of contamination, as evidenced by the fact that an arbitrary amount of contamination from a taxon not in the sample yields an infinite multiplicative error. Future work that incorporates our bias model with models of contamination can lead to improved measurement of each error type and improved understanding of the joint error process in applications. We have summarized these points in the Discussion subsection “Limitations and Next Steps”.
Saturation and other systematic variation in taxon efficiencies: Saturation – where a taxon’s efficiency decreases with its relative or absolute abundance in the sample – is a special case of the relative efficiencies systematically depending on sample composition. The residual compositional errors, $\hat{\u03f5}(s)\sim \mathbf{O}(s)/(\mathbf{A}(s)\cdot \hat{\mathbf{B}})$, include systematic and random variation in the efficiencies and so place an upper bound on the strength of such effects. With the exception of the Brooks et al. (2015) PCRproduct mixtures, the residuals are much smaller than the bias (Figure 3—figure supplement 1; Figure 4—figure supplement 1), and the large residuals in the PCRproduct mixtures can plausibly be explained by random sample construction error. These observations indicate that saturation or other systematic variation in the bias is not significant in the experiments we analyzed. It is possible that saturation would become apparent in samples with more extreme variation in taxon abundances, we note one potential example of this (Thomas et al., 2016) in the Discussion and further testing is warranted. Our model can serve as a null model for future researchers to look for such effects in their own protocols; since it properly accounts for compositionality, deviations from our model can correctly be interpreted as efficiencies varying with sample composition. Protocol design should strive for consistent efficiencies, which is likely to be much more attainable than eliminating bias.
b) In the two data sets analyzed in this study the authors removed contamination manually. However, this approach likely won't be possible for nonmock data.
Removal of contamination is indeed more difficult in natural control samples where the true taxa present in the sample are not known a priori. However, removal of contamination using current practices – such as filtering out taxa below a minimum frequency above which outofsample contribution is assumed negligible or applying a specialized contaminationremoval procedure (Davis et al., 2018; Larsson et al., 2018) – can be used instead of manual removal. Because our model and method for estimating bias consider only the ratios among taxa, they apply equally well to arbitrary subcompositions of taxa from each sample. Therefore, one can restrict to the taxa in each sample one is confident are not contaminants and proceed to estimate bias from these subcompositions with the methods we describe. Future Bayesian or maximumlikelihood methods that jointly model contamination and bias may make it possible to avoid the loss of information associated with current contaminant filtering approaches. We have summarized these points in the Discussion subsection “Limitations and Next Steps”.
c) All bias estimates reported in the text are usually reported without dispersion measures, even though the formulated point estimate should induce an appropriate dispersion measure as well. It would be interesting to see how much uncertainty there is regarding the bias as a function of samples used for correction. This could aid in determining how many samples should be used to estimate the bias.
We agree with the reviewers that applications of bias estimation should account for uncertainty in the estimated bias, and understanding of the structure of this uncertainty for different controlsample designs should inform experimental planning. Towards these ends, we have taken several steps in this revision.
We now implement a bootstrap procedure for estimating the uncertainty in the estimate of bias. This procedure is described in Appendix 2 section “Bootstrap method for estimating uncertainty in the bias”, implemented in the R package, and demonstrated in the tutorial. We are developing more sophisticated maximumlikelihood and Bayesian estimation procedures that we plan to add R package and tutorials in the future and that overcome some limitations of our bootstrap method (namely, not accounting for sequencing depth and potentially suboptimal performance when different control samples contain different taxa).
In the online R Markdown documents, we use the bootstrap procedure to estimate standard errors for all bias estimates reported in the manuscript. Due to the large sample size of these experiments, the standard errors are small relative to the estimated bias parameters and we exclude them from the tables for clarity. Standard errors for the estimated bias for the Costea et al. (2017) experiment are plotted in Figure 4—figure supplement 1 and for the estimated bias of individual workflow steps for the Brooks et al. (2015) experiment in Figure 6.
We also examined how the precision of the bias estimate varies with number of control samples for the protocol with an intermediate degree of noise (Protocol H) in the Costea et al. (2017) experiment. To estimate the (geometric) standard errors for N control samples (from N=1 to N=9), we performed multinomial bootstrap resampling of N samples and computed the geometric standard deviation of the resulting efficiency estimates. Figure 5—figure supplement 1 shows the resulting geometric standard errors as a function of N, alongside the observed compositions for help in interpretation. This figure and analysis is discussed in the subsection “Calibration”. When noise in the controls is much lower than bias, as it is here, measuring just 1 control sample gives a sufficiently accurate estimate to be useful in calibration. However, multiple samples are needed to allow estimating uncertainty and may be essential if “heavytail” random errors are common (such as those seen in two samples in the figure).
5) It would be worthwhile to discuss the observed bias structure with other data transformations (not only the ones from compositional data analysis). For instance, Equation 7 essentially makes fold changes of the same taxon across samples invariant to the bias. This would mean that the established transformation from RNAseq analysis would also be applicable here (logtransformation and comparison by logfold change rather than absolute difference).
Equation 6 and Equation 7 demonstrate that bias cancels out in analyses based on fold changes (FCs) in the ratios of taxa. However, bias does not cancel out in nonratio based analyses due to the reasons stated in the section “Systematic error in taxon ratios, but not in taxon proportions, is independent of sample composition”. Here we present the general result for the error in the estimated proportional FC of a taxon between two samples, which provides a direct point of comparison to Equation 6 and Equation 7. From Equation 4, we have that the proportional FC of taxon $i$ from sample $t$ to sample $s$ is
The bias partially cancels, but not fully, due to the dependence of the observed proportions on the sample mean efficiency (SME); the result is an error in the estimated proportional FC equal to the reciprocal of the FC in the SME. This can even results in a sign if the true proportional FC of a taxon is smaller than and opposing the FC in the SME. We give a numerical example of such a sign error for Taxon 3 in Figure 2 (subsection “Analyses based on foldchanges in taxon ratios are insensitive to bias, while analyses based on taxon proportions can give spurious results”, first paragraph) where the taxon is observed to have a proportional FC greater than 1 but in fact has a proportional FC that is less than 1. In a recent paper, Kevorkian et al. (2018) suggested that bias cancels when computing the proportional FC between samples. However, their derivation did not properly account for the compositional nature of MGS measurements, which causes the appearance of the sample mean efficiency in our Equation 4 but that is missing in their Equation 3.
Whether spurious associations (or missed true associations) are a serious worry in applications appears to us an open question. It is easy to come up with examples of spurious associations in a given taxon due to another taxon changing from high to low frequency between samples (such as Taxon 1 in Figure 2; see also Brooks, 2016, p. 338). Such large singlespecies changes might be uncommon in some ecosystems; but large phylumlevel changes could similarly cause spurious results if species’ efficiencies are similar within the phylum. On the other hand, it is possible that the SME is fairly stable across samples. In this case, the partial canceling may be sufficient for accurate estimates of differential abundance. Because the observed FCs of all taxa are multiplied by the same error term, the observed logFCs remain correlated with the actual logFCs. A plot of observed vs. actual logFCs will appear systematically shifted from the y = x line by the logFC in the SME; but such a shift might often be obscured in real data by noise. This result could explain the high correlation and apparent lack of a systematic shift seen in D'Amore et al. (2016) Figure 7, which considers this relationship for the FCs in proportions between two DNA mock communities. Thus determining the conditions in which bias causes spurious taxonomic associations remains an important line of future work.
Other nonratiobased transformations currently employed in RNAseq and microbiome differentialabundance analysis, such as rank transformation and quantile normalization, face a similar problem. However, the practical implications of bias for such analyses again remain unclear and requires further consideration via theory and/or simulation, for which our model can provide a conceptual foundation.
6) I feel that some further investigation into correlations of the identified biases with taxonomy may be helpful to gain mechanistic insight into the sources of bias. For instance, does the bias correlate with the abundance of the taxon, or do phylogenetically similar taxa show similar biases? This could help to formulate strategies for estimating bias for taxa that are undetected in the batch replicates. The authors also argue that summarization of taxa into higher phylogenetic ranks may introduce additional errors, but they do not state which rank is permissible here (species, genus, family?).
We agree that understanding the extent to which bias is conserved among phylogenetically similar species is critical for understanding the causes and consequences of bias and for developing practical estimation and correction methods. The patterns we observe provide some anecdotal evidence about the nature of phylogenetic conservation of bias. For example, the variation in efficiencies seen among the five Firmicute species in the Costea et al. (2017) experiment shows that bias is not universally conserved at the phylum level and may pose a problem for analyses based on Phylumlevel aggregation (as noted in the third paragraph of the Discussion). As further evidence, we visualized differential bias against phylogeny for the spikein taxa in the Costea et al. (2017) experiment including the contaminant Shigella flexneri, (see Author response image 1). Author response image 1 shows that the three most closely related species – S. flexneri, S. enterica, and Y. pseudotuberculosis – have very similar relative efficiencies. These three taxa are all within the family Enterobacteriaceae. On the other hand, the next two most closely related species, B. hansenii and C. saccharolyticum in the family Lachnospiraceae, do not have similar relative efficiencies. Thus these results are ambiguous regarding how conserved bias is at the family level.
Our ability to draw firm conclusions about the phylogenetic conservation of bias is limited by the small number of taxa in these datasets. For instance, only two taxa in the Brooks et al. (2015) dataset, L. iners and L. crispatus, share a genus, while none of the spikein taxa in the Costea et al. (2017) experiment do. Our future plans include using additional mock and nonmock datasets to estimate the phylogenetic conservation of bias more broadly, so as to give more concrete guidance on when bias is (or is not) likely to remain consistent when aggregating at a given taxonomic level.
Certain phenotypes are known to interact mechanistically with the MGS measurement process to introduce bias. In this revision, we improved our analysis of the relationship between bias and one such phenotype – 16S copy number (CN) – by performing a more rigorous estimation of CN (described in Materials and methods subsection “Data and analysis code availability” and shown in an Rmarkdown document online) and improving the figure showing the relationship between bias and CN (Figure 6—figure supplement 1). Previously we followed Brooks et al. (2015) in basing our estimates of 16S CN on RefSeq annotations for the given species. We have now revised our CN estimates for A. vaginae from 1 to 2 and for L. iners from 1 to 5 (Table 3). Previously 16S CN did not predict total bias or PCR bias, and that 16S CN correction actually made the estimated proportions worse than no correction in both cell and DNA mixtures. With these new numbers, we find that 16S CN variation explains a substantial amount of estimated PCR bias and that CN correction substantially improves the estimated proportions for the DNA mixtures. We thus substantially revised our previous conclusion that CN does not explain PCR bias in this experiment; however, it remains that CN explains at best a small fraction of total bias. These new results are reflected in the last paragraph of the Results subsection “Bias measurement as a means of evaluating and improving protocols” and reflected in the improved fit in the middle column of Figure 3—figure supplement 2. Investigation of other potential phenotypic predictors of bias is of interest, but we are again limited in our ability to systematically survey potentially relevant phenotypes by the limited number of taxa in our datasets.
https://doi.org/10.7554/eLife.46923.027Article and author information
Author details
Funding
The authors declare that there was no funding for this work.
Acknowledgements
We thank Glen Satten and David Clausen for valuable discussions.
Senior Editor
 Wendy S Garrett, Harvard T.H. Chan School of Public Health, United States
Reviewing Editor
 Peter Turnbaugh, University of California, San Francisco, United States
Reviewers
 Peter Turnbaugh, University of California, San Francisco, United States
 Christopher Quince, University of Warwick, United Kingdom
 Sean Gibbons
Publication history
 Received: March 16, 2019
 Accepted: August 10, 2019
 Version of Record published: September 10, 2019 (version 1)
Copyright
© 2019, McLaren et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 6,127
 Page views

 796
 Downloads

 38
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.