A phylogenetic transform enhances analysis of compositional microbiota data
 Cited 9
 Views 3,692
 Annotations
Abstract
Surveys of microbial communities (microbiota), typically measured as relative abundance of species, have illustrated the importance of these communities in human health and disease. Yet, statistical artifacts commonly plague the analysis of relative abundance data. Here, we introduce the PhILR transform, which incorporates microbial evolutionary models with the isometric logratio transform to allow offtheshelf statistical tools to be safely applied to microbiota surveys. We demonstrate that analyses of communitylevel structure can be applied to PhILR transformed data with performance on benchmarks rivaling or surpassing standard tools. Additionally, by decomposing distance in the PhILR transformed space, we identified neighboring clades that may have adapted to distinct human body sites. Decomposing variance revealed that covariation of bacterial clades within human body sites increases with phylogenetic relatedness. Together, these findings illustrate how the PhILR transform combines statistical and phylogenetic models to overcome compositional data challenges and enable evolutionary insights relevant to microbial communities.
https://doi.org/10.7554/eLife.21887.001Introduction
Microbiota research today embodies the datarich nature of modern biology. Advances in highthroughput DNA sequencing allow for rapid and affordable surveys of thousands of bacterial taxa across hundreds of samples (Caporaso et al., 2011). The exploding availability of sequencing data has poised microbiota research to advance our understanding of fields as diverse as ecology, evolution, medicine, and agriculture (Waldor et al., 2015). Considerable effort now focuses on interrogating microbiota datasets to identify relationships between bacterial taxa, as well as between microbes and their environment.
Increasingly, it is appreciated that the relative nature of microbial abundance data in microbiota studies can lead to spurious statistical analyses (Jackson, 1997; Friedman and Alm, 2012; Aitchison, 1986; Lovell et al., 2011; Gloor et al., 2016a; Britanova et al., 2014; Li, 2015; Tsilimigras and Fodor, 2016). With next generation sequencing, the number of reads per sample can vary independently of microbial load (Lovell et al., 2011; Tsilimigras and Fodor, 2016). In order to make measurements comparable across samples, most studies therefore analyze the relative abundance of bacterial taxa. Analyses are thus not carried out on absolute abundances of community members (Figure 1A), but rather on relative data occupying a constrained geometric space and represented in a nonCartesian coordinate system (Figure 1B). Such relative abundance datasets are often termed compositional. The use of most standard statistical tools (e.g., correlation, regression, or classification) within a compositional space leads to spurious results (PawlowskyGlahn et al., 2015). For example, threequarters of the significant bacterial interactions inferred by Pearson correlation on a compositional human microbiota dataset were likely false (Friedman and Alm, 2012), and over twothirds of differentially abundant taxa inferred by a ttest on a simulated compositional human microbiota dataset were spurious (Mandal et al., 2015). To account for compositional effects in microbial datasets, bioinformatics efforts have rederived common statistical methods including correlation statistics (Friedman and Alm, 2012; Fang et al., 2015), hypothesis testing (La Rosa et al., 2012), and variable selection (Chen and Li, 2013; Lin et al., 2014).
An alternative approach is to transform compositional microbiota data to a space where existing statistical methods may be applied without introducing spurious conclusions. This approach is attractive because of its efficiency: the vast toolbox of existing statistical models can be applied without rederivation. Normalization methods, for example, have been proposed to modify count data by assuming reads follow certain statistical distributions (e.g., negative binomial) (Paulson et al., 2013; Anders and Huber, 2010). Alternatively, the field of Compositional Data Analysis (CoDA) has focused on formalizing methods for transforming compositional data into a simpler geometry without having to assume data adhere to a distribution model (BaconShone, 2011). Previous microbiota analyses have already leveraged CoDA theory and used the centered logratio transform to reconstruct microbial association networks and interactions (Kurtz et al., 2015; Lee et al., 2014) and to analyze differential abundances (Fernandes et al., 2014; Gloor et al., 2016b). However, the centered logratio transform has a crucial limitation: it yields a coordinate system featuring a singular covariance matrix and is thus unsuitable for many common statistical models (PawlowskyGlahn et al., 2015). This drawback can be sidestepped using another CoDA transform, known as the Isometric LogRatio (ILR) transformation (Egozcue et al., 2003). The ILR transform can be built from a sequential binary partition of the original variable space (Figure 1C), creating a new coordinate system with an orthonormal basis (Figure 1D and E) (Egozcue and PawlowskyGlahn, 2005). However, a known obstacle to using the ILR transform is the choice of partition such that the resulting coordinates are meaningful (PawlowskyGlahn et al., 2015). To date, microbiota studies have chosen ILR coordinates using ad hoc sequential binary partitions of bacterial groups that are not easily interpreted (Finucane et al., 2014; Lê Cao et al., 2016). Alternatively, external covariates have been used to pick groups of bacterial taxa to contrast (Morton et al., 2017).
Here, we introduce the bacterial phylogenetic tree as a natural and informative sequential binary partition when applying the ILR transform to microbiota datasets (Figure 1C). Using phylogenies to construct the ILR transform results in an ILR coordinate system capturing evolutionary relationships between neighboring bacterial groups (clades). Analyses of neighboring clades offer the opportunity for biological insight: clade analyses have linked genetic differentiation to ecological adaptation (Hunt et al., 2008), and the relative levels of sister bacterial genera differentiate human cohorts by diet, geography, and culture (De Filippo et al., 2010; Wu et al., 2011; Yatsunenko et al., 2012). Datasets analyzed by a phylogenetically aware ILR transform could therefore reveal ecological and evolutionary factors shaping hostassociated microbial communities.
We term our approach the Phylogenetic ILR (PhILR) transform. Using published environmental and humanassociated 16S rRNA datasets as benchmarks, we found that simple Euclidean distances calculated on PhILR transformed data provided a compositionally robust alternative to distance/dissimilarity measures like BrayCurtis, Jaccard, and Unifrac. In addition, we observed that the accuracy of supervised classification methods on our benchmark datasets was matched or improved with PhILR transformed data relative to applying the same models on untransformed (raw) or log transformed relative abundance data. Decomposing distances between samples along PhILR coordinates identified bacterial clades that may have differentiated to adapt to distinct body sites. Similar decomposition of variance along PhILR coordinates showed that, in all human body sites studied, the degree to which neighboring bacterial clades covary tends to increase with the phylogenetic relatedness between clades. Together, these findings demonstrate that the PhILR transform can be used to enhance existing microbiota analysis pipelines, as well as enable novel phylogenetic analyses of microbial ecosystems.
Results
Constructing the PhILR transform
The PhILR transform has two goals. The first goal is to transform input microbiota data into an unconstrained space with an orthogonal basis while preserving all information contained in the original composition. The second goal is to conduct this transform using phylogenetic information. To achieve these dual goals on a given set of $N$ samples consisting of relative measurements of $D$ taxa (Figure 1B), we transform data into a new space of $N$ samples and $(D1)$ coordinates termed ‘balances’ (Figure 1C–E) (Egozcue et al., 2003; Egozcue and PawlowskyGlahn, 2005). Each balance ${y}_{i}^{*}$ is associated with a single internal node $i$ of a phylogenetic tree with the $D$ taxa as leaves (the asterisk denotes a quantity represented in PhILR space). The balance represents the logratio of the geometric mean relative abundance of the two clades of taxa that descend from $i$ (Materials and methods). Although individual balances may share overlapping sets of leaves and thus exhibit dependent behavior, the ILR transform rescales and combines leaves to form a coordinate system whose basis vectors are orthonormal and the corresponding coordinates are Cartesian (Egozcue et al., 2003; Egozcue and PawlowskyGlahn, 2005). The orthogonality of basis vectors allows conventional statistical tools to be used without compositional artifacts. The unitlength of basis vectors makes balances across the tree statistically comparable even when they have differing numbers of descendant tips or exist at different depths in the tree (PawlowskyGlahn et al., 2015). In addition, the unitlength ensures that the variance of PhILR balances has a consistent scale, unlike the variance of logratios originally proposed by Aitchison (Aitchison, 1986) as a measure of association, in which it can be unclear what constitutes a large or small variance (Friedman and Alm, 2012).
While the above description represents the core of the PhILR transform, we have also equipped the PhILR transform with two sets of weights that can: (1) address the multitude of zero and nearzero counts present in microbiota data; and, (2) incorporate phylogenetic branch lengths into the transformed space. Because zero counts cause problems when computing logs or performing division, zeros are often replaced in microbiota analyses with small nonzero counts. However, to avoid excess zero replacement that may itself introduce bias, stringent hard filtering thresholds are often employed (e.g. removing all taxa that are not seen with at least a minimum number of counts in a subset of samples). Still, hard filtering thresholds may remove a substantial fraction of observed taxa and do not account for the low precision (or high variability) of nearzero counts (Gloor et al., 2016a; Good, 1956; McMurdie and Holmes, 2014). We therefore developed a ‘taxon weighting’ scheme that acts as a type of softthresholding, supplementing zero replacement methods with a generalized form of the ILR transform that allows weights to be attached to individual taxa (Egozcue and PawlowskyGlahn, 2016). Weights are chosen with a heuristic designed to down weight the influence of taxa with many zero or nearzero counts (Materials and methods).
Our second weighting scheme is called branch length weighting. Certain analyses may benefit from incorporating information on evolutionary distances between taxa (Lozupone and Knight, 2005; Fukuyama et al., 2012; Purdom, 2011). For example, because related bacteria may be more likely to share similar traits (Martiny et al., 2015), it may be desirable to consider communities differing only in the abundance of closelyrelated microbes to be more similar than communities differing only in the abundance of distantlyrelated microbes. Because of the onetoone correspondence between PhILR balances and internal nodes on the phylogenetic tree, evolutionary information can be incorporated into the PhILR transform by scaling balances using the phylogenetic distance between their direct descendants (Materials and methods). We note that we employ both branch length weighting and taxon weighting throughout our following analyses except where noted; still, these weights should be considered optional additions to the core PhILR transform.
Benchmarking communitylevel analyses in the PhILR coordinate system
To illustrate how the PhILR transform can be used to perform standard communitylevel analyses of microbiota datasets, we first examined measures of community dissimilarity. Microbiota analyses commonly compute the dissimilarity or distance between pairs of samples and use these computed pairwise distances as input to a variety of statistical tools. We investigated how Euclidean distances calculated on PhILR transformed data compared to common ecological measures of microbiota distance or dissimilarity (UniFrac, BrayCurtis, and Jaccard) as well as Euclidean distance applied to raw relative abundance data in standard distancebased analysis. We chose three different microbiota surveys as reference datasets: Costello Skin Sites (CSS), a dataset of 357 samples from 12 human skin sites (Costello et al., 2009; Knights et al., 2011); Human Microbiome Project (HMP), a dataset of 4743 samples from 18 human body sites (e.g., skin, vaginal, oral, and stool) (Human Microbiome Project Consortium, 2012); and, Global Patterns (GP), a dataset of 26 samples from nine human or environmental sites (Caporaso et al., 2011) (Supplementary file 1 and Figure 2—figure supplement 1).
Distancebased analyses using Euclidean distances computed on PhILR transformed data exhibited performance rivaling common ecological distance or dissimilarity measures. Principal coordinate analyses (PCoA) qualitatively demonstrated separation of body sites using both Euclidean distances on PhILR transformed data (Figure 2A) and with several standard distance measures calculated on raw relative abundance data (Figure 2—figure supplement 2). To quantitatively compare distance measures, we tested how well habitat information explained variability among distance matrices as measured by the R^{2} statistic from PERMANOVA (Chen et al., 2012). By this metric, the Euclidean distance in the PhILR coordinate system significantly outperformed the five competing distance metrics in all but one case (in comparison to Weighted UniFrac when applied to the HMP dataset; Figure 2B).
Next, we tested the performance of predictive statistical models in the PhILR coordinate system. We examined four standard supervised classification techniques: logistic regression (LR), support vector machines (SVM), knearest neighbors (kNN), and random forests (RF) (Knights et al., 2011). We applied these methods to the same three reference datasets used in our comparison of distance metrics. As a baseline, the machine learning methods were applied to raw relative abundance datasets and raw relative abundance data that had been logtransformed.
The PhILR transform significantly improved supervised classification accuracy in 7 of the 12 benchmark tasks compared to raw relative abundances (Figure 2C). Accuracy improved by more than 90% in two benchmarks (SVM on HMP and GP), relative to results on the raw data. Log transformation of the data also improved classifier accuracy significantly on 6 of the 12 benchmarks but also significantly underperformed on one benchmark compared to raw relative abundances. In addition, the PhILR transform significantly improved classification accuracy in 5 of the 12 benchmarks relative to the log transform. Overall, the PhILR transform often outperformed the raw and log transformed relative abundances with respect to classification accuracy and was never significantly worse.
Identifying neighboring clades that differ by body site preference
While our benchmarking experiments demonstrated how PhILR transformed data performed in communitylevel analyses, we also wanted to explore potential biological insights afforded by the PhILR coordinate system. We therefore investigated how distances decomposed along PhILR balances using a sparse logistic regression model to examine which balances distinguished human body site microbiota in the HMP dataset. Such balances could be used to identify neighboring bacterial clades whose relative abundances capture communitylevel differences between body site microbiota. Microbial genetic differentiation may be associated with specialization to new resources or lifestyle preferences (Hunt et al., 2008), meaning that distinguishing balances near the tips of the bacterial tree may correspond to clades adapting to human body site environments.
We identified dozens of highly discriminatory balances, which were spread across the bacterial phylogeny (Figure 3A and Figure 3—figure supplement 1). Some discriminatory balances were found deep in the tree. Abundances of the Firmicutes, Bacteroidetes, and Proteobacteria relative to the Actinobacteria, Fusobacteria, and members of other phyla, separated skin body sites from oral and stool sites (Figure 3B). Levels of the genus Bacteroides relative to the genus Prevotella differentiated stool microbiota from other communities on the body (Figure 3C). Notably, values of select balances below the genus level also varied by body site. Relative levels of sister Corynebacterium species separated human skin sites from gingival sites (Figure 3D). Specieslevel balances even differentiated sites in nearby habitats; levels of sister Streptococcus species or sister Actinomyces species vary depending on specific oral sites (Figure 3E and F). These results show that the decomposition of distances between groups of samples along PhILR balances can be used to highlight ancestral balances that distinguish body site microbiota, as well as to identify more recent balances that may separate species that have adapted to inhabit different body sites.
Balance variance and microbiota assembly
As a natural extension of our analysis of how distance decomposes along PhILR balances, we next investigated how balance variance decomposed in the PhILR coordinate system. Balance variance is a measure of association between neighboring bacterial clades. When the variance of a balance between two clades approaches zero, the mean abundance of taxa in each of the two clades will be linearly related and thus covary across microbial habitats (Lovell et al., 2015). By contrast, when a balance exhibits high variance, related bacterial clades exhibit unlinked or exclusionary patterns across samples. Unlike standard measures of association (e.g., Pearson correlation) balance variance is robust to compositional effects (PawlowskyGlahn et al., 2015).
Our preliminary investigation demonstrated a striking pattern in which balance variance decreased for balances closer to the tips of the phylogeny and increased for balances nearer to the root. To determine if this observed pattern was not the result of technical artifact, we took the following three steps. First, we omitted branch length weights from the transform as we anticipated that branch lengths may vary nonrandomly as a function of depth in the phylogeny. Second, we anticipated that balances near the tips of the phylogeny would be associated with fewer read counts and thus would be more biased by our chosen heuristics for taxon weighting and zero replacement. We therefore omitted taxon weighting, employed more stringent filtering thresholds, and conditioned our calculation of balance variance on nonzero counts rather than using zeroreplacement (Materials and methods). Third, we combined regression and a permutation scheme to test the null hypothesis that the degree to which neighboring clades covary is independent of the phylogenetic distance between them (Materials and methods). By permuting tip labels on the tree, our test generates a restricted subset of random sequential binary partitions that still maintains the count variability (and potential biases due to our zero handling methods) of the observed data.
With our modified PhILR analysis in place, we observed significantly decreasing balance variances near the tips of the phylogenetic tree for all body sites in the HMP dataset (p<0.01, permutation test with FDR correction; Figure 4A–F and Figure 4—figure supplements 1–2). Low variance balances predominated near the leaves of the tree. Examples of such balances involved B. fragilis species in stool (Figure 4H), Rothia mucilaginosa species in the buccal mucosa (Figure 4J), and Lactobacillus species in the midvagina (Figure 4L). By contrast, higher variance balances tended to be more basal on the tree. Three examples of high variance balances corresponded with clades at the order (Figure 4G), family (Figure 4I), and genus (Figure 4K) levels. We also observed that the relationship between balance variance and phylogenetic depth varied at different taxonomic scales. LOESS regression revealed that trends between variance and phylogenetic depth were stronger above the species level than below it (Materials and methods; Figure 4D–F and Figure 4—figure supplement 2). Overall, the observed pattern of decreasing balance variance near the tips of the phylogenetic tree suggested that closely related bacteria tend to covary in human body sites.
Discussion
There exists a symbiosis between our understanding of bacterial evolution and the ecology of hostassociated microbial communities (Matsen, 2015). Microbiota studies have shown that mammals and bacteria cospeciated over millions of years (Moeller et al., 2016; Ley et al., 2008), and human gut microbes have revealed the forces driving horizontal gene transfer between bacteria (Smillie et al., 2011). Evolutionary tools and theory have been used to explain how cooperation benefits members of gut microbial communities (RakoffNahoum et al., 2016), and raise concerns that rising rates of chronic disease are linked to microbiota disruption (Blaser and Falkow, 2009). Here, we aimed to continue building links between microbiota evolution and ecology by designing a data transform that uses phylogenetic models to overcome the challenges associated with compositional data while enabling novel evolutionary analyses.
We found that the resulting PhILR coordinate system, at least with respect to the performance metrics chosen, led to significantly improved performance for a variety of communitylevel analyses now used in microbiota analysis. While these results add credence to our proposed approach, we underscore that we do not find it essential that PhILR demonstrates superior benchmark performance to motivate its use in microbiota analysis. We believe that the need for compositionally robust tools has already been well established (Jackson, 1997; Friedman and Alm, 2012; Aitchison, 1986; Lovell et al., 2011; Gloor et al., 2016a; Li, 2015; Tsilimigras and Fodor, 2016) and intended these benchmarks to showcase the flexibility and utility of working with PhILR transformed data. We also note that for some analyses, a phylogenybased ILR transform will not outperform an ILR transform built from another sequential binary partition. In fact, in the absence of branch length weights, any random ILR partition would yield equivalent results on our benchmark tasks. Instead, what distinguishes the PhILR transform from other ILR transforms is the interpretability of the transformed coordinates. Balances in PhILR space correspond to speciation events, which can be a source for biological insight.
For example, performing regression on PhILR transformed data enabled us to decompose the distance between bacterial communities onto individual locations on the phylogeny, highlighting balances near the tips of the tree that distinguished human body sites. These balances may reflect functional specialization, as ecological partitioning among recently differentiated bacterial clades could be caused by genetic adaptation to new environments or lifestyles (Hunt et al., 2008). Indeed, among oral body sites, we observed consistent site specificity of neighboring bacterial clades within the genera Actinomyces (Figure 3F) and Streptococcus (Figure 3E). Species within the Actinomyces genera have been previously observed to partition by oral sites (Aas et al., 2005; Mager et al., 2003). Even more heterogeneity has been observed within the Streptococcus genus, where species have been identified that distinguish teeth, plaque, mucosal, tongue, saliva, and other oral sites (Aas et al., 2005; Mager et al., 2003). This partitioning likely reflects variation in the anatomy and resource availability across regions of the mouth (Aas et al., 2005), as well as the kinds of surfaces bacterial strains can adhere to (Mager et al., 2003).
We also observed evidence for potential withingenus adaptation to body sites that has not been previously reported. Within the genus Corynebacterium, we found ratios of taxa varied among oral plaques and select skin sites (Figure 3D). Although the genus is now appreciated as favoring moist skin environments, the roles played by individual Corynebacteria within skin microbiota remain incompletely understood (Grice and Segre, 2011). Precisely linking individual Corynebacterium species or strains to body sites is beyond the scope of this study due to the limited taxonomic resolution of 16S rRNA datasets (Janda and Abbott, 2007; Větrovský and Baldrian, 2013). Nevertheless, we believe the PhILR coordinate system may be used in the future to identify groups of related bacterial taxa that have undergone recent functional adaptation.
Another example of how the PhILR transform can provide biological insights arose in our analysis of how human microbiota variance decomposes along individual balances. We observed that balances between more phylogenetically related clades were significantly more likely to covary than expected by chance. This pattern could reflect evolutionary and ecological forces structuring microbial communities in the human body. Related bacterial taxa have been hypothesized to have similar lifestyle characteristics (Martiny et al., 2015; Zaneveld et al., 2010), and may thus covary in human body sites that favor their shared traits (Levy and Borenstein, 2013; Faust et al., 2012). An alternative explanation for the balance variation patterns we observed is that sequencing errors and read clustering artifacts are likely to produce OTUs (Operational Taxonomic Units) with similar reference sequences and distributions across samples. While we cannot conclusively rule out this alternative hypothesis, we note that it would not explain why signal for taxa covariation is weakest for balances at higher taxonomic levels and appears to plateau for balances near or below the species level. A biological explanation for the plateauing signal would be that lifestyle characteristics enabling bacteria to persist in human body sites are conserved among strains roughly corresponding to the same species. Followup studies are needed to more conclusively understand how balance variance patterns across the phylogenies can be interpreted from an evolutionary standpoint.
Though the methods presented here provide a coherent geometric framework for performing microbiota analysis free from compositional artifacts, future refinements are possible. Specifically, we highlight issues relating to our choice of weights, the handling of zero values, and information loss during count normalization. Both the taxa weights and the branch lengths weights we introduce here may be viewed as preliminary heuristics; future work will likely yield additional weighting schemes, as well as knowledge for when a given weighting scheme should be matched to an analysis task. In the case of supervised machine learning, weighting selection could be optimized as part of the training process. Additionally, if it is important that the transformed data has meaningful numerical coordinates, such that one desires to interpret the exact numerical value of a given balance in a sample, we suggest that neither branch length weights nor taxa weights be used as these weights can complicate this type of interpretation. Concerning our handling of zero values, this model design choice confronts an outstanding challenge for microbiota and compositional data analysis (Tsilimigras and Fodor, 2016; MartınFernandez et al., 2011). Part of this challenge’s difficulty is whether a zero value represents a value below the detection limit (rounded zero) or a truly absent taxon (essential zero). Here, we employ zero replacement, which implies an assumption that all zero values represent rounded zeros. New mixture models that explicitly allow for both essential and rounded zeros (Bear and Billheimer, 2016), as well as more advanced methods of zero replacement (MartınFernandez et al., 2011; MartinFernandez et al., 2015), may enable us to handle zeros in a more sophisticated manner. Lastly, in regards to informational loss caused by normalization, it is known that the number of counts measured for a given taxon influences the precision with which we may estimate its relative abundance in a sample (Gloor et al., 2016a; Good, 1956; McMurdie and Holmes, 2014). While our taxa weights are intended to address this idea, a fully probabilistic model of counts would likely provide more accurate error bounds for inference. We believe it would be possible to build such a model in a Bayesian framework by viewing the observed counts as multinomial draws from a point in the PhILR transformed space, as has been done for other logratio based spaces (Billheimer et al., 2001).
Beyond refining the PhILR transform itself, future effort may also be directed towards interpreting the transform's results at the single taxon level. Microbiota studies frequently focus on individual taxa for tasks such as identifying specific bacteria that are causal or biomarkers of disease. Logratio approaches can provide a compositionally robust approach to identifying biomarkers based on changes in the relative abundance of individual taxa. Due to the onetoone correspondence between CLR coordinates and individual taxa, the CLR transform has been used previously to build compositionally robust models in terms of individual taxa (Mandal et al., 2015; Kurtz et al., 2015; Fernandes et al., 2014). However, CLR transformed data suffer from the drawback of a singular covariance matrix, which can make the development of new models based on the CLR transform difficult (PawlowskyGlahn et al., 2015). ILR transformed data do not suffer this drawback (PawlowskyGlahn et al., 2015) and moreover, can be analyzed at the single taxon level. To do so, the inverse ILR transform can be applied to model results generated in an ILR coordinate system, yielding analyses in terms of changes in the relative abundance of individual taxa (PawlowskyGlahn et al., 2015). The use of the inverse ILR transform in this manner is well established (PawlowskyGlahn et al., 2015; van den Boogaart and TolosanaDelgado, 2013; PawlowskyGlahn and Buccianti, 2011) and the inverse transform is provided in the Methods (Egozcue and PawlowskyGlahn, 2016).
Despite these avenues for improvement, modification, or extension we believe the PhILR transform already enables existing statistical methods to be applied to metagenomic datasets, free from compositional artifacts and framed according to an evolutionary perspective. We foresee the PhILR transform being used as a default transformation prior to many microbiota analyses, particularly if a phylogenetic perspective is desired. For example, the PhILR transform could be used in lieu of the conventional log transform, which is often the default choice in microbiota analysis but is not robust to compositional effects. Substituting PhILR into existing bioinformatics pipelines should often be seamless and we emphasize that all statistical tools applied to PhILR transformed data in this study were used 'offtheshelf' and without modification. Importantly, such a substitution contrasts with the alternative approach for accounting for compositional microbiota data, which is to modify existing statistical techniques. Such modification is often challenging because many statistics were derived assuming an unconstrained space with an orthonormal basis, not a constrained and overdetermined compositional space. Therefore, while select techniques have already been adapted (e.g. distance measures that incorporate phylogenetic information (Lozupone and Knight, 2005) and feature selection methods that handle compositional input (Chen and Li, 2013; Lin et al., 2014)), it is likely that certain statistical goals, such as nonlinear community forecasting or control system modeling, may prove too complex for adapting to the compositional nature of microbiota datasets. Finally, beyond microbiota surveys, we also recognize that compositional metagenomics datasets are generated when studying the ecology of viral communities (Culley et al., 2006) or clonal population structure in cancer (Britanova et al., 2014; Yuan et al., 2015; Roth et al., 2014). We expect the PhILR transform to aid other arenas of biological research where variables are measured by relative abundance and related by an evolutionary tree.
Materials and methods
The ILR transform
A typical microbiome sample consists of measured counts ${c}_{j}$ for taxa $j\in \left\{1,\dots ,D\right\}$. A standard operation is to take count data and transform it to relative abundances. This operation is referred to as closure in compositional data analysis (Aitchison, 1986) and is given by
where $\mathit{x}$ represents a vector of relative abundances for the $D$ taxa in the sample. We can represent a binary phylogenetic tree of the $D$ taxa using a sign matrix $\mathrm{\Theta}$ as introduced by PawlowskyGlahn and Egozcue (PawlowskyGlahn and Egozcue, 2011) and shown in Figure 5. Each row of the sign matrix indexes an internal node $i$ of the tree and each column indexes a tip of the tree. A given element in the sign matrix is $\pm 1$ depending on which of the two clades descending from $i$ that tip is a part of and 0 if that tip is not a descendent of $i$. The assignment of $+1$ versus $1$ determines which clade is represented in the numerator versus the denominator of the corresponding logratio (as described below). Exchanging this assignment for a given balance switches which clade is represented in the numerator versus the denominator of the logratio. Following Egozcue and PawlowskyGlahn (Egozcue and PawlowskyGlahn, 2016), we represent the ILR coordinate (balance) associated with node $i$ in terms of the shifted composition $\mathit{y}\mathit{=}\mathit{x}\mathit{/}\mathit{p}=\left({x}_{1}/{p}_{1},\dots ,{x}_{D}/{p}_{D}\right)$ as
Here, ${g}_{p}\left({\mathit{y}}_{i}^{+}\right)$ and ${g}_{p}\left({\mathit{y}}_{i}^{}\right)$ represents the weighted geometric mean of the components of $y$ that represent tips in the $+1$ or $1$ clade descendant from node $i$ respectively. This weighted geometric mean is given by
where ${p}_{j}$ is the weight assigned to taxa $j$. The term $\sqrt{{n}_{i}^{+}{n}_{i}^{}/{n}_{i}^{+}+{n}_{i}^{}}$ in equation 1 is the scaling term that ensures that the ILR basis element has unit length and the terms ${n}_{i}^{\pm}$ are given by
Note that when $\mathit{p}=\left(1,\dots ,1\right),\text{}\mathit{y}\mathit{=}\mathit{x}$, Equation 1 represent the ILR transform as originally published (Egozcue et al., 2003), Equation 2 represents the standard formula for geometric mean of a vector $\mathit{y}$, and equation 3 represents the number of tips that descend from the $+1$ or $1$ clade descendant from node $i$. However, when $\mathit{p}\ne \left(1,\dots ,1\right)$, these three equations represent a more generalized form of the ILR transform that allows weights to be assigned to taxa in the transformed space (Egozcue and PawlowskyGlahn, 2016).
Following Egozcue and PawlowskyGlahn (Egozcue and PawlowskyGlahn, 2016), we also note that the form of the generalized ILR (which we will denote $\mathrm{i}\mathrm{l}{\mathrm{r}}_{\mathrm{p}}$) transform can be rewritten in terms of a generalized CLR transform (which we will denote $\mathrm{c}\mathrm{l}{\mathrm{r}}_{\mathrm{p}}$). This formulation in terms of the generalized CLR transform can be more efficient to compute and allows the inverse of the transform to be easily described. We can define the generalized CLR transform as
The generalized ILR transform can then be written as
with the $ij$th element of the matrix $\mathrm{\Psi}$ given by
With these components defined the inverse of generalized ILR transform can be written as $\mathcal{\mathcal{C}}\left[\mathit{y}\right]={\mathrm{i}\mathrm{l}{\mathrm{r}}_{\mathrm{p}}}^{1}\left({\mathit{y}}^{\ast}\right)=\mathcal{\mathcal{C}}\left[\mathrm{exp}\left({\mathit{y}}^{\ast}\mathrm{\Psi}\right)\right]$ and $x=\text{}\mathcal{\mathcal{C}}\left[{\mathrm{i}\mathrm{l}{\mathrm{r}}_{\mathrm{p}}}^{1}\left({\mathit{y}}^{\ast}\right)\mathit{p}\right]$.
Soft thresholding through weighting taxa
We make use of this generalized ILR transform to down weight the influence of taxa with many zero and nearzero counts since these are less reliable and therefore more variable (Good, 1956). Our choice of taxa weights is a heuristic that combines two terms multiplicatively: a measure of the central tendency of counts, such as the mean or median of the raw counts for a taxon across the $N$ samples in a dataset; and, the norm of the vector of relative abundances of a taxon across the $N$ samples in a dataset. We add this vector norm term to weight taxa by their sitespecificity. Preliminary studies showed that the geometric mean of the counts (with a pseudocount added to avoid skew from zero values) outperformed both the arithmetic mean and median as a measure of central tendency for the counts (data not shown). Additionally, while both the Euclidean norm and the Aitchison norm improved preliminary benchmark performance compared to using the geometric mean alone, in one case (classification using support vector machine on the global patterns dataset), the Euclidean norm greatly outperformed the Aitchison norm (Supplementary file 1). Therefore, our chosen taxa weighting scheme uses the geometric mean times the Euclidean norm:
Note that we add the subscript $j$ to the righthand side of the above equation to emphasize that this is calculated with respect to a single taxon across the $N$ samples in a dataset. As intended, this scheme tended to assign smaller weights to taxa in our benchmarks with more zero and nearzero counts (Figure 2—figure supplement 1). Despite their heuristic nature, we found that our chosen weights provide performance improvements over alternative weights (or the lack thereof) as measured by our benchmark tasks (Supplementary file 1).
Our taxa weighting scheme supplements the use of pseudocounts and represents a softthreshold on low abundance taxa. More generally, these taxa weights represent a form of prior information regarding the importance of each taxon. We note that if prior biological information suggests allowing specific taxa to influence the PhILR transform more (or less) strongly, such a weighting could be achieved for taxon $j$ by increasing (or decreasing) $p}_{j$.
Incorporating branch lengths
Beyond utilizing the connectivity of the phylogenetic tree to dictate the partitioning scheme for ILR balances, branch length information can be embedded into the transformed space by linearly scaling ILR balances (${y}_{i}^{*}$) by the distance between neighboring clades. We call this scaling by phylogenetic distance ‘branch length weighting’. Specifically, for each coordinate ${y}_{i}^{*}$, corresponding to node $i$ we use the transform
where ${d}_{i}^{\pm}$ represent the branch lengths of the two direct children of node $i$. When $f\left({d}_{i}^{+},{d}_{i}^{}\right)=1$, the coordinates are not weighted by branch lengths. The form of this transform was chosen so that the weights ${d}_{i}^{\pm}$, only influence the corresponding coordinate (${y}_{i}^{*,blw}$).
We also investigated the effect of using $f\left({d}_{i}^{+},{d}_{i}^{}\right)=1$, $f\left({d}_{i}^{+},{d}_{i}^{}\right)={d}_{i}^{+}+{d}_{i}^{}$, and based on the results of Chen et al. (2012), $f\left({d}_{i}^{+},{d}_{i}^{}\right)=\sqrt{{d}_{i}^{+}+{d}_{i}^{}}$ on benchmark performance. When coupled with the taxa weights specified above, the square root of the summed distances had the highest rank in 9 of the 12 supervised classification tasks and 2 of the three distance based tasks (Supplementary file 1). Based on these results, except for our analysis of balance variance versus phylogenetic depth (see below), the square root of the summed distances was used throughout our analyses.
Implementation
The PhILR transform, as well as the incorporation of branch length and taxa weightings has been implemented in the R programing language as the package philr available at https://bioconductor.org/packages/philr/.
Datasets and preprocessing
All data preprocessing was done in the R programming language using the phyloseq package for analysis of microbiome census data (McMurdie and Holmes, 2013) as well as the ape (Paradis et al., 2004) and phangorn (Schliep, 2011) packages for analysis of phylogenetic trees.
Data acquisition
We chose to use previously published OTU tables, taxonomic classifications, and phylogenies as the starting point for our analyses. The Human Microbiome Project (HMP) dataset was obtained from the QIIME Community Profiling Pipeline applied to highquality reads from the v35 region, available at http://hmpdacc.org/HMQCP/. The Global Patterns dataset was originally published in Caporaso, et al. (Caporaso et al., 2011) and is provided with the phyloseq R package (McMurdie and Holmes, 2013). The Costello Skin Sites dataset (CSS) is a subset of the dataset collected by Costello et al. (Costello et al., 2009) featuring only the samples from skin sites. This skin subset was introduced as a benchmark for supervised machine learning by Knights et al. (Knights et al., 2011) and can be obtained from http://www.knightslab.org/data.
OTU table preprocessing
To accord with general practice, we performed a minimal level of OTU table filtering for all datasets used in benchmarks and analyses. Due to differences in sequencing depth, sequencing methodology, and the number and diversity of samples between datasets, filtering thresholds were set independently for each dataset. For the HMP dataset, we initially removed samples with fewer than 1000 counts to mimic prior analyses (Human Microbiome Project Consortium, 2012). We additionally removed OTUs that were not seen with more than three counts in at least 1% of samples. Preprocessing of the Global Patterns OTU table followed the methods outlined in McMurdie and Holmes (McMurdie and Holmes, 2013). Specifically, OTUs that were not seen with more than three counts in at least 20% of samples were removed, the sequencing depth of each sample was standardized to the abundance of the median sampling depth, and finally OTUs with a coefficient of variation ≤3.0 were removed. The CSS dataset had lower sequencing depth than the other two datasets; we chose to filter OTUs that were not seen with greater than 10 counts across the skin samples. The PhILR transform, and more generally our benchmarking results in Figure 2b and C, were robust to varying our filtering strategies (Supplementary file 2).
Preprocessing phylogenies
For each dataset, the phylogeny was pruned to include only those taxa remaining after OTU table preprocessing. Except for the Global Patterns dataset, which was already rooted, we chose to root phylogenies by manually specifying an outgroup. For the HMP dataset the phylum Euryarchaeota was chosen as an outgroup. For the CSS dataset, the tree was rooted with OTU 12871 (from phylum Plantomycetes) as the outgroup. For all three phylogenies, any multichotomies were resolved with the function multi2di from the ape package which replaces multichotomies with a series of dichotomies with one (or several) branch(es) of length zero.
Zero replacement and normalization
A pseudocount of 1 was added prior to PhILR transformation to avoid taking logratios with zero counts. We found that our benchmarking results were robust to changing the value of this pseudocount from 1 to 2, 3, or 10 (Supplementary file 1).
Grouping sampling sites
To simplify subsequent analyses, HMP samples from the left and right retroauricular crease and samples from the left and right antecubital fossa were grouped together, respectively, as preliminary PERMANOVA analysis suggested that these sites were indistinguishable (data not shown).
Benchmarking
Distance/dissimilarity based analysis
Distance between samples in PhILR transformed space was calculated using Euclidean distance. All other distance measures were calculated using phyloseq on the preprocessed data without adding a pseudocount. Principle coordinate analysis was performed for visualization using phyloseq. PERMANOVA was performed using the function adonis from the R package vegan (v2.3.4). The R^{2} value from the fitted model was taken as a performance metric. Standard errors were calculated using bootstrap resampling with 100 samples each. Differences between the performance of Euclidean distance in PhILR transformed space and that of each other distance or dissimilarity measure on a given task was tested using twosided ttests and multiple hypothesis testing was accounted for using FDR correction.
Supervised classification
The performance, as measured by classification accuracy, of PhILR transformed data was compared against data preprocessed using one of two standard strategies for normalizing sequencing depth: the preprocessed data was transformed to relative abundances (e.g., each sample was normalized to a constant sum of 1; raw); or, a pseudocount of 1 was added, the data was transformed to relative abundances, and finally the relative abundances were logtransformed (log).
All supervised learning was implemented in Python using the following libraries: Scikitlearn (v0.17.1), numpy (v1.11.0) and pandas (v0.17.1). Four classifiers were used: penalized logistic regression, support vector classification with RBF kernel, random forest classification, and knearestneighbors classification. Each classification task was evaluated using the mean and variance of the test accuracy over 10 randomized test/train (30/70) splits which preserved the percentage of samples from each class at each split. For each classifier, for each split, the following parameters were set using crossvalidation on the training set. Logistic regression and Support Vector Classification: the ‘C’ parameter was allowed to vary between ${10}^{3}$ to ${10}^{3}$ and multiclass classification was handled with a onevsall loss. In addition, for logistic regression the penalty was allowed to be either ${l}_{1}$ or ${l}_{2}$. Knearestneighbors classification: the ‘weights’ argument was set to ‘distance’. Random forest classification: each forest contained 30 trees and the ‘max_features’ argument was allowed to vary between 0.1 and 1. All other parameters were set to default values. Due to the small size of the Global Patterns dataset, the supervised classification task was simplified to distinguishing human vs. nonhuman samples. Differences between each method's accuracy in each task was tested using twosided ttests and multiple hypothesis testing was accounted for using FDR correction.
Identifying balances that distinguish sites
To identify a sparse set of balances that distinguish sampling sites while accounting for the dependencies between nested balances, we fit a multinomial regression model with a grouped l_{1} penalty using the R package glmnet (v2.0.5). The penalization term lambda was set by visually inspecting model outputs for clear body site separation (lambda = 0.1198). This resulted in 35 balances with nonzero regression coefficients. Phylogenetic tree visualization was done using the R package ggtree (Yu et al., 2017).
Variance and depth
To reduce the likelihood that our analysis of balance variance and phylogenetic depth was affected by statistical artifact, we modified our PhILR transform in several ways. First, we omitted branch length weights (i.e., we set $f\left({d}_{i}^{+},{d}_{i}^{}\right)=1$) as these may vary nonrandomly as a function of phylogenetic depth. Second, we also anticipated that any zero replacement method would likely lead to lower variance measurements, which could have greater effects on balances closer to the tips of the tree. We therefore omitted taxa weights and zero replacement; we instead used stricter hard filtering thresholds and calculated balance values based on nonzero counts. In practice, we used the following filtering thresholds for each body site, taxa present in less than 20% of samples from that site were excluded and subsequently samples that had less than 50 total counts were excluded. To calculate balance values based on nonzero counts we retained balances that met the following criteria: the term ${g}_{p}\left({\mathit{y}}_{i}^{+}\right)/{g}_{p}\left({\mathit{y}}_{i}^{}\right)$ had nonzero counts from some taxa within the subcomposition $\mathit{y}}_{i}^{+$ (formed by the taxa that descend from the $+1$ clade of node $i$) and some other taxa within the subcomposition $\mathit{y}}_{i}^{$ (formed by the taxa that descend from the $1$ clade of node $i$) in at least 40 samples from that body site. We believe these two modifications to PhILR resulted in a more conservative analysis of balance variance versus phylogenetic depth but are likely not optimal in other situations.
To investigate the overall relationship between balance variance and phylogenetic depth we used linear regression. A balance’s depth in the tree was calculated as its mean phylogenetic distance to its descendant tips $\left(d\right)$. For a given body site the following model was fit:
where $d$ represents mean distance from a balance to its descendant tips. We then set out to test the null hypothesis that $\beta =0$, or that the variance of the logratio between two clades was invariant to the distance of the two clades from their most recent common ancestor. For each site, a null distribution for $\beta $ was constructed by permutations of the tip labels of the phylogenetic tree. For each permutation of the labels, the resultant tree was used to transform the data and $\beta $ was estimated. We chose this permutation scheme to ensure that the increasing variance we saw with increasing proximity of a balance to the root was not because deeper balances had more descendant tips, an artifact of variance scaling with mean abundance, or due to bias introduced due to our handling of zeros. Furthermore, for each body site, we found the null distribution for $\beta $ was symmetric about $\beta =0$ which further supports that balance variance depends on phylogenetic depth through a biological mechanism and not through a statistical artifact (Figure 4—figure supplement 3). Two tailed pvalues were calculated for $\beta $ based on 20000 samples from each site’s respective null distribution. FDR correction was applied to account for multiple hypothesis testing between body sites.
To visualize local trends in the relationship between balance variance and phylogenetic depth, a LOESS regression was fit independently for each body site. This was done using the function geom_smooth from the R package ggplot2 (v2.1.0) with default parameters.
The data and code needed to reproduce our analysis of balance variance versus phylogenetic depth is provided in Figure 4—source data 1 and Figure 4—source code 1 respectively.
Integrating taxonomic information
Taxonomy was assigned to OTUs in the HMP dataset using the assign_taxonomy.py script from Qiime (v1.9.1) to call uclust (v1.2.22) with default parameters using the representative OTU sequences obtained as described above. Taxonomic identifiers were assigned to the two descendant clades of a given balance separately using a simple voting scheme and combined into a single name for that balance. The voting scheme occurs as follows: (1) for a given clade, the entire taxonomy table was subset to only contain the OTUs that were present in that clade (2) starting at the finest taxonomic rank the subset taxonomy table was checked to see if any species identifier represented ≥95% of the table entries at that taxonomic rank, if so that identifier was taken as the taxonomic label for the clade (3) if no consensus identifier was found, the table was checked at the next mostspecific taxonomic rank.
Median phylogenetic depths for each taxonomic rank were estimated by first decorating a phylogenetic tree with taxonomy information using tax2tree (v1.0) (McDonald et al., 2012). For a given taxonomic rank the mean distance to tips was calculated for each internal node possessing a label that ended in that rank. The median of these distances was used to display an estimate of the phylogenetic depth of that given rank. This calculation of median phylogenetic depth of different taxonomic ranks was done separately for each body site.
The data and code needed to reproduce the taxonomic assignment and estimation of median phylogenetic depths for each taxonomic rank is included in Figure 4—source data 1 and Figure 4—source code 1 respectively.
References

1
Defining the normal bacterial flora of the oral cavityJournal of Clinical Microbiology 43:5721–5732.https://doi.org/10.1128/JCM.43.11.57215732.2005
 2
 3

4
A Short History of Compositional Data AnalysisIn: V PawlowskyGlahn, A Buccianti, editors. Compositional Data Analysis. Hoboken, NJ: John Wiley & Sons, Ltd. pp. 1–11.

5
A logistic normal mixture model for compositional data allowing essential zerosAustrian Journal of Statistics 45:3–23.https://doi.org/10.17713/ajs.v45i4.117

6
Statistical interpretation of species compositionJournal of the American Statistical Association 96:1205–1214.https://doi.org/10.1198/016214501753381850

7
What are the consequences of the disappearing human Microbiota?Nature Reviews Microbiology 7:887–894.https://doi.org/10.1038/nrmicro2245

8
Agerelated decrease in TCR repertoire diversity measured with deep and normalized sequence profilingThe Journal of Immunology 192:2689–2698.https://doi.org/10.4049/jimmunol.1302064
 9
 10

11
Variable selection for sparse Dirichletmultinomial regression with an application to microbiome data analysisThe Annals of Applied Statistics 7:418–442.https://doi.org/10.1214/12AOAS592
 12
 13
 14

15
Isometric logratio transformations for compositional data analysisMathematical Geology 35:279–300.https://doi.org/10.1023/A:1023818214614

16
Groups of parts and their balances in compositional data analysisMathematical Geology 37:795–828.https://doi.org/10.1007/s1100400573819

17
Changing the reference measure in the simplex and its weighting effectsAustrian Journal of Statistics 45:25–44.https://doi.org/10.17713/ajs.v45i4.126
 18

19
Microbial cooccurrence relationships in the human microbiomePLoS Computational Biology 8:e1002606.https://doi.org/10.1371/journal.pcbi.1002606
 20
 21

22
Inferring correlation networks from genomic survey dataPLoS Computational Biology 8:e1002687.https://doi.org/10.1371/journal.pcbi.1002687

23
Pacific Symposium on Biocomputing213, Comparisons of distance methods for combining covariates and abundances in microbiome studies, Pacific Symposium on Biocomputing, NIH Public Access.

24
Compositional uncertainty should not be ignored in highthroughput sequencing data analysisAustrian Journal of Statistics 45:73.https://doi.org/10.17713/ajs.v45i4.122

25
It's all relative: analyzing microbiome data as compositionsAnnals of Epidemiology 26:322–329.https://doi.org/10.1016/j.annepidem.2016.03.003

26
On the estimation of small frequencies in ContingencyTablesJournal of the Royal Statistical Society Series BStatistical Methodology 18:113–124.
 27
 28
 29
 30

31
16s rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfallsJournal of Clinical Microbiology 45:2761–2764.https://doi.org/10.1128/JCM.0122807

32
Supervised classification of human MicrobiotaFEMS Microbiology Reviews 35:343–359.https://doi.org/10.1111/j.15746976.2010.00251.x

33
Sparse and compositionally robust inference of microbial ecological networksPLOS Computational Biology 11:e1004226.https://doi.org/10.1371/journal.pcbi.1004226
 34

35
Helminth colonization is associated with increased diversity of the gut MicrobiotaPLoS Neglected Tropical Diseases 8:e2880.https://doi.org/10.1371/journal.pntd.0002880
 36

37
Evolution of mammals and their gut microbesScience 320:1647–1651.https://doi.org/10.1126/science.1155725
 38

39
Microbiome, metagenomics, and highdimensional compositional data analysisAnnual Review of Statistics and Its Application 2:73–94.https://doi.org/10.1146/annurevstatistics010814020351
 40

41
Proportions, percentages, ppm: do the molecular biosciences treat compositional data rightIn: V PawlowskyGlahn, A Buccianti, editors. Compositional Data Analysis: Theory and Applications. Hoboken, NJ: John Wiley & Sons, Ltd. pp. 193–207.

42
Proportionality: a valid alternative to correlation for relative dataPLOS Computational Biology 11:e1004075.https://doi.org/10.1371/journal.pcbi.1004075

43
UniFrac: a new phylogenetic method for comparing microbial communitiesApplied and Environmental Microbiology 71:8228–8235.https://doi.org/10.1128/AEM.71.12.82288235.2005

44
Distribution of selected bacterial species on intraoral surfacesJournal of Clinical Periodontology 30:644–654.https://doi.org/10.1034/j.1600051X.2003.00376.x

45
Analysis of composition of microbiomes: a novel method for studying microbial compositionMicrobial Ecology in Health & Disease 26:27663.https://doi.org/10.3402/mehd.v26.27663

46
Dealing with zerosIn: V PawlowskyGlahn, A Buccianti, editors. Compositional Data Analysis: Theory and Applications. Hoboken, NJ: John Wiley & Sons, Ltd. pp. 43–58.

47
Bayesianmultiplicative treatment of count zeros in compositional data setsStatistical Modelling 15:134–158.https://doi.org/10.1177/1471082X14535524
 48

49
Phylogenetics and the human microbiomeSystematic Biology 64:e26–e41.https://doi.org/10.1093/sysbio/syu053
 50
 51

52
Waste not, want not: why rarefying microbiome data is inadmissiblePLoS Computational Biology 10:e1003531.https://doi.org/10.1371/journal.pcbi.1003531
 53
 54
 55

56
Differential abundance analysis for microbial markergene surveysNature Methods 10:1200–1202.https://doi.org/10.1038/nmeth.2658

57
Compositional Data Analysis: Theory and ApplicationsHoboken, NJ: John Wiley & Sons, Ltd.
 58

59
Exploring compositional data with the CoDaDendogramAustrian Journal of Statistics 40:103–113.

60
Analysis of a data matrix and a graph: metagenomic data and the phylogenetic treeThe Annals of Applied Statistics 5:2326–2358.https://doi.org/10.1214/10AOAS402
 61
 62

63
Phangorn: phylogenetic analysis in RBioinformatics 27:592–593.https://doi.org/10.1093/bioinformatics/btq706
 64

65
Compositional data analysis of the microbiome: fundamentals, tools, and challengesAnnals of Epidemiology 26:330–335.https://doi.org/10.1016/j.annepidem.2016.03.002
 66
 67

68
Where next for microbiome research?PLOS Biology 13:e1002050.https://doi.org/10.1371/journal.pbio.1002050
 69
 70
 71
 72

73
Ribosomal RNA diversity predicts genome diversity in gut bacteria and their relativesNucleic Acids Research 38:3869–3879.https://doi.org/10.1093/nar/gkq066
Decision letter

Anthony FodorReviewing Editor; University of North Carolina at Charlotte
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "A phylogenetic transform enhances analysis of compositional microbiota data" for consideration by eLife. Your article has been favorably evaluated by Wendy Garrett (Senior Editor) and three reviewers, one of whom, Anthony Fodor, is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal their identity: Juan José Egozcue (Reviewer #2). All three reviewers found considerable merit in the manuscript praising the originality of the idea, the clarity of the presentation and the effectiveness of the visualizations.
For your reference, I have attached the three original reviews below. Per the eLife policy, you are not required to respond to each individual point raised by each reviewer. Below you will find a summary of the reviews reflecting the consensus of the reviewers of required revisions.
Summary:
It has long been recognized that compositional data is best analyzed in ratio space. For 16S rRNA sequence datasets and other genomic datasets, this leads to the rather difficult question of what ratio to take. In this interesting manuscript, Silverman et al. suggest taking ratios on a bifurcating phylogenetic tree. In their approach, the dependent variable for each node of the tree is a "balance", which is a normalized ratio of all the nodes beneath each node (with the ratio of all the left descendant nodes over all the right descendant nodes). The paper explores the consequences of this normalization showing reasonable performance when used for classification or machine learning. The authors provide an implementation in R and this approach may be of utility for analysts looking for a novel way to analyze microbial data in ratio space.
Essential revisions:
1) There was consensus among the reviewers that the normalization scheme seemed a bit adhoc and poorly justified. We suggest that it is made clear to the readers that the choice of weights used in the manuscript, both on the taxa and on the tips, are preliminary choices and further work will be required to ultimately determine the best weighting scheme. In addition, some further clarification on how sparsity was treated would strengthen the manuscript.
2) None of the reviewers were entirely satisfied with the demonstration that the proposed transformation was superior to other transformations. It is unclear what it means, for example, that the PhILR technique sometimes produces a stronger clustering signal and sometimes a weaker clustering signal than weighted UNIFRAC (Figure 2B). There was some discussion among the reviewers about whether a comparison to random binary partitions would be appropriate since this could reflect your null model (that phylogenetic binary partition is superior). Reviewer #2 was not enthusiastic about this idea and argued that a change on a tree does not alter (Aitchison) distances between data points. Therefore, results would likely not change no matter which tree has been selected for defining ILR coordinates.
In the end, the consensus of the reviewers was that exploring such null models might be of interest, but would not be appropriate as a requirement for revision. Nonetheless, reviewer #1 would have been interested to see if machine learning algorithms in fact performed better on the phylogenetic partition in comparison to random partitions. We leave it up to the authors as to whether they feel such an inclusion would improve their paper or be worth the effort. However, all reviewers felt that paper would be improved by more discussion of how to tell when one normalization procedure is superior to another. It seemed to reviewer #1 that it might be adequate to say that the proposed transformation appears to be doing no harm relative to weighted Unifrac but was providing some compositional safeguards. Reviewer #2 argued that the main benefit of the proposed procedure was in interpretability in yielding a phylogenetically aware transformation that could be utilized for teststatistics.
3) There was general consensus among the reviewers that Figure 4 was unconvincing in that technical (nonbiological) features of the algorithm could explain the results more easily than exclusionary dynamics. The authors could likely remove Figure 4 without damaging the paper, but if they choose to keep it, they should respond to the concerns below from reviewers #1 and #3 as both reviewers felt that simple lack of association between taxa and stochastic variation in sampling could explain many of the observations in Figure 4.
4) There was consensus among the reviewers that the authors could provide more guidance to the nonspecialists about when this algorithm should be used. As reviewer #3 put it, when and why should a reader choose the PhILR approach and when should it be avoided.
5) The authors should be explicit about what filtering and other preprocessing steps were used and whether those choices substantially impacted the performance of their algorithm.
Reviewer #1:
It has long been recognized that compositional data is best analyzed in ratio space. For 16S rRNA sequence datasets and other genomic datasets, this leads to the rather difficult question of what ratio to take. In this interesting manuscript, Silverman et al. suggest taking ratios on a bifurcating phylogenetic tree. In their approach, the dependent variable for each node of the tree is a "balance", which is a normalized ratio of all the nodes beneath each node (with the ratio of all the left descendant nodes over all the right descendant nodes). This is a clever idea and the authors make a reasonably compelling case that this is a useful contribution. While the evidence from the machine learning and clustering portions of the paper provide only modest evidence that this transformation is substantially better than other transformations, the transformation clearly is not making things much worse. As such, the proposed method will be of interest to researchers interested in how best to approach the challenging problem of compositional analyses of microbial datasets.
Suggestions for improvement follows:
1) The mathematical description of the algorithm could be improved in terms of its clarity. For example, it is not always clear how the pluses and minuses in the second equation of the subsection “The ILR Transform” are defined. It is not made immediately clear that the leftmost term (sqrt(ni+ * ni / (ni+ + ni)) in that equation is a correction for branch length. For the equation in the subsection “Addressing sparsity through weighting taxa”, please use parentheses to emphasize that the formula is sqrt(sum(xj)) * ((sum(cj 1…N)) ^ 1/N and not sqrt(sum(xj))^N * sqrt(sum(cj 1…N))
2) A limitation of working in ratio space in 16S data is what to do with all of the zeros in the spreadsheet (the "sparse data" problem). There is nothing inherent to the balancing algorithm that can deal with this problem (the sum in the denominator of the balance obviously can't be zero). The authors therefore require some kind of pseudocount scheme and they have settled on one (given in the equation in subsection “Addressing sparsity through weighting taxa”) which is a function of the number of samples. If I am reading their formula correctly, the weight given to a taxa with 1 read in one sample and zero in all others would be very close to the sqrt(number of samples). The authors might consider giving this as an example and explicitly noting that their method has some similarity to adding the sqrt(number of samples) as a pseudocount.
3) I struggled a bit with the variance analysis which concludes the Results sections. In the second paragraph of the subsection 2Balance variance and microbiota assembly” it is argued that "When the variance of a balance between two clades approaches zero, the mean abundance of taxa in each of the two clades will be linearly related and thus exhibit shared dynamics across microbial habitats". While this is true (and must be true mathematically as when the variance approaches zero, the ratio of the two normalized abundances of the clades is constant, which means they will become perfectly linearly related with a positive slope), it's not clear to me that a high variance necessarily means "exclusionary dynamics across samples". I don't see how a high variance in this number indicated anything about how direct or indirect the interactions are within the tree. There are lots of reasons two clades could lack perfect linear correlation with a positive slope from direct competition to being completely unrelated biologically with abundances just shifting due to stochastic variation. Also, wouldn't variance have to be lower closer to the tips of the phylogenetic tree as the number of sequences is lower, so the pseudocount has more of an effect depressing overall variance? Doesn't that explain Figure 4DF? And if I am reading these graphs correctly, below the species level there are not enough datapoints to support modeling via localized regression one way or another. Although not central to the arguments of the paper, I found Figure 4 to be less convincing than other sections of the paper.
Reviewer #2:
The manuscript introduces two new powerful ideas. They are: (a)construction of the sequential binary partition used to define
ILR coordinates on the base of phylogenetic trees; (b) use therecently published balances in weighted compositions to account forphylogenetic distances between OTU's or taxa. Both ideas aredeveloped and applied successfully to benchmark data sets.
For the originality of main ideas applied to microbiome analysis,the manuscript deserves publication.
Reviewer #3:
The manuscript is, in general, written in a clear and logical manner with a minimum of technical jargon. Figures 1, 3 and 4 are outstanding examples of data visualization and demonstrate the utility of this approach over others extant in the field. Figure 5 seems an afterthought and could be combined with Figure 1, or made supplementary. The binary portion could also be completely described in the Methods without a figure.
Figure 2, and the associated descriptive text are somewhat underwhelming demonstrations of the utility of the approach, which, in theory, should clearly outperform all other methods if the compositional data approach is necessary. The use of boxplots probably does not aid the visual appeal.
That the ILR based distance can seriously underperform the weighted unifrac metric is puzzling and cause for concern, since PhILR becomes another metric to be 'mined' to achieve the desired outcome. Does the PhILR partitioning scheme outperform a simple random binary partition? this is crucial to know.
There needs to be serious guidance given to the reader as to when and why to choose the PhILR approach and when not to. The same can be said for the classification methods shown, where the PhILR generally (but not always) outperforms relative abundance, but not log relative abundance. I am left to wonder if data analysis choices are driving these results.
Finally, I am puzzled as to why the authors chose to take a rigorous variancebased approach (ILR) and analyze it using a distancebased metric. It is established in the CoDa literature that Mahalanobis distances are often more appropriate for clustering based approaches and for outlier detection (see for example: http://www.statistik.tuwien.ac.at/public/filz/papers/2010Compstat.pdf). Did the authors examine other distances?
I believe that Figure 4 is somewhat overinterpreted. Subsection “Balance variance and microbiota assembly”, last paragraph: There are many technical reasons such as inefficient clustering, etc. that closely related sequences (called bacteria here) could exhibit strong correlations. Reading Lovell (referenced in this manuscript), and the points shown in 4GL, indicate that an alternative explanation to that given in the text (see aforementioned subsection, end of second paragraph) is that a higher balance variance would be compatible with a model of indifference rather than competition since the vast majority of high variance ratio balances are likely to be noncorrelated balances (Lovell 2015). Such noncorrelated balances persist even out to the extreme tails and there is as yet no general principled approach to identify negative correlations in compositions.
Introduction, last paragraph: “the accuracy of distance.…” – this is a strong statement to make as the underlying standard of truth is assumed, not known from first principles.
Subsection “Addressing sparsity through weighting taxa”, first paragraph: I would argue that total read counts contain information on precision, not variance. It is hard to square this statement with the otherwise compositional approach used in the paper.
Subsection “Addressing sparsity through weighting taxa”, second paragraph: Why use a prior of one. The supplement that I found did not support the assertion that the prior chosen had little effect. This needs to be clarified.
Subsection “Human Microbiome Project (HMP)”: What is the effect of filtering and preprocessing on the method? As with many methods papers, the arbitrary choices here have the potential to become the accepted norm later without evidence.
Subsection “Supervised Classification”, first paragraph: Were the logtransformed data not scaled? if not, this is a problem since it is not a fair comparison.
https://doi.org/10.7554/eLife.21887.024Author response
Essential revisions:
1) There was consensus among the reviewers that the normalization scheme seemed a bit adhoc and poorly justified. We suggest that it is made clear to the readers that the choice of weights used in the manuscript, both on the taxa and on the tips, are preliminary choices and further work will be required to ultimately determine the best weighting scheme. In addition, some further clarification on how sparsity was treated would strengthen the manuscript.
We understand the reviewers’ concern regarding the discussion of our chosen weights (both branch length weights and taxa weights) and how sparsity was handled. We also recognize that these chosen weights are heuristics motivated in part on benchmarking results, and that further work is needed to determine optimal weightings in different situations. In response to these comments, we have expanded our discussion of these issues in the Results, Discussion and Methods sections (see subsection “Constructing the PhILR transform”; Discussion, sixth paragraph; subsections “Soft thresholding through weighting taxa”, “Incorporating branch lengths” and “Implementation”). In particular, we now state in the main text:
“Both the taxa weights and the branch lengths weights we introduce here may be viewed as preliminary heuristics; future work will likely yield additional weighting schemes, as well as knowledge for when a given weighting scheme should be matched to an analysis task.”
We have also included an additional supplementary figure (Figure 4—figure supplement 2) to provide readers with intuition for our chosen taxa weights behave in the face of data sparsity; namely, that these weights assign smaller influence to taxa with more zero and nearzero counts.
2) None of the reviewers were entirely satisfied with the demonstration that the proposed transformation was superior to other transformations. It is unclear what it means, for example, that the PhILR technique sometimes produces a stronger clustering signal and sometimes a weaker clustering signal than weighted UNIFRAC (Figure 2B). There was some discussion among the reviewers about whether a comparison to random binary partitions would be appropriate since this could reflect your null model (that phylogenetic binary partition is superior). Reviewer #2 was not enthusiastic about this idea and argued that a change on a tree does not alter (Aitchison) distances between data points. Therefore, results would likely not change no matter which tree has been selected for defining ILR coordinates.
In the end, the consensus of the reviewers was that exploring such null models might be of interest, but would not be appropriate as a requirement for revision. Nonetheless, reviewer #1 would have been interested to see if machine learning algorithms in fact performed better on the phylogenetic partition in comparison to random partitions. We leave it up to the authors as to whether they feel such an inclusion would improve their paper or be worth the effort. However, all reviewers felt that paper would be improved by more discussion of how to tell when one normalization procedure is superior to another. It seemed to reviewer #1 that it might be adequate to say that the proposed transformation appears to be doing no harm relative to weighted Unifrac but was providing some compositional safeguards. Reviewer #2 argued that the main benefit of the proposed procedure was in interpretability in yielding a phylogenetically aware transformation that could be utilized for teststatistics.
We recognize from the reviewers’ comments that our rationale for providing the benchmarking results could be improved. Our goal in Figure 2 was to demonstrate that common microbiota analyses could be easily performed on PhILR transformed data with performance on par with current practice. Importantly, we did not intend these benchmarks to justify a compositionally robust approach to microbiota data analysis; we believe that this need has been well established (17). Nor do we believe it essential to our manuscript that PhILR outperforms other transforms or that Euclidean distance measured on PhILR transformed data outperforms all other (nonCoDA) distance metrics. In fact, we find it extremely unlikely that PhILR would always outperform any other transformation or distance metric under all situations. For example, while Weighted UniFrac may not be grounded in compositional data analysis theory, it was specifically designed for analysis of mixed microbial communities and we therefore would expect it to preform quite well in these tasks. We do think though that the results of Figure 2B and 2C are promising and suggest that at least with respect to our chosen metrics, PhILR transformed data can improve the performance of select tools in a variety of situations.
To reflect the reviewers’ overall concerns and our rationale for benchmarking, we have reworded our description of our benchmarking tasks and results throughout the manuscript. For example, we now motivate our use of benchmarking as follows:
“[…] we underscore that we do not find it essential that PhILR demonstrates superior benchmark performance to motivate its use in microbiota analysis. We believe that the need for compositionally robust tools has already been well established (17) and intended these benchmarks to showcase the flexibility and utility of working with PhILR transformed data.”
Additionally, we have been more explicit regarding the metrics used to evaluate the performance of PhILR transformed data. For example, we now specify in the text that the R^{2} statistic from PERMANOVA is used in Figure 2B and classification accuracy is used in Figure 2C.
With regards to tests involving random binary partitions, we strongly agree with reviewer #2's comments. In the absence of branch length weights, all ILR transforms would be expected to preform equally well on our benchmark tasks. In addition, it is not our intention to show that PhILR exhibits superiority to other ILR transforms with regards to benchmark performance. Rather, we believe much of PhILR’s utility is that, by choosing the phylogeny as the partition, the resulting coordinates have biological meaning. Figure 3 and 4 demonstrate how such meaning enables analyses involving evolutionarily relevant balances. Still, because the transformed space can be easily scaled by the phylogenetic distances between taxa (our branch length weighting), we believe the most natural “null” comparison to include is between PhILR with and without branch length weights. We include such an analysis in Supplementary file 1, which demonstrates that, when coupled with our taxa weights, our chosen branch length weights had the highest rank in 9 of the 12 supervised classifications tasks and 2 of the three distance based tasks. We have updated our Discussion section to reflect the relation between the PhILR transform and other ILR transforms. These updated lines now include the statement:
“We also note that for some analyses, a phylogenybased ILR transform will not outperform an ILR transform built from another sequential binary partition. […] Instead, what distinguishes the PhILR transform from other ILR transforms is the interpretability of the transformed coordinates. Balances in PhILR space correspond to speciation events, which can be both a source and target for biological insight.”
3) There was general consensus among the reviewers that Figure 4 was unconvincing in that technical (nonbiological) features of the algorithm could explain the results more easily than exclusionary dynamics. The authors could likely remove Figure 4 without damaging the paper, but if they choose to keep it, they should respond to the concerns below from reviewers #1 and #3 as both reviewers felt that simple lack of association between taxa and stochastic variation in sampling could explain many of the observations in Figure 4.
Based on the reviewers’ astute comments, we have dramatically narrowed our ecological interpretation of Figure 4. Notably, because we now recognize that high balance variance can be due to either a lack of association or opposing dynamics between neighboring clades, we no longer draw conclusions regarding the importance of competitive exclusion in structuring microbial communities. We have chosen to instead restrict our discussion of Figure 4 to the separate conclusion that more phylogenetically similar clades are more likely to covary within human body sites.
In addition, to address reviewers’ concerns about technical sources of bias, we have also added a new paragraph in to the Results section which discusses the steps we took to avoid artifact in our analysis (subsection “Balance variance and microbiota assembly”, second paragraph). (We believe this new paragraph also has the benefit of reducing confusion regarding how our zero handling and the use of weights differs from prior sections.) We believe one of the steps to avoid bias that we now detail, which is a permutationbased hypothesis test, accounts for many of the potential technical artifacts that the reviewers alluded to. Histograms from the null distributions for each body site (Figure 4—figure supplement 3) are symmetric about zero, suggesting that any bias introduced by our handing of zero values or stochastic sampling was likely minimal. Still, we do acknowledge that we cannot completely exclude the possibility that inefficient clustering of sequences into OTUs could add to the observed pattern between balance variance and phylogenetic depth. But, due to our observation that this relation appears to be stronger above the species level than below, we believe that inefficient clustering alone could not explain the pattern we observed. We have added a discussion of this issue to our Discussion section (fifth paragraph).
Ultimately, we believe the variance analyses described in this section to be technically novel and uniquely enabled by PhILR. Even if future work were to determine our ecological conclusions dubious, we believe it a service to the community to highlight new analytical approaches that could eventually prove useful when interpreted in some other manner. Given this novelty, as well as the measures we took to avoid artifact, we therefore elected to retain Figure 4 (albeit in the setting of a more careful discussion). Still, if the reviewers still have strong reservations, we will gladly remove this section from our final manuscript.
4) There was consensus among the reviewers that the authors could provide more guidance to the nonspecialists about when this algorithm should be used. As reviewer #3 put it, when and why should a reader choose the PhILR approach and when should it be avoided.
We appreciate this suggestion and have expanded on when the PhILR transform should be used and its limitations in our Discussion (last paragraph): “We foresee the PhILR transform being used as a default transformation prior to many microbiota analyses, particularly if a phylogenetic perspective is desired. For example, the PhILR transform could be used in lieu of the conventional log transform, which is often the default choice in microbiota analysis but not robust to compositional effects.”
Based on comments of reviewer #3 we have also added a Discussion paragraph discussing inference in terms of the changes of the relative abundance of individual taxa rather than balance values (seventh paragraph). This paragraph points out how single taxon analyses can actually be carried out after the ILR transform, by applying an inverse transform to model results.
5) The authors should be explicit about what filtering and other preprocessing steps were used and whether those choices substantially impacted the performance of their algorithm.
We agree with reviewers that these steps are important components of our analysis. Based on reviewer comments we have entirely rewritten the section of our Methods entitled ‘Datasets and Preprocessing’ in an attempt to clarify the preprocessing steps taken. In addition, we have created a new supplementary figure (Figure 2—figure supplement 2) to visually illustrate how our taxon weighting scheme assigns weights to taxa based on data sparsity.
We have also designed and performed new benchmarking analyses showing how data filtering choices affect model output (Supplementary file 2). For this analysis we redid our benchmarking results from Figure 2B and 2C for the CSS dataset under 4 different OTU filtering strategies. Notably these results include the performance of the other distance / dissimilarity measures we compared and the Log transformed and Raw relative abundance data. The four filtering strategies were as follows:
1) Keep taxa seen with at least 10 counts across all samples (this was the settings used in Figure 2B and 2C);
2) Keep taxa that are seen with at least 2 counts in more than 1% of samples (settings used by McMurdie and Holmes (8));
3) Keep all taxa that represent at least 0.005% of total reads (settings suggested by Qiime pipeline (9));
4) Keep all taxa that are at least 0.1% abundant in any sample (settings used by Gloor and Reid (10)).
The resulting 4 versions of the CSS dataset varied in the number of OTUs remaining by 16fold (minimum of 366 and maximum of 6142). In spite of this wide variation, we found PhILR’s results to be robust (Supplementary file 2).
Reviewer #1:
[…] Suggestions for improvement follows:
1) The mathematical description of the algorithm could be improved in terms of its clarity. For example, it is not always clear how the pluses and minuses in the second equation of the subsection “The ILR Transform” are defined. It is not made immediately clear that the leftmost term (sqrt(ni+ * ni / (ni+ + ni)) in that equation is a correction for branch length. For the equation in the subsection “Addressing sparsity through weighting taxa”, please use parentheses to emphasize that the formula is sqrt(sum(xj)) * ((sum(cj 1…N)) ^ 1/N and not sqrt(sum(xj))^N * sqrt(sum(cj 1…N))
Based on reviewer comments we have made numerous revisions aimed at improving the clarity of the Methods section entitled ‘The ILR Transform’. We do note that the term (sqrt(ni+ * ni / (ni+ + ni)) is not a correction for branch length but is the scaling term that ensures the ILR basis elements have unit length. This term can be thought of as a scaling factor that ensures that balances with different numbers of descendant tips are still comparable on the same scale. We have attempted to clarify this point in our revised manuscript. We have also rearranged the equation for taxa weights in the section entitled ‘Soft thresholding through weighting taxa’ to avoid potential confusion pointed out by the reviewer.
2) A limitation of working in ratio space in 16S data is what to do with all of the zeros in the spreadsheet (the "sparse data" problem). There is nothing inherent to the balancing algorithm that can deal with this problem (the sum in the denominator of the balance obviously can't be zero). The authors therefore require some kind of pseudocount scheme and they have settled on one (given in the equation in subsection “Addressing sparsity through weighting taxa”) which is a function of the number of samples. If I am reading their formula correctly, the weight given to a taxa with 1 read in one sample and zero in all others would be very close to the sqrt(number of samples). The authors might consider giving this as an example and explicitly noting that their method has some similarity to adding the sqrt(number of samples) as a pseudocount.
We regret our lack of clarity regarding how sparsity was handled. Please note that the scheme referenced by the reviewer is our ‘taxa weighting’ strategy. This strategy is meant to be independent of how zeroreplacement is handled (in our case, we add a pseudo count of 1). We hope the reviewer will find the new description of our taxa weights and sparsity to be clearer (as referenced in response to Essential revision 1). We have also added a new figure (Figure 2—figure supplement 2), which illustrates the weights assigned by our ‘taxa weighting’ to individual taxa.
3) I struggled a bit with the variance analysis which concludes the Results sections. In the second paragraph of the subsection 2Balance variance and microbiota assembly” it is argued that "When the variance of a balance between two clades approaches zero, the mean abundance of taxa in each of the two clades will be linearly related and thus exhibit shared dynamics across microbial habitats". While this is true (and must be true mathematically as when the variance approaches zero, the ratio of the two normalized abundances of the clades is constant, which means they will become perfectly linearly related with a positive slope), it's not clear to me that a high variance necessarily means "exclusionary dynamics across samples". I don't see how a high variance in this number indicated anything about how direct or indirect the interactions are within the tree. There are lots of reasons two clades could lack perfect linear correlation with a positive slope from direct competition to being completely unrelated biologically with abundances just shifting due to stochastic variation. Also, wouldn't variance have to be lower closer to the tips of the phylogenetic tree as the number of sequences is lower, so the pseudocount has more of an effect depressing overall variance? Doesn't that explain Figure 4DF? And if I am reading these graphs correctly, below the species level there are not enough datapoints to support modeling via localized regression one way or another. Although not central to the arguments of the paper, I found Figure 4 to be less convincing than other sections of the paper.
We found these concerns helpful and valid, and we address them more fully in our response above to Essential revision 3. In short, based on these comments, we have dramatically narrowed our ecological interpretation of Figure 4 and more clearly described our multiple efforts to avoid potential artifacts of analysis. We also shared the concern that variance would be lower closer to the tips due to the heightened effects of pseudocounts; we therefore did not use pseudocounts for this analysis and instead conditioned our analysis on nonzero counts while more strictly excluded taxa with many zeros (to avoid analyses involving pseudocounts). We also incorporated a permutationbased hypothesis test, which should account for biases associated with count variation.
Reviewer #2:
The manuscript introduces two new powerful ideas. They are: (a)construction of the sequential binary partition used to define
ILR coordinates on the base of phylogenetic trees; (b) use therecently published balances in weighted compositions to account forphylogenetic distances between OTU's or taxa. Both ideas aredeveloped and applied successfully to benchmark data sets.
For the originality of main ideas applied to microbiome analysis,the manuscript deserves publication.
We thank the reviewer for these comments. On a minor note, we clarify that we do not use the weighted ILR transform as a means for introducing the phylogenetic distances between OTUs. Rather, we use a weighted transform to downweight the effects of taxa with many zero and nearzero counts. We refer to this as taxa weighting. To account for phylogenetic distances, we simply scale balances by the phylogenetic distance between descendant clades. We do acknowledge that there is a deeper connection between these two sets of weights; that is, there is a close relationship between linearly scaling an ILR balance and changing the reference measure of the simplex (see (11), equation 18 in Appendix B); but for simplicity, we have not highlighted this connection in this manuscript. We hope that the updated description of these weightings in both the Results and Methods sections will make this distinction clearer.
Reviewer #3:
The manuscript is, in general, written in a clear and logical manner with a minimum of technical jargon. Figures 1, 3 and 4 are outstanding examples of data visualization and demonstrate the utility of this approach over others extant in the field. Figure 5 seems an afterthought and could be combined with Figure 1, or made supplementary. The binary portion could also be completely described in the Methods without a figure.
We agree that Figure 5 is much simpler than Figures 1–4 and wished to make it a supplementary figure. However, formatting constraints would require making it supplementary to a parent figure. The most natural parent figure would be Figure 1, but we believe that Figure 5’s sign matrix representation of a binary tree would confuse readers at that point in the manuscript. For this reason, we have chosen to leave Figure 5 in its current place.
Figure 2, and the associated descriptive text are somewhat underwhelming demonstrations of the utility of the approach, which, in theory, should clearly outperform all other methods if the compositional data approach is necessary. The use of boxplots probably does not aid the visual appeal.
That the ILR based distance can seriously underperform the weighted unifrac metric is puzzling and cause for concern, since PhILR becomes another metric to be 'mined' to achieve the desired outcome.
As we describe more fully in our response to Essential revision 2 above, we do not agree that PhILR must always outperform other methods to motivate a compositional data approach to microbiome data. First, we argue that the need for compositionally robust tools for microbiome data analysis is well established (17). Second, we include these benchmarks to assuage any concerns that PhILR is somehow not compatible with desired downstream analyses, like ordination analysis or supervised learning. Third, we find it extremely unlikely that PhILR would always outperform any other method on all experiments. This is especially true when comparing to tools tested and specifically designed for microbiota analysis such as Weighted UniFrac.
Does the PhILR partitioning scheme outperform a simple random binary partition? this is crucial to know.
As we state above in Essential revision 2, we strongly agree with reviewer #2’s comments. In the absence of branch length weights, all ILR transforms should perform equally. Moreover, we believe that much of PhILR’s utility derives from casting the resulting coordinates in the evolutionary framework afforded by use of a phylogenetic tree.
There needs to be serious guidance given to the reader as to when and why to choose the PhILR approach and when not to. The same can be said for the classification methods shown, where the PhILR generally (but not always) outperforms relative abundance, but not log relative abundance. I am left to wonder if data analysis choices are driving these results.
We appreciate this guidance and have inserted a statement into the Discussion suggesting the use of PhILR as a default microbiota transformation, prior to further analyses, particularly when a phylogenetic perspective is desired (see Essential revision 4 for more details). Still, we do feel that an indepth discussion of classification methods is beyond the scope of this work. Such choices are ultimately data driven and the subject of extensive research in the Machine Learning community. Our rationale for choosing 4 separate classification algorithms (both parametric and nonparametric) was simply to demonstrate that a range of algorithms can be applied to PhILR transformed data.
Finally, I am puzzled as to why the authors chose to take a rigorous variancebased approach (ILR) and analyze it using a distancebased metric. It is established in the CoDa literature that Mahalanobis distances are often more appropriate for clustering based approaches and for outlier detection (see for example: http://www.statistik.tuwien.ac.at/public/filz/papers/2010Compstat.pdf). Did the authors examine other distances?
While the Mahalanobis distance is a very useful tool for CoDa (and data analysis in general), we believe its use is neither simple nor standard practice. We therefore feel it more appropriate to leave this as a potential improvement on the PhILR method for cases where outliers are a chief concern.
I believe that Figure 4 is somewhat overinterpreted. Subsection “Balance variance and microbiota assembly”, last paragraph: There are many technical reasons such as inefficient clustering, etc. that closely related sequences (called bacteria here) could exhibit strong correlations. Reading Lovell (referenced in this manuscript), and the points shown in 4GL, indicate that an alternative explanation to that given in the text (see aforementioned subsection, end of second paragraph) is that a higher balance variance would be compatible with a model of indifference rather than competition since the vast majority of high variance ratio balances are likely to be noncorrelated balances (Lovell 2015). Such noncorrelated balances persist even out to the extreme tails and there is as yet no general principled approach to identify negative correlations in compositions.
We thank the reviewer for this astute observation and have removed our ecological interpretation of higher balance variances. Please see our above response to Essential revision #3 for additional details.
Introduction, last paragraph: “the accuracy of distance…” – this is a strong statement to make as the underlying standard of truth is assumed, not known from first principles.
Thank you for pointing this out. We agree it was an overly strong statement and in our revised draft, we have softened it to read:
“the accuracy of supervised classification methods on our benchmark datasets was matched or improved with PhILR transformed data”
Subsection “Addressing sparsity through weighting taxa”, first paragraph: I would argue that total read counts contain information on precision, not variance. It is hard to square this statement with the otherwise compositional approach used in the paper.
We have updated our discussion of this issue throughout the paper to reference precision rather than variance. Nevertheless, although we are of course strong proponents of compositional approaches, we do believe that including uncertainty from count data could lead to exciting advances in the analysis of microbiota data, which we briefly describe in our revised Discussion (sixth paragraph).
Subsection “Addressing sparsity through weighting taxa”, second paragraph: Why use a prior of one. The supplement that I found did not support the assertion that the prior chosen had little effect. This needs to be clarified.
We regret the lack of clarity in this area. The equation referenced represents our strategy for how taxa weights were chosen. This strategy is meant to be independent of how zeroreplacement is handled (in our case, we add a pseudocount of 1). We acknowledge our choice of pseudocount is a heuristic, which was supported largely by our benchmarking results (Supplementary file 1) and does not come with any optimality guarantees. We did investigate how changing the pseudocount added to the table affected our benchmarking results. Changing this value from 1 to 2, 3, or 10 had minimal effect (also Supplementary file 1). We have tried to clarify our description of taxa weights as well as our handling of zero values in the revised manuscript (see our response to Essential revision 1, above).
Subsection “Human Microbiome Project (HMP)”: What is the effect of filtering and preprocessing on the method? As with many methods papers, the arbitrary choices here have the potential to become the accepted norm later without evidence.
We understand this concern and now tested the effects of alternate filtering and preprocessing schemes on our method. In short, the effects are minimal on PhILR. Please see our response to Essential revision 5 above for more detail.
Subsection “Supervised Classification”, first paragraph: Were the logtransformed data not scaled? if not, this is a problem since it is not a fair comparison.
To preface, we understand this comment to be asking whether the log transformed data were scaled in the sense of how scaling is often applied in machine learning; that is, each variable is often scaled to a unit normal distribution or to exist between 0 and 1. If indeed this is what the reviewer was asking, then no, the logtransformed data were not scaled. But, then again, according to this criterion, the PhILR transformed data were also not scaled.
Alternatively, if the reviewer is referring to the scaling of the ILR basis elements that is part of the definition of the ILR transform (this is the $\sqrt{{n}_{i}^{+}{n}_{i}^{}/{n}_{i}^{+}+{n}_{i}^{}}$ term in the transform), we are not aware of an obvious analogy between this and a potential scaling step for the log transform that is commonly used.
Bibliography:
1) Jackson DA (1997) Compositional data in community ecology: The paradigm or peril of proportions? Ecology 78(3):929940.
2) Friedman J & Alm EJ (2012) Inferring correlation networks from genomic survey data. PLoS Comput Biol 8(9):e1002687.
3) Aitchison J (1986) The statistical analysis of compositional data (Chapman and Hall, London; New York).
4) Lovell D, Müller W, Taylor J, Zwart A, & Helliwell C (2011) Proportions, percentages, ppm: do the molecular biosciences treat compositional data right. Compositional Data Analysis: Theory and Applications, eds PawlowskyGlahn V & Buccianti A (John Wiley & Sons, Ltd.), pp 193207.
5) Gloor GB, Macklaim JM, Vu M, & Fernandes AD (2016) Compositional uncertainty should not be ignored in highthroughput sequencing data analysis. Austrian Journal of Statistics 45(4):73.
6) Li HZ (2015) Microbiome, Metagenomics, and HighDimensional Compositional Data Analysis. Annual Review of Statistics and Its Application, Vol 2 2(1):7394.
7) Tsilimigras MC & Fodor AA (2016) Compositional data analysis of the microbiome: fundamentals, tools, and challenges. Ann Epidemiol 26(5):330335.
8) McMurdie PJ & Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 8(4):e61217.
9) NavasMolina JA, et al. (2013) Advancing our understanding of the human microbiome using QIIME. Methods Enzymol 531:371444.
10) Gloor GB & Reid G (2016) Compositional analysis: a valid approach to analyze microbiome highthroughput sequencing data. Can J Microbiol 62(8):692703.
11) Egozcue JJ & PawlowskyGlahn V (2016) Changing the Reference Measure in the Simplex and its Weightings Effects. Austrian Journal of Statistics 45(4):2544.
https://doi.org/10.7554/eLife.21887.025Article and author information
Author details
Funding
National Science Foundation (IIS 1546331)
 Sayan Mukherjee
National Science Foundation (IIS 1418261)
 Sayan Mukherjee
Global Probiotics Council (Young Investigator Grant for Probiotics Research)
 Lawrence A David
Searle Scholars Program (15SSP184 Research Agreement)
 Lawrence A David
Alfred P. Sloan Foundation (BR2014003)
 Lawrence A David
Duke University (Medical Scientist Training Program GM007171)
 Justin D Silverman
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Rachel Silverman, Aspen Reese, Firas Midani, Heather Durand, Jesse Shapiro, Jonathan Friedman, Simon Levin, and Susan Holmes for their helpful comments, Dan Knights for providing us with the CSS dataset, as well as Klaus Schliep and Liam Revell for their insight into manipulation of phylogenetic trees in the R programming language. JS was supported in part by the Duke University Medical Scientist Training Program (GM007171).
Reviewing Editor
 Anthony Fodor, Reviewing Editor, University of North Carolina at Charlotte
Publication history
 Received: September 27, 2016
 Accepted: February 13, 2017
 Accepted Manuscript published: February 15, 2017 (version 1)
 Version of Record published: February 27, 2017 (version 2)
Copyright
© 2017, Silverman et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,692
 Page views

 891
 Downloads

 9
 Citations
Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.
Download links
Downloads (link to download the article as PDF)
Download citations (links to download the citations from this article in formats compatible with various reference manager tools)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Genetics and Genomics

 Genetics and Genomics
 Microbiology and Infectious Disease