A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets

  1. Lenore Pipes  Is a corresponding author
  2. Rasmus Nielsen
  1. Department of Integrative Biology, University of California, Berkeley, United States
  2. GLOBE Institute, University of Copenhagen, Denmark
7 figures and 1 additional file

Figures

Figure 1 with 2 supplements
Species assignment in alignment-based methods (A) vs. Tronko (B).

In Tronko, scores are calculated for all nodes in the tree based on the query’s global alignment to the best BWA-MEM hit. The query is assigned to the lowest common ancestor (LCA) of the highest scoring nodes within the cut-off threshold. See Figure 1—figure supplement 1 for more details regarding using multiple trees.

Figure 1—figure supplement 1
Workflow of iterative partitioning procedure.

First, the multiple sequence alignments (MSAs) and corresponding phylogenetic trees are used as input into the algorithm. Then, the sum-of-pairs scores are calculated for each partition. If the sum-of-pairs score is below a heuristic threshold, the tree is used to partition the sequences into three partitions in the cluster based on the node with the smallest variance. Each of the three partitions is realigned and phylogenetic trees are estimated. The algorithm stops for a given partition when the sum-of-pairs score is greater than the heuristic threshold.

Figure 1—figure supplement 2
Comparisons of Tronko with pplacer and APPLES-2 using a database of 200, 400, 600, 800, 1000, 1200, 1400, and 1600 reference sequences.

(A) Assignment rate against the number of reference sequences at the species level. (B) Running time against the number of reference sequences. (C) Peak memory in gigabytes against number of references. Both methods had a 100% true positive rate for all sizes of databases. Assignment rate is the number of reads assigned at the species level for each method. Reference sequences were chosen randomly from the cytochrome oxidase 1 (COI) reference database.

Figure 2 with 2 supplements
Recall vs. misclassification rates using leave-one-species-out analysis of the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150 bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300 bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, and APPLES-2, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line).

See Figure 2—figure supplement 2 for results using different combinations of aligners and tree estimation methods.

Figure 2—figure supplement 1
Recall vs. misclassification rates using leave-one-species-out analysis for the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150 bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300 bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, APPLES-2, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line) and wavefront alignment (dashed line).
Figure 2—figure supplement 2
Recall vs. misclassification rates using leave-one-species-out analysis for the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) with paired-end 150 bp × 2 reads with 2% error/polymorphism using Tronko with cut-offs of 0, 5, 10, 15, and 20 and different combinations of tree estimation methods and aligners.

For tree estimation, we used RAxML and IQ-TREE2. For multiple sequence aligners, we used FAMSA and MAFFT. For global alignment methods, we used Needleman–Wunsch (NW) and wavefront alignment (WFA). This is the same dataset as used in Figure 2. Colors represent different combinations of methods.

Confusion matrices at the genus level of the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) using the leave-one-species-out analysis with paired-end 150 bp × 2 reads with 2% error/polymorphism using kraken2 (A), metaphlan2 (B), pplacer (C), APPLES-2 (D), MEGAN (E), and Tronko using the Needleman–Wunsch alignment (NW) for cut-offs 0 (F), 5 (G), 10 (H), 15 (I), and (J) 20.

Unassigned column contains both unassigned queries and queries assigned to a lower taxonomic level. Phylogenetic tree represents ancestral sequences at the genus level.

Figure 4 with 2 supplements
Recall vs. misclassification rates using leave-one-individual-out analysis for the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150 bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300 bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, APPLES-2, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line).
Figure 4—figure supplement 1
Confusion matrices at the species level of the order Charadriiformes using the leave-one-individual-out analysis with paired-end 150 bp × 2 reads with 2% error/polymorphism using kraken2 (A), metaphlan2 (B), pplacer (C), APPLES-2 (D), MEGAN (E), and Tronko using the Needleman–Wunsch alignment (NW) for cut-offs 0 (F), 5 (G), 10 (H), 15 (I), and (J) 20.

Unassigned column contains both unassigned queries and queries assigned to a lower taxonomic level. Phylogenetic tree represents ancestral sequences at the species level.

Figure 4—figure supplement 2
Recall vs. misclassification rates using leave-one-individual-out analysis for the order Charadriiformes (cytochrome oxidase 1 [COI] metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150 bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300 bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, APPLES-2, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line) and wavefront alignment (dashed line).
Figure 5 with 2 supplements
Recall vs. misclassification rates using leave-one-species-out analysis with bacteria species (16S metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150 bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300 bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, APPLES-2 and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line).
Figure 5—figure supplement 1
Recall vs. misclassification rates using leave-one-species-out analysis with bacteria species (16S metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism, single-end 150bp reads with 0% (D), 1% (E), and 2% (F) error/polymorphism, and single-end 300bp reads with 0% (G), 1% (H), and 2% (I) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer, APPLES-2, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line) and wavefront alignment (dashed line).
Figure 5—figure supplement 2
Recall vs. misclassification rates using leave-one-individual-out analysis for bacterial species (16S metabarcode) with paired-end 150 bp × 2 reads with 0% (A), 1% (B), and 2% (C) error/polymorphism using kraken2, metaphlan2, MEGAN, pplacer + hmmer, pplacer + mafft, APPLES-2 + hmmer, APPLES-2 + mafft, and Tronko with cut-offs of 0, 5, 10, 15, and 20 using the Needleman–Wunsch alignment (solid line).
Figure 6 with 4 supplements
Recall vs. misclassification rates using mock communities from Schirmer et al., 2015 (A), Lluch et al., 2015 (B), Gohl et al., 2016 (C), and Braukmann et al., 2019 (D) using both Needleman–Wunsch and wavefront alignment algorithms.

Figures with smaller misclassification rates on the x-axis are available for Schirmer et al., 2015, Lluch et al., 2015, Gohl et al., 2016, Braukmann et al., 2019 in Figure 6—figure supplements 1, 2, and 4, respectively.

Figure 6—figure supplement 1
Close-up of Figure 6A.
Figure 6—figure supplement 2
Close-up of Figure 6A.
Figure 6—figure supplement 3
Close-up of Figure 6B.
Figure 6—figure supplement 4
Close-up of Figure 6C.
Comparisons of running time (A) and peak memory (B) using 100, 1000, 10,000, 100,000, and 1,000,000 queries for Tronko, blastn + MEGAN, kraken2, and metaphlan2 using the cytochrome oxidase 1 (COI) reference database.

NW: Needleman–Wunsch; WFA: wavefront alignment.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lenore Pipes
  2. Rasmus Nielsen
(2024)
A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets
eLife 13:e85794.
https://doi.org/10.7554/eLife.85794