Computational and Systems Biology

Barcode-free multiplex plasmid sequencing using Bayesian analysis and nanopore sequencing

Masaaki Uematsu author has email address
Jeremy M Baskin author has email address

Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, United States
Department of Chemistry and Chemical Biology, Cornell University, Ithaca, United States

https://doi.org/10.7554/eLife.88794.2

Open access
Copyright information

Figures and data

Workflow for Simple Algorithm for Very Efficient Multiplexing of Oxford Nanopore Experiments for You (SAVEMONEY).
The algorithm consists of three steps: presurvey, sample submission, and post-analysis. The pre-survey step identifies the optimal combination of plasmids that will permit suitable accuracy for the classification step of the postanalysis. Plasmids with divergent sequences are grouped together, and those with very similar sequences are classified into different groups. After sample submission and sequencing, the postanalysis component, which consists of three different steps, is performed to deconvolve the obtained results. Reads (query sequences) are first classified based on its similarity to the plasmid blueprint/map (reference sequence). Reads are then aligned against reference sequences. Finally, consensus sequences and quality scores are calculated based on base calling, quality score from each read, and reference sequence, using Bayesian analysis.

Examples of the pre-survey outputs.
Sequences of 14 different plasmids were analyzed with the indicated distance_threshold and number _of _groups values. Levenshtein distance between each plasmid pair is displayed in heatmaps, which were subsequently used to generate the dendrograms displayed on the right side of the heatmap. The dotted red lines in the dendrogram represent the distance_threshold values, enabling visualization of the results of clustering, i.e., plasmids with distances less than the red lines were classified in the same cluster. Based on this clustering results of similar plasmids, plasmids were classified into groups for sequencing submission, with groups are displayed in red and blue (a), red, blue, and magenta (b), or red, blue, magenta, and cyan (c). Levenshtein distance between each plasmid are also emphasized by the colored frames within each group, and plasmids classified into the same cluster are grouped in different groups. Note that P1-P14 here are different from example plasmids used in Fig. 3 and Fig. 4.

Example results after the classification step for a set of six moderately related plasmids.
(a) Results of the pre-survey against plasmid sets used in (b-c). (b) Read length and the quality score distributions. (c) Scatter plots of normalized alignment scores. Normalized alignment scores were calculated for each read over all reference plasmids and displayed as scatter plots. Density plots of the normalized alignment scores are also displayed for each reference plasmid in the diagonal panel, which is the projection of each scatter plots against horizontal axes. The y-axis ranges of these diagonal panels are shared. The vertical and the horizontal positions of dashed lines correspond to score_threshold values. These data depict results of classification performed with a score_threshold value of 0.5. Note that P1-P6 here are different from example plasmids used in Fig. 2 and Fig. 4.

Example results after the classification step for closely related plasmids.
(a-c) Results of the pre-survey against plasmid sets. (d-1) Scatter plots of normalized alignment scores against the plasmid pairs. The vertical and the horizontal positions of dashed lines correspond to score_threshold values. These data depict results of classification performed with a score_threshold value of 0.5. (g-i) Breakdowns of reads covering the regions where the plasmids differ in sequence. In the rotated heatmap at the bottom of (g), the axis labeled as “P1 match” represents the number of bases matching the P1 sequence in the regions where the sequences of P1 and P2 differ, whereas the axis labeled as “P2 match” represents the equivalent for P2. The values in each cell represent the number of observed reads matching the values of the two axes at that position. The subtraction of “P1 match” from “P2 match” is represented by the horizontal axis, which is also shared with the x-axis of the histogram on top, where the sum projection of the heatmap is displayed. In the histogram, the breakdown of classification is represented by color: blue for reads classified to P1, orange for reads classified to P2, gray for unclassified reads with a score above score_threshold, and light gray for unclassified reads with a score below score_threshold. Because the normalized alignment scores for the two plasmids are the same where the value on the horizontal axis is 0, reads are not classified to either plasmid; therefore, the middle bar of the histogram is colored with either gray or light gray. The same interpretation applies to (h) and (i). (j-l) Summary of the fitting results. Based on the estimated parameters displayed in Table 1, the rotated heatmaps representing the breakdowns of reads originating from P1, P3, and P5 (upper panels) and P2, P4, and P6 (lower panels) were generated. Note that P1-P6 here are different from example plasmids used in Fig. 2 and Fig. 3.

The fitting results summarized in Fig. 4j-l.
The values for the fitted parameters are displayed in “Error rate” and “Total reads” columns. The rotated heatmaps in Fig. 4j-l were generated based on these estimated parameters. For P1, P3, and P5, the sum of values in heatmap cells whose location on the horizontal axis are above 0, below 0, and 0 are shown as “Correctly classified reads”, “Wrongly classified reads”, and “Reads not classified” columns, respectively. For P2, P4, and P6, below 0, above 0, and 0 are shown as “Correctly classified reads”, “Wrongly classified reads”, and “Reads not classified” columns, respectively. Finally, the values in “Rate of incorrect classification” columns for each plasmid were calculated by dividing the values of “Wrongly classified reads” for the other plasmids in the same set by the total number of reads estimated to be classified to the focusing plasmid, which is different from values displayed in the “Total reads” column. Specific equations are provided at the bottom of the table.

Characteristics of base calling used for prior information.
(a) Grid showing error ratios for each base calling event. Based on the results obtained from samples analyzed by R10.4.1 flow cells with V14 library preparation chemistry by Oxford Nanopore Technologies via the Plasmidsaurus service, the frequency was analyzed for base calling of each pore (column labels) and the results of the consensus sequence (row labels) at each position. In the context of base-calling, “-” represents bases that were base-called in the consensus sequence but skipped in the reads from each pore. In the context of consensus sequencing, “-” represents bases that do not appear in the consensus sequence but were base-called from pores. Color of the diagonal panels is saturated because of the contrast range focusing on subtle differences of the nondiagonal panels. Of note, the sum of the rows is 1, but the sum of the displayed numbers may be slightly different from 1 because the fourth decimal place is rounded in the grid. (b) Quality score distributions for each base calling event. The base calling of each pore (column labels) and the results of the consensus sequence (row labels) at each position were classified, and probability density plots and quality scores were calculated and displayed. The y-axis is shared by all panels and is scaled to focus on panels in which the true base and base calling are not the same (incorrect base calling, blue). Therefore, the density of maximum quality score is out of the range of the display area in the diagonal panels (correct base calling, orange), and full-size plots are provided at right. Note that there are no density plots when base calling was skipped in the reads from each pore (column corresponding to the label “-” in (a)).

Analysis for the maximum similarity between plasmids that can be mixed and the minimum number of required reads.
(a) Density plot of the rate of reads with correct base calling. The rate was calculated at each position of plasmids and displayed using representative nanopore sequencing results. (b) Averaged quality score distribution of reads with correct rate of less than 0.7 (upper panel) and more than 0.9 (bottom panel). The corresponding regions are displayed with dashed red frames in (a). Of note, “omitted” represents reads that did not cover the focused position. (c) Probability logo plot. Statistical significance (-log₁₀[P value]) was calculated for a 5-mer around the positions that showed correct rate lower than 0.7 in (a) using those that showed more than 0.9 as a background. Enriched residues are stacked on the top, whereas depleted residues are stacked on the bottom. (d) Estimated probability of incorrect classification. Based on the match/mismatch/deletion ratio of reads obtained in the “worst-case scenario”, i.e., top panel in (b), the probability of incorrect classification of read was calculated assuming that two plasmids that differ by indicated base(s) were mixed. (e) Estimated probability of correct/incorrect consensus base calling. Based on the quality score distribution obtained in the “worst-case scenario”, i.e., top panel in (b), the indicated number of reads were generated in silico, and the consensus base calling was calculated using SAVEMONEY. The simulation was performed 10,000 times for each condition to calculate the probability of correct/incorrect consensus base calling.

Sign up for email alerts