Workflow for Simple Algorithm for Very Efficient Multiplexing of Oxford Nanopore Experiments for You (SAVEMONEY).

The algorithm consists of three steps: pre-survey, sample submission, and post-analysis. The pre-survey step identifies the optimal combination of plasmids that will permit suitable accuracy for the classification step of the post-analysis. Plasmids with divergent sequences are grouped together, and those with very similar sequences are classified into different groups. After sample submission and sequencing, the post-analysis component, which consists of three different steps, is performed to deconvolve the obtained results. Reads (query sequences) are first classified based on its similarity to the plasmid blueprint/map (reference sequence). Reads are then aligned against reference sequences. Finally, consensus sequences and quality scores are calculated based on base calling, quality score from each read, and reference sequence, using Bayesian analysis.

Examples of the pre-survey outputs.

Sequences of 14 different plasmids were analyzed with threshold (pre) values of either 6 (a), 10 (b) or 20 (c). The number of bases that differ between each plasmid pair is displayed in heatmaps. These scores were used to generate the distance matrix and subsequently to generate the dendrograms displayed on the right side of the heatmap. The dotted red lines in the dendrogram represent the threshold (pre) values, enabling visualization of the results of clustering, i.e., plasmids with distances less than the red lines were classified in the same cluster. Based on this clustering results of similar plasmids, plasmids were classified into groups for sequencing submission, with groups are displayed in red and blue (a), red, blue, and magenta (b), or red, blue, magenta, and cyan (c). The numbers of bases that differ between each plasmid are also emphasized by the colored frames within each group, and plasmids classified into the same cluster are grouped in different groups. Note that P1–P14 here are different from example plasmids used in Fig. 3 and Fig. 4.

Example results after the classification step for a set of six moderately related plasmids.

(a) Results of the pre-survey against plasmid sets used in (b–c). (b) Read length and the quality score distributions. (c) Scatter plots of normalized alignment scores. Normalized alignment scores were calculated for each read over all reference plasmids and displayed as scatter plots. Density plots of the normalized alignment scores are also displayed for each reference plasmid in the diagonal panel, which is the projection of each scatter plots against horizontal axes. The y-axis ranges of these diagonal panels are shared. The vertical and the horizontal positions of dashed lines correspond to threshold (post) values. These data depict results of classification performed with a threshold (post) value of 0.5. Note that P1–P6 here are different from example plasmids used in Fig. 2 and Fig. 4.

Example results after the classification step for closely related plasmids. (a, e, i)

Results of the pre-survey against plasmid sets used in (b–d, f–h, j–l). (b, f, j) Scatter plots of normalized alignment scores against the plasmid pairs. Density plots of the normalized alignment scores against P1, P3, and P5 are also displayed on the upper panel, which is the projection of each scatter plots against horizontal axes. The vertical and the horizontal positions of dashed lines correspond to threshold (post) values. These data depict results of classification performed with a threshold (post) value of 0.5. (c, g, k) Magnified view of the bottom panels in (b, f, j). (d, h, l) Histogram displaying distribution of reads based on the distance from y=x lines in the bottom panels in (b, f, j). Note that P1–P6 here are different from example plasmids used in Fig. 2 and Fig. 3.

Characteristics of base calling used for prior information.

(a) Grid showing error ratios for each base calling event. Based on the results obtained from samples analyzed by R10.4.1 flow cells with V14 library preparation chemistry by Oxford Nanopore Technologies via the Plasmidsaurus service, the frequency was analyzed for base calling of each pore (column labels) and the results of the consensus sequence (row labels) at each position. In the context of base-calling, “–” represents bases that were base-called in the consensus sequence but skipped in the reads from each pore. In the context of consensus sequencing, “–” represents bases that do not appear in the consensus sequence but were base-called from pores. Color of the diagonal panels is saturated because of the contrast range focusing on subtle differences of the non-diagonal panels. Of note, the sum of the rows is 1, but the sum of the displayed numbers may be slightly different from 1 because the fourth decimal place is rounded in the grid. (b) Quality score distributions for each base calling event. The base calling of each pore (column labels) and the results of the consensus sequence (row labels) at each position were classified, and probability density plots and quality scores were calculated and displayed. The y-axis is shared by all panels and is scaled to focus on panels in which the true base and base calling are not the same (incorrect base calling, blue). Therefore, the density of maximum quality score is out of the range of the display area in the diagonal panels (correct base calling, orange), and full-size plots are provided at right. Note that there are no density plots when base calling was skipped in the reads from each pore (column corresponding to the label “–” in (a)).

Analysis for the maximum similarity between plasmids that can be mixed and the minimum number of required reads.

(a) Density plot of the rate of reads with correct base calling. The rate was calculated at each position of plasmids and displayed using representative nanopore sequencing results. (b) Averaged quality score distribution of reads with correct rate of less than 0.7 (upper panel) and more than 0.9 (bottom panel). The corresponding regions are displayed with dashed red frames in (a). Of note, “omitted” represents reads that did not cover the focused position. (c) Probability logo plot. Statistical significance (−log10[P value]) was calculated for a 5-mer around the positions that showed correct rate lower than 0.7 in (a) using those that showed more than 0.9 as a background. Enriched residues are stacked on the top, whereas depleted residues are stacked on the bottom. (d) Estimated probability of incorrect classification. Based on the match/mismatch/deletion ratio of reads obtained in the “worst-case scenario”, i.e., top panel in (b), the probability of incorrect classification of read was calculated assuming that two plasmids that differ by indicated base(s) were mixed. (e) Estimated probability of correct/incorrect consensus base calling. Based on the quality score distribution obtained in the “worst-case scenario”, i.e., top panel in (b), the indicated number of reads were generated in silico, and the consensus base calling was calculated using SAVEMONEY. The simulation was performed 10,000 times for each condition to calculate the probability of correct/incorrect consensus base calling.