Figures and data

Summary of the different protocol modifications and results of each sample.
* Standard lysis protocol of the Single-Microbe DNA Barcoding kit, i.e, 0.8 M KOH lysis for 15 minutes at 23°C. ** Standard WGA of the Single-Microbe DNA Barcoding kit, which is based on MDA.

Sequencing metrics and cell characterization in SPC-based scDNAseq. A: Schematic representing the experimental setup of the present study. Passage numbers of the two strains are indicated as P X (unknown number of passages) + passages since isolation from clinical sample (18 for BPK081, 14 for HU3), and in the case of BPK081, the number inside parenthesis indicate number of passages since cloning. Standard lysis was done with 0.8M of KOH for 15 minutes at 23 °C. Longer incubation was done for 1h instead of 15 minutes. Higher temperature was done at 99°C instead of 23 °C. PTA lysis indicates the lysis protocol of the PTA kit (ResolveDNATM – Bioskryb). B: Distribution of the total number of reads associated with each SPC barcode in all 6 samples. X-axis (log10 scale) show regions where some of the distributions peak. C: Distinction between ‘true cells’ and background signals. Read counts associated with each barcode were ranked in a log10 space and the ‘elbow’ point in the curve was set as the threshold for background signal. The labels show the rank and read count of the last SPC considered as ‘true cell’ D: Number of SPCs containing a ‘true cell’ (blue) or a background signal (grey) in each sample. E: Evaluation of HU3/BPK081 doublets (grey dots) in each sample. Y-axis show the percentage of variant reads with HU3 signature, which was predetermined based on bulk whole genome sequencing of HU3. F-I: analysis of nucleotide coverage depth in each sample. Coverage (F) is defined as the average number of reads mapping each nucleotide in the genome. Figures G and H show the percentage of the genome that was covered by at least 1 and 5 reads respectively, while figure I show the percentage of the genome covered by at least 1 read after cells were normalized to 100.000 reads per cell. Violin plots display the distribution of the data (kernel density), with embedded boxplots indicating the median (thick line), the interquartile range (IǪR – shown by a box spanning the 25th to 75th percentile), and whiskers (lines) extending to 1.5×IǪR. Points beyond the whiskers indicate outliers.

Evaluation of the impact of the different WGA methods on coverage evenness. A: Lorenz curves showing the cumulative fraction of sequencing counts as a function of the cumulative fraction of the genome covered in the scDNAseq datasets. The dashed diagonal represents perfect uniform coverage. Curves deviating downward from the diagonal indicate increasing coverage unevenness across the genome. The blue line indicates the 10X-scDNA datasets, while the green lines show samples SPC-PTA1-2, and orange lines represent samples SPC-STD1-3. B-E: violin plots showing the gini (B), MAPD (C), ICCV (D) and ICF (E) scores for each dataset compared to the 10X-BPK081 and 10-HU3 libraries. Asteriscs indicate the p-values of a wilcox test of all samples against both 10X-scDNA libraries combined adjusted with the hochberg method. **** = pvalue < 0.0001. F-H: heatmap showing the read count of each 20k bin (y axis) in each cell (x – axis) in each chromosome (row groups). Colors indicate the read count of each bin normalized by the median read count in the cell. Annotation bars indicate the cell strain, and cells were arranged with the ward.D2 hyerarchical clustering algorithm based on the heatmap values. A density plot on top shows the distribution of values in the heatmap. Here sample SPC-STD2 is shown as a representative of standard WGA samples, while sample SPC-PTA2 represents the PTA samples.

Evaluation of the standard WGA PTA for determination of chromosome copy numbers in single-cells. A-B: Heatmaps showing the estimated raw somy values for each cell in sample SPC-STD2 (A) SPC-PTA2 (B). Raw somy values were calculated as the mean count of each chromosome (normalized by the cells mean) multiplied by the cell’s scale factor. Annotation bars indicate the cell strain, and cells were arranged with the ward.D2 hyerarchical clustering algorithm based on the heatmap values. A density plot on top shows the distribution of values in the heatmap. C: heatmaps showing the 5 most frequent karyotypes identified the BPK081 cells of each dataset. Heatmap colors indicate the integer copy number of each chromosome (y-axis) in each of the 5 karyotypes (x-axis). Columns on top indicate the proportion of the total sequenced population displaying each karyotype. Numbers on top of the bars show the number of cells instead. D: Sankey diagram showing the proportion (y-axis) of each BPK081 karyotype in the SPC-scDNA samples compared to those found in the 10X-scDNA dataset. Each karyotype is represented by a line which width indicates its proportion and height indicating its position. Karyotypes which are not found in more than 1% of any dataset were not colored. E-F: The same as C-D but for the HU3 cells.

Detection of strain-specific CNVs. A: heatmaps showing the read counts of samples SPC-STD2 (left) and SPC-PTA2 (right) across 100bp bins in the region surrounding the m-locus in chromosome 36. Heatmap colors indicate normalized read count for each bin. Annotation bars indicate the cell strain, and cells were arranged with the ward.D2 hierarchical clustering algorithm based on the heatmap values. B: same as A, but showing the normalized read counts for the CNV found in the chromosome 29. C: Distribution of the estimated haploid copy number of the m-locus (top row) and the CNV in chromosome 29 (bottom row) in each cell and in each sample. Haploid copy numbers were estimated by dividing the average count in the bins located in the m-locus by the average count of the other bins of the chromosome. Colors indicate the strain.

Identification of nucleotide variants with the SPC-scDNA workflow. A: Principal component analysis (PCA) based on nucleotide variants identified in each single-cell dataset. PCA was done using PLINK2 after removal of linked variants. Colors of the dots indicate the strain. B: Violin plots indicating the fraction of cells in which each variant was confidently called. Included boxplots indicate the median (thick line), the interquartile range (IǪR – shown by a box spanning the 25th to 75th percentile), and whiskers (lines) extending to 1.5*IǪR of the data. C: heatmap showing the estimated allele frequency of 1000 randomly sampled variant loci in samples SPC-STD2 (left) and SPC-PTA2 (right). Color scale goes from blue (allele frequency of 0) to yellow (0.5) and red (1). White indicates missing values (NAs). Annotation bars indicate the cell strain, and cells were arranged with the ward.D2 hierarchical clustering algorithm based on the heatmap values. The black arrows to the right side of the heatmap of SPC-PTA2 show the drug resistance-associated loci. D: A zoom in the drug resistance-associated genes. Here only genes showing heterogeneity in nucleotide sequence are shown. The TEC1, TEC2, NUC, LiRos3, and HELI genes are associated with resistance against miltefosine (MIL), while MRPA is associated with Antimony (ANT) resistance, and C5D is related to amphotericin B (AMB). The right row labels indicate which type of mutation each variant is predicted to cause in the protein expressed by those genes. E-F: Phylogenetic trees built based on homozygous variants found in the dataset of SPC-PTA2. Trees were constructed with the ScisTree2 pipeline for the BPK081 strain (E) and HU3 (F). In F, an inset displays an expansion of the extreme right part of the tree. Numbers in the x-axis indicate the number of different variants between nodes.