(A) Violin plot of the distribution of DNA methylation levels quantified as weighted methylation over genes, promoters, and transposable elements (TEs). Cytosine methylation was analyzed in the CG, …
Genome-wide cytosine methylation levels in 10 kb windows in wild-type and Δhp1 Verticillium dahliae.
(A) Cytosine methylation was calculated using weighted methylation (see Materials and methods) in the CG, CHG, and CHH sequence context in both wild-type (WT) and Dhp1. Methylation levels were …
Cytosine methylation was calculated using weighted methylation (see Materials and methods) in the CG, CHG, and CHH sequence context in both wild-type (JR2) and Dhp1. DNA methylation was summarized …
(A) Volcano plot showing the log2 fold-change for Dhp1 compared to the wild-type (WT) grown in PDB culture. The adjusted p-value (-log10) is shown in the y-axis to indicate statistical significance. …
(A) Principle component analysis for 14 variables measured for each individual transposable element (TE). Each vector represents one variable, with the length signifying the importance of the …
Genomic data for transposable elements.
(A) Scatter plot for %GC sequence content versus CRI values for individual TE elements shown as points, separated by TE type, LINE, and DNA, labeled in the top left of each plot. Each point is …
Each data type is shown in the upper right corner of the individual box plots. Outliers are shown as individual points. The nine subclasses of TEs are listed to the left of each figure. Test for …
The same transposable elements (TEs) are shown in three separate scatter plots with marginal densities. Individual Copia elements are shown as blue points, and Gypsy elements as gray points. These …
(A) Heatmap comparing core versus LS values within the four TE classifications for the variable listed to the right. Plot colored based on p-values from Wilcoxon rank sum test. p-values≥0.05 are …
Violin plots show the distribution of the values for each category, along with a box plot showing the mean (thick black line) 1st and 3rd quartiles, and whiskers extending to the furthest data point …
(A and B) Whole chromosomes plots showing TE and gene counts over 10 kb genomic windows, blue and red heatmaps respectively. The %GC content is shown in purple, RNA-seq show in orange, CG cytosine …
Genomic data for 10 kb windows.
Each genomic window is shown as a point on the graph, with the windows in the core colored as blue circles and LS as yellow triangles. Colored ellipses show the confidence interval for the core and …
(A) Area under the Response operator curve (auROC) plotting sensitivity and false positive rate (FPR) for four machine learning algorithms, BCT- Boosted classification tree; GBM- stochastic gradient …
(A) The random forrest model was trained using three-time 10-fold cross-validation (CV) under varying conditions for the parameter ‘randomly selected predictor’. The plot shows the average accuracy …
(A) Two machine learning algorithms, Stochastic Gradient Boosting (GBM) and Random Forest (RF), were used to predict Lineage-Specific (LS) regions from 15 independent training-test splits (80/20). …
Consensus LS classification genomic regions.
Gene presence and absence counts.
TE presence and absence counts.
The genomic data were compiled into 3611 10 kb windows. For machine learning training and testing (related to Figure 6), only 20% of the data could be used for prediction. To generate predictions …
For each trail, the data set were split 80:20, training and testing, 15 independent times. For each data split, the model was trained and tested and the performance was assessed using Recall (A) and …
The eight chromosomes of V. dahliae are labeled at the right (Chr. X) along with the physical DNA size indicated at the bottom. (A) GBM model predictions for 10 kb windows as either core or LS …
Box plot of the LS region sizes for the New classification based on model consensus and the previous LS classification. The number of regions, their mean and standard deviation (Std) are shown above …
(Top) The genome of V. dahliae was split into 10 kb windows, and labeled as core or LS based on previous observations, shown in Figure 4D, re-shown here for comparison. (Bottom) Same 10 kb genomic …
(A) Uniform Manifold Approximation and Projection (UMAP) clustering of individual V. dahliae TEs, color coded for core (blue) and LS (red). UMAP clustering in two dimensions resulted in the …
Genomic data and UMAP group for TEs.
Genomic data and UMAP group for genes.
GC content, GC sequence fraction; Jukes Cantor, corrected Jukes Cantor distance of TE comparisons; CRI, Composite RIP index; RNAseq, variance stabilizing transformed log2 RNA-sequencing reads from …
GC content, GC sequence fraction; RNAseq, log2 TPM+1 of RNA-sequencing reads from PDB grown fungus; H3K9me3 and H3K27me3 and ATAC-seq, TPM values of mapped reads from H3K9me3 ChIP-seq, H3K27me3 …
Genotype | Avg. weighted mCG* | Avg. weighted mCHG* | Avg. weighted mCHH* | Avg. fraction mCG* | Avg. fraction mCHG* | Avg. fraction mCHH* |
---|---|---|---|---|---|---|
WT | 0.0040 | 0.0037 | 0.0034 | 0.097 | 0.097 | 0.088 |
Δhp1 | 0.0030 | 0.0030 | 0.0032 | 0.082 | 0.083 | 0.079 |
Avg. Weighted, The average of total methylated cytosines in a given context divided by total cytosines in that context in a 10 kb windows; Avg. Fraction, The total cytosine positions with a read supporting methylation divided by total cytosines in a specific context in a 10 kb window; mCG, methylated cytosine residing next to a guanine; mCHG, methylated cytosine residing next to any base that is not a guanine next to a guanine; mCHH, methylated cytosine residing next to any two bases that are not a guanines. *, values are significantly different (p-value<0.001), Mann-Whitney U-test. The distribution of values and p-values for individual comparisons are shown in Figure 1—figure supplement 1.
Known | |||
---|---|---|---|
Predicted | Core | LS | |
LR | Core | 638 | 7 |
LS | 50 | 26 | |
GBM | Core | 645 | 5 |
LS | 43 | 28 | |
BCT | Core | 672 | 20 |
LS | 16 | 13 | |
RF | Core | 623 | 2 |
LS | 65 | 31 |
LR, Logistic Regression; GBM, Stochastic Gradient Boosting; BCT, Boosted Classification Tree; RF, Random Forest; Core, regions of the genome defined as core; LS, regions of the genome defined as Lineage Specific. For final model parameter settings and classification thresholds, see Materials and methods.
Models | Precision | Recall | MCC | F1 |
---|---|---|---|---|
LR | 0.34 | 0.79 | 0.49 | 0.48 |
GBM | 0.39 | 0.85 | 0.55 | 0.54 |
BCT | 0.45 | 0.39 | 0.39 | 0.42 |
RF | 0.32 | 0.94 | 0.52 | 0.48 |
LR, Logistic Regression; GBM, Stochastic Gradient Boosting; BCT, Boosted Classification Tree; RF, Random Forest; MCC, Matthews Correlation Coefficient; F1, F-score or harmonic mean of precision and recall. For final model parameter settings and classification thresholds, see Materials and methods.
Reagent type (species) or resource | Designation | Source or reference | Identifiers | Additional information |
---|---|---|---|---|
Strain, strain background (V. dahliae) | JR2, wild-type | PMID:26286689 | Fungal Biodiversity Center (CBS), 143773 | |
Strain, strain background (V. dahliae) | JR2, Δhp1 | https://doi.org/10.1101/2020.08.26.268789 | ||
Antibody | Rabbit anti H3K9me3 (Polyclonal) | Active Motif | 39765 | ChIP (1:200) |
Antibody | Rabbit anti H3K27me3 (Polyclonal) | Active Motif | 39155 | ChIP (1:200) |
Recombinant DNA reagent | pRF-HU2 | Frandsen et al., 2008 | USER-cloning | |
Commercial assay or kit | EZ DNA Methylation-Gold kit | Zymo Research | D5005 | |
Commercial assay or kit | Nextera DNA library Preparation | Illumina | FA-121–1030 |
Supplementary tables.
Table 1. Summary of Transposable elements by Family in core and LS regions. Table 2. Dunns test of pairwise differences for TE Families following Kruskal-Wallis test. Table 3. Summary count of TEs by sub-class. Table 4. Kruskal-wallis test statistic for differences between TE sub-classes for genomic variables. Table 5. P-value results from Conover test and BH multiple testing correction for genomic variables summarized over TE sub-classes. Table 6. Contribution of variables to the first 5 dimensions of PCA. Table 7. Confusion Matrix results for Stochastic Gradient Boosting machine learning of 15 independent training-test predictions of LS and core regions. Table 8. Confusion Matrix results for Random Forest machine learning of 15 independent training-test predictions of LS and core regions. Table 9. GMB LS prediction results for each of the 15 rounds of training and testing. Table 10. RF LS prediction results for each of the 15 rounds of training and testing. Table 11. Contingency tables for observed and expected LS versus core designation for in planta induction. Table 12. Contingency tables for observed and expected LS versus core designation for predicted effectors. Table 13.Contingency tables for observed and expected LS versus core designation for proteins with secretion signal. Table 14. Contingency tables for observed and expected TE elements classified as LS and core in the 3 UMAP Groups. Table 15. Kruskal-wallis test statistic for differences between TE UMAP groups across genomic variables. Table 16. Contigency tables for observed and expected genes classified as LS and core in the 3 UMAP Groups3. Table 17. Kruskal-wallis test statistic for differences between Gene UMAP groups across genomic variables. Table 18. Verticillium dahliae isolates used for the presence/absence variation.