A unique chromatin profile defines adaptive genomic regions in a fungal plant pathogen

  1. David E Cook  Is a corresponding author
  2. H Martin Kramer
  3. David E Torres
  4. Michael F Seidl
  5. Bart P H J Thomma  Is a corresponding author
  1. Department of Plant Pathology, Kansas State University, United States
  2. Laboratory of Phytopathology, Wageningen University & Research, Netherlands
  3. Theoretical Biology & Bioinformatics Group, Department of Biology, Utrecht University, Netherlands
  4. University of Cologne, Institute for Plant Sciences, Cluster of Excellence on Plant Sciences (CEPLAS), Germany
7 figures, 4 tables and 2 additional files

Figures

Figure 1 with 3 supplements
DNA methylation is only present at transposable elements, but not at those present in Lineage-Specific (LS) regions.

(A) Violin plot of the distribution of DNA methylation levels quantified as weighted methylation over genes, promoters, and transposable elements (TEs). Cytosine methylation was analyzed in the CG, …

Figure 1—source data 1

Genome-wide cytosine methylation levels in 10 kb windows in wild-type and Δhp1 Verticillium dahliae.

https://cdn.elifesciences.org/articles/62208/elife-62208-fig1-data1-v2.txt
Figure 1—figure supplement 1
Genome-wide cytosine methylation in wild-type and Dhp1.

(A) Cytosine methylation was calculated using weighted methylation (see Materials and methods) in the CG, CHG, and CHH sequence context in both wild-type (WT) and Dhp1. Methylation levels were …

Figure 1—figure supplement 2
Cytosine methylation for functional elements in wild-type and Dhp1.

Cytosine methylation was calculated using weighted methylation (see Materials and methods) in the CG, CHG, and CHH sequence context in both wild-type (JR2) and Dhp1. DNA methylation was summarized …

Figure 1—figure supplement 3
Transcriptional impact of Dhp1.

(A) Volcano plot showing the log2 fold-change for Dhp1 compared to the wild-type (WT) grown in PDB culture. The adjusted p-value (-log10) is shown in the y-axis to indicate statistical significance. …

Figure 2 with 3 supplements
Individual TE families have distinct epigenetic and physical compaction profiles.

(A) Principle component analysis for 14 variables measured for each individual transposable element (TE). Each vector represents one variable, with the length signifying the importance of the …

Figure 2—figure supplement 1
Genomic distribution of DNA characteristics by transposable element (TE) classes.

(A) Scatter plot for %GC sequence content versus CRI values for individual TE elements shown as points, separated by TE type, LINE, and DNA, labeled in the top left of each plot. Each point is …

Figure 2—figure supplement 2
Characterization of transposable elements (TEs) in nine subclasses across genomic variables.

Each data type is shown in the upper right corner of the individual box plots. Outliers are shown as individual points. The nine subclasses of TEs are listed to the left of each figure. Test for …

Figure 2—figure supplement 3
The LTR subclass distinction does not account for the bimodal distribution of LTR elements in the genome.

The same transposable elements (TEs) are shown in three separate scatter plots with marginal densities. Individual Copia elements are shown as blue points, and Gypsy elements as gray points. These …

Figure 3 with 1 supplement
The LTR and Unspecified elements have significantly different chromatin profiles based on core versus LS location.

(A) Heatmap comparing core versus LS values within the four TE classifications for the variable listed to the right. Plot colored based on p-values from Wilcoxon rank sum test. p-values≥0.05 are …

Figure 3—figure supplement 1
Violin plots for twelve measured variables collected for the TEs located in either the core (blue) or LS (yellow) regions of the genome.

Violin plots show the distribution of the values for each category, along with a box plot showing the mean (thick black line) 1st and 3rd quartiles, and whiskers extending to the furthest data point …

Figure 4 with 1 supplement
Epigenome and physical DNA characteristics collectively define core and LS regions.

(A and B) Whole chromosomes plots showing TE and gene counts over 10 kb genomic windows, blue and red heatmaps respectively. The %GC content is shown in purple, RNA-seq show in orange, CG cytosine …

Figure 4—figure supplement 1
Principle component analysis for seven variables genome wide at 10 kb window.

Each genomic window is shown as a point on the graph, with the windows in the core colored as blue circles and LS as yellow triangles. Colored ellipses show the confidence interval for the core and …

Figure 5 with 1 supplement
Supervised machine learning can predict Lineage-Specific (LS) regions based on epigenome and physical genome characteristics.

(A) Area under the Response operator curve (auROC) plotting sensitivity and false positive rate (FPR) for four machine learning algorithms, BCT- Boosted classification tree; GBM- stochastic gradient …

Figure 5—figure supplement 1
Results from model parameter tuning and assessment.

(A) The random forrest model was trained using three-time 10-fold cross-validation (CV) under varying conditions for the parameter ‘randomly selected predictor’. The plot shows the average accuracy …

Figure 6 with 5 supplements
Machine learning predictions for genome-wide LS content.

(A) Two machine learning algorithms, Stochastic Gradient Boosting (GBM) and Random Forest (RF), were used to predict Lineage-Specific (LS) regions from 15 independent training-test splits (80/20). …

Figure 6—figure supplement 1
Density plot for the number of distribution of predictions per genomic region.

The genomic data were compiled into 3611 10 kb windows. For machine learning training and testing (related to Figure 6), only 20% of the data could be used for prediction. To generate predictions …

Figure 6—figure supplement 2
Recall and Precision assessment for independent classification trials.

For each trail, the data set were split 80:20, training and testing, 15 independent times. For each data split, the model was trained and tested and the performance was assessed using Recall (A) and …

Figure 6—figure supplement 3
Genomic location of Lineage-Specific (LS) predictions from two ML models.

The eight chromosomes of V. dahliae are labeled at the right (Chr. X) along with the physical DNA size indicated at the bottom. (A) GBM model predictions for 10 kb windows as either core or LS …

Figure 6—figure supplement 4
Size distribution and summary description of the New and Old Lineage-Specific (LS) classifications.

Box plot of the LS region sizes for the New classification based on model consensus and the previous LS classification. The number of regions, their mean and standard deviation (Std) are shown above …

Figure 6—figure supplement 5
Genome model of core and Lineage-Specific (LS) regions defined by epigenetics and chromatin status.

(Top) The genome of V. dahliae was split into 10 kb windows, and labeled as core or LS based on previous observations, shown in Figure 4D, re-shown here for comparison. (Bottom) Same 10 kb genomic …

Figure 7 with 3 supplements
Genome-wide UMAP groups details that functional elements labeled core and LS have different epigenetic and DNA characteristics.

(A) Uniform Manifold Approximation and Projection (UMAP) clustering of individual V. dahliae TEs, color coded for core (blue) and LS (red). UMAP clustering in two dimensions resulted in the …

Figure 7—figure supplement 1
Multiple comparisons of transposable elements (TEs) in UMAP groups for genomic variables.

GC content, GC sequence fraction; Jukes Cantor, corrected Jukes Cantor distance of TE comparisons; CRI, Composite RIP index; RNAseq, variance stabilizing transformed log2 RNA-sequencing reads from …

Figure 7—figure supplement 2
Multiple comparisons of Genes in UMAP groups for genomic variables.

GC content, GC sequence fraction; RNAseq, log2 TPM+1 of RNA-sequencing reads from PDB grown fungus; H3K9me3 and H3K27me3 and ATAC-seq, TPM values of mapped reads from H3K9me3 ChIP-seq, H3K27me3 …

Figure 7—figure supplement 3
UMAP groupings vary significantly for Absence across V.

dahliae strain. (A) Scatter plot showing each 11,429 genes as a point following the UMAP results. Each gene is colored according to its absence count across 42 V. dahliae strains. (B) Box plot …

Tables

Table 1
Summary of DNA methylation in Verticillium dahliae wild-type (WT) and heterochromatin protein one deletion mutant (Δhp1) as measured by whole genome bisulfite sequencing calculated over 10 kb non-overlapping windows.
GenotypeAvg. weighted mCG*Avg. weighted mCHG*Avg. weighted mCHH*Avg. fraction mCG*Avg. fraction mCHG*Avg. fraction mCHH*
WT0.00400.00370.00340.0970.0970.088
Δhp10.00300.00300.00320.0820.0830.079
  1. Avg. Weighted, The average of total methylated cytosines in a given context divided by total cytosines in that context in a 10 kb windows; Avg. Fraction, The total cytosine positions with a read supporting methylation divided by total cytosines in a specific context in a 10 kb window; mCG, methylated cytosine residing next to a guanine; mCHG, methylated cytosine residing next to any base that is not a guanine next to a guanine; mCHH, methylated cytosine residing next to any two bases that are not a guanines. *, values are significantly different (p-value<0.001), Mann-Whitney U-test. The distribution of values and p-values for individual comparisons are shown in Figure 1—figure supplement 1.

Table 2
Confusion Matrix for LS and core prediction in V.dahliae from test data classification using the final trained model.
Known
PredictedCoreLS
LRCore6387
LS5026
GBMCore6455
LS4328
BCTCore67220
LS1613
RFCore6232
LS6531
  1. LR, Logistic Regression; GBM, Stochastic Gradient Boosting; BCT, Boosted Classification Tree; RF, Random Forest; Core, regions of the genome defined as core; LS, regions of the genome defined as Lineage Specific. For final model parameter settings and classification thresholds, see Materials and methods.

Table 3
Assessment of four trained machine learning algorithms on final test data.
ModelsPrecisionRecallMCCF1
LR0.340.790.490.48
GBM0.390.850.550.54
BCT0.450.390.390.42
RF0.320.940.520.48
  1. LR, Logistic Regression; GBM, Stochastic Gradient Boosting; BCT, Boosted Classification Tree; RF, Random Forest; MCC, Matthews Correlation Coefficient; F1, F-score or harmonic mean of precision and recall. For final model parameter settings and classification thresholds, see Materials and methods.

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional information
Strain, strain background (V. dahliae)JR2, wild-typePMID:26286689Fungal Biodiversity Center (CBS), 143773
Strain, strain background (V. dahliae)JR2, Δhp1https://doi.org/10.1101/2020.08.26.268789
AntibodyRabbit anti H3K9me3 (Polyclonal)Active Motif39765ChIP (1:200)
AntibodyRabbit anti H3K27me3 (Polyclonal)Active Motif39155ChIP (1:200)
Recombinant DNA reagentpRF-HU2Frandsen et al., 2008USER-cloning
Commercial assay or kitEZ DNA Methylation-Gold kitZymo ResearchD5005
Commercial assay or kitNextera DNA library PreparationIlluminaFA-121–1030

Additional files

Supplementary file 1

Supplementary tables.

Table 1. Summary of Transposable elements by Family in core and LS regions. Table 2. Dunns test of pairwise differences for TE Families following Kruskal-Wallis test. Table 3. Summary count of TEs by sub-class. Table 4. Kruskal-wallis test statistic for differences between TE sub-classes for genomic variables. Table 5. P-value results from Conover test and BH multiple testing correction for genomic variables summarized over TE sub-classes. Table 6. Contribution of variables to the first 5 dimensions of PCA. Table 7. Confusion Matrix results for Stochastic Gradient Boosting machine learning of 15 independent training-test predictions of LS and core regions. Table 8. Confusion Matrix results for Random Forest machine learning of 15 independent training-test predictions of LS and core regions. Table 9. GMB LS prediction results for each of the 15 rounds of training and testing. Table 10. RF LS prediction results for each of the 15 rounds of training and testing. Table 11. Contingency tables for observed and expected LS versus core designation for in planta induction. Table 12. Contingency tables for observed and expected LS versus core designation for predicted effectors. Table 13.Contingency tables for observed and expected LS versus core designation for proteins with secretion signal. Table 14. Contingency tables for observed and expected TE elements classified as LS and core in the 3 UMAP Groups. Table 15. Kruskal-wallis test statistic for differences between TE UMAP groups across genomic variables. Table 16. Contigency tables for observed and expected genes classified as LS and core in the 3 UMAP Groups3. Table 17. Kruskal-wallis test statistic for differences between Gene UMAP groups across genomic variables. Table 18. Verticillium dahliae isolates used for the presence/absence variation.

https://cdn.elifesciences.org/articles/62208/elife-62208-supp1-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/62208/elife-62208-transrepform-v2.docx

Download links