Abstract
Summary
Fungi display a wide range of lifestyles and hosts. We still know little about the impact of lifestyles, including pathogenicity, on their genome architecture. Here, we combined and annotated 552 fungal genomes from the class Sordariomycetes and examined the association between 12 genomic features and two lifestyle traits: pathogenicity and insect association. We found that pathogens on average tend to have a larger number of protein-coding genes, including effectors, and tRNA genes. In addition, the non-repetitive size of their genomes is larger than that of non-pathogenic species. However, this pattern is not consistent across all groups. Insect endoparasites and symbionts have smaller genome sizes and genes with longer exons; moreover, insect-vectored pathogens possess fewer genes compared to those not transmitted by insects. Our study shows that genes are the main contributors to genome size variation in Sordariomycetes and that seemingly similar pathogens can exhibit distinct genome architectures, depending on their host and vector interactions.
Introduction
The growing emergence of severe infectious diseases in humans, agriculture, and wildlife caused by fungi, prompts strengthened efforts to uncover the molecular mechanisms of pathogenic traits. This requires understanding the forces shaping the evolution of fungal pathogen genomes (Fisher et al. 2020; Rokas 2022). Fungal genome sizes span from 3 Mb in microsporidians (Biderre et al. 1997) up to 892 Mb reported in a rust fungus (Mohanta and Bae 2015; Tavares et al. 2014), and fungal pathogens score both among the smallest and the largest of them. Various features of the genome have been implicated in the evolution of the pathogenic lifestyle. These include the evolution of small proteins called effectors that interact with the host immune system (Stergiopoulos and de Wit 2009). Other features include specialized gene families (Raffa and Keller 2019), horizontal gene transfer (Sahu et al. 2023; McDonald et al. 2019), copy number variation of pathogenicity-relevant genes (Bergin et al. 2022), expansion of transposable elements (Oggenfuss and Croll 2023), and compartmentalization of the genome (Raffaele and Kamoun 2012; Wacker et al. 2023). Yet, we still have limited knowledge of which genomic features are specific to pathogenic lifestyle, and if they are equally important across phylogenetically distinct groups of pathogens.
Fungal pathogen genomes, and in particular fungal plant pathogen genomes have been often linked to large sizes with expansions of TEs, and a unique presence of a compartmentalized genome with fast and slow evolving regions or chromosomes (Raffaele and Kamoun 2012; Möller and Stukenbrock 2017). Such accessory genomic compartments were shown to facilitate the fast evolution of effectors (Dong, Raffaele, and Kamoun 2015). Also, TE expansions were shown to play a role in the evolution of new virulence genes in emerging pathogens (Bao et al. 2017; Wacker et al. 2023). However, even though such architecture can facilitate pathogen evolution, it is currently recognized more as a side effect of a species evolutionary history rather than a pathogenicity related trait (Torres et al. 2020). Nevertheless, effectors are considered important genes for pathogen evolution, along with the diversification of other gene families (Muszewska et al. 2011; Baroncelli et al. 2016; Sipos et al. 2017). As the number of genes is strongly correlated with fungal genome size (Stajich 2017), such expansions could be a major contributor to fungal genome size. Importantly, not all fungal pathogens carry large genomes. At the end of the spectrum are the endoparasites Microsporidia, which have among the smallest known fungal genomes, showing signatures of strong genome contractions, including a reduced gene repertoire (Katinka et al. 2001).
The size of genomes is dictated by the balance of mutation, selection and genetic drift. Neutral evolution hypotheses generate predictions regarding the size of the genomes, the size of their deleterious non-coding parts, the number of genes, and the mutation rate (Lynch and Conery 2003; Petrov 2002; Lynch 2007). In eukaryotes, the efficiency of purifying selection depends on the size of the species’ effective population size (Ne) and will thus determine the rate of loss of slightly deleterious non-coding DNA, including mobile genetic elements or introns, leading to larger genomes with more repetitive content in species with smaller Ne. In bacteria, bias towards deletion rates will lead to shortening of the genome in small Ne (Mira, Ochman, and Moran 2001). In ascomycetes, the evolution of genome size on a broad phylogenetic scale supports the neutral hypothesis of genome evolution, but in different ways. In species with larger genomes, indicators of drift such as increased intron frequency and decreased gene density are associated with genome expansions, but in species with small and medium-sized genomes, those indicators are associated with genome size reductions (Kelkar and Ochman 2012). Given the strong impact of both demographic history (Ne) and host-specific adaptations of pathogens on their genomes, we reasoned that further examination of genomic sequences in groups of species with various lifestyles can generate predictions regarding the architecture of pathogenic genomes.
In this study, we investigate how fungal genome size and complexity associates with pathogenic lifestyle in the diverse ascomycete class of Sordariomycetes fungi. Sordariomycetes have rich genomic resources and include species with a wide range of lifestyles, including plant pathogens and saprophytes, and several groups of insect pathogens and mutualists. We studied 552 genomes, including 14 newly sequenced and assembled genomes to answer the following questions: 1) which genomic features predict genome size, 2) which genomic traits are associated with pathogenic and insect-associated species, 3) how gene evolution differs across pathogens.
Results
Genome assemblies and phylogeny of Sordariomycetes
We analyzed 552 genomes from the class Sordariomycetes (Ascomycota) together comprising fungi mostly represented by plant pathogens, saprotrophs, and entomopathogens (Supplementary Table 1). Most genomes were downloaded from NCBI or other sources (Supplementary Table 1) and complemented with 14 genomes sequenced and assembled in this study (Supplementary Table 2). Genome size ranged from 20.7 Mbp in Ceratocystiopsis brevicomis (CBS 137839) to 110.9 Mbp in Ophiocordyceps sinensis (IOZ07) and the number of ab initio gene models ranged from 6,280 in Ambrosiella xylebori (CBS 110.61) to 17,878 in Fusarium langsethiae (Fe2391). A maximum likelihood tree was generated from 1000 concatenated single-copy conserved proteins and had a 100% bootstrap support for all major nodes except for one (support of 82% for the split between two subclades of Ophiostomatales, Figure 1A).

Several genomic traits are correlated with genome size in Sordariomycetes.
A. Maximum likelihood tree based on 1000 concatenated protein sequence alignments calculated with IQ-TREE using ultrafast bootstrap approximation (n=563 species). The largest orders are indicated with different colors. Bootstrap support for all major clades except one within Ophiostomatales (82%, black dot) reached 100%. B. Spearman’s rank correlation for phylogenetic independent contrasts of all pairwise combinations of genomic traits. C. Principal component analysis of genomic features. Colors correspond to orders depicted in A. D. Model of genome size. On the y axis are the model coefficients with 95% confidence intervals obtained with phylogenetic generalized least squares (pgls). Principal components correspond to the three main principal components based on 11 genomic traits. E. Testing hypotheses of genome size evolution. First four plots show correlation of dN/dS as a proxy of Ne with the non-coding elements of the genome. The last plot is a correlation of gene loss rate with genome size. Correlations were tested using Spearman’s rank correlation on phylogenetic independent contrasts. Loadings of each principal component from figure C are shown in Supplementary Figure 1. Raw data underlying figures are in Figure 1-Source Data 1-3.
Genome size predictors
We examined the correlation between genome size (bp) and 11 genomic traits, including the number of genes, the repeats proportion in genome assembly, the size of the assembly excluding repeat content (bp), GC content, the mean number of introns per gene, mean intron length (bp), mean exon length (bp), the fraction of genes with introns, mean intergenic length (bp) and the number of tRNA and pseudo tRNA genes.
Most genomic features were correlated with genome size and with each other, with the strongest positive correlation observed between the size of the assembly excluding repeats and the number of genes (Figure 1B). Exon length, number of genes with introns and GC content were negatively correlated with genome size. We extracted three major principal components from all 11 traits, which together explained 66% of variance among species. The first PC was mostly driven by coding traits (exon length, number of introns, genome without repeats, genes), whereas the second PC was mostly driven by non-coding parts of the genome (repeats, intergenic length, GC and intron length, Supplementary Figure 1). The third PC was mainly driven by the proportion of genes with introns, tRNAs and number of introns. Modeling genome size using these 3 PCs showed that all axes were important predictors of genome size, with PC1 being the strongest (Figure 1C). PC1 also showed negative interaction with PC2 and PC3. This suggests that fungi with large genomes have in general either an excess of non-coding elements or an excess of coding elements but rarely both of them at the same time.
We calculated genome-wide ratio of non-synonymous to synonymous substitution rates (dN/dS) as a proxy for an effective population size (Ne) (Kelkar and Ochman 2012; Lefébure et al. 2017) to determine its association with the non-coding portion of the genome. We found no correlation between dN/dS and non-coding traits such as repeat content, intergenic length, intron length or number of introns (Figure 1D), therefore we found no confirmation of accumulation of slightly deleterious DNA in species that have high dN/dS indicative of small Ne. As the amount of non-coding elements is generally minimal in our dataset compared to other eukaryotes, such correlation may be difficult to spot. The rate of gene loss was negatively correlated with genome size (Figure 1D), but was not correlated with dN/dS (Spearman’s Rho=-0.05, P-value=0.44). This suggests that variations in deletion rates may influence genome size dynamics in this group of fungi.
Genomic traits associated with pathogenicity
We found that pathogens were not associated with larger genomes, but rather a greater total number of genes, effectors, tRNAs, and fewer non-coding regions (Figure 2B). However, these patterns were not consistent across the tree (Supplementary Table 2). To explore genomic traits associated with pathogenicity, we applied three methods that account for phylogenetic non-independence: an MCMC evolutionary model of binary traits (BayesTraits), phylogenetic logistic regression (Phyloglm), and random forest analysis. We assessed the same 11 genomic traits as before, adding genome size and the number of effectors. All three methods consistently pointed to the number of effectors and tRNA genes as predictors of pathogenicity (Figure 2B, Supplementary Figure 2). There was no support for a positive association between repeats and pathogenicity, and only one method indicated that pathogens tend to have larger genome sizes (Figure 2B). The total number of genes, genome size without repeats, and the number of pseudo tRNAs were also positively associated with pathogenicity in at least two methods.

Genomic traits associated with pathogenicity.
A. Time-scaled phylogeny of Sordariomycetes colored by the reconstructed genome size displayed in log scale. The outside heatmap indicates species which are pathogenic and insect-associated (IA). Letters and numbers correspond to clades used in subsequent analyses, with representative genera from each clade listed in the legend. B. Association of 13 genomic traits with pathogenicity estimated with BayesTraits, phylogenetic logistic regression (Phyloglm), and random forest classifier. In BayesTraits and Phyloglm, colors correspond to statistically significant associations, either positive (brown) or negative (green). In Phyloglm these are based on coefficient sign, in BayesTraits they were inferred based on transition rates between binary traits estimated from the model. “YES” indicates that dependent co-evolution was detected with BayesTraits but a uniform direction of association could not be deduced from transition rates. Grey color scale depicts prevalence of species classified as pathogenic and non-pathogenic with traits size below median (LOW) or above median (HIGH). C. Same three methods repeated for 10 subsets of the data with an equal number of pathogens and non-pathogens from each genome size bin. In random forest analysis, rank of the feature, instead of importance is given. Transition rates (modeled gains and losses of traits) are shown in Supplementary Figure 2. Raw data underlying figures are in Figure2-Source Data 1-2.
Our dataset has more pathogenic species than non-pathogenic ones (1.8:1), and this is true in particular for species with medium and large genomes (>50 Mb, 4.5:1). This could represent a genuine pattern or a bias due to the focus on sequencing pathogenic species with larger genomes. Even though we didn’t find an association between pathogenicity and genome size, we accounted for potential sampling bias to test the robustness of our findings. Species were grouped by genome size, and we subsampled equal numbers of pathogenic and non-pathogenic species from each group, repeating the process 10 times. These 10 subsets were analyzed using the same methods in parallel, and nearly all confirmed that the number of effectors and tRNA genes were the strongest predictors of pathogenicity (Figure 2C, Supplementary Figure 2). Genome size analyses yielded mixed results. Out of the 10 random subsets, BayesTraits identified co-evolution of genome size with pathogenicity but with no clear directional trend. In contrast, Phyloglm coefficient estimates for genome size varied, showing either positive or negative associations depending on the subset. The analysis also revealed some evidence of a negative association between pathogenicity and non-coding elements such as repeats, intron length, number of introns, and intergenic regions. This could explain the absence of consistent positive correlation between pathogenicity and genome size, as the predicted increase in gene number in pathogens is likely offset by a reduction in non-coding elements.
We also examined whether associations with pathogenicity are consistent across the entire phylogenetic tree or vary between branches. The model with varying rates of evolution provided a significantly better fit for all genomic traits (Supplementary Table 2).
Genomic traits of insect-associated (IA) species
Our analyses revealed that insect-associated (IA) species tend to have smaller genomes and genes with longer exons on average (Figure 3A). All three methods consistently showed a positive association between exon length and IA species (Figure 3A). Additionally, there was support for a negative association between IA species and genome size, genome size excluding repeats, and gene number. Similar to pathogenicity, the evolution of IA traits varied across the phylogenetic tree, as indicated by the covarion models (Supplementary Table 3).

Genomic traits associated with insect association (IA).
A. Association of 13 genomic traits with IA estimated with BayesTraits, phylogenetic logistic regression (Phyloglm), and random forest classifier. In BayesTraits and Phyloglm, colors correspond to statistically significant associations, either positive (brown) or negative (green). In Phyloglm these are based on coefficient sign, in BayesTraits they were inferred based on transition rates between binary traits estimated from the model. “YES” indicates that dependent co-evolution was detected with BayesTraits but a uniform direction of association could not be deduced from transition rates. Grey color scale depicts prevalence of species classified as IA and non-IA with traits size below median (LOW) or above median (HIGH). B. Comparison of exon and intron metrics in 38 one-to-one orthologs between two clades, Microascales (M), Ophiostomatales (O) and Diaportales (D). In the comparisons, one taxon is IA (M1, O) and another one non-IA (M2, D). Exon/intron metrics were averaged within each clade and compared using using paired Mann Whitney U test. C. Correlation between prevalence of gene families within clades and two exon metrics, in same four clades. In all clades, more common gene families have longer and more exons (negative binomial generalized linear model, P-values < 0.05). Raw data underlying figures are in Figure3-Source Data 1-3.
To ensure the association with exon length was not an artifact driven by a specific group of genes or species, we compared exon length and number in one-to-one orthologs between IA and non-IA species from two phylogenetic groups. In both groups, IA species showed longer and fewer exons compared to non-IA species, and no difference was observed in intron length (Figure 3B). We also tested whether genes with longer exons were more likely to be retained in IA species compared to non-IA species. While genes present in most species within a clade (i.e. genes less likely to be deleted) were enriched for longer exons, this pattern was observed in both IA and non-IA species (Figure 3C).
Pathogens gene evolution depends on lifestyle
Given that our results indicated genome size and gene content vary depending on fungal lifestyle, we investigated how IA pathogens evolve compared to non-IA ones. Using phylogenetic logistic regression, we modeled the IA trait, incorporating key genomic traits one at a time as predictors and including pathogenicity as an interaction term. This analysis was performed separately in three distinct fungal clades (H, M and a O/Ma/D/S comprising four subclades, Figure 2A). Models incorporating genome size and exon length could only be fitted for clade H, revealing a negative association between IA species and genome size, and a positive association with exon length in both pathogens and non-pathogens (Figure 4A). In clades O/Ma/D/S and M, genes showed a stronger negative association with IA in pathogens compared to non-pathogens. In clade H, tRNAs exhibited a stronger negative association with IA in pathogens than in non-pathogens. Additionally, effectors were negatively associated with IA in non-pathogens across all three clades, but not in pathogens.

Evolution of pathogens with different lifestyles.
A. Models of IA trait using phylogenetic logistic regression for each of the five genomic traits and pathogenicity as predictors. Coefficients of genomic traits for pathogenic and non-pathogenic species with 95% credible intervals are shown. B. Rates of gene loss and gain in four groups of species with different pathogenicity and IA status, estimated based on 527 small gene families using birth and death model in the program CAFE v5. Estimates of gene evolutionary rates (λ) are shown above the boxplots. Raw data underlying figures are in Figure4-Source Data 1-2.
Host-pathogen interactions often result in rapid gene turnover due to ongoing adaptation to host defense mechanisms, leading to gene family expansions and contractions (Gómez Luciano et al. 2019; Pendleton et al. 2014; Hartmann and Croll 2017; Stalder et al. 2023). To explore this, we analyzed a subset of ∼500 gene families to test whether the rate of gene evolution (λ, the rate of gene loss and duplication (Mendes et al. 2020)) differs across species with different lifestyles. Species were categorized into four groups based on their pathogenicity and IA status, and we fitted models with one, two, or four rate parameters. The model with four distinct rates was significantly better than the simpler models (likelihood ratio tests (LRT) 4 vs. 1 rates: χ2=7.8, P-value=0; LRT 4 vs. 2 rates: χ2=6, P-value=0.002). Gene evolutionary rates were higher in non-IA species and lower in IA species, regardless of pathogenicity (Figure 4B). Non-IA pathogens showed the highest evolutionary rates, while IA pathogens exhibited the lowest.
These results indicate that IA species are associated with slower gene evolutionary rates overall, and some IA clades also with fewer genes, particularly in pathogenic species. However, some genes important for pathogens, such as effectors, are not depleted to the same extent as in non-pathogens.
IA plant pathogens preferentially lose genes important for host colonization
To investigate the functional gene classes gained or lost in fungi with different lifestyles, we calculated the fold change in the number of various gene classes across 19 clades (Figure 2A) since their last common ancestor with sister clades. The clades include plant pathogenic fungi (H1, Ma, G, D), entomophathogens (H2.3, H2.4, H2.6), insect symbionts (H2.2), insect-vectored species (M1, O, H2.8), saprotrophs (S, H2.1), and mixed lifestyle groups. We utilized a range of databases to annotate genome assemblies or ab initio gene models (as detailed in the Methods).
Nearly all IA clades (M1, H2.8, O, H2.4, H2.1), along with a non-IA clade derived from an IA ancestor (H2.2), a saprotrophic clade (S), and one mixed clade (H2.9), experienced the largest gene losses (Figure 5). Entomopathogens exhibited a range of changes, with H2.4 showing mostly gene losses, H2.3 mostly gene gains, and H2.6 showing a mix of both. Conversely, the three clades that experienced the most gene gains were plant pathogenic clades (D, H1, G) and one IA clade (H2.3).

Insect-vectored clades lose genes involved in breaking plant host barriers.
Heatmap shows the fold change of genes/clusters relative to the ancestral state ((observed - ancestral)/ ancestral state). Clades are shown in columns (clade names correspond to the ones in Figure 2A) with the number of clade members in parentheses; functional classes are shown in rows. Clades comprise plant pathogenic fungi (H1, G, D), entomophathogens (H2.4, H2.6, H2.3), insect symbionts (H2.2), insect-vectored species (M1, O, H2.8), saprotrophs (S, H2.1), or mixed lifestyle groups. The dots indicate significant gain (brown) or loss (green) of genes/clusters across clade members estimated from 100 rounds of bootstrapping of 10 species in clades with >= 10 members. SMC: secondary metabolite clusters, CAZy: carbohydrate-active enzymes. Same heatmap but for pathogenic species only is shown in Supplementary Figure 3. Raw data underlying figures are in Figure5-Source Data 1-4.
Insect-vectored species (M1, O, H2.8) demonstrated the highest gene losses in genes associated with secondary metabolite pathways, defense mechanisms, and carbohydrate and lipid transport and metabolism (Figure 5). This was also evident from the gene losses among different categories of CAZymes (carbohydrate-active enzymes), and SMCs (secondary metabolite clusters), as well as peptidases, transcription factors, and effectors. These patterns remained consistent when focusing solely on plant pathogens (Supplementary Figure 3). The only gene classes that did not exhibit a reduction or showed gains in these three clades were tRNAs, genes involved in cytoskeleton functions, and cell motility (Figure 5). Other IA species displayed more variability in changes to their gene repertoires, typically showing some gains in SMC genes (H2.1, H2.6). Two clades enriched with non-IA plant pathogens (H1, G) demonstrated the highest gains in host-related genes. Clade H1 experienced the most gene gains in genes related to secondary metabolite pathways, defense mechanisms, carbohydrate and amino acid transport and metabolism, wheres clade G experienced most gene gains in genes related to extracellular structures, secondary metabolite pathways, carbohydrate transport and metabolism, and the cell wall/membrane/envelope biogenesis.
Overall, we observed common gene gains, particularly in CAZy and SMC classes, among many crop (clades H1, G) and some tree pathogens (clade D, Figure 5, Supplementary Figure 3). On the other hand, insect-symbionts (clade H2.2), insect-vectored species, including plant pathogens (clades M1, H2.8, O), and specialized entomopathogens (H2.4, H2.6) exhibited variable levels of gene losses, sometimes in genes critical for host interactions (Figure 5, Supplementary Figure 3). An exception was the IA clade H2.3, which includes the genera Metarhizium and Pochonia, both known for their ability to adapt to various lifestyles or hosts.
Discussion
Fungal pathogens are commonly linked to notably large genomes (Raffaele and Kamoun 2012; Wacker et al. 2023) yet they exhibit considerable diversity in both genome size and structure (Möller and Stukenbrock 2017). We focused on the class of Sordariomycetes, which includes fungi with diverse lifestyles, to investigate three questions: first, which genomic traits best explain genome size; second, whether genome size can be attributed to pathogenic lifestyle; and third, whether pathogens share common genomic traits. Our analysis revealed that gene number is the strongest predictor of genome size, followed by non-coding regions. We found no correlation between genome size and pathogenicity. Instead, pathogenicity was on average best explained by the number of effectors and tRNAs, but the overall genome architecture of pathogens was strongly influenced by the type of interacting host and the presence of potential vector species.
Genome size evolution in fungi
We found that the number of genes and their length—specifically, exon length and number—are the primary drivers of genome size in Sordariomycetes. While non-coding regions of the genome also contribute to this, they are less significant and are more frequently found in gene-poor species. These findings suggest that genes and non-coding regions do not evolve in concert or at the same rate.
To determine whether neutral processes can explain genome size in Sordariomycetes, we tested two hypotheses. The first posits that a decrease in effective population size (Ne) reduces the effectiveness of selection, including purifying selection, which in eukaryotes typically leads to a greater accumulation of deleterious non-coding elements and larger genome sizes (Lynch and Conery 2003, Lynch 2007). The second hypothesis suggests that, due to a higher rate of deletions compared to insertions, smaller Ne results in more streamlined genomes, similar to observations in bacteria (Kuo, Moran, and Ochman 2009; Mira, Ochman, and Moran 2001). Using the genome-wide ratio of non-synonymous to synonymous substitution rates (dN/dS) as a proxy for Ne (Lefébure et al. 2017; Kelkar and Ochman 2012; Marino et al. 2024), we did not find the link between Ne and the non-coding portion of the genome. Although we observed a weak negative correlation between the gene deletion rate and genome size, there was no correlation between the gene deletion rate and dN/dS.
The lack of clear association of dN/dS with the non-coding content could arise from several factors. First, dN/dS may not be an appropriate indicator of Ne, or species analyzed may not be in mutation-selection balance. Second, fungi typically have much more gene-rich genomes than most of other eukaryotes (Wells and Feschotte 2020; Merényi et al. 2023), and the species in this study generally have modest genome sizes, which may weaken the relationship between Ne and the non-coding regions. Lastly, the large non-coding content is often driven by the proliferation of transposable elements (TEs), which are known to be very dynamic. The absence of significant association between dN/dS and TE content on the long evolutionary timescales has been noted (Marino et al. 2024), suggesting possible variations in selective effects of TEs across different taxonomic groups.
Support for the neutral hypothesis of genome size evolution in our study is indirect. We found that species associated with insects—such as endoparasites and symbionts, therefore species that likely undergo transmission bottlenecks and in effect have reduced Ne (Mira and Moran 2002)—have smaller genomes and in some cases fewer genes. Additionally, insect-associated species have genes with fewer but longer exons, suggesting the evolution of a more streamlined gene architecture. While these findings align with neutral theories of genome evolution, further research is necessary to identify the primary processes, including any potential selective factors, that drive the dynamics of gene content in fungi.
Genomic predictors of fungal pathogens
Large genome size driven by repeats has often been implicated as one of the hallmarks of fungal pathogens, especially fungal plant pathogens (Raffaele and Kamoun 2012). However, in our analysis of over 500 genomes of Sordariomycetes, we found no evidence supporting a positive association between genome size or repeat content and pathogenic fungi. While a few species with the largest genomes are indeed plant pathogens and possess a high proportion of repeats, such instances appear to be exceptions rather than the rule on a broader scale. Although the high activity of transposable elements can contribute to genome expansion and facilitate the evolution of pathogenicity-related traits (Hartmann et al. 2017; Wacker et al. 2023), our findings suggest that this is not the predominant pathway for fungal pathogen evolution.
We found that the overall number of genes, and genome size without repeats showed positive associations with pathogenicity on average. However, we can only speculate whether this pattern arises from drift or selection. Fungal pathogens comprise a diverse group of species, with specialists or generalists adaptations and varying demographic histories. Pathogens vary from obligate endoparasites, free-living saprotrophs who become opportunistic pathogens, to pathogens of common crops or plants with fluctuating or large Ne. Unfortunately, the specifics of the population dynamics and life histories of many pathogens remain largely unknown. Following the idea that smaller effective population sizes lead to smaller, more streamlined genomes in Sordariomycetes, the higher gene counts in pathogens may imply that most pathogens posses larger Ne compared to non-pathogens. Alternatively, the greater number of genes in pathogens might reflect selective pressures favoring pathogen-specific gene expansions. An arms race between hosts and pathogens drives the continual evolution of new effector gene variants in pathogens, enabling them to evade the host immune system (Boller and He 2009; Stalder et al. 2023). Our findings indicate association of genes important for pathogenicity, such as effectors, with pathogens which can contribute to this overall pattern. In addition, we observe that clades with different lifestyles undergo distinct gene loss and gain dynamics, further suggesting a selective advantage for certain gene duplications.
Effectors and tRNAs were the only specific gene classes we examined in our analysis and they proved to be better associated with pathogenicity than the overall gene count. The significance of effectors in pathogens is well established (Stergiopoulos and de Wit 2009; Wilson and McDowell 2022). For instance, mutations in effector genes can provide pathogens with an advantage by enabling them to evade host immune defenses (Sánchez-Vallet et al. 2018). The role of tRNA modifications in pathogenicity has also been recognized (Hinsch, Galuszka, and Tudzynski 2016; G. Li et al. 2023), although the overall count of tRNAs remains less clear. In large eukaryotic genomes, short interspersed nuclear elements can lead to the misannotation of tRNAs (Gibbs et al. 2004). Therefore the role of tRNAs in fungal pathogens warrants further investigation. Other gene families could be important predictors of pathogenicity. Carbohydrate-active enzymes profiles are reported to be different in pathogenic and non-pathogenic fungi (Dort et al. 2023; Haridas et al. 2020; Hane et al. 2019). Given that overall gene repertoires between pathogens can vary drastically, small-scale evolution of specialized genes or gene families can better reflect fungal transmissions to pathogenic lifestyles.
Different genomic evolutionary pathways for fungal pathogens
Our study showed that the insect-association (IA) trait was negatively correlated with genome size and positively with exon length. Additionally, the total number of genes was also significantly reduced, particularly in insect-transmitted clades. Previous research showed that specialized fungi tend to have smaller genomes than generalist fungi (Loos et al. 2024), and the associations we observed between genomic traits and IA species may partially reflect the degree of specialization towards their hosts. In our dataset, IA species include several species of entomopathogens, many of which are host-specific. Plant pathogens vectored by insects are also largely host specific.
Unfortunately, it is challenging to obtain a comprehensive data on the full range of host species represented in our dataset, which limits our ability to compare genome sizes between specialist and generalist species.
We found significant differences in gene evolution between IA and non-IA pathogens. While non-IA pathogens had the highest gene evolution rates, IA pathogens on average had the lowest. However, among the three clades of IA species included in our study, non-pathogenic species displayed a more pronounced reduction in effector repertoires compared to pathogenic species, highlighting the critical role of these genes across a broad range of fungal pathogens. Two IA clades, Ophiostomatales and Microascales, exhibited losses in gene classes important for host colonization and wood decomposition, such as CAZymes, or secondary metabolite clusters (Scharf, Heinekamp, and Brakhage 2014; Cantarel et al. 2009). This is likely because the insect vector often facilitates host entry in these species, rendering many genes redundant and prone to pseudogenization and loss. In the third clade, Hypocreales, patterns of gene losses and gains varied across species. Hypocreales is our dataset’s largest order and includes both specialized and generalist species with diverse host and insect associations. Furthermore, many species within this order have lifestyles that remain poorly understood. Distinguishing between pathogenic and non-pathogenic lifestyles can be challenging, particularly for species found on dead host tissues, which may function as opportunistic pathogens or secondary saprophytes that benefit from decaying matter. These factors may partially contribute to the variation that we see across species.
Conclusions
Our study shows that genome size of Sordariomycetes is primarily determined by the number of genes, with non-coding regions playing a lesser role. We found no association between pathogenic species and larger genomes or higher repeat content compared to non-pathogenic species. While many pathogens have a high number of genes, especially effectors and tRNAs, this pattern does not hold true for all species. While gene gains are common to some plant pathogens, others, including, insect-vectored plant pathogens show frequent gene losses, including losses in gene classes important for host interactions. Additionally, insect-associated species tend to have smaller genomes and longer exons, likely reflecting their level of specialization toward hosts and vectors—factors that may influence their effective population size. In summary, our findings suggest that the genome architecture of fungal pathogens is mainly shaped by the dynamics of pathogenicity-related genes and the nature of interactions with hosts and other species.
Materials and methods
Genome assemblies
We selected 14 strains (11 x Ophiostoma and 3 x Leptographium, Supplementary Table 2) for high-depth short-read Illumina sequencing. Isolates were grown on Malt Extract Agar medium (15 g l−1 agar, 30 g l−1 malt extract, and 5 g l−1 mycological peptone), and DNA was extracted with acetyl trimethylammonium bromide chloroform protocol. The fungal isolates were handled in a facility that has received a Plant Pest Containment Level 1 certifcation by the Canadian Food Inspection Agency. Library preparation and sequencing were conducted at the Génome Québec Innovation Center (Montréal, Canada). One strain (O. quercus) was sequenced with Illumina NovaSeq (paired-end 150 bp), and the rest with Illumina HighSeq X (paired-end 150 bp). Data quality was inspected with Fastqc v0.11.2/8 (Andrews 2010). Ten strains were used for de novo genome assembly (Supplementary Table 2). The depth of coverage of the generated data was between 32x and 555x. Reads were trimmed for adapters with Trimmomatic v0.33/0.36 (Bolger, Lohse, and Usadel 2014) using options ‘ILLUMINACLIP:adapters.fa:6:20:10 MINLEN:21’ and overlapping reads were merged with bbmerge from BBTools v36/v37 (Bushnell, Rood, and Singer 2017). De novo genome assemblies were generated with SPAdes v3.9.1 (Bankevich et al. 2012) with options ‘-k 21,33,55,77,99 --careful’. Mitochondrial DNA was searched in contigs with NOVOPlasty v3.8.3 (Dierckxsens, Mardulyn, and Smits 2017) using the mitochondrial sequence of O. novo-ulmi as a bait (CM001753.1) and matching contigs were removed. Reads were remapped to the nuclear assembly with bwa mem v0.7.17 (H. Li and Durbin 2009), and contigs with normalized mean coverage < 5% and shorter than 1000 bp were also removed.
A total of 11 Ophiostoma strains were selected for long-read sequencing with PacBio (Supplementary Table 2). DNA extraction was done in the same way as for Illumina libraries, except that no vortexing and shaking were done to avoid DNA fragmentation. Library preparation and sequencing with the PacBio SMRTcell Sequel system were conducted at the Génome Québec Innovation Center (Montréal, Canada). The average depth of coverage of the generated data was between 47x and 269x. De novo genome assemblies were generated with pb-falcon v2.2.0 (Chin et al. 2016). The range of parameters was tested and the final configuration files with the best-performing parameters for each species can be found on https://github.com/aniafijarczyk/Fijarczyk_et_al_2023. O. quercus was sequenced in two runs and read data from the two runs were combined together. O. novo-ulmi (H294) was sequenced in two runs but only read data from one run was used for an assembly due to sufficient coverage. Assemblies were ordered according to O. novo-ulmi H327 genome (GCA_000317715.1) using Mauve snapshot-2015-02-13 (Darling et al. 2004). Assemblies were polished between two to four times by remapping long reads with minimap2 (pbmm v1 (H. Li 2018)) and correcting assemblies using arrow (pbgcpp v1). We also performed one round of polishing with pilon v1.23 (Walker et al. 2014) after mapping short-read Illumina reads (bwa v0.7.17 (H. Li and Durbin 2009)) to assemblies from this study (n=3) or from the previous study (Hessenauer et al. 2020) (n=7, Supplementary Table 2). The only exception was O. quercus, for which we had no corresponding short-read data, therefore we performed four rounds of correction using arrow. The effectiveness of polishing was assessed by analyzing the completeness of genes with Busco v3 (Waterhouse et al. 2018). Mitochondrial genomes were assembled using Illumina assemblies as baits (or those of related species). Long reads were mapped to bait assembly with pbmm2 (H. Li 2018), and a subsample of mapped reads was used for mtDNA assembly with mecat v2 (Xiao et al. 2017). Consecutive rounds of mapping and assembly were conducted until circularized assemblies were obtained. Nuclear assembly contigs with more than 50% of low-quality bases, those mapping to mtDNA, or with no mapping of Illumina reads (standardized mean coverage < 5%) were filtered out. The genome assembly of O. montium was very fragmented and had a high percentage of missing conserved genes (74.3%) so instead of a PacBio assembly, an Illumina de novo assembly for this species was considered in further analysis.
In both short-read and long-read assemblies, contigs were searched for viral or bacterial contaminants by BLAST searches against bacterial or viral UniProt accessions and separately against fungal UniProt accessions. Contigs having more bacterial/viral hits in length than fungal hits were marked as contaminants and removed. Finally, repeats were identified with RepeatModeller v2.0.1 (Flynn et al. 2020) using option -LTRStruct, and recovered repeat families together with fungal repeats from RepBase were used for masking assemblies with RepeatMasker v4.1.0 (Tarailo-Graovac and Chen 2009).
Genes were annotated with Augustus v.3.3.2 (Stanke and Morgenstern 2005) and Breaker v2.1.2 (Hoff et al. 2019). To obtain ab initio gene models in Ophiostoma novo-ulmi, ulmi, himal-ulmi, quercus and triangulosporum, a training file was generated using RNA-seq reads from O. novo-ulmi H327 (SRR1574322, SRR1574324, SRR2140676) mapped onto a O. novo-ulmi H294 genome with STAR v2.7.2b (Dobin et al. 2013). A training set of genes was generated with GeneMark-ES-ET v4.33 (Lomsadze et al. 2005). Final gene models were merged from ab initio models, models inferred from the alignment of RNA-seq reads (except O. triangulosporum) and O. novo-ulmi H327 proteins. The rest of the genome assemblies were annotated only ab initio using Magnaporthe grisea species training file.
Sordariomycetes genomes
A total of 580 genome assemblies (one reference genome per species) were downloaded from NCBI (21/05/2021), the two assemblies were downloaded from JGI Mycocosm (S. kochii, T. guianense), and two from other resources (O. ulmi W9 (Christendat 2013), L. longiclavatum (Wong et al. 2020)). All assemblies were filtered for short contigs (< 1000 bp) and contaminants as described above. Gene completion was assessed with Busco v3 (Waterhouse et al. 2018) using an orthologous gene set from Sordariomycetes. Ab initio gene models were obtained with Augustus v.3.3.2 (Stanke and Morgenstern 2005) with different species-specific training files: Magnaporthe grisea, Fusarium, Neurospora, Verticillium longisporum1 or Botrytis cinerea (Supplementary Table 1). Next, 31 assemblies were removed based on the low percentage of complete conserved genes (Busco score < 85%) and one due to an excessive number of gene models (Botrytis cinerea). The number of genes was systematically underestimated (10% on average) compared to the gene numbers that had been submitted to NCBI for the corresponding species, but underestimation was not different between pathogenic or non-pathogenic fungi (Wilcoxon rank sum test, w = 2994, p = 0.57).
Phylogeny
The maximum likelihood tree was built from 1000 concatenated genes (retrieved with Busco v3 (Waterhouse et al. 2018)) with the highest species representation. Protein alignments were generated with mafft v7.453 (Katoh and Standley 2013) with E-INS-i method (option ‘--genafpair --ep 0 --maxiterate 1000’), trimmed with trimal v1.4.rev22 (Capella-Gutiérrez, Silla-Martínez, and Gabaldón 2009) with option ‘-automated1’ and converted into a matrix. Best protein evolution model JTT+I+G4 was chosen as the most frequent model across all protein alignments, based on the BIC score in IQ-TREE v1.6.12 (Nguyen et al. 2015; Kalyaanamoorthy et al. 2017). The maximum likelihood tree was inferred with ultrafast bootstrap (Hoang et al. 2018), seed number 17629, and 1000 replicates implemented in IQ-TREE. Time-scaled phylogeny was inferred with program r8s v1.81 (Sanderson 2003), setting the calibration point of 201 My at the split of Neurospora crassa and Diaporthe ampelina, retrieved from TimeTree (Hedges and Kumar 2005).
Species and trait selection
New and downloaded genome assemblies after filtering together amounted to 573. Few species were represented by more than one strain, in particular, multiple Ophiostoma strains sequenced in this study. To avoid potential biases related to the overrepresentation of some species, only one strain per species was retained, leaving 563 species in total, including 11 outgroup species.
Pathogenicity and insect association traits were assigned based on a literature search. Pathogenicity was assigned if the species caused a well-recognized disease, or pathogenicity was experimentally documented in at least one host species. We considered pathogens of plants, animals, and fungi, both obligatory and opportunistic. Insect association included all types of relationships with insects, including pathogenic, symbiotic, and mutualistic. Insect-vectored species were limited to documented cases of insect transmission. Analyzed genomic traits included genome size (bp), number of genes, number of effectors, a fraction of repeat content, size of the assembly excluding repeat content (bp), GC content, the mean number of introns per gene, mean intron size (bp), mean exon size (bp), a fraction of genes with introns, mean intergenic length (bp) and number of tRNA and pseudo tRNA genes. Genome size is equivalent to assembly size after filtering. Genome size excluding repeat content is assembly size excluding regions masked by RepeatMasker (in this study or downloaded from NCBI). Repeat content was calculated as the proportion of masked bases in the raw assembly (after removing contaminated contigs, but before filtering for contig length) compared to the raw assembly size. Gene number corresponds to the number of all ab initio gene models obtained with Augustus. GC content is the proportion of GC bases in the total assembly. The mean number of introns per gene, intron and exon size, a fraction of genes with introns, and intergenic length were estimated from gff files with annotated gene models. tRNA and pseudo tRNA genes were obtained with tRNAscan-SE v2.0.9 (Chan et al. 2021). Effectors were identified from predicted proteins, first by running SignalP v6 (Teufel et al. 2022) to identify proteins containing signal peptides, and then by running EffectorP v3 (Sperschneider and Dodds 2022) on them to identify effectors.
Correlation of genomic traits with genome size
The phylogenetic generalized least squares (pgls) model of genome size was built with gls function from the R package nlme. Maximum likelihood (ML) method was used, and the phylogenetic relationships were modeled with the Pagel’s lambda correlation structure. We considered 11 genomic traits (except for genome size, and effectors) as explanatory variables. First, the correlations among pairs of genomic traits were calculated after transforming variables to independent contrasts using pic function from the R package ape (Paradis, Claude, and Strimmer 2004). Pairwise correlations were tested with non-parametric Spearman’s rank correlation test. As it turned out, many variables were strongly correlated with each other. Therefore, instead of including all variables in the model, PCA was run on 11 genomic variables to retain three main components, which explained 33%, 19%, and 13% of the trait variance respectively. The loadings revealed that PC1 was mainly influenced by genic traits (exon length, introns, genome w/o repeats, genes), PC2 by non-coding traits (repeats, intergenic length), whereas PC3 by other traits (tRNA, genes with introns). We fitted the PGLS model with the three main PCs and interactions among them. The model was checked for homogeneity of variance, and normal distribution of residuals, and was refitted after removing 8 outliers. Model fit was assessed using anova() function. To compute the importance of each component in the model, R2 was calculated with the R2 function from the rr2 R package after removing each component at a time and comparing it with the full model.
Correlation between dN/dS and non-coding regions
Nonsynnymous to synonymous substitution rate (dN/dS) was calculated as a proxy for effective population size (Ne). Species with a larger Ne are expected to exhibit a lower rate of nonsynonymous to synonymous substitution rate than species with smaller Ne due to stronger constraints on nonsynonymous deleterious mutations. dN/dS was calculated with codeml program in PAML v.4.10.6 (Yang 2007), on a codon-wise alignment of 300 random concatenated orthologs identified with Busco. Specifically, for each species, calculations were performed on a subset of ten most closely related species, with one species being the foreground branch and the rest serving as the background. Species, whose dN/dS estimate was based on a comparison with a recently diverged species showed elevated dN/dS values, likely because of the stronger contribution of segregating polymorphism compared to fixed differences to divergence estimates. To account for this time-dependence, logarithm of species divergence times was fitted to a logarithm of dN/dS using generalized least square model, and model residues were taken to serve as dN/dS estimates in further correlations.
To test if species with smaller Ne tend to accumulate more non-coding deleterious material, correlation between phylogenetically independent contrasts of time-corrected dN/dS estimates and log transformed traits such as repeat content, intergenic length, intron length and number were calculated using Spearman’s rank correlation.
Correlation between the gene loss rate and genome size
To test if differences in deletion rate can account for genome size changes, a correlation between the rate of gene loss and genome size was tested. Orthologous gene families for 563 species were determined with OrthoFinder v2.5.2 (Emms and Kelly 2019). Gene contractions were estimated by applying a birth-death model of gene evolution with CAFE v5 (Mendes et al. 2020). A single lambda for all branches was estimated, with a poisson distribution for gene family counts at the root. As the likelihood scores could not be computed for all gene families, dataset was limited to 527 gene families present at the root and with a maximum change of 4 genes across species. This procedure has likely led to the removal of many fast evolving and big gene families and provides estimates on the rather conserved set of gene families. Lambda search was run three separate times to assure convergence. The rate of gene loss was calculated by summing gene family contractions and dividing by the calibrated branch length (in My).
Correlation on phylogenetically independent contrasts of log transformed genome size and log transformed gene loss rate was calculated using Spearman’s rank correlation.
Association of genomic traits with pathogenicity and IA
Association of pathogenicity and insect association (IA) with genomic traits was estimated using three approaches, i) a reversible jump MCMC discrete model of evolution implemented in BayesTraits v3 (Pagel and Meade 2006), ii) a phylogenetic logistic regression with phyloglm function in R package phylolm (Ho and Ané 2014), iii) random forest classification using scikit-learn python (v3) package. In tests of pathogenicity associations, analyses were also run separately within 10 subsets, where each subset comprised an equal number of pathogens and non-pathogens (n = 190,190) sampled from five genome size bins (between 20 and 70 Mbp, with a 10 Mbp step).
BayesTraits tests coevolution of a pair of binary traits, by modeling eight possible transition rates between the two traits (transition of only one trait at a time). We modeled coevolution of each genomic trait with pathogenicity or IA. Genomic traits were converted into binary traits based on their median (0 if below median and 1 if above median), such that trait gain indicates increase of the size of that trait, and trait loss indicates decrease in the size of the trait. For each genomic trait, two models were compared, one with independent and one with dependent evolution of the two traits. In addition, to test if some transition rates are more frequent in one direction (gains vs losses of a trait), four models were run in which we set equal rates of change for the opposite transitions (for example the rate of gain of genome size in pathogens was set to equal the rate of loss of genome size in pathogens), and compared them to dependent models using log Bayes factor. Based on all significant transition changes (gains or losses of one of the trait in the presence of the other) we determined if the genomic trait is positively associated with the pathogenicity/IA (gain of the trait for pathogens/IA, or loss of the trait in non-pathogens/non-IA). Finally, we also compared dependent model with the covarion model in which coevolutionary transitions were allowed to vary across the phylogeny (Venditti, Meade, and Pagel 2011). Models were selected if the log Bayes factor surpassed 4. Each model was run three times to check for consistency, for 21 mln iterations, 1 mln burn-in, and thinning of 1000.
Phylogenetic logistic regression was used to fit each genomic trait to pathogenicity/IA using phyloglm function in R with the “logistic_MPLE” method, btol option (searching space limit) set to 30, and 1000 independent bootstrap replicates. Benjamini-Hochberg correction was applied to P-values. Scale of some genomic traits was adjusted. Assembly size with and without repeats was used in unit of 10 Mbp, number of genes in unit of 1000 genes, length of introns and intergenic length in unit of 1 kb, length of exons in unit of 100 bp, number of tRNAs, pseudo tRNAs, and effectors in unit of 100 genes.
To train a machine learning classifier for predicting pathogens and IA species, we tested the accuracy of five classifiers (KNeighbors, SVC, DecisionTree, RandomForest, and GradientBoosting). In each case, randomized grid search and five-fold cross validation was performed on the train data (80%), and balanced accuracy scores were obtained. Scores were averaged across 10 random splits into test and train datasets. Species with missing data were filtered out, and data were rescaled (between 0 and 1) for classification with SVC. As the random forest model showed one of the best scores, we used this model to train on the whole dataset and determine the most important features. Important features were determined using a mean decrease impurity method. In the final model, we used hyperparameters from the data split which gave the best balanced accuracy score. As the samples in the dataset are dependent through phylogenetic relationships, we also determined the impact of phylogeny on the model, by including for each species distances to each node as additional features. We found that node distances never appeared among the top 5 most important features.
Ancestral reconstruction of genome size was performed using fastAnc function in R package phytools (Revell 2011) and visualized on the tree with ggtree R package (Yu 2020) using the continuous option.
To test association of exon length and exon number between IA and non-IA clades, we obtained these features for 38 one-to-one orthologs found with OrthoFinder (described above). Mean values across IA species were compared to mean values in non-IA species for two clade pairs using paired Mann Whitney U test.
To determine whether long exon genes are more or less likely to be deleted compared to short exon genes, we examined the presence of gene families across different clade members. Gene families found in all clade members represent genes that are less likely to be deleted, whereas gene families present in only a few members of the clade represent genes that are more often deleted. We then compared features of the common and rare gene families in terms of exon length and number, by fitting a negative binomial generalized linear model using glm.nb() function from the MASS package in R.
Evolution of pathogens with different lifestyles
The insect association (IA) trait was modeled using phylogenetic logistic regression in the phyloglm model, as described earlier, with each of the five genomic traits (genome size, exon length, number of genes, effectors, and tRNAs) tested individually, including pathogenicity as an interaction term. Models were fitted separately for 3 clades comprising species with a combination of pathogenicity and IA traits (H, M and O/Ma/D/S).
To test if the rate of gene evolution varies between pathogenic species that are IA or non-IA, we used the birth-death model implemented in CAFE v5, to fit one, two or four parameters of lambda on a subset of 527 small gene families. Pathogenicity and IA trait states were determined for each node in the phylogeny using function ace in the R package phytools. Model with equal rates of transition was used for IA (ER model) and a model with non-equal transition rates was used for pathogenicity (ADR model). Models were compared using likelihood ratio tests.
Functional gene classes
Functional gene annotations were determined using several databases. Protein sequences were searched against KOG (release 2003 (Tatusov et al. 2003)) and MEROPS Scan Sequences (release 12.1 (Rawlings et al. 2018)) using diamond v2.0.9 (Buchfink, Xie, and Huson 2015) with options ‘--more-sensitive -e 1e-10’ and against CAZymes (download 24/09/2021 (Cantarel et al. 2009)) with options ‘--more-sensitive -e 1e-102’. Pfam domains were searched with pfam_scan.pl (Mistry et al. 2021) script with hmmer v3.2.1 (Mistry et al. 2013) and matched with Sordariomycetes transcription factors obtained from JGI Mycocosm database (download 9/12/2021). Secondary Metabolite Clusters were inferred from genome assemblies using antiSMASH v4.0.2 (Blin et al. 2019).
For each functional group of genes, and for each one of 19 selected clades, the ancestral number of genes/clusters was inferred in the most recent ancestor of the clade and its sister clade using the fastAnc function from R package phytools (Revell 2011). Fold change was calculated as a difference in observed and ancestral state divided by ancestral state and averaged across all clade members. Gains or losses of genes per clade were tested with a bootstrap of 10 species with 100 replicates for clades with >= 10 members.
Data Availability
Raw short and long reads from sequenced genomes are available at NCBI under project number PRJNA841745. Genome assemblies have been deposited at GenBank under the accessions JANSLN000000000-JANSMH000000000. The versions described in this paper are versions JANSLN010000000-JANSMH010000000. Code used in this study is available on github (https://github.com/aniafijarczyk/Fijarczyk_et_al_2023).
Figure Supplements

Loadings of the main PCs.
Principal component analysis of genomic features. Histogram shows proportion of variance explained by each PC. Dark gray PCs together explain 66% of the variance. Heatmap shows loadings for each PC, with brown colors depicting positive, and green colors depicting negative contributions to the PC.

Numbers of gains and losses among four transition types obtained in BayesTraits run.
Opaque colors show transitions significantly shifted towards gains or losses. The x axis lists all four possible transitions for the two traits (genomic trait and pathogenicity). Two states (before change, after change) are separated by an underscore. 1 stands for either a value of a trait above median or presence of pathogenicity, 0 stands for a value of a trait below median or absence of pathogenicity. For example transition 01_11, indicates a gain of a trait (eg. transition from a small to a large genome size, 0x_1x), and no change in pathogenicity (x1_x1).

Insect-associated pathogens lose genes involved in breaking host barriers.
Heatmap shows the fold change in genes/clusters number relative to the ancestral state for pathogenic species only. Clades are shown in columns (clade names correspond to the ones in Figure 2A) with the number of clade members in parentheses; functional classes are shown in rows. The dots indicate significant gain (red) or loss (green) of genes/clusters across clade members estimated from 100 rounds of bootstrapping of 10 species in clades with >= 10 members. SMC: secondary metabolite clusters, CAZy: carbohydrate-active enzymes.
Acknowledgements
We thank Erika Dort for an access to lifestyle database to confirm pathogenicity traits of 90 species from our dataset. We thank Rohan Dandage, Ilga Porth and Louis Bernier for comments and discussions about the manuscript. This project was funded by Genome Canada and Genome Québec BioSAFE project and a NSERC Discovery grant to CRL. CRL holds the Canada Research Chair in Cellular Systems and Synthetic Biology.
Additional files
References
- FastQC: A Quality Control Tool for High Throughput Sequence Datahttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology: A Journal of Computational Molecular Cell Biology 19:455–77
- PacBio Sequencing Reveals Transposable Elements as a Key Contributor to Genomic Plasticity and Virulence Variation in Magnaporthe OryzaeMolecular Plant 10:1465–68
- Gene Family Expansions and Contractions Are Associated with Host Range in Plant Pathogens of the Genus ColletotrichumBMC Genomics 17:555
- Systematic Analysis of Copy Number Variations in the Pathogenic Yeast Candida Parapsilosis Identifies a Gene Amplification in RTA3 That Is Associated with Drug ResistancemBio 13:e0177722
- Mapping of Repetitive and Non-Repetitive DNA Probes to Chromosomes of the Microsporidian Encephalitozoon CuniculiGene 191:39–45
- The antiSMASH Database Version 2: A Comprehensive Resource on Secondary Metabolite Biosynthetic Gene ClustersNucleic Acids Research 47:D625–30
- Trimmomatic: A Flexible Trimmer for Illumina Sequence DataBioinformatics 30:2114–20
- Innate Immunity in Plants: An Arms Race between Pattern Recognition Receptors in Plants and Effectors in Microbial PathogensScience (New York, N.Y.) 324:742–44
- Fast and Sensitive Protein Alignment Using DIAMONDNature Methods 12:59–60
- BBMerge - Accurate Paired Shotgun Read Merging via OverlapPloS One 12:e0185056
- The Carbohydrate-Active EnZymes Database (CAZy): An Expert Resource for GlycogenomicsNucleic Acids Research 37:D233–38
- trimAl: A Tool for Automated Alignment Trimming in Large-Scale Phylogenetic AnalysesBioinformatics 25:1972–73
- tRNAscan-SE 2.0: Improved Detection and Functional Classification of Transfer RNA GenesNucleic Acids Research 49:9077–96
- Phased Diploid Genome Assembly with Single-Molecule Real-Time SequencingNature Methods 13:1050–54
- Ophiostoma Ulmi Resource Browser2013http://www.moseslab.csb.utoronto.ca/o.ulmi/
- Mauve: Multiple Alignment of Conserved Genomic Sequence with RearrangementsGenome Research 14:1394–1403
- NOVOPlasty: De Novo Assembly of Organelle Genomes from Whole Genome DataNucleic Acids Research 45:e18
- STAR: Ultrafast Universal RNA-Seq AlignerBioinformatics 29:15–21
- The Two-Speed Genomes of Filamentous Pathogens: Waltz with PlantsCurrent Opinion in Genetics & Development 35:57–65
- Large-Scale Genomic Analyses with Machine Learning Uncover Predictive Patterns Associated with Fungal Phytopathogenic Lifestyles and TraitsScientific Reports 13:17203
- OrthoFinder: Phylogenetic Orthology Inference for Comparative GenomicsGenome Biology 20:238
- Threats Posed by the Fungal Kingdom to Humans, Wildlife, and AgriculturemBio 11:3https://doi.org/10.1128/mBio.00449-20
- RepeatModeler2 for Automated Genomic Discovery of Transposable Element FamiliesProceedings of the National Academy of Sciences of the United States of America 117:9451–57
- Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian EvolutionNature 428:493–521
- Blast Fungal Genomes Show Frequent Chromosomal Changes, Gene Gains and Losses, and Effector Gene TurnoverMolecular Biology and Evolution 36:1148–61
- ‘CATAStrophy,’ a Genome-Informed Trophic Classification of Filamentous Plant Pathogens - How Many Different Types of Filamentous Plant Pathogens Are There?Frontiers in Microbiology 10:3088
- 101 Dothideomycetes Genomes: A Test Case for Predicting Lifestyles and Emergence of PathogensStudies in Mycology 96:141–53
- Distinct Trajectories of Massive Recent Gene Gains and Losses in Populations of a Microbial Eukaryotic PathogenMolecular Biology and Evolution 34:2808–22
- A Fungal Wheat Pathogen Evolved Host Specialization by Extensive Chromosomal RearrangementsThe ISME Journal 11:1189–1204
- TIMETREE 5: The Timescale of Life2005http://www.timetree.org/
- Hybridization and Introgression Drive Genome Evolution of Dutch Elm Disease PathogensNature Ecology & Evolution 4:626–38
- Functional Characterization of the First Filamentous Fungal tRNA-Isopentenyltransferase and Its Role in the Virulence of Claviceps PurpureaThe New Phytologist 211:980–92
- UFBoot2: Improving the Ultrafast Bootstrap ApproximationMolecular Biology and Evolution 35:518–22
- Whole-Genome Annotation with BRAKERIn:
- Kollmar Martin
- A Linear-Time Algorithm for Gaussian and Non-Gaussian Trait Evolution ModelsSystematic Biology 63:397–408
- ModelFinder: Fast Model Selection for Accurate Phylogenetic EstimatesNature Methods 14:587–89
- Genome Sequence and Gene Compaction of the Eukaryote Parasite Encephalitozoon CuniculiNature 414:450–53
- MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and UsabilityMolecular Biology and Evolution 30:772–80
- Causes and Consequences of Genome Expansion in FungiGenome Biology and Evolution 4:13–23
- The Consequences of Genetic Drift for Bacterial Genome ComplexityGenome Research 19:1450–54
- Less Effective Selection Leads to Larger GenomesGenome Research 27:1016–28
- Unconventional Secretion of Magnaporthe Oryzae Effectors in Rice Cells Is Regulated by tRNA Modification and Codon Usage ControlNature Microbiology 8:1706–16
- Minimap2: Pairwise Alignment for Nucleotide SequencesBioinformatics 34:3094–3100
- Fast and Accurate Short Read Alignment with Burrows-Wheeler TransformBioinformatics 25:1754–60
- Gene Identification in Novel Eukaryotic Genomes by Self-Training AlgorithmNucleic Acids Research 33:6494–6506
- A Global Survey of Host, Aquatic, and Soil Microbiomes Reveals Shared Abundance and Genomic Features between Bacterial and Fungal GeneralistsCell Reports 43:114046
- The Origins of Genome ArchitectureSinauer Associates
- The Origins of Genome ComplexityScience 302:1401–4
- Effective Population Size Does Not Explain Long-Term Variation in Genome Size and Transposable Element Content in AnimalseLife https://doi.org/10.7554/elife.100574.1
- Transposon-Mediated Horizontal Transfer of the Host-Specific Virulence Protein ToxA between Three Fungal Wheat PathogensmBio 10:5https://doi.org/10.1128/mBio.01515-19
- CAFE 5 Models Variation in Evolutionary Rates among Gene FamiliesBioinformatics https://doi.org/10.1093/bioinformatics/btaa1022
- Genomes of Fungi and Relatives Reveal Delayed Loss of Ancestral Gene Families and Evolution of Key Fungal TraitsNature Ecology & Evolution 7:1221–31
- Estimating Population Size and Transmission Bottlenecks in Maternally Transmitted Endosymbiotic BacteriaMicrobial Ecology 44:137–43
- Deletional Bias and the Evolution of Bacterial GenomesTrends in Genetics: TIG 17:589–96
- Pfam: The Protein Families Database in 2021Nucleic Acids Research 49:D412–19
- Challenges in Homology Search: HMMER3 and Convergent Evolution of Coiled-Coil RegionsNucleic Acids Research 41:e121
- The Diversity of Fungal GenomeBiological Procedures Online 17:8
- Evolution and Genome Architecture in Fungal Plant PathogensNature Reviews. Microbiology 15:756–71
- Independent Subtilases Expansions in Fungi Associated with AnimalsMolecular Biology and Evolution 28:3395–3404
- IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood PhylogeniesMolecular Biology and Evolution 32:268–74
- Recent Transposable Element Bursts Are Associated with the Proximity to Genes in a Fungal Plant PathogenPLoS Pathogens 19:e1011130
- Bayesian Analysis of Correlated Evolution of Discrete Characters by Reversible-Jump Markov Chain Monte CarloThe American Naturalist 167:808–25
- APE: Analyses of Phylogenetics and Evolution in R LanguageBioinformatics 20:289–90
- Duplications and Losses in Gene Families of Rust Pathogens Highlight Putative EffectorsFrontiers in Plant Science 5:299
- Mutational Equilibrium Model of Genome Size EvolutionTheoretical Population Biology 61:531–44
- Genome Evolution in Filamentous Plant Pathogens: Why Bigger Can Be BetterNature Reviews. Microbiology 10:417–30
- A Call to Arms: Mustering Secondary Metabolites for Success and Survival of an Opportunistic PathogenPLoS Pathogens 15:e1007606
- The MEROPS Database of Proteolytic Enzymes, Their Substrates and Inhibitors in 2017 and a Comparison with Peptidases in the PANTHER DatabaseNucleic Acids Research 46:D624–32
- Phytools: An R Package for Phylogenetic Comparative Biology (and Other Things)Methods in Ecology and Evolution 3:217–23
- Evolution of the Human Pathogenic Lifestyle in FungiNature Microbiology 7:607–19
- Vertical and Horizontal Gene Transfer Shaped Plant Colonization and Biomass Degradation in the Fungal Genus ArmillariaNature Microbiology 8:1668–81
- The Genome Biology of Effector Gene Evolution in Filamentous Plant PathogensAnnual Review of Phytopathology 56:21–40
- r8s: Inferring Absolute Rates of Molecular Evolution and Divergence Times in the Absence of a Molecular ClockBioinformatics 19:301–2
- Human and Plant Fungal Pathogens: The Role of Secondary MetabolitesPLoS Pathogens 10:e1003859
- Genome Expansion and Lineage-Specific Genetic Innovations in the Forest Pathogenic Fungi ArmillariaNature Ecology & Evolution 1:1931–41
- EffectorP 3.0: Prediction of Apoplastic and Cytoplasmic Effectors in Fungi and OomycetesMolecular Plant-Microbe Interactions: MPMI 35:146–56
- Fungal Genomes and Insights into the Evolution of the KingdomMicrobiology Spectrum 5:4https://doi.org/10.1128/microbiolspec.FUNK-0055-2016
- The Population Genetics of Adaptation through Copy Number Variation in a Fungal Plant PathogenMolecular Ecology 32:2443–60
- AUGUSTUS: A Web Server for Gene Prediction in Eukaryotes That Allows User-Defined ConstraintsNucleic Acids Research 33:W465–67
- Fungal Effector ProteinsAnnual Review of Phytopathology 47:233–63
- Using RepeatMasker to Identify Repetitive Elements in Genomic SequencesCurrent Protocols in Bioinformatics Chapter 4:Unit 4.10
- The COG Database: An Updated Version Includes EukaryotesBMC Bioinformatics 4:41
- Genome Size Analyses of Pucciniales Reveal the Largest Fungal GenomesFrontiers in Plant Science 5:422
- SignalP 6.0 Predicts All Five Types of Signal Peptides Using Protein Language ModelsNature Biotechnology 40:1023–25
- Genome Evolution in Fungal Plant Pathogens: Looking beyond the Two-Speed Genome ModelFungal Biology Reviews 34:136–43
- Multiple Routes to Mammalian DiversityNature 479:393–96
- Two-Speed Genome Evolution Drives Pathogenicity in Fungal Pathogens of AnimalsProceedings of the National Academy of Sciences of the United States of America 120:e2212633120
- Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly ImprovementPloS One 9:e112963
- BUSCO Applications from Quality Assessments to Gene Prediction and PhylogenomicsMolecular Biology and Evolution 35:543–48
- A Field Guide to Eukaryotic Transposable ElementsAnnual Review of Genetics 54:539–61
- Recent Advances in Understanding of Fungal and Oomycete EffectorsCurrent Opinion in Plant Biology 68:102228
- Molecular Assays to Detect the Presence and Viability of Phytophthora Ramorum and Grosmannia ClavigeraPloS One 15:e0221742
- MECAT: Fast Mapping, Error Correction, and de Novo Assembly for Single-Molecule Sequencing ReadsNature Methods 14:1072–74
- PAML 4: Phylogenetic Analysis by Maximum LikelihoodMolecular Biology and Evolution 24:1586–91
- Using Ggtree to Visualize Data on Tree-Like StructuresCurrent Protocols in Bioinformatics 69:e96
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2025, Fijarczyk et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 77
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.