Several genomic traits are correlated with genome size in Sordariomycetes.

A. Maximum likelihood tree based on 1000 concatenated protein sequence alignments calculated with IQ-TREE using ultrafast bootstrap approximation (n=563 species). The largest orders are indicated with different colors. Bootstrap support for all major clades except one within Ophiostomatales (82%, black dot) reached 100%. B. Spearman’s rank correlation for phylogenetic independent contrasts of all pairwise combinations of genomic traits. C. Principal component analysis of genomic traits. Colors correspond to orders depicted in A. D. Model of genome size. On the y axis are the model coefficients with 95% confidence intervals obtained with phylogenetic generalized least squares (pgls). Principal components correspond to the three main principal components based on 11 genomic traits. E. Testing hypotheses of genome size evolution. The first four plots show correlation of dN/dS as a proxy of Ne with the non-coding elements of the genome. The last plot is a correlation of gene loss rate with genome size. Correlations were tested using Spearman’s rank correlation on phylogenetic independent contrasts. Loadings of each principal component from figure C are shown in Supplementary Figure 1. Raw data underlying figures are in Figure 1-Source Data 1-3.

Genomic traits associated with pathogenicity.

A. Time-scaled phylogeny of Sordariomycetes colored by the reconstructed genome size displayed in log scale. The outside heatmap indicates species which are pathogenic and insect-associated (IA). Letters and numbers correspond to clades used in subsequent analyses, with representative genera from each clade listed in the legend. B. Association of 13 genomic traits with pathogenicity estimated with BayesTraits, phylogenetic logistic regression (Phyloglm), and random forest classifier. In BayesTraits and Phyloglm, colors correspond to statistically significant associations, either positive (brown) or negative (green). In Phyloglm these are based on coefficient sign, in BayesTraits they were inferred based on transition rates between binary traits estimated from the model. “YES” indicates that dependent co-evolution was detected with BayesTraits but a uniform direction of association could not be deduced from transition rates. Grey color scale depicts prevalence of species classified as pathogenic and non-pathogenic with traits size below median (LOW) or above median (HIGH). C. Same three methods repeated for 10 subsets of the data with an equal number of pathogens and non-pathogens from each genome size bin. In random forest analysis, rank of the trait, instead of importance is given. Distributions of genomic traits are shown in Supplementary Figure 2. Transition rates (modeled gains and losses of traits) are shown in Supplementary Figure 3. Raw data underlying figures are in Figure2-Source Data 1-2.

Genomic traits associated with insect association (IA).

A. Association of 13 genomic traits with IA estimated with BayesTraits, phylogenetic logistic regression (Phyloglm), and random forest classifier. In BayesTraits and Phyloglm, colors correspond to statistically significant associations, either positive (brown) or negative (green). In Phyloglm these are based on coefficient sign, in BayesTraits they were inferred based on transition rates between binary traits estimated from the model. “YES” indicates that dependent co-evolution was detected with BayesTraits but a uniform direction of association could not be deduced from transition rates. Grey color scale depicts prevalence of species classified as IA and non-IA with traits size below median (LOW) or above median (HIGH). B. Comparison of exon and intron metrics in 38 one-to-one orthologs between two clades, Microascales (M), Ophiostomatales (O) and Diaporthales (D). In the comparisons, one taxon is IA (M1, O) and another one non-IA (M2, D). Exon/intron metrics were averaged within each clade and compared using paired Mann Whitney U test. C. Correlation between prevalence of gene families within clades and two exon metrics, in the same four clades. In all clades, more common gene families have longer and more exons (negative binomial generalized linear model, P-values < 0.05). Raw data underlying figures are in Figure3-Source Data 1-3.

Evolution of pathogens with different lifestyles.

A. Models of IA trait using phylogenetic logistic regression for each of the five genomic traits and pathogenicity as predictors. Coefficients of genomic traits for pathogenic and non-pathogenic species with 95% credible intervals are shown. B. Rates of gene loss and gain in four groups of species with different pathogenicity and IA status, estimated based on 527 small gene families using birth and death model in the program CAFE v5. Estimates of gene evolutionary rates (λ) are shown above the boxplots. Raw data underlying figures are in Figure4-Source Data 1-2.

Insect-vectored clades lose genes involved in breaking plant host barriers.

A. Phylogenetic position of the selected clades. Names correspond to the ones in Figure 2A and colors correspond to general lifestyles explained in the legend in plot B. These broad categories indicate clades dominated by plant pathogenic fungi (H1, G, D), entomopathogens (H2.4, H2.6, H2.3), insect-vectored species (M1, O, H2.8, H2.2), saprotrophs (S), plant symbionts (H2.1, H2.2) or mixed lifestyle groups. The basis of triangles indicates span between minimum and maximum branch length for given clades. The height of the triangles was scaled with a factor of 0.2. Empty branches correspond to smaller clades whose members are not shown. B. Means (dots) and standard deviations (error bars) of number of genes and genome size for selected clades. C. Heatmap shows the fold change of genes/clusters relative to the ancestral state ((observed - ancestral)/ ancestral state). Clades are shown in columns with the number of clade members in parentheses; functional classes are shown in rows. The dots indicate significant gain (brown) or loss (green) of genes/clusters across clade members estimated from 100 rounds of bootstrapping of 10 species in clades with >= 10 members. SMC: secondary metabolite clusters, CAZy: carbohydrate-active enzymes. Same heatmap but for pathogenic species only is shown in Supplementary Figure 4. Raw data underlying figures are in Figure5-Source Data 1-4.

Loadings of the main PCs.

Principal component analysis of genomic traits. Histogram shows proportion of variance explained by each PC. Dark gray PCs together explain 66% of the variance. Heatmap shows loadings for each PC, with brown colors depicting positive, and green colors depicting negative contributions to the PC.

Distributions of genomic traits across lifestyles.

Boxplots indicate 25th and 75th percentiles of distributions, and dots indicate mean values. IA stands for insect-associated.

Numbers of gains and losses among four transition types obtained in BayesTraits run.

Opaque colors show transitions significantly shifted towards gains or losses. The x axis lists all four possible transitions for the two traits (genomic trait and pathogenicity). Two states (before change, after change) are separated by an underscore. 1 stands for either a value of a trait above median or presence of pathogenicity, 0 stands for a value of a trait below median or absence of pathogenicity. For example transition 01_11, indicates a gain of a trait (eg. transition from a small to a large genome size, 0x_1x), and no change in pathogenicity (x1_x1).

Insect-associated pathogens lose genes involved in breaking host barriers.

Heatmap shows the fold change in genes/clusters number relative to the ancestral state for pathogenic species only. Clades are shown in columns (clade names correspond to the ones in Figure 2A) with the number of clade members in parentheses; functional classes are shown in rows. The dots indicate significant gain (red) or loss (green) of genes/clusters across clade members estimated from 100 rounds of bootstrapping of 10 species in clades with >= 10 members. SMC: secondary metabolite clusters, CAZy: carbohydrate-active enzymes.

Proportion of species with different lifestyles among short and long-read assemblies.

Distribution of trait values calculated for long-read assembly (orange) and short-read assembly (blue) species.

Vertical lines indicate the position of median values.

Distribution of genomic trait values for short and long-read assemblies separately for pathogenic (1) and non-pathogenic (0) species.

Distribution of genomic trait values for short and long-read assemblies separately for IA (insect-associated) (1) and non-IA (0) species.

Comparison of assembly length and repeat annotations between matching species from our dataset (NCBI) and JGI Mycocosm.

In boxplots categories “1” corresponds to pathogens or IA (insect-associated) species, and “0” to non-pathogens or non-IA species.

Comparison of gene annotations between matching species from our dataset (NCBI) and JGI Mycocosm.

In boxplots categories “1” corresponds to pathogens or IA (insect-associated) species, and “0” to non-pathogens or non-IA species.

Gene trait values as a function of distance from the five species or genera used in Augustus training.

Distributions and medians of gene traits compared between all species from Hypocreales and Microascales retrieved from JGI Mycocosm, and all species from the same two orders in our dataset annotated with Augustus using “fusarium” for training.

Species were split into three groups with increasing distance from genus Fusarium, namely all Fusarium species in Hypocreales order, all other Hypocreales, and Microascales.