Plant pests, such as pathogens and herbivores, can cause major yield losses in crops and often require the massive use of pesticides to control their damage. Natural plant populations, on the other hand, are constantly exposed to such biotic stressors and their higher genetic diversity often allows these populations to become locally adapted. Since many pest species are sensitive to climatic conditions, their pressure on plant communities is spatially heterogeneous, maintaining geographically structured genetic variation in plant defenses (1,2). For these reasons, natural plant populations are highly suitable to study defense mechanisms and evolution of defenses, and also a very useful source of beneficial and resistance alleles for specific pathogens and environmental conditions. This genetic variation in defense-related genes can for example be screened through Genome Wide Association (GWA) (36) or approaches based on known candidate genes (2).

Many pest species are also highly sensitive to temporal variation in weather conditions. This temporal heterogeneity in pathogen pressure can induce plastic responses in plants, involving gene expression and epigenetic changes (79), which may also be studied through stress experiments (79). Some plastic epigenetic responses can have a transient stability and be transmitted to the next generations through inheritance of epigenetic marks (1013). In particular, DNA methylation has been shown to respond to biotic and abiotic stresses through gene expression regulation and transposable elements (TEs) reactivation, and can be inherited across generations (9,12,14). In plants, DNA methylation can occur in the three sequence contexts CG, CHG and CHH (H being A, T or C), which differ in their molecular machineries depositing, maintaining and removing methylation and consequently also in their transgenerational stability (15,16). While CG methylation is usually more stable across generations, CHH methylation is less stable and more responsive to stress and the sensitivity of CHG methylation lies somewhere in between (1517).

Whether inherited or induced, some strategies of plants for defense against pathogens and herbivores include: i) physical barriers such as reinforced cell walls, leaf protective layers or closing stomata, ii) production of specialized (secondary) metabolites that reduce palatability or are toxic to pests, iii) oxidative bursts, iv) the activation of signaling cascades to induce systemic responses and v) RNA interference mechanisms to silence pathogen genes (1822). In Brassicaceae, a particularly important and diverse class of defense metabolites are glucosinolates, which often show local adaptation driven by variation in pests and can also be induced by herbivore and pathogen attacks (1,2,23).

Studying natural variation in plant resistance, along with associated genetic and epigenetic variation, can identify genes involved in defense and their regulators, including vital genes whose function cannot be determined through knockout experiments. Such knowledge, and especially the discovery of natural resistance alleles, are crucial sources for the breeding of more pest-resistant crop varieties. Nevertheless, because of the diversity of resistance mechanisms and their often multigenic nature, plant defense mechanisms remain difficult to study. In particular, antixenosis (the prevention of pathogen settlement) and antibiosis (the repression of pathogen growth and reproduction) require extensive and time-consuming phenotyping, based for example on choice (24) or settling (9) assays, and such assays are extremely challenging to perform on large collections. On the other hand, there are increasing numbers of large sequencing datasets, which may also be used to quantify contaminants or microbiome composition (2527) and thus as proxies for resistance phenotyping. In this study we investigated such usage of exogenous reads, i.e. reads not mapping to the target reference genome, as a source of information for quantifying herbivore and pathogen abundance in large collections.

We worked with field pennycress (Thlaspi arvense), an annual plant in the Brassicaceae family that is increasingly studied as a model species (2832) and new biofuel and winter cover crop (3336). In a previous study, we investigated natural epigenetic variation in a collection of 207 Thlaspi lines from across Europe (32). Prior to their whole-genome and -epigenome sequencing these lines had been grown in a common environment, an open glasshouse where the plants were spontaneously colonized by aphids and powdery mildew, as well as by other microbes. At the time of sequencing, pathogen contamination was still very limited but appeared highly variable, and preliminary analyses showed that it resulted in sizeable amounts of non-target reads assigned to the pest species, i.e. contamination of the DNA samples. Inspired by other recent studies on non-target reads (2527), we asked if there was systematic variation in the numbers of aphid and pathogen reads among different T. arvense lines, and whether these data, together with our whole-genome plant sequencing data, could provide insights into the genomic basis of plant resistance variation.

The goals of our study were thus two-fold: i) to contribute to a mechanistic understanding of pest resistance in Thlaspi arvense, and ii) to explore whether non-target reads from plant sequencing can be used as proxies for studying plant biotic interactions. Considering that we are moving towards an increasingly sequencing-prone world, with more and larger datasets being generated for many species (3743), the use of non-target reads has very broad potential.


Reads classification and species identification

Starting from our previously published sequencing data (32), the first step of our analysis was to separate the Whole Genome Sequencing (WGS) reads of each sample into the ∼99.5% mapping to the Thlaspi arvense reference genome (29) and the ∼0.5% that did not, hereafter called “exogenous reads” (Fig 1A). Initially, we used all mapped reads for calling variants in Thlaspi, but after some difficulties with Genome Wide Associations (see below) we suspected that some plant reads were false and mapped to the T. arvense genome only because of the high cross-taxa similarity of some genomic regions. We therefore remapped all reads to the genomes of the aphid Acyrthosiphon pisum, its endosymbiont Buchnera aphidicola and the powdery mildew Blumeria graminis, and found that, on average, 7.4% of the reads mapped to both T. arvense and at least one of the pests. We removed these ambiguous reads from our analyses and used only the T. arvense target reads, 92.1% on average, for variant calling (Fig 1A, S1 Table).

Classification of sequencing reads in Thlaspi arvense WGS data.

(A) Workflow of the analyses, including reads classification (orange nodes) into target, ambiguous and exogenous reads, and downstream analysis (dark blue nodes) (see Methods). (B) Fractions of exogenous reads assigned to different taxonomic groups by MG-RAST (44,45). (C) Read counts assigned to nine selected groups in our 207 T. arvense samples from different European regions. (D) Aphids and mildew occurring on T. arvense leaves during our experiment.

We next attempted a taxonomic classification of the exogenous reads, in multiple steps. First, we used MG-RAST (44,45) to assign reads to taxonomic groups based on public sequencing databases. Out of the 78% of the exogenous reads that passed the MG-RAST quality control (S1 Table) the majority belonged to bacteria and smaller fractions to fungi, plants and animals (Fig 1B and S2 Table). For subsequent group-level analyses, we then focused on nine taxonomic groups that occurred consistently within our samples (Fig 1C), and that were particularly abundant or relevant: Erysiphales (fungi), Peronosporales (oomycetes), Aphididae and Culicidae (both insects), and five bacterial families.

Visual inspection (Fig 1D) and other sources of information narrowed down the observed aphid and mildew species to a few candidates. For aphids we considered Acyrthosiphon pisum (indicated by MG-RAST), Myzus persicae (visual match, and a generalist attacking Brassicaceae (46)) and Brevicoryne brassicae (attacks Brassicaceae including Thlaspi (47)). For powdery mildew we considered Blumeria graminis (indicated by MG-RAST), and Erysiphe cruciferarum (attacks Brassicaceae (48)). To decide among these species, we then used a competitive mapping approach (49), where the exogenous reads were aligned to a pseudo-reference composed of the same DNA sequences from the different candidate species (see Methods for details, S3 and S4 Tables). The majority (77%) of the aphid reads mapped to M. persicae, 18% to B. brassicae, and 5% to A. pisum, while 98% of the mildew reads mapped to E. cruciferarum, with only 2% to the other mildew species (S5 Table). Based on these results, we concluded that the plants in our experiment had been attacked by Myzus persicae and Erysiphe cruciferarum.

Finally, to compare the power of a large database approach (MG-RAST) versus using specific reference genomes, we also remapped all exogenous reads to the M. persicae and B. aphidicola genome assemblies (50) ( and used the counts from these two mappings as additional phenotypes, besides the nine taxonomic groups selected through MG-RAST (Table 1, S6 Table).

Population differences and SNP-based heritability for different types of exogenous read counts.

Population differences were tested with a linear model, SNP-based heritabilities (and their confidence intervals) estimated with the R package heritability.

Exogenous read counts are a heritable Thlaspi phenotype

As we had observed that aphid and mildew infections in the glasshouse were not random, but prevalent on plants from some origins than others, i.e. possibly reflecting heritable variation in plant resistance, we next tested for population differences and SNP-based heritability in pest and microbiome loads (see Methods). Prior to these analyses, to avoid biases caused by different sequencing depths, we corrected the read counts for the total numbers of deduplicated reads in each library and used the residuals as unbiased estimates of aphid, mildew and microbe loads.

For most of the nine taxonomic groups, there were significant population effects, with 20-40% of the variance in read counts explained, as well as significant SNP-based heritability, typically in the range of 0.18 - 0.30 (Table 1). The highest heritability of 0.47 was for read counts of Erysiphales, indicating particularly strong variation for resistance to mildew. Both SNP-based heritability and population differences tended to be stronger for aphid and Buchnera data based on read mapping to the reference genomes than for those based on MG-RAST, demonstrating that the former method is stronger and thus preferable if high-quality genome assemblies are available.

An alternative explanation for different aphid and mildew loads in the greenhouse could be that enemy variation in the field was transmitted to the greenhouse, through maternal carry-over effects, or even as seed contamination. However, we had recorded aphid and mildew occurrence during seed sampling in the field and found no significant differences in the glasshouse between the offspring of plants that had been attacked in the field versus those that had not (S1 Fig).

Aphid and mildew loads correlate with climate of origin and glucosinolates content of plants

Having established that our method most likely captured variation in plant resistance, we were interested in the ecological drivers of this variation. As climate is known to be a major influence on many biotic interactions as well as plant defenses (1,51), we correlated the observed read counts with the climates of origin of the plants. We found negative correlations between aphid read counts and several temperature variables, in particular annual minimum temperature (Fig 2A). Aphid read counts were also positively correlated with temperature variability, i.e. the diurnal and annual ranges of temperature (Fig 2A). In other words, plants from warmer and more stable climates had consistently lower levels of aphid infestation in our glasshouse, possibly because these plants had evolved greater resistance under such benign climatic conditions where aphids thrive. We found similar, although weaker patterns, for the number of Erysiphales reads. The other analysed taxonomic groups showed different and often weaker patterns of correlation with climate, except that the read counts of several bacterial groups were positively correlated with annual maximum temperature and in particular diurnal temperature range.

Relationships between climates of origin or glucosinolate levels of plants and the exogenous reads loads.

(A) Correlations with bioclimatic variables. (B) Correlations with baseline glucosinolate (GS) levels measured in the same pennycress lines in another experiment. All correlations in (A) and (B) were done after correction for population structure. Aphid-related read counts are in green, mildew-related in gray, others in black. (C) Boxplot of the aphid reads residuals in samples where benzyl GS was detected vs. not.

Since glucosinolates are major defense metabolites of Brassicaceae, and their variation could thus be an explanation for variance in plant resistance, we also tested for correlations between the baseline amounts of these metabolites and the frequencies of aphid and mildew reads. Glucosinolate levels were measured on the same T. arvense lines in a separate experiment not affected by pests (S7 Table). We found positive correlations of aphid read counts with sinigrin, an aliphatic glucosinolate which is by far the most abundant in the leaves of T. arvense, and a stronger negative correlation with benzyl glucosinolates (glucotropaeolin) (Fig 2B). Although the baseline levels of benzyl glucosinolates were very low and probably sometimes below the detection level, plant lines where benzyl glucosinolate was detected had significantly lower aphid loads (over 70% less reads) in the glasshouse (Fig 3C). We also detected three indole glucosinolates, but these did not show any significant correlations with aphid loads.

Genome-wide association analyses for aphid and mildew loads.

We show only the results for M. persicae and MG-RAST Erysiphales read-counts; for full results see S3 Fig. (A) Manhattan plots, annotated with genes potentially affecting aphid/mildew colonization. The genome-wide significance (horizontal red line) was calculated based on unlinked variants (53), the blue line corresponds to –log(p) = 5. (B) Corresponding to the Manhattan plots on the left, enrichment of a priori candidates and expected false discovery rates (as in (52)) for increasing significance thresholds. (C) Allelic effects of the red-marked variants in the corresponding Manhattan plots, with genotypes on the x-axes and the read-count residuals on the y-axes. (D) The candidate genes marked in panel A, their putative functions and distances to the top variant of the neighboring peak. Candidates in dark blue are the a priori candidates included in the enrichment analyses and involved in defense response (GO:0006952). GS: glucosinolates. (E) Zoom-in for the Manhattan plot of Erysiphales load, around the first peak in Scaffold 1, with gene and TE models below, and a priori candidates in blue.

GWA identifies peaks near defense genes

To further investigate the genetic basis of variation in aphid, mildew and microbe loads, we next employed GWA and tested for associations between exogenous read counts and biallelic genetic variants (SNPs and short INDELs). We corrected for population structure using an IBS matrix and only tested variants with Minor Allele Frequency (MAF) > 0.04 (see Methods). Initially, we called genetic variants using all reads that mapped to the T. arvense genome and found massive peaks in some highly conserved regions of the genome, which had very high mapping coverage (S2 Fig). We suspected that this might be because some non-Thlaspi reads were very similar to these highly conserved regions and, by mapping there, generated false variants only in samples containing many non-Thlaspi reads. We therefore identified and removed ambiguous reads prior to variant calling, which eliminated the observed massive GWA peaks, indicating that they had indeed reflected false associations (S2 Fig).

After excluding the ambiguous reads, we still found significant GWA peaks for Erysiphales but not for other types of exogenous reads (excluding isolated, unreliable variants) (Fig 3A and S3 Fig). Nevertheless, when clear peaks were visible, regardless of their significance, they were usually located close to genes involved in plant defense response. An enrichment analysis (52) confirmed that stronger variants were indeed enriched close to these defense genes (S8 Table) for some exogenous read counts (Fig 3B and S3 Fig). For M. persicae load there was a peak in the proximity of Tarvense_01930, encoding a predicted pathogenesis-related peptide. The top variant in this peak had a slight but clear allelic effect on M. persicae load (Fig 3C). For Erysiphales load we detected a more persistent enrichment, with a highly significant peak in Scaffold 1, located in a region with several defense genes, including MAJOR LATEX PROTEINS (MLP) and two genes similar to Arabidopsis thaliana SALICYLATE METHYLTRANSFERASE 1 (BSMT1) (Fig 3D and E). This region is wide due to ancient TE colonization, but the top variants are clearly neighboring candidate genes involved in defense (Fig 3E). Other significant peaks for Erysiphales load were close to other genes that possibly contribute to resistance such as PBL7, involved in signaling and stomatal closure or SRF3, reinforcing cell walls by callose deposition.

Aphid and mildew loads correlate with differential methylation at genes and transposons

Variation in phenotypes, such as our indirect estimates of pest resistance, may not only be associated with DNA sequence but also with epigenetic changes like DNA methylation. This phenotype-associated epigenetic variation can include both heritable and plastic components. The Whole Genome Bisulfite Sequencing (WGBS) data from our previous study (32) allowed us to also explore these questions and to test for associations between DNA methylation variation and pest attack. For simplicity, we limited this analysis to M. persicae and Erysiphales loads.

Our analysis had two steps: First we called Differentially Methylated Regions (DMRs) between the 20 samples with the most and least M. persicae or Erysiphales loads, and then we conducted Epigenome Wide Association (EWA) analyses on individual positions located within these DMRs (see Methods). This approach allowed us to target genomic regions of interest, while strongly reducing the multiple-testing problem of millions of cytosines in the whole genome and correcting for population structure. Using a relaxed False Discovery Rate (FDR) of 20%, we identified 162 DMRs for M. persicae load and 548 DMRs for Erysiphales load (S4 Fig, S9 and S10 Table). The majority of these were in the CG context, especially for M. persicae-related DMRs (Fig 4A and S4 Fig). As observed previously (32), DMRs in CHH were generally shorter than in the other sequence contexts (S4A Fig). Since the genome of T. arvense is rich in TEs and intergenic regions, the majority of DMRs were located in those features (S4B Fig). However, the DMR density was higher in proximity of genes and particularly in coding sequences (Fig 4A), and even DMRs assigned as intergenic (Fig 4A) were often located close to genes or promoters. In accordance with previous studies (8,9), most DMRs were hypomethylated in the affected samples, indicating that genes needed for defense might be activated through demethylation.

Differential methylation associated with aphid and mildew loads.

(A) Differentially Methylated Region (DMR) densities in different genomic features when comparing the 20 samples with the most vs. the least M. persicae (top) or Erysiphales (bottom) load. CDS: coding sequences. (B) Manhattan plots from EWA analyses based on individual cytosines within DMRs, with sequence contexts in different colours and annotation of genes close to low P-value cytosines. The genome-wide Bonferroni significance thresholds (dashed red lines) were calculated based on the number of DMRs. (C) Candidate genes and TEs marked in panel B, their putative functions, genomic locations of associated DMRs, and whether affected samples were hyper- or hypomethylated. (D) Zoomed-in Manhattan plot for Erysiphales load around the peak in Scaffold 4, with gene and TE models given below. The CG methylation in the 20 most and least affected samples was calculated over 50 bp bins (see S5C Fig for methylation in other contexts).

For a more detailed investigation, we turned to EWA, leveraging the power of the entire Thlaspi collection. We tested for associations between M. persicae or Erysiphales loads and the methylation at individual cytosines located within the DMRs. As in GWA, we corrected for population structure using an IBS matrix. For both types of pest loads, we found associations in the proximity of genes and especially within TEs, but no genomic feature was particularly enriched for low P-value associations (S5A Fig). M. persicae load was associated with methylation at several genomic locations, especially TEs (Fig 4B), but these associations had strongly inflated P-values (S5B Fig). For Erysiphales load the P-value distribution was more well-behaved (S5B Fig), and we found a clear association with hypomethylation of Copia family 202 TEs upstream of MAPKK KINASE 20 (MAPKKK20), a gene involved in abscisic acid (ABA) stress response and stomatal closure (Fig 4B, C and D). A coverage analysis confirmed that none of the T. arvense lines carries insertions or deletions of the TEs upstream of MAPKKK20.


Plant pests are a major threat to food safety, causing large yield losses, and new crops such as the potential biofuel plant Thlaspi arvense must be able to resist pathogen and herbivore attacks. A powerful source for obtaining resistant varieties is natural variation in plant defenses, but phenotyping large collections can be very time-consuming and error-prone. Here we describe how an unplanned pest infestation in a glasshouse experiment, together with available WGS data, can be used to estimate aphid, mildew and microbial loads, and thus variation in plant resistance. The approach is straightforward, makes use of WGS data without microbiome-specific DNA extraction, and can in principle be applied to many other situations such as field experiments. It is not error-free, but we highlight some potential pitfalls, show how to reduce noise, and illustrate its potential to detect associations with climatic, genetic and epigenetic variation.

An important first step in our analyses was the identification and classification of pest-related reads in the plant WGS data. We began by classifying all reads as target (only mapping to T. arvense), ambiguous (mapping to T. arvense and at the same time to at least one of the pest genomes) or exogenous (not mapping to T. arvense) (Fig 1A). We demonstrated the importance of removing ambiguous reads prior to variant calling, as this prevented calling false positive variants caused by exogenous DNA that also mapped to highly conserved or repetitive sequences in the T. arvense genome. We then classified the exogenous reads using MG-RAST (44,45) or by confident mapping to specific pest genomes, and selected the eleven most relevant and/or abundant taxonomic groups to focus our analyses on. To obtain unbiased pest/microbe loads we also corrected the read-counts for the total number of deduplicated reads of each sample. A competitive mapping approach allowed us to identify the aphid and mildew species that had occurred in our experiment as the generalist aphid Myzus persicae and the powdery mildew Erysiphe cruciferaum (Fig 2).

Since we suspected a non-random colonization of pests and microbes in our T. arvense collection, we tested for population differences as well as SNP-based heritability. We found significant population differences for most pest and microbe loads, and often heritabilities above 15%, which although low, is still indicative of genetic determination (5). Moreover, Erysiphales load had a particularly high heritability of 47% (Table 1). We therefore next asked what could explain the observed variation in pest loads in our experiment. As pathogen abundances in the field are often determined by climatic conditions, we expected plants originating from climates less favorable to aphids to perform worse in our glasshouse, i.e. to have higher pathogen loads. As expected, aphid counts were negatively correlated with temperature of origin (particularly minimum temperature), and positively with temperature variability (Mean Diurnal Range and Temperature Annual Range) (Fig 3A), suggesting that plants from colder and more thermally fluctuating climates, which are less favorable to aphids, were less well defended and performed worse in our glasshouse. We found similar but weaker patterns for Erysiphales load.

As we expected the observed climate-associated variation in pest loads to be at least partially explained by variation in chemical defenses, and since in Arabidopsis thaliana glucosinolates, the main defensive compounds, are known to be geographically structured in response to aphid distributions (1), we also tested for association of aphid and mildew loads with glucosinolates in our collection. In accordance with literature on A. thaliana (54), we observed a positive correlation of aphid loads with total glucosinolates as well as with the most abundant glucosinolate sinigrin (aliphatic glucosinolate), but a negative correlation with benzyl glucosinolates (Fig 3B). These findings suggest that glucosinolate composition, rather than total amount, is important for aphid defense, and that while benzyl glucosinolates might have a deterrent effect, sinigrin might on the contrary attract M. persicae or act as a stimulant, which would be in accordance with previous observations (55).

To detect genetic variants associated with pest and microbe loads, we then conducted a GWA study. For aphid (M. persicae) load, we detected only one non-significant peak in Scaffold 5, close to a pathogenesis-related coding gene (Fig 4A and D). For Erysiphales load, however, there were several significant associations neighboring genes directly involved in defense, mostly members of the MLP family, clustered in a large peak on the first arm of Scaffold 1 (Fig 4E). MLP165, the closest gene to the most significant variant in the peak, is indirectly involved in GS biosynthesis in A. thaliana (56), which might explain why baseline GS levels were associated with Erysiphales load (Fig 3B). Further GWA peaks for Erysiphales pointed towards other genes indirectly involved in the defense response through phytohormone signaling (eg. CAR8, PBL7, GH3.1) or preventing pathogen access through cell wall reinforcement or stomatal closure (SRF3, PBL7) (Fig 4D). Further experiments would be necessary to confirm the functionality of these genes.

An important general insight from our GWA analyses was the frequent ambiguity of reads that mapped to both pest and host plant genome. Such ambiguous reads generated false variants only present in samples with pest DNA, which resulted in highly significant false associations, and it was therefore important to remove these reads before variant calling. Another potential reason for sequence similarity between host and pathogens could be defense mechanisms such as RNA interference. If T. arvense produces small or micro RNAs to silence pathogen genes, this would originate from genomic regions of high similarity between host and pathogen, and thus reflect a true association. However, a BLAST (57) of the region in which the suspicious associations occurred did not reveal any similarity to aphid or mildew genes, but instead to the highly conserved ribosomal RNA coding regions. While genetic variants are generally inherited from parents and thus reflect evolutionary processes, DNA methylation variants can be heritable but can also reflect plastic responses to environmental stresses like herbivores or pathogens. Our data do not allow to confidently distinguish between these two sources of DNA methylation variation, and thus should be interpreted with caution, especially with regard to the directionality of associations. A beneficial DNA methylation variant is expected to be associated with lower pathogen load when already present before pathogen arrival, but with higher pathogen load when plastically induced by pathogens during the experiment. For both M. persicae and Erysiphales, the majority of DMRs were hypomethylated in affected samples, which is in accordance with the loss of methylation observed in A. thaliana and T. arvense upon aphid feeding, and in diploid wheat upon powdery mildew infection (8,9,31), but we also detected hypermethylation at several loci. M. persicae load was associated with differential methylation at only few genes but several TEs, which is in accordance with the aphid or stress-induced TE reactivation observed in A. thaliana (9,14). Erysiphales load was associated with hypomethylated Copia TEs upstream of MAPKKK20, a gene involved in ABA-mediated signaling and stomatal closure. Since stomatal closure is a known defense mechanism to block pathogen access (21), it is tempting to conclude that hypomethylation of the MAPKKK20 promoter might induce its overexpression and consequent stomatal closure, thereby preventing mildew access to the leaf blade. Overall, we found associations between pathogen load and TE methylation that could act both in cis (eg. Copia TE methylation in MAPKKK20 promoter) and in trans, possibly through transposon reactivation (eg. LINE, Helitron and Ty3/Gypsi TEs isolated from genes). Although we cannot confidently distinguish inherited versus induced DNA methylation variants, to our knowledge this is the first coupled GWA - EWA analysis conducted on a large natural plant collection.

In summary, our study offers first insights into the defense mechanisms of Thlaspi arvense, including candidate genes and alleles which may be of interest for breeding efforts in this novel biofuel and cover crop. It also provides a proof of principle that exogenous reads from large sequencing efforts, usually discarded if not mapping to the target genome, can be leveraged to extract additional information about important biotic interactions of the target species, including its antagonists and microbiome components. We combined this approach with data from a common environment experiment to show that pest and microbiome load were geographically structured, as expected from locally adapted traits, and associated with both genetic and DNA methylation variants. In principle, our approach can be applied to many other designs. For example, field-collected samples could be used to quantify geographic pathogen distributions. With the decreasing cost of sequencing and increasing large-scale and single-species sequencing projects e.g. (3743), the number of data-sets suitable for such analyses is set to rapidly increase in the near future.

Materials and Methods

Plant growth and sequencing

The WGS data used in this study were already published in Galanti et al. (2022) (32). Please refer to this publication for details on data generation and methods. Briefly, we collected 207 Thlaspi arvense accessions from 36 European populations in July 2018, and we grew their offspring in a glasshouse at University of Tübingen (48°32’21.3“N, 9°02’04.2“E) between August and October 2019. The glasshouse was located in biodiverse surroundings, and insects and pests could enter when the windows opened for temperature regulation. A few weeks after germination, we noticed aphid and mildew infestations. After 46 d we sampled the 3rd or 4th true leaf of each plant and snap-froze it in liquid nitrogen. We extracted DNA using the DNeasy Plant Mini Kit (Qiagen, Hilden, DE), sonicated (Covaris) 300 ng of genomic DNA and used the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs) to prepare the libraries. Half way through the protocol we split the DNA into 1/3 for genomic libraries and 2/3 for bisulfite libraries. For the bisulfite conversion we used the EZ-96 DNA Methylation-Gold MagPrep (ZYMO) kit. We sequenced paired-end for 150 cycles using Illumina NovaSeq 6000 (Illumina, San Diego, CA) for genomic libraries and HiSeq X Ten (Illumina, San Diego, CA) for bisulfite libraries.

Reads mapping and classification

Upon demultiplexing the raw reads, we used cutadapt (58) for quality (minimum quality of 20) and adaptor trimming, excluding reads shorter than 35 bp. We used FastQC and MultiQC (59,60) to estimate the duplication rate, and calculated the total deduplicated reads, which we later used for correcting the number of exogenous reads. We then classified the reads based on their mapping behavior. First we aligned reads to the T. arvense reference genome (29) with BWA-MEM v0.7.17 (61), excluding multimapping reads (-c 1) and marking duplicates with MarkDuplicatesSpark (62,63). We then mapped all samples again (61) to the three putative exogenous genomes of pea aphid (Acrophyson pisum), the aphid symbiont Buchnera aphidicola and powdery mildew (Blumeria graminis), using available resources (, (64). After this, we used a custom script to collect all read IDs within a sample mapping to any of the three exogenous genomes, and removed any of these reads from the T. arvense alignment bam files. We thus removed all ambiguous reads before proceeding with variant calling. To compare coverage of specific regions with and without ambiguous reads, we used samtools bedcov (65). The numbers of reads classified by their mapping behaviour are reported in S1 Table.

Variant calling

For variant calling we used GATK4 v4.1.8.1 (62,63), following the best practices for germline short variant discovery ( with few adjustments for large datasets (32). Briefly, starting from the bam files generated after the removal of ambiguous reads, we i) ran HaplotypeCaller, ii) combined the resulting GVCF files with GenomicsDBImport and GenotypeGVCFs and iii) filtered out low quality variants with VariantFiltration (see Galanti et al. (2022) (32) for more details). Finally, we used vcftools v0.1.16 (66) to retain biallelic variants with MAF > 0.01 and a maximum of 10% missing genotype calls. We imputed these missing calls with BEAGLE 5.1 (67) to obtain a complete multisample vcf file.

Identification and classification of exogenous reads

To identify exogenous reads, we extracted all unmapped reads from the bam files created aligning WGS reads to the T. arvense genome (29). We selected reads with both mates unmapped (SAM flag 12) and excluded supplementary alignments (SAM flag 256 after running MarkDuplicatesSpark) with samtools (65). We then recovered these reads from the trimmed fastq files with seqtk subset ( to obtain fastq files of unmapped reads only. We used these as input for MG-RAST (44,45), a web-based tool for phylogenetic analysis of metagenomes.

We ran MG-RAST mostly with default parameters, without assembled reads, excluding dereplicated sequences, and dynamically trimming reads with a minimum Phred score of 15 in more than 5 consecutive bases. We set the “sequence screening” to A. thaliana, the closest relative of T. arvense available. We used two different approaches to extract read counts. First we classified all reads up to family level using the web-based Analysis tool from MG-RAST. We used RefSeq as query annotation database and filtered reads classified with low confidence using default settings: e-value 5, 60 %-identity, length 15 and min.abundance of 1 (S2 Table). Out of the hundreds of taxonomic groups identified by MG-RAST, we selected only a small subset for follow-up analyses, based on their biological relevance, our visual observations and/or abundance: Aphididae, Culicidae, Peronosporales, Staphylocaccaceae, Burkholderiaceae, Mycobacteriaceae and Pseudomonadaceae (Table 1). Additionally, we used a custom Python script to download individual “taxonomy” or “sequence_breakdown” results from MG-RAST API (68) and extracted the counts of the genus Buchnera, including bacterial symbionts of many aphid species, and of the order Erysiphales, to quantify the observed mildew infection (Table 1). All the code for extracting counts for all families or specific taxonomic groups are available on GitHub (

In addition to the nine read groups selected from MG-RAST results, we also performed a highly confident mapping of exogenous reads to the M. persicae and B. aphidicola genome assemblies (50) (, to test whether mapping to a high quality assembly of the exact pathogen has a higher sensitivity than MG-RAST. We mapped with BWA-MEM v0.7.17 (61), using a seed length of 25 bp (69) and removing reads with MAPQ < 20 and duplicates with MarkDuplicatesSpark (62,63). We then counted all reads in the bam files.

Finally, we log transformed all read counts to approximate normality, and corrected for the total number of deduplicated reads by extracting residuals from the following linear model, log(Read_count + 1) ∼ log(Deduplicated_reads), which allowed us to quantify non-Thlaspi loads, accounting for the sequencing depth of each sample.

Exogenous reads heritability and species identification

To exclude the possibility that aphid and mildew infestation patterns were carried-over from the field, through seed contamination or maternal effects, we used aphid and mildew presence/absence data collected in the field. We found no difference in aphid or mildew loads between samples with and without aphids or mildew on the original parental plant in the field (S1 Fig). Nevertheless, to exclude a possible bias, we excluded one outlier sample with particularly high aphid load and aphids observed in the field (S1 Fig) from the analyses.

Even though MG-RAST classifies reads based on all taxonomic ranks, the accuracy of species identification of course strongly depends on the sequences available in the query databases. MG-RAST assigned our aphid reads to A. pisum, but this did not fit with our visual observations and with the poor performance of this species on Brassicaceae (70). We therefore selected three plausible aphid species and test which of these had mostly likely attacked our experiment. In addition to A. pisum, we tested two other aphid species commonly attacking Brassicaceae: Brevicoryne brassicae and Myzus persicae. While not all three species have reference genomes available, all mitochondrial genomes are available on NCBI (71) under accession numbers MN232006, NC_011594 and NC_056270. We downloaded these sequences, aligned them to each other (72), removed INDELs to retain only SNPs and combined them into a single pseudo-reference fasta file (S3 Table). We then mapped the exogenous reads from 40 randomly selected samples to this pseudo-reference, allowing for unique mappings only and counted the reads mapping to either of the three aphid species. We used the same approach for mildew except that we included only two possible species: Blumeria graminis, as suggested by MG-RAST, and Erysiphe cruciferarum which is known to attack Brassicaceae but was not in the MG-RAST query database and seemed plausible from visual inspection (Fig 2B). For the mildew pseudo-reference (S4 Table) we used the Internal Transcribed Spacer (ITS), which is publicly available for both species on NCBI (71) under accession numbers MT644878 and AF031283.

Quantification of glucosinolates

Using seed material collected from the sequenced plants, we conducted a follow-up experiment to estimate the glucosinolates (GS) content of all 207 lines in the absence of pathogens. Briefly, we sowed the seeds in petri dishes, stratified them at 4°C in the dark for two weeks and transplanted the germinated seedlings to individual 9 x 9 cm pots. We grew the plants in a growth chamber with a 14/10 h light/dark cycle at 21/17 °C and a relative humidity of ∼45%. Two weeks after germination the plants were vernalized at 4°C for two more weeks in order to minimize phenological and developmental differences between winter and summer annuals. Ten days after vernalization, we collected the 3rd or 4th true leaf and snap-froze it in liquid nitrogen. After freeze drying, we weighed all samples and extracted the material threefold in 80% methanol, adding p-hydroxybenzl glucosinolate (Phytoplan, Heidelberg, Germany) as internal standard. After centrifugation, we applied the supernatants onto ion-exchange columns with diethylaminoethyl (DEAE) Sephadex A25 (Sigma Aldrich, St.Louis, MO, USA) in 0.5 M acetic acid buffer, pH 5. We added purified sulfatase, converting glucosinolates to desulfo glucosinolates. After one day, we eluted desulfo glucosinolates in water and analyzed them on a HPLC coupled to a DAD detector (HPLC-1200 Series, Agilent Trechnologies, Inc., Santa Clara, CA, USA) equipped with a Supelcosil LC 18 column (3 μm, 150×3 mm, Supelco, Bellefonte, PA, USA). We analysed the samples with a gradient from water to methanol starting at 5% methanol for 6 min and then increased from 5 to 95% within 13 min with a hold at 95% for 2 min, followed by a column equilibration cycle. We identified different glucosinolates based on their retention times and UV spectra in comparison to respective standards and an in-house database. We integrated peaks at 229 nm and calculated respective glucosinolate concentrations in relation to the internal standard and sample dry mass, using response factors as reported by Agerbirk et al. (2015) (73).

Drivers of exogenous reads variation

To test for associations between glucosinolate variation, as well as climate of origin, and the observed pest loads, we extracted average bioclimatic variables for the 25 years predating our experiment for our 36 study populations from the Copernicus website (74), as described in Galanti et al. 2022 (32). We then used the R package “lme4qtl” (75) to run mixed models that included either bioclimatic variables or glucosinolate contents as explanatory variables, and the exogenous read counts as dependent variables, while correcting for population structure with the same IBS matrix as in GWA and EWA analyses (see below).

GWA analysis

We conducted GWA with mixed models that corrected for population structure with a genetic Isolation By State (IBS) matrix as a random factor, as implemented in GEMMA (76). To obtain the IBS matrix we used PLINK v 1.90b6.12 (77). Starting from the imputed multisample vcf file obtained from variant calling, we pruned variants with LD > 0.8 in 50 variants windows, sliding by five. To produce the genetic variants used for GWAS, we also started from the imputed multisample vcf file from variant calling and filtered out variants with MAF < 0.04. As phenotypes we used the number of exogenous reads corrected for the total number of deduplicated reads, as described above.

To validate our results and test for overlap with existing gene functional annotations, we performed an enrichment analysis of variants neighboring a priori candidate genes as described in Atwell et al. (2010) (52). Briefly, we attributed a priori candidate status to all variants located within 20 kb from orthologs (78) of A. thaliana genes annotated with the GO term “defense response” (GO:0006952), including nine genes similar to AtBSMT1 (S8 Table). We then calculated the enrichment of these variants compared to the background frequency and an upper bound for the FDR for increasing - log(p) thresholds (32,52). The code for these analyses is available on

Methylation and DMR calling

For the methylome analyses we used the EpiDiverse toolkit (79), specifically designed for large WGBS datasets. We used the WGBS pipeline ( for read mapping and methylation calling, retained only uniquely-mapping reads longer than 30 bp, and obtained individual-sample bedGraph files for each sequence context. We then called DMRs using the DMR pipeline (79), with a minimum coverage of 4x. We compared the 20 samples with the most and the least M. periscae and Eriysiphales loads, resulting in two sets of DMRs for each sequence context. Since this was only the first step of our methylation analysis, meant to identify potential regions of interest, we retained all DMRs with a FDR < 20%. To understand the genomic preferences of DMRs, we intersected them with genomic features and calculated their densities in each by dividing their number by the total Mb covered by each genomic feature in the genome.

EWA analysis

Following the DMR calling, we investigated methylation-phenotype relationships in more detail, using EWA. We ran EWA similarly to GWA, enabling the “-notsnp” option available in GEMMA (76), and correcting for population structure with the same IBS matrix. To exclude possible biases, we excluded all samples with a bisulfite non-conversion rate >1 (32). To generate the methylation input files we first used custom scripts ( (32) to unite individual-sample bedGraph files into unionbed files and retain positions with coverage > 3 in at least 95% of the samples and a methylation difference of at least 5% in at least two samples. We then intersected the unionbed files with the DMRs of the corresponding sequence context using bedtools (80) and converted unionbed to BIMBAM format as input for GEMMA.

We ran EWA for individual positions within the DMRs and calculated Bonferroni thresholds based on the number of DMRs, assuming that cytosines within the same DMR are mostly autocorrelated. To observe in which genomic features associations with lower P-values were located, we performed enrichment analyses similar to the ones performed for defense a priori candidate genes in GWA (52), but based on whole genomic features. Starting from all cytosines used for EWAS, we calculated the background frequency as the fraction of all cytosines located in each genomic feature and then calculated the observed frequency in the same way for -log(p) 0.5 increments, with enrichment as the ratio of observed and expected frequencies. All code used for EWA and the enrichment analysis in genomic features is available on

Code and data availability statement

The seed material from the sequenced lines is available at the Nottingham Arabidopsis Stock Centre (NASC) under stock numbers N950001 to 950204. Genomic and bisulfite sequencing raw data are available on the ENA Sequence Read Archive ( under study accession number PRJEB50950. The reference genome and annotations were previously published by Nunn et al. (2022) (29). GWA and EWA results in a format compatible with the Integrative Genomics Viewer ( are available on Zenodo (

All code used in this study is available and documented on GitHub. The scripts for variant calling, filtering and imputation are on, and the scripts for the classification of sequencing reads and MG-RAST analysis are in and respectively. The pipelines for methylation and DMR calling from WGBS data can be found on the EpiDiverse GitHub ( The workflow for downstream analysis of methylation data is on Finally, the scripts for running GWA and EWA analysis are on and respectively.


This work was supported by the EU Horizon 2020 program under Marie Skłodowska-Curie grant agreement 764965 (Innovative Training Network EpiDiverse;; PhD fellowship of DG), as well as by the Deutsche Forschungsgemeinschaft (DFG) as part of the priority programme 2125 Deconstruction and Reconstruction of the Plant Microbiota, “DECRyPT” (grant 401829393 to OB; PhD fellowship of JHJ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


We thank the EpiDiverse network for its amazing support and discussions, in particular Adrián Contreras-Garrido, Bárbara Díez Rodríguez, Iris Sammarco, Adam Nunn, Daniela Ramos and Anupoma Troyee for their close collaboration. We also thank Frank Reis for his feedback and expert tips for the project, and Cecilia Heyworth for proofreading the manuscript. We thank Peter Stadler at the University of Leipzig and David Langenberger from ecSeq, which helped with computing and hosted the EpiDiverse servers. The BinAC cluster is managed by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, and supported by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 37/935-1 FUGG.

Author Contributions

Dario Galanti: Conceptualization, Methodology, Formal analysis, Investigation, Data Curation, Writing – Original Draft Preparation, Visualization.

Jun Hee Jung: Conceptualization, Methodology, Data Curation, Writing – Review & Editing.

Caroline Müller: Investigation, Writing – Review & Editing.

Oliver Bossdorf: Conceptualization, Methodology, Writing – Original Draft Preparation, Supervision, Funding Acquisition.