The Arabidopsis thaliana mobilome and its impact at the species level

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Transposable elements (TEs) are powerful motors of genome evolution yet a comprehensive assessment of recent transposition activity at the species level is lacking for most organisms. Here, using genome sequencing data for 211 Arabidopsis thaliana accessions taken from across the globe, we identify thousands of recent transposition events involving half of the 326 TE families annotated in this plant species. We further show that the composition and activity of the 'mobilome' vary extensively between accessions in relation to climate and genetic factors. Moreover, TEs insert equally throughout the genome and are rapidly purged by natural selection from gene-rich regions because they frequently affect genes, in multiple ways. Remarkably, loci controlling adaptive responses to the environment are the most frequent transposition targets observed. These findings demonstrate the pervasive, species-wide impact that a rich mobilome can have and the importance of transposition as a recurrent generator of large-effect alleles.

https://doi.org/10.7554/eLife.15716.001

Introduction

Transposable elements (TEs) are sequences that move and replicate around the genome. Depending on whether their mobilization relies on a RNA or DNA intermediate, they are classified as retrotransposons (class I) or DNA transposons (class II), respectively (Slotkin and Martienssen, 2007). TEs are further subdivided into distinct families, the prevalence of which differs between organisms because of a complex array of factors, including variable transposition activity and diverse selection pressures (Barrón et al., 2014). Given their mobile nature, TEs pose multiple threats to the physical and functional integrity of genomes. In particular, TEs can disrupt genes through insertion and also through excision in the case of DNA transposons. Thus, TE mobilization is a source of both germline and somatic mutations (Richardson et al., 2015; Barrón et al., 2014; Lisch, 2013). Although TEs are endogenous mutagens with potentially catastrophic effects, their mobilization might sometimes be beneficial. In fact, soon after their discovery, Barbara McClintock named TEs 'controlling elements' to emphasize their role in the control of gene action (McClintock, 1956). In mammals, transposition of some LINE 1 retrotransposons occurs extensively during embryogenesis as well as in the adult brain, again suggesting functional relevance of somatic TE mobilization (Richardson et al., 2014). Nonetheless, TEs are under tight control to limit their mutational impact both within and across generations. In plants and mammals, a major control is through epigenetic silencing mechanisms, including DNA methylation and these mechanisms can in turn have 'epimutagenic' effects on adjacent genes (Slotkin and Martienssen, 2007; Weigel and Colot, 2012; Heard and Martienssen, 2014; Quadrana and Colot, 2016).

Despite the many documented short-term as well as evolutionary consequences of TE mobilization (Rebollo et al., 2012; Trono, 2016; Kazazian, 2004; Elbarbary et al., 2016; Bennetzen and Wang, 2014; Lisch, 2013), TEs are among the least investigated components of genomes, mainly because they are present in multiple, often degenerated copies, which complicate analysis. Thus, a species-wide view of the mobilome - i.e. of the set of TE families with transposition activity - is lacking for most organisms.

Studies in humans suggest that although at least half of the 3Gb genome is made up of TE sequences, mainly belonging to LINE 1 and SINE retrotransposon families, a few of these only and none of the other TE families contain mobile copies (Richardson et al., 2015). In contrast, the number of TE families that have retained transposition activity is much larger in the mouse and these include several so-called endogenous retroviruses (ERVs) in addition to LINEs and SINEs (Richardson et al., 2015). In Drosophila, which has a much smaller genome (∼120 Mb) characterized by a large repertoire of class I and class II TE families, the situation is again different with most TE families likely mobile (Mackay et al., 2012; Cridland et al., 2013; Robert et al., 2015; Rahman et al., 2015). However, in this species and even more so in mammals, the population genetics of the mobilome remains poorly characterized.

The flowering plant A. thaliana is particularly attractive for conducting a systematic survey of the mobilome and of its molecular as well as phenotypic impact at the species level. First, like Drosophila, A. thaliana has a compact genome and a large repertoire of class I and class II TE families (Arabidopsis Genome Initiative, 2000; Ahmed et al. 2011; Joly-Lopez and Bureau, 2014). Thus, most TE families are of relatively small size, which facilitates their study. Second, A. thaliana occupies a wide range of habitats across the globe and representative accessions have been extensively characterized both genetically and phenotypically (Weigel and Nordborg, 2015). Third, whole genome sequencing has been performed for >1000 A. thaliana accessions and DNA methylome as well as transcriptome data are also available for hundreds of these (Schmitz et al., 2013; Long et al., 2013; Dubin et al., 2015; Cao et al., 2011). Finally, genome-wide association studies (GWASs) are straightforward in this species (Weigel and Nordborg, 2015).

Here we present a comprehensive assessment of the A. thaliana mobilome, which radically changes the prevailing view of limited transposition potential in this species and provides important novel insights into the population genetics of TE mobilization (Hu et al., 2011; Maumus and Quesneville, 2014). Specifically, we show that the A. thaliana mobilome is composed of a very large number of class I and class II TE families overall, but differs extensively among accessions. We further show that TE mobilization is a complex trait and we have identified environmental as well as genetic factors that influence transposition in nature. These factors include the annual temperature range, the TE themselves and multiple gene loci, notably MET2a, which encodes a poorly characterized DNA methyltransferase. In addition, we present compelling evidence that TEs insert throughout the genome with no overt bias and that the mobilome has a pervasive impact on the expression and DNA methylation status of adjacent genes. These and other observations indicate that purifying selection is most probably the main factor responsible for the differential accumulation of TE sequences along the A. thaliana genome and notably their clustering in pericentromeric regions. Finally, we reveal the importance of the mobilome as a generator of large-effect alleles at loci underlying adaptive traits. Collectively, our approaches and findings provide a unique framework for detailed studies of the dynamics and impact of transposition in nature.

Results

Composition of the A. thaliana mobilome

The reference genome sequence of A.thaliana is 125 Mb long (TAIR 10) and contains ~32000, mostly degenerate TE copies that belong to 326 distinct families (Arabidopsis Genome Initiative, 2000; Ahmed et al., 2011). So far, transposition activity has been documented experimentally for nine TE families, mainly on the basis of studies carried out in the reference accession Col-0 (Ito and Kakutani, 2014; Tsay et al., 1993). To assess species-wide the composition of the A.thaliana mobilome, we used publically available Illumina short genome sequence reads (Schmitz et al., 2013; Schneeberger et al., 2011). First, we looked for TE copy number variation (CNV) between the reference accession Columbia (Col-0) and 211 accessions taken from across the globe. To limit the problem posed by the presence of TEs in multiple copies across the genome, with varying degrees of similarity to each other, we performed an aggregated CNV analysis based on the 11,851 annotated Col-0 TE sequences longer than 300 bp (see ‘Materials and methods’). CNVs were detected for 263 TE families (Figure 1A and B; Figure 1—source data 1; see ‘Materials and methods’), in keeping with the results of a previous study indicating that the vast majority of the TE sequences annotated in the Col-0 reference genome are absent from that of at least one of 80 accessions analyzed (Cao et al., 2011).

Figure 1 with 2 supplements see all

Download asset Open asset

Overview of the *A. thaliana* mobilome.

(A) Genome browser tracks showing normalized sequencing coverage over the two full-length *ATCOPIA31* elements annotated in the reference genome (Col-0). CNV is detected as increased or decreased coverage in other accessions. Number of copies is indicated on the right. (B) Heat map representing CNVs (log2 ratio) for 317 TE families and 211 *A. thaliana* accessions. TE families with statistically significant CNV in at least one accession are indicated. Figure 1—source data 1 contains absolute copy number estimation of TE sequences. (C) Schematic representation of the bioinformatics pipeline to identify non-reference TE insertions with TSD using split-reads. 1- Reads are mapped on a collection of TE extremities from annotated TE sequences and reference sequences (Repbase update). 2- Reads aligning partially over TE extremities are extracted and clipped. 3- The unmapped portion of these split-reads are re-mapped on the Arabidopsis reference genome. 4- Non-reference TE insertions with TSDs are identified by searching for overlapping clusters of 5’ and 3’ split-reads. (D) Genome browser tracks showing split-reads for two non-reference *ATCOPIA31* insertions and TSD reconstruction. Figure 1—source data 2 contains the coordinates of all non-reference TE insertions with TSDs. (E) Distribution frequency of allele counts for non-reference TE insertions with TSDs. (F) Number of mobile TE families per accession identified using split-read and TE-sequence capture. (G) Cumulative plot of the number of mobile TE families detected with increasing numbers of accessions. (H) The total number of non-reference TE insertions with TSDs is indicated in relation to the number of accessions with such insertions, for each of the 131 mobile TE families. Asterisks indicate the nine TE families with experimental evidence of transposition (Ito and Kakutani, 2014; Tsay et al., 1993). Figure 1—source data 3 contains the total number of distinct non-reference TE insertions with TSD for each TE family and super-family. Figure 1—figure supplement 2 shows TE-capture results. Figure 1—figure supplement 1 contains IGV screenshots showing the pattern of split-reads characteristic of true- and false-positive non-reference TE insertions with TSDs.

https://doi.org/10.7554/eLife.15716.002

Figure 1—source data 1 Copy number estimation of TE sequences. (A) Copy number estimation based on read coverage for the 317 TE families analyzed across 211 A. thaliana accessions collected worldwide. Column descriptions are provided in (B).: https://doi.org/10.7554/eLife.15716.003
Download elife-15716-fig1-data1-v2.xlsx
Figure 1—source data 2 Coordinates of non-reference TE insertions with TSDs. (A) Coordinates and presence or absence call (1 and 0, respectively) across the 211 A. thaliana accessions. Description of columns is provided in (B).: https://doi.org/10.7554/eLife.15716.004
Download elife-15716-fig1-data2-v2.xlsx
Figure 1—source data 3 Number of distinct non-reference TE insertions with TSDs identified by the split-reads approach for each TE family and super-family.: https://doi.org/10.7554/eLife.15716.005
Download elife-15716-fig1-data3-v2.zip
Figure 1—source data 4 TE insertions with TSDs present in Col-0 but absent in Ler-1. (A) Genomic coordinates of the insertion in Col-0 and of the corresponding empty site in Ler-1. Description of columns is provided in (B).: https://doi.org/10.7554/eLife.15716.006
Download elife-15716-fig1-data4-v2.xlsx

Since CNVs could reflect either recent TE mobilization or the gain or loss of TE copies through other types of chromosomal rearrangements, we then looked among the unmapped Illumina short reads for so-called 'split-reads' that contain TE extremities. Crucially, because most TE families generate short target site duplications (TSDs) of fixed size upon insertion, TSDs can serve as signatures of bona fide transposition events. We therefore developed a pipeline for the systematic identification of split-reads covering TE junctions that are absent from the reference genome and that produce, when mapped to the insertion site, a sequence overlap of the size of TSDs (3 to 15 bp, depending on the TE family, Figure 1C and D; see ‘Materials and methods’). Our pipeline differs in that respect from that used in another study to detect the presence/absence of reference and non-reference TE insertions in the same set of accessions (Stuart et al., 2016). Results produced by our pipeline for the 292 annotated TE families that create TSDs upon transposition were verified visually to eliminate false positives (Figure 1—figure supplement 1; see ‘Materials and methods’). Following this approach, non-reference TE insertions with TSDs were identified for 131 TE families in total (Figure 1—source data 2), which all also show CNV (Figure 1—source data 1).

Most (86%) non-reference TE insertions with TSDs are private or shared only by a few accessions and thus they typically correspond to recently derived alleles, as expected (Figure 1E). Moreover, recent transposition activity is only detected for between four and 66 TE families in any given accession, thus indicating large variations in the composition of the mobilome among accessions (Figure 1F). Nonetheless, we have probably identified most of the annotated TE families that compose the mobilome at the species level, because the number of TE families defined as mobile by the split-reads approach reaches a plateau after examining just 74 accessions (Figure 1G). The 53 class I COPIA families and the 40 class II Mutator-like (MuDR) families are the most mobile, as they account for 1408 and 729 of the 2835 non-reference TE insertions with TSDs identified in total, respectively (Figure 1H and Figure 1—source data 2 and 3). However, the number of non-reference insertions per accession is always small (<16) for any given family (Figure 1—source data 2), thus suggesting a lack of recent transposition bursts.

The ability to detect non-reference TE insertions with TSDs using split-reads is strongly dependent on read depth as well as sequence composition at the insertion site (Figure 1—figure supplement 2A; Hénaff et al., 2015). To assess the extent of this limitation, we used the assembled Ler-1 genome sequence recently obtained using PacBio long reads (see ‘Materials and methods’). Although not annotated, this sequence assembly can serve to identify by whole genome comparison the Col-0 TEs flanked by TSDs that are absent from the corresponding position in the Ler-1 genome (see ‘Materials and methods’). A total of 142 TEs belonging to 80 distinct families were identified in this way (Figure 1—source data 4), which is consistent with estimates obtained using other approaches (Ziolkowski et al., 2009; Hénaff et al., 2015). In contrast, we could detect only 78 Col-0-specific TEs with TSDs belonging to 49 TE families when using the split-reads pipeline to map Col-0 short reads onto the assembled Ler-1 genome. These results indicate therefore that the split-reads approach has a low sensitivity (Figure 1—figure supplement 1C; 45% false negatives and 10% false positives; False Discovery Rate: 15.3%).

To obtain an independent estimation of the composition of the mobilome, we also performed TE sequence capture (TE-capture; Baillie et al., 2011). Briefly, probes were designed to cover the 5’ and 3’ extremities of 310 TE elements belonging to 181 distinct families, including 117 of the 131 TE families identified as mobile with the split-reads approach (see ‘Materials and methods’). Using genomic DNA extracted from 12 randomly chosen accessions (Figure 1—figure supplement 2D), we could validate by TE-capture most (87%) of the non-reference TE insertions with TSDs that were detected by the split-reads approach (Figure 1—figure supplement 2F; see ‘Materials and methods’). As expected, TE-capture also uncovered many additional non-reference TE insertions with TSDs for the same TE families (Figure 1—figure supplement 2F). However, no such insertions were detected for the other TE families that could be captured but which were not identified as mobile by the split-reads approach in any of the 12 accessions. These results confirm that despite the low sensitivity of the latter, we have probably identified most of the TE families with TSDs that compose the A. thaliana mobilome at the species level. Finally, non-reference insertions were also identified for 30 TE families (including 15 HELITRON families) that could not be analyzed using our split-reads pipeline because they do not produce TSDs or have insertion sites located in low complexity regions (Figure 1—figure supplement 2B). Since most of the non-reference insertions for these 30 TE families are present in only one or two of the 12 accessions examined (Figure 1—figure supplement 2G), they likely reflect recent transposition events. Thus, there are altogether at least 165 TE families with recent transposition activity at the species level. Moreover, based on the TE-capture data, we can estimate that since they diverged from each other any two accessions have accumulated between ~200 and ~300 newly transposed TE copies (Figure 1—figure supplement 2H).

TE mobilization as a complex trait

The observation that the composition of the mobilome differs extensively between accessions (Figure 1F) suggests that it is influenced by environmental and genetic factors. To try to identify such factors, we first established that copy number (CN) correlates positively with the number of TE insertions with TSDs that are detected by TE-capture (Figure 2—figure supplement 1; see ‘Materials and methods’). Thus, CNV is a reliable and quantitative estimator of differential TE mobilization between accessions, which we used to analyze the 113 TE families that were defined as mobile based both on the split-reads approach and TE-capture (Figure 2—source data 1).

Controlling for population stratification and considering thirteen geo-climatic variables (Hancock et al., 2011), we uncovered robust correlations with CN for 15 class I and class II TE families. Among these, ATCOPIA2 and ATCOPIA78 share the highest number of geo-climatic variables correlated with CN (Figure 2—figure supplement 2). Moreover, the strongest correlation is between temperature annual range and CN for ATCOPIA78 (Figure 2A and B). Given that at least one member of this TE family is transcriptionally induced by heat shock in the Col-0 accession (Ito et al., 2011; Cavrak et al., 2014), ATCOPIA78 provides a compelling case of a causal link between climate and TE mobilization.

Figure 2 with 3 supplements see all

Download asset Open asset

Environmental and genetic factors associated with differential mobilome activity.

(A) Copy number (CN, red circles) of *ATCOPIA78* in accessions distributed across the globe. Annual temperature range is also shown. (B) Partial Mantel correlation between *ATCOPIA78* CN and annual temperature range. (C) Fraction of CNV variance explained by SNPs (*cis,* and *trans)* and partial Mantel correlation with geo-climatic variables. (D) Distribution of *cis* and *trans* loci in the joined analysis (391 accessions) and number of TE families associated with a given *trans* locus. A complete list of the GWAS results is provided in Supplementary file 1. (E) Manhattan plots displaying GWAS results for the seven TE families with a *MET2a* association. The leading SNP within each interval is indicated as a red diamond. Colors indicate the extent of linkage disequilibrium (r²) with the leading SNP. (F) Schematic view of the MET2a protein (TD: targeting domain; BAH: bromo adjacent homology domain) and sequence alignment of the TD. The amino acid substitution (G519E) that is present in some accessions is indicated (red arrow). (G) Average DNA methylation level over non-mobile, mobile and *MET2a*-associated TE families in WT and *met2a* Col-0 seedlings (Stroud et al., 2013). Statistically significant differences are indicated (MWU test). Figure 2—figure supplement 1 shows the positive correlation between CN and number of non-reference TE insertions with TSDs. Figure 2—figure supplement 2 shows climate association to CNVs. Figure 2—figure supplement 3 shows GWAS results for CNVs.

https://doi.org/10.7554/eLife.15716.009

Figure 2—source data 1 Copy number estimation used for the geo-climatic associations and GWASs. (A) Copy number estimation based on read coverage for the 131 mobile TE families analyzed across 211 A. thaliana accessions collected worldwide. (B) Copy number estimation based on read coverage for the same 131 mobile TE families across 180 A. thaliana accessions from Sweden.: https://doi.org/10.7554/eLife.15716.010
Download elife-15716-fig2-data1-v2.xlsx

We next explored the possibility of using GWASs to identify genetic variants influencing TE mobilization (see ‘Materials and methods’). For 33 TE families, a disproportionately large number of SNPs are associated with CNV, preventing further analysis. For the remaining 80 TE families, SNPs in linkage disequilibrium with each other and associated with CNVs delineate 230 loci. Moreover, 34% of these loci are also identified by GWAS using whole genome sequencing data obtained for another 180 accessions taken from Sweden (Long et al., 2013) (Figure 2—figure supplement 3A). This substantial overlap suggests a similar genetic architecture for the A. thaliana mobilome at both the local and global scales, which prompted us to perform joined GWASs using all 391 accessions in order to increase both sensitivity and specificity. Depending on the TE family, GWASs identified from 0 to 33 loci and collectively, associations explain between 2% and 67% of the variance in CN (Figure 2C).

Among the 334 loci detected in total, 130 encompass sites with reference or non-reference TE insertions (Figure 2C and Supplementary file 1). Furthermore, each of these local (cis) genetic variants explains on average 5.2% of the total variance compared with 2.2% for distal (trans) genetic variants. The higher explanatory power of cis variants is of course to be expected, as the TEs themselves are the primary determinants of the transposition process. Indeed, almost all cis SNPs that map to TE sequences in the reference genome are likely causal as they affect sequences involved in transposition, for example the long terminal repeats (LTRs) and the various open reading frames of LTR retrotransposons (Figure 2—figure supplement 3B and C).

While cis loci collectively explain most CN variance for class I TE families, this is not the case for class II TE families (Figure 2—figure supplement 3D). Given that class II TEs move by a cut and paste mechanism, some trans loci could in fact correspond to sites of excision. However, we could not find evidence of excision footprints, such as small insertions or deletions. Alternatively, the larger fraction of CN variance explained by trans loci for class II TE families may in part result from many of these families being non-autonomous, i.e. requiring factor(s) encoded by other TEs for their mobilization. Consistent with this possibility, the proportion of CN variance associated with trans loci as well as the number of TE annotations overlapping trans loci are higher for non-autonomous than autonomous class II TE families (Figure 2—figure supplement 3E). Although we did not investigate trans mobilization in depth, we readily identified one probable case, involving the non-autonomous and autonomous MuDR families ATDNA1T9A and VANDAL16, respectively (Figure 2—figure supplement 3F). Of note, CN do not co-vary between these two families, which could indicate that their transposition is differentially controlled. Finally, 'false' trans loci could also be caused by non-reference TE insertions that are sufficiently frequent to be in linkage disequilibrium with SNPs used for the GWASs but that we have failed to detect. However, such trans loci are not expected to be more prevalent for class II than class I TE families and should be very rare in any case given that the probability of missing moderately frequent (>5%) non-reference TE insertions by both the split-reads approach and TE-capture is low (Figure 2—figure supplement 3G).

Since transposition is controlled by multiple protein activities in addition to those encoded by the TEs themselves, we also examined genes located within or adjacent to trans loci. Overall, these genes do not appear to be enriched for any particular function and most of them are specific to a single TE family (Figure 2D and Figure 2—figure supplement 3H). These observations indicate either a complex genetic architecture of mobilome variation and/or spurious trans associations such as those considered above. Nonetheless, 22 trans loci stand out as they show association with CNV for two or more TE families (Figure 2D) and a causal link is evident in two cases. Indeed, the locus associated with CNV for respectively the retrotransposon and DNA transposon families ATGP2 and ATENSPM2 encodes the transcription factor ARF23, which recognizes motifs that are overrepresented in the sequence of these TEs. The second locus is associated with CNV for the largest number of TE families (four class I and three class II families, Figure 2E) and encodes the MET2a protein, a poorly characterized homolog of the main DNA maintenance methyltransferase MET1. Moreover, one of the MET2a SNPs is presumably causal as it leads to a non-synonymous amino-acid substitution (G519E) in a conserved domain of the protein (Figure 2F) that in the mammalian homolog Dnmt1 is required for the targeting to replication foci (Klein et al., 2011). A role for MET2a is also supported by the observation that met2a mutant plants (Stroud et al., 2013) lose some DNA methylation exclusively over mobile TE families. Furthermore, loss of methylation is more pronounced when only considering the seven TE families that show a MET2a association (Figure 2G). Intriguingly, CHG sites (where H=A, T or C), which are poor substrates for MET1 or Dnmt1 compared to CG sites (Law and Jacobsen, 2010), are the most affected in the met2a mutant. Whether or not this observation reflects an atypical recognition specificity for MET2a remains to be determined. Finally, we note that GWASs failed to detect any association with genes known to be involved in the epigenetic silencing of TEs (Ito and Kakutani, 2014) such as MET1 and DDM1, presumably because of their essential function.

Genome localization of newly inserted TEs

In A. thaliana as in many other eukaryotes, TE sequences tend to cluster in pericentromeric regions (Arabidopsis Genome Initiative, 2000). Mechanistically, such clustering may result from insertion bias, selective constraints or differential elimination of TE copies through ectopic homologous recombination (Barrón et al., 2014). To distinguish between these possibilities, we looked at the genomic location of the 2835 non-reference TE insertions with TSDs detected with the split-reads approach and found that they are distributed almost evenly along chromosomes (Figure 3A). Since a similar distribution is observed for the non-reference TE insertions with TSDs detected exclusively using TE capture (Figure 3—figure supplement 1A), we can rule out an ascertainment bias of the split-reads approach towards non-reference TE insertions located along the chromosome arms. However, there is a clear trend towards a more pericentromeric localization when only considering non-reference TE insertions with TSDs that are shared by two or more accessions and that are thus presumably more ancestral (Figure 3B and Figure 3—figure supplement 1B). Moreover, the density of non-reference TE insertions with TSDs positively correlates with the recombination rate but negatively with gene density (Figure 3—figure supplement 1C, D and E). Finally, except for COPIA families, non-reference TE insertions with TSDs are globally under-represented within genes, where they are expected to be most detrimental (Figure 3C and Figure 3—figure supplement 1F and Figure 3—figure supplement 2A). Collectively, these observations provide strong evidence that TEs insert equally throughout the genome and are preferentially purged over time from the chromosome arms because of their deleterious effects on adjacent genes rather than as a consequence of ectopic homologous recombination.

Figure 3 with 4 supplements see all

Download asset Open asset

Genomic localization of non-reference TE insertions.

(A) Density of non-reference TE insertions with TSDs (blue) and of annotated TE sequences (red) along the reference sequence of chromosome 1. Inner pericentromeric regions are masked. (B) Fraction of private and shared non-reference TE insertions with TSDs and of annotated TE sequences in outer pericentromeric regions. Statistically significant differences are indicated (chi square test). (C) Observed/expected ratio (O/E) of private non-reference TE insertions with TSDs in and around genes. Errors bars are defined as 95% confidence intervals. (D) Cumulative distribution of gene expression ratios between alleles harboring and lacking non-reference TE insertions. Statistically significant differences were calculated using the KS test. (E) As D, but only for *COPIA* (green) or *MuDR* (red) non-reference TE insertions with TSDs. Figure 3—figure supplement 1 shows detailed analysis of the distribution of non-reference TE insertions with TSDs along the genome. Figure 3—figure supplement 2 shows local TE insertion preferences. Figure 3—figure supplement 3 shows global expression levels of gene affected by non-reference TE insertions. Figure 3—figure supplement 4 shows expression levels of genes affected in some accessions by a non-reference insertion with TSD in plants grown under control conditions or subjected to heat stress.

https://doi.org/10.7554/eLife.15716.014

Although most TE families show no overt insertion bias at the genome scale, there are clear local insertion preferences. In agreement with previous observations (Fu et al., 2013; Miyao et al., 2003), private non-reference COPIA and MuDR insertions with TSDs are enriched at coding sequences and transcriptional start sites (TSS), respectively (Figure 3—figure supplement 2A). In addition, insertion sites for most TE superfamilies are enriched in specific DNA sequence motifs or exhibit biased sequence composition (Figure 3—figure supplement 2B and C; Supplementary file 2). For example, LINEs tend to insert within poly(A) tracks, as expected for this superfamily of non-LTR retrotransposons, which integrate into the genome via poly(A)-dependent, target site–primed reverse transcription (TPRT; Beck et al., 2011).

Impact of newly inserted TEs on the expression of adjacent genes

Transcriptome analyses in the reference accession Col-0 have revealed that A.thaliana genes nearest to TE sequences are expressed at lower levels compared with the genome-wide distribution of gene expression, suggesting that TE insertions tend to reduce the expression of neighboring genes (Hollister and Gaut, 2009). To investigate more directly the impact of TEs on the genes within or near which they insert, we examined RNA-seq data available for 144 accessions (Schmitz et al., 2013). Specifically, we considered all non-reference TE insertions with TSDs and calculated for each gene located within 1 kb of them (1616 genes in total), the ratio between the expression level in the accession(s) harboring the insertion and the median expression level in the accessions devoid of the insertion. Expression ratios expected under the null hypothesis (no effect of the TE insertions) were calculated by taking 10⁶ randomly chosen sets of 1616 genes and assigning for each set the TE insertion ‘presence/absence’ label randomly among the 144 accessions (see ‘Materials and methods’). Comparison of the distribution of the observed and expected expression ratios indicates that for a large fraction of genes, expression is indeed significantly altered when TEs insert within or near them (Figure 3D, p<1.9×10^-5). These alterations are most pronounced for the COPIA insertions, which are overrepresented in genes and less pronounced for the MuDR insertions, despite the latter being overrepresented around the TSS of genes (Figure 3E and Figure 3—figure supplement 2A). Although other TE superfamilies show similar trends, we could not draw firm conclusions in these cases because of insufficient statistical power (Figure 3—figure supplement 3). This notwithstanding, it is clear that TE insertions induce both increases and decreases in gene expression with equal frequency (Figure 3D and E). Thus our findings contradict the prevailing view of a dampening effect of TE insertions on the expression of adjacent genes (Hollister and Gaut, 2009) and suggest instead a stronger selection against TE insertions when they occur close to highly expressed genes.

To complement the re-analysis of transcriptome data, we also measured by RT-qPCR the expression level of 19 genes with recent COPIA or MuDR insertions, using nine different accessions grown under control conditions or subjected to heat shock. COPIA insertions were found to have more dramatic and systematic effects on gene expression in stressed plants (Figure 3—figure supplement 4) which in the case of ATCOPIA78 can be related to its transcriptional sensitivity to heat shock (Ito et al., 2011; Cavrak et al., 2014). On the other hand, we could not detect any effect of the MuDR insertions under the two conditions tested (Figure 3—figure supplement 4). These findings are in agreement with those of the transcriptome analysis and indicate in addition that the effect of TE insertions on the expression of adjacent genes can vary substantially in relation to the environment.

Impact of newly inserted TEs on the DNA methylation status of adjacent sequences

TE sequences are typically targeted by the RNA-directed DNA methylation (RdDM) machinery in A. thaliana (Lippman et al., 2004; Lister et al., 2008; Cokus et al., 2008) and we have previously provided genome-wide evidence that DNA methylation can spread from RdDM targets to flanking sequences, with possible consequences on gene expression (Ahmed et al., 2011). To investigate the effect of new TE insertions on the DNA methylation status of adjacent sequences, we used MethylC-Seq data available for 140 accessions (Schmitz et al., 2013). Analysis of this data set first indicated that mobile TE families have on average higher CG, CHG and CHH methylation than non-mobile TE families (Figure 4A). Furthermore, DNA methylation is also higher for most mobile TE families in the accessions with evidence of recent transposition activity (Figure 4—figure supplement 1A). These observations prompted us to examine in addition methylome data obtained for several mutation accumulation (MA) lines (Becker et al., 2011; Schmitz et al., 2011). Mobile TE families suffer less sporadic DNA methylation loss than non-mobile families (Figure 4B). These findings are entirely consistent with DNA methylation playing an important role in the control of TE mobility and they suggest in turn that most of the recent TE insertions we have identified are present in the methylated state. Moreover, given that DNA methylation is likely established over newly inserted TE copies by RdDM in a progressive manner across multiple generations (Teixeira et al., 2009; Marí-Ordóñez et al., 2013), unmethylated non-reference TE insertions should be mainly private and reflect very recent transposition events.

Figure 4 with 1 supplement see all

Download asset Open asset

DNA methylation of non-reference TE insertion sites.

(A) Boxplot representation of average DNA methylation level for mobile and non-mobile TE families across all accessions. (B) O/E ratio of spontaneous DMRs identified in mutation accumulation lines (Becker et al., 2011; Schmitz et al., 2011) for non-mobile and mobile TE families. Statistically significant differences were calculated using a chi square test. (C) Average DNA methylation level in 50bp windows upstream and downstream of 1543 insertions sites for accessions lacking or containing a given non-reference TE insertion with TSD. (D) Genome browser tracks showing examples of insertion sites respectively associated with short- and long-distance DNA methylation. (E) Meta-analysis of DNA methylation around non-reference TE insertions sites. (F) Distribution of non-reference TE insertions associated with short- or long-distance DNA methylation according to their position relative to genes (stacked bar plot) and proportion of insertions in the two possible orientations relative to the closest gene (pie charts). (G) Average expression level in different organs and at different developmental time points (in Col-0) of genes with non-reference TE insertions with TSDs and affected by short- (blue) or long-distance (red) DNA methylation. Error bars are s.e.m. Statistical significance of differences was calculated using a MWU test. Figure 4—figure supplement 1 shows DNA methylation of TE families and impact on sequences flanking non-reference TE insertions with TSDs.

https://doi.org/10.7554/eLife.15716.019

Based on these considerations, we next analyzed the DNA methylation status of 1543 TE insertion sites for which reliable data could be extracted across all 140 accessions (Figure 4C). Approximately 10% of sites are methylated in most accessions, including systematically in the one(s) containing the TE. As expected, these sites are preferentially located within TE-rich, pericentromeric regions. In contrast, another 40% of sites are devoid of methylation in the accession(s) containing the TE insertion as well as in most of the other accessions. This absence of adjacent DNA methylation could indicate either that the TE insertions themselves are unmethylated or else that DNA methylation does not spread from them. Finally, 50% of sites are methylated exclusively or almost exclusively in the accession(s) with the TE insertion (Figure 4—figure supplement 1B), thus suggesting that at these sites TEs are methylated and that DNA methylation did spread into adjacent sequences. Why some sites may be refractory to DNA methylation spreading when others are not is unclear, as we did not identify any feature that could distinguish them, such as the identity of the TE or the sequence composition at the insertion site.

Further analysis of DNA methylation associated with TE insertions indicates that it affects all three sequence contexts and that it generally extends for up to 300 bp on both sides of the insertions (Figure 4D and E; Figure 4—figure supplement 1B), a distance that closely matches that previously reported for the spreading of DNA methylation from RdDM targets (Ahmed et al., 2011). For 243 insertion sites however, DNA methylation extends over much longer distances (up to 3.5 kb) on one or the other side of the insertion (Figure 4D and E). While most of these sites lie within or close to genes, the TE insertions are not preferentially orientated with respect to gene transcription (Figure 4F), which rules out sense-antisense transcription as the likely trigger for this long-distance DNA methylation. Proximity to pericentromeric heterochromatin can also be ruled out, because most of the genes with long-distance DNA methylation are located on the chromosome arms (Figure 4—figure supplement 1C). To explore potential mechanisms further, we made used of the wealth of epigenomic data available in Col-0 to examine the 142 Col-0 TE insertions with TSDs that are absent from the assembled Ler-1 genome sequence (Figure 1—source data 4). Methylome data (Stroud et al., 2013) indicate that 121 of these 142 Col-0 TE copies are methylated and that DNA methylation tends to extend into flanking sequences, predominantly over short distances, but occasionally over much longer distances (<300 pb: 63 TE insertions; >1 kb: 36 TE insertions; Figure 1—source data 4). These results confirm those obtained for the non-reference TE insertions. In addition, analysis of Col-0 small RNA-seq data (Fahlgren et al., 2009) indicates that in contrast to short-distance DNA methylation, long-distance DNA methylation aligns with 24-nt siRNAs (Figure 4—figure supplement 1D). Thus, genes affected by the latter type of DNA methylation have presumably become secondary targets of RdDM, as was shown for a transgene (Kanno et al., 2008; Daxinger et al., 2009). Moreover, genes affected by long-distance DNA methylation in accessions other than Col-0 tend in the latter accession, where they are by definition in the ancestral state, to be most highly expressed in both pollen and seeds and more highly expressed in these two organs than genes affected by short-distance DNA methylation (Figure 4G). Given that RdDM activity is also maximal in these organs (Teixeira and Colot, 2010), our observations suggest that secondary RdDM results from the concomitance of strong transcription and strong RdDM at target loci.

Finally, our analysis of TE-associated DNA methylation indicates that it accounts for at least 7% of the so-called gene C-DMRs (i.e. regions of differential methylation at CG, CHG and CHH sites) identified in nature, which are typically low frequency gain of DNA methylation variants (Schmitz et al., 2013). These and similar findings reported recently (Stuart et al., 2016) confirm and extend previous results that first indicated that many natural gene C-DMRs are not bona fide epialleles but rather new alleles caused by TE insertions (Schmitz et al., 2013). Nonetheless, examination of one TE-insertion allele shared among 13 accessions indicates that it is present in the unmethylated state in one accession and thus possibly subjected to epigenetic variation in nature (Figure 4—figure supplement 1B).

TE insertions as motors of adaptive changes

Although TEs tend to insert with no overt bias at the genome scale (Figure 3A), we detected nineteen 10 kb windows with a high load of non-reference TE insertions (Figure 5A). Such enrichment could result from insertion preferences or reflect an absence of strong negative selection. In fact, three of these 10 kb windows span genes encoding nucleotide-binding domain and leucine-rich repeat containing (NLR) proteins, which function as immune receptors in plants and are known to be under diversifying selection (Chae et al., 2014). Moreover, a fourth 10 kb window spans the gene FLC, which encodes a key repressor of flowering and is one of the main genetic factors causing natural variation in the onset of flowering, another key adaptive trait (Ietswaart et al., 2012). Remarkably, the FLC locus has the highest number of non-reference TE insertions (seven in total) across the genome. These insertions belong to several COPIA families and affect four distinct FLC haplotypes in total (Figure 5A and Figure 5—figure supplement 1A). Moreover, five insertions are located within the first intron (Figure 5B), which plays an important role in the epigenetic regulation of FLC in response to cold (Ietswaart et al., 2012). Although four of these insertions as well as another intronic insertion were previously identified among early flowering accessions (Liu et al., 2004; Lempe et al., 2005), causality could not be established unequivocally because of numerous other sequence polymorphisms in complete linkage disequilibrium. To obtain direct proof of a causal role for the seven TE insertions we identified, we used publically available transcriptomic (Schmitz et al., 2013) as well as phenotypic data (Li et al., 2010; Lempe et al., 2005) and compared FLC expression as well as flowering time among accessions that have the same FLC haplotype but differ by the presence or absence of a TE insertion (see ‘Material and methods’). Results of these comparisons indicate that the TE-containing accessions have systematically much reduced FLC expression and flower much earlier than their TE-free counterparts (Figure 5C and D; Figure 5—figure supplement 1B and C). Thus, we can conclude that TEs are recurrent generators of major effect FLC alleles, which in turn suggests that they contribute significantly to the high level of allelic heterogeneity observed at this locus (Li et al., 2014)

Figure 5 with 1 supplement see all

Download asset Open asset

Local enrichment of non-reference TE insertions with TSDs.

(A) Density of non-reference TE insertions with TSDs in 10 kb windows. The 19 regions statistically enriched in such insertions are indicated by red bars. (B) Position and identity of the seven non-reference TE insertions with TSDs spanning the *FLC* locus. (D) and (E) Level of *FLC* expression (D) and flowering time (E) for accessions of same *FLC* haplotype but differing by the presence or absence of the relevant TE insertion. Errors bars are s.e.m. Figure 5—figure supplement 1 shows the reconstruction of the *FLC* haplotypes and additional analyses of the effect on flowering timeof non-reference TE insertions with TSDs at the locus.

https://doi.org/10.7554/eLife.15716.021

Discussion

We have shown that the A. thaliana mobilome is particularly rich at the species level, being composed of at least half of the 326 TE families that are annotated in the reference genome. This finding is at odds with the prevailing view that most TE families are mere molecular fossils in A. thaliana, since they contain a much lower proportion of 'young', i.e. non-degenerated TE copies than in the close relative A. lyrata (Hu et al., 2011; Maumus and Quesneville, 2014). Furthermore, we provide definitive evidence that TEs insert throughout the genome, with no overt bias towards pericentromeric regions, which contrasts with the observed clustering of annotated TE sequences around centromeres. However, these discrepancies are easily resolved, since we have also shown that despite the richness of the A. thaliana mobilome, most TEs tend to be rapidly purged by natural selection in this species when they insert in the chromosome arms, which are gene-rich. Indeed, our systematic survey indicates that TEs have pervasive effects on the expression and DNA methylation status of genes near or within which they insert. Incidentally, the deleterious effects associated with most transposition events in A. thaliana may also explain in part the fact that we did not detect any recent transposition bursts, as these should be strongly counter-selected. Furthermore, because A. thaliana is predominantly self-fertilizing, the purging of deleterious TE insertions should be accelerated in this species compared to A. lyrata, which is an obligated out-crosser. Given this difference in mating systems, the TE population dynamics of A. thaliana and A. lyrata are expected to differ significantly (Lockton and Gaut, 2010). Thus, homologous recombination could play a more prominent role in the elimination of TE insertions in A. lyrata, as is thought to be the case in D. melanogaster (Barrón et al., 2014). However, comprehensive studies similar to those presented here remain to be performed for A. lyrata in order to identify conclusively the forces that shape the TE landscape in this species.

We have also shown that the composition and activity of the mobilome vary greatly between accessions. GWASs revealed that this variation is caused in part by sequence polymorphisms within the TEs themselves (cis variation), which is in agreement with empirical data and theoretical models indicating that TE families contain only one or a few active, autonomous (i.e. master) TE copies at any one time (Becker et al., 2011). The fact that we could readily detect such cis variants may again be linked to the mating system of A. thaliana, which on the one hand should increase the probability that disabling mutations accumulate within the few active TE copies that are present within a given lineage, before these copies could transpose further; and on the other hand should decrease the probability of acquiring new active copies through crosses.

Another important result of the GWASs is that natural variation at the MET2a locus, which encodes a poorly characterized DNA methyltransferase, has a significant impact on mobilome composition and activity across accessions, being associated with differential transposition activity for seven class I and class II TE families. While the role of MET2a in transposition control remains to be determined experimentally, it is noteworthy that none of the epigenetic repressors of TE activity identified through genetic screens, such as MET1 or DDM1, are associated with natural variation of the mobilome, presumably because of their essential function. Altogether, these findings illustrate the power of GWASs in identifying the genetic factors affecting transposition in nature.

Being a complex trait, TE mobilization is also modulated by environmental factors, and we have identified temperature annual range as a clear contributor to the variation in ATCOPIA78 mobilization across accessions. Remarkably, this TE family has generated several rare alleles with large effects at the FLC locus, which is a major genetic determinant of the onset of flowering in nature. It is therefore tempting to speculate that ATCOPIA78 may endow A. thaliana with a unique ability to adapt to global warming and the associated increase in droughts by facilitating the creation of early flowering FLC alleles. Additionally, these observations may provide insights into how A. thaliana has been able to colonize efficiently the entire northern hemisphere from a few glacial refugia located in Southern Europe (François et al., 2008).

In summary, our findings have far reaching implications, as they indicate that part of the missing heritability that plagues many GWASs may be accounted for by recent and thus rare TE insertion alleles with large effects (Vinkhuyzen et al., 2013; Brachi et al., 2011). More generally, our study highlights the need for similar species-wide explorations of the mobilome in a variety of organisms in order to assess the true mutational and epimutational impact of transposition as well as its contribution to natural phenotypic variation. In this respect, it can be anticipated that the advent of long read sequencing technologies will greatly facilitate such studies, especially in organisms with large, repeat-rich genomes.

Share this article

Cite this article

Overview of the A. thaliana mobilome.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Figure 1—source data 4

Environmental and genetic factors associated with differential mobilome activity.

Figure 2—source data 1

Genomic localization of non-reference TE insertions.

DNA methylation of non-reference TE insertion sites.

Local enrichment of non-reference TE insertions with TSDs.

Author details

Leandro Quadrana

Contribution

Competing interests

Amanda Bortolini Silveira

Contribution

Competing interests

George F Mayhew

Contribution

Competing interests

Chantal LeBlanc

Contribution

Competing interests

Robert A Martienssen

Contribution

Competing interests

Jeffrey A Jeddeloh

Contribution

Competing interests

Vincent Colot

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism