The Arabidopsis thaliana mobilome and its impact at the species level

  1. Leandro Quadrana
  2. Amanda Bortolini Silveira
  3. George F Mayhew
  4. Chantal LeBlanc
  5. Robert A Martienssen
  6. Jeffrey A Jeddeloh
  7. Vincent Colot  Is a corresponding author
  1. Institut de Biologie de l’Ecole Normale Supérieure, Centre National de la Recherche Scientifique, Institut National de la Santé et de la Recherche Médicale, Ecole Normale Supérieure, France
  2. Roche NimbleGen, Inc, United States
  3. Yale University, United States
  4. Howard Hughes Medical Institute, Gordon and Betty Moore Foundation, United States
  5. Cold Spring Harbor Laboratory, United States
5 figures and 5 additional files

Figures

Figure 1 with 2 supplements
Overview of the A. thaliana mobilome.

(A) Genome browser tracks showing normalized sequencing coverage over the two full-length ATCOPIA31 elements annotated in the reference genome (Col-0). CNV is detected as increased or decreased coverage in other accessions. Number of copies is indicated on the right. (B) Heat map representing CNVs (log2 ratio) for 317 TE families and 211 A. thaliana accessions. TE families with statistically significant CNV in at least one accession are indicated. Figure 1—source data 1 contains absolute copy number estimation of TE sequences. (C) Schematic representation of the bioinformatics pipeline to identify non-reference TE insertions with TSD using split-reads. 1- Reads are mapped on a collection of TE extremities from annotated TE sequences and reference sequences (Repbase update). 2- Reads aligning partially over TE extremities are extracted and clipped. 3- The unmapped portion of these split-reads are re-mapped on the Arabidopsis reference genome. 4- Non-reference TE insertions with TSDs are identified by searching for overlapping clusters of 5’ and 3’ split-reads. (D) Genome browser tracks showing split-reads for two non-reference ATCOPIA31 insertions and TSD reconstruction. Figure 1—source data 2 contains the coordinates of all non-reference TE insertions with TSDs. (E) Distribution frequency of allele counts for non-reference TE insertions with TSDs. (F) Number of mobile TE families per accession identified using split-read and TE-sequence capture. (G) Cumulative plot of the number of mobile TE families detected with increasing numbers of accessions. (H) The total number of non-reference TE insertions with TSDs is indicated in relation to the number of accessions with such insertions, for each of the 131 mobile TE families. Asterisks indicate the nine TE families with experimental evidence of transposition (Ito and Kakutani, 2014; Tsay et al., 1993). Figure 1—source data 3 contains the total number of distinct non-reference TE insertions with TSD for each TE family and super-family. Figure 1—figure supplement 2 shows TE-capture results. Figure 1—figure supplement 1 contains IGV screenshots showing the pattern of split-reads characteristic of true- and false-positive non-reference TE insertions with TSDs.

https://doi.org/10.7554/eLife.15716.002
Figure 1—source data 1

Copy number estimation of TE sequences.

(A) Copy number estimation based on read coverage for the 317 TE families analyzed across 211 A. thaliana accessions collected worldwide. Column descriptions are provided in (B).

https://doi.org/10.7554/eLife.15716.003
Figure 1—source data 2

Coordinates of non-reference TE insertions with TSDs.

(A) Coordinates and presence or absence call (1 and 0, respectively) across the 211 A. thaliana accessions. Description of columns is provided in (B).

https://doi.org/10.7554/eLife.15716.004
Figure 1—source data 3

Number of distinct non-reference TE insertions with TSDs identified by the split-reads approach for each TE family and super-family.

https://doi.org/10.7554/eLife.15716.005
Figure 1—source data 4

TE insertions with TSDs present in Col-0 but absent in Ler-1.

(A) Genomic coordinates of the insertion in Col-0 and of the corresponding empty site in Ler-1. Description of columns is provided in (B).

https://doi.org/10.7554/eLife.15716.006
Figure 1—figure supplement 1
Visual inspection of true- and false-positive non-reference TE insertions with TSDs.

IGV screenshots showing split-reads for non-reference TE insertions with TSDs that are validated or not by TE-capture (true- and false-positives, respectively). Split-reads are shown for 12 different accessions. Accessions containing the non-reference TE insertion with TSD are indicated in red.

https://doi.org/10.7554/eLife.15716.007
Figure 1—figure supplement 2
Validation of the A. thaliana mobilome by TE-capture.

(A) The number of non-reference TE insertions with TSDs identified by the split-read pipeline is plotted against the corresponding genome sequencing coverage for each accession. Accessions analyzed by TE-capture are highlighted in red. (B) Genome browser tracks showing examples of non-reference TE insertions identified by TE-capture only. (C) Overlap between TE insertions with TSDs identified specifically in Col-0 using the Ler-1 genome assembly as a reference and either whole genome sequence alignment or the split-reads pipeline. The percentage of false positives (FP), true positives (TP) and false negatives (FN) as well as the false discovery rate (FDR) are indicated. (D) Description of the TE-capture design and workflow. (E) TE-capture enrichment of target sequences. (F) Overlap between non-reference insertions with TSDs identified by split-read analysis and TE-capture. The percentage of FP, TP and FN as well as the FDR are indicated. (G) Distribution frequency of allele counts for non-reference TE insertions identified using the split-read approach and TE-capture among the 12 accessions analysed. (H) Number of SNPs plotted against the number of non-reference TE insertions identified by TE-capture between any two accessions.

https://doi.org/10.7554/eLife.15716.008
Figure 2 with 3 supplements
Environmental and genetic factors associated with differential mobilome activity.

(A) Copy number (CN, red circles) of ATCOPIA78 in accessions distributed across the globe. Annual temperature range is also shown. (B) Partial Mantel correlation between ATCOPIA78 CN and annual temperature range. (C) Fraction of CNV variance explained by SNPs (cis, and trans) and partial Mantel correlation with geo-climatic variables. (D) Distribution of cis and trans loci in the joined analysis (391 accessions) and number of TE families associated with a given trans locus. A complete list of the GWAS results is provided in Supplementary file 1. (E) Manhattan plots displaying GWAS results for the seven TE families with a MET2a association. The leading SNP within each interval is indicated as a red diamond. Colors indicate the extent of linkage disequilibrium (r2) with the leading SNP. (F) Schematic view of the MET2a protein (TD: targeting domain; BAH: bromo adjacent homology domain) and sequence alignment of the TD. The amino acid substitution (G519E) that is present in some accessions is indicated (red arrow). (G) Average DNA methylation level over non-mobile, mobile and MET2a-associated TE families in WT and met2a Col-0 seedlings (Stroud et al., 2013). Statistically significant differences are indicated (MWU test). Figure 2—figure supplement 1 shows the positive correlation between CN and number of non-reference TE insertions with TSDs. Figure 2—figure supplement 2 shows climate association to CNVs. Figure 2—figure supplement 3 shows GWAS results for CNVs.

https://doi.org/10.7554/eLife.15716.009
Figure 2—source data 1

Copy number estimation used for the geo-climatic associations and GWASs.

(A) Copy number estimation based on read coverage for the 131 mobile TE families analyzed across 211 A. thaliana accessions collected worldwide. (B) Copy number estimation based on read coverage for the same 131 mobile TE families across 180 A. thaliana accessions from Sweden.

https://doi.org/10.7554/eLife.15716.010
Figure 2—figure supplement 1
Pearson correlation between TE CN and number of TE sequences identified by TE capture.
https://doi.org/10.7554/eLife.15716.011
Figure 2—figure supplement 2
Climate association to TE CNV.

Heat map representing partial Mantel correlation coefficient between TE CN and geo-climatic variables. TE families with statistically significant correlations (P < 8.33×10-4)are indicated.

https://doi.org/10.7554/eLife.15716.012
Figure 2—figure supplement 3
GWAS of CNVs.

(A) Overlap between GWAS results obtained using CNVs and SNPs from world-wide accessions and from Swedish accessions. (B) Manhattan plot of the GWAS results for ATCOPIA69 CNVs. The leading SNP (red diamond) is located within the TE itself. Colors indicate the extent of linkage disequilibrium (r2) to the leading SNP. Distribution of CN values associated with the leading SNP and the common allele are shown on the right. (C) Schematic overview of ATCOPIA69. LTR, long terminal repeat; gag, nucleocapsid protein; pro, protease. The sequence alignment on the right indicates the position of the amino acid change (red arrow) caused by the leading SNP in the transposase protein. Conserved amino acids are highlighted in blue. (D) Average fraction of CN variance explained by cis and trans loci for Class I and Class II TE families. (E) Average fraction of CN variance explained by cis and trans loci for autonomous and non-autonomous class II TE families. Ratio of the observed over expected (O/E) number of TE annotations overlapping trans loci for autonomous and non-autonomous class II TE families. Statistically significant differences were calculated by resampling 10,000 times the coordinates of the trans loci. (F) Manhattan plot of the GWAS results for ATDNA1T9A. Distribution of CN values associated with the leading SNP and the common allele are shown on the right. Note that both ATDNA1T9A and VANDAL16 show similar insertion preference towards the TSS of genes. (G) Probability of missing a non-reference TE insertion with TSD as a function of the allele frequency of the trans locus identified by GWAS. (H) Gene ontology of genes overlapping or close to trans loci.

https://doi.org/10.7554/eLife.15716.013
Figure 3 with 4 supplements
Genomic localization of non-reference TE insertions.

(A) Density of non-reference TE insertions with TSDs (blue) and of annotated TE sequences (red) along the reference sequence of chromosome 1. Inner pericentromeric regions are masked. (B) Fraction of private and shared non-reference TE insertions with TSDs and of annotated TE sequences in outer pericentromeric regions. Statistically significant differences are indicated (chi square test). (C) Observed/expected ratio (O/E) of private non-reference TE insertions with TSDs in and around genes. Errors bars are defined as 95% confidence intervals. (D) Cumulative distribution of gene expression ratios between alleles harboring and lacking non-reference TE insertions. Statistically significant differences were calculated using the KS test. (E) As D, but only for COPIA (green) or MuDR (red) non-reference TE insertions with TSDs. Figure 3—figure supplement 1 shows detailed analysis of the distribution of non-reference TE insertions with TSDs along the genome. Figure 3—figure supplement 2 shows local TE insertion preferences. Figure 3—figure supplement 3 shows global expression levels of gene affected by non-reference TE insertions. Figure 3—figure supplement 4 shows expression levels of genes affected in some accessions by a non-reference insertion with TSD in plants grown under control conditions or subjected to heat stress.

https://doi.org/10.7554/eLife.15716.014
Figure 3—figure supplement 1
Distribution of non-reference TE insertions with TSDs along the genome.

(A) Density of non-reference TE insertions with TSDs detected by split-reads (black) or TE-sequence capture (red) across the five chromosomes. (B) Fraction of non-reference TE insertions with TSDs detected in pericentromeric regions using TE-sequence capture as a function of allele frequency (f). (C) Historical recombination landscape estimated using genome sequencing data for 211 accessions. (D) Density of non-reference TE insertions with TSDs as a function of the density of coding sequences. (E) Density of non-reference TE insertions with TSDs as a function of the recombination rate. The correlation between these two variables still holds after correcting for the partial correlation between gene density and recombination rate (r=0.32, p<4-23). (F) Distribution frequency of allele counts for non-reference TE insertions with TSDs located either within or close to genes or away from genes.

https://doi.org/10.7554/eLife.15716.015
Figure 3—figure supplement 2
Local TE insertion preferences.

(A) Metagene analysis of the distribution of private non-reference TE insertions with TSDs for four Class II and two Class I TE superfamilies. UTR, untranslated transcribed region. (B) Sequence motifs for non-reference insertion sites. (C) GC-content for non-reference insertion sites (including 50bp upstream and downstream). Blue bars represent the GC content for 100bp-long sequences randomly chosen from the reference genome sequence. Bars represent average GC-content ± SD. Statistically significant differences were calculated using a permutation test

https://doi.org/10.7554/eLife.15716.016
Figure 3—figure supplement 3
Global expression levels of genes affected by non-reference TE insertions of different TE superfamilies.

Cumulative distribution of gene expression ratios between alleles harboring and lacking non-reference TE insertions for the different TE superfamilies. The number of genes analyzed is indicates in each case. Statistically significant differences between the observed and expected distributions of expression ratios were calculated using a KS test.

https://doi.org/10.7554/eLife.15716.017
Figure 3—figure supplement 4
Expression levels of selected genes affected by non-reference TE insertions with TSDs.

Accession(s) with a non-reference TE insertion with TSD are indicated in red for each gene. Plants were grown under standard conditions (Ctrl.) or subjected to a heat shock (HS). RT-qPCR results (three technical replicates) are indicated relative to those obtained for a gene that shows invariant expression under multiple conditions (see ‘Material and methods’).

https://doi.org/10.7554/eLife.15716.018
Figure 4 with 1 supplement
DNA methylation of non-reference TE insertion sites.

(A) Boxplot representation of average DNA methylation level for mobile and non-mobile TE families across all accessions. (B) O/E ratio of spontaneous DMRs identified in mutation accumulation lines (Becker et al., 2011; Schmitz et al., 2011) for non-mobile and mobile TE families. Statistically significant differences were calculated using a chi square test. (C) Average DNA methylation level in 50bp windows upstream and downstream of 1543 insertions sites for accessions lacking or containing a given non-reference TE insertion with TSD. (D) Genome browser tracks showing examples of insertion sites respectively associated with short- and long-distance DNA methylation. (E) Meta-analysis of DNA methylation around non-reference TE insertions sites. (F) Distribution of non-reference TE insertions associated with short- or long-distance DNA methylation according to their position relative to genes (stacked bar plot) and proportion of insertions in the two possible orientations relative to the closest gene (pie charts). (G) Average expression level in different organs and at different developmental time points (in Col-0) of genes with non-reference TE insertions with TSDs and affected by short- (blue) or long-distance (red) DNA methylation. Error bars are s.e.m. Statistical significance of differences was calculated using a MWU test. Figure 4—figure supplement 1 shows DNA methylation of TE families and impact on sequences flanking non-reference TE insertions with TSDs.

https://doi.org/10.7554/eLife.15716.019
Figure 4—figure supplement 1
DNA methylation of TE families and impact on sequences flanking non-reference TE insertions with TSDs.

(A) Boxplot representation of average DNA methylation levels for non-mobile TE families (across all accessions) and for mobile TE families (separately for accessions with or without evidence of mobility). (B) Genome browser tracks for one insertion site. DNA methylation and split-reads are indicated whenever present. Two accessions (Zdr-1 and Knox-18) have DNA methylation (red arrows) but lack split reads supportive of the presence of the non-reference TE insertion (black crosses). Another accession (Anholt-1) has no DNA methylation (red cross) yet contains the non-reference TE insertion (black arrow). Another 12 accessions have the non-reference TE insertion and all have DNA methylation at the insertion site. (C) Expected density distribution of the fraction of non-reference TE insertions with TSDs that are located in the chromosome arms and are associated with long-distance DNA methylation. The observed fraction is indicated by the vertical line. (D) Genome browser tracks showing the density of 24-nt siRNAs over two Col-0 TE insertions with TSDs that are associated with long- and short-distance DNA methylation, respectively (top and bottom panels). Brackets indicate the absence of the insertions in the other accessions.

https://doi.org/10.7554/eLife.15716.020
Figure 5 with 1 supplement
Local enrichment of non-reference TE insertions with TSDs.

(A) Density of non-reference TE insertions with TSDs in 10 kb windows. The 19 regions statistically enriched in such insertions are indicated by red bars. (B) Position and identity of the seven non-reference TE insertions with TSDs spanning the FLC locus. (D) and (E) Level of FLC expression (D) and flowering time (E) for accessions of same FLC haplotype but differing by the presence or absence of the relevant TE insertion. Errors bars are s.e.m. Figure 5—figure supplement 1 shows the reconstruction of the FLC haplotypes and additional analyses of the effect on flowering timeof non-reference TE insertions with TSDs at the locus.

https://doi.org/10.7554/eLife.15716.021
Figure 5—figure supplement 1
Reconstruction of the FLC haplotypes and additional analyses of the effect on flowering time of non-reference TE insertions with TSDs at the locus.

(A) SNPs identify 20 distinct haplotypes at the FLC locus (+/- 50 Kb). The seven non-reference TE insertions with TSDs (indicated A to G, as in Figure 5B) located within FLC affect four distinct haplotypes, as shown on the right. (B) and (C) Flowering time (Y. Li et al., 2010; Lempe et al., 2005) associated with FLC alleles belonging to the same haplotype but differing by the presence or absence of a non-reference TE insertion with TSD.

https://doi.org/10.7554/eLife.15716.022

Additional files

Supplementary file 1

Summary of GWAS results for CNV.

Distribution of CN values, Manhattan plot and QQ-plot across the joined data set (391 accessions) for the indicated TE families. Summary statistics of associations are indicated below. MAF indicates Minor Allele Frequency in the joined dataset. Genes within GWAS intervals are indicated (MET2a is in bold).

https://doi.org/10.7554/eLife.15716.023
Supplementary file 2

DNA sequence motifs at insertions sites.

Sequence logo of the overrepresented DNA sequence motifs at insertion sites ( ± 30bp) is shown for the 79 mobile TE families with at least 10 non-reference TE insertions with TSDs. The number of sequences used in each case is indicated.

https://doi.org/10.7554/eLife.15716.024
Supplementary file 3

PCR validation of non-reference TE insertions with TSDs and list of primer sequences used in this study.

https://doi.org/10.7554/eLife.15716.025
Supplementary file 4

List of TE-capture targets.

https://doi.org/10.7554/eLife.15716.026
Supplementary file 5

Historical population-scaled recombination rate map for A. thaliana

https://doi.org/10.7554/eLife.15716.027

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Leandro Quadrana
  2. Amanda Bortolini Silveira
  3. George F Mayhew
  4. Chantal LeBlanc
  5. Robert A Martienssen
  6. Jeffrey A Jeddeloh
  7. Vincent Colot
(2016)
The Arabidopsis thaliana mobilome and its impact at the species level
eLife 5:e15716.
https://doi.org/10.7554/eLife.15716