K-mer analyses of the A. thaliana reference genome

(A) Median K-mer frequency in the non-repetitive and repetitive compartments of the reference genome. (B) Relationship between scaled 12-mer based copy number estimates (y-axis) and simulated copy number change across 1,000 simulations. Each gray line denotes a single simulation iteration while the blue line indicates the median values across all iterations. (C) Relationship between 12-mer-based copy number estimates and annotation-based copy number estimates for 272 transposon families in the TAIR10 reference genome. Blue line indicates the line of best fit derived from the linear regression model. (D) Relationship between 12-mer abundances in Illumina reads (at 44 X mean coverage) compared to respective abundances in genome assembly. Each hexagonal bin is colored by the number of observations within the bin. Blue line indicates the line of best fit derived from the linear regression model.

Patterns of copy number variability across the genome

(A) Sequence abundance variability in non-overlapping 100 KB windows across the genome compared to the density of genes and repeats. sR denotes the standardized range (range / median) of 12-mer abundance across all individuals. Pink shaded regions indicate the centromere and pericentromeric regions. Diamond = NOR2, triangles = knobs, star = cluster of cysteine-rich repeat proteins. (B) Number of individuals with putative copy number changes per repetitive sequence. Z-scores were calculated from the 12-mer derived copy number estimates and values on the Y axis correspond to the count of Z score values greater than (increase in copy number, positive value) or less than 3 (decrease in copy number, negative value) across sequences. Sequences colored teal are unclassified repeats.

Repeat copy number differences between populations

(A) Principal component 1 of repeat copy number estimates partitioned by subpopulation. (B) Number of sequences with putative copy number change per individual grouped by subpopulation. Z-scores were calculated from the 12-mer derived copy number estimates and values on the Y axis correspond to the count of Z score values greater than (increase in copy number, positive value) or less than 3 (decrease in copy number, negative value) across individuals. Abbreviated subpopulations are as follows: N.S. is North Sweden, S.S. is South Sweden, Af. is Africa, R. is relict, W. Eur. is Western Europe, C. Eur. is Central Europe, Ch. is China, Ger. is Germany, I.B.C. is Italy Balkan Caucasus.

Genomic regions associated with repeat copy number variation.

(A) Frequency of significant SNPs per 100 KB windows for all GWAS (grey), or GWAS in retrotransposons (blue), DNA transposons (red), satellites (green), and simple repeats (yellow), respectively. Pink highlighted region indicates centromere and pericentromere. (B) Proportion of all SNPs (black), significant SNPs in all GWAS (grey), and significant SNPs in GWAS by sequence class in the pericentromere.

Genome-wide association mapping of repeat abundance.

(A) Meta-GWAS of repeat abundance. Dotted line indicates significance threshold at Bonferroni-corrected alpha = 0.05. Labels correspond to candidate genes at each locus listed in Table S6. (B) GWAS of PC 1 of repeat abundance. Dotted line indicates significance threshold at Bonferroni-corrected alpha 0.05. (C) Frequency of significant meta-GWAS tag SNPs across individual GWAS separated by sequence class. Bar colors indicate the effect of the minor allele on sequence abundance (i.e. positive beta corresponds to minor allele associated with increase in sequence copy number).

Evolutionary dynamics of alleles associated with copy number change.

(A) Site frequency spectra for meta-GWAS tag SNPs, meta-GWAS tag SNPs stratified by predominant copy number effect, all GWAS-associated SNPs from this study, and AraGWAS-associated SNPs, compared to nonsynonymous and synonymous SNPs from the 1001 Genomes dataset. (B) Relationship between the median number of sequences with copy number change per subpopulation (normalized to the Africa group) and the ratio of mean derived allele frequencies for nonsynonymous versus synonymous SNPs.