1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

Long-term balancing selection drives evolution of immunity genes in Capsella

  1. Daniel Koenig  Is a corresponding author
  2. Jörg Hagmann
  3. Rachel Li
  4. Felix Bemm
  5. Tanja Slotte
  6. Barbara Nueffer
  7. Stephen I Wright
  8. Detlef Weigel  Is a corresponding author
  1. Max Planck Institute for Developmental Biology, Germany
  2. Stockholm University, Sweden
  3. University of Osnabrück, Germany
  4. University of Toronto, Canada
Research Article
  • Cited 0
  • Views 849
  • Annotations
Cite this article as: eLife 2019;8:e43606 doi: 10.7554/eLife.43606

Abstract

Genetic drift is expected to remove polymorphism from populations over long periods of time, with the rate of polymorphism loss being accelerated when species experience strong reductions in population size. Adaptive forces that maintain genetic variation in populations, or balancing selection, might counteract this process. To understand the extent to which natural selection can drive the retention of genetic diversity, we document genomic variability after two parallel species-wide bottlenecks in the genus Capsella. We find that ancestral variation preferentially persists at immunity related loci, and that the same collection of alleles has been maintained in different lineages that have been separated for several million years. By reconstructing the evolution of the disease-related locus MLO2b, we find that divergence between ancient haplotypes can be obscured by referenced based re-sequencing methods, and that trans-specific alleles can encode substantially diverged protein sequences. Our data point to long-term balancing selection as an important factor shaping the genetics of immune systems in plants and as the predominant driver of genomic variability after a population bottleneck.

https://doi.org/10.7554/eLife.43606.001

eLife digest

Capsella rubella is a small plant that is found in southern and western Europe. This plant is young in evolutionary terms: it is thought to have emerged less than 200,000 years ago from a small group of plants belonging to an older species known as Capsella grandiflora.

Individuals of the same species may carry alternative versions of the same genes – known as alleles – and the total number of alleles present in a population is referred to as genetic diversity. When a few individuals form a new species, the gene pool and the genetic diversity in the new species is initially much lower than in the ancestral species, which may make the new species less robust to fluctuations in the environment. For example, alternative versions of a gene might be preferable in hot or cold climates, and loss of one of these versions would limit the species’ ability to survive in both climates.

A mechanism known as balancing selection can maintain various alleles in a species, even if the population is very small. However, it was not clear how common long-lasting balancing selection was after a species had split. To address this question, Koenig et al. assembled collections of wild C. rubella and C. grandiflora plants and sequenced their genomes in search of alleles that were shared between individuals of the two species.

The analysis found not just a few, but thousands of examples where the same genetic differences had been maintained in both C. rubella and C. grandiflora. Some of these allele pairs were also shared with individuals of a third species of Capsella that had split from C. rubella and C. grandiflora over a million years ago. The shared alleles did not occur randomly in the genome; genes involved in immune responses were far more likely to be targets of balancing selection than other types of genes.

These findings indicate that there is strong balancing selection to maintain different alleles of immunity genes in wild populations of plants, and that some of this diversity can be maintained over hundreds of thousands, if not millions of years. The strategy developed by Koenig et al. may help to identify new versions of immunity genes from wild relatives of crop plants that could be used to combat crop diseases.

https://doi.org/10.7554/eLife.43606.002

Introduction

Balancing selection describes the suite of adaptive forces that maintain genetic variation for longer than expected by random chance. It can have many causes, including heterozygous advantage, negative frequency-dependent selection, and environmental heterogeneity in space and time. The unifying characteristic of these situations is that the turnover of alleles is slowed, resulting in increased diversity at linked sites (Charlesworth, 2006). In principle, it should be simple to detect the resulting footprints of increased coalescence times surrounding balanced sites (Tellier et al., 2014), and many candidates have been identified using diverse methodology (Fijarczyk and Babik, 2015). However, balanced alleles will be stochastically lost over long time spans, suggesting that most balanced polymorphism is short lived (Fijarczyk and Babik, 2015).

The strongest evidence for balancing selection comes from systems in which alleles are maintained in lineages that are reproductively isolated and that have separated millions of years ago, resulting in trans-specific alleles with diagnostic trans-specific single nucleotide polymorphisms (tsSNPs). A few, well known genes fit this paradigm: the self-incompatibility loci of plants (Vekemans and Slatkin, 1994), mating-type loci of fungi (Wu et al., 1998), and the major histocompatibility complex (MHC) and ABO blood group loci in vertebrates (McConnell et al., 1988; Mayer et al., 1988; Lawlor et al., 1988; Watkins et al., 1990; Ségurel et al., 2012). Additional candidates have been proposed by comparing genome sequences from populations of humans and chimpanzees, and from populations of multiple Arabidopsis species. These efforts have revealed six loci in primates (Leffler et al., 2013b; Teixeira et al., 2015) and up to 129 loci, that were identified by at least two shared SNPs each, in Arabidopsis (Novikova et al., 2016; Bechsgaard et al., 2017), as potential targets of long-term balancing selection and/or introgression. In both systems, genes involved in host–pathogen interactions were enriched, which in Arabidopsis is consistent with previous findings that several disease resistance loci appear to be under balancing selection in this species, based on the analysis of individual genes (Huard-Chauveau et al., 2013; Botella, 1998; Caicedo et al., 1999; Noel, 1999; Stahl et al., 1999; Tian et al., 2002; Bakker, 2006; Rose et al., 2004; Todesco et al., 2010). However, even with the ability to conduct whole-genome scans for balancing selection in A. thaliana, the total number of examples with robust evidence across species remains small (Cao et al., 2011; 1001 Genomes Consortium, 2016).

One explanation for this paucity of evidence for pervasive and stable balancing selection is that cases of long-term maintenance of alleles are rare. However, there are good reasons to believe that many studies lacked the power to detect the expected effects (Fijarczyk and Babik, 2015; DeGiorgio et al., 2014). If one requires that alleles have been maintained in species separated by millions of years, then only targets of outstandingly strong selective pressures that remain the same over many millennia can be identified. Furthermore, recombination between deeply coalescing alleles will typically reduce the size of the genomic footprint to very short sequence stretches, thus limiting the opportunity for distinguishing old alleles from recurrent mutations.

We hypothesised that self-fertilizing species provide increased sensitivity to detect balancing selection based on two observations (Wiuf et al., 2004; Wright et al., 2008). First, self fertilisation greatly reduces the effective rate of recombination, thus potentially expanding the footprint of balancing selection. In addition, the transition to self fertilisation is generally associated with dramatic genome-wide reductions in polymorphism, potentially making it easier to detect outlier loci that retain variation from the outcrossing, more polymorphic ancestor. In this study we sought to assess how strongly selection acts to maintain genetic diversity in the context of repeated transitions to self fertilisation in the flowering plant genus Capsella. Like many plant lineages, the ancestral state of Capsella is outcrossing (found in the extant diploid species C. grandiflora), but selfing has evolved independently in two diploid species, C. rubella and C. orientalis (Figure 1A)(Foxe et al., 2009; Guo et al., 2009; Bachmann et al., 2018). The genomes of both species exhibit the drastic loss of genetic diversity typical for many selfers (Figure 1B–C) (Guo et al., 2009; Foxe et al., 2009; St Onge et al., 2011; Slotte et al., 2013; Brandvain et al., 2013; Slotte et al., 2012). In the younger species, C. rubella, loss of genetic diversity was initially thought to have occurred uniformly throughout the entire genome (Foxe et al., 2009; Guo et al., 2009), but subsequent reports already hinted at some loci having increased diversity (Gos et al., 2012; Brandvain et al., 2013), motivating the present study.

Figure 1 with 1 supplement see all
Polymorphism discovery in Capsella.

(A) Diagram of the relationships between Capsella species. Arrowheads indicate transitions from outcrossing to self-fertilisation. (B) Inflorescence of C. rubella with small flowers. (C) Inflorescence of C. grandiflora with large, showy flowers, to attract pollinators. (D) SNP discovery in C. rubella (top) and C. grandiflora (bottom). Samples were randomly downsampled ten times. Means of segregating transpecific (tsSNPs, purple), species specific in C. rubella (ssCrSNPs, blue), and species specific in C. grandiflora (ssCgSNPs, green) SNPs. (E) Number of heterozygous (light colours) and homozygous SNP calls (dark colours). (F) Average decay of linkage disequilibrium in C. grandiflora (green) and C. rubella (blue).

https://doi.org/10.7554/eLife.43606.003

Results

Polymorphism discovery in C. grandiflora and C. rubella

The species Capsella rubella is young, only 30,000 to 200,000 years old, and was apparently founded when a small number of C. grandiflora individuals became self-compatible (Foxe et al., 2009; Guo et al., 2009). Previous studies had hinted at unequal retention of C. grandiflora alleles across the C. rubella genome (Gos et al., 2012; Brandvain et al., 2013), leading us to analyse this phenomenon systematically by comparing the genomes of 50 C. rubella and 13 C. grandiflora accessions from throughout each species’ range (Figure 1—figure supplement 1 and Figure 1—source data 1). Because the calling of trans-specific SNPs (tsSNPs) is particularly sensitive to mismapping errors in repetitive sequences, we applied a set of stringent filters, resulting in 74% of the C. rubella reference genome remaining accessible to base calling in both species, with almost half (47%) of the masked sites in the repeat rich pericentromeric regions. After filtering, there were 5,784,607 SNPs and 883,837 indels. Unless otherwise stated, all subsequent analyses were performed using SNPs. Of these, only 27,852 were fixed between the two species, whereas 824,540 were found in both species (tsCgCrSNPs), consistent with the expected sharing of variation between the two species. In addition, 4,291,959 SNPs segregated only in C. grandiflora (species-specific SNPs; ssCgSNPs), and 640,256 only in C. rubella (ssCrSNPs). Sample rarefaction by subsampling our sequenced accessions indicated that common ssCrSNP and tsCgCrSNP discovery was near saturation in our experiment, though additional sampling will continue to uncover rare alleles (Figure 1D).

The consequences of selfing are easily seen as a dramatic reduction in genetic diversity in C. rubella (Figure 1—source data 2), consistent with the previously suggested genetic bottleneck (Foxe et al., 2009; Guo et al., 2009). As expected from a predominantly selfing species, SNPs segregating in C. rubella were much less likely to be heterozygous than those segregating in C. grandiflora, though evidence for occasional outcrossing in C. rubella is observed in the form of a variable number of heterozygous calls (Figure 1E). Selfing is also expected to reduce the effective rate of recombination between segregating polymorphisms. Linkage disequilibrium (LD) decayed, on average, to 0.1 within 5 kb in C. grandiflora, while it only reached this value at distances greater than 20 kb in C. rubella (Figure 1F). Though C. rubella is a relatively young species, it exhibits characteristics typical of a predominantly (but not exclusively) self-fertilising species: reduced genetic diversity, reduced observed heterozygosity, and reduced effective recombination rate. This last effect could potentially increase the visibility of signals for balancing selection from linked sites (Wiuf et al., 2004).

Capsella rubella demography

The degree of trans-specific allele sharing depends upon the level of gene flow between species, the age of the speciation event, and the demographic history of each resultant species. We first sought to understand how these neutral processes have affected extant polymorphism in C. grandiflora and C. rubella. We searched for evidence of population structure in our dataset by fitting individual ancestries to different numbers of genetic clusters with ADMIXTURE (Alexander et al., 2009) (Figure 2A and Figure 2—figure supplement 1A-B; k-values from 1 to 6). The best fit as determined by the minimum cross-validation error was three clusters, with one including all C. grandiflora individuals, and C. rubella samples split into two clusters. Principal component (PC) analysis (Price et al., 2006) of genetic variation revealed a similar picture, with PC1 separating the two species and PC2 separating the C. rubella samples (Figure 2A).

Figure 2 with 2 supplements see all
Demographic analysis of C.rubella.

(A) Admixture bar graphs (top) and PCA of population structure in C. grandiflora (green) and C. rubella (blue). The C. rubella colours correspond to the sampling locations in (B). Inset shows lines from outside Eurasia (Canary Islands and Argentina). (C) Pairwise interspecific identity-by-descent (IBD) between C. grandiflora and C. rubella samples. Comparisons between West C. rubella and C. grandiflora are in dark blue and E C. rubella and C. grandiflora in light blue. The minimum segment length threshold was 1 kb, and comparisons without IBD segments (all from the West C. rubella population) are at the bottom of the plot. (D) Total lengths of interspecific IBD sharing by sample site within the E C. rubella population. An approximate distribution of C. grandiflora is shown for comparison in green. (E) The most likely demographic model of C. rubella and C. grandiflora evolution as inferred from joint allele frequency spectra by fastsimcoal2. Arrows indicate gene flow.

https://doi.org/10.7554/eLife.43606.007

C. rubella population structure was strongly associated with geography. Samples from western Europe and southeastern Greece were unambiguously assigned to separate groups, while samples from northern and western Greece, near the presumed site of speciation in the current range of C. grandiflora (Hurka and Neuffer, 1997), showed mixed ancestry (or intermediate assignment to these groups, Figure 2A–B). A single C. rubella sample from western Europe showed some mixed ancestry. This sample was collected near Gargano National Park on the eastern coast of Italy. The source of its mixed ancestry is unclear, but its proximity to Greece suggests that it may result from ongoing migration across the Adriatic Sea. The general pattern of population structure is consistent with the centre of diversity for C. rubella being in northern Greece and a more recent rapid expansion into Western Europe, and agrees with predictions made based on previous, smaller datasets (Brandvain et al., 2013). The observed structure is principally organised by a major geographic barrier, the Adriatic Sea. We therefore separated our samples into into two distinct groups to the west (W) and east (E) of the Adriatic Sea for subsequent analyses.

Because their current ranges overlap, ongoing gene flow between sympatric C. rubella and C. grandiflora could be a potentially important source of allele sharing between the two species. While a previous study had not found any evidence for such a scenario (Brandvain et al., 2013), one of our C. grandiflora samples was assigned partial ancestry to the otherwise C. rubella-specific clusters, and resided at an intermediate position along PC1 (Figure 2A). Furthermore, eastern C. rubella individuals, many of which grew in sympatry with C. grandiflora, were less differentiated from C. grandiflora compared to western C. rubella samples along PC1 (Figure 2A and Figure 2—figure supplement 1C-D). Gene flow between eastern C. rubella and C. grandiflora was supported by significant genome-wide D-statistics for C. rubella samples from the C. grandiflora range (ABBA-BABA test; comparing each E individual with the W population) (Green et al., 2010; Durand et al., 2011), with D decreasing as a function of distance from the centre of C. grandiflora’s range (Figure 2—figure supplement 1 and Figure 2—source data 1). Because D statistics can be sensitive to ancient population structure (Durand et al., 2011), we further relied on identity-by-descent (IBD) segments as detected by BEAGLE (Browning and Browning, 2013) to identify genomic regions of more recent co-ancestry across these species. The proportion of the genome shared in IBD segments between C. rubella and C. grandiflora also decreased as a function of distance between samples, and the strongest evidence for recent ancestry was found between C. grandiflora individuals and sympatric northern Greek C. rubella lines (Figure 2C–D). These results indicate that gene flow is ongoing between the species, consistent with interspecific crosses often producing fertile offspring, specifically with C. rubella as the paternal parent (Sicard et al., 2011; Rebernig et al., 2015).

To estimate the magnitude and direction of gene flow and other demographic events that have shaped genetic variation in the two species we used fastsimcoal2 (Excoffier et al., 2013) to compare the likelihood of a large number of demographic models given the observed joint site frequency spectrum (Figure 2E, Figure 2—figure supplement 2 and Figure 2—source data 2). The best fitting model estimated the split between C. rubella and C. grandiflora to have occurred 170,000 generations ago, associated with a strong reduction in C. rubella population size (to only 2–14 effective chromosomes, or 1–7 individuals). Bidirectional gene flow at a relatively low rate apparently occurred until just over 10,000 generations ago, when C. rubella split into the W and E populations, after which gene flow continued only from E C. rubella to C. grandiflora (Figure 2E).

The close timing of the end of gene flow into C. rubella and the split into two populations suggests that westward expansion of the C. rubella range reduced the opportunity for gene flow from C. grandiflora, with potential genetic reinforcement by the development of hybrid incompatibilities (Sicard et al., 2015). If we assume an average of 1.3 years per generation as found in the close relative, A. thaliana (Falahati-Anbaran et al., 2014), which has similar life history and ecology, the population split and the end of introgression from C. grandiflora occurred around 13,500 years ago. This date is similar to the spread of agriculture and the end of the last glaciation in Europe (Walker et al., 2009), suggesting that C. rubella’s success might have been facilitated by one or both of these events.

Non-random polymorphism sharing after a genetic bottleneck

Our analyses provide dates for the bottleneck and rapid colonisation events that have led to dramatically reduced genetic variation in C. rubella. Yet, over half of the segregating variants in C. rubella were also found in C. grandiflora (Figure 1D). Such tsCgCrSNPs could originate from independent mutation in each species (identity by state, IBS). Alternatively, they could be the result of introgression after speciation or they could reflect retention of the same alleles since the species split (identity by descent, IBD). Older retained alleles are expected to be found at elevated frequencies relative to the genome-wide average, while younger, recurrent mutations are expected to be rare. We therefore identified ancestral and derived alleles by comparison with the related genus Arabidopsis, and then compared the derived allele frequency spectra of tsCgCrSNPs and ssSNPs in Capsella as a proxy for allele age. We found that tsCgCrSNPs are strongly enriched among high-frequency alleles in both Capsella species (Figure 3A, p-value << 0.0001 in C. grandiflora and C. rubella, Mann-Whitney U-test). At allele frequencies greater than 0.25 in C. rubella, tsCgCrSNPs accounted for more than 80% of all variation. These results indicate that tsCgCrSNPs are predominantly older alleles that were already present in the common ancestral population of C. rubella and C. grandiflora or that were introgressed from C. grandiflora to C. rubella prior to its expansion into western Europe.

Figure 3 with 9 supplements see all
Unequal presence of ancestral variation in modern C.rubella.

(A) Derived allele frequency spectra (DAF) of ssCgSNPs (green), ssCrSNPs (blue), and tsCgCrSNPs (purple) in C. grandiflora (top) and C. rubella (bottom). The inset depicts the fraction of alleles that are species or transpecific as a function of derived allele frequency (DAF). (B) Fraction of coding (CDS) SNPs that result in non-synonymous changes as a function of SNP sharing. (C) Fraction of genic SNPs as a function of SNP sharing. Because SNPs in different classes (ssSNPs, tsSNPs) differ in allele frequency distributions, we normalised by downsampling to comparable frequency spectra. Each bar consists of 1000 points depicting downsampling values. (D) 20 kb genomic windows required to cover different fractions of ssSNPs and tsSNPs. The black line corresponds to a completely even distribution of SNPs in the genome. tsSNPs deviate the most from this null distribution. (E) tsCgCrSNP density in 20 kb windows (5 kb steps) along the eight Capsella chromosomes. Histogram on the right shows distribution of values across the entire genome. (F) tsCgCrSNP density and (G) Watterson’s estimator (Θw) of genetic diversity along scaffold 2. The repeat dense pericentromeric regions are not filled. (H–J) Correlation of tsCgCrSNP density in 20 kb non-overlapping windows with genetic diversity in C. rubella (H), genetic diversity in C. grandiflora (I), and interspecific Fst (J). Spearman’s rho is always given on the top left. Only windows with at least 5000 accessible sites in both species were considered.

https://doi.org/10.7554/eLife.43606.012

The distribution of tsCgCrSNPs was uneven across the genome. When compared to ssCrSNPs drawn from the same allele frequency distribution, tsCgCrSNPs were less likely to result in nonsynonymous changes (Figure 3B, p-value < 0.001, from 1000 jackknife resamples from the same allele frequency distribution), but they were more likely to be in genes (Figure 3C). As expected for transpecific haplotype sharing, eighty-three percent of all tsCgCrSNPs were in complete LD with at least one other tsCgCrSNP in C. rubella, and the density of tsSNPs along the genome was highly variable (Figure 3D–G). tsCgCrSNP density was positively correlated with local genetic diversity in C. rubella (and less strongly so with genetic diversity in C. grandiflora; Figure 3F–I and Figure 3—figure supplements 25), and negatively correlated with differentiation between the species as measured by Fst (Figure 3J and Figure 3—figure supplements 25). The uneven pattern of diversity was similar in each C. rubella subpopulation (Figure 3—figure supplements 69), indicating that most of the retained polymorphism already segregated prior to colonisation. Thus, most common genetic variation in C. rubella is also retained in its outcrossing ancestor, and the rate of retention varies dramatically between genomic regions.

High density of tsSNPs around immunity-related loci

The observed heterogeneity in shared diversity across the C. rubella genome could be a simple consequence of a bottleneck during the transition to selfing. In the simplest scenario, C. rubella was founded by a small number of closely related individuals, and stochastic processes during subsequent inbreeding caused random losses of population heterozygosity. A study of genetic variation in bottlenecked populations of the Catalina fox found this exact pattern (Robinson et al., 2016). Alternatively, there may be selective maintenance of diversity in specific regions of the genome due to balanced polymorphisms, with contrasting activities of the different alleles. To explore this latter possibility, we tested whether the likelihood of allele sharing was dependent on annotated function of the affected genes. We found that tsCgCrSNPs were strongly biased towards genes involved in plant biotic interactions, including defense and immune responses, and also toward pollen-pistil interactions, though less strongly (Supplementary file 1, Figure 4A). Amongst the top ten enriched Gene Ontology (GO) categories for biological processes were apoptotic process, defense response, innate immune response, programmed cell death, and defensive secondary metabolite production (specifically associated with terpenoids). Of genes annotated with apoptotic process, 87% were homologs of A. thaliana NLR genes, a class of genes best known for its involvement in perception and response to pathogen attack (Jones and Dangl, 2006). An even higher enrichment for tsCgCrSNPs was found when testing this class of genes specifically, with tsCgCrSNPs falling in NLR genes being more likely than those in other types of genes to result in nonsynonymous changes (Figure 4A–B). These results indicate that despite a severe global loss of genetic diversity, genes involved in plant-pathogen interactions have maintained high levels of genetic variation in C. rubella.

Figure 4 with 2 supplements see all
Preferential sharing of alleles near immunity genes.

(A) Enrichment of tsCgCrSNPs and non-synonymous tsCgCrSNPs for genes associated with significant GO terms (means, blue; medians, red). Each point represents values calculated for an individual gene. For example, in the upper subplot each point is the number of tsSNPs identified in a gene divided by the total number of SNPs identified for a gene. GO terms with a significantly increased ratio of nonsynonymous changes are highlighted with a yellow box. (B) tsCgCrSNP frequency as a function of distance to the closest NLR cluster. (C) Pairwise genetic diversity at neutral (four-fold degenerate) sites as a function of distance to the closest NLR cluster. For (B–C) the thick black lines are mean values calculated in 500 bp windows as a function of distance from a NLR gene. The thin black lines are mean values from 100 random gene sets of equivalent size. The grey polygons are the range of values across all 100 random gene sets. (D) Chromosomal locations of Bal regions with the strongest evidence for balancing selection. Immunity genes with known function in A. thaliana in each region indicated. (E) Values for Watterson’s estimator (Θw) of diversity in Bal regions, calculated from 20 kb windows. (F) Tajima’s D. Dots in (E, F) are a random sample of 1000 windows for non candidate windows. Boxplots report the median 1st and 3rd quantiles of all windows in each class. (G) An example of the site level (dots) and windowed (green) decrease in Fst at the first region on chromosome 2. The subregion without data near 1 Mb is a CC-NLR cluster, which was largely masked for variant calling.

https://doi.org/10.7554/eLife.43606.022

While the high density of tsCgCrSNPs near immunity genes was intriguing, NLR genes frequently occur in complex clusters, which could elevate error rates during SNP calling and thus potentially influence our analyses. Of particular concern is that sequencing reads derived from paralogs not found in the reference, but present in some accessions, could be mismapped against the reference, leading to false positive tsCgCrSNPs calls. We therefore examined whether tsCgCrSNPs showed more evidence of such errors than other SNPs. Mismapping should increase coverage and reduce concordance (the fraction of reads supporting a particular call) at a site. That the distributions of these two metrics were nearly identical at tsCgCrSNPs and ssSNP sites indicates, however, that mismapping is unlikely to have affected our SNP calls (Figure 4—figure supplement 1). Mismapping is also expected to cause pseudo-heterozygous calls, due to reads from different positions in the focal genome being mapped to the same target in the reference genome. However, tsCgCrSNPs were not more likely to be found in the heterozygous state as compared to ssSNPs (Figure 4—figure supplement 1). In addition, we asked whether the signal of increased tsCgCrSNPs density extended into sequences adjacent to NLRs and is detectable even when masking the NLR clusters themselves. For this purpose, we collapsed NLR genes within 10 kb of one another into a single region, and calculated tsCgCrSNPs rates and genetic diversity as a function of distance from these collapsed regions, ignoring SNPs within the focal cluster. We found that elevated tsCgCrSNPs sharing and genetic diversity extended over 100 kb from NLR genes. Thus, increased sharing is not an artefact of the internal structure of NLR clusters (Figure 4B–C).

Increased retention of genetic diversity near immunity loci suggests that these genes might be the targets of balancing selection in either C. rubella, C. grandiflora, or both species. However, neutral processes including random introgression and stochastic allele fixation can give rise to uneven distributions of genetic variation across the genome after genetic bottlenecks (Robinson et al., 2016). We sought to identify regions that showed a pattern of allele sharing that was unlikely to have occured neutrally, as indicated by low values of the fixation index Fst, which quantifies genetic differentiation between populations. We compared the observed values of Fst between C. rubella and C. grandiflora to a distribution calculated from simulated sequences under our previously inferred neutral demographic model, which included gene flow between C. rubella and C. grandiflora. We simulated one million 20 kb DNA segments, or just over 7,000 C. rubella genome equivalents, under the neutral model and calculated the expected distribution of Fst values. Using this distribution, we assigned the probability of observing the Fst value for each non-overlapping 20 kb window throughout the genome. After Bonferroni correction and joining of adjacent significant segments, we identified 21 genomic regions that we designated as candidate targets of balancing selection (Bal, Figure 4D and Figure 4—source data 1). Bal regions showed several classical indications of balancing selection including substantially higher Tajima’s D and within-C. Rubella genetic diversity relative to the remainder of the genome (Figure 4E–F; p<<0.001 Mann-Whitney U-test for both statistics). tsCgCrSNPs in Bal regions were also less likely to have been lost during colonisation of Western Europe than ssCrSNPs or tsCgCrSNPs in other parts of the genome, and allele frequencies in Bal regions showed elevated correlation across populations (Figure 4—source data 2). Like tsSNPs in general, Bal regions did not show evidence for increased heterozygosity that might indicate increased error rates in SNP calling (Median Observed - Expected Heterozygosity in 20 kb windows was −0.021 inside of Bal regions, and −0.020 outside of these regions).

Estimates of Fst were reduced in large segments surrounding NLR and other immune gene candidate clusters (Figure 4G), consistent with allele sharing being the result of linkage to a nearby balanced polymorphism. Of the 21 candidate regions, nine overlapped with clusters of NLR genes, and five with clusters of RLK/RLP or CRK genes, two classes of genes with broad roles in innate immunity (Yeh et al., 2015; Zipfel, 2008). Many of the specific regions we identified in Capsella have been directly demonstrated to function in disease resistance in A. thaliana (McDowell et al., 2000; McDowell, 1998; Goritschnig et al., 2012; Holub, 1994; Gassmann et al., 1999; Zhang et al., 2014; Zhang et al., 2013; Xu, 2006; Yeh et al., 2015). RPP1 and RPP8 have been previously suggested as candidate targets of balancing selection, and trans-specific polymorphism has been reported at the RPP8 locus in the genus Arabidopsis (Bergelson et al., 2001; Wang et al., 2011). It should be noted, however, that these genes are often members of larger linked NLR gene superclusters, with some of the regions our approach identified being sizeable and thus making it difficult to pinpoint a single focal gene. Indeed the strong signal found in these regions could result from multiple linked balanced sites. Furthermore, the strongest signals of balancing selection are mostly derived from linked sites, rather than the clusters themselves, because confident SNP calling is very difficult, if not impossible, with short reads in the most complex genomic regions (Figure 4G).

It is formally possible that the unusual pattern of diversity that we observe near Bal loci could result from historical balancing selection in the outcrossing ancestor C. grandiflora rather than ongoing selection in the selfing C. rubella. Population genetic indices such as Fst, nucleotide diversity pi, Tajima’s D, and allele sharing are not fully independent, and elevated diversity in the C. rubella founding population, driven by historical balancing selection, could also generate the observed patterns. Genetic diversity was only modestly elevated in these regions in C. grandiflora (p<<0.001 Mann-Whitney U-test, Figure 4E), and Tajima’s D was not significantly different from other windows (Figure 4F), suggesting that this is not very likely. If balancing selection is acting at these loci in the outcrosser, it is clear that its genomic footprint is small, perhaps due to the rapid decay of LD in this species relative to the selfing C. rubella. Still, it is possible that even a small elevation of genetic diversity in Bal regions in the founding populations might have considerable impact on subsequent C. rubella diversity. We approximated this situation using our simulated genetic data. We subsetted simulations by the level of genetic diversity in C. grandiflora, choosing the top 1% of simulated values. Even in the case of elevated founder diversity in these data, the observed Fst values in Bal regions remain exceptionally unlikely (p<0.0001). These observations point to ongoing balancing selection within C. rubella maintaining diversity in Bal regions.

Adaptive retention of C. grandiflora diversity in Bal regions could be explained by two non-exclusive models. Allelic variation might have been present in the C. rubella founding population and maintained by balancing selection until the present. Alternatively, beneficial alleles may have been introgressed from C. grandiflora after the evolution of selfing, and retained by balancing selection. We searched for evidence of recent ancestry between the two species in Bal regions. A larger fraction of Bal region sequence was found to be IBD when compared to the genome-wide average (Figure 4—figure supplement 2), consistent with elevated retention of introgressed alleles in these regions. Shared segments in Bal regions were on average shorter than those found in other parts of the genome, suggesting that they are older and have been subjected to longer periods of recombination since the introgression event (median within 3,503 bp, median outside 6,661 bp, Wilcoxon-rank sum test, p=1e-54), although we cannot exclude the influence of differing patterns of recombination in these regions as a contributing factor to this observation.

Elevated IBD rates in Bal regions might result from gene flow between the species in either direction, and our previous results suggested that most modern gene flow occurs through introgression of C. rubella alleles into C. grandiflora. We explored the geographic pattern of IBD between C. rubella and C. grandiflora in Bal regions to determine whether it differs from that of the genome-wide pattern. Within the East population, IBD decayed as a function of distance from the C. grandiflora range in a manner comparable to the observed genome-wide pattern, albeit with a more shallow slope (Figure 4—figure supplement 2). In contrast to the genome-wide pattern, high levels of IBD were observed between C. grandiflora and West population accessions. Thus, we find evidence for neutral gene flow throughout the genome, perhaps dominated by C. rubella to C. grandiflora introgression, as indicated by our demographic simulations. However, allele sharing appears to be older in Bal regions and introgressed alleles have been retained for longer periods even after colonisation of Western Europe. This latter observation is consistent with the hypothesis that alleles were introgressed prior to the most recent range expansions in C. rubella, and that variation was subsequently maintained by selection in Bal regions.

Balancing selection over millions of years

Although evidence for balancing selection at immunity-related loci in C. rubella is much stronger than in C. grandiflora, it is difficult to completely exclude the effect of founder diversity at these loci on the observed patterns. We therefore sought to validate our findings in a related species that has been separated from C. grandiflora and C. rubella for a long time. The genus Capsella offers a unique opportunity to test the longevity of balancing selection, because selfing has evolved independently in C. orientalis, which diverged from C. grandiflora and C. rubella more than one million years ago and whose modern range no longer overlaps with the two other species, preventing ongoing introgression (Hurka et al., 2012; Douglas et al., 2015). We expected the evolution of selfing to have generated a similar bottleneck as in C. rubella (Douglas et al., 2015; Bachmann et al., 2018), and we therefore resequenced 16 C. orientalis genomes, to test whether there is evidence of balancing selection at similar types of loci.

After alignment, SNP calling, and filtering, we identified a mere 71,454 segregating SNPs in C. orientalis. This is a surprisingly small amount of variation, corresponding to an almost 50-fold reduction in diversity relative to the outcrossing C. grandiflora (Figure 5—source data 1). Using our divergence and diversity measures, we estimated that C. orientalis diverged from C. grandiflora over 1.8 million generations ago (calculated as in ref. Brandvain et al., 2013). The combination of long divergence times and low variability in C. orientalis makes it unlikely that alleles will have been maintained by random chance. Using estimates of Ne from nucleotide diversity at four-fold degenerate sites (C. orientalis [14,643] and C. grandiflora [694,643]), the divergence time above, and the genome assembly size of 134.8 Mb, the probability of finding a single tsSNP is <4×10−19 using the methodology of Leffler and colleagues and Wiuf and colleagues (Leffler et al., 2013a; Wiuf et al., 2004), which assumes constant population size. It was therefore surprising that 8,408 C. orientalis variants were shared with either C. rubella or C. grandiflora (ts2-waySNPs), and 3992 with both (ts3-waySNPs, Figure 5A–B). In each of the three species, ts3-waySNPs were enriched at higher derived allele frequencies relative to ssSNPs and ts2-waySNPs, suggesting that they are on average the oldest SNPs (Figure 5C).

Figure 5 with 2 supplements see all
The signal of ancient balancing selection.

(A) Absolute number and (B) fraction of ssSNPs (blue), ts2-waySNPs (purple), and ts3-waySNPs (orange) for Capsella. Only sites accessible to read mapping in all three species were considered. (C) Derived allele frequency (using A. lyrata and A. thaliana as outgroup) of ssSNPs (blue), ts2-waySNPs (purple) and ts3-waySNPs (orange). (D) Pairwise diversity (π) as a function of distance from a ts3-waySNP in C. grandiflora, C. rubella, and C. orientalis. Bottom right, Divergence between the C. rubella/C.grandiflora and C. orientalis lineages as a function of distance from a ts3-waySNP. Black dots, means over all sites at a particular distance, and coloured lines, means over bins of 50 bp. (E) Enrichment of ts3-waySNPs and ts3-wayhqSNPs in candidate balanced regions from the C. rubella/C. grandiflora comparison. (F) Watterson’s estimator (Θw) of genetic diversity in C. orientalis, in 20 kb windows. Grey-to-blue scale indicates the genome-wide percentile of the same window for Θw) in C. rubella. (G) ts3hqSNP number in each window.

https://doi.org/10.7554/eLife.43606.027

Because this large amount of trans-specific polymorphism was unexpected, we wanted to ensure that this was not due to more error-prone read mapping to a distant reference. We therefore also used an additional set of more stringent filters to identify high confidence ts3-waySNPs (ts3-wayhqSNPs; see Materials and methods). Importantly, we required ts3-wayhqSNPs to be in LD with at least one other ts3-wayhqSNP in all three species (r2 >0.2 in the same phase), to provide evidence that they represented the same ancestral haplotype. The aim was to improve the likelihood that such SNPs were true examples of identity by descent. Furthermore, we generated a draft assembly of the C. orientalis genome using Pacific Biosciences SMRT cell technology, and re-called ts3-waySNP sites. We identified 812 high quality transpecific SNPs segregating in all three species (ts3-wayhqSNPs). The distributions of coverage and concordance values in this dataset were similar between ts3-waySNP sites and other C. orientalis sites, further supporting their authenticity (Figure 5—figure supplement 1).

As discussed earlier, the presence of trans-specific polymorphism in diverged species could be driven by stable balancing selection or it could result from gene flow between the species. While C. grandiflora and C. rubella occur around the Mediterranean, C. orientalis is restricted to Central Asia (Hurka et al., 2012) and its current distribution is far from that of C. grandiflora and C. rubella. Modern gene flow between the C. orientalis and C. rubella/C. grandiflora lineages is therefore unlikely, but it is possible that the ranges of these species overlapped in the past. If alleles have been maintained since the split between the lineages, then the divergence between maintained alleles should meet or exceed the divergence between the species. On the other hand, if ts3-wayhqSNPs are the result of recent gene flow between the lineages, then divergence between species near these SNPs should be reduced compared to the genome-wide average divergence. We examined diversity and divergence at neutral (four-fold degenerate) sites surrounding ts3-wayhqSNPs (Figure 5D). In all three species, diversity was high directly adjacent to ts3-wayhqSNPs, close to average levels for genome-wide divergence between the two Capsella lineages. This footprint of elevated diversity is much more discernible in the two selfing species than in C. grandiflora. No obvious reduction in divergence was observed near ts3-wayhqSNPs (Figure 5D). We conclude that ts3-wayhqSNPs correspond predominantly to long-term maintained alleles that diverged on ancient time scales and that they are not the result of recent introgression.

The finding of tsSNPs shared between two independent lineages, C. grandiflora/C. rubella and C. orientalis, for over a million generations in spite of strong geographic barriers suggests that they are targets of stable long-term balancing selection. If this selection pressure remains constant across species, ancient alleles are expected to evolve towards similar equilibrium intermediate frequencies. In comparison to ts2-waySNPs, the minor alleles at ts3-wayhqSNP sites are closer to intermediate frequencies in all three species (Figure 5—figure supplement 2). Furthermore, ts3-wayhqSNPs segregate at more similar allele frequencies in C. rubella and C. grandiflora than other two-way tsSNPs, as measured by Fst median values: 0.03 for ts3-wayhqSNPs and 0.16 for ts2-waySNPs, p<<0.001 Mann-Whitney test) and correlation of derived allele frequencies (Figure 4—source data 2). These results suggest a conserved equilibrium maintained since the isolation of C. rubella and C. grandiflora over 10,000 generations ago. Derived allele frequencies for ts3-wayhqSNPs are not correlated between the two ancient Capsella lineages (Spearman’s rho −0.08 C. orientalis to C. grandiflora and −0.04 to C. rubella). It is possible that demographic reduction or habitat shift in C. orientalis has disturbed this equilibrium.

Like tsCgCrSNPs, ts3-waySNPs are strongly enriched in GO categories associated with immunity (Supplementary file 2). Our previously identified balanced regions strongly predicted the genomic distribution of ts3-waySNPs; 50% of ts3-wayhqSNPs fell into these regions, even though they encompass fewer than 10% of tsCgCrSNPs and fewer than 3% of all SNPs, resulting in an even more skewed and uneven distribution of genetic diversity along the genome (Figure 5F–G). At least one ts3-wayhqSNP was found in each of 10 of the 21 original candidate regions under balanced selection. Six of these corresponded to NLR clusters, two to RLK/RLP clusters, and one to a TIR-X cluster. Only one region did not contain a clear immunity candidate, with the caveat that this conclusion is based on the single annotated C. rubella reference genome (Slotte et al., 2013). Thus, even in a situation where a recent genetic bottleneck has wiped out almost all genetic diversity, there is very strong selection to maintain allelic diversity at specific immunity-related loci, consistent with these alleles having persisted already for very long evolutionary times.

Insights into balancing selection from de novo assembly of MLO2

The balanced regions we identified contained very old tsSNPs, yet as mentioned, the immunity genes themselves are often not accessible to variant discovery based on mapping short reads to a single reference genome. Furthermore, it is possible, or even likely, that the strongest evidence for balancing selection comes from loci that include several linked targets of balancing selection. This combination of factors makes it difficult to pinpoint potential functional changes maintained by balancing selection in these regions. To discover functional changes, we therefore focused on ts3-wayhqSNPs that did not fall in our large balanced regions but were clustered in regions of the genome that were likely less complex. We selected genes that were well covered by reads in all three species (>80% sites), contained at least six high quality tsSNPs, at least one non-synonymous ts3-wayhqSNP, were at least 100 kb from any of our candidate balanced regions, and had been functionally characterised in A. thaliana. These filters singled out a homolog of the A. thaliana MLO2 gene as a particularly good candidate for more detailed analysis (Supplementary file 3 and Figure 6).

Figure 6 with 3 supplements see all
Evidence for long-term balancing selection at MLO2.

(A) Diagram of MLO2 protein in the cell membrane. Blue ovals, transmembrane domains. Top is the extracellular space. Orange dots represent amino acid differences between proteins encoded by haplogroups A and B. (B) Haplogroup identification with reference based SNP calls. Circles represent ts3-wayhqSNPs and colours represent the reference (green) and alternative (blue) SNP calls. Numbers indicate haplotype frequencies in each species. (C) Top: A diagram of the MLO2 region on scaffold 1. Grey boxes represent coding regions. Empty circles show the positions of the seven initially identified ts3-wayhqSNPs shown in (B). MLO2b gene is drawn based on the reference annotation, but alignment with orthologous genes suggested a misannotation of the last splice site acceptor leading to truncation of the annotated gene. For final alignments, the corrected annotation was used. Middle: Average read coverage by haplogroup and species (blue is C. rubella and orange is C. orientalis). Region of poor coverage in haplogroup B is highlighted with a black bar. The green and yellow bars below the coverage plots highlight the de novo assembled region and the windows from which the neighbour-joining trees were built (excluding indels, each window is 1 kb). The blue and orange circles on the tree indicate samples from each species. Black scale indicates substitutions in trees.

https://doi.org/10.7554/eLife.43606.031

MLO2 encodes a seven-transmembrane domain protein with a conserved role in plant disease susceptibility (Figure 6A) (Consonni et al., 2006). The C. rubella MLO2 locus has experienced a tandem duplication, resulting in two genes, MLO2a and MLO2b. Although both homologs are sufficiently diverged to be accessible to unambiguous read mapping, all six ts3-wayhqSNPs were in MLO2b (Figure 6B–C). In C. rubella and C. orientalis, the ts3-wayhqSNPs were arranged in five different haplotypes, which we collapsed into three related haplogroups, A, B and C (Figure 6B). The reference haplogroup A was most frequent in both species.

Because several known targets of balancing selection in A. thaliana are the result of structural variation, or lesions larger than 1 kb (Mauricio et al., 2003; Stahl et al., 1999), we examined coverage patterns around the MLO2 locus to identify potential linked indels. We found that haplogroup B in both C. rubella and C. orientalis exhibited similar patterns of low read coverage at the 5’ end of MLO2b, suggesting a possible indel (Figure 6C). To examine the exact sequence of each allele, we took advantage of the homozygous nature of sequence data from these two selfing species and performed local de novo assembly of the MLO2 locus from read pairs mapping to this region. We were able to reconstruct the locus for 15 C. orientalis samples (13 haplogroup A and two haplogroup B) and 43 C. rubella samples (34 A, 2 B, and 7 C). Surprisingly, a comparison of the different haplotypes revealed that the pattern of low coverage observed for haplogroup B was not due to structural variation, but instead to extremely high divergence from the reference haplogroup A (Figure 6—figure supplement 1). Divergence between alleles within species was greater than 0.15 differences per bp, over three times higher than the genome-wide divergence between the species (Figure 6—figure supplement 1 and Figure 5—source data 1). This highly diverged region had therefore been originally inaccessible to reference-based read mapping in haplogroup B samples. De novo assembly allowed us to identify a total of 204 additional tsSNPs, nearly all of which mapped to the 5’ end of MLO2b (Figure 6—figure supplement 1). Neighbour-joining trees revealed the expected clustering of samples by species in regions adjacent to MLO2b, but clear clustering by haplogroup within the 5’ region, a pattern that is reproduced in phylogenetic analysis of the entire CDS (Figure 6C and Figure 6—figure supplement 2). Importantly, divergence within haplogroup across species was greater than, or similar to genome-wide averages for both A and B, demonstrating that recent introgression did not give rise to allele sharing (Figure 6—figure supplement 1).

The high nucleotide divergence between haplogroups A and B translates into numerous amino acid differences in the N terminal half of the encoded proteins. In a 157 amino acid stretch, 31 amino acid differences are found in both species (Figure 6A and Figure 6—figure supplement 3), with an indel polymorphism accounting for another seven amino acid differences. The large number of differences between the two haplogroups makes it difficult to point to any specific change as the target of balancing selection, but it seems likely that the two alleles differ functionally, perhaps reinforced by additional differences in the promoter. In summary, the nucleotide divergence in this region suggests that the MLO2b haplogroups are much older than the split between the two species.

Discussion

While balancing selection has long been recognised as an important evolutionary force, its relevance as a major factor shaping genomic variation has remained unclear (Charlesworth, 2006; Wiuf et al., 2004; Asthana et al., 2005). We have taken advantage of unique demographic situations in two Capsella lineages to demonstrate not only that there is pervasive balancing selection at immunity-related loci in this genus, but also that the same alleles are maintained in species that are likely experiencing quite different pathogen pressures. We expect that balancing selection plays a similar role in other taxa, but that its effects are masked by a background of higher neutral genetic diversity and more frequent recombination between balanced sites and linked variants (Wiuf et al., 2004; Charlesworth, 2006). In addition, the detection of long-term balancing selection is further compounded by very old alleles being less accessible to short read re-sequencing, the dominant mode of variant discovery today. In the two selfing Capsella species, the footprints of balancing selection extend for tens of kilobases, greatly impacting diversity of many other genes. While this makes it more difficult to pinpoint the actual selected variants, it greatly improves statistical power to identify regions under balancing selection. This is reminiscent of genome-wide association studies, where extended LD improves statistical power to detect causal regions of the genome but reduces the ability to identify the specific causal variants (Atwell et al., 2010).

The nature of balancing selection acting on the regions we have identified remains to be clarified. Stable balancing selection in self fertilising species is unlikely to derive from heterozygous advantage, pointing to negative frequency-dependent selection or fluctuating selection from variable pathogen pressures as possible factors. While the mode of selection cannot be determined from these static data, the strong signal that we observe in highly selfing lineages points to environmental heterogeneity or negative frequency dependent selection over heterozygote advantage. Based on the enrichment of immunity-related genes, it appears that biotic factors are the dominant drivers of long-term maintenance of polymorphism. This observation is consistent with a large body of work on intraspecific variation in A. thaliana. The signal of balancing selection has been observed for specific pairs of disease resistance alleles in A. thaliana (Stahl et al., 1999; Tian et al., 2002; Tian et al., 2003; Mauricio et al., 2003; Bakker, 2006), and in the case of the resistance gene RPS5, alternative alleles have been shown to affect fitness in the field (Karasov et al., 2014). It is possible, or perhaps even likely, that the signal of balancing selection is amplified by the fact that immunity-related loci occur in clusters (Meyers, 2003) and that our strongest signal is the result of simultaneous selection on several genes in these regions in a situation analogous to the MHC in animals (Hedrick, 1998). Thus, biotic factors might not be quite as important as our analyses make them appear. On the other hand, it is also possible that the clustering of disease resistance genes itself is a product of selection, if selection was more effective when acting on groups of genes (Charlesworth and Charlesworth, 1975), or if evolution under a balanced regime was deleterious at other types of loci. Even if we accept that biotic factors predominate, the nature of the potential trade-offs that prevent individual alleles from becoming fixed is still a mystery, but it might involve conflicts between growth and defense (Coley et al., 1985; Walling, 2009; Herms and Mattson, 1992), beneficial and harmful microbe interactions (Walters and Heil, 2007), or defense against different types of pathogens (Kliebenstein and Rowe, 2008). What is clear is that the trade-offs must be stable over very long periods of evolution.

Our findings suggest a model in which the success of self fertilising populations may be buoyed by gene flow from outcrossing relatives in a situation analogous to evolutionary rescue strategies in conservation biology (Whiteley et al., 2015). This model is a variation on the theme of adaptive introgressions, which have recently emerged as a major evolutionary force in a wide range of taxa (Whitney et al., 2006; Castric et al., 2008; Pease et al., 2016; Dasmahapatra et al., 2012; Henning and Meyer, 2014; Hedrick, 2013; Huerta-Sánchez et al., 2014; Racimo et al., 2015Castric et al., 2008; Pease et al., 2016; Dasmahapatra et al., 2012; Whitney et al., 2006; Huerta-Sánchez et al., 2014; Hedrick, 2013). The unique feature of self-fertilisation in comparison to these examples is that the amplified effects of linked selection and genetic drift lead to a steady loss of genetic variation over time. Constant replenishment via adaptive introgression from an outcrossing relative counters the loss of diversity at immunity-related loci, thereby preventing decreased fitness in competition with pathogens. Whether this model generally applies will require independent study of other lineages of related self-fertilising and outcrossing populations at various stages of speciation.

Finally, we note that maintenance of ancient variants is most easily detectable in a background of low variation. Therefore, it could potentially be used to rapidly identify loci with meaningful functional variation. Typically, agricultural breeding panels seek to maximise surveyed diversity, but our results indicate that identification of useful immunity-related polymorphism with genomic data might be facilitated in otherwise homogeneous wild populations.

Materials and methods

Plant material and DNA extraction

Seeds were stratified for two weeks at 4°C and germinated in controlled environment chambers. Four to six rosette leaves were collected from each accession and frozen in liquid nitrogen for gDNA extraction. The methods available for extraction and sequencing varied as the project progressed, and 24 of the C. rubella and the 13 C. grandiflora samples were analysed independently in previous studies (Agren et al., 2014; Williamson et al., 2014). See Figure 1—source data 1 for a listing of DNA preparation, library construction, and sequencing technology by sample. In brief, DNA was extracted following an abbreviated nuclei enrichment protocol (Becker et al., 2011) or using the Qiagen Plant DNeasy Extraction kit. The recovered DNA was sheared to the desired length using a Covaris S220 instrument, and Illumina sequencing libraries were prepared using the NEBNext DNA Sample Prep Reagent Set 1 (New England Biolabs) or the Illumina TruSeq DNA Library Preparation Kit and sequenced on the instrument as listed in Figure 1—source data 1. We aimed for a minimum genome coverage of 40x. We mapped reads to the C. rubella reference genome (Slotte et al., 2013) resulting in realised coverages of 30 – 126x.

Sequence handling and variant calling

Initial sequence read processing, alignment, and variant calling were carried out using the SHORE (v0.8) software package (Ossowski et al., 2008). Read filtering, de-multiplexing, and trimming were accomplished using the import command discarding reads that had low complexity, contained more than 10% ambiguous bases, or were shorter than 75 bp after trimming. Reads were mapped to the C. rubella reference genome (Phytozome v.1.0) using the GenomeMapper aligner (Schneeberger et al., 2009) with a maximum edit distance (gaps or mismatches) of 10%. Alignments from each sample were then processed to generate raw whole genome reference and variant calls with qualities computed using an empirical scoring matrix approach (Cao et al., 2011) allowing heterozygous positions. Of the initial 53 C. rubella samples, two were removed because of low or uneven coverage, and one was removed as a misidentified C. bursa-pastoris sample (C. rubella and its polyploid relative C. bursa-pastoris are not easily identified phenotypically, but they can be distinguished by the extreme number of pseudo-heterozygous calls in the latter).

The per-sample raw consensus calls produced by SHORE were used to construct a whole genome matrix of finalised genotype calls for each species. Positions were considered only if covered by at least four reads and if overlapping reads mapped uniquely (GenomeMapper applies a ‘best match’ approach, so unique means that only one best match exists) (Schneeberger et al., 2009). We simultaneously considered information from all samples within a species to make base pair calls. If no variant was called in any sample then the site was treated as reference. Individual sample calls were made if four reads supported the reference base, the computed quality was above 24, and at least 80% of reads supported a reference call. A site was excluded if more than 30% of the samples from that species did not meet these criteria.

If at least one sample reported a difference from the reference in the raw consensus, then variant (indel or SNP) or reference calls were considered. The SNP calling parameters were slightly different for the two selfing species as compared to the outcrossing C. grandiflora because variants should only rarely be found in the heterozygous state in the former (and the frequency of heterozygous calls in a selfing species is a powerful filter to detect problems with mismapped reads). The general approach was to require at least one high quality variant call at a site and then to call genotypes in other samples with slightly reduced stringency. If no variant call met the more stringent threshold, then the site was reconsidered using the above reference criteria. Finally, the calls from each of the three species were combined into a master matrix. If a position was not called biallelic or invariant across the compared species, then it was not considered. To facilitate further analyses in PLINK (v1.9) (Chang et al., 2015) and vcftools (v0.1.12a) (Danecek et al., 2011), the genome matrix at biallelic SNP sites was also converted into a minimal vcf format.

Defining pericentromeric regions

Regions of high repeat density near the centromeres of all chromosomes as well as two large, repeat-rich regions in chromosomes 1 and 7 were removed from genome scans. Coordinates for these regions are listed in Supplementary file 4.

Site annotations

We used the SnpEff (v.3.2a) (Cingolani et al., 2012) software package to annotate variant and invariant sites for the whole genome. The annotation database was built using the C. rubella v1.0 Phytozome gff file. Sites were annotated using the table input function that includes annotation of fold degeneracy for each site in coding regions. Invariant sites were annotated using a table with dummy SNPs at each position. The SnpEff program outputs several annotations for some sites, and a primary annotation was selected by ranking the strength of effect of each annotation and reporting the annotation with the strongest effect (the rankings are listed in Supplementary file 5).

Ancestral state assignment

To calculate derived allele frequency spectra we assigned ancestral state to each polymorphic site using three-way whole genome alignments between C. rubella, A. thaliana, and A. lyrata (Slotte et al., 2013). Only biallelic sites identical between A. lyrata and A. thaliana (indels were ignored) were considered. For the two species analysis, only sites also fixed for the ancestral allele in C. orientalis were considered.

Trans-specific SNP annotation comparisons

To compare tsSNP and ssSNP annotations from similar allele frequency spectra, we binned 20,000 tsSNPs randomly drawn from throughout the genome by derived allele frequency (10 bins). We then drew an equivalent number of ssSNPs from each allele frequency bin and calculated the fraction of CDS SNPs that caused nonsynonymous changes and the fraction that fell in genes. This process was repeated 1000 times for both species to generate the plots shown in Figure 3B.

Analysis of population structure and demographic modeling

Genotypes at four-fold degenerate SNP sites called in C. grandiflora and C. rubella were pruned in PLINK (50 kb windows, 5 kb step, and 0.2 r2 LD threshold) and used as input for ADMIXTURE (v.1.23) (Alexander et al., 2009) and EIGENSTRAT (v6.0 beta) (Price et al., 2006). For demographic modelling in Fastsimcoal (v2.5.2.11) (Excoffier et al., 2013), joint minor allele frequency spectra were generated at four-fold degenerate sites with complete information and ignoring heterozygous calls in selfing lineages (counting only one allele from each individual). Demographic parameters for each tested model were then inferred in 50 runs of Fastsimcoal (parameters: -l40 -L40 -n100000 -N100000 -M0.001 -C5). The global maximum likelihood model was selected after correcting for number of estimated parameters using Akaike Information Criterion. Confidence intervals were set for estimated parameters using 100 bootstraps of identical inference runs on simulated data under the most likely model. To reduce computational times, global maximum likelihoods were calculated for bootstraps after 13 runs rather than 50. The mutation rate assumed for this and other analyses was 7 × 10−9 mutations/generation/ bp based on mutation rate measurements in Arabidopsis thaliana (Ossowski et al., 2010).

Segments of recent ancestry and interspecific introgression

Segments of IBD were identified using the phasing and segment identification in Beagle (r1339) (Browning and Browning, 2013). For the analysis presented here, we considered only the first haplotype from each C. rubella sample and both haplotypes from each C. grandiflora sample. Segments were required to be larger than 1 kb to be considered in the analyses. D statistics were calculated as in Green et al. (2010); Patterson et al. (2012); Dasmahapatra et al. (2012) comparing each individual genotype from the eastern C. rubella population to allele frequencies from western C. rubella and C. grandiflora. The outgroup species for these analyses was C. orientalis.

Sliding window analysis of genetic diversity

Population genetic diversity statistics for genome scans were calculated for each species by transforming variant calls from the genome matrix into FASTA files and inputting these files into the compute function from the libsequence analysis package (Thornton, 2003). Heterozygous bases were randomly assigned as reference or variant to generate a single haplotype for each sample. Weir and Cockerham’s Fst was calculated using vcftools (v.0.1.12a) on biallelic SNP sites.

Identification of balanced regions

To identify regions of the genome with unusually low Fst after speciation, we generated a null distribution of Fst values by simulating one million 20 kb segments under our inferred best demographic model using Fastsimcoal2. The output of each simulation was transformed to vcf format and Fst between C. grandiflora and each C. rubella subpopulation was calculated using vcftools. The probability of a particular Fst value in the observed data was then assigned based on its rank in these simulations (independently for the two subpopulations; one sided test). Multiple testing was accounted for using Bonferroni correction. Significant outlier windows (adjusted p-value<0.05) identified for each subpopulation were collapsed into regions using a two state hidden markov-model as implemented in the Rhmm package. The HMM approach has the advantage of joining windows of high coverage separated by a low coverage window. Only regions significant in both subpopulations were considered for further analysis. Windows overlapping the pericentromeric regions were removed from the analysis.

Linkage disequilibrium

LD was calculated in 30 kb windows in C. grandiflora and C. rubella using PLINK (v.1.9). The decay of LD is the mean value at each position up to 30 kb from a focal SNP.

Gene ontology (GO) enrichment

Because the C. rubella annotation is sparse, we used annotations from nucleotide blast best hit matches (e < 1e-10) to CDS sequences from its close relative, A. thaliana, for our GO analysis. Enrichment tests were performed with the SNP2GO R library (Szkiba et al., 2014) using tsCgCrSNPs as the test set and all SNP sites called in either C. rubella or C. grandiflora as the background set. We chose this approach because it is less sensitive to gene length (which should similarly affect tsSNP and non-tsSNP distributions across genes). A corresponding analysis was performed in the three-way comparisons using a background set of all SNP sites called in all three species. Significant enrichments were considered at a q-value threshold of q < 0.01 after false discovery correction. A gene was considered as belonging to the NLR family in C. rubella if its best blast hit in A. thaliana was annotated as such (Supplementary file 6).

Identification of high quality three-way tsSNPs

To generate a list of high quality ts3-waySNPs, we applied a series of empirical filters. First, all ts3-waySNPs were required to have an r2 >0.2 with another ts3-wayhqSNP in the same phase in all three species. We excluded SNPs overlapping pericentromeric or annotated repeat sequences (Slotte et al., 2013). We also required that the coverage of SNPs was no more than two standard deviations above the mean coverage of all SNPs for that species, to have an average concordance greater than 0.98, and to be identified in more than one individual. These criteria were selected to increase our confidence in identified tsSNPs; it is likely that our inferences are conservative.

To validate our trans-specific SNPs we aligned the C. orientalis samples against the draft C. orientalis assembly using the bwa (v.0.7.12) mem command with default parameters. The output bam format file was sorted using samtools (v.1.6) and multisample variant calls were made with freebayes (v.1.1.0) using the parameter settings -z. 1–0 w. The resulting vcf file was filtered using vcftools (v.0.1.13) using the settings --remove-indels --minQ 50 --max-missing 0.8 --max-alleles two and further filtered to remove sites that were called as heterozygous in more than 5% of the samples. The sites overlapping with the original call set were extracted from this vcf and used for validation.

Coordinate transforms between the two genomes were necessary to validate tsSNPs. The draft assembly of C. orientalis and the C. rubella reference genome were aligned using the LAST (v.923) aligner. The C. rubella reference database was built with the lastdb command with the parameter settings -uMAM8 -cR11, and then the two genomes were aligned with the lastal command with the settings -m50 -E0.05. Equivalent sites were considered if they were present in alignments at least 500 bp long and contained only one C. orientalis and one C. rubella sequence.

Local de novo assembly and analysis of MLO2

To reconstruct alleles from the MLO2 locus, we used an iterative assembly approach. Reads were first mapped to the entire reference genome using bwa (v.0.7.8) (Li and Durbin, 2009) using the bwa-mem alignment algorithm for each sample. Reads that mapped to the MLO2 locus were then extracted and assembled de novo using SPAdes (v.3.5.0) (Nurk et al., 2013). Assemblies were filtered to be longer than 2,000 bp with a coverage greater than 5, and then used to create an index for a second round of read mapping. Reads that mapped to the assembly without mismatches were collected together with their mates (regardless of the mate’s mapping quality), and were again de novo assembled. This process was iterated six times until scaffolds covering both coding regions were achieved. Format conversions and file handling made use of the software samtools (v.0.1.19) (Li et al., 2009) and bamutil (v.1.0.13).

Assemblies were filtered for appropriate length, and aligned using MAFFT (Katoh and Standley, 2013). Alignments were visualised using AliView (Larsson, 2014), and manually edited where appropriate. The protein encoded by MLO2b annotated in the C. rubella reference was truncated relative to A. thaliana MLO2. We aligned the genomic and coding regions from both species and found that the premature stop in MLO2b is likely due to a mis-annotated splice junction. The A. thaliana junction is conserved in C. rubella and alternative annotations on phytozome identify the A. thaliana-like splice variant. We therefore used the full-length version derived from manual alignments for our analysis. The phylogeny of Capsella MLO2 CDS sequences was produced using the optim.pml command from the R package phangorn using Jukes-Cantor distances. 1000 bootstrap iterations were run to estimate support for nodes in the tree. To determine where amino acid substitutions had occurred, we aligned the proteins encoded by each allele against the barley mlo protein and annotated domains (UniProtKB P3766).

Draft assembly of the C. orientalis genome

The draft genome from the C. orientalis accession 2007–03 (Figure 1—source data 1) was assembled from long reads generated by PacBio single-molecule real-time sequencing. Long reads were assembled with Falcon (Chin et al., 2016) (version 0.5.4, max_diff = 150, max_cov = 150, min_cov = 2). The resulting primary contig set was iteratively polished with Quiver again using long reads (Chin et al., 2013) (version 2.0.0) and with Pilon (Walker et al., 2014) (version 1.16) using short reads from a single Illumina TruSeq DNA PCR-free library. The draft genome of C. orientalis comprises 135 Mb distributed over 423 gap-free contigs and covers 60% of the C. rubella reference with non-ambiguous 1-to-1 whole genome alignments. Its completeness is comparable to that of the C. rubella reference.

References

  1. 1
  2. 2
    “Mating System Shifts and Transposable Element Evolution in the Plant Genus Capsella.”
    1. JÅ Agren
    2. W Wang
    3. D Koenig
    4. B Neuffer
    5. D Weigel
    6. SI Wright
    (2014)
    BMC Genomics, 15, 10.1186/1471-2164-15-602, 25030755.
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
    Second-generation PLINK: rising to the challenge of larger and richer datasets
    1. CC Chang
    2. CC Chow
    3. LC Tellier
    4. S Vattikuti
    5. SM Purcell
    6. JJ Lee
    (2015)
    GigaScience, 4, 10.1186/s13742-015-0047-8, 25722852.
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
    Natural selection for polymorphism in the disease resistance gene RPS2 of Arabidopsis Thaliana
    1. R Mauricio
    2. EA Stahl
    3. T Korves
    4. D Tian
    5. M Kreitman
    6. J Bergelson
    (2003)
    Genetics 163:735–746.
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
    Research in Computational Molecular Biology
    1. S Nurk
    2. A Bankevich
    3. D Antipov
    4. A Gurevich
    5. A Korobeynikov
    6. A Lapidus
    7. A Prjibelsky
    (2013)
    158–170, Assembling Genomes and Mini-Metagenomes from Highly Chimeric Reads, Research in Computational Molecular Biology, Berlin, Springer Verlag.
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
    Gene and allelic genealogies at a gametophytic self-incompatibility locus
    1. X Vekemans
    2. M Slatkin
    (1994)
    Genetics 137:1157–1165.
  92. 92
  93. 93
  94. 94
    In Advances in Botanical Research
    1. LL Walling
    (2009)
    551–612, Adaptive Defense Responses to Pathogens and Insects, In Advances in Botanical Research, Academic Press.
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
    Enhanced arabidopsis pattern-triggered immunity by overexpression of cysteine-rich receptor-like kinases
    1. YH Yeh
    2. YH Chang
    3. PY Huang
    4. JB Huang
    5. L Zimmerli
    (2015)
    Frontiers in Plant Science, 6, 10.3389/fpls.2015.00322, 26029224.
  106. 106
  107. 107
  108. 108

Decision letter

  1. Molly Przeworski
    Reviewing Editor; Columbia University, United States
  2. Ian T Baldwin
    Senior Editor; Max Planck Institute for Chemical Ecology, Germany
  3. Ziyue Gao
    Reviewer; Stanford University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Long-term balancing selection drives evolution of immunity in Capsella" for consideration by eLife. Your article has been reviewed by Ian Baldwin as the Senior Editor, a Reviewing Editor and three reviewers. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. As you will see from the detailed comments provided below, the reviewers found the findings quite interesting, but raised a number of important concerns that will require both additional analyses and extensive rewriting of the manuscript.

Since additional analyses are needed, and the authors may want to collect more data to respond to the concern about possible artifacts in C. orientalis, it may not be feasible for the revisions to be completed within two months. We leave it to the discretion of the authors whether they think this timeline is feasible, or would prefer to work on revisions at more length and resubmit the manuscript to be considered anew.

In terms of the analysis, the three main points are that:

1) The data for C. orientalis seems a bit troubling, and should probably be further inspected for possible artifacts.

2) The hypothesis that balancing selection is currently acting in C. rubella needs to be better supported. In particular, the authors need to exclude the possibility that the observed signatures persisted since the ancestral species. It will also be important for the authors to more clearly explain whether they believe that the high variability is driven by introgression (adaptive or not), ongoing balancing selection (and of what kind…) or both.

3) The analysis would be more interpretable if more of the highlighted regions were investigated individually, rather than always treating all tsSNPs jointly.

In terms of the writing, all reviewers agreed that the paper is unrelated to the origin of sex or mating system evolution, and that these topics are not always discussed accurately. The discussions of these points should be removed. All reviewers also felt that a number of claims were too strong, and relevant literature is missing.

We hope you will find these comments useful in revising your manuscript, should you choose to do so.

Reviewer #1:

The idea of looking for balanced polymorphisms in selfers, in which the footprints are frozen by the lower effective recombination rate, is really nice, and I was convinced by the findings that increased diversity and other signs of balancing selection were enriched for host-pathogen genes.

Less clear to me is when/where the balancing selection is occurring. There are at least three possibilities:

1) The hypothesis the authors privilege: long-term balancing selection is ongoing in the selfers as well as the outcrossing species.

2) Balancing selection occurred only in the ancestor of the two selfers, increasing the probability that more than one lineage was retained in the selfer. In a parenthetical comment, they dismiss this possibility for orientalis based on a calculation from Wiuf et al., 2004. They need to spell out their assumptions about Ne etc… in doing so. More importantly, they also need to calculate a similar quantity for rubella. Otherwise, a number of statements in the manuscript seemed overstated (e.g., the last sentence of the Abstract).

3) Balancing selection is operating in the outcrosser and variation was introduced into the selfer by introgression. They entertain this possibility for rubella in the Discussion section. They label it as "adaptive introgression" but it seems to me that it could just as well be neutral, if introgression is more likely to introduce a different haplotype due to balancing selection in the outcrosser.

In this regard, I note that it is not correct that (1) and (3) cannot be distinguished, as they predict different genealogies. Given that the selfers are mostly homozygous at these positions, can they build trees for e.g., 100 regions, and see whether they support (1) or (3)? If not, they should at least explore diversity and divergence levels around the tsSNPs.

More generally, I was confused by what specific models the authors had in mind at a number of points in the manuscript. Because they focus on selfers, their study design seems aimed at picking up balancing selection due to fluctuating selection pressures not heterozygote advantage. Moreover, they repeatedly state that they are looking for a form of selection that favors rare types. Why then expect a positive Tajima's D (subsection “A high density of tsSNPs around immunity-related loci”)? I think a discussion of the mode of balancing selection they expect to detect is missing.

The frequency spectrum for orientalis in Figure 5A seems quite odd. And the observation about the low interchromosomal LD among ts3_SNPs is likewise disconcerting. The explanation the authors provide seems highly unlikely, unless there is rampant epistasis among ts SNPs. If there were, shouldn't the authors also expect to see large footprints of balancing selection in the outcrossers (because haplotypes, not just SNPs, would be retained)?

The Red Queen hypothesis is used to refer to a dizzying array of hypotheses, from the evolution of sex (Introduction) to selection for new combinations of alleles (subsection “Balancing selection over ancient time scales”) to selection for rare alleles (Introduction) to balancing selection (Discussion section), in ways that are often inaccurate. In particular, the Red Queen hypothesis is neither necessary nor sufficient for the maintenance of genetic variation. As Aurelien Tellier and others have shown, it is actually very unlikely that genetic variation would be maintained over long time periods due to simple models of host-pathogen coevolution, especially in small populations (see Tellier et al., 2014 Evolution, for example). This literature should be discussed, especially since the authors are finding so many putative examples.

Reviewer #2:

I was very disappointed by this manuscript, as the results are presented so as to appear more exciting than they really are, and parts of the study appear to reflect a lack of understanding. In the first category, I would mention text in the Introduction about the evolution of sex, with which the study does not even remotely deal. In the second category of lack of understanding and/or knowledge of the field and of results already established, I highlight the discussion of outcrossing, which does not mention inbreeding depression, which is surely a factor that ought to be mentioned – I think that it is incorrect to predict that in general outcrossing will be most favored by natural selection when the selective landscape changes frequently, and to imply that pathogen pressure is probably important in maintaining outcrossing. Moreover, the conclusion that balancing selection still operates in the selfers to maintain variation seems unjustified, as other possibilities are not excluded, and the same applies to the conclusion stated at the end of the Abstract that "population longevity in the face of pathogen attack depends on the persistence of ancient genetic variation".

The results may support the much more limited, but still interesting, conclusion that genes involved in defences against pathogens are enriched for cases of balancing selection, in line with the many previously published findings that this class of gene functions shows this kind of behaviour. It is far from clear that there is an "advantage of an expanded footprint of balancing selection after a genetic bottleneck", as the Abstract claims. It is correct that low diversity in a selfer might allow rare high diversity loci to be detected, but selfers that evolved recently from an outcrossing population have the disadvantage that one cannot disentangle selection prior to the selfing state being established from selection more recently, to say nothing of the effect of a bottleneck in increasing the variance of LD and other quantities that might be estimated.

Details of these major concerns are too long for your word limit, so I have emailed them to the editorial office in a separate file, to be passed on to the authors.

My recommendation would be to revise the manuscript extensively to show clearly the evidence for those conclusions that are well supported, and relate them to what is already known.

Reviewer #3:

In this manuscript, the authors aim to detect evidence and identify targets of balancing selection by genome-wide search of polymorphisms shared by three Caspella species, two of which experienced genetic bottleneck and transition to self-fertilization independently. The authors resequenced 50 C. rubella, 13 C. grandiflora and 16 C. orientalis accessions and identified millions of single-nucleotide polymorphisms (SNPs). With this dataset, they first characterized the demographic history of these species and, unlike a previous study, detected evidence of gene flow between (eastern) C. rubella and C. grandiflora. Then, they detected evidence of balancing selection enriched in immunity-related genes by focusing on shared polymorphisms between species. 21 genomic regions are highlighted as strong candidates of targets under balancing selection, many of which contain immunity genes with known functions in A. thaliana.

This study does a reasonably good job in excluding factors other than ancestral variation that could lead to shared polymorphisms, so the conclusions are convincing to me. I also enjoy reading this manuscript, as it generates several new insights: (1) it substantially increases the number of trans-species polymorphisms that are potentially maintained by balancing selection in plants, providing candidates for follow-up functional studies; (2) it strongly suggests that frequency-dependent selection is a common mechanism underlying ancient balancing selection (as compared to heterozygote advantage); (3) the case of MLO2 underscores the technical difficulty of detecting divergent haplogroups shared between species by short-read sequencing methods.

That said, I have several questions and comments that should be considered before publication:

1) The shared IBD segments and genome-wide D-statistic suggested past and ongoing gene flow between sympatric C. rubella and C. grandiflora. Are there shared IBD segments and significant D-statistic values for the regions with strong evidence of balancing selection? The presence of introgression does not change the conclusion that most tsSNPs are retained ancestral variation, as long as western C. rubella and C. grandiflora still share the polymorphisms, but I am wondering if some of the sharing by eastern C. rubella and C. grandiflora could be attributed to recent gene flow.

2) To exclude the possibility of spurious tsSNPs due to cryptic paralogs, the authors compared the coverage and read concordance. Although the overall distribution of the two metrics looks similar for tsSNPs and ssSNPs, it is not clear whether there could be a few regions with excess coverage or reduced concordance. It will be helpful to plot these two metrics along the chromosome coordinate for the regions with strong evidence for balancing selection.

Another way to rule out the concern of the cryptic paralogs is to compare the proportion of heterozygotes for tsSNPs and ssSNPs (controlling for allele frequency). If some of the tsSNPs are due to cryptic paralogs, the proportion of heterozygotes is expected to be higher than that for ssSNPs.

3) The presence of tsSNPs shared by all three species of Capsella provides strong evidence for balancing selection, but I am slightly concerned by the quality of the C. orientalis data because of a couple of puzzling patterns: (A) why is the allele frequency spectrum of ssSNPs in C. orientalis multi-modal? (B) Why is the average divergence between the two main Capsella lineages higher near sites adjacent to the three-way tsSNPs?

4) The interchromosomal LD for three-way tsSNPs is significantly lower than that of random polymorphisms in C. rubella and C. grandiflora. The authors speculated that this might be due to selection promoting new combinations of tsSNPs. However, I found this explanation implausible unless strong and prevalent epistasis is involved. Could the authors run simulations to show that the "reshuffling" hypothesis is able to explain the reduced LD and estimate how much epistasis is required to generate the observed pattern? In addition, why is the interchromosomal LD for three-way tsSNPs significantly higher for that of random polymorphisms in C. orientalis? Could this be due to technical issues?

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for resubmitting your work entitled "Long-term balancing selection drives evolution of immunity genes in Capsella" for further consideration at eLife. Your revised article has been favorably evaluated by Ian Baldwin (Senior Editor), a Reviewing Editor, and two reviewers.

The manuscript has been improved but there are some remaining issues with the presentation and discussion that need to be addressed before acceptance, as outlined below. Notably reviewer 2 made a number of useful suggestions of clarifications that would be helpful in understanding the analyses and results.

Reviewer #1:

This manuscript is significantly improved. However, the writing is still rough in several places, and there are some other things that should be improved.

The Abstract suggests that the study aim is "to understand the extent to which natural selection can drive the retention of genetic diversity", meaning sequence diversity. In fact, however, the conclusions relate mainly to whether genes whose sequences suggest that they are involved in resistance to pathogens show footprints of balancing selection, as has been suggested many times, and for which there is some empirical evidence, including from analyses of diversity. This study is therefore an advance mainly in that such genes were not chosen for study and shown to have interesting diversity patterns, but instead they emerged from a genomic analysis for balancing selection, and specifically through analyzing shared polymorphisms.

The revised manuscript concludes that genes showing signals of potential balancing selection show an enrichment for genes involved in interactions with pathogens. I have two comments about this. First, are some of these genes located in clusters in physically small genome regions (as is commonly the case for plant NRR genes)? If so, could this over-estimate the fraction under balancing selection, because one gene is actually under balancing selection, and the others have correlated diversity patterns, due to close linkage? This would seem to be a particular concern for any clusters that are located in regions of low recombination, or low effective recombination rates (which could encompass considerable parts of the genomes of the selfing species studied).

It is far from clear that "self-fertilizing species provide increased sensitivity to detect balancing selection" (as claimed in the Introduction) and indeed the evidence is not compelling, though it is overall consistent with its having been detected in the selfing Capsella species studied. Moreover, only certain kinds of balancing selection could be detected in a selfer, notably situations involving frequency dependence (not heterozygote advantage). This claim should therefore not be made.

Second, the revised manuscript makes much clearer than before that the long-term balancing selection that may have been detected could potentially occur solely within the outcrossing species from which the selfing species evolved (in one case, apparently very recently). The signal within the selfing species might therefore not reflect ongoing selection in those species, but merely introgression of sequences bearing the footprints of the selection. The revised text marshalls the evidence on this question quite clearly. However, it is not mentioned that introgression by pollen flow into the selfing species is more likely than the reverse, following the long-established general rule for attempts to inter-cross two species when only one of them is self-incompatible: pollen of the self-compatible species often fails to work on pistils of the self-incompatible one, whereas the reciprocal pollination succeeds (as might be expected if selfing species' pistils have lost the ability to reject incompatible pollen). However, some recently evolved selfing species also reject pollen from their close self-incompatible relatives. A comment on the case in these Capsella species would be worth adding. In my opinion, the abstract should mention that tests for adaptive introgression suggest that this is not causing the signal(s) on which the conclusion of balancing selection in the selfers is based.

The first paragraph is poorly written and somewhat misleading, as it fails to mention that loss of alleles in situation with strong balancing selection occurs only in special cases. I think that the authors meaning is as follows. The term "balancing selection" describes several different adaptive forces that maintain genetic variation for longer than expected under genetic drift of neutral variants. It includes situations with heterozygous advantage, negative frequency-dependent selection (rare allele advantage), and environmental heterogeneity affecting fitnesses in space and time. In these situations, selection prevents loss of alleles from populations at the functional genes or sites. As this also results in increased diversity at closely linked neutral or weakly selected sites (Charlesworth, 2006), it should be possible to detect balancing selection from the resulting footprints of increased coalescence times at closely linked sites, and many candidate genes have been identified using diverse methodology (Fijarczyk and Babik, 2015). However, theoretical models predict that even strongly balanced functional alleles can be stochastically lost over long time periods, suggesting that balanced polymorphism could often be short lived [a reference is needed here], particularly when the functional alleles have low equilibrium frequencies, or fluctuate in frequency, with periods of low frequency, as may occur in the case of plant pathogen resistance genes where temporarily rare alleles may have advantages (Tellier et al., 2014). There is no need to repeat this reference in the Introduction. It would be better to make clear all along that this is a special case.

Reviewer #2:

The manuscript has been substantially improved after extensive revisions. Removal of the discussion of evolution of sex makes the main story much clearer, and I appreciate the explicit discussion of different possible mechanisms underlying the shared polymorphisms (technical artifacts, recurrent mutations, maintenance of ancestral variation, introgression, etc). The evidence of trans-species polymorphisms in MLO2b is very compelling, and I am convinced that the high-quality three-way tsSNPs show evidence for long-term balancing selection and no evidence of recent gene flow.

However, I have a few questions about the analysis and hope the authors could add some details in the manuscript to improve clarity. I also feel the current abstract is too vague and lacks specifics of some key results. For instance, it is unclear which collection of variants the authors are referring to when saying "ancestral variation preferentially persists at immunity related loci" (ts_2waySNPs, Bal regions, ts_3waySNPs, or ts_3wayhqSNPs?). In addition, the gene name MLO2b should be specified in the second to the last sentence, as this is the only case the authors demonstrate trans-species sharing of haplotypes clearly. Please see below for my specific questions and comments.

1) The authors use the D statistic and IBD sharing to support their conclusion of ongoing gene flow between C. grandiflora and eastern C rubella, but some details of these analyses are unclear.

D-statistic calculation:

1) What exactly is the configuration for the ABBA-BABA test. For example, what species is used as the outgroup?

2) Why is the correlation between D statistic and distance evidence for ongoing gene flow (as claimed in subsection “Capsella rubella demography”)? Could this reflect past gene flow?

IBD detection:

1) What does the minimum segment length of 1kb translate into genetic distance? I am not familiar with recombination rates in plants, but such segments seem very short by standards in humans and thus very old. (The length threshold is only specified in legend of Figure 2, but I think it should also be included in the Materials and methods section.)

2) What is the distribution of detected IBD segment lengths?

3) Can the authors provide some estimates for the age of the IBD segments based on their lengths to support their conclusion of "very recent co-ancestry"?

2) Figure 4 is one of the key results showing enrichment in immunity-related genes, but it is unclear what each panel shows exactly, as some details or labels are missing.

1) Panel A: What does each dot mean? I guess each point represents a gene, but this is not clear from the text, figure or legend. What is the y-axis in each sub-panel? How is this number calculated?

2) Panels B: what do the thick line and the shaded area mean, respectively? Does the legend of panel C apply to panel B?

3) The authors use simulation data to argue against the possibility that the diversity patterns near Bal loci result from historical balancing selection in the ancestor population. Did the simulations include effects of balancing selection? If so, how is balancing selection simulated in practice? If balancing selection is not simulated, are the results based on neutral simulations sufficient to rule out the possibility of historical balancing selection?

4) MLO2b region

1) Why is the haplotype structure of MLO2b not discussed on C grandiflora? Since the tsSNPs are shared by C grandiflora, it seems the selection pressure is preserved in this species. Would the higher recombination rate in C grandiflora help narrow down the causal variant(s)?

2) How long are the shared haplotypes between C rubella and C orientalis? Is maintenance of such long haplotypes expected, given plausible split time and recombination rate, if there is no epistasis? It will be helpful to estimate this using the formula in Wiuf et al., 2004.

3) Based on Figure 3B, haplogroup C differs by only 1 SNP from the second haplotype in group B. Why is C classified as the separate haplogroup instead of a subtype of B?

4) In subsection “Insights into balancing selection from de novoassembly of MLO2”, the authors concluded that the shared haplotypes did not result from recent gene flow, because of similar within-haplotype divergence levels of A and B haplogroups. However, this evidence is insufficient: the authors need to demonstrate that the within-haplogroup divergence is at least as high as the genome-wide average (which seems to be the case by comparing Figure 6—figure supplement 1 to Figure 5D).

5) The reconstructed trees in Figure 6C are very small, and haplotypes need to be labeled in order to show the pattern of clustering by haplogroup.

https://doi.org/10.7554/eLife.43606.047

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

Since additional analyses are needed, and the authors may want to collect more data to respond to the concern about possible artifacts in C. orientalis, it may not be feasible for the revisions to be completed within two months. We leave it to the discretion of the authors whether they think this timeline is feasible, or would prefer to work on revisions at more length and resubmit the manuscript to be considered anew.

In terms of the analysis, the three main points are that:

1) The data for C. orientalis seems a bit troubling, and should probably be further inspected for possible artifacts.

The reviewers main concern regarding the C. orientalis data was that the allele frequency spectrum was unusual which may indicate elevated error rates. The unusual appearance of the spectrum is the result of having 16 samples, and binning allele frequencies into 10 bins. Because we had differing numbers of individuals from each species, we chose to compare everything using the same binning, which happens to mathematically introduce an uneven spectrum as observed in the original Figure 5. We replotted these spectra using different bin sizes for each species and show that the uneven pattern is largely removed.

To further alleviate the reviewers’ concerns, we generated a draft genome assembly for one of our C. orientalis samples using an alternative sequencing technology and recalled variants at tsSNP sites in our C. orientalis samples. In the updated manuscript, only SNPs identified in both approaches were considered “high quality”.

2) The hypothesis that balancing selection is currently acting in C. rubella needs to be better supported. In particular, the authors need to exclude the possibility that the observed signatures persisted since the ancestral species.

The reviewers questioned our assertion that balancing selection has retained genetic diversity in selfing lineages and suggested that the historical molecular signature of balancing selection in the outcrossing C. grandiflora, might result in the appearance of balancing selection in C. rubella. Specifically, if immunity related genes are under balancing selection in the outcrosser, then a random sample of alleles during the C. rubella founding event might result in higher genetic variation in subsequent generations.

We agree with the reviewers that these two possibilities are difficult to tease apart in the recent C. rubella / C. grandiflora speciation event. We based our original conclusion that balancing selection is operating in C. rubella based on the following arguments:

1) The allele frequencies in the proposed regions are far less diverged between C. grandiflora and C. rubella then expected by random chance under an explicit demographic model which includes gene flow.

2) These regions show many classic indications of balancing selection in C. rubella including:

a) Increased allele frequency

b) Increased genetic diversity

c) Low Fst between subpopulations

d) Elevated Tajima’s D

3) In C. grandiflora, genetic diversity is high throughout the genome, and balanced regions show only subtle deviation from the global pattern very near to the candidate targets. Therefore, a random sample of alleles anywhere in the genome would generate peaks of outstandingly high genetic diversity in C. rubella, yet we see that retention of variation is specifically elevated at immunity loci.

4) Though we identified tsSNPs throughout the genome, those tsSNPs found in our candidate regions were much more likely to be retained after population splits regardless of allele frequency.

5) In the much more diverged and much less diverse selfing lineage, C. orientalis, we observe the same pattern at many of the same loci. Though this is not direct evidence for balancing selection in C. rubella specifically, it is strong evidence that these loci are the targets of stable, long-term balancing selection in a selfing lineage. It is possible that, despite having been stable over millions of years of evolution previously, balancing selection ended within the last ~150,000 years in C. rubella specifically. We think that the more plausible explanation of the observed patterns is that variation is retained by ongoing selection in all three lineages.

In response to this concern we sought to strengthen our argument with two additional analyses. First, we examined whether elevated founder diversity might impact the likelihood of observing the pattern of Fst we used to identify outliers. We approximated the effect of balancing selection in C. grandiflora alone by including only simulations that showed the highest level of diversity C. grandiflora, and then explored whether our outlier C. rubella Fst values were more likely to occur. Even under this scenario of highfounder diversity, the observed values remain extremely unlikely. Second, we examined whether allele frequencies are correlated between C. grandiflora and C. rubella and between subpopulations of C. rubella for tsSNPs inside and outside of balanced regions.

We find that although all these SNPs segregate in all of the populations assayed, SNPs within balanced regions are show a substantial increase in AFS correlation.

It will also be important for the authors to more clearly explain whether they believe that the high variability is driven by introgression (adaptive or not), ongoing balancing selection (and of what kind…) or both.

We have added to discussion to make our thinking more clear on these issues. We do not think that adaptive introgression and maintenance by balancing selection are mutually exclusive hypotheses, as the later is not dependent on the source of an allele. In the case of a young species like C. rubella we think that gene flow has contributed to the modern pattern of variation, and that this variation is maintained for longer than expected by chance in regions near immunity related loci. For the older split, we find no evidence for recent introgression.

We further explored these ideas by looking for IBD regions inside and outside of balanced regions and find a pattern consistent with the above hypothesis.

With respect to the mechanism of balancing selection, we believe that definitive statements about this issue are beyond the scope of the current manuscript, however we have tried to make it clear that we think heterozygote advantage is unlikely.

3) The analysis would be more interpretable if more of the highlighted regions were investigated individually, rather than always treating all tsSNPs jointly.

We have reported statistics individually for these regions.

In terms of the writing, all reviewers agreed that the paper is unrelated to the origin of sex or mating system evolution, and that these topics are not always discussed accurately. The discussions of these points should be removed.

We have removed this emphasis throughout the manuscript.

All reviewers also felt that a number of claims were too strong, and relevant literature is missing.

We have softened claims that were pointed out by the reviewers as too strong and added relevant literature where requested.

We hope you will find these comments useful in revising your manuscript, should you choose to do so.

Reviewer #1:

The idea of looking for balanced polymorphisms in selfers, in which the footprints are frozen by the lower effective recombination rate, is really nice, and I was convinced by the findings that increased diversity and other signs of balancing selection were enriched for host-pathogen genes.

Thank you for the positive feedback.

Less clear to me is when/where the balancing selection is occurring. There are at least three possibilities:

1) The hypothesis the authors privilege: long-term balancing selection is ongoing in the selfers as well as the outcrossing species.

2) Balancing selection occurred only in the ancestor of the two selfers, increasing the probability that more than one lineage was retained in the selfer. In a parenthetical comment, they dismiss this possibility for orientalis based on a calculation from Wiuf et al., 2004. They need to spell out their assumptions about Ne etc… in doing so.

We have added a comment on the Ne assumption for the C. orientalis calculation.

More importantly, they also need to calculate a similar quantity for rubella. Otherwise, a number of statements in the manuscript seemed overstated (e.g., the last sentence of the Abstract).

We are unsure whether the reviewer is suggesting that we calculate this probability for C. rubella-C. grandiflora or C. rubella-C. orientalis. The C. grandiflora-C. orientalis calculation is conservative relative to the later. However, we have modified the Abstract extensively to account for the reviewer’s criticism.

3) Balancing selection is operating in the outcrosser and variation was introduced into the selfer by introgression. They entertain this possibility for rubella in the Discussion section. They label it as "adaptive introgression" but it seems to me that it could just as well be neutral, if introgression is more likely to introduce a different haplotype due to balancing selection in the outcrosser.

We refer the reviewer to our response to the general comments.

In this regard, I note that it is not correct that (1) and (3) cannot be distinguished, as they predict different genealogies. Given that the selfers are mostly homozygous at these positions, can they build trees for e.g., 100 regions, and see whether they support (1) or (3)? If not, they should at least explore diversity and divergence levels around the tsSNPs.

We have removed this statement and have included additional analysis of IBS regions to address the reviewer’s comment, and we present trees as suggested in our analysis of MLO2 where complete data on a single gene could be produced.

More generally, I was confused by what specific models the authors had in mind at a number of points in the manuscript. Because they focus on selfers, their study design seems aimed at picking up balancing selection due to fluctuating selection pressures not heterozygote advantage. Moreover, they repeatedly state that they are looking for a form of selection that favors rare types. Why then expect a positive Tajima's D (subsection “A high density of tsSNPs around immunity-related loci”)? I think a discussion of the mode of balancing selection they expect to detect is missing.

We have heavily modified our introduction and removed some of the references that the reviewer refers to. With respect to negative frequency dependent selection, though this selection favors rare alleles, it is expected to elevate Tajima’s D as observed for the classic S-locus case. We have also made specific wording to with respect to the different types of balancing selection that might be operating in the Discussion section.

The frequency spectrum for orientalis in Figure 5A seems quite odd.

Please see the response to the general comments.

And the observation about the low interchromosomal LD among ts3_SNPs is likewise disconcerting. The explanation the authors provide seems highly unlikely, unless there is rampant epistasis among ts SNPs. If there were, shouldn't the authors also expect to see large footprints of balancing selection in the outcrossers (because haplotypes, not just SNPs, would be retained)?

We agree and have removed these data/interpretation.

The Red Queen hypothesis is used to refer to a dizzying array of hypotheses, from the evolution of sex (Introduction) to selection for new combinations of alleles (subsection “Balancing selection over ancient time scales”) to selection for rare alleles (Introduction) to balancing selection (Discussion section), in ways that are often inaccurate. In particular, the Red Queen hypothesis is neither necessary nor sufficient for the maintenance of genetic variation. As Aurelien Tellier and others have shown, it is actually very unlikely that genetic variation would be maintained over long time periods due to simple models of host-pathogen coevolution, especially in small populations (see Tellier et al., 2014 Evolution, for example). This literature should be discussed, especially since the authors are finding so many putative examples.

We have removed the Red Queen references in our work. We have also added reference to Tellier.

Reviewer #2:

I was very disappointed by this manuscript, as the results are presented so as to appear more exciting than they really are, and parts of the study appear to reflect a lack of understanding.

We did not mean to present our results as more exciting than they are, only as exciting as we found them. We hope that the reviewer is more satisfied with our revised presentation.

In the first category, I would mention text in the Introduction about the evolution of sex, with which the study does not even remotely deal. In the second category of lack of understanding and/or knowledge of the field and of results already established, I highlight the discussion of outcrossing, which does not mention inbreeding depression, which is surely a factor that ought to be mentioned – I think that it is incorrect to predict that in general outcrossing will be most favored by natural selection when the selective landscape changes frequently, and to imply that pathogen pressure is probably important in maintaining outcrossing.

We have removed all of this material from the manuscript.

Moreover, the conclusion that balancing selection still operates in the selfers to maintain variation seems unjustified, as other possibilities are not excluded, and

We have added several additional analyses supporting our conclusions.

the same applies to the conclusion stated at the end of the Abstract that "population longevity in the face of pathogen attack depends on the persistence of ancient genetic variation".

We have revised the Abstract and modified this statement.

The results may support the much more limited, but still interesting, conclusion that genes involved in defences against pathogens are enriched for cases of balancing selection, in line with the many previously published findings that this class of gene functions shows this kind of behaviour. It is far from clear that there is an "advantage of an expanded footprint of balancing selection after a genetic bottleneck", as the Abstract claims. It is correct that low diversity in a selfer might allow rare high diversity loci to be detected, but selfers that evolved recently from an outcrossing population have the disadvantage that one cannot disentangle selection prior to the selfing state being established from selection more recently, to say nothing of the effect of a bottleneck in increasing the variance of LD and other quantities that might be estimated.

Please see our response to the general editorial comments.

Details of these major concerns are too long for your word limit, so I have emailed them to the editorial office in a separate file, to be passed on to the authors.

My recommendation would be to revise the manuscript extensively to show clearly the evidence for those conclusions that are well supported, and relate them to what is already known.

Reviewer #3:

[…] This study does a reasonably good job in excluding factors other than ancestral variation that could lead to shared polymorphisms, so the conclusions are convincing to me. I also enjoy reading this manuscript, as it generates several new insights: (1) it substantially increases the number of trans-species polymorphisms that are potentially maintained by balancing selection in plants, providing candidates for follow-up functional studies; (2) it strongly suggests that frequency-dependent selection is a common mechanism underlying ancient balancing selection (as compared to heterozygote advantage); (3) the case of MLO2 underscores the technical difficulty of detecting divergent haplogroups shared between species by short-read sequencing methods.

We appreciate the reviewers’ positive comments regarding the manuscript.

That said, I have several questions and comments that should be considered before publication:

1) The shared IBD segments and genome-wide D-statistic suggested past and ongoing gene flow between sympatric C. rubella and C. grandiflora. Are there shared IBD segments and significant D-statistic values for the regions with strong evidence of balancing selection? The presence of introgression does not change the conclusion that most tsSNPs are retained ancestral variation, as long as western C. rubella and C. grandiflora still share the polymorphisms, but I am wondering if some of the sharing by eastern C. rubella and C. grandiflora could be attributed to recent gene flow.

We have added analyses of IBD segments along the lines of the reviewers’ suggestions.

2) To exclude the possibility of spurious tsSNPs due to cryptic paralogs, the authors compared the coverage and read concordance. Although the overall distribution of the two metrics looks similar for tsSNPs and ssSNPs, it is not clear whether there could be a few regions with excess coverage or reduced concordance. It will be helpful to plot these two metrics along the chromosome coordinate for the regions with strong evidence for balancing selection. Another way to rule out the concern of the cryptic paralogs is to compare the proportion of heterozygotes for tsSNPs and ssSNPs (controlling for allele frequency). If some of the tsSNPs are due to cryptic paralogs, the proportion of heterozygotes is expected to be higher than that for ssSNPs.

We decided that this suggestion was the simplest way of addressing this concern and have provided it in the manuscript. We modified this approach a bit, we checked to see if the balanced regions that show signal of increased heterozygosity, rather than tsSNPs. We find no elevations of observed relative to expected heterozygosity in these regions.

3) The presence of tsSNPs shared by all three species of Capsella provides strong evidence for balancing selection, but I am slightly concerned by the quality of the C. orientalis data because of a couple of puzzling patterns: (A) why is the allele frequency spectrum of ssSNPs in C. orientalis multi-modal?

Please see the discussion of this issue in the general comments.

B) Why is the average divergence between the two main Capsella lineages higher near sites adjacent to the three-way tsSNPs?

We assume that the reviewer means relative to the genome-wide average. In our manuscript, we hypothesize that these alleles have been segregating in Capsella since before the species split. Thus, on average, we expect divergence to be slightly elevated in tight linkage with the balanced SNP (since the exact same allele was not selected in each lineage, but just related alleles), and this is indeed what we see.

4) The interchromosomal LD for three-way tsSNPs is significantly lower than that of random polymorphisms in C. rubella and C. grandiflora. The authors speculated that this might be due to selection promoting new combinations of tsSNPs. However, I found this explanation implausible unless strong and prevalent epistasis is involved. Could the authors run simulations to show that the "reshuffling" hypothesis is able to explain the reduced LD and estimate how much epistasis is required to generate the observed pattern? In addition, why is the interchromosomal LD for three-way tsSNPs significantly higher for that of random polymorphisms in C. orientalis? Could this be due to technical issues?

We have removed this analysis from the manuscript.

[Editors' note: the author responses to the re-review follow.]

The manuscript has been improved but there are some remaining issues with the presentation and discussion that need to be addressed before acceptance, as outlined below. Notably reviewer 2 made a number of useful suggestions of clarifications that would be helpful in understanding the analyses and results.

Reviewer #1:

[…] The revised manuscript concludes that genes showing signals of potential balancing selection show an enrichment for genes involved in interactions with pathogens. I have two comments about this. First, are some of these genes located in clusters in physically small genome regions (as is commonly the case for plant NRR genes)? If so, could this over-estimate the fraction under balancing selection, because one gene is actually under balancing selection, and the others have correlated diversity patterns, due to close linkage? This would seem to be a particular concern for any clusters that are located in regions of low recombination, or low effective recombination rates (which could encompass considerable parts of the genomes of the selfing species studied).

We agree with the reviewer’s comments, as we noted in the discussion “It is possible, or perhaps even likely, that the signal of balancing selection is amplified by the fact that immunity-related loci occur in clusters (Meyers et al., 2003) and that our strongest signal is the result of simultaneous selection on several genes in these regions in a situation analogous to the MHC in animals (Hedrick, 1998). Thus, biotic factors might not be quite as important as our analyses make them appear.” We hope that the reviewer finds this recognition of the biases of gene clusters acceptable.

It is far from clear that "self-fertilizing species provide increased sensitivity to detect balancing selection" (as claimed in the Introduction) and indeed the evidence is not compelling, though it is overall consistent with its having been detected in the selfing Capsella species studied. Moreover, only certain kinds of balancing selection could be detected in a selfer, notably situations involving frequency dependence (not heterozygote advantage). This claim should therefore not be made.

The beginning of the sentence actually reads “We hypothesized that self-fertilizing…”. We feel that we clearly state a hypothesis, not a conclusion. We hope we can leave it to the reader to decide whether this particular idea has merit, and indeed, we point out many of the caveats that the reviewer mentions throughout the manuscript. We think that our study does provide some evidence for this hypothesis, and we hope to learn if it is widely true from further studies.

Second, the revised manuscript makes much clearer than before that the long-term balancing selection that may have been detected could potentially occur solely within the outcrossing species from which the selfing species evolved (in one case, apparently very recently). The signal within the selfing species might therefore not reflect ongoing selection in those species, but merely introgression of sequences bearing the footprints of the selection. The revised text marshalls the evidence on this question quite clearly. However, it is not mentioned that introgression by pollen flow into the selfing species is more likely than the reverse, following the long-established general rule for attempts to inter-cross two species when only one of them is self-incompatible: pollen of the self-compatible species often fails to work on pistils of the self-incompatible one, whereas the reciprocal pollination succeeds (as might be expected if selfing species' pistils have lost the ability to reject incompatible pollen). However, some recently evolved selfing species also reject pollen from their close self-incompatible relatives. A comment on the case in these Capsella species would be worth adding.

We have added a comment on the direction of fertility and a reference that explores the hybrid compatibility of these species in detail.

In my opinion, the abstract should mention that tests for adaptive introgression suggest that this is not causing the signal(s) on which the conclusion of balancing selection in the selfers is based.

We preferred to leave this out of the abstract because it is entirely possible that adaptive introgression plays an important role during the early stages of divergence. It is true that recent gene flow between C. orientalis and the other species has not occurred (as stated in the text), but we thought that including a broader statement might confuse the reader.

The first paragraph is poorly written and somewhat misleading, as it fails to mention that loss of alleles in situation with strong balancing selection occurs only in special cases. I think that the authors meaning is as follows. The term "balancing selection" describes several different adaptive forces that maintain genetic variation for longer than expected under genetic drift of neutral variants. It includes situations with heterozygous advantage, negative frequency-dependent selection (rare allele advantage), and environmental heterogeneity affecting fitnesses in space and time. In these situations, selection prevents loss of alleles from populations at the functional genes or sites. As this also results in increased diversity at closely linked neutral or weakly selected sites (Charlesworth, 2006), it should be possible to detect balancing selection from the resulting footprints of increased coalescence times at closely linked sites, and many candidate genes have been identified using diverse methodology (Fijarczyk and Babik, 2015). However, theoretical models predict that even strongly balanced functional alleles can be stochastically lost over long time periods, suggesting that balanced polymorphism could often be short lived [a reference is needed here], particularly when the functional alleles have low equilibrium frequencies, or fluctuate in frequency, with periods of low frequency, as may occur in the case of plant pathogen resistance genes where temporarily rare alleles may have advantages (Tellier et al., 2014). There is no need to repeat this reference in the Introduction. It would be better to make clear all along that this is a special case.

We agree that the use of the Tellier reference was unclear, as it does support the possibility of identifying balanced sites, the Fijarczyk review covers the broad consensus that balancing selection over long time scales is rare. The second reference to Tellier has been removed.

Reviewer #2:

The manuscript has been substantially improved after extensive revisions. Removal of the discussion of evolution of sex makes the main story much clearer, and I appreciate the explicit discussion of different possible mechanisms underlying the shared polymorphisms (technical artifacts, recurrent mutations, maintenance of ancestral variation, introgression, etc). The evidence of trans-species polymorphisms in MLO2b is very compelling, and I am convinced that the high-quality three-way tsSNPs show evidence for long-term balancing selection and no evidence of recent gene flow.

However, I have a few questions about the analysis and hope the authors could add some details in the manuscript to improve clarity.

We thank the reviewer for their comments, and we hope that the changes documented below will satisfy their concerns.

I also feel the current abstract is too vague and lacks specifics of some key results. For instance, it is unclear which collection of variants the authors are referring to when saying "ancestral variation preferentially persists at immunity related loci" (ts_2waySNPs, Bal regions, ts_3waySNPs, or ts_3wayhqSNPs?).

All of the SNP classes suggested by the reviewer were enriched in immune loci. We feel that each subset is better defined in the text itself.

In addition, the gene name MLO2b should be specified in the second to the last sentence, as this is the only case the authors demonstrate trans-species sharing of haplotypes clearly.

We have added a sentence specifically in reference to MLO2 to the abstract. We note that ts3wayhqSNPS were found to be in LD with at least one other ts3wasyhqSNP in all three species.

Please see below for my specific questions and comments.

1) The authors use the D statistic and IBD sharing to support their conclusion of ongoing gene flow between C. grandiflora and eastern C rubella, but some details of these analyses are unclear.

D-statistic calculation:

1) What exactly is the configuration for the ABBA-BABA test. For example, what species is used as the outgroup?

This information has been added to the Materials and methods section.

2) Why is the correlation between D statistic and distance evidence for ongoing gene flow (as claimed in subsection “Capsella rubella demography”)? Could this reflect past gene flow?

The word “ongoing” has been removed.

IBD detection:

1) What does the minimum segment length of 1kb translate into genetic distance? I am not familiar with recombination rates in plants, but such segments seem very short by standards in humans and thus very old. (The length threshold is only specified in legend of Figure 2, but I think it should also be included in the Materials and methods section.)

The effective population size of C. grandiflora is several orders of magnitude larger than humans, meaning that effective recombination rate is much higher. IBD segments of the length observed in humans are very unlikely in this context (and shorter segments are supported by more SNPs than would be expected in human genetic data). The length threshold has been added to the Materials and methods section.

2) What is the distribution of detected IBD segment lengths?

This data is now shown in Figure 4—figure supplement 4.

3) Can the authors provide some estimates for the age of the IBD segments based on their lengths to support their conclusion of "very recent co-ancestry"?

We changed “very” to “more”. We think that estimating the age of IBD segments might be possible, however this would not be trivial given the differing recombination rates and population structure of our sample. Since our goal is not to estimate the rate of gene flow accurately, but to show that balancing selection is acting in the selfer regardless of the origin of the allele, we feel that this analysis is beyond the scope of the current work.

2) Figure 4 is one of the key results showing enrichment in immunity-related genes, but it is unclear what each panel shows exactly, as some details or labels are missing.

1) Panel A: What does each dot mean? I guess each point represents a gene, but this is not clear from the text, figure or legend. What is the y-axis in each sub-panel? How is this number calculated?

To clarify the figure, we have added the following text to the figure legend “Each point represents values calculated for an individual gene. For example, in the upper subplot each point is the number of tsSNPs identified in a gene divided by the total number of SNPs identified for a gene”.

2) Panel B: what do the thick line and the shaded area mean, respectively? Does the legend of panel C apply to panel B?

The text “For (B-C) the…” has been added to clarify the legend.

3) The authors use simulation data to argue against the possibility that the diversity patterns near Bal loci result from historical balancing selection in the ancestor population. Did the simulations include effects of balancing selection? If so, how is balancing selection simulated in practice? If balancing selection is not simulated, are the results based on neutral simulations sufficient to rule out the possibility of historical balancing selection?

The principal effect of balancing selection, increased diversity, was approximated in simulations by recalculating the probability of the observed values in the selfer using only simulations that showed outstandingly high diversity in C. grandiflora. The observed values could not be explained under these conditions.

We acknowledge that this is only an approximation of balancing selection. We note that the reviewer’s previous concerns were that increased founding diversity due to balancing selection could potentially increase the probability of sampling multiple haplotypes during speciation of C. rubella. We feel that this specific concern is well addressed by our method.

Still, we recognize that it is difficult to prove beyond all doubts that balancing selection is ongoing in C. rubella. Even if we were to explicitly model balancing selection in these two species, it would be difficult if not impossible to consider all possible relevant scenarios. We try to illustrate this challenge throughout the manuscript. For example: “Although evidence for balancing selection at immunity-related loci in C. rubella is much stronger than in C. grandiflora, it is difficult to completely exclude the effect of founder diversity at these loci on the observed patterns.” This was exactly the reason we analyzed C. orientalis and found considerable additional evidence that the same regions were targets of balancing selection in a much older selfer.

4) MLO2b region

1) Why is the haplotype structure of MLO2b not discussed on C grandiflora? Since the tsSNPs are shared by C grandiflora, it seems the selection pressure is preserved in this species. Would the higher recombination rate in C grandiflora help narrow down the causal variant(s)?

Our data is unphased Illumina short read data. We can assemble high confidence haplotypes in the sellers because they have very low observed heterozygosity. Accurate haplotype assembly in C. grandiflora might be possible with a larger sample size, and it is possible this data might narrow down causal variants. However, it is important to point out that the C. rubella and C. grandiflora haplotypes have had nearly identical amount of time to recombine since the split with C. orientalis (the shift to selfing in C. rubella is much more recent in comparison to the larger divergence).

2) How long are the shared haplotypes between C rubella and C orientalis? Is maintenance of such long haplotypes expected, given plausible split time and recombination rate, if there is no epistasis? It will be helpful to estimate this using the formula in Wiuf et al., 2004.

This is a very interesting question. Intuitively, it does not seem probable that now recombination has occurred. Epistasis is a fantastic hypothesis, and one that would certainly be worth exploring in the future. Another possibility is that local variation in recombination rate, potentially further reduced by divergence, is involved. Using the Wiuf equation with an extremely rough estimate of recombination rate does not seem to distinguish these two possible causal factors.

3) Based on Figure 3B, haplogroup C differs by only 1 SNP from the second haplotype in group B. Why is C classified as the separate haplogroup instead of a subtype of B?

We initially separated haplogroup C because the internal four SNPs were in a different orientation then observed in haplogroup A and B, and this orientation was not found to be transpecific. We admit that this was somewhat arbitrary. As it turns out, examination of the assembled sequences indicates that this was the correct decision, this haplogroup was found only in C. rubella and is quite diverged from A or B. However, only A and B were clearly transpecific. Because we did not find C in both species, we did not include it in the coverage analysis.

To improve the clarity of the relationships between haplogroups, we have included C in the phylogeny in Figure 6—figure supplement 2.

4) In subsection “Insights into balancing selection from de novo assembly of MLO2”, the authors concluded that the shared haplotypes did not result from recent gene flow, because of similar within-haplotype divergence levels of A and B haplogroups. However, this evidence is insufficient: the authors need to demonstrate that the within-haplogroup divergence is at least as high as the genome-wide average (which seems to be the case by comparing Figure 6—figure supplement 1 to Figure 5D).

We have modified the text to reflect this fact.

5) The reconstructed trees in Figure 6C are very small, and haplotypes need to be labeled in order to show the pattern of clustering by haplogroup.

We agree that the small trees in Figure 6 make it difficult to clearly show all of the haplotype relationships. We have provided a full maximum likelihood tree with all the assembled haplotypes as a separate supplemental figure.

https://doi.org/10.7554/eLife.43606.048

Article and author information

Author details

  1. Daniel Koenig

    Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    Present address
    Department of Botany and Plant Sciences, University of California, Riverside, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    dkoenig@ucr.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1037-5346
  2. Jörg Hagmann

    Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    Present address
    Computomics GmbH, Tübingen, Germany
    Contribution
    Data curation, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Rachel Li

    Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    Present address
    Berkeley Brewing Science, Oakland, United States
    Contribution
    Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8112-4237
  4. Felix Bemm

    Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    Present address
    KWS SE, Einbeck, Germany
    Contribution
    Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Tanja Slotte

    Department of Ecology,Environment, and Plant Sciences, Stockholm University, Stockholm, Sweden
    Contribution
    Resources, Supervision, Writing—review and editing
    Competing interests
    No competing interests declared
  6. Barbara Nueffer

    Department of Biology, University of Osnabrück, Osnabrück, Germany
    Contribution
    Resources, Writing—review and editing
    Competing interests
    No competing interests declared
  7. Stephen I Wright

    Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
    Contribution
    Resources, Supervision, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9973-9697
  8. Detlef Weigel

    Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
    Contribution
    Conceptualization, Supervision, Funding acquisition, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    weigel.elife@gmail.com
    Competing interests
    Deputy editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2114-7963

Funding

European Research Council (IMMUNEMESIS)

  • Detlef Weigel

Human Frontier Science Program (LT000783/2010-L)

  • Daniel Koenig

Deutsche Forschungsgemeinschaft (WE 2897/4-2)

  • Detlef Weigel

Max-Planck-Gesellschaft (Open-access funding)

  • Detlef Weigel

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Christa Lanz for expert assistance with Illumina sequencing. We thank Danelle Seymour, Rebecca Schwab, Beth Rowan, Derek Lundberg, Wangsheng Zhu, Efthymia Symeonidi, Gautam Shirsekar, Rui Wu, Patricia Lang, Talia Karasov, Hernán Burbano, Moisés Exposito Alonso, Maricris Zaidem, Rafal Gutaker, Eunyoung Chae, and Diep Tran for reading of the manuscript and insightful comments. Thank you to Dmitry German for his identification of C orientalis from herbarium samples, making this study possible in the first place. This work was supported by a Human Frontiers Science Program Long-Term Fellowship to DK (LT000783/2010 L) and by DFG-SPP1529 ADAPTOMICS (WE 2897/4–2), ERC Advanced Grant IMMUNEMESIS and the Max Planck Society (DW).

Senior Editor

  1. Ian T Baldwin, Max Planck Institute for Chemical Ecology, Germany

Reviewing Editor

  1. Molly Przeworski, Columbia University, United States

Reviewer

  1. Ziyue Gao, Stanford University, United States

Publication history

  1. Received: November 13, 2018
  2. Accepted: February 26, 2019
  3. Accepted Manuscript published: February 26, 2019 (version 1)
  4. Version of Record published: March 20, 2019 (version 2)

Copyright

© 2019, Koenig et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 849
    Page views
  • 158
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Nick Barton et al.
    Insight
    1. Evolutionary Biology
    2. Genetics and Genomics
    Mashaal Sohail et al.
    Research Communication