1. Genetics and Genomics
Download icon

Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences

  1. Fanny Pouyet  Is a corresponding author
  2. Simon Aeschbacher
  3. Alexandre Thiéry
  4. Laurent Excoffier  Is a corresponding author
  1. University of Bern, Switzerland
  2. Swiss Institute of Bioinformatics, Switzerland
  3. University of Zurich, Switzerland
Research Article
  • Cited 1
  • Views 2,699
  • Annotations
Cite this article as: eLife 2018;7:e36317 doi: 10.7554/eLife.36317

Abstract

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

https://doi.org/10.7554/eLife.36317.001

eLife digest

Human chromosomes are made up of DNA, which contains about 3 billion ‘letters’ that carry the instructions needed to build and maintain an individual. However, only about 10 percent of the human genome is made up of genes that code for proteins, or have a defined role in the body. The DNA sequence is largely the same in all people, but some modifications – or variants – occur about every hundred letters. These produce different versions of the same gene, which give us our unique features, such as the color of our hair or eyes.

The frequencies of some genetic variants can change over time, which makes human populations diverge genetically and physically. This can happen through different mechanisms. Positive selection keeps variants that are beneficial in specific environments, while negative selection removes genetic changes that are detrimental, for example because they cause disease. Transmission bias favors one of the two variants from our two parents. Chance alters the frequencies of neutral variants, which are neither good nor bad for the individual.

It is important to distinguish between these different scenarios, as they inform us about the forces that act on human evolution. For example, neutral variants tell us about the demography and migration patterns between populations. Variants under negative selection reveal which genetic areas are under pressure to stay the same because they are important for the organism to function correctly. Until now, it was unclear how we could best identify the variants affected by different evolutionary pressures, and how much of the genome was under negative selection.

Pouyet, Aeschbacher et al. created a measure of genetic diversity that is only affected by selection or transmission bias. The results showed that negative selection influences as much as 85 percent of our genome, whereas transmission bias affects a majority of the rest of the genome. After removing these two biases, less than 5 percent of the human genome is found to evolve by chance. This suggests that while most of our genetic material is formed of non-functional sequences, the vast majority of it evolves indirectly under some type of selection.

These findings define which parts of our genome evolves neutrally and can therefore be used to correctly reconstruct the past demography and migration events of humans around the world. The next step could be to reassess the history of human populations that was drawn using genomic data.

https://doi.org/10.7554/eLife.36317.002

Introduction

Human genomic diversity has evolved under diverse and complex constraints (Auton et al., 2015), such as past demography, selection, mutations, or genomic rearrangements (Lohmueller et al., 2011; Schiffels and Durbin, 2014; Sudmant et al., 2015; Mallick et al., 2016). However, the influence of these evolutionary forces and their interactions remain to be fully understood. For instance, it is yet unclear which fraction of the genome evolves under positive or purifying selection (McVicker et al., 2009; Rands et al., 2014; Corbett-Detig et al., 2015). Such information is crucial to our understanding of what portion of the genome is evolving neutrally, and necessary to form a clear basis for demographic inference, the detection of selective events, or the inference of the distribution of fitness effects of new mutations.

Genome-wide variation in recombination may strongly affect neutral variants (Spencer et al., 2006; Corbett-Detig et al., 2015), as selection will have more impact on linked polymorphism in regions of low recombination (Charlesworth et al., 1995), whereas biased gene conversion, which can also mimic the effect of selection (Galtier and Duret, 2007; Ratnakumar et al., 2010), will occur mostly in regions of high recombination (Katzman et al., 2011). In humans, various measures of diversity are positively correlated with levels of recombination (Nachman, 2001; Spencer et al., 2006; Cai et al., 2009; Lohmueller et al., 2011). While a direct mutagenic effect of recombination seems unlikely (McVicker et al., 2009; Schaibley et al., 2013) except at CpG sites (Arbeithuber et al., 2015), there is still some debate about whether the correlation between diversity and recombination is driven by recurrent selective sweeps (hitchhiking of neutral and slightly deleterious mutations) or background selection (BGS; i.e. purifying selection against deleterious mutations at linked sites) (McVicker et al., 2009; Stephan, 2010; Hernandez et al., 2011; Lohmueller et al., 2011). The modeling of genomic diversity under selection in humans suggests that it can be explained entirely by BGS (Lohmueller et al., 2011), whereas a combination of both BGS and positive selection seems to best explain genomic diversity in Drosophila (Elyashiv et al., 2016). However, the correlation between diversity and recombination is generally relatively weak in humans for most tested statistics and seems restricted to genomic regions of relatively low-recombination rate (<1 cM/Mb, (Cai et al., 2009; Lohmueller et al., 2011)).

Given the positive relationship between recombination and genetic variability, it has been proposed that the genomic regions most suitable for demographic inferences should be far away from genes and have high-recombination rates (Lohmueller et al., 2011). However, regions of high recombination might be prone to GC-biased gene conversion (gBGC), a process by which GC alleles in recombination tracts are preferentially transmitted in GC/AT heterozygotes (Duret and Galtier, 2009). This process thus increases the frequency of G and C derived alleles (usually denoted as strong or S alleles, Lachance and Tishkoff, 2014) relative to A and T (denoted as weak or W alleles), especially in recombination hotspots (Spencer et al., 2006; Glémin et al., 2015). By modifying allele frequencies in high-recombination regions, gBGC affects the site frequency spectrum (SFS) (Lachance and Tishkoff, 2014; Glémin et al., 2015) such that it becomes right-shifted for W-to-S (WS) mutations and left-shifted for S-to-W (SW) mutations. In addition, gBGC affects various classical statistics used to detect selection, and WS SNPs show larger levels of population differentiation than other SNPs (Lachance and Tishkoff, 2014). Overall, gBGC is believed to directly affect only 1% to 2% of the human genome, near recombination hotspots (Glémin et al., 2015), but due to the transient nature of these hotspots, a larger fraction of the genome could have been affected in the long term.

Here, we use two whole-genome human datasets to determine how and to what extent recombination and selective forces affect genome-wide diversity in humans. We examine the relationship between recombination rate and the average derived allele frequency per individual, as well as the SFS. After determining the parts of the genome that are least affected by BGS and gBGC, we examine the impact of these two processes on the SFS, and how they affect demographic inference based on the SFS.

Results

We first used a representative set of one hundred individuals from the 1000 Genomes (1000G) Project (Auton et al., 2015) from ten populations in five geographic regions to study the pattern of human genomic diversity. Since our analyses compared genomic diversity across individuals for sets of sites devoid of any missing data, we selected in each population those 1000G individuals with the highest coverage. As a measure of genomic diversity, we used the average derived allele frequency per individual (DAFi¯). This statistic was computed over all sites that were found polymorphic across all populations (i.e. where derived alleles are neither fixed nor absent in all individuals). Assuming that there are STot such sites, DAFi¯ is computed for each diploid individual as the total number of derived alleles observed at those sites (ni) divided by 2 STot. We show in the Materials and methods below how this statistic depends on the average time to the most recent common ancestor (tMRCA) of the whole sample, and, if one assumes neutrality that this statistic should be the same on expectation for any individual in the sample across its whole genome, irrespective of the particular demography of its population (Figure 1—figure supplement 1). Differences in the number of derived alleles (ni) among individuals and among genomic regions should therefore only reflect differences in selection, mutation rate and/or generation time (Figure 1—figure supplement 2). The number of derived alleles is indeed broadly comparable across individuals from different geographic regions, even though Southern and Eastern Asians (SAS and EAS) show a slight yet significant deficit in the number of derived alleles than the three other groups (~50,000 out of 17 million, Tukey test, p<0.01) (Figure 1—figure supplement 2), suggesting either a more efficient selection, a lower mutation rate or a longer generation time. This statistic thus appears ideally suited to assess the impact of selection at linked sites that could locally alter their tMRCA.

Figure 1 with 10 supplements see all
Average derived allele frequency per individual (DAFi¯) as a function of recombination rate.

1000G SNPs were ranked by their local recombination rate and divided into 20 bins of equal size. DAFi¯ was computed for each individual as the number of heterozygous sites plus two times the number of derived homozygous sites and averaged per geographic region. (A) DAFi¯ vs. recombination rate on a log10 scale for all 17,129,351 1000G SNPs. (B) Same as panel A for SNPs in transcribed regions (TR), non-transcribed regions (NTR), or non-transcribed regions more than 50 kb away from TR (NTR-50kb). (C) Same as panel A for SNPs differently affected by GC-biased gene conversion (gBGC). Left: WS sites, where the derived allele is favored by gBGC. Center: SW sites, where the ancestral allele is favored by gBGC. Right: WW and SS sites, which are not affected by gBGC. The vertical dashed lines at 1.5 cM/Mb delimit an approximate threshold above which BGS has no effect on WW and SS sites, but where gBGC has a strong and opposite effect on WS and SW sites. Each group (AFR: Africans, EUR: Europeans, EAS: East-Asians, SAS: South Asians, AMR: Admixed Americans) includes individuals from two populations (see Supplementary file 1 - Table S1). Shaded areas delimit the 95% confidence interval of each group, estimated using a block-bootstrap approach (see Materials and methods).

https://doi.org/10.7554/eLife.36317.003

The average derived allele frequency per individual increases with recombination rate

For individuals belonging to five geographic regions, we studied the variability of DAFi¯ across the genome by computing it separately for SNPs that belong to different recombination classes and averaging it across individuals within each region (Figure 1A). Local recombination rates around each SNP were obtained from the 1000G Yoruba recombination map (Frazer et al., 2007) (see Materials and methods), but the use of alternative recombination maps leads to similar patterns (Figure 1—figure supplement 3). We find that the average intra-bin DAFi¯ increases almost log-linearly with the average recombination rate. The exception is for the lowest recombination class, most likely because low-recombination rates are difficult to estimate (Kong et al., 2010; Wegmann et al., 2011). We observe the same log-linear relationship in a set of 20 individuals chosen to represent five continents that were sequenced as part of the Simons Genome Diversity Project (Mallick et al., 2016) at higher coverage (31–60×) than the 1000G individuals (Figure 1—figure supplement 4A). The log-linear relationship between recombination rate and DAFi¯ is conserved among geographic regions (Figure 1A) and it is also observed at the level of single individuals (Figure 1—figure supplement 5A), as expected from our theoretical derivations. Note that this very similar behavior among individuals and populations is not in line with a differential action of positive selection (selective sweeps) in different continents. Therefore, if adaptive events were involved in shaping allele frequencies and creating this relation, they should have occurred before the human lineage split into different continental groups. Since most variation in exonic regions has emerged in the last 10,000 years (Fu et al., 2013), a pure adaptive explanation for this relation seems unlikely.

As expected if purifying selection was removing deleterious variation predominantly in coding regions, we find a stronger effect of BGS in transcribed (TR) than in non-transcribed (NTR) regions, in the sense that DAFi¯ is more reduced in regions of low recombination in TR than in NTR regions (Figure 1—figure supplement 6A). At the same distance from exons (between ~0.001 and~0.1 cM, Figure 1—figure supplement 6B), DAFi¯ is slightly larger for NTR than for TR regions suggesting that BGS is stronger in TR regions. However, DAFi¯ converges to similar values in high-recombination regions, in line with the view that BGS is not acting in these regions. Interestingly, BGS is clearly acting in NTR regions even when we focus on NTR regions more than 50 kb away from any transcribed region (Figure 1B). This result confirms that BGS is acting in NTRs (Asthana et al., 2007; Comeron, 2014; Rands et al., 2014), which could be either due to the presence of functional elements in these regions such as non-coding RNAs, histone marks, enhancers or insulators (Kellis et al., 2014; Bonev and Cavalli, 2016; Van Nostrand et al., 2017), or due to remote effects of exonic deleterious mutations on SNPs in NTR. However, since the influence of exonic regions on DAFi¯ is largely limited above 0.01 cM (Figure 1—figure supplement 5B), we suspect that functionally constrained elements are widespread in NTRs. Conservation scores have also been used to assess a potential effect of selection on DAFi¯. Sites associated to GERP RS scores between –2 and +2 are thought to be evolving neutrally in mammals (Davydov et al., 2010), but we still find a positive log-linear relationship between DAFi¯ and recombination rate for those sites (Figure 1—figure supplement 5C), suggesting that these sites are also influenced by BGS due to selection at linked sites. Note that we also find a positive relationship between DAFi¯ and recombination for more conserved sites that could be directly under negative selection (Figure 1—figure supplement 5D–E) suggesting that their diversity is also affected by BGS at neighbouring sites. These observations suggest that filtering by GERP score may not be sufficient to completely remove the effect of BGS. Since DAFi¯ patterns seem to be driven by BGS, we would expect that they are correlated with statistics that have been specifically developed to measure the extent of BGS in various regions of the genome, such as the B-statistic (McVicker et al., 2009). Indeed, the B-statistic measures the relative reduction in genetic diversity due to BGS and it ranges from 0 in regions highly affected by BGS to 1 in regions unaffected by BGS. As expected, we find that DAFi¯ and the average B-statistic, computed both in the same 20 recombination rate bins defined in Figure 1A, are highly correlated (Figure 1—figure supplement 5F). This result suggests that the average DAFi¯ and average B-statistic are affected by the same process, and thus that DAFi¯ provides information on the strength of background selection among a set of SNPs.

Limits of BGS and evidence for biased gene conversion in regions of high recombination

Since the impact of BGS is mediated by recombination, BGS should have a minimal influence in regions of high recombination (Hudson and Kaplan, 1995; Nordborg et al., 1996). However, it has been shown that GC biased gene conversion (gBGC) is acting in GC/AT heterozygotes in these regions, particularly in the vicinity of recombination hotspots (Spencer et al., 2006), potentially increasing the frequency of G and C derived alleles (usually denoted as strong or S alleles, see Lachance and Tishkoff, 2014) as compared to A and T (denoted as weak or W alleles). We have thus examined the relationship between DAFi¯ and local recombination rate for three combinations of S and W alleles (Figure 1C, Figure 1—figure supplement 6C). If the ancestral allele is W and the derived allele is S (WS sites, Figure 1C, left), we see the same log-linear relation between DAFi¯ and recombination as if we consider all SNPs (Figure 1A). However, at SW sites (Figure 1C, center), DAFi¯ decreases for recombination rates above ~1.5 cM/Mb. This non-monotonic behavior at SW sites is consistent with gBGC favoring the transmission of G and C alleles, and thus decreasing the frequency of derived A and T alleles. Finally, for mutations not affected by gBGC (WW and SS sites), DAFi¯ increases with local recombination rate until it reaches a plateau starting at ~1.5 cM/Mb, which suggests that the effect of BGS is absent or strongly reduced above this recombination threshold (Figure 1C, right). This latter observation implies that the linear increase of DAFi¯ above 1.5 cM/Mb at WS sites (Figure 1C, left) is entirely due to gBGC. Note that the exact same pattern holds for SGDP populations (Figure 1—figure supplement 4C). Moreover, if we analyze all possible types of substitutions separately, gBGC appears to affect the 12 types of SNP according to whether the SNP type belongs to the SW, WS, or WW +SS class (Figure 1—figure supplement 7. These results suggest that SNPs located in regions where recombination is higher than 1.5 cM/Mb are affected by gBGC and not by BGS (Figure 1C, Figure 1—figure supplement 6C). Therefore, WW and SS sites with a recombination rate above 1.5 cM/Mb (representing 2.88% and 2.94% of all SNPs for 1000G and SGDP datasets, respectively) should be optimal for demographic inference, as they appear to evolve mainly neutrally.

BGS and gBGC affect the whole SFS

Since DAFi¯ increases with recombination rate (Figure 1), BGS does not simply amount to lowering the effective population size (Charlesworth, 1994; Charlesworth et al., 1995; Hudson and Kaplan, 1995), as this simple rescaling would not modify allele frequencies. BGS thus affects the SFS (Zeng and Charlesworth, 2011) in complex ways (Nicolaisen and Desai, 2013), and the comparison of sites that are differentially exposed to BGS allows us to better examine this influence. The SFS computed in ten 1000G populations for different recombination classes (Figure 2A, Figure 2—figure supplement 1) shows distortions that are qualitatively similar in all populations, irrespective of differences in demographic history. As compared to the highest recombination class, the second-to-lowest recombination class (which is potentially the one most strongly affected by BGS) not only shows an excess of singletons, but also a deficit of intermediate and high frequency variants (Figure 2A). Similar distortions are also observed in non-transcribed regions, and even (but to a lower extent) in regions at least 50 kb away from transcribed regions (Figure 2—figure supplement 2), in line with our results for DAFi¯.

Figure 2 with 4 supplements see all
BGS and gBGC both have an impact on the SFS.

Each panel corresponds to the normalized unfolded SFS of Yoruba (top, YRI) and Japanese (bottom, JPT) populations. (A) SFS computed for all SNPs in the 2nd, 10th and 20th recombination classes (as defined in Figure 1). For each panel, pairwise comparisons of the SFS are significant with p-values<10–3 (see Materials and methods). The SFS for all ten 1000G populations are shown in Figure 2—figure supplement 1. (B) SFS for three gBGC mutation categories computed for three recombination classes. Note that WW and SS sites (in red) are unaffected by gBGC. All SFS are different from each other (site permutation test, p-values<10–3) except for the Yoruba recombination class two between WS and WWSS where p=0.0135. (C) SFS for sites unaffected by BGS and gBGC (WW + SS sites with RR ≥1.5 cM/Mb). The four SFSs are not significantly different from each other at the 1% significance level, as revealed by a permutation approach (see Materials and methods). Shaded areas delimit 95% confidence intervals using a block-bootstrap strategy (see Materials and methods).

https://doi.org/10.7554/eLife.36317.014

To understand the respective effects of gBGC and BGS on the SFS, we computed the SFS for subsets of mutations differentially affected by gBGC in the Yoruba (YRI) and Japanese (JPT) 1000G populations (Figure 2B). In line with previous work (Lachance and Tishkoff, 2014), we find that the difference between the SFSs of unbiased mutations (WW + SS) and biased mutations (SW and WS) increases with recombination rate. In particular, WS mutations show a deficit of low-frequency variants and an excess of intermediate- and high-frequency variants in regions of high recombination (Figure 2B). As previously recognized (Katzman et al., 2011; Lachance and Tishkoff, 2014), the excess of high-frequency variants at WS sites is not compensated by a corresponding deficit of high-frequency variants at SW sites, implying that gBGC could contribute to the increase of nearly fixed derived alleles that has previously been attributed to mislabelled ancestral states or positive selection (Hernandez et al., 2007).

Impact on demographic inferences

To investigate the impact that the choice of SNPs may have on demographic inference, we estimated demographic parameters for the Yoruba and Japanese populations using three different SFSs (Figure 3A): the synonymous SFS commonly used in exome resequencing studies; the SFS inferred on non-transcribed regions at least 50 kb away from coding regions (NTR-50kb), and on our best-filtered dataset (WW + SS sites in ≥1.5 cM/Mb regions), hereafter called the ‘neutral’ dataset. Note that this neutral SFS was computed over both TR and NTR regions since they show the same SFS (Figure 2C). Interestingly, the SFS observed at synonymous sites differs markedly from that observed at neutral sites, as it comparatively shows a significant deficit of low-frequency variants and a large excess of high-frequency variants (Figure 3A, Figure 3—figure supplement 1A). It appears that this latter excess is due to gBGC, as it disappears when one computes the SFS on synonymous sites not affected by gBGC (Figure 3—figure supplement 1B).

Figure 3 with 3 supplements see all
Recent demography inferred from three different datasets in the Yoruba and Japan 1000G populations.

(A) Observed SFS computed on the three datasets. Neutral: neutral SFS computed on WW +SS sites in regions with recombination rate (RR) ≥1.5 cM/Mb; Synonymous: synonymous SFS; NTR-50kb: SFS computed on sites in non-transcribed regions more than 50 kb away from any transcribed region. Synonymous and NTR-50kb SFS are significantly different from neutral SFS with p-values<10−3. (B) Comparison of inferred demographic events in the last 600,000 years under the model shown in panel C. Left and right panels compare neutral estimations to those of different datasets. All parameter values are given in Supplementary file 3 - Table S3. Solid lines represent maximum-likelihood (ML) estimates of population sizes and bottleneck times. Vertical arrows indicate ML estimates of admixture times; their height is proportional to the admixture estimates shown on the right axis. Boxes delimit 99% confidence intervals obtained by a block-bootstrap approach (see Materials and methods). Boxes surrounded by a solid line are for bottleneck parameters (size and time), open boxes are for population size between bottleneck events, and boxes surrounded by a dashed line are for admixture rates and times. Note that bottlenecks have been modeled with a fixed duration of 100 generations, and the width of the boxes denotes the range covered by 99% of the bootstrap estimations. (C) Sketch of the demographic model used for SFS-based demographic inferences. The model includes three possible bottlenecks of a fixed duration of 100 generations in the direct ancestry of the sampled population, and it allows some sampled genes (a fraction pADM) to have ancestors coming from an unsampled (ghost) population at any time (TADM) since its divergence from the sampled population TDIV generations ago. Note that in this model, NANC and NGHOST have been arbitrarily fixed to 40,000 and 20,000 (haploid sizes), respectively. Note also that the ghost population is used here to allow for some gene flow from some unspecified source, and so to account for the non-isolated nature of human populations.

https://doi.org/10.7554/eLife.36317.019

Using a simple demographic model of a focal population going through three successive bottlenecks and receiving some migrants from surrounding populations (modelled as a ghost population for simplicity) (Figure 3C), we can fit almost perfectly the three SFSs (Figure 3—figure supplement 2). Yet, the inferred parameters differ considerably (Supplementary file 3 - Table S3). For the Yoruba population, the differences in demography are especially important in the old periods (>100 ky, Figure 3B). With the neutral SFS, we nevertheless infer a more recent last bottleneck dated at the end of the Last Glacial Maximum (LGM), a more pronounced and more recent admixture event from surrounding populations. The ancient demography is markedly different with a significantly more ancient second bottleneck and a significantly lower ancient population size inferred from both synonymous and NTR-50 kb SFS. The Japanese demography inferred from the three data sets shows more similarity over the last 600 ky but the demography inferred from the neutral data set suggests a stronger recent bottleneck (pre LGM) and no population expansion as compared to what is inferred from the synonymous SFS neutral data set. Our results thus clearly show that very different demographies can be inferred from neutral and non-neutral SFSs. However, even though BGS and gBGC affect the SFS of populations with distinct histories in a qualitatively similar way, they have different consequences on their reconstructed demography. It thus appears difficult to predict how demographic parameters will be biased when using non-neutral SFS.

Simulations of BGS reproduce observed patterns

To confirm that our observed patterns were compatible with background selection, we ran individual-based forward simulations implementing BGS with SLiM v. 2.3 (Haller and Messer, 2017) in populations having the demography estimated from neutral sites in the Japanese and the Yoruba populations (see Supplementary file 3 - Table S3). Overall, the simulated BGS patterns qualitatively match the observation very well (Figure 4, Figure 4—figure supplement 1, and Figure 4—figure supplement 2). As observed in real data (Figure 1), neutral sites simulated next to selected regions present a strong increase in DAFi¯ with recombination rate (Figure 4A), and the SFS at neutral sites shows a considerable excess of singletons and a deficit of intermediate- and high-frequency variants for low-recombination rates (Figure 4B), respectively. These results show that BGS can reproduce both the observed correlation between DAFi¯ and local recombination rates, and the observed distortions of the SFS in low-recombination regions.

Figure 4 with 2 supplements see all
Genomic data simulated under a model of background selection (BGS).

We used the demographic parameters estimated for the Yoruba (YRI) and Japanese (JPT) populations from neutral sites (WW + SS sites with RR ≥1.5 cM/Mb) as reported in Supplementary file 3 - Table S3. Forward simulations of diploid individuals were performed with SLiM v. 2.3 (Haller and Messer, 2017). We simulated the evolution of a chromosome of 50 Mb made up of 1000 5 kb regions, each consisting of a 1 kb region experiencing purifying selections followed by a 4 kb region with neutral mutations. A, B) Average derived allele frequency per individual (DAFi¯). C, D) Unfolded normalized SFS. Solid and dashed lines correspond to simulations performed with and without BGS, respectively. The transition to effective neutrality occurs between a recombination rate of 1e–8 (blue curve) and 1e–7 (orange curve), a range that includes our proposed threshold of 1.5 cM/Mb.

https://doi.org/10.7554/eLife.36317.023

Discussion

Delineating the neutrally evolving part of the human genome remains a challenge, as variation in the intensity of recombination, mutation, and selection are increasingly recognised as having a strong effect on observable genomic diversity in humans (Corbett-Detig et al., 2015; Elyashiv et al., 2016) and other organisms (e.g. Elyashiv et al., 2016; Ravinet et al., 2017). Here, we have shown that a surprisingly large proportion (up to 95%) of our genome might be affected by background selection (BGS) and/or GC-biased gene conversion (gBGC). These two processes, which both depend on recombination, strongly affect observed measures of genetic diversity along the genome and can lead to biased demographic inference if not properly taken into account (Figure 3).

We have interpreted the striking linear relationship observed between DAFi¯ and recombination rate (Figure 1) as evidence for the pervasive effect of BGS but other processes could in principle lead to a similar relationship. For instance, a mutagenic effect of recombination could lead to an increased diversity in regions of high recombination (Hellmann et al., 2003). The examination of extremely low-frequency mutations, which should be enriched for new mutations, did not reveal any association between recombination rate and the density of new mutations in a large human sample (Schaibley et al., 2013), but a more recent study of de novo mutations suggested the existence of such a correlation (Francioli et al., 2015). Alternatively, a correlation between mutation and recombination rates could occur if these rates were both affected by the same process, such as replication timing (Stamatoyannopoulos et al., 2009; Koren et al., 2012) or transcription rate (Gerton et al., 2000; Park et al., 2012). However, a mere correlation between mutation and recombination rates cannot explain two key aspects of our observations. First, DAFi¯ plateau at high recombination rates once the effect of gBGC is removed (Figure 1C), whereas it should continue increasing if only mutation-recombination correlation was driving the relationship between DAFi¯ and recombination. Second, we find a significant difference in the shape of SFS computed in regions of low and high recombination (Figure 2A), even though mutation rate should have no effect on the shape of the SFS. To better investigate the effect of a possible mutation-recombination correlation, we have used the fact that DAFi¯ is correlated with the B-statistic (Figure 1—figure supplement 5F), for which a simple model (Hudson and Kaplan, 1995) predicts its value as a function of mutation and recombination rates. We find that the B-statistics inferred by McVicker et al. (2009) are significantly better fitted as a function of the recombination rate if we assume a log-log linear relationship between recombination and deleterious mutation rate than if we impose a constant mutation rate across the genome (Figure 1—figure supplement 8). Interestingly, under our log-log linear model, both the observed and predicted B-statistics reach a plateau value of ~0.9 above a recombination rate of ~1.5 cM/Mb. This pattern remains if we only consider subsets of SNPs (e.g. WW + SS sites; Supplementary file 4 - Table S4). Therefore, these results suggest that in addition to BGS and gBGC, some correlation between mutation and recombination rate is required to best explain our observed patterns. Moreover, given the relationship observed between B-statistics and DAFi¯ (Figure 1—figure supplement 5F), the reduced effect of recombination on B above 1.5 cM/Mb should translate into a similar absence of change in DAFi¯ above the same threshold, thus explaining the plateau we see in Figure 1C above 1.5 cM/Mb.

The occurrence of pervasive positive selection, either in the form of soft or hard sweeps (Kern and Hahn, 2018) or of positive selection on polygenic traits (Boyle et al., 2017) in our genome could also lead to a correlation between genetic diversity and recombination, as the effect of selection on linked neutral sites should decrease with recombination. However, positive selection should lead to an increase of both low- and high-frequency variants in the SFS (Fay et al., 2000; Hernandez et al., 2007; Huber et al., 2016; Pavlidis and Alachiotis, 2017), whereas we only observe an increase of low-frequency variants in low-recombination regions where the effect of selection should be strongest (Figure 2A), which is the expected effect of BGS (Figure 4).

The exact proportion of the genome that is influenced by selection is still the source of an intense debate (Bernstein et al., 2012; Rands et al., 2014; Graur, 2017; Kern and Hahn, 2018). Here, we show that up to 80–85% of the human genome is probably affected by background selection (BGS), an effect that is not subtle (Reed et al., 2005) and that is visible from single individuals genomes (Figure 1—figure supplement 5A). Even though our estimate of the fraction of the human genome influenced by BGS matches relatively well with that reported to be biochemically functional by the ENCODE consortium (Bernstein et al., 2012), our results do not imply that 80–85% of the human genome is functional. They rather show that functional sites that are the direct target of purifying selection in both coding and non-coding regions (potentially representing 8–15% of the genome, Rands et al. (2014); Graur, 2017) have an important but indirect influence on most of the genome.

As expected, the effect of BGS is clearly mediated by local recombination rate, but it extends well beyond coding regions in humans (Hernandez et al., 2011) (Figure 1), and it is thus not restricted to species with a large effective size (Corbett-Detig et al., 2015). Our results also show that the influence of gBGC is not restricted to recombination hotspots (Spencer et al., 2006; Glémin et al., 2015), but that it has also a strong footprint in regions with a recombination rate larger than 1.5 cM/Mb, but note that it could affect (to a lesser degree) regions with an even lower recombination rate (see Figure 1—figure supplement 5D. These regions represent about 15.9% and 16.2% of the polymorphic positions for the 1000G and SGDP datasets, respectively. Taken together, BGS and gBGC thus affect more than 95% of the polymorphic sites in our genome, and we have identified only a small fraction of all genomic SNPs (~3%, Supplementary file 2 - Table S2) that can be considered as evolving neutrally.

Interestingly, our neutral SNPs are found in both transcribed and non-transcribed-regions (Figure 2C), and they are enriched close to telomeric regions (Figure 1—figure supplement 10), where BGS is predicted to be weaker (Charlesworth, 2012). Whereas SNPs included in our best-filtered set are evolving mostly neutrally, it does not imply that all other SNPs are influenced by BGS and gBGC. Indeed, our way of identifying selection and biased gene conversion is indirect and operates on arbitrarily defined recombination-rate categories. Thus, DAFi¯ cannot be used to identify the presence of selection at the SNP level or in small genomic regions, or inversely, the presence of neutral SNPs in low recombining segments between recombination hotspots. A more precise mapping of selected genomic segments could use information on the positions of known functional elements (Siepel et al., 2005; Kellis et al., 2014; Rands et al., 2014; Elkon and Agami, 2017) or B-statistics (McVicker et al., 2009; Elyashiv et al., 2016), which could also be used to evidence neutrally evolving regions in both low- and high-recombination regions.

To investigate if and how DAFi¯ depends on potential co-variates within our neutral set of SNPs, we have examined its relationship with several statistics, such as B-statistics or the distance (in map units) to exons, as well as distances to conserved elements and to recombination hotspots. In our neutral set, we find virtually no relationship between DAFi¯ and recombination rate, with average DAFi¯ remaining close to its mean value of 0.146 (Figure 1—figure supplement 9A), but we find a negative relation with the distance to recombination hotspots (Figure 1—figure supplement 9B, a positive relationship with distance to conserved elements and with B-statistics (Figure 1—figure supplement 9C–D), and a small positive correlation with distance to exons (DAFi¯ varies from 0.145 to 0.15, close to the average, Figure 1—figure supplement 9E). It thus seems that recombination hotspots still play a role in decoupling selected from neutral sites, and that sites furthest away from hotspots might still be slightly sensitive to BGS. Purifying selection in phastCons conserved elements (Siepel et al., 2005) is also exerting a strong negative pressure on derived allele frequencies, with average DAFi¯ below 0.14 at sites less than 0.0003 cM away from these elements (which correponds approximatively to a distance of 200 bp if RR = 1.5 cM/Mb). Contrastingly, being further than 0.05 cM away from these conserved elements allows DAFi¯ to rise above 0.16, an average value that is barely reached for sites with associated mean B values close to 1. These results suggest that phastCons elements represent the covariate that has the strongest remaining influence on DAFi¯ within our neutral set.

The SFS of each population is affected by BGS and gBGC (Figure 2, Figure 2—figure supplement 1), and the demography inferred from neutrally evolving SNPs differs markedly from that based on synonymous sites or sites in non-transcribed regions (Figure 3A). However, we show that BGS and gBGC can have different impacts on the inferred demography of the populations. For instance, we found that they lead to an underestimation of the age of a bottleneck and an overestimation of the magnitude of a demographic expansion in the Yoruba population, but we do not observe such strong biases in the Japanese population. It therefore appears difficult to predict the specific biases introduced by these evolutionary forces on demographic inference, except perhaps under simple evolutionary scenarios (Ewing and Jensen, 2016). We therefore suggest that future studies of demographic history should be based on a set of markers that is minimally influenced by these non-neutral forces.

We have also computed the observed SFS for subsets of neutral SNPs with various values of the covariates mentioned above (Figure 1—figure supplement 9). SNPs in the 1st and 4th distance-quartiles to hotspot show similar SFS, with a slight excess of singletons and high-frequency variants for the sites furthest to hotspots (Figure 2—figure supplement 3A. Even though conserved elements had the strongest influence on DAFi¯, the SFSs computed at sites belonging to the 4th distance quartile and to all sites still look very similar, especially in the Japanese population, while sites in the 1st distance quartileshow an excess of singletons and a deficit of high-frequency variants (Figure 2—figure supplement 3B. Exonic and non-exonic SFSs within our neutral set differ mainly by increased frequencies of singletons for exonic SNPs, yet the removal of exonic SNPs has no impact on the SFS (Figure 2—figure supplement 3C). In conclusion, even though exonic SNPs and those located close (≤0.0003 cM) to phastCons elements show different SFS shapes (Figure 2—figure supplement 4), their removal from our neutral set would have no major effect on the shape of the SFS, since they represent only a small fraction (2.2% and 16.9% respectively) of the SNPs in our neutral set.

It is interesting to compare our neutral set of SNPs to another previously defined set of neutral regions of the human genome that has been used as a reference for demographic inferences in a series of studies (e.g. Gronau et al., 2011; McManus et al., 2015; King and Wakeley, 2016; Veeramah et al., 2018). Gronau et al. (2011) have identified a set of 37,574 potentially neutral regions of 1 kb in length with carefully chosen properties (e.g. at least 1 kb away from exons and 100 bp away from phastCons elements, without CpG sites, separated by at least 50 kb, without recombination hotspots). The SFS computed on this alternative neutral set departs significantly from our neutral set, with a significant excess of singletons, a deficit of sites with intermediate allele frequencies, and an excess of nearly fixed variants, a pattern that can be explained by the action of both BGS and gBGC (Figure 3—figure supplement 3A. Since a large B-statistic is also indicative of relaxed BGS, one could be tempted to use regions associated with B values larger than 0.9 as being potentially neutral. However, we see that its SFS also departs from that of our neutral set, with a small deficit of singleton and an excess of other frequency classes in Yoruba, and a slight excess of high-frequency variants in Japan (Figure 3—figure supplement 3A). These differences in SFS shapes also lead to inferred demographies that are markedly different from that inferred from our own neutral set, and this especially for the Yoruba population (Figure 3—figure supplement 3B). We suspect that the main discrepancy with our neutral set is the presence of gBGC in regions with B > 0.9, such that filtering out SW and WS SNPs may result in a good alternative data set on which to perform demographic inferences

Methods of demographic inference based on whole genomes (e.g. Li and Durbin, 2009; Sheehan et al., 2013; Schiffels and Durbin, 2014) should also be sensitive to BGS and gBGC, since they assume that heterozygosity levels within individuals is not driven by local recombination rates nor selection. In this respect, the history of human populations as well as that of other species might be more readily inferred from methods that can conveniently analyze restricted sets of neutrally evolving sites interspersed across the genome. Similarly, other types of inference using a biased neutral SFS as a reference could also be affected, such as inferences of the distributions of fitness effects (DFE) (Keightley and Eyre-Walker, 2010; Kim et al., 2017; Tataru et al., 2017), even though the magnitude of the effect remains to be investigated. In conclusion, we show that BGS and gBGC had a pervasive effect on most of our genome, but that we can conveniently define a set of sites (representing about 3% of all polymorphic sites of both 1000G and SGDP datasets) that should not be too influenced by these two evolutionary forces, even though some sites close to conserved elements could still be affected by BGS. Contrary to previously used sets of SNPs, these sites should lead to essentially unbiased demographic inferences and serve as a reference for future demographic reconstructions in humans. Due to its simplicity, our approach can be readily applied to any species for which a recombination map is available.

Materials and methods

Datasets

We analyzed two distinct whole genome datasets. The first one consisted of 100 individuals from ten 1000G populations (Auton et al., 2015). For each 1000G population, we selected the ten individuals with the highest depth of coverage (coverage >10×), such as to maximize the number of sites having no missing data. We also analyzed 20 individuals from panel C of the Simons Genome Diversity Project (SGDP) (Mallick et al., 2016). These individuals were selected from ten SGDP populations that were geographically close to those analyzed for the 1000G project. Coverage was higher for the SGDP individual and ranged between 31 × and 64× (see Supplementary file 1 - Table S1 for IDs and location of the 1000G and SGDP samples).

Data processing and annotations

We processed the 1000G and SGDP datasets identically. We removed all sites with any missing data and kept only diallelic sites from autosomal chromosomes. The ancestral state of each variant in these genomes was set to the chimpanzee reference genome (panTro4 genome assembly) to avoid any discrepancy between African and non-African populations. Only diallelic SNPs for which one of the variants observed in the 1000G or SGDP datasets corresponded to the chimpanzee ancestral state were kept for later analyses. In addition, we removed the CpG sites that present a peculiar mutation profile and are correlated with recombination rate (Arbeithuber et al., 2015). We used the LD-based Yoruba-specific recombination map from the 1000 Genomes project (Frazer et al., 2007) to obtain the local recombination rate (RR) surrounding each SNP. We also estimated local RR by using three other maps: the LD-based CEU or JPT-specific recombination maps (Frazer et al., 2007) and the sex-averaged pedigree-inferred deCode map (Kong et al., 2010). For each of these maps, we filtered out SNPs without RR information (see Supplementary file 2 - Table S2). We used the Yoruba-specific map to define hotspots as regions with RR >10 cM/Mb. Using Biomart (http://grch37.ensembl.org/biomart/martview/), we assigned SNPs to transcribed (TR) and non-transcribed regions (NTR). For each site, we inferred the distance to the closest exonic region in cM and in bp using the Ensembl exon positions (ftp://ftp.ensembl.org/pub/grch37/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz). The B-statistic (McVicker et al., 2009) (indicative of the strength of local background selection) associated with each SNPs was retrieved from http://www.phrap.org/othersoftware.html and lifted over from the hg18 to the hg19 reference genome using the UCSC liftOver tool. Genomic Evolutionary Rate Profiling (GERP) rejection scores (Davydov et al., 2010) that quantify the level of evolutionary constraint acting on polymorphic sites and conserved elements identified using PhastCons on the primate subset of 46 vertebrates (Siepel et al., 2005) were downloaded from the UCSC platform (Speir et al., 2016). The number of SNPs from the 1000G and SGDP datasets retained for each filter is reported in Supplementary file 2 - Table S2. We finally retrieved 37,574 potentially neutral regions of 1 kb (e.g. Gronau et al., 2011; McManus et al., 2015; King and Wakeley, 2016; Veeramah et al., 2018) from http://compgen.cshl.edu/GPhoCS/data.php to make comparisons between our neutral set of SNPs to another possible sets.

Estimating the impact of GC-biased gene conversion (gBGC)

As gBGC favors strong (abbreviated as S, and representing C and G bases) compared to weak (abbreviated as W, and representing A and T bases) alleles, we defined three groups of SNPs according to the expected consequences of gBGC: (1) SNPs for which the derived state is favoured (WS sites); (2) SNPs for which the ancestral state is favoured (SW sites), and (3) SNPs on which gBGC has no effect (WW or SS sites).

Average derived allele frequency per individual (DAFi¯)

To quantify a local effect of selection and/or gBGC, we used the average derived allele frequency per individual (DAFi¯), where this average is computed over a set of sites found polymorphic in a collection of individuals. We show in the following that this statistic is ideally suited to evidence the potential effect of selection (or mutation), as difference in the demography of the populations from which individuals are sampled should not translate into different values of this statistic among individuals.

Start by considering a single non-recombining locus (k) with mutation rate uk, and for the sake of simplicity, let us consider just two individuals i and j, drawn from two different populations. Note that the same reasoning can be extended to an arbitrary number of individuals drawn from an arbitrary number of populations. Now, suppose that the two homologous alleles of these individuals have coalesced ti and tj generations ago, and that the most recent common ancestor of these four homologous alleles is tglobal. Now, the frequency of the derived allele in individual i at the k-th locus is simply given by

(1) DAFik=nik2Stot,k

where Stot,k is the total number of sites that are polymorphic at this k-th locus for this sample of two individuals, and nik is the number of derived alleles observed in individual i. Since nik is the number of heterozygous sites (Hetik) plus two times the number of homozygous derived sites (Homik) (see Figure 1—figure supplement 1), the expected value of nik can be expressed as a function of tglobal and the mutation rate uk as

(2) E(nik)=E(Hetik+2 Homik)=2 uk tik+2 uk (tglobal,ktik)=2 uk tglobal,k,

which does not depend on tik, the coalescence times between homologous alleles in individuals 1 or 2, as illustrated in Figure 1—figure supplement 1. Therefore, E(njk)=E(nik)=2uktglobal,k, and

(3) E(DAFik)= E(nik)/E(2Stot,k)=tglobal,k/Ttot,k,i,

where Ttot,k is the total tree length at the k-th locus. Since the average derived allele frequency computed over an arbitrary number of unlinked loci m is obtained as the ratio of the total number of derived alleles over twice the total number of polymorphic sites, its expectation is then obtained as

(4) E(DAFi¯)=E(ni)E(2 Stot)=kmE(nik)kmE(2 Stot,k)=kmuk tglobal,kkmukTtot,k,

an equation that is valid irrespective of the number of individuals and populations sampled if one computes the number of derived alleles over all sites found polymorphic in the collection of individuals. If the mutation rate is uniform across loci, then equation (1.4) simplifies to

(5) E(DAFi¯)= t¯global/T¯tot,

which only depends on the average global coalescence time of the total sample t¯global, and on the average tree length over all loci T¯tot, and not on the coalescence times in each population. Therefore, even though E(DAFi¯) depends on the overall demography of the collection of individuals and on the composition of the samples, which both condition the global tMRCA and total tree lengths, the specific demographic histories of the sampled populations will not translate, in expectation, into different DAFi¯ among individuals examined for the same set of loci. Selection in some portion of the genome will affect tMRCAs, which should thus translate into differences in DAFi¯ computed for these regions. Differences in mutation rates across the genome might also affect DAFi¯ for some regions, but should not lead to individual differences, unless mutation rates are different in specific populations.

For both SGDP and 1000G data sets, we ranked SNPs according to their associated recombination rate and binned them into 20 equal-sized classes of increasing recombination rates. We performed a similar binning for the different groups of SNPs we considered (the three types of mutations, within a transcribed region or not, etc.) or after ranking SNPs according to their distance to the nearest exon, to hotspots or to conserved elements. We then computed DAFi¯ for each bin b as DAFib¯=nib/(2Stot,b).

Site frequency spectrum

We estimated the unfolded SFS for ten 1000G population samples using different filters (e.g. different recombination classes, different types of mutations). The SFS was then normalized (Lapierre et al., 2017) by dividing each entry by its expectation in a stationary population. To estimate if two SFSs are statistically different, we used a permutation approach. We first computed a distance between the two SFS as the sum of the squared difference in site frequencies over all SFS entries (noted Dobs). We divided the SNPs into three categories: those shared by the two SFS (if any), and those that were private to one of the SFS. We then randomly permuted sites among the two latest categories and re-evaluated the distance noted Dest. When one SFS was based on a subset of variants from another SFS, we subsampled sites from the largest dataset and re-evaluated Dest. We repeated the permutations or the resampling procedure 1000 times and estimated a p value as the frequency of Dobs ≥ Dest.

Block-bootstrap procedure

For each filter (e.g. per recombination class or per type of mutation), we identified sets of 100 adjacent SNPs along the genome and we sampled them with replacement such as to keep the same number of sites as in the non-bootstrapped set when computing statistics of interest (DAFi¯, SFS). We repeated the sampling 1000 times to obtain 1000 block-bootstrap sets of SNPs. 95% confidence intervals were computed by identifying the 2.5 and 97.5 quantiles of the resulting bootstrap distributions.

Demographic inference

We estimated the parameters of the demographic model shown in Figure 3C from the SFS of two 1000G populations (Japan and Yoruba) using the program fastsimcoal2 (Excoffier et al., 2013) ver 2.6. We used the following command line options:

./fsc26 -t pop.tpl -n200000 -d -e pop.est -M -l25 -L50 -q −0 -C1 -c1 -B1,

where pop denotes either the Japan or the Yoruba population. We used the tpl and est setting files defined in Supplementary file SF1. For each population, we performed 50 independent estimations and retrained the parameters that maximized the model likelihood. The confidence intervals of the parameters were estimated from 100 block-bootstrapped SFS obtained in a way similar to that described above. For each population, estimations were performed on each bootstrap dataset independently, using the maximum likelihood (ML) parameters values estimated above as initial values. Since we started parameter estimation close to the observed ML values, we only did five estimations per bootstrap and retained the parameters with maximum associated likelihood. A 99% confidence interval was then obtained for each parameter by estimating the 0.5% and 99.5% quantiles of its resulting bootstrap distribution.

Individual-based simulations

We performed individual-based simulations using the software SLiM v. 2.3 (Haller and Messer, 2017) to check that BGS could reproduce observations. We simulated the demographic scenario inferred from the ‘neutral’ SFS (i.e. from WW + SS sites with r ≥ 1.5 cM/Mb) for the Japanese (JPT) and Yoruba (YRI) 1000G populations as described above (Demographic inference). We simulated a linear genome of 50 Mb made up of 1000 regions of 5 kb. Each of these regions consisted of a 1 kb stretch experiencing purifying selection against deleterious mutations, followed by a 4 kb stretch with neutral mutations. We also simulated an alternative genomic architecture with 10,000 regions of 500 bp, each consisting of a 100 bp stretch under purifying selection, followed by a 400 bp stretch with neutral mutations. For computational efficiency, we scaled the inferred event times and population sizes by a factor of 0.1 and give below the rescaled values. We set the per-site mutation rate to 1.25 × 10–7 for deleterious and neutral mutations. The fitness contribution of all deleterious mutations was 1 – s in homozygous form and 1 – s/2 in heterozygous form. The fitness of individuals was computed multiplicatively across sites. We ran independent simulations for four recombination rates (r = 10–9, 10–8, 10–7, and 10–6). For each demographic scenario and recombination rate, we simulated a scenario with background selection (s = –0.1) and a neutral scenario (s = 0). For each parameter combination, we performed 100 independent replicates starting with a period of 4 × NANC generations, where NANC is the number of haploid genomes in the ancestral population (Figure 4—figure supplement 1). We set NANC = 4000 for both the Yoruba and Japanese simulation. At the end of each simulation, we output the full population and computed the number of derived alleles for each individual across a fixed number arbitrarily set to 40,000 SNPs, subsampled from all SNPs. These 40,000 SNPs were subsampled individually for each replicate simulation. The SFS of the population was subsampled to 10 individuals (i.e. 20 haploid genomes) following Nielsen et al. (2005) as pi,20=k1j=1k(fji)(njfj20i)/(nj20), where k is the total number of SNPs in the dataset, and nj and fj are the number of haploid genomes in the full sample and the number of derived alleles in the full sample at the jth SNP, respectively (see also Liu et al., 2017). We computed the SFS separately for each replicate simulation, and then calculated the mean and the 2.5 and 97.5 percentiles across these replicates for each entry pi,20. We normalized the SFS as described above (subsection SFS).

Accounting for a correlation between mutation and recombination rates

To model a potential correlation between mutation and recombination, we assumed that the per-base pair deleterious mutation rate ud depends on the local recombination rate r as

udr=u0rb.

This assumption implies a log-log linear relationship between mutation and recombination, with an intercept of logu0 and a slope of b. In the special case of b = 0, mutation is independent of recombination. We then modified the approximate BGS model of Hudson and Kaplan (1995) by substituting ud(r) for the deleterious mutation rate. The reduction in the nucleotide diversity a at a focal site due to BGS is then predicted to be

(6) B=ππ0exp(ud(r)r)=exp(u0rbr)=exp(u0r(b1))

where π0 is the baseline nucleotide diversity in the absence of BGS, and π is the effective nucleotide diversity with BGS. We fit this modified BGS model to the relationship between the B-statistic from McVicker et al. (2009) and the recombination rate associated with our polymorphic SNPs using the method of non-linear least squares as implemented in the nls function in R v 3.4.4 (R core Team, 2018). We then used the Akaike information criterion (AIC, Akaike, 1974) to compare this extended BGS model to the original BGS model in which the mutation rate does not depend on the recombination rate (b=0). Note that McVicker et al. (2009) obtained their B-statistics by fitting a more complex BGS model to polymorphism and recombination data (assuming no specific correlation between recombination and mutation). However, the model of Hudson and Kaplan (1995) used here is just a simplified version of that used by McVicker et al. (2009). It assumes that neutral sites on which diversity is measured are in the middle of a region containing sites under negative selection, that recombination rates are uniform in the considered region, and that selection coefficients at deleterious sites are small relative to the total recombination rate in the region. These assumptions seem reasonable except for sites that are very close to recombination hotspots or close to telomeres, but we expect a qualitatively global agreement between these two models. An exact quantitative match is not required here, since our goal here is simply to assess whether a correlation between mutation and recombination rates needs to be invoked rather than to accurately estimate the parameters of the model (u0 and b).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    The pattern of neutral molecular variation under the background selection model
    1. D Charlesworth
    2. B Charlesworth
    3. MT Morgan
    (1995)
    Genetics 141:1619–1632.
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
    Hitchhiking under positive Darwinian selection
    1. JC Fay
    2. CI Wu
    3. W Ci
    (2000)
    Genetics 155:1405–1413.
  21. 21
  22. 22
    A second generation human haplotype map of over 3.1 million SNPs
    1. KA Frazer
    2. DG Ballinger
    3. DR Cox
    4. DA Hinds
    5. LL Stuve
    6. RA Gibbs
    7. JW Belmont
    8. A Boudreau
    9. P Hardenbol
    10. SM Leal
    11. S Pasternak
    12. DA Wheeler
    13. TD Willis
    14. F Yu
    15. H Yang
    16. C Zeng
    17. Y Gao
    18. H Hu
    19. W Hu
    20. C Li
    21. W Lin
    22. S Liu
    23. H Pan
    24. X Tang
    25. J Wang
    26. W Wang
    27. J Yu
    28. B Zhang
    29. Q Zhang
    30. H Zhao
    31. H Zhao
    32. J Zhou
    33. SB Gabriel
    34. R Barry
    35. B Blumenstiel
    36. A Camargo
    37. M Defelice
    38. M Faggart
    39. M Goyette
    40. S Gupta
    41. J Moore
    42. H Nguyen
    43. RC Onofrio
    44. M Parkin
    45. J Roy
    46. E Stahl
    47. E Winchester
    48. L Ziaugra
    49. D Altshuler
    50. Y Shen
    51. Z Yao
    52. W Huang
    53. X Chu
    54. Y He
    55. L Jin
    56. Y Liu
    57. Y Shen
    58. W Sun
    59. H Wang
    60. Y Wang
    61. Y Wang
    62. X Xiong
    63. L Xu
    64. MM Waye
    65. SK Tsui
    66. H Xue
    67. JT Wong
    68. LM Galver
    69. JB Fan
    70. K Gunderson
    71. SS Murray
    72. AR Oliphant
    73. MS Chee
    74. A Montpetit
    75. F Chagnon
    76. V Ferretti
    77. M Leboeuf
    78. JF Olivier
    79. MS Phillips
    80. S Roumy
    81. C Sallée
    82. A Verner
    83. TJ Hudson
    84. PY Kwok
    85. D Cai
    86. DC Koboldt
    87. RD Miller
    88. L Pawlikowska
    89. P Taillon-Miller
    90. M Xiao
    91. LC Tsui
    92. W Mak
    93. YQ Song
    94. PK Tam
    95. Y Nakamura
    96. T Kawaguchi
    97. T Kitamoto
    98. T Morizono
    99. A Nagashima
    100. Y Ohnishi
    101. A Sekine
    102. T Tanaka
    103. T Tsunoda
    104. P Deloukas
    105. CP Bird
    106. M Delgado
    107. ET Dermitzakis
    108. R Gwilliam
    109. S Hunt
    110. J Morrison
    111. D Powell
    112. BE Stranger
    113. P Whittaker
    114. DR Bentley
    115. MJ Daly
    116. PI de Bakker
    117. J Barrett
    118. YR Chretien
    119. J Maller
    120. S McCarroll
    121. N Patterson
    122. I Pe'er
    123. A Price
    124. S Purcell
    125. DJ Richter
    126. P Sabeti
    127. R Saxena
    128. SF Schaffner
    129. PC Sham
    130. P Varilly
    131. D Altshuler
    132. LD Stein
    133. L Krishnan
    134. AV Smith
    135. MK Tello-Ruiz
    136. GA Thorisson
    137. A Chakravarti
    138. PE Chen
    139. DJ Cutler
    140. CS Kashuk
    141. S Lin
    142. GR Abecasis
    143. W Guan
    144. Y Li
    145. HM Munro
    146. ZS Qin
    147. DJ Thomas
    148. G McVean
    149. A Auton
    150. L Bottolo
    151. N Cardin
    152. S Eyheramendy
    153. C Freeman
    154. J Marchini
    155. S Myers
    156. C Spencer
    157. M Stephens
    158. P Donnelly
    159. LR Cardon
    160. G Clarke
    161. DM Evans
    162. AP Morris
    163. BS Weir
    164. T Tsunoda
    165. JC Mullikin
    166. ST Sherry
    167. M Feolo
    168. A Skol
    169. H Zhang
    170. C Zeng
    171. H Zhao
    172. I Matsuda
    173. Y Fukushima
    174. DR Macer
    175. E Suda
    176. CN Rotimi
    177. CA Adebamowo
    178. I Ajayi
    179. T Aniagwu
    180. PA Marshall
    181. C Nkwodimmah
    182. CD Royal
    183. MF Leppert
    184. M Dixon
    185. A Peiffer
    186. R Qiu
    187. A Kent
    188. K Kato
    189. N Niikawa
    190. IF Adewole
    191. BM Knoppers
    192. MW Foster
    193. EW Clayton
    194. J Watkin
    195. RA Gibbs
    196. JW Belmont
    197. D Muzny
    198. L Nazareth
    199. E Sodergren
    200. GM Weinstock
    201. DA Wheeler
    202. I Yakub
    203. SB Gabriel
    204. RC Onofrio
    205. DJ Richter
    206. L Ziaugra
    207. BW Birren
    208. MJ Daly
    209. D Altshuler
    210. RK Wilson
    211. LL Fulton
    212. J Rogers
    213. J Burton
    214. NP Carter
    215. CM Clee
    216. M Griffiths
    217. MC Jones
    218. K McLay
    219. RW Plumb
    220. MT Ross
    221. SK Sims
    222. DL Willey
    223. Z Chen
    224. H Han
    225. L Kang
    226. M Godbout
    227. JC Wallenburg
    228. P L'Archevêque
    229. G Bellemare
    230. K Saeki
    231. H Wang
    232. D An
    233. H Fu
    234. Q Li
    235. Z Wang
    236. R Wang
    237. AL Holden
    238. LD Brooks
    239. JE McEwen
    240. MS Guyer
    241. VO Wang
    242. JL Peterson
    243. M Shi
    244. J Spiegel
    245. LM Sung
    246. LF Zacharia
    247. FS Collins
    248. K Kennedy
    249. R Jamieson
    250. J Stewart
    251. International HapMap Consortium
    (2007)
    Nature 449:851–861.
    https://doi.org/10.1038/nature06258
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    Deleterious background selection with recombination
    1. RR Hudson
    2. NL Kaplan
    (1995)
    Genetics 141:1605–1617.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
    A survey of methods and tools to detect recent and strong positive selection
    1. P Pavlidis
    2. N Alachiotis
    (2017)
    Journal of Biological Research-Thessaloniki, 24, 10.1186/s40709-017-0064-0, 28405579.
  57. 57
    R: A language and environment for statistical computing
    1. R core Team
    (2018)
    R Foundation for Statistical Computing, Vienna, Austria.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76

Decision letter

  1. Krishna Veeramah
    Reviewing Editor; Stony Brook University, United States
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States
  3. Ilan Gronau
    Reviewer; IDC Herzliya, Israel

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Identifying the neutrally evolving fraction of the human genome" for consideration by eLife. Your article has been reviewed by Patricia Wittkopp as the Senior Editor a Reviewing Editor, and three reviewers. The following individual involved in review of your submission has agreed to reveal his identity: Ilan Gronau (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

In this paper, Pouyet and colleagues develop a new framework to identify to what extent the human genome can be considered "neutral". In particular, the authors define a new statistic, INDA, that should be robust to population demographic, thus allowing the identification of genomic regions where patterns of diversity are not affected by natural selection. They apply this approach to 1000 Genomes and SGDP data. They find a strong positive relationship between local recombination rate and INDA, and that background selection is the primary process that skews genomic diversity, though biased gene conversion also contributes. They determine that WW and SS polymorphisms in regions of the genome where the recombination rate is >1.5cM/MB act effectively neutrally. ~2.5% of SNPs fall into this category, a fairly striking finding that suggests 95% of the genome is affected by natural selection. The authors also show that alternative definitions of neutrality lead to skewed allele frequency spectra which then have downstream affects when attempting to infer demographic history.

Essential revisions:

In general, the three reviewers were complementary of the manuscript and found it to be of value. However, five major issues that required addressing in a revision were identified and agreed upon by all reviewers and the reviewing editor and are summarized below.

1) The current description of INDA in the main text is confusing. The definition should be expanded from its current version in the Results section, and then described in detail in the methods section. Text from Figure 1—figure supplement 1 would be better served being incorporated into the main text Figure 1. The description in Figure 1—figure supplement 1 is also problematic. Stating that the expected value of INDA depends only on TDIV and not on tMRCA is a bit of an oversimplification. Under neutrality with no differences in mutation rate, E(INDA) should be the same across populations. However, the probability that a site will be segregating in the set of populations does depend on tMRCA and therefore differences in tMRCA at some loci will influence INDA. Maybe it should be stated that under neutrality, E(tMRCA) is uniform across the genome, however when there is selection, the tMRCA can differ at different loci and therefore different genomic regions can have different INDA?

2) Differences in mutation rate across genome will have a big effect on INDA and there is clearly a big difference in mutation rates across the genome. While the authors acknowledge that INDA is affected by mutation rate, they then seem to assume that the entire correlation between INDA and recombination rate is driven by selection. While recombination rate does not seem to be mutagenic based on previous studies, it could still be correlated with something that does affect the mutation rate (e.g. potentially replication timing or transcription). One way to control for mutation rate would be to use divergence from a species such as macaque (or by controlling for trinucleotide context given that CpG sites are already identified as being "peculiar"). As pointed out by Lohmueller et al. (2011), divergence is still affected by BGS, but the effect should be smaller for more distant species. At the very least, the authors need to include more discussion that some of what they see could be driven by mutation rate variation across the genome.

3) After the initial analysis, the neutral sites are identified solely based on recombination rate and mutation pattern (r>1.5 cM/Mb, WW,SS). Distance from genes and evolutionary conservation are completely ignored at this stage. The authors should examine whether the resulting neutral set does not have any residual correlations between INDA and other possible covariates, in particular (genetic as well as physical) distance to genes, density of conserved elements and B statistics. It would be surprising if there was no residual correlation with distance to coding exons or other conserved elements (the latter of which has been addressed in several recent studies, e.g. Elyashiv et al., 2016). One reviewer suggests these correlations could be examined on the r>1.5 cM/Mb set, though this may lack power due to the restricted size of the "neutral set". Alternatively, another reviewer suggests to use the marginal correlations shown in Figure 1—figure supplement 5 to suggest good cutoffs for the other covariates.

In a similar vein, the authors are assuming that only sites with a local recombination rate > 1.5 will evolve neutrally. However, it seems plausible that sites in low recombination regions could also evolve neutrally if they were separated from selected sites by recombination hotspots, and as such 95% of the genome being neutral may be an overestimate. One reviewer suggests that B values may provide a better estimate? (i.e. sites with B close to 1.0).

4) The authors should comment on the significance of the neutrality threshold occurring at a recombination rate of 1.5cM/Mb. Perhaps background selection is ineffective at recombination rates above 1.5cM/Mb because the human mutation rate is somewhere between 1.0 and 1.5 e-8 mutations per site per generation? Is background selection ineffective in this regime because recombination events occur at a higher rate than mutation events, meaning that mutations are decoupled from each other onto separate haplotypes soon after they occur? If an analytical equation cannot answer this question easily, it could be investigated using SLiM by trying out a few different combinations of mutation and recombination rates and seeing whether background selection is effective precisely when the recombination rate is lower than the mutation rate.

5) The lack of a golden standard in terms of neutrality make it very difficult to positively assess that a given set is truly neutrally evolving. This study demonstrates that the author's SNP set is "more neutral" compared to alternative sets that have been determined using fairly simple criteria (Figure 3). While some investigators still use synonymous SNPs, and SNPs far from genes as neutral proxies, there do exist some better ways to filter SNPs (e.g., the NRE of Arbiza et al., 2012, BMC Bioinformatics). The authors should compare their results against these alternative, more sophisticated neutral sets. One reviewer notes that NRE can be used to filter based on B, so the authors can use it to compare their neutral set with the set of regions with very high B.

https://doi.org/10.7554/eLife.36317.038

Author response

Summary:

In this paper, Pouyet and colleagues develop a new framework to identify to what extent the human genome can be considered "neutral". In particular, the authors define a new statistic, INDA, that should be robust to population demographic, thus allowing the identification of genomic regions where patterns of diversity are not affected by natural selection. They apply this approach to 1000 Genomes and SGDP data. They find a strong positive relationship between local recombination rate and INDA, and that background selection is the primary process that skews genomic diversity, though biased gene conversion also contributes. They determine that WW and SS polymorphisms in regions of the genome where the recombination rate is >1.5cM/Mb act effectively neutrally. ~2.5% of SNPs fall into this category, a fairly striking finding that suggests 95% of the genome is affected by natural selection. The authors also show that alternative definitions of neutrality lead to skewed allele frequency spectra which then have downstream affects when attempting to infer demographic history.

We appreciate the nice summary of our study. We would just mention that we find that “only” 80-85% of the genome seems affected by selection (and not 95%), whereas about 15% seem affected primarily by gBGC.

Essential revisions:

In general, the three reviewers were complementary of the manuscript and found it to be of value. However, five major issues that required addressing in a revision were identified and agreed upon by all reviewers and the reviewing editor and are summarized below.

1) The current description of INDA in the main text is confusing. The definition should be expanded from its current version in the Results section, and then described in detail in the methods section. Text from Figure 1—figure supplement 1 would be better served being incorporated into the main text Figure 1. The description in Figure 1—figure supplement 1 is also problematic. Stating that the expected value of INDA depends only on TDIV and not on tMRCA is a bit of an oversimplification. Under neutrality with no differences in mutation rate, E(INDA) should be the same across populations. However, the probability that a site will be segregating in the set of populations does depend on tMRCA and therefore differences in tMRCA at some loci will influence INDA. Maybe it should be stated that under neutrality, E(tMRCA) is uniform across the genome, however when there is selection, the tMRCA can differ at different loci and therefore different genomic regions can have different INDA?

We have followed the suggestions of the reviewers. We have modified the Results section to better introduce this statistic and justify its use and we have considerably extended its description in the Material and methods section to clarify its definition, hoping it removes any ambiguity. Note that in order to follow the suggestion of the third reviewer, we have dropped the use of INDA in favor of the average derived allele frequency, noted as DAFi¯, which is just INDA divided by twice the number of polymorphic sites in the sample, as it is easier to compare this statistic across data sets. In the new version, INDA, now simply called ni, is only described in the material and methods to show its quasi-independence of differential demography among populations. DAFi¯ is used in the reminder of the paper, and all figures have been modified to reflect this change.

2) Differences in mutation rate across genome will have a big effect on INDA and there is clearly a big difference in mutation rates across the genome. While the authors acknowledge that INDA is affected by mutation rate, they then seem to assume that the entire correlation between INDA and recombination rate is driven by selection. While recombination rate does not seem to be mutagenic based on previous studies, it could still be correlated with something that does affect the mutation rate (e.g. potentially replication timing or transcription). One way to control for mutation rate would be to use divergence from a species such as macaque (or by controlling for trinucleotide context given that CpG sites are already identified as being "peculiar"). As pointed out by Lohmueller et al. (2011), divergence is still affected by BGS, but the effect should be smaller for more distant species. At the very least the authors need to include more discussion that some of what they see could be driven by mutation rate variation across the genome.

We agree that there is a wide difference in mutation and recombination rates along the genome, and that these two processes might be directly (mutagenic effect of recombination) or indirectly (e.g. through replication timing) correlated. To address these concerns in more details, we have now added a full paragraph on the possible mutagenic effect of recombination at the beginning of the Discussion section as well as on the possible indirect association between mutation rate and recombination. We have also added a new analysis of the relationship between the intensity of BGS quantified by McVicker et al. (2009) in terms of the B-statistic and the recombination rate under a simple model. We show that the fit to the data is significantly better if we allow for some correlation between mutation and recombination rates. Since the sole mutation–recombination correlation cannot explain the exact relationship observed between DAFi¯ and recombination (i.e. the plateau for DAFi¯ above ~1.5 cM/Mb for SS+WW sites) and the difference in the shape of the SFS for low and high recombination regions, we now conclude that our observations are best explained by the action of BGS and gBGC and by the presence of some correlation between mutation and recombination along the genome. We have modified the Discussion and Material and Methods section to summarise our conclusions and describe the additional analyses (see also the new Figure 1—figure supplement 8 and Supplementary file 4).

3) After the initial analysis, the neutral sites are identified solely based on recombination rate and mutation pattern (r>1.5 cM/Mb, WW,SS). Distance from genes and evolutionary conservation are completely ignored at this stage. The authors should examine whether the resulting neutral set does not have any residual correlations between INDA and other possible covariates, in particular (genetic as well as physical) distance to genes, density of conserved elements and B statistics. It would be surprising if there was no residual correlation with distance to coding exons or other conserved elements (the latter of which has been addressed in several recent studies, e.g. Elyashiv et al., 2016). One reviewer suggests these correlations could be examined on the r>1.5 cM/Mb set, though this may lack power due to the restricted size of the "neutral set". Alternatively, another reviewer suggests to use the marginal correlations shown in Figure 1—figure supplement 5 to suggest good cutoffs for the other covariates.

In a similar vein, the authors are assuming that only sites with a local recombination rate > 1.5 will evolve neutrally. However, it seems plausible that sites in low recombination regions could also evolve neutrally if they were separated from selected sites by recombination hotspots, and as such 95% of the genome being neutral may be an overestimate. One reviewer suggests that B values may provide a better estimate? (i.e. sites with B close to 1.0).

We found it indeed interesting and important to better characterize our neutral set. We have done so by examining, for our neutral set of SNPs, possible relationships between DAFi¯ (now replacing INDA) and other covariates, such as the B-statistics as well as the distance (in terms of map units) to exons, to conserved elements and to recombination hotspots. In addition, we have compared the SFS observed in our neutral set (SS+WW sites with r > 1.5 cM/Mb) to the SFS computed for subsets of SNPs that showed extreme values for the covariates mentioned above (first and fourth quartile). In our neutral set, we find virtually not relationship between DAFi¯ and recombination rate or little for distance to exons (DAFi¯ varies from 0.145 to 0.15, close to the average of 0.146), but we find a negative relation with distance to recombination hotspots and a positive relationship with distances to conserved elements and with B-statistics. We find that the distance to phastCons conserved elements has a major influence on DAFi¯. However, removing the 25% closest sites to conserved elements sites has little effect on the SFS (the fourth quartile is close to the whole set suggesting that a majority of SNPs behave as the fourth quartile and not as the first one), suggesting that our whole neutral set can be used for SFS-based demographic inference, even though it may appear safer to remove these sites from some analyses. These results are reported in Figure 1—figure supplement 9 and Figure 2—figure supplement 3 and discussed in a new paragraph of the Discussion section. We also added a few sentences to acknowledge the fact that regions with low recombination rates can still harbor sites unaffected to BGS, but that the average DAFi¯ statistic has not enough resolution to recognize these regions.

4) The authors should comment on the significance of the neutrality threshold occurring at a recombination rate of 1.5cM/Mb. Perhaps background selection is ineffective at recombination rates above 1.5cM/Mb because the human mutation rate is somewhere between 1.0 and 1.5 e-8 mutations per site per generation? Is background selection ineffective in this regime because recombination events occur at a higher rate than mutation events, meaning that mutations are decoupled from each other onto separate haplotypes soon after they occur? If an analytical equation cannot answer this question easily, it could be investigated using SLiM by trying out a few different combinations of mutation and recombination rates and seeing whether background selection is effective precisely when the recombination rate is lower than the mutation rate.

We agree that a better theoretical justification of the threshold recombination rate used to define neutral regions would be of value. Actually, previous theory shows that the ratio of the deleterious mutation rate to the recombination rate is a major determinant of the effect of background selection (Hudson and Kaplan, 1995; Nordborg et al., 1996). We have therefore addressed this concern by using the observed linear relationship between DAFi¯ and B-statistics for which a dependence on BGS and recombination is theoretically available, and we explicitly used it to address point 2 above. We have thus used a simple model of BGS (Hudson and Kaplan, 1995) describing the relationship between B and recombination rate to evidence a potential correlation between recombination and mutation. In this analysis, we find that the B-statistic also reaches a plateau approximately above 1.5 cM/Mb by assuming a log-log linear relation between the deleterious mutation rate and recombination. In this simple BGS model, B exp(ud/r), where ud is the deleterious mutation rate. Thus, B tends symptotically to 1 with increasing r values, but more slowly when ud increases with r than when it is constant. In this analysis, in order to correctly explain the data, we find the deleterious mutation rate to be about 10 times smaller (∼1.4e-9 per site per generation) than the neutral mutation rate when r= 1 cM/Mb (see Figure 1—figure supplement 8). Of course, this model is very simple, but we think it predicts relatively well the main features of the average B statistics. Given that average DAFi¯ and B-statistics are highly correlated, the threshold recombination rate of 1.5 cM/Mb should also apply to DAFi¯. Therefore, the intuition of the reviewers is correct in the sense that there is a theoretical justification for this plateau, but theory suggests that it depends on the interaction between the deleterious mutation rate and recombination, and not between the neutral mutation rate and recombination.

Because the analytical theory provides a sharp qualitative prediction that justifies our threshold of 1.5 cM/Mb, we decided not to run additional simulations. We do, however, stress that not too much emphasis should be given to the exact value of 1.5 cM/Mb, but rather to the claim that the true threshold is expected to be roughly around this recombination rate.

5) The lack of a golden standard in terms of neutrality make it very difficult to positively assess that a given set is truly neutrally evolving. This study demonstrates that the author's SNP set is "more neutral" compared to alternative sets that have been determined using fairly simple criteria (Figure 3). While some investigators still use synonymous SNPs, and SNPs far from genes as neutral proxies, there do exist some better ways to filter SNPs (e.g., the NRE of Arbiza et al., 2012, BMC Bioinformatics). The authors should compare their results against these alternative, more sophisticated neutral sets. One reviewer notes that NRE can be used to filter based on B, so the authors can use it to compare their neutral set with the set of regions with very high B.

This is a good comment and to address it, we have also compared our neutral set to one that was initially defined by Gronau et al. (2011), and which consists of a series of 37,574 1-kb regions cleared of misalignment with chimpanzees, without CpG sites, exonic regions, or any conserved non-coding elements, at least 50 kb away from each other, without recombination hot-spots within regions and separated by hot-spots to minimize their linkage. These regions potentially minimally affected by BGS have been used as a neutral benchmark in several papers referenced in the discussion. We have also compared our neutral set to regions associated with B-statistics larger than 0.9, as suggested by one reviewer. These two alternative data sets show clearly different SFS shapes than our neutral set, mainly due to BGS for the 37,574 1-kb regions and to gBGC for the B > 0.9 SNPs. These alternative data sets are discussed in the Discussion section and their SFS are presented in new Figure 3—figure supplement 4A. The B > 0.9 data set including only WW and SS SNPs could therefore be considered as a good alternative data set on which to perform demographic inference.

https://doi.org/10.7554/eLife.36317.039

Article and author information

Author details

  1. Fanny Pouyet

    1. Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Simon Aeschbacher
    For correspondence
    fanny.pouyet@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5614-6998
  2. Simon Aeschbacher

    1. Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    3. Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Fanny Pouyet
    Competing interests
    No competing interests declared
  3. Alexandre Thiéry

    1. Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  4. Laurent Excoffier

    1. Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    laurent.excoffier@iee.unibe.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7507-6494

Funding

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (310030B-166605)

  • Laurent Excoffier

University of California Berkeley (Visiting Miller Professorship)

  • Laurent Excoffier

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Anthanasios Kousathanas, Guillaume Achaz, Etienne Patin, Lluis Quintana-Murci, Sylvain Glémin, and Rasmus Nielsen for informative discussions on the subject, and Montgomery Slatkin for his careful reading and helpful comments on the manuscript. FP and SA have been supported by a Swiss NSF grant No 310030B-166605 to LE. LE was also supported by the Institut Pasteur in Paris and by a Visiting Miller Professorship grant from the University of Berkeley during his sabbatical. Calculations were performed on the UBELIX (http://www.id.unibe.ch/hpc) cluster of the University of Bern. The source code and setting parameters to reproduce the analyses are available at http://datadryad.org/review?doi=doi:10.5061/dryad.t76fk80.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Krishna Veeramah, Stony Brook University, United States

Reviewer

  1. Ilan Gronau, IDC Herzliya, Israel

Publication history

  1. Received: March 1, 2018
  2. Accepted: August 17, 2018
  3. Accepted Manuscript published: August 20, 2018 (version 1)
  4. Accepted Manuscript updated: August 23, 2018 (version 2)
  5. Version of Record published: October 9, 2018 (version 3)

Copyright

© 2018, Pouyet et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,699
    Page views
  • 416
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Jürgen Jänes et al.
    Tools and Resources Updated
    1. Developmental Biology
    2. Genetics and Genomics
    Young Sun Hwang et al.
    Short Report