RegEvol: detection of directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness functions

  1. Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
  2. Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
  3. Department of Computational Biology, University of Lausanne, Lausanne, Switzerland

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Diethard Tautz
    Max Planck Institute for Evolutionary Biology, Plön, Germany
  • Senior Editor
    Aleksandra Walczak
    CNRS, Paris, France

Reviewer #1 (Public review):

Summary:

In this manuscript, the authors present a method to detect natural selection on transcription factor binding sites (TFBSs), which is an upgraded version of a previously published method (Liu and Robinson-Rechavi, 2020). This upgraded version of the test implements more explicit models of evolution and is shown to outperform its predecessor in terms of both power and false positive rate. I think this method can be a valuable resource for the community and can be helpful not only to studies of TFBSs but also broader evolutionary questions related to genotype-phenotype maps or fitness landscapes.

Major comments:

(1) Questions related to Figure 1

Figure 1, along with the first section of the Results, shows that the SVM score and its sensitivity to mutations are generally correlated with the strength of ChIP-seq signals. It is not very clear to me, however, what the motivation is behind this part of the paper. It seems that the model used to predict binding strength is a pre-existing one, and it is unclear what is new in this section. Was the prediction model retrained using different data? Was its validity confirmed using new data? I would appreciate some more elaboration on how these results differ from what was presented in the previous study of Liu and Robinson-Rechavi (2020).

The existence of weak or negative correlations between SVM and coverage, which reportedly reflects low-quality peaks, seems applicable not only to this paper, but also to previous ones, so I would like to have it confirmed whether the question and the authors' answers apply to previous studies as well.

It is reported that SVM scores capture TF binding signals better than conservation-based statistics do. My intuitive interpretation is that both ChIP-seq peaks and SVM scores are supposed to reflect binding strength, whereas conservation is supposed to reflect selection (i.e., different definitions of "function" as mentioned above). It is not explicitly explained in the Results, however, what the difference indicates, leaving only an impression that the SVM score is "better" than the conservation statistics.

In summary, I think further elaboration on the above problems would make the flow of thought of this paper easier to follow.

(2) Lack of directional selection for low binding affinity

In the analysis of Drosophila melanogaster ChIP-seq peaks, there were more cases of directional selection for higher binding affinity than directional selection for lower binding affinity. The authors suggested that this observation is "likely biological" because the same pattern was not seen in simulations (line 412-413). I wonder if this could have resulted from a difference in the distribution of ancestral binding affinity across TFBSs between real and simulated data. If binding affinity was generally low in the common ancestor of D. melanogaster and D. simulans, selection for low binding affinity would manifest mainly as purifying selection against mutations that increase affinity instead of directional selection. Ancestral sequences for simulations, if I understood correctly, are observed peaks in D. melanogaster (line 715-719), which would include high fraction sequences that could be rarer in the real ancestral sequences.

The description of this particular result does not refer to a figure or table, nor is it revisited in the Discussion. Figure 5 treats peaks under directional selection as a single category. Taken together, it is hard to tell how this observation should be interpreted. If the authors consider this result as biologically meaningful, I would suggest adding more details (e.g., the number of each side).

(3) Selection in non-focal lineages

Regarding the detected signals of directional selection for stronger binding in certain tissues (Figure 6), I wonder if it is the focal species or those very tissues that are "special": did the human lineage undergo more adaptive regulatory evolution than the chimpanzee lineage, or do nervous and male reproductive systems have a high "propensity" for adaptive regulatory evolution? Assuming that the binding preference of the same TF did not undergo a significant change since human-chimpanzee split (which, I believe, is a built-in assumption in both RegEvo and the permutation test), it should be possible to perform the same test using chimpanzee sequences that are homologous to the human ChIP-seq peak regions. In the case of coding sequences, for example, Bakewell et al. (2007) found that it was the chimpanzee that had more genes under positive selection than humans; I wonder if TFBSs show the same or a different pattern.

(4) Comments on terminology

a) Meaning of "function"

The word "function" has had different meanings in the biology literature, with some authors using "functional" to refer to anything with a phenotypic effect and some using it only for targets of selection. A (putative) TFBS would be considered "functional" as long as it has TF binding affinity if we follow the effect-based definition, but only if its binding affinity is under selection if we follow the selection-based definition. In this manuscript, the term "function" appears to have been used to refer to TF binding but not selection, most notably in the first Results section. There are also places where it is less clear what "function" means exactly (e.g., "deeply conserved elements that are likely to be functionally important" of line 61). Since this paper is about evolution, it is likely that many readers prefer the selection-based definition or assume that the selection-based definition would be used. Thus, using "function" to refer to just TF binding could be confusing. To this end, I would suggest that the authors drop the word "function" or give an explicit definition early in this paper.

b) Directional selection in different directions

In this paper, selection for increased TF binding affinity is referred to as "positive directional selection", and selection in the opposite direction is called "negative directional selection" (as exemplified in Figure 2). I understand that using such shorthand names would make the text less clumsy, but these two terms could potentially be confusing, as "positive selection" and "negative (purifying) selection" are also terms referring to specific types of selection and have some connection to directional and stabilizing selection. Therefore, I suggest that the authors use something like "selection for increased/decreased binding affinity" instead, or note explicitly in the text that "positive/negative directional selection" would be used as shorthand.

Reviewer #2 (Public review):

Summary:

The manuscript by Laverre et al. provides an interesting new test of selection on TF binding. Rather than focusing on sequence changes, this test is specifically for changes in predicted TF binding affinity. The authors report directional selection on 5.1% of tested regions in Drosophila, as well as a signal of selection on CTCF binding in the human CNS and male reproductive system.

Strengths:

Overall, I think this represents an important direction for the field of molecular evolution: now that TF binding can be predicted fairly well from sequence, it can be a very useful focus for tests of selection.

Weaknesses:

As mentioned several times in the manuscript, Jiang and Zhang (2024) pointed out some issues with a previous permutation-based version of this test. Foremost among these was the issue of ascertainment bias: when testing only experimentally supported TF binding sites from a focal species, and then asking what type of selection (or lack of selection) led to those sites, one is guaranteed to find more substitutions that increase affinity, simply because the sites were selected in the first place as those with maximum (empirically measured) affinity.

To address this issue, the authors simulated Drosophila CTCF peaks evolving neutrally and then tested different ascertainment cutoffs in Figure 4D. It was not entirely clear to me what is shown in Figure 4D: the text says the bins were stratified by derived delta-SVM, whereas the figure says SVM, and the legend says derived SVM (both without the delta). I was unable to find any clarification of this in the Methods section. In any case, I am not really convinced by this, for two main reasons. First, when analyzing empirical ChIP-seq data, I would guess that only a tiny fraction of the genome is bound (far less than 1%, especially in mammalian genomes). However, the most extreme bin in Figure 4D is taking the top 10% of (delta?) SVM values. What would Figure 4D look like at bins of the highest 0.1%, 0.001%, etc? My guess is there would be a strong uptick in the FPR. The second reason is actually more important and fundamental than the first. As long as this method is working as described, I cannot see any way that it would ‘not’ be impacted by ascertainment bias. As an extreme case, imagine that all TF binding sites tested had the maximum possible SVM scores; then none of them would have any chance of showing directional selection against binding, while even those that evolved neutrally would appear to have directional selection in favor of binding. Of course, real empirical data are not as extreme as this, but the same concept applies in less extreme scenarios.

This bias could explain patterns observed in the real data. For example: "We observe much more positive than negative directional selection, a pattern likely biological rather than methodological, since it is absent from simulations." This is exactly the pattern predicted under ascertainment bias (in the extreme-scenario thought experiment above). I suspect it is absent from simulations simply because the authors did not properly account for this bias in their simulations.

If the main result reported by the authors had been a lack of any directional selection in favor of binding, and instead only neutrality or directional selection against binding, then this ascertainment bias would not be an issue- it would only have made their results conservative. Unfortunately, this is not the case, and the directional selection in favor of binding, which is the main result emphasized from the empirical analysis, could be inflated by this bias.

Minor point:

The following statement: "In contrast, phastCons and phyloP scores lack such enrichment and have a lower dynamic range, suggesting that the conservation scores are less sensitive to fine-scale variation of TF occupancy and thus regulatory region function" is only true if one assumes that TF binding is the only function of this region. One could even turn this around and say the fact that the sites affecting TF binding are not the most conserved is actually evidence that TF binding is not a good indicator of these regions' entire function. I suggest the authors soften this claim that conservation scores are less sensitive to regulatory region function.

Author response:

Reviewer #1 (Public review):

Summary:

In this manuscript, the authors present a method to detect natural selection on transcription factor binding sites (TFBSs), which is an upgraded version of a previously published method (Liu and Robinson-Rechavi, 2020). This upgraded version of the test implements more explicit models of evolution and is shown to outperform its predecessor in terms of both power and false positive rate. I think this method can be a valuable resource for the community and can be helpful not only to studies of TFBSs but also broader evolutionary questions related to genotype-phenotype maps or fitness landscapes.

Major comments:

(1) Questions related to Figure 1

Figure 1, along with the first section of the Results, shows that the SVM score and its sensitivity to mutations are generally correlated with the strength of ChIP-seq signals. It is not very clear to me, however, what the motivation is behind this part of the paper. It seems that the model used to predict binding strength is a pre-existing one, and it is unclear what is new in this section. Was the prediction model retrained using different data? Was its validity confirmed using new data? I would appreciate some more elaboration on how these results differ from what was presented in the previous study of Liu and Robinson-Rechavi (2020).

We agree that the current manuscript does not clearly distinguish which parts of Figure 1 are novel and which are foundational. The SVM itself is not new and is the same as in Lee et al. (2016), as used in Liu & Robinson-Rechavi (2020). In the revision, we will explicitly state that the SVM used in Figure 1 is the standard gapped-kmer SVM (ls-gkm) approach. We retrained all gkm-SVM models de novo for each species-TF dataset, ensuring consistency across all analysed ChIP-seq peaks. For this, we recalled all ChIP-seq peaks in a homogeneous and robust manner using the nf-core ChIP-seq pipeline v2.0 (Ewels et al. 2022). Figure 1A confirms that the predicted binding affinity from the SVM correlates with experimental ChIP peak height. In addition, examining scores per site rather than per peak is new compared with Liu and Robinson-Rechavi (2020). The correlations between the SVM-derived scores and other features had not been shown before to the best of our knowledge, thus Figure 1B-C is entirely novel. In other words, this analysis is meant to show that our phenotypic metric (SVM score per site) indeed tracks binding intensity, i.e. molecular phenotype.

The existence of weak or negative correlations between SVM and coverage, which reportedly reflects low-quality peaks, seems applicable not only to this paper, but also to previous ones, so I would like to have it confirmed whether the question and the authors' answers apply to previous studies as well.

Yes, this is a well-known issue in ChIP-seq studies. Low coverage often matches weak predicted binding affinity scores because noisy or unreliable peaks naturally have weaker signals. This is not specific to our work, and it has been observed in many other studies (e.g., Bailey et al. 2013 doi:10.1371/journal.pcbi.1003326; Nakato and Shirahige 2017 doi:10.1093/bib/bbw023). It is simply an expected property of the data.

It is reported that SVM scores capture TF binding signals better than conservation-based statistics do. My intuitive interpretation is that both ChIP-seq peaks and SVM scores are supposed to reflect binding strength, whereas conservation is supposed to reflect selection (i.e., different definitions of "function" as mentioned above). It is not explicitly explained in the Results, however, what the difference indicates, leaving only an impression that the SVM score is "better" than the conservation statistics.

While the reviewer is correct that there are different definitions of function, both conservation-based statistics and RegEvol seek to capture selected function. The difference is that RegEvol aims to measure functional change, whereas conservation-based statistics aim to detect sequences that retain the same function across species. In both cases, we expect a correlation with causal function (i.e., binding). We will clarify these concepts and how they apply to our results in the revised manuscript.

(2) Lack of directional selection for low binding affinity

In the analysis of Drosophila melanogaster ChIP-seq peaks, there were more cases of directional selection for higher binding affinity than directional selection for lower binding affinity. The authors suggested that this observation is "likely biological" because the same pattern was not seen in simulations (line 412-413). I wonder if this could have resulted from a difference in the distribution of ancestral binding affinity across TFBSs between real and simulated data. If binding affinity was generally low in the common ancestor of D. melanogaster and D. simulans, selection for low binding affinity would manifest mainly as purifying selection against mutations that increase affinity instead of directional selection. Ancestral sequences for simulations, if I understood correctly, are observed peaks in D. melanogaster (line 715-719), which would include high fraction sequences that could be rarer in the real ancestral sequences.

The description of this particular result does not refer to a figure or table, nor is it revisited in the Discussion. Figure 5 treats peaks under directional selection as a single category. Taken together, it is hard to tell how this observation should be interpreted. If the authors consider this result as biologically meaningful, I would suggest adding more details (e.g., the number of each side).

We appreciate this insight. We agree that the text was not clear, but in fact, the simulations were performed using the reconstructed ancestral sequences of ChIP-seq peaks themselves. Thus, simulated and empirical results should be directly comparable, and different results should be due to biology. We will revise the Manuscript to explicitly state that simulations are performed from reconstructed ancestral sequences and why. We will also add more descriptive statistics of the simulated and real data.

(3) Selection in non-focal lineages

Regarding the detected signals of directional selection for stronger binding in certain tissues (Figure 6), I wonder if it is the focal species or those very tissues that are "special": did the human lineage undergo more adaptive regulatory evolution than the chimpanzee lineage, or do nervous and male reproductive systems have a high "propensity" for adaptive regulatory evolution? Assuming that the binding preference of the same TF did not undergo a significant change since human-chimpanzee split (which, I believe, is a built-in assumption in both RegEvo and the permutation test), it should be possible to perform the same test using chimpanzee sequences that are homologous to the human ChIP-seq peak regions. In the case of coding sequences, for example, Bakewell et al. (2007) found that it was the chimpanzee that had more genes under positive selection than humans; I wonder if TFBSs show the same or a different pattern.

This is an excellent suggestion. To compare in an unbiased manner, we would need transcription factor ChIP-seq from the same organs in chimpanzees and humans. We are not aware of such a dataset. If one is identified, we would be very interested in analysing it, and thus answer this question. As suggested by the reviewer, we will analyse the human homologous sequences. Although it should be clear that this will provide a biased estimate for comparing adaptation between the two species, as we will lack newly acquired binding sites in the chimpanzee.

(4) Comments on terminology

(a) Meaning of "function"

The word "function" has had different meanings in the biology literature, with some authors using "functional" to refer to anything with a phenotypic effect and some using it only for targets of selection. A (putative) TFBS would be considered "functional" as long as it has TF binding affinity if we follow the effect-based definition, but only if its binding affinity is under selection if we follow the selection-based definition. In this manuscript, the term "function" appears to have been used to refer to TF binding but not selection, most notably in the first Results section. There are also places where it is less clear what "function" means exactly (e.g., "deeply conserved elements that are likely to be functionally important" of line 61). Since this paper is about evolution, it is likely that many readers prefer the selection-based definition or assume that the selection-based definition would be used. Thus, using "function" to refer to just TF binding could be confusing. To this end, I would suggest that the authors drop the word "function" or give an explicit definition early in this paper.

We thank the reviewer for this precision and fully agree, we will revise our terminology for clarity. We will clarify the distinction between selected function and causal function, and we will pay attention to their use throughout the manuscript.

(b) Directional selection in different directions

In this paper, selection for increased TF binding affinity is referred to as "positive directional selection", and selection in the opposite direction is called "negative directional selection" (as exemplified in Figure 2). I understand that using such shorthand names would make the text less clumsy, but these two terms could potentially be confusing, as "positive selection" and "negative (purifying) selection" are also terms referring to specific types of selection and have some connection to directional and stabilizing selection. Therefore, I suggest that the authors use something like "selection for increased/decreased binding affinity" instead, or note explicitly in the text that "positive/negative directional selection" would be used as shorthand.

We agree with this ambiguity in the current terminologies. We will replace the phrases “positive directional selection” and “negative directional selection” with, e.g., “selection for increased binding affinity” and “selection for decreased binding affinity” as suggested when presenting our biological result on ChIP-seq peaks. However, we will still use “positive/negative directional” for the general framework (genotype → phenotype →fitness map) and insert a note that we use “positive/negative directional” as shorthand to mean increasing/decreasing affinity in the case of CHIP-seq peaks.

Reviewer #2 (Public review):

Summary:

The manuscript by Laverre et al. provides an interesting new test of selection on TF binding. Rather than focusing on sequence changes, this test is specifically for changes in predicted TF binding affinity. The authors report directional selection on 5.1% of tested regions in Drosophila, as well as a signal of selection on CTCF binding in the human CNS and male reproductive system.

Strengths:

Overall, I think this represents an important direction for the field of molecular evolution: now that TF binding can be predicted fairly well from sequence, it can be a very useful focus for tests of selection.

Weaknesses:

As mentioned several times in the manuscript, Jiang and Zhang (2024) pointed out some issues with a previous permutation-based version of this test. Foremost among these was the issue of ascertainment bias: when testing only experimentally supported TF binding sites from a focal species, and then asking what type of selection (or lack of selection) led to those sites, one is guaranteed to find more substitutions that increase affinity, simply because the sites were selected in the first place as those with maximum (empirically measured) affinity.

To address this issue, the authors simulated Drosophila CTCF peaks evolving neutrally and then tested different ascertainment cutoffs in Figure 4D. It was not entirely clear to me what is shown in Figure 4D: the text says the bins were stratified by derived delta-SVM, whereas the figure says SVM, and the legend says derived SVM (both without the delta). I was unable to find any clarification of this in the Methods section. In any case, I am not really convinced by his, for two main reasons. First, when analyzing empirical ChIP-seq data, I would guess that only a tiny fraction of the genome is bound (far less than 1%, especially in mammalian genomes). However, the most extreme bin in Figure 4D is taking the top 10% of (delta?) SVM values. What would Figure 4D look like at bins of the highest 0.1%, 0.001%, etc? My guess is there would be a strong uptick in the FPR.

We apologise for the confusion in Figure 4D, we will clarify the caption and text and specify that bins are stratified by derived SVM (post-simulation binding affinity proxy), not genome % or ΔSVM.

We want to note that we used the same subsampling approach as Jiang and Zhang (2024) to evaluate ascertainment bias, and that Figure 4 both confirms the issue that they identified with Liu and Robinson-Rechavi (2020), and shows very clearly that RegEvol does not have the same issue (flat red lines). Following the reviewer's suggestion, we can extend the figure to 1% or 0.1% bins. We note that the % of the total genome is different from the % of peaks: while actual peaks cover a very small proportion of the genome, the subsampling in Figure 4 (and in Jiang and Zhang 2024) aims to estimate the impact of detecting only the strongest peaks.

One difference between Jiang and Zhang (2024) and our study is that we simulated using whole empirical peaks, whereas they simulated 10-nucleotide transcription-binding sites, meaning that each substitution represented a 10% change. We will clarify these differences in the revised text.

The second reason is actually more important and fundamental than the first. As long as this method is working as described, I cannot see any way that it would ‘not’ be impacted by ascertainment bias. As an extreme case, imagine that all TF binding sites tested had the maximum possible SVM scores; then none of them would have any chance of showing directional selection against binding, while even those that evolved neutrally would appear to have directional selection in favor of binding. Of course, real empirical data are not as extreme as this, but the same concept applies in less extreme scenarios.

This bias could explain patterns observed in the real data. For example: "We observe much more positive than negative directional selection, a pattern likely biological rather than methodological, since it is absent from simulations." This is exactly the pattern predicted under ascertainment bias (in the extreme-scenario thought experiment above). I suspect it is absent from simulations simply because the authors did not properly account for this bias in their simulations.

If the main result reported by the authors had been a lack of any directional selection in favor of binding, and instead only neutrality or directional selection against binding, then this ascertainment bias would not be an issue- it would only have made their results conservative. Unfortunately, this is not the case, and the directional selection in favor of binding, which is the main result emphasized from the empirical analysis, could be inflated by this bias.

There is indeed a possible ascertainment bias, although we believe it concerns only the detection of negative directional selection, as long as we have only empirical peaks in the focal species and not the sister species. This is not so much a limitation of our method as an intrinsic limitation of asymmetrical sampling of species: to study both gain and loss of function, function must be studied experimentally in several species. We will revise the manuscript to highlight this limitation.

Concerning positive directional selection, the mathematical foundation of RegEvol makes it inherently robust to ascertainment bias for positive directional selection. RegEvol calculates the likelihood of the entire sequence of observed substitutions accounting for the starting ancestral state and the mutational landscape. In other words, the model does not assume a uniform probability of phenotypic change; instead, it models the probability of each nucleotide mutation to result in a substitution (i.e., go to fixation) depending on its phenotype.

In an extreme case where all tested TF binding sites had the maximum SVM score, detecting negative directional selection would indeed be impossible, as ancestral states would have had equivalent or lower scores. However, positive directional selection would be inferred only if the likelihood of observing the substitution pattern’s deltaSVM distribution significantly exceeded that expected under the mutational landscape. If a sequence evolved neutrally but reached a maximum SVM score, the likelihood of detecting directional selection would depend on: either the ancestral state being close to maximum with few substitutions increasing SVM (resulting in low statistical power), or the ancestral state being distant with many neutral substitutions and rare chance shifts to maximum (where the substitution distribution would be indistinguishable from neutrality). Then, even in such an extreme dataset, neutral evolution remains detectable, demonstrating RegEvol's strength beyond deltaSVM comparisons between two states.

Minor point:

The following statement: "In contrast, phastCons and phyloP scores lack such enrichment and have a lower dynamic range, suggesting that the conservation scores are less sensitive to fine-scale variation of TF occupancy and thus regulatory region function" is only true if one assumes that TF binding is the only function of this region. One could even turn this around and say the fact that the sites affecting TF binding are not the most conserved is actually evidence that TF binding is not a good indicator of these regions' entire function. I suggest the authors soften this claim that conservation scores are less sensitive to regulatory region function.

We thank the reviewer for this comment, the text will be revised to soften this claim. We will explicitly state that sequence conservation reflects general functional constraints, whereas sequence-to-phenotype predictions capture highly specific and lineage-specific TF-DNA interactions.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation