Figures and data

SVM-based predictions capture functional signals of transcription factor binding.
A) SVM scores computed across a CEBPA ChIP-seq peak in Homo sapiens. Each letter represents a nucleotide, with font size proportional to the SVM score. Two regions of elevated scores match the canonical CEBPA binding motifs (JASPAR). The summit of the peak is shown by a dotted line. B) Average profiles of normalised read coverage (red), SVM prediction scores (yellow), absolute ASVM values (green), phastCons score (blue), and phyloP score (purple), centred on the summits of CEBPA ChIP-seq peaks. C) Distribution of Pearson correlation coefficients between each metric (SVM, ASVM, phastCons, phyloP) and the base-level read coverage profile across individual CEBPA peaks.

Outline of RegEvol: from genotype to fitness.
A) Distribution of phenotypic effects (DPE) for all possible point mutations within a ChIP-seq peak (left). The observed substitutions relative to the ancestral sequence represent a subset of this distribution and are used to infer the peak’s evolutionary regime (right). B) Definition of nested selective regimes using Beta distributions parameterized by a and 0, which define the shape of the underlying fitness landscape. C) Expected distributions of all possible substitutions under each phenotypic fitness landscape. Vertical orange lines indicate the observed substitutions. D) Likelihood of the observed substitutions under each selective regime. In this example, because the substitutions are biased toward positive phenotypic change, the model of positive directional selection is the most likely.

Detection of directional selection on simulated evolution of drosophila CTCF peaks.
A) Proportion of peaks simulated under Directional, Stabilising and Random scenarios and detected by RegEvol and the Permutations Test as evolving under positive directional selection (orange), negative directional selection (blue), stabilising selection (green) and random drift (grey). B-E) True Positive rate of peaks simulated under directional selection (toward positive (+) or negative (-) changes) detected by RegEvol (red) or the Permutation Test (black), as a function of B) number of substitutions per peak, C) ASVM (quantile 5%), d) proportion of random substitutions, and E) selection strength (a parameter). N = 10,000 simulated peaks for each evolutionary scenario; FDR < 0.01.

Sensitivity to extreme substitutions and robustness to ascertainment bias.
Proportion of simulated Drosophila CTCF peaks detected under directional selection using the Permutation Test (black lines) and RegEvol (red lines). Substitutions were either drawn randomly (A, D) or under stabilising selection (B, C; mean(a=P)=45). A-B) False positive rate across 1% quantiles of ASVM; C) false positive rate as a function of the number of substitutions; D) fraction of datasets stratified by the highest derived SVM scores. N = 10,000 simulated peaks; FDR < 0.01.

Application of RegEvol to Drosophila melanogaster transcription factor binding sites.
A. Proportion of peaks inferred to be under directional selection using the original permutation test (black; p < 0.01) and RegEvol (red; FDR < 0.05). B. Proportion of peaks under directional selection across the 5% quantiles of derived ΔSVM values for all peaks. C. Ratio of substitutions to single-nucleotide polymorphisms (SNPs) in embryonic TwdlD peaks inferred by RegEvol under directional selection (red; N = 385) or not (blue; N = 877). The odds ratio and Fisher’s exact test between the two categories are indicated. D. Odds ratios between directional and non-directional peaks across datasets (restricted to experiments with >10 directional peaks). Experiments where directional peaks show higher ratios (red; odds ratio > 1), lower ratios (blue; odds ratio < 1), or no significant difference (grey; FDR > 0.05) are shown. E-F. Relationship between purifying selection on protein-coding genes (wo) and the total number of associated peaks (E) or the proportion of their peaks inferred under directional selection (F). Spearman’s correlation coefficients were computed across all genes.

Tissue-level aggregation of directional selection on human CTCF peaks.
Median SUMSTAT scores across 10,000 resamplings of 5,000 human CTCF peaks per tissue. Bars indicate the standard deviation across resamplings. SUMSTAT scores were calculated following the framework from Daub et al. (2017) by taking the fourth-root of the per-peak Alog-likelihood (Directional minus Neutral model) and summing across all peaks within each tissue. Tissues are sorted by median SUMSTAT and colored according to organ system.