Mapping TFBS genotype-phenotype landscapes.

a-c) Sort-Seq procedure. We utilized a plasmid-based fluorescence reporter system followed by sort-seq to map TFBS genotype-phenotype landscapes. a. Library generation. For generating our E. coli libraries, we cloned TFBS sequence variants (28 = 65,536), represented as black circles, into our plasmid, between a sigma70 constitutive promoter (shown as a black right-facing arrow) and a gfp gene (shown as a green right-facing arrow), to measure transcriptional repression through fluorescence intensity. When a TF binds to the TFBS, it blocks RNA polymerase activity through steric hindrance, reducing gfp transcription. Thus, lower fluorescence levels (light green) represent stronger TF-TFBS binding. The TFBS variants produce a range of regulation strengths resulting, in variable green fluorescence intensities in bacterial cells (green-colored rectangles). b. Sorting procedure. We sorted libraries into expression bins based on fluorescence intensity using a fluorescence-activated cell sorting (FACS). c. Sequencing. We then sequenced the TFBSs from each bin and mapped their sequence (genotype) to their respective regulation strengths, depicted through a color gradient from dark green (low-affinity TFBS, high fluorescence) to light green (high-affinity TFBS, low fluorescence). d) Genotype-phenotype mapping. To construct a regulatory landscape, we connected TFBS genotypes (colored circles) that differed by a single nucleotide via edges, thereby establishing an interconnected genotype network. Each genotype conveys a regulatory phenotype (strength of regulation, heatmap colors), which can be viewed as the elevation dimension (z-axis) in a landscape.

a. Genotypes in each landscape vary broadly in their regulation strength. The violin plots show the distribution of regulation strengths S (vertical axis) for each TF landscape (horizontal axis). The width of a plot at a given value of S represents the frequency of TFBSs at this value. The vertical length of the box in each violin plot covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR. The white circle shows the regulation strength of the wild type for each landscape (CRP: 0.71, Fis: 0.97, IHF: 0.95). The landscape sizes are as follows. CRP: N=31,975 TFBS variants; Fis: N= 43,222; IHF: N= 41,325. b. Peak genotypes are stronger regulators than non-peak genotypes. Dual violin plots show the distribution of regulation strength (vertical axis) for the three TF landscapes (horizontal axis), stratified by non-peak genotypes (grey, CRP, N=29,821, Fis, N= 40,910, IHF, N= 38,872), and peak genotypes (red, CRP, N= 2,154, Fis, N= 2,312, IHF, N= 2,453). The black tick-mark in each plot indicates the mean regulation strength of both non-peak and peak genotypes taken together (mean ± standard deviation, CRP: 0.37 ±0.1, Fis: 0.57±0.13, IHF: 0.52±0.14). The white circle on each plot marks the regulation strength of the wild-type (CRP: 0.71, Fis: 0.97, IHF: 0.95). c. Sequence logos and nucleotide frequency matrices for strong CRP, Fis and IHF binding sites. Each sequence logo 67,68 is based on an alignment of TFBSs with greater regulation strength than the wild-type for each of the three TFs IHF, Fis, and CRP. Each logo also shows the non-varying position (grey) of the TFBS genotype from which our libraries were created. In each logo, the height of each letter at each TFBS position indicates the information content at that nucleotide position – the taller the letter, the more frequent the nucleotide is in strongly regulating TFBSs 67,68. Similar information is conveyed by the frequency matrices displayed as heat maps below each logo. They represent the variability of each nucleotide at each position (horizontal axis) through a color gradient (see color legend). Tall letters in the sequence logo and yellow letters in the frequency matrix indicate frequent, and thus likely important nucleotides for strong regulation.

Peaks and their basins of attractions.

a. The fraction of accessible paths declines with increasing path length. The figure illustrates how the fraction of accessible paths to high peaks (vertical axis, logarithmic scale) decreases with the length of the shortest accessible path (horizontal axis). Accessible paths are defined as paths where each step increases regulation, starting from a specified initial genotype. Each colored line corresponds to data from a different TF landscape: CRP (green), Fis (orange), and IHF (blue). Circles indicate the mean fraction of accessible paths for a given path length. b. Higher peaks have larger basins of attraction. The split-violin plots display the distribution of basin sizes (vertical axis) for high (red) and low (blue) peaks in the three TF landscapes CRP, Fis, and IHF (horizontal axis). High peaks have significantly larger basins of attraction (Welch Two Sample t-tests; CRP: t-value = 9.0898, df = 63.509, and p-value = 4.22 × 10-13, with mean basin sizes of xhigh = 9528.426 and xlow = 5655.035. Fis: t-value = 7.617, df = 172.35, and p-value = 1.645 × 10-12, with mean basin sizes of xhigh = 14206.70 and xlow = 10931.94. IHF: t-value = 6.1777, df = 201.99, and p-value = 3.521 × 10-9, with mean basin sizes of xhigh = 10965.872 and xlow = 8664.953.) The violin plots show the distribution of basin sizes for each of the two kinds of peaks. Their width represents the frequency of a given basin size. The vertical length of the box in each plot covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR. c. The regulation strength of peak genotypes (horizontal axis) is plotted against the size of their basins of attraction (vertical axis, shown as a percentage of all (non-peak genotypes) for peaks in all three landscapes (color legend). Color-coded values of R at the top of the graph indicate Pearson correlation coefficients for each landscape (color legend). Curved lines are based on a linear regression analysis (in linear space), with grey shading indicating 95% confidence intervals (CRP: R2=0.28, N = 2,154; Fis: R2=0.24, N = 2,312; IHF: R2=0.23, N = 2,434). d. Basins of attractions share many TFBS genotypes.

Peak accessibility, contingency and biases in adaptive walks.

a. More than 10% percent of adaptive walks lead to high peaks. The graph shows the cumulative distribution function (CDF, vertical axis) of regulation strength (horizontal axis) attained at the end of 15 million adaptive walks (15,000 starting genotypes × 1,000 adaptive walks each) for each TF landscape (color legend, CRP=: green; Fis: orange; IHF: purple). Dashed lines intersect the CDF at points equivalent to the wild-type regulation strength for each TF (color legend). They help to infer the proportion yTF of walks that terminate at high peaks (S>SWT), which is indicated by the numerical values at the graph’s top for each of the landscapes (color legend). b. Paths to high peaks are not much longer than genetic distances from the starting genotype. Colored boxplots summarize, for each of the three landscapes (horizontal axis), the distribution of shortest genetic distances (blue) and actual path lengths (green) for adaptive walks terminating at high peaks. Path lengths are typically less than one mutational step longer than genetic distances. Each box covers the range between the first and third quartiles (IQR). The horizontal line within each box represents the median value, and whiskers span 1.5 times the IQR. Numbers atop each plot indicate means ± one standard deviation. c. Different numbers of high peaks are reached from different starting variants. Violin plots integrated with boxplots show the distribution in the number of high peaks reached (vertical axis) from each starting variant that attained any high peak during 10,000 adaptive walks. The number of these variants is shown as an integer on top of the panel, and its percentage among all 15,000 starting variants is shown in parentheses (color legend). Underneath it, the panel shows the mean ± 1 s.d. of the number of attained high peaks. a. d. Some high peaks are reached more often than others. We randomly and uniformly sampled 10 starting genotypes from the CRP landscape, started 103 adaptive walks from each, and recorded the number and frequency of distinct high peaks attained in these random walks. Results for each starting genotype are symbolized by a vertical bar. The number of stacks within each bar (delineated by horizontal lines, also indicated by an integer above each bar) indicates the number of high peaks reached by the 103 adaptive walks. Starting variants are ordered in ascending order based on this number of attained peaks. Stack height indicates the fraction of walks that reached the same peak, and is indicated in red, orange, and yellow for the three most frequently attained peaks. See Supplementary Figure S32 for the Fis and IHF landscapes.

Plasmids used in this study

Table of strains

Libraries used in the present study.

TFBS reference (“wild-type”, WT) sequences used in the present study.

Global landscape properties

Primers for engineering the pCAW-Sort-Seq-V2 plasmid and for cloning libraries.

Construction of pCAW-Sort-Seq-V2.

The plasmid pCAW-Sort-Seq-V2 is a derivative of plasmid pCAW-Sort-Seq27. The figure illustrates the sequential cloning steps involved in the plasmid’s construction, from the bottom upwards. For an in-depth explanation, see Supplementary Methods. Briefly, we began with the synthesis and cloning of a codon-optimized mscarlet gene expression cassette into the pCAW-Sort-Seq plasmid, oriented antiparallel to the tetracycline resistance gene (tetr). This generated the intermediate plasmid pCAW-Sort-Seq-mscarlet plasmid. Subsequently, we inserted each TF gene, exemplified by the crp gene for the purpose of this representation, upstream of the mscarlet gene in a bicistronic operon configuration. This step created the TF-specific expression plasmid variants (one each for CRP, Fis and IHF). The final stage, not depicted here, involved the incorporation of the appropriate wild-type TFBSs and libraries into each plasmid variant upstream of the gfp gene, which yielded the library plasmids we utilized in our sort-seq experiments.

Components of plasmid pCAW-Sort-Seq-V2.

The plasmid system pCAW-Sort-Seq-V2 is a derivative of the plasmid pCAW-Sort-Seq27. It encodes the broad-host, low-copy number replication origin pBBR1 170 (5 to 10 copies per cell), together with an origin of transfer (OriT) for conjugation (blue), and a chloramphenicol resistance gene (yellow). The interchangeable regulatory region where the TFBS is located (cyan) is placed between a constitutive promoter (BBa_J23110193) and a superfolder GFP (sfgfp, green) fluorescent reporter gene172. The transcriptional insulator RiboJ173 is located upstream of the sfgfp gene. The tetr gene (purple) is derived from the original Tn10 transposon51,73 under the control of a low-strength constitutive promoter (pLac promoter variant developed in ref. 194). TetR regulates the expression of a bicistronic operon encoding the TF of interest (represented by CRP, dark purple), and an mScarlet-I red fluorescent reporter174 (red). Promoters and their respective insulators195 are represented as white arrows. Transcriptional terminators are represented as black arrows196. Ribosome binding sites (RBSs) are represented as pink arrows.

Growth profile for strains used in this study.

Growth profiles for all strains used for landscape mapping are represented as optical densities (OD600, y-axis) measured over 7 hours of growth at 1-hour intervals (x-axis). We transformed all strains with the plasmid pCAW-Sort-Seq-V2 and grew them in LB medium without inducers. Colors (legend) represent the different strains. Wild-type (purple) refers to the BW25113 parent-strain from which the mutants (Δcrp, Δfis, Δihf) were derived. Each circle represents the mean of three biological replicates independently measured in a plate reader. Shaded areas correspond to one standard deviation calculated from the three replicates.

Sorting procedure for CRP library.

Each density plot is based on a histogram representing the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while the vertical axis indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) CRP library in pCAW-v2 without TF expression (“Lib-CRP”) and 3) CRP library in pCAW-v2 with TF expression (“Lib +CRP”). To create each density plot from the histogram data, we performed density smoothing using a Gaussian kernel function with the aid of the ggplot2 package197. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent fluorescence bins/gates into which we sorted E.coli cells, spaced equally on a binary logarithmic (log2) scale.

Sorting procedure for Fis library.

Each density plot represents the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while each histogram indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) Fis library in pCAW-v2 without TF expression (“Lib-Fis”) and 3) Fis library in pCAW-v2 with TF expression (“Lib +Fis”). We performed density smoothing using a Gaussian kernel function to create a smooth density plot for each histogram with the aid of the ggplot2 package 197. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent bins/gates for each sample, spaced equally on a binary logarithmic (log2) scale.

Sorting procedure for IHF library.

Each density plot represents the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while each histogram indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) IHF library in pCAW-v2 without TF expression (“Lib-IHF”) and 3) IHF library in pCAW-v2 with TF expression (“Lib +IHF”). We performed density smoothing using a Gaussian kernel function to create a smooth density plot for each histogram with the aid of the ggplot2 package156. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent bins/gates for each sample, spaced equally on a binary logarithmic (log2) scale.

Correlation of read counts among replicates of the CRP library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for CRP. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Correlation of read counts among replicates of the Fis library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for Fis. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Correlation of read counts among replicates of the IHF library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for IHF. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Regulation strength distribution across the libraries

a. CRP library. The histogram illustrates the distribution of regulation strengths (normalized to the maximally observed value) among 31,975 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.37), while the wild type’s (WT = 0.71) strength is denoted by a dashed red line. A total of N=84 variants surpass the WT in regulation strength. B. Fis library. The histogram illustrates the distribution of regulation strengths (normalized to the maximum observed value) among 43,222 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.57), while the wild type’s (WT = 0.97) strength is denoted by a dashed red line. A total of N=172 variants surpass the WT in regulation strength. c. IHF library. The histogram illustrates the distribution of regulation strengths (normalized to the maximum observed value) among 41,325 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.52), while the wild type’s (WT = 0.95) strength is denoted by a dashed red line. A total of N=199 variants surpass the WT in regulation strength.

Validating regulation strengths.

In this analysis, we validated the regulation strengths of 10 selected TFBSs for each bin and for each library (130 sequences per library). These TFBSs covered a wide range of regulation strengths determined from sort-seq data. The horizontal axis of each panel shows the regulation strengths of the 130 binding sites as quantified through sort-seq. The vertical axis shows the fluorescence levels of the same binding sites, as quantified by a plate reader (arbitrary units). The top of each panel shows the result of a test of the null-hypothesis that the Pearson correlation coefficient R between the two quantities is equal to zero, together with a 95% confidence interval. The shaded grey area surrounding the regression line indicates the 95% confidence interval, i.e., the probability that the true linear regression line of the population lies within the confidence interval of the regression line calculated from the sample data at a confidence level of 95%. R2 indicates the coefficient of determination. Each circle in the plot represents the mean of three independent (biological) replicate fluorescence measurements in a plate reader after eight hours of growth. Circle colors indicate GFP fluorescence from low (dark green) to high (light green a. Association between plate reader fluorescence and regulation strengths for the CRP library. R = −0.81, P < 0.001; R² = 0.66, P < 0.001, N=130. b. Association between plate reader fluorescence and regulation strengths for the Fis library. R = −0.74, P < 0.001; R² = 0.54, P < 0.001, N=130. c. Association between plate reader fluorescence and regulation strengths for the IHF library. R = −0.73, P < 0.001; R² = 0.53, P < 0.001 N = 130

Adaptive peaks and their regulation strengths across TF landscapes.

a. Adaptive peaks in the CRP landscape. The distribution of regulation strengths across the landscape’s 2,154 adaptive peaks is shown. The WT’s regulation strength (0.71) is marked by a horizontal dashed red line for reference. A total of N=61 peaks have higher regulation strength than the WT. The color gradient indicates genotype density along the regulation strength axis, with the histogram on the right detailing the regulation strength distribution. b) like a) but for the Fis landscape, based on its 2,312 adaptive peaks. (WT regulation strength: 0.97) N=172 peaks regulate expression more strongly than the WT. c) like b) but for the IHF landscape, based on its 2,434 adaptive peaks. (WT regulation strength: 0.95) N=199 peaks regulate expression more strongly than the WT.

Quantitative analysis of CRP landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e,. the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24(=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks”. Mean relative connectivities and standard deviations, are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 4.59 × 10-77, N1= 29,821; N2 = 2,093). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 1, N1 = 29,821; N2 = 61). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The weak association (Pearson correlation, R = −0.13; t = −23.888; degrees of freedom (df) = 31,973; p-value < 2.2 × 10-16) indicates that strongly regulating genotypes dare not necessarily densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of CRP landscape based on regulation strength. Visual overview of the CRP landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). It indicates the density of genotypes within each of the four categories through a color gradient from blue (low density) to yellow (high density). e. Visualization of CRP landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color-codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Like panel d, but for four categories of relative connectivity (no data, low, medium, high).

Quantitative analysis of the Fis landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e., the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24 (=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks.” Mean relative connectivities and standard deviations are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 4.59 × 10-77, N1 = 40,910; N2 = 2,312). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 9.31 × 10-13, N1 = 40,910; N2 = 172). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The weak association (Pearson correlation, R = −0.13; t = −23.888; degrees of freedom (df) = 31,973; p-value < 2.2 × 10-16) indicates that strongly regulating genotypes are not necessarily densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of Fis landscape based on regulation strength. Visual overview of the Fis landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm 182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). The density of genotypes within each category is indicated by a color gradient from blue (low density) to yellow (high density). e. Visualization of Fis landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Similar to panel d, but for four categories of relative connectivity (no data, low, medium, high).

Quantitative analysis of the IHF landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e., the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24 (=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks.” Mean relative connectivities and standard deviations are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 1.14 × 10-77, N1 = 38,872; N2 = 2,254). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 9.31e-13, N1 = 38,872; N2 = 199). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The slight positive correlation (Pearson correlation, R = 0.017; t = 3.51; degrees of freedom (df) = 41,323; p-value = 0.0004485) indicates that genotypes with elevated regulation strength are only marginally more likely to be densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of IHF landscape based on regulation strength. Visual overview of the IHF landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm 182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). The density of genotypes within each category is indicated by a color gradient from blue (low density) to yellow (high density). e. Visualization of IHF landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Similar to panel d, but for four categories of relative connectivity (no data, low, medium, high).

Peaks are spread across each landscape.

The vertical axis shows the genetic distance d between peaks and an equal number of randomly chosen non-peak genotypes (color legend) for the CRP (N= 2,154), Fis (N= 2,312), and IHF (N= 2,434) landscape (horizontal axis). (CRP: dpeak= 6, drandom= 5.94, t-statistic= 52.49, p-value <0.01; Fis: dpeak = 6, drandom = 5.97, t-statistic= 30.96, p-value <0.01; IHF: dpeak = 6, drandom = 5.97, t-statistic= 34.07, p-value <0.01; null-hypothesis: means do not differ). Although the differences are statistically significant, peaks are almost as dispersed within genotype space as non-peaks. Each box covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR.

Genetic diversity among peaks across TF landscapes.

a. CRP landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,154 peaks (red) versus 2,154 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing a slight difference between peaks (d=5.6, in red) and non-peaks (d=5.5, in grey), demonstrating large genetic diversity within the peaks (Welch Two Sample t-test, t = 78.299, df = 4637305, p-value < 2.2 × 10-16). b. Fis landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,312 peaks (red) versus 2,312 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing no significant difference between peaks (d=5.7, in red) and non-peaks (d=5.7, in grey), demonstrating large genetic diversity within the peaks (t = 27.234, df = 5412460, p-value < 2.2 × 10-16). c. IHF landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,434 peaks (red) versus 2,434 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing no significant difference between peaks (d=6, in red) and non-peaks (d=6, in grey), demonstrating large genetic diversity within the peaks (t = 54.689, df = 6014743, p-value < 2.2 × 10-16).

Genetic diversity among high peaks across TF landscapes.

a. CRP landscape. The overlayed histograms compare the genetic diversity between 61 high regulation strength peaks (red) and 61 non-peak, randomly selected variants (grey). No significant difference exists (W) between high-regulation strength peaks (mean d=5.4, in red) compared to randomly selected genotypes (mean d=5.5, in grey; Welch Two Sample t-test, t = −0.78555, df = 3653.7, p-value = 0.4322). c. Fis landscape. The overlayed histograms compare the genetic diversity between 172 high-regulation strength peaks (red) and 172 non-peak, randomly selected genotypes (grey). No significant difference exists between high-regulation strength peaks (mean d=5.4, in red) and random genotypes (mean d=5.8, in grey; −16 Welch Two Sample t-test, t = −16.634, df = 29357, p-value < 2.2 × 10). c. IHF landscape. The overlayed histograms compare the genetic diversity between 199 high regulation strength peaks (red) and 199 non-peak, randomly selected genotypes (grey). The analysis reveals no significant difference between high regulation strength peaks (mean d=5.8, in red) compared to the random variants (mean d=5.8, in grey; Welch Two Sample t-test, t = 0.73365, df = 39366, p-value = 0.4632).

Principal component analysis of the CRP landscape.

a. PCA plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 31,975 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure in the genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot is a way to visualize the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component, and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component, and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g. G18 stands for a guanine at position 18 of the binding site).

Principal component analysis of the Fis landscape

a. PCA plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 43,222 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure of the data in genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot visualizes the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both the length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g., G18 stands for a guanine at position 18 of the binding site).

Principal Component Analysis of the IHF Landscape

a. PCA Plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 41,325 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure in the genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot is a way to visualize the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both the length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g., G18 stands for a guanine at position 18 of the binding site).

Accessible paths to high peaks are less than two mutational steps longer than the shortest genetic distances to the peak.

Each composite histogram shows the distribution of genetic distances (blue) and shortest accessible path length (green) for all pairs of high peaks and non-peak genotypes for one of the three landscapes. We used a t-test of the null hypothesis that the means of these distance distributions are statistically indistinguishable. This null hypothesis is rejected for all three landscapes. a. CRP Landscape. Mean genetic distance: 5.74; mean number of mutational steps in shortest accessible paths: 7.2; t = 500.17, df = 1,212,712, p < 2.2 × 10-16. b. Fis Landscape. Mean genetic distance: 5.98; mean number of mutational steps in shortest accessible paths: 7.7; t = −1271, df = 5,639,001, p < 2.2 × 10-16. c. IHF Landscape. Mean genetic distance: 5.78; mean number of mutational steps in shortest accessible paths: 7.6; t = −1365, df = 6,636,001, p < 2.2 × 10-16.

Moderate positive correlation between peak regulation strength and basin size for the three regulatory landscapes. a. CRP landscape. The scatter plot shows the association between regulation strength conveyed by a peak variant (horizontal axis, colors from dark blue [low] to yellow [high]) and the size of its basin of attraction (absolute numbers on the left vertical axis, percentages on the right vertical axis, note the logarithmic scales). The dashed vertical line represents the regulation strength of the wild type. The black curve represents a linear regression line (in linear space). R is the Pearson correlation coefficient (R=0.39) and R2 is the coefficient of determination for the linear regression model (R2 =0.18, N = 2,154) b. Fis landscape. Like panel a (R=0.37, R2 =0.15, N = 2,312), c. IHF landscape. Like panel a (R=0.41, R2 =0.19, N = 2,434).

The basins of attractions of high peaks overlap.

Each heatmap matrix quantifies the proportion of shared TFBS variants in the basins of attractions of all pairs of high peaks (Methods). Each matrix entry quantifies the basin overlap for one pair of high peaks using the Jaccard index (Supplementary Methods 7.5), where row and column indices correspond to specific pairs of peaks. The maximum overlap of one (red) signifies that two basins share all variants, while the minimum value of zero (blue) indicates that they share no variants. a. Basin overlap in the CRP landscape. The average Jaccard index is 0.461 with a standard deviation of ±0.144, based on 3,721 pairwise comparisons. b. Basin overlap in the Fis landscape. The average Jaccard index is 0.426 with a standard deviation of ±0.136, derived from 29,584 pairwise comparisons. c. Basin overlap in the IHF landscape. The average Jaccard index is 0.364, with a standard deviation of ±0.145, based on 39,601 pairwise comparisons.

Distribution of the fraction of accessible high peaks per TFBS variant across TF landscapes.

The figure shows violin plots with embedded boxplots, for the distribution of high peak accessibility of each variant within the CRP (green), Fis (orange), and IHF (purple) landscape. The y-axis quantifies peak accessibility as the fraction of the total number of high peaks that can be accessed from each (non-peak) genotype within each TF landscape, revealing the extent to which individual variants can potentially access different high peaks. In the CRP landscape, each genotype can access, on average, a proportion of 0.314±0.315 (mean±s.d.) high peaks from the total number of high peaks, with a median value of 0.194. Notably, 5,152 out of 29,821 genotypes (17%) can access no high peak. Fis: mean high peak accessibility: 0.363±0.311 (median: 0.305), with 4,304 out of 40,895 (10%) genotypes lacking access to any high peaks. IHF: mean peak accessibility: 0.301±0.289 (median: 0.205); 4,629 out of 38,872 genotypes (12%) show lack access to high peaks.

Kimura adaptive walks on the CRP landscape a-c. Fraction of adaptive walks attaining TFBS variants with strong regulation for different population sizes. The vertical axis indicates the cumulative fraction of adaptive walks on the CRP landscape that end at a genotype with a regulation strength shown on the horizontal axes. Data is based on 15 million simulated adaptive walks (15,000 random starting genotypes × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for CRP, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (Horizontal line, blue lettering). For example, a value of y=0.82 implies that 82% of walks terminate at peaks lower than the wild-type. The remaining 18% reach higher peaks. a. Population size 102. 18% of walks terminate at peaks higher than the wild-type. b. Population size 105. 10% of the walks terminate at higher peaks. c. Population size 108. 10% of walks terminate at higher peaks. d-f. Number of high peaks reached by a single adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 15 × 106 adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong, because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 102, e 105, and f 108.

Kimura adaptive walks on the Fis landscape. a-c. Fraction of adaptive walks attaining high peaks for different population sizes. The vertical axis indicates the cumulative percentage of adaptive walks on the Fis landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15 million simulated adaptive walks (15,000 starting genotypes × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for Fis, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a y-value of 0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% reaching higher peaks. a. Population size 10². 27% of walks terminate at peaks higher than the wild-type. b. Population size 10⁵. 13% of walks terminate at higher peaks. c. Population size 10⁸. 13% of walks terminate at higher peaks. d-f. Number of high peaks reached by individual adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 150 million adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 10², e 10⁵, and f 10⁸.

Kimura adaptive walks on the IHF landscape. a-c. Fraction of adaptive walks attaining high peaks for different population sizes. The vertical axis indicates the cumulative percentage of adaptive walks on the IHF landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15 million simulated adaptive walks (15,000 starting points × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for IHF, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a y-value of 0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% reaching higher peaks. a. Population size 10². 31% of walks terminate at peaks higher than the wild-type. b. Population size 10⁵. 15% of walks terminate at higher peaks. c. Population size 10⁸. 15% of walks terminate at higher peaks. d-f. Number of high peaks reached by individual adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 150 million adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 10², e 10⁵, and f 10⁸.

Fraction of greedy adaptive walks attaining high peaks for different TFs. a-c. Fraction of greedy adaptive walks attaining high peaks for different TFs. The vertical axis indicates the cumulative distribution function (CDF) of adaptive walks on each TF landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15,000 adaptive walks (15,000 starting points × 1 adaptive greedy walk for each) for each TF landscape. The vertical dashed line indicates the regulation strength of the wild-type TFBS for their respective TFs and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a value of y=0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% terminating at higher peaks. a. CDF for adaptive walks in the CRP landscape. Note the intersection at y= 0.89, meaning that 11% of the walks terminated on high peaks. b. CDF for adaptive walks in the Fis landscape. Note the intersection at y= 0.85, meaning that 15% of the walks terminated on high peaks. c. CDF for adaptive walks in the IHF landscape. Note the intersection at y= 0.82, meaning that 18% of the walks terminated on high peaks.

Diversity of alternative shortest paths for pairs of starting genotypes and their respective attained peaks.

Each violin plot is based on data from 15,000 randomly (and uniformly) sampled starting (non-peak) genotypes from each TF landscape – CRP, Fis and IHF (see color legend) – and on 1’000 Kimura adaptive walks per starting genotypes. For each pair of a starting genotype and a high peak that is attainable from this starting genotype, we calculated the number of distinct shortest paths reaching the peak (vertical axis, note logarithmic scale). The width of the violin plot at any given y-axis value represents the density of data points at that value. Wider sections indicate a higher density of data points. The text at the top of each violin indicates the median number of shortest paths and the interquartile range (IQR) for each TF. For example, in the CRP landscape peaks are reached via a median number (± IQR) of 4 (± 4) shortest paths, with a sample size (number of starting genotypes × attainable peak pairs) of N = 345,224. The violin regions appear discretized because the number of shortest paths is a discrete variable, not a continuous one. This means that the data can only take on specific integer values, resulting in discrete regions within the violin plots with a non-zero density of data points.

Peak genotypes conveying stronger regulation are also reached more often during adaptive evolution.

Each scatter plot represents the association between the strength of regulation conveyed by a peak genotype (horizontal axis) and the number of times the peak is attained (vertical axis, note the logarithmic scale). The data for each plot is based on 15,000 randomly (and uniformly) sampled starting (non-peak) genotypes from each TF landscape – CRP, Fis and IHF (see color legend) – and on 1’000 Kimura adaptive walks per starting genotypes (15×106 adaptive walks per landscape). The heatmap colors represent the density of peak genotypes from low (dark blue) to high (yellow). The dashed vertical line represents the regulation strength of the wild type TFBS. The red curve is derived from linear regression (in linear space). P-values in each panel result from a test of the null hypothesis that the two variables are not associated (Pearson correlation coefficient R=0; Coefficient of determination R2=0). a. CRP landscape (R= 0.56; R2 =0.34, N = 2,154). b. Fis landscape. (R=0.6, R2 =0.36, N = 2,312), c. IHF landscape. (R=0.62, R2 =0.4, N = 2,434).

The frequency distribution of attained high peaks is biased towards some frequently attained peaks.

We randomly and uniformly sampled 10 starting genotypes, started 103 adaptive walks from each, and recorded the number and frequency of distinct high peaks attained in these random walks. Each starting variant is symbolized by a vertical bar. The number of stacks within each bar (delineated by horizontal lines, also indicated by an integer above each bar) shows the number of high peaks reached by the 103 adaptive walks. Starting variants are ordered in ascending order based on this number of attained peaks. Stack height indicates the fraction of walks that reached the same peak, and is indicated in red, orange, and yellow for the three most frequently attained peaks. a. Fis landscape. b. IHF landscape.

The frequency distribution of traversed paths to high peaks is biased towards some paths that are frequently traversed.

We randomly and uniformly sampled 20 starting genotypes, initiated 103 adaptive walks from each, and recorded the number and frequency of distinct paths traversed by these random walks for each attained peak. We then randomly and uniformly sampled a high peak attained by each starting variant for our analysis. Each starting variant is represented by a vertical bar. The number of stacks within each bar indicates the number of unique paths traversed to reach the sampled high peak. Stack height represents the fraction of walks that reached the high peak, with the three most frequently transversed paths in red, orange, and yellow. a. CRP landscape. b. Fis landscape. c. IHF landscape.