Introduction

Transcriptional regulation controls the ability of RNA polymerase to initiate the transcription of a gene into mRNA1. It is crucial in the life of all organisms, orchestrating the expression of genes in response to environmental cues and cellular states2,3. Transcriptional regulation is mediated by transcription factors (TFs), proteins that bind DNA near a gene at short DNA words known as transcription factor binding sites (TFBSs). The binding of a TF to its binding site on DNA can either hinder or facilitate transcription initiation by RNA polymerase1,4,5. The stronger the binding of a TF to its TFBS is, the more strongly the TF can regulate a nearby gene69. In bacteria, TFs operate within a hierarchically organized gene regulatory network. At the bottom of this hierarchy are local TFs that regulate the expression of one or few genes and modulate specific biological processes 1012. At the top are global TFs that may regulate hundreds of genes and control many cellular functions 11,1315. Consistent with their broader role, global regulators are typically more highly expressed and bind to a broader range of TFBSs than local regulators12.

Both TFs and their TFBSs evolve, but TFBSs evolve more rapidly. The reason is that a mutation in a TF can affect the expression of many genes, whereas a mutation in a TFBS may affect only one gene and is less likely to be deleterious 13,1618. A special case of TFBS evolution is the evolution of a strong TFBS from a DNA word with weak or no regulatory activity. Such de novo evolution of TFBSs can create new regulatory interactions and change the structure of gene regulatory networks1922. Unfortunately, we know little about the ability of Darwinian evolution to create TFBSs de novo. A strong TFBS may have to arise from a non-binding site through an evolutionary path of multiple mutational steps. Darwinian evolution can favor this process only if strong binding is adaptive and if each mutational step in a path increases binding strength, i.e., if the path is accessible to Darwinian evolution. To find whether such paths exist may require the analysis of multiple paths. This is challenging because sequence space contains an astronomical number of potential TFBSs for any one TF. The number of evolutionary paths to strong transcriptional regulation is thus also astronomical. For each such path, the strength of each TFBS along the path has to be measured experimentally2326.

Building on our previous work on a local TF27, we take a step to address this challenge for three global regulators in Escherichia coli. Specifically, we perform three independent and massively parallel experiments23 using a plasmid-based expression system for each TF. Our synthetic biology approach allows us to explore the regulatory effects of tens of thousands of TFBS variants in a high-throughput manner. We quantify how strongly each of more than 30,000 binding sites for each TF can regulate the expression of a nearby reporter gene.

The first TF we study is the cAMP receptor protein (CRP). In the absence of glucose, CRP modulates the expression of genes mostly involved in carbon metabolism. It allows E. coli to efficiently switch between sources of carbon and energy 2831. The second TF is the factor for inversion stimulation (Fis), which helps to regulate the expression of genes involved in growth phase transitions. It also modulates the supercoiling of DNA3234 and influences DNA replication, recombination, and repair3234. The third factor is the integration host factor (IHF). IHF regulates genes involved in stress responses and stationary phase survival35,36. Similar to Fis, it is involved in DNA compaction, replication, and recombination, but also in the assembly of complex nucleoprotein structures35,36. We chose these factors because they are the most global regulators in E. coli, and they are diverse, belonging to different protein families.

We use the regulatory data for each TF’s binding sites to map a genotype-phenotype landscape for these binding sites 37,38. Such a landscape is a conceptual analog of a physical landscape, where each location corresponds to one genotype in sequence space, and the elevation at this location corresponds to a phenotype – the ability to regulate gene expression 6,27,39. One of our two main aims is to provide the first characterization of such landscapes for global bacterial regulators, and to find out whether these landscapes are different or similar.

When strong regulation of a gene is adaptive, a regulatory landscape becomes an adaptive landscape or fitness landscapes 38,4042. Our second aim is to understand how populations would evolve on each of our three landscapes when they are viewed as fitness landscapes. Specifically, we study how evolving populations would traverse each landscape through individual mutational steps that change a TFBS its ability to regulate a gene via its cognate TF. A peak in such a landscape is a TFBS conveying stronger regulation than all neighboring TFBSs in sequence space. If such a landscape is rugged (has multiple peaks), natural selection alone may not enable a population to discover the highest peaks, i.e., the strongest TFBSs. The reason is that the peaks may be separated by valleys of low regulation strength that cannot be traversed by natural selection alone 4345 38. In other words, high peaks may not be easily accessible through Darwinian evolution – they may be reachable by few or no evolutionary paths of single DNA mutations where each mutational step increases TFBS strength6. One factor that can reduce peak accessibility is epistasis – non-additive effects of two or more mutations on a phenotype. In addition, epistasis can reduce the predictability of evolutionary trajectories towards a peak 37,46.

To accomplish both aims, we first studied the topography of the three landscapes, and then simulated the dynamics of evolving populations on them. All three landscapes are highly rugged. They contain more than 2,000 peaks that are scattered through genotype space and are also rife with epistatic interactions. Despite these features, evolving populations can reach the strongest TFBSs in all three landscapes. Which of several high peak is reached is contingent on chance events during adaptive evolution.

Results

Landscape mapping

We constructed a plasmid system (Supplementary Table S1, Supplementary Figures S1-S2 and Supplementary Methods 2-3) to inducibly express each of the three global TFs CRP, Fis, and IHF in E. coli. We expressed each factor in a host strain in which its encoding chromosomal gene had been knocked out (Δcrp, Δfis and Δihf, see Supplementary Table S2). This ensures that each TF is expressed only upon induction. Expression of each TF can be induced with anhydrotetracycline (atc), which allowed us to control TF expression levels and measure how the binding of the TF to DNA can influence transcriptional regulation by measuring the expression of a green fluorescent reporter gene (gfp) (Figure 1a). In our plasmid system, stronger binding of a TF to DNA leads to lower GFP expression. The reason is that the TF’s binding site is located directly upstream of the gfp gene, and TF-DNA binding thus downregulates GFP expression by interfering with gfp transcription. Our three plasmid-strain pairs attained similar cell densities during late exponential or early stationary phase (Supplementary Figure S3) where we performed all our measurements.

Mapping TFBS genotype-phenotype landscapes.

a-c) Sort-Seq procedure. We utilized a plasmid-based fluorescence reporter system followed by sort-seq to map TFBS genotype-phenotype landscapes. a. Library generation. For generating our E. coli libraries, we cloned TFBS sequence variants (28 = 65,536), represented as black circles, into our plasmid, between a sigma70 constitutive promoter (shown as a black right-facing arrow) and a gfp gene (shown as a green right-facing arrow), to measure transcriptional repression through fluorescence intensity. When a TF binds to the TFBS, it blocks RNA polymerase activity through steric hindrance, reducing gfp transcription. Thus, lower fluorescence levels (light green) represent stronger TF-TFBS binding. The TFBS variants produce a range of regulation strengths resulting, in variable green fluorescence intensities in bacterial cells (green-colored rectangles). b. Sorting procedure. We sorted libraries into expression bins based on fluorescence intensity using a fluorescence-activated cell sorting (FACS). c. Sequencing. We then sequenced the TFBSs from each bin and mapped their sequence (genotype) to their respective regulation strengths, depicted through a color gradient from dark green (low-affinity TFBS, high fluorescence) to light green (high-affinity TFBS, low fluorescence). d) Genotype-phenotype mapping. To construct a regulatory landscape, we connected TFBS genotypes (colored circles) that differed by a single nucleotide via edges, thereby establishing an interconnected genotype network. Each genotype conveys a regulatory phenotype (strength of regulation, heatmap colors), which can be viewed as the elevation dimension (z-axis) in a landscape.

For each plasmid expressing one of our three TFs, we cloned an entire library of TFBSs upstream of the gfp reporter. Specifically, in this library, we randomized each of the eight most important (information-rich) base pair positions within the TFBS (see Supplementary Methods 4 and Supplementary Table S3). We obtained these most information-rich positions for each TF from the alignment of known TFBSs in the RegulonDB database47. Each library comprised 48=65,536 unique sequences. We constructed three replicates of each library (Supplementary Methods 4 and 5).

We then mapped these TFBS genotypes to their respective regulatory phenotypes using a well-established technique known as sort-seq23,39,4850 (Figure 1b), which combines fluorescence-activated cell sorting (FACS) with high-throughput sequencing (Figure 1c). In sort-seq, one first sorts cells into multiple fluorescence “bins” depending on the level of GFP expression. A cell’s GFP fluorescence serves as a proxy for GFP expression levels51 and quantifies how strongly the TFBS library member in this cell can regulate GFP expression (Methods, Supplementary Methods 5).

The results of our sort-seq experiments are three maps – one for each transcription factor – from each of more than 30,000 genotypes (TFBS variants) to regulatory phenotypes (regulation strength). One can view each map as a genotype-phenotype landscape that we also call a regulatory landscape8,9,27,39,5256 (Figure 1d). Whenever strong regulation entails high fitness, such a landscape is also an adaptive landscape or fitness landscape6,38. For the purpose of analyzing the landscape quantitatively, we represent it as a network of genotypes (TFBS variants). Each node in this network corresponds to a genotype. Edges link neighboring genotypes, which differ in a single nucleotide (Figure 1d).

Landscapes exhibit diverse regulation strengths and distribution breadths

To evaluate the ability of our library to regulate gene expression, we first measured the distribution of GFP fluorescence intensities across the bins in two conditions, i.e., in the presence or absence of the TF (with or without the atc inducer, Supplementary Figures S4-S6). In the presence of the TF, GFP expression was lower on average and showed a broader distribution than in the absence of the TF (Supplementary Figures S4-S6). This indicates that the TFBSs in each library can indeed downregulate GFP expression, but some library variants convey stronger regulation than others, hence the broader fluorescence distribution.

We then pooled barcoded DNA sequences extracted from cells in each bin and biological replicate. We sequenced at least 250 unique TFBS genotypes from each bin, with an average of 3992.3±4293.9 unique sequences per bin for the three TFs (as detailed in Methods, Supplementary Methods 7, and Supplementary Figures S4-S6). The resulting sequences covered 95%, 90%, and 93% of the total library sizes (N = 65,536 genotypes) for CRP, Fis and IHF, respectively. To ensure the reliability of our data, we excluded sequences with low read coverage and sequences that were not present in all triplicates (Methods, Supplementary Methods 7.3). This quality filtering step resulted in library sizes of 31,975 genotypes for CRP (49% of all 48 genotypes), 43,222 genotypes for Fis (66%), and 41,325 genotypes for IHF (63%). The correlation in read counts for each variant across replicates was very high for this quality-filtered data (Pearson’s R= 0.98-0.99) (Supplementary Figures S7-S9). We used this data for all further analyses.

The fluorescence and sequence data from each bin allowed us to quantify the variant’s ability to regulate gene expression for each TFBS library variant. We refer to the resulting metric as the regulation strength S conveyed by the variant (Methods, Supplementary Methods 7.2, Figure 1d). It ranges between S=0 (no regulation) to S=1 (strongest regulation among all variants in the library). Although our experiments directly quantify expression regulation, and not the affinity or binding strength of a TF to a TFBS variant, we also use binding strength as a proxy for regulation strength, because TF-DNA binding is necessary for regulation1,4,11,5759. To each of our three TFs, we also assigned a reference TFBS that is naturally occurring and conveys strong regulation by the TF, as proven by previous experimental work31,6062. We refer to this TFBS as the wild-type (WT 60,63 WT 61,64 and WT 62,65, Supplementary Table S4). We refer to TFBSs with regulation strengths below and above the wild-type (WTCRP: SWT=0.71, WTFis: SWT=0.97, WTIHF: SWT=0.95) as weak and strong, respectively.

We observed a broad range of regulation strengths S for each TF landscape, with varying spreads. The CRP landscape exhibited the lowest average regulation strength and a narrower distribution of S compared to Fis and IHF, which had similar distributions (mean±s.d., 0.37±0.1 for CRP, 0.57±0.13 for Fis, and 0.52±0.14 for IHF; see Figure 2a, Supplementary Figure S10). Next, we analyzed the strongly regulating TFBSs to identify nucleotides that may be particularly frequent and thus, potentially important for strong regulation (Figure 2b). We discovered a moderate association between the most frequent nucleotides (highlighted in yellow in Figure 2b) and the most informative nucleotides from the available position weight matrices (PWMs) for these TFs66. (Pearson correlation coefficients: R=0.51, R=0.43, R=0.47 for Fis, CRP and IHF, respectively, with p-values smaller than 10-16, rejecting the null hypothesis of an absence of association). This observation suggests that our logos capture different information compared to available PWMs. This is expected, because our approach allows us to filter sequences by regulatory strength thresholds to construct our matrices, unlike the traditional method of aligning genomic TFBSs without considering their regulation strengths67,68.

a. Genotypes in each landscape vary broadly in their regulation strength. The violin plots show the distribution of regulation strengths S (vertical axis) for each TF landscape (horizontal axis). The width of a plot at a given value of S represents the frequency of TFBSs at this value. The vertical length of the box in each violin plot covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR. The white circle shows the regulation strength of the wild type for each landscape (CRP: 0.71, Fis: 0.97, IHF: 0.95). The landscape sizes are as follows. CRP: N=31,975 TFBS variants; Fis: N= 43,222; IHF: N= 41,325. b. Peak genotypes are stronger regulators than non-peak genotypes. Dual violin plots show the distribution of regulation strength (vertical axis) for the three TF landscapes (horizontal axis), stratified by non-peak genotypes (grey, CRP, N=29,821, Fis, N= 40,910, IHF, N= 38,872), and peak genotypes (red, CRP, N= 2,154, Fis, N= 2,312, IHF, N= 2,453). The black tick-mark in each plot indicates the mean regulation strength of both non-peak and peak genotypes taken together (mean ± standard deviation, CRP: 0.37 ±0.1, Fis: 0.57±0.13, IHF: 0.52±0.14). The white circle on each plot marks the regulation strength of the wild-type (CRP: 0.71, Fis: 0.97, IHF: 0.95). c. Sequence logos and nucleotide frequency matrices for strong CRP, Fis and IHF binding sites. Each sequence logo 67,68 is based on an alignment of TFBSs with greater regulation strength than the wild-type for each of the three TFs IHF, Fis, and CRP. Each logo also shows the non-varying position (grey) of the TFBS genotype from which our libraries were created. In each logo, the height of each letter at each TFBS position indicates the information content at that nucleotide position – the taller the letter, the more frequent the nucleotide is in strongly regulating TFBSs 67,68. Similar information is conveyed by the frequency matrices displayed as heat maps below each logo. They represent the variability of each nucleotide at each position (horizontal axis) through a color gradient (see color legend). Tall letters in the sequence logo and yellow letters in the frequency matrix indicate frequent, and thus likely important nucleotides for strong regulation.

To further validate the data from our sort-seq experiments, we isolated cells harboring 10 different TFBS variants from each of 13 bins of each library (i.e., 10×13=130 variants per library), and determined their regulation strength more directly by quantifying GFP expression with a microplate reader (Methods and Supplementary Figure S11). This comparison validates the sort seq approach by revealing a strong association of the two independent quantifications of regulation strength (Pearson’s R=-0.81 for CRP, R=-0.74 for Fis, and R=-0.73 for IHF). (Methods and Supplementary Figure S11).

All three regulatory landscapes are highly rugged

The study of our landscapes in a network framework (Figure 1d) can help to quantify different aspects of landscape topography (Supplementary Table S5). One of them is the ruggedness of each landscape. It can be quantified by the number of peaks6972. In the network framework, a peak is a TFBS whose neighbors are all weaker regulators (with lower S) than the TFBS itself. We find that all three landscapes are highly rugged, with 2,154, 2,312, and 2,453 peaks for the CRP, Fis, and IHF landscapes, respectively (Figure 2c, Supplementary Figure S12, Supplementary Table S5). Not surprisingly, peak genotypes generally are stronger regulators than non-peak genotypes (Figure 2c). Only a small fraction of peak genotypes regulate expression more strongly than the wild-type (Figure 2c; 61 for CRP, 172 for Fis, and 199 for IHF). We refer to such peaks as high peaks and distinguish them from low peaks (S<SWT).

We compared the ruggedness of our landscapes with that of a well-established theoretical model of uncorrelated random landscapes, in which each sequence is assigned a fitness at random from the same fitness distribution, and neighbors have uncorrelated fitness values7274 Such landscapes are maximally rugged 7274. We created 100 uncorrelated random landscapes for each of our three TF landscapes by randomly shuffling the measured regulation strengths among all genotypes (Supplementary Methods 7.6). The number of peaks in our TF landscapes lies within 93%, 96%, and 98% percent of that of a maximally rugged random landscape, which, on average [mean±s.d.], has 2,308±133, 2,405±89 and 2,373±97 peaks for CRP, Fis and IHF, respectively. This analysis underlines that our landscapes are indeed highly rugged.

Because we use only quality-filtered genotype data, our landscapes lack regulatory information for about 40% of the 48 genotypes. While we cannot exclude that this undersampling of genotypes has led to systematic biases in our determination of landscape ruggedness, we note that the sampling of landscape genotypes by our experiments was not strongly biased with respect to regulation strength. Specifically, when we analyzed the relative connectivity of genotypes in our landscapes – the fraction of each genotype’s 24 possible neighbors for which our experiments yielded regulatory data – we found that it is only weakly correlated with regulation strength (R=-0.1, −0.1, 0.01 for the CRP, Fis, and IHF landscapes, Figures S13-S15). Similarly, the relative connectivity of peak genotypes is only weakly correlated with their regulation strength (R=-0.05, −0.04, 0.06 for the CRP, Fis, and IHF landscapes). Furthermore, the prevalence of sign epistasis (Supplementary Table S5) supports the notion that our landscapes are indeed rugged (see Supplementary Methods 7.5 for further details on epistasis and its evolutionary consequences).

Landscape peaks are widely scattered in genotype space

In a rugged adaptive landscape, reaching high fitness peaks can be challenging, because such peaks are separated from other genotypes by valleys of low fitness that cannot be traversed by natural selection alone38. If natural selection favors strong gene regulation, the ruggedness of our regulatory landscapes may thus present a challenge for adaptive evolution. To better understand the potential magnitude of this challenge, we next analyzed our landscapes’ topography in greater detail. We began by studying the distribution of peaks in genotype space. If a landscape’s highest peaks are widely scattered through genotype space, then they may be accessible via fitness-increasing evolutionary paths from diverse non-peak genotypes. This may facilitate adaptive evolution compared to a landscape where peaks are clustered in a small region of genotype space.

For each of our three landscapes, we determined the genetic distance between peaks, i.e., the minimum number of mutations needed to transition from one peak to another, regardless of their effect on regulation strength. We compared the distribution of these distances to the distribution of genetic distances for an equal number of randomly selected non-peak variants. In all three landscapes, the distances between peaks are almost indistinguishable from those of random genotypes, differing on average by fewer than 0.1 mutational steps. In other words, peaks are about as widely dispersed in each landscape as random genotypes (Supplementary Figure S16). This is the case for both low and high peaks (Supplementary Figures S17-S18). A principal component analysis further underscores this dispersion (Supplementary Methods 7.7, Supplementary Figures S19-S21).

Accessible paths to a peak are often not the shortest possible paths

Next, we focused on mutational paths to peaks that are evolutionarily accessible, i.e., a series of mutational steps where each step increases regulation strength. Specifically, we enumerated all accessible paths that exist from each non-peak genotype to each peak genotype for all three landscapes. We found that these paths are generally longer than the shortest genetic distance between a non-peak and the peak genotypes. Specifically, the mean length of accessible paths exceeded the shortest distance by 1.5, 1.8 and 1.8 steps for the CRP, Fis and IHF landscapes (Supplementary Figure S22). In other words, accessible mutational paths are often not the most direct paths.

The existence of indirect paths implies that some mutational steps are evolutionarily prohibited because they decrease regulation strength. The reason is closely linked to epistasis – non-additive effects of two or more mutations on regulation strength. (see Supplementary Methods 7.5 for further details on epistasis). More specifically, such inaccessible steps are a result of sign epistasis. In this kind of epistasis, a double mutant of a TFBS regulates expression more strongly than the TFBS itself, even though one or both constituent single mutants regulate expression more weakly than the TFBS37,7577. Indeed, epistasis is prevalent in all three landscapes (Table S5). Specifically, we observe sign epistasis in 62%, 66%, and 65% percent of interactions between single mutant pairs in the CRP, Fis, and IHF landscapes.

The existence of accessible paths alone does not tell us how easily a high peak can be found through Darwinian evolution. The reason is that only a very small fraction of all paths to that peak may be accessible, and an evolving population may not find any one of those paths. To quantify the fraction of accessible paths, we determined the total number of paths from each non-peak genotype to each high fitness peak, and computed the fraction of those paths that are accessible. Figure 3a shows how this fraction depends on the number of mutational steps in a path. It behaves similarly for all three landscapes (Figure 3a). The majority of two-step paths (a fraction greater than 80%) are accessible in all three landscapes (Figure 3a), but the fraction of accessible paths dwindles rapidly with an increasing number of mutational steps. It reaches a minimum below 1% for all three landscapes at eight mutational steps (Figure 3a).

Peaks and their basins of attractions.

a. The fraction of accessible paths declines with increasing path length. The figure illustrates how the fraction of accessible paths to high peaks (vertical axis, logarithmic scale) decreases with the length of the shortest accessible path (horizontal axis). Accessible paths are defined as paths where each step increases regulation, starting from a specified initial genotype. Each colored line corresponds to data from a different TF landscape: CRP (green), Fis (orange), and IHF (blue). Circles indicate the mean fraction of accessible paths for a given path length. b. Higher peaks have larger basins of attraction. The split-violin plots display the distribution of basin sizes (vertical axis) for high (red) and low (blue) peaks in the three TF landscapes CRP, Fis, and IHF (horizontal axis). High peaks have significantly larger basins of attraction (Welch Two Sample t-tests; CRP: t-value = 9.0898, df = 63.509, and p-value = 4.22 × 10-13, with mean basin sizes of xhigh = 9528.426 and xlow = 5655.035. Fis: t-value = 7.617, df = 172.35, and p-value = 1.645 × 10-12, with mean basin sizes of xhigh = 14206.70 and xlow = 10931.94. IHF: t-value = 6.1777, df = 201.99, and p-value = 3.521 × 10-9, with mean basin sizes of xhigh = 10965.872 and xlow = 8664.953.) The violin plots show the distribution of basin sizes for each of the two kinds of peaks. Their width represents the frequency of a given basin size. The vertical length of the box in each plot covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR. c. The regulation strength of peak genotypes (horizontal axis) is plotted against the size of their basins of attraction (vertical axis, shown as a percentage of all (non-peak genotypes) for peaks in all three landscapes (color legend). Color-coded values of R at the top of the graph indicate Pearson correlation coefficients for each landscape (color legend). Curved lines are based on a linear regression analysis (in linear space), with grey shading indicating 95% confidence intervals (CRP: R2=0.28, N = 2,154; Fis: R2=0.24, N = 2,312; IHF: R2=0.23, N = 2,434). d. Basins of attractions share many TFBS genotypes.

High peaks have large basins of attraction that share many TFBS genotypes

In the following analysis, we computed the basin of attraction of each peak, defined as the set of non-peak genotypes from which accessible paths to the peak exist70,78,79. In other words, a peak’s basin of attraction comprises all genotypes from which adaptive evolution can access the peak. The peaks in our landscapes have widely different basin sizes. They include peaks accessible from a mere fraction of genotypes to those accessible by a majority of genotypes (Figure 3b). Notably, high peaks generally had larger basins of attraction than low peaks (Figure 3c, Supplementary Figure S23). In addition, we found that many genotypes are part of multiple basins of attraction. Adaptive evolution may reach different high peaks starting from any such genotype. For all pairs of high peaks, we computed the pairwise overlap (intersection) of the basins of attraction, i.e., the fraction of genotypes that are part of both basins of attraction. (Figure 3d, Supplementary Figures S24-S25). The number of genotypes in this intersection varies widely, but the basins of attraction of high peaks generally share a substantial fraction of genotypes (mean shared genotypes: 48%, 42%, and 36% among all pairs of high peaks for the CRP, Fis, and IHF landscapes, Figure 3d)

The violin plots with embedded boxplots illustrate the fraction of genotypes shared between basins of attractions (vertical axis) for all pairs of high peaks in the CRP (green), Fis (orange), and IHF (purple) TF landscapes (horizontal axis). Specifically, we computed basin overlap as one minus the Jaccard coefficient (Supplementary Methods 7.5). A value of one (zero) indicates that the basin of attraction of two peaks share all (no) genotypes (CRP: 61 high peaks, N=1,830 comparisons, Fis: 172 high peaks, N=14,706 comparisons, IHF: 199 high peaks, N=19,701 comparisons). Their width represents the frequency of a given basin overlap. The vertical length of the box in each box covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR.

Genetic drift facilitates and clonal interference reduces the attainment of high peaks

Taken together, our analyses thus far indicate that high peaks in the CRP, Fis, and IHF landscapes are more accessible than low peaks (Figure 3b), and their basins of attraction also share more genotypes (Figure 3d). However, the accessible evolutionary paths to high peaks are often indirect. In addition, the fraction of accessible paths to any one high peak dwindles rapidly with the distance from that peak (Figure 3a).

To understand how these topographical features jointly affect adaptive evolution, we simulated the evolutionary dynamics on our three landscapes under the assumption that natural selection would favor strong regulation and that fitness is proportional to regulation strength.

Because high peaks constitute only a small fraction of all peaks (2.8%, 0.4%, and 0.5% in the CRP, Fis, and IHF landscapes), we hypothesized that most evolving populations would reach only low peaks. To test this hypothesis, we took advantage of the fact that E.coli has a low genomic mutation rate of 2.2 × 10−10 per base pair per generation 43, and that we study evolution only within an 8-nucleotide segment of a TFBS. In addition, E. coli has a large effective population size (>108 individuals43), which means that genetic drift is weak and even minor differences in fitness are visible to natural selection 43,80. These conditions imply that a population on our landscape would evolve in the well-studied strong-selection weak-mutation regime (SSWM)39,8183. In this regime, a population is monomorphic most of the time, until a beneficial mutation arises, which usually sweeps rapidly to fixation. In other words, adaptive evolution can be modeled as an adaptive walk, in which each mutational step is taken with a fixation probability that has been derived by Kimura 80. We thus call the resulting adaptive walks Kimura walks (Supplementary Methods 8). We simulated 103 such adaptive walks starting from each of 15,000 randomly and uniformly selected starting (non-peak) TFBS genotypes from each landscape. Each random walk comprised up to 25 mutational fixation steps, unless it reached a fitness peak earlier. Each adaptive walk also accounted for known E. coli mutational biases 8486 (Supplementary Methods 8).

As we hypothesized, most adaptive walks indeed reach only low peaks (90%, 85%, and 87% percent for the CRP, Fis, and IHF landscapes.). However, the percentage of walks that terminate at high peaks is several-fold greater than the proportion of high peaks. Specifically, 10 percent of walks terminate at high peaks in the CRP landscape, which is 3.6-fold higher than the percentage (2.8%) of high peaks in this landscape. In the Fis and IHF landscape, adaptive walks are 2 and 1.6-fold more likely to terminate at high peaks than expected from the proportion of these peaks among all peaks (7.4% and 8.1% percent) (Figure 4a). In other words, adaptive walks reach high fitness peaks more often than one might expect.

Peak accessibility, contingency and biases in adaptive walks.

a. More than 10% percent of adaptive walks lead to high peaks. The graph shows the cumulative distribution function (CDF, vertical axis) of regulation strength (horizontal axis) attained at the end of 15 million adaptive walks (15,000 starting genotypes × 1,000 adaptive walks each) for each TF landscape (color legend, CRP=: green; Fis: orange; IHF: purple). Dashed lines intersect the CDF at points equivalent to the wild-type regulation strength for each TF (color legend). They help to infer the proportion yTF of walks that terminate at high peaks (S>SWT), which is indicated by the numerical values at the graph’s top for each of the landscapes (color legend). b. Paths to high peaks are not much longer than genetic distances from the starting genotype. Colored boxplots summarize, for each of the three landscapes (horizontal axis), the distribution of shortest genetic distances (blue) and actual path lengths (green) for adaptive walks terminating at high peaks. Path lengths are typically less than one mutational step longer than genetic distances. Each box covers the range between the first and third quartiles (IQR). The horizontal line within each box represents the median value, and whiskers span 1.5 times the IQR. Numbers atop each plot indicate means ± one standard deviation. c. Different numbers of high peaks are reached from different starting variants. Violin plots integrated with boxplots show the distribution in the number of high peaks reached (vertical axis) from each starting variant that attained any high peak during 10,000 adaptive walks. The number of these variants is shown as an integer on top of the panel, and its percentage among all 15,000 starting variants is shown in parentheses (color legend). Underneath it, the panel shows the mean ± 1 s.d. of the number of attained high peaks. a. d. Some high peaks are reached more often than others. We randomly and uniformly sampled 10 starting genotypes from the CRP landscape, started 103 adaptive walks from each, and recorded the number and frequency of distinct high peaks attained in these random walks. Results for each starting genotype are symbolized by a vertical bar. The number of stacks within each bar (delineated by horizontal lines, also indicated by an integer above each bar) indicates the number of high peaks reached by the 103 adaptive walks. Starting variants are ordered in ascending order based on this number of attained peaks. Stack height indicates the fraction of walks that reached the same peak, and is indicated in red, orange, and yellow for the three most frequently attained peaks. See Supplementary Figure S32 for the Fis and IHF landscapes.

The adaptive walks that reached a high peak were short (Figure 4b). On average, they were also no more than half a mutational step longer than the shortest genetic distance between each starting genotype and peak (Figure 4b, CRP landscape: path length 3.2 ± 1.6 [mean±s.d.] vs. genetic distance 2.8 ± 1.2; Fis landscape: path length 3.1 ± 1.5 vs. genetic distance 2.7 ± 1.2; IHF landscape: path length 3.1 ± 1.5 vs. genetic distance 2.7 ± 1.2).

These short paths can be explained by our previous observation that peaks are widely distributed across the genotype spaces of the three landscapes. This distribution makes it easier for populations starting from any (non-peak) genotype to find a local peak through few mutations. Additionally, path lengths realized during adaptive walks tend to be shorter than the average lengths of accessible paths. (Figure 4b) The reason is that the average length of accessible paths takes all accessible paths into consideration, whereas adaptive walks tend to stop at the first peak they encounter, which results in shorter realized path lengths.

Because genetic drift can cause evolving populations to escape a low fitness peak and attain a higher fitness peak, we also asked how small population sizes affect the likelihood that a population attains a high fitness peak (Supplementary Material 8, Supplementary Figures S26-28). When simulating adaptive evolution in populations of N=15,000 individuals, we found that the likelihood that a population reaches a high peak increased by at least 10% percent (to 18%, 20%, and 21% for the CRP, Fis, and IHF landscapes, Supplementary Figures S26-S28).

We also studied how deviations from the strong selection weak mutation regime would affect evolutionary dynamics. In large populations or at high mutation rates, populations tend to be polymorphic and are subject to clonal interference, where the most advantageous of several co-occurring mutations typically dominates and becomes fixed 8789. To approximate this dynamic, we conducted simulations using “greedy” adaptive walks (Supplementary Methods 8) 38,90. In such an adaptive walk, it is always a genotype’s mutational neighbor with the highest increase in regulation strength that becomes fixed 82,9092. In other words, a greedy walk starting from a given genotype is deterministic. We, therefore, simulated only one walk for each of the 15,000 randomly chosen (non-peak) starting genotypes. We found that the fraction of greedy walks reaching high regulation peaks is slightly lower than in the SSWM regime, with 1%, 2%, and 5% fewer walks achieving such peaks in the CRP, Fis, and IHF landscapes, respectively (Supplementary Figure S29). This outcome is expected, because greedy walks prioritize immediate fitness gains at the expense of the ability to discover higher fitness peaks90.

The attainment of peaks is highly contingent on chance events

Because different basins of attraction share many TFBS genotypes (Figure 3d, Supplementary Figures S24-S25), it is possible that adaptive evolution starting from any one genotype can reach different peaks, depending on which mutational paths it takes. In other words, the structure of the landscapes we study may lead to evolutionary contingency 93,94 – the dependence of an outcome of evolution on unpredictable chance events26,93,95. Indeed, non-peak genotypes can access on average around 30% of the total number of high peaks in each landscape (Supplementary Figure S25). To assess the prevalence of evolutionary contingency, we determined how many different peaks are attained by 103 adaptive walks starting from the same randomly chosen genotype. We performed this analysis on 15,000 starting genotypes, focusing on the subset of those starting genotypes from which at least one high peak is reached during the 103 walks. Specifically, 28.6%, 23%, and 24.5% of starting genotypes for the CRP, Fis, and IHF landscapes, reached at least one high peak. Notably, these percentages represent the fraction of starting genotypes that can access at least one high peak, not the total fraction of adaptive walks that lead to a high peak, which is lower (as shown in Figure 4a). For instance, while adaptive walks starting from 28.6% of genotypes reached at least one high peak in the CRP landscape only 10% of all adaptive walks in this landscape end at a high peak (Figure 4a). Importantly, for any one starting genotype where random walks reached at least one high peak, they typically reached more than one high peak (Figure 4c, median±IQR, 4±4, 4±8, 6±6 for the CRP, Fis, and IHF landscapes, respectively). The number of high peaks attained from any one starting genotype was as large as 21, 30, and 33 for the three respective landscapes. Thus, evolutionary contingency is pervasive in all three landscapes. In addition, not only the end point of evolution but also the evolutionary trajectories leading to this end point show contingency. That is, adaptive walks starting from a specific genotype and ending at a specific high peak often take multiple paths to this peak (Supplementary Figure S30).

Biases towards specific evolutionary outcomes have been documented in both empirical and computational studies 26,27,70,93,9698. They also exist in our landscapes (Figure 4d, Supplementary Figure S31). Figure 4d illustrates both contingency and bias for 103 adaptive walks starting from each of 10 different genotypes in the CRP landscape (thus a total of 104 adaptive walks). Depending on the starting variant, the adaptive walks reached between 1 and 16 high peaks. Whenever different peaks are reached, they are reached by markedly different proportions of adaptive walks. The same holds in the Fis and IHF landscapes (Supplementary Figure S32). Additionally, while multiple routes to each peak can be traversed (Supplementary Figure S30), we observed in all three landscapes biases towards specific paths that are taken more frequently than others (Supplementary Figure S33).

Discussion

We evaluated the ability of three E. coli global regulators to control gene expression through each of more than 30,000 TFBSs for each regulator. To achieve this, we utilized a synthetic plasmid-based system that facilitates high-throughput fluorescence measurements. This system allowed us to quantify the regulatory strength of individual TFBSs by measuring gene expression through GFP fluorescence 51. Additionally, the system insulates the control of transcription from any direct effects library sequences might have on transcription, such as the transcriptional and translational impacts of 5’ untranslated regions on gene expression 99,100. Recent studies have investigated large empirical datasets of cis-regulatory genotypes, examining the impact of sequence variation on gene regulation in both eukaryotes and prokaryotes6,39,49,59,101104. However, few studies have focused on these interactions from an evolutionary perspective, as we do here 6,39,103.

We showed that the regulatory landscapes of all three TFs are highly rugged and have multiple peaks. The ruggedness of all three landscapes is also supported by the prevalence of epistasis in them (Supplementary Table S5). A particularly important form of epistasis is sign epistasis37,75,76, because it can lead to multiple adaptive peaks 37,75,76 (see Supplementary Methods 7.5) Our landscapes contain up to 65% of mutation pairs with sign epistasis, a value that is especially high compared to the almost exclusively additive interactions of mutations in eukaryotic TFs6,105. However, the TFs we study are not exceptions among other prokaryotic TFs. Recent studies have shown that epistatic interactions are common in binding sites of local prokaryotic regulators, such as AraC (sign epistasis in more than 50% of 20 mutants106) and Cl (sign epistasis in 85% of 113 mutants107). We attribute this greater incidence of epistasis to the nature of prokaryotic TFBSs. Specifically, prokaryotic TFBSs are at approximately 20bps, twice as long as eukaryotic ones57,108 and exhibit symmetries reflecting the dimeric state of their cognate TFs109111. These factors may increase the likelihood of intramolecular epistasis.

Despite high ruggedness and epistasis, we found that a modest proportion of evolving populations can access the highest peaks of each landscape. Specifically, in the large populations that are typical of E. coli, at least 10% percent of evolving populations reach one of the highest peaks. The clonal interference that may occur in even larger populations or in populations with high mutation rates only modestly reduces this percentage to 5% (Supplementary Figure S29) Conversely, in small populations where genetic drift can help a population escape from a low peak, this percentage increases to 18%. Numbers like these render the de novo evolution of strong transcription factor binding sites plausible for our three global regulators. Once such binding site have originated in one population, they can also spread to others through horizontal gene transfer112,113.

It is well known that epistasis can influence the trajectories of evolving populations on adaptive landscapes37,76,114. The regulation strength of a TFBS is partially determined by the combined interactions of the nucleotides within the TFBS115118, which in turn affects TF-TFBS binding affinities. Different combinations of mutations can result in similar regulation strengths and create multiple evolutionary pathways to achieve optimal or near-optimal regulation6,119. This can lead to overlapping basins of attraction among different peak TFBSs conveying strong regulation, such that even evolution starting from the same genotypes can take different paths and reach different peaks with similar regulation strengths (Supplementary Figure S30). Indeed, we observed such overlapping basins in our landscapes (Figure 3d and Supplementary Figure S24). Moreover, epistasis can create plateaus of regulation strength where various combinations of nucleotides yield TFBSs with intermediate regulation strengths. These plateaus serve as common evolutionary intermediates that multiple starting sequences can traverse on their way to different peaks, and thus contribute further to the shared basins of attraction we observe (Supplementary Figure S24).

In a landscape with substantial epistasis, the sequence of mutations that occurs in an evolving population can render adaptive evolution highly contingent on this sequence93,94. For example, some beneficial mutations within a TFBS might only be accessible after other specific mutations have occurred. Different starting sequences or early mutations can also change the spectrum of accessible mutations, leading to different peaks with similarly strong regulation 93,94. Indeed, we observed in all three landscapes that different evolving populations starting from the same genotypes in the landscape attain different peaks (Figures 3d, 4d and Supplementary Figure S32) 93,95,120,121. Such contingency renders evolution less predictable 26,93,122.

Despite the prevalence of contingency in all three landscapes, evolving populations starting from the same genotype more often attain some peaks than others (Supplementary Figure S32). Moreover, for each attained peak with multiple possible evolutionary routes, we also observed that some paths are more frequently transversed than their alternatives (Supplementary Figure S33). That is, evolution is biased towards traversing some paths and attaining some peaks more often than others123,124. These observations emphasize the complex interplay between chance, contingency, and evolutionary biases in shaping the outcomes of adaptive evolution.26,93,95,97.

The three TFs we studied here belong to different protein families, yet they have regulatory landscapes with similar topography. All three landscapes highly rugged, highly epistatic, and harbor multiple peaks that are widely scattered, and this holds for peaks of low, intermediate, and high regulation strength. This might help to explain how TFBSs with intermediate and high regulation strength can be easily attained during evolution. In addition, peak accessibility in all three landscapes increases with peak height (Supplementary Figure S31). On the one hand, these commonalities may be caused by common biological or biochemical properties of the three TFs. For example, CRP has been suggested to possess nucleoid-associated protein properties similar to Fis and IHF, due to its ability to bend and loop DNA125. On the other hand, the commonalities may reflect general characteristics of global gene regulation. One of them is that global TFs often bind unspecifically to multiple TFBSs11,13. Also, they may bind DNA with a broad range of different affinities centered around intermediate affinity, rather than bind few sites but very strongly 13. This possibility is supported by comprehensive analyses of in vitro eukaryotic TF binding affinities for thousands of TFBS variants, showing essentially continuous binding affinity distributions for hundreds of sites6.

Despite broad similarities among the three landscapes we study, we also found some differences. Most notable is the narrower distribution of regulation strengths for the binding sites of CRP compared to those of Fis and IHF (Figure 2a). This is not unexpected, given that CRP binding sites are known for being quasi-symmetric and less degenerate than those of Fis and IHF28,31. Previous studies have also shown that Fis and IHF binding sites are less conserved in their nucleotide structure and more biased toward AT-rich content22,126,127. One possible explanation for this observation is the nature of the binding dynamics of these regulators. CRP, mostly restricted to gene regulatory interactions, may have evolved to finely tune its regulatory output, resulting in a narrower distribution of regulation strengths. In contrast, Fis and IHF, which are involved in functions beyond gene regulation, such as DNA replication and genomic structural maintenance128,129, might encompass a more diverse range of regulation strengths.

The landscapes of our global regulators also differ in two respects from that of the previously characterized local regulator TetR 27. First, the peaks of the TetR landscape are more clustered in genotype space due to its lower diversity of strongly binding TFBSs 27. Second, the distribution of regulation strengths for its TFBSs is narrower and shows more strongly binding TFBSs 27. A possible explanation may lie in the fact that TetR regulates only two transcription units130.

Some of the properties of our landscapes are similar to what would be expected from theoretical models, such as uncorrelated random landscapes72,73,131,132. First, both kinds of landscapes have many peaks. Specifically, the number of peaks of the CRP, Fis, and IHF landscapes is 93%, 96%, and 98% of that expected in a random landscape of the same size (Supplementary methods 7.6). Second, the peaks of our landscapes are widely scattered in genotype space. This scattering may reflect a biological property, namely that global regulators can strongly bind multiple and highly diverse sites1113. In addition, high peaks in our empirical landscapes also have large basins of attractions, an observation that is consistent with predictions from simple theoretical models 132.

One limitation of our work stems from our rigorous quality filtering of sort-seq data. As a result, we lack regulatory data for some 40% of the TFBSs in each landscape. This limited diversity of reliable data is a common feature in mutational library studies. It has several technical causes, such as biases in library synthesis 133, PCR amplification 134, cloning, and loss of sequence diversity after cell sorting 102. In future work, this limitation could be overcome by a combination of strategies, such as subsampling complete genotype spaces or combining different molecular methods to overcome biases and diversity loss during PCR amplification135,136, sorting48,137,138, and high-throughput sequencing103.

A second limitation comes from our use of the sort-seq method, which is best-suited for our work, because it allows the high-throughput measurement and sorting of millions of individual cells in a standardized and straightforward manner. However, the method’s accuracy depends on the binning procedure. Other studies have used between two and several dozen bins, with or without unbiased sampling 48,50,59,101,102,137,139. Recommendations emerging from this work include sorting of cells into at least four logarithmically (log2) equally-spaced bins, each covering approximately 12.5-15% of the fluorescence distribution, 59,102,140. We followed these recommendations, using 13 bins. In addition, we computed a weighted average to calculate expression values from sequences appearing in multiple bins, a straightforward method validated by robust studies in transcriptional regulation analysis 39,49. In addition, we validated individual regulation strengths with an independent method to demonstrate its reliability.

A third limitation of our study is that we only examined variation at eight positions for each TF. We selected these positions based on their importance for DNA binding and regulation by our TFs, as evidenced by their high information content. We cannot exclude the possibility that selecting a different set of positions could yield different landscape topographies. However, we speculate that less information-rich position would not reduce but rather increase the potential for the de novo evolution of strong TFBSs. For example, they may facilitate landscape navigability through extradimensional bypasses78,141 or provide small-incremental changes in regulation strengths. These small changes help populations reach peaks via diminishing returns effects, meaning that as a population gets closer to a peak, each subsequent mutation contributes progressively smaller improvements to regulation strength 142,143. Investigating a wider array of positions and larger regulatory landscapes remains an important task for future work.

Fourth, we use a simplified empirical system for the study of gene regulation—a promoter followed by a single TFBS. Although this regulatory architecture exists for some genes144, global regulators often form part of more complex, combinatorial architectures4,144,145 that are influenced by environmental factors2,146,147, concentrations of active TFs148150, chromosome structure151155, methylation states156, and more. Studies on more complex promoter architectures may reveal regulatory landscapes with different topographies.

Lastly, we also analyzed the adaptive evolution of TFBSs in a simplified manner, performing data-driven evolutionary simulations rather than experimental evolution. Although many studies assume that regulation strength and fitness are correlated, this is not always the case. Low binding affinities can be adaptive during development 157, and many genes exhibit a nonlinear fitness-expression function 158 with a plateau of maximal fitness across a wide range of expression levels 159161. For instance, most mutations and polymorphisms in the promoter of the yeast gene TDH3 do not significantly affect fitness in a glucose-rich medium160. For these reasons we focused our observations and interpretations on regulatory phenotypes (regulation strength). Our simplified approach allowed us to model population evolution in a large genotype space and avoid monitoring thousands of evolving regulatory sequences simultaneously in vivo162,163. However, it cannot replace experimental evolution of TFBSs, which also remains an important challenge for future work.

Transcription factor binding sites are among the simplest units of biological organization. Our work provides the first large-scale analysis of the regulatory landscapes formed by such sites for global transcriptional regulators. It shows that strong binding sites can readily evolve de novo, even though prokaryotic transcription factor binding sites are much larger than their eukaryotic counterparts and have more rugged regulatory landscapes. In addition, the evolution of these simple sequences also displays phenomena that have been characterized in much more complex systems, such as evolutionary contingency and evolutionary biases.

Materials and methods

The Supplementary Materials contain extended details of experimental procedures and data analysis.

Strains and plasmids

Bacterial strains and plasmids used in this work are listed in Table 1. We obtained electrocompetent E. coli cells of strain SIG10-MAX® from Sigma Aldrich (CMC0004). We used this strain for molecular cloning and library generation due to its high transformation efficiency. The genotype of this strain (Supplementary Table S4) is similar to DH5α (Sigma Aldrich commercial information, see Supplementary Table S4). The strain is resistant to the antibiotic streptomycin.

We amplified plasmid libraries in SIG10-MAX®, extracted their DNA, and transformed them into mutants derived from E. coli K-12 strain BW25113 that harbor chromosomal deletions of the crp, fis, or ihfa gene. We obtained these mutant strains from the KEIO collection 164 and used them for sort-seq experiments. The design, genetic parts, and assembly of the plasmid vectors we used in this study are available in the Supplementary material, as are all primers, TFBS sequences/libraries, strains, and plasmids.

Sort-Seq procedure

To explore the regulatory effects of each transcription factor on binding sites in the corresponding library, we constructed three plasmids, each of which enables the inducible expression of one of our three TFs. These are plasmids pCAW-Sort-Seq-V2-CRP, pCAW-Sort-Seq-V2-Fis, and pCAW-Sort-Seq-V2-IHF (Supplementary Methods 3 and Supplementary Table S1). We cloned the TFBS libraries (Supplementary Table S3) into their respective plasmids and then transformed them into mutant strains lacking the corresponding TFs (Δcrp, Δfis, and Δihf, as listed in Supplementary Table S2). We induced TF expression using anhydrotetracycline (Atc) and, after overnight growth, performed cell sorting for cell populations. During sorting, we distributed cells into 13 equally-spaced logarithmic bins based on their fluorescence levels. We replicated each sort-seq experiment three times from three separate library transformations for each TF. To mitigate the impact of extrinsic noise (gene expression variation among cells 165), we adhered to standard protocols by normalizing our GFP fluorescence measurements against mScarlet-I fluorescence values obtained from flow-cytometry assays 7,48,166,167. We subsequently recovered the sorted cells from each bin in 50mL Falcon tubes containing 10 mL of LB medium supplemented with chloramphenicol and incubated them overnight at 37°C with shaking at 220 rpm. After this growth period, we re-sorted cell cultures from each bin to eliminate potential contaminants and ensure that the cell populations had preserved their fluorescence distributions. Following re-sorting, we extracted plasmids from the cell population of each bin. We amplified and barcoded the TFBS region from each population through a polymerase chain reaction (PCR). Lastly, we sequenced barcoded amplicons containing TFBS sequences, and used the sequencing results to calculate the regulation strengths of each TF to its TFBSs in the corresponding library. More details are provided in the Supplementary Material.

Regulation strengths

Due to gene expression and measurement noise, individual TFBS variants in a sort-seq experiment usually appear in more than a single bin, and their read count (frequency) varies among bins48,49,101,138. Following established practice27,49,103, we used a weighted average of these frequencies for each variant to represent the mean expression level driven by the variant. To facilitate the interpretation of this quantity, we converted this expression level into a regulation strength relative to the highest observed regulation strength for a given TF, to which we assigned a value of one. From each library, we selected a single naturally occurring binding site for each TF that was previously characterized in the literature as a strong binder. We called this TFBS the WT sequence and used it as a baseline to separate weakly (higher GFP expression) from strongly (lower GFP expression) binding TFBSs.

Validating regulation strengths with plate reader measurements

To further validate our regulation strength data, we chose 10 DNA binding sites from each bin (10 variants ×13 bins = 130 variants in total per TF library, Supplementary File 2), covering a wide range of measured regulation strengths. We cloned these sequences into the appropriate vector pCAW-Sort-Seq-V2-TF (TF: CRP, Fis or IHF), and transformed them into the appropriate mutant strain. We picked individual colonies and grew them overnight (16 hours, 37°C, 220 rpm) in liquid LB supplemented with 50 μg/mL of chloramphenicol and anhydrotetracycline. We diluted the cultures to 1:10 (v/v) in cold Dulbecco’s PBS (Sigma-Aldrich #D8537) to a final volume of 1 mL. We transferred 200 μl of the diluted cultures into individual wells in 96-well plates and measured GFP fluorescence (emission: 485nm/excitation: 510nm, bandpass: 20nm, gain: 50), as well as the optical density at 600nm (OD600) as an indicator of cell density. We then normalized fluorescence by the measured OD600 value to account for differences in cell density among cultures and compared the obtained ratios to the previously inferred regulation strengths for the 130 selected variants. We performed all such measurements in biological and technical triplicates (three colonies per sample, and three wells per colony, respectively).

Code Availability and Data Analysis

All code used for processing data and plotting, as well as the final processed data, plasmid sequences, and primer sequences are available in our GitHub repository.

Data availability

Sequencing data has been deposited in the NCBI database under the BioProject accession code: PRJNA1162449. The data generated in this study have been deposited in the Zenodo public repository and are accessible via the following DOI: 10.5281/zenodo.13838265

Computer code

The computer code generated in this study has been deposited in the Zenodo public repository and is accessible via the following DOI: 10.5281/zenodo.13838265.

Supplementary tables and figures

Plasmids used in this study

Table of strains

Libraries used in the present study.

TFBS reference (“wild-type”, WT) sequences used in the present study.

Global landscape properties

Primers for engineering the pCAW-Sort-Seq-V2 plasmid and for cloning libraries.

Construction of pCAW-Sort-Seq-V2.

The plasmid pCAW-Sort-Seq-V2 is a derivative of plasmid pCAW-Sort-Seq27. The figure illustrates the sequential cloning steps involved in the plasmid’s construction, from the bottom upwards. For an in-depth explanation, see Supplementary Methods. Briefly, we began with the synthesis and cloning of a codon-optimized mscarlet gene expression cassette into the pCAW-Sort-Seq plasmid, oriented antiparallel to the tetracycline resistance gene (tetr). This generated the intermediate plasmid pCAW-Sort-Seq-mscarlet plasmid. Subsequently, we inserted each TF gene, exemplified by the crp gene for the purpose of this representation, upstream of the mscarlet gene in a bicistronic operon configuration. This step created the TF-specific expression plasmid variants (one each for CRP, Fis and IHF). The final stage, not depicted here, involved the incorporation of the appropriate wild-type TFBSs and libraries into each plasmid variant upstream of the gfp gene, which yielded the library plasmids we utilized in our sort-seq experiments.

Components of plasmid pCAW-Sort-Seq-V2.

The plasmid system pCAW-Sort-Seq-V2 is a derivative of the plasmid pCAW-Sort-Seq27. It encodes the broad-host, low-copy number replication origin pBBR1 170 (5 to 10 copies per cell), together with an origin of transfer (OriT) for conjugation (blue), and a chloramphenicol resistance gene (yellow). The interchangeable regulatory region where the TFBS is located (cyan) is placed between a constitutive promoter (BBa_J23110193) and a superfolder GFP (sfgfp, green) fluorescent reporter gene172. The transcriptional insulator RiboJ173 is located upstream of the sfgfp gene. The tetr gene (purple) is derived from the original Tn10 transposon51,73 under the control of a low-strength constitutive promoter (pLac promoter variant developed in ref. 194). TetR regulates the expression of a bicistronic operon encoding the TF of interest (represented by CRP, dark purple), and an mScarlet-I red fluorescent reporter174 (red). Promoters and their respective insulators195 are represented as white arrows. Transcriptional terminators are represented as black arrows196. Ribosome binding sites (RBSs) are represented as pink arrows.

Growth profile for strains used in this study.

Growth profiles for all strains used for landscape mapping are represented as optical densities (OD600, y-axis) measured over 7 hours of growth at 1-hour intervals (x-axis). We transformed all strains with the plasmid pCAW-Sort-Seq-V2 and grew them in LB medium without inducers. Colors (legend) represent the different strains. Wild-type (purple) refers to the BW25113 parent-strain from which the mutants (Δcrp, Δfis, Δihf) were derived. Each circle represents the mean of three biological replicates independently measured in a plate reader. Shaded areas correspond to one standard deviation calculated from the three replicates.

Sorting procedure for CRP library.

Each density plot is based on a histogram representing the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while the vertical axis indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) CRP library in pCAW-v2 without TF expression (“Lib-CRP”) and 3) CRP library in pCAW-v2 with TF expression (“Lib +CRP”). To create each density plot from the histogram data, we performed density smoothing using a Gaussian kernel function with the aid of the ggplot2 package197. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent fluorescence bins/gates into which we sorted E.coli cells, spaced equally on a binary logarithmic (log2) scale.

Sorting procedure for Fis library.

Each density plot represents the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while each histogram indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) Fis library in pCAW-v2 without TF expression (“Lib-Fis”) and 3) Fis library in pCAW-v2 with TF expression (“Lib +Fis”). We performed density smoothing using a Gaussian kernel function to create a smooth density plot for each histogram with the aid of the ggplot2 package 197. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent bins/gates for each sample, spaced equally on a binary logarithmic (log2) scale.

Sorting procedure for IHF library.

Each density plot represents the distribution of GFP fluorescence for 106 cells. The histogram represents data from one replicate, and the same sorting strategy was applied to the other two replicates. Specifically, the horizontal axis shows GFP fluorescence (FITC-H values, arbitrary units, log2 scale), while each histogram indicates the relative frequency of cells with a given fluorescence value. Heatmap colors indicate fluorescence, from low (dark purple) to high (yellow). The vertical axis indicates the three experimental conditions under which we measured fluorescence: 1) negative control (“Neg”), without GFP expression (promoterless pCAW-v2); 2) IHF library in pCAW-v2 without TF expression (“Lib-IHF”) and 3) IHF library in pCAW-v2 with TF expression (“Lib +IHF”). We performed density smoothing using a Gaussian kernel function to create a smooth density plot for each histogram with the aid of the ggplot2 package156. The vertical dashed grey line marks the upper limit of cell autofluorescence (negative control). We did not select cells with fluorescence below this threshold for sorting. Rectangular boxes (1 to 13) represent bins/gates for each sample, spaced equally on a binary logarithmic (log2) scale.

Correlation of read counts among replicates of the CRP library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for CRP. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Correlation of read counts among replicates of the Fis library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for Fis. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Correlation of read counts among replicates of the IHF library.

The figure presents a pairwise comparison of log-transformed read counts across three experimental replicates for IHF. Each circle represents a distinct genotype, plotted according to its read counts in two sequencing replicates (x-axis for one replicate and y-axis for the other). The density of genotypes within the plot is indicated in the legend, with purple representing lower-density regions and yellow representing higher-density regions. The red line across each plot depicts a linear model’s fit to the data, illustrating the association between the read counts of genotypes across replicates. Pearson correlation coefficients (upper triangle) are indicated for each pair of replicates. Their values are close to 1, demonstrating high reproducibility.

Regulation strength distribution across the libraries

a. CRP library. The histogram illustrates the distribution of regulation strengths (normalized to the maximally observed value) among 31,975 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.37), while the wild type’s (WT = 0.71) strength is denoted by a dashed red line. A total of N=84 variants surpass the WT in regulation strength. B. Fis library. The histogram illustrates the distribution of regulation strengths (normalized to the maximum observed value) among 43,222 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.57), while the wild type’s (WT = 0.97) strength is denoted by a dashed red line. A total of N=172 variants surpass the WT in regulation strength. c. IHF library. The histogram illustrates the distribution of regulation strengths (normalized to the maximum observed value) among 41,325 library variants. The average regulation strength across all variants is highlighted by a dashed blue line (mean = 0.52), while the wild type’s (WT = 0.95) strength is denoted by a dashed red line. A total of N=199 variants surpass the WT in regulation strength.

Validating regulation strengths.

In this analysis, we validated the regulation strengths of 10 selected TFBSs for each bin and for each library (130 sequences per library). These TFBSs covered a wide range of regulation strengths determined from sort-seq data. The horizontal axis of each panel shows the regulation strengths of the 130 binding sites as quantified through sort-seq. The vertical axis shows the fluorescence levels of the same binding sites, as quantified by a plate reader (arbitrary units). The top of each panel shows the result of a test of the null-hypothesis that the Pearson correlation coefficient R between the two quantities is equal to zero, together with a 95% confidence interval. The shaded grey area surrounding the regression line indicates the 95% confidence interval, i.e., the probability that the true linear regression line of the population lies within the confidence interval of the regression line calculated from the sample data at a confidence level of 95%. R2 indicates the coefficient of determination. Each circle in the plot represents the mean of three independent (biological) replicate fluorescence measurements in a plate reader after eight hours of growth. Circle colors indicate GFP fluorescence from low (dark green) to high (light green a. Association between plate reader fluorescence and regulation strengths for the CRP library. R = −0.81, P < 0.001; R² = 0.66, P < 0.001, N=130. b. Association between plate reader fluorescence and regulation strengths for the Fis library. R = −0.74, P < 0.001; R² = 0.54, P < 0.001, N=130. c. Association between plate reader fluorescence and regulation strengths for the IHF library. R = −0.73, P < 0.001; R² = 0.53, P < 0.001 N = 130

Adaptive peaks and their regulation strengths across TF landscapes.

a. Adaptive peaks in the CRP landscape. The distribution of regulation strengths across the landscape’s 2,154 adaptive peaks is shown. The WT’s regulation strength (0.71) is marked by a horizontal dashed red line for reference. A total of N=61 peaks have higher regulation strength than the WT. The color gradient indicates genotype density along the regulation strength axis, with the histogram on the right detailing the regulation strength distribution. b) like a) but for the Fis landscape, based on its 2,312 adaptive peaks. (WT regulation strength: 0.97) N=172 peaks regulate expression more strongly than the WT. c) like b) but for the IHF landscape, based on its 2,434 adaptive peaks. (WT regulation strength: 0.95) N=199 peaks regulate expression more strongly than the WT.

Quantitative analysis of CRP landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e,. the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24(=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks”. Mean relative connectivities and standard deviations, are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 4.59 × 10-77, N1= 29,821; N2 = 2,093). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 1, N1 = 29,821; N2 = 61). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The weak association (Pearson correlation, R = −0.13; t = −23.888; degrees of freedom (df) = 31,973; p-value < 2.2 × 10-16) indicates that strongly regulating genotypes dare not necessarily densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of CRP landscape based on regulation strength. Visual overview of the CRP landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). It indicates the density of genotypes within each of the four categories through a color gradient from blue (low density) to yellow (high density). e. Visualization of CRP landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color-codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Like panel d, but for four categories of relative connectivity (no data, low, medium, high).

Quantitative analysis of the Fis landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e., the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24 (=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks.” Mean relative connectivities and standard deviations are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 4.59 × 10-77, N1 = 40,910; N2 = 2,312). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 9.31 × 10-13, N1 = 40,910; N2 = 172). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The weak association (Pearson correlation, R = −0.13; t = −23.888; degrees of freedom (df) = 31,973; p-value < 2.2 × 10-16) indicates that strongly regulating genotypes are not necessarily densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of Fis landscape based on regulation strength. Visual overview of the Fis landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm 182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). The density of genotypes within each category is indicated by a color gradient from blue (low density) to yellow (high density). e. Visualization of Fis landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Similar to panel d, but for four categories of relative connectivity (no data, low, medium, high).

Quantitative analysis of the IHF landscape sparsity.

a. Distribution of relative connectivity among genotypes. Violin plots augmented with embedded boxplots illustrate the distribution of relative connectivity, i.e., the ratio of the empirically observed number of adjacent genotypes to the theoretical maximum number of 24 (=8×3) within a fully connected network. Genotypes are stratified into the three categories “other” (non-peak genotypes), “peaks” (excluding high peaks), and “high peaks.” Mean relative connectivities and standard deviations are shown above each categorical division, revealing lower relative connectivity in peaks relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 1.14 × 10-77, N1 = 38,872; N2 = 2,254). High peaks exhibit comparable or superior connectivity relative to non-peak genotypes (Two-sample Welch’s t-test, p-value = 9.31e-13, N1 = 38,872; N2 = 199). b. Weak association between connectivity and regulation strength. Scatterplot of relative connectivity against regulation strength. The density of genotypes (circles) in the scatterplot area is represented by a color gradient from purple (low density) to yellow (high density). The slight positive correlation (Pearson correlation, R = 0.017; t = 3.51; degrees of freedom (df) = 41,323; p-value = 0.0004485) indicates that genotypes with elevated regulation strength are only marginally more likely to be densely connected. Marginal histograms adjacent to the scatterplot quantify the distributions of regulation strength and relative connectivity. c. Visualization of IHF landscape based on regulation strength. Visual overview of the IHF landscape with genotypes color-coded according to regulation strength (low strength: purple; medium: green; high: yellow; grey: genotypes with missing data). To enhance clarity, edges between genotypes are not shown. The landscape is projected onto a 2D space through a force-directed algorithm 182, with axes representing arbitrary units. d. Spatial distribution of regulation strengths. Each panel shows a density contour plot of the distribution of regulation strength scores of genotypes within one of four regulation strength categories (no data [missing data], low, medium, high). The density of genotypes within each category is indicated by a color gradient from blue (low density) to yellow (high density). e. Visualization of IHF landscape based on relative connectivity. Analogous to panel c, but for relative connectivity. Color codes indicate relative connectivity (see color legend). f. Spatial distribution of relative connectivity. Similar to panel d, but for four categories of relative connectivity (no data, low, medium, high).

Peaks are spread across each landscape.

The vertical axis shows the genetic distance d between peaks and an equal number of randomly chosen non-peak genotypes (color legend) for the CRP (N= 2,154), Fis (N= 2,312), and IHF (N= 2,434) landscape (horizontal axis). (CRP: dpeak= 6, drandom= 5.94, t-statistic= 52.49, p-value <0.01; Fis: dpeak = 6, drandom = 5.97, t-statistic= 30.96, p-value <0.01; IHF: dpeak = 6, drandom = 5.97, t-statistic= 34.07, p-value <0.01; null-hypothesis: means do not differ). Although the differences are statistically significant, peaks are almost as dispersed within genotype space as non-peaks. Each box covers the range between the first and third quartiles (IQR). The horizontal line within the box represents the median value, and whiskers span 1.5 times the IQR.

Genetic diversity among peaks across TF landscapes.

a. CRP landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,154 peaks (red) versus 2,154 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing a slight difference between peaks (d=5.6, in red) and non-peaks (d=5.5, in grey), demonstrating large genetic diversity within the peaks (Welch Two Sample t-test, t = 78.299, df = 4637305, p-value < 2.2 × 10-16). b. Fis landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,312 peaks (red) versus 2,312 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing no significant difference between peaks (d=5.7, in red) and non-peaks (d=5.7, in grey), demonstrating large genetic diversity within the peaks (t = 27.234, df = 5412460, p-value < 2.2 × 10-16). c. IHF landscape. Comparative analysis of genetic distances among nucleotide sequences from 2,434 peaks (red) versus 2,434 non-peak, random variants (grey). Mean genetic distances are displayed with vertical lines, showing no significant difference between peaks (d=6, in red) and non-peaks (d=6, in grey), demonstrating large genetic diversity within the peaks (t = 54.689, df = 6014743, p-value < 2.2 × 10-16).

Genetic diversity among high peaks across TF landscapes.

a. CRP landscape. The overlayed histograms compare the genetic diversity between 61 high regulation strength peaks (red) and 61 non-peak, randomly selected variants (grey). No significant difference exists (W) between high-regulation strength peaks (mean d=5.4, in red) compared to randomly selected genotypes (mean d=5.5, in grey; Welch Two Sample t-test, t = −0.78555, df = 3653.7, p-value = 0.4322). c. Fis landscape. The overlayed histograms compare the genetic diversity between 172 high-regulation strength peaks (red) and 172 non-peak, randomly selected genotypes (grey). No significant difference exists between high-regulation strength peaks (mean d=5.4, in red) and random genotypes (mean d=5.8, in grey; −16 Welch Two Sample t-test, t = −16.634, df = 29357, p-value < 2.2 × 10). c. IHF landscape. The overlayed histograms compare the genetic diversity between 199 high regulation strength peaks (red) and 199 non-peak, randomly selected genotypes (grey). The analysis reveals no significant difference between high regulation strength peaks (mean d=5.8, in red) compared to the random variants (mean d=5.8, in grey; Welch Two Sample t-test, t = 0.73365, df = 39366, p-value = 0.4632).

Principal component analysis of the CRP landscape.

a. PCA plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 31,975 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure in the genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot is a way to visualize the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component, and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component, and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g. G18 stands for a guanine at position 18 of the binding site).

Principal component analysis of the Fis landscape

a. PCA plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 43,222 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure of the data in genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot visualizes the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both the length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g., G18 stands for a guanine at position 18 of the binding site).

Principal Component Analysis of the IHF Landscape

a. PCA Plot. A principal component analysis reveals that adaptive peaks are distributed widely within the genotype space (Supplementary Methods). The panels show principal components (PC) 1 and 2. Axes labels also indicate the percentage of variation explained by each component. Every circle corresponds to one among the 41,325 genotypes within the landscape. Light blue and light red circles represent low and high peaks, respectively, while grey circles represent non-peak variants. The clustered structure in the genotype space is due to the different influence of individual nucleotides on the variation observed in the data (see panel b). b. The contribution of individual nucleotides to principal component analysis. A PCA contribution plot is a way to visualize the relative importance of different variables to the variation observed in the data. In this plot, the contribution of each variable (nucleotide identity at each position of a TFBS) is expressed as the square of the cosine of the angle (cos2) between the variable’s vector (column representing the presence/absence of a base letter at each position in the sequence) and each principal component axis. This quantity is represented as an arrow that indicates the correlation of the variable with PC1 and PC2, the two principal components that capture the largest amount of variation in the data. Both the length and color of the arrow represent the contribution of the variable to the variation observed in the data. A high cos2 value (red) indicates that the variable is strongly correlated with the principal component and therefore makes a large contribution to the variation observed in the data. A low cos2 value (blue) indicates that the variable is weakly correlated with the principal component and therefore makes a small contribution to the variation observed in the data. Each arrow’s length represents the importance of the variable’s contribution relative to other variables in the plot. Each nucleotide is represented with a base letter (A, T, C, G) followed by a number that indicates the position of that base in the binding site sequence (e.g., G18 stands for a guanine at position 18 of the binding site).

Accessible paths to high peaks are less than two mutational steps longer than the shortest genetic distances to the peak.

Each composite histogram shows the distribution of genetic distances (blue) and shortest accessible path length (green) for all pairs of high peaks and non-peak genotypes for one of the three landscapes. We used a t-test of the null hypothesis that the means of these distance distributions are statistically indistinguishable. This null hypothesis is rejected for all three landscapes. a. CRP Landscape. Mean genetic distance: 5.74; mean number of mutational steps in shortest accessible paths: 7.2; t = 500.17, df = 1,212,712, p < 2.2 × 10-16. b. Fis Landscape. Mean genetic distance: 5.98; mean number of mutational steps in shortest accessible paths: 7.7; t = −1271, df = 5,639,001, p < 2.2 × 10-16. c. IHF Landscape. Mean genetic distance: 5.78; mean number of mutational steps in shortest accessible paths: 7.6; t = −1365, df = 6,636,001, p < 2.2 × 10-16.

Moderate positive correlation between peak regulation strength and basin size for the three regulatory landscapes. a. CRP landscape. The scatter plot shows the association between regulation strength conveyed by a peak variant (horizontal axis, colors from dark blue [low] to yellow [high]) and the size of its basin of attraction (absolute numbers on the left vertical axis, percentages on the right vertical axis, note the logarithmic scales). The dashed vertical line represents the regulation strength of the wild type. The black curve represents a linear regression line (in linear space). R is the Pearson correlation coefficient (R=0.39) and R2 is the coefficient of determination for the linear regression model (R2 =0.18, N = 2,154) b. Fis landscape. Like panel a (R=0.37, R2 =0.15, N = 2,312), c. IHF landscape. Like panel a (R=0.41, R2 =0.19, N = 2,434).

The basins of attractions of high peaks overlap.

Each heatmap matrix quantifies the proportion of shared TFBS variants in the basins of attractions of all pairs of high peaks (Methods). Each matrix entry quantifies the basin overlap for one pair of high peaks using the Jaccard index (Supplementary Methods 7.5), where row and column indices correspond to specific pairs of peaks. The maximum overlap of one (red) signifies that two basins share all variants, while the minimum value of zero (blue) indicates that they share no variants. a. Basin overlap in the CRP landscape. The average Jaccard index is 0.461 with a standard deviation of ±0.144, based on 3,721 pairwise comparisons. b. Basin overlap in the Fis landscape. The average Jaccard index is 0.426 with a standard deviation of ±0.136, derived from 29,584 pairwise comparisons. c. Basin overlap in the IHF landscape. The average Jaccard index is 0.364, with a standard deviation of ±0.145, based on 39,601 pairwise comparisons.

Distribution of the fraction of accessible high peaks per TFBS variant across TF landscapes.

The figure shows violin plots with embedded boxplots, for the distribution of high peak accessibility of each variant within the CRP (green), Fis (orange), and IHF (purple) landscape. The y-axis quantifies peak accessibility as the fraction of the total number of high peaks that can be accessed from each (non-peak) genotype within each TF landscape, revealing the extent to which individual variants can potentially access different high peaks. In the CRP landscape, each genotype can access, on average, a proportion of 0.314±0.315 (mean±s.d.) high peaks from the total number of high peaks, with a median value of 0.194. Notably, 5,152 out of 29,821 genotypes (17%) can access no high peak. Fis: mean high peak accessibility: 0.363±0.311 (median: 0.305), with 4,304 out of 40,895 (10%) genotypes lacking access to any high peaks. IHF: mean peak accessibility: 0.301±0.289 (median: 0.205); 4,629 out of 38,872 genotypes (12%) show lack access to high peaks.

Kimura adaptive walks on the CRP landscape a-c. Fraction of adaptive walks attaining TFBS variants with strong regulation for different population sizes. The vertical axis indicates the cumulative fraction of adaptive walks on the CRP landscape that end at a genotype with a regulation strength shown on the horizontal axes. Data is based on 15 million simulated adaptive walks (15,000 random starting genotypes × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for CRP, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (Horizontal line, blue lettering). For example, a value of y=0.82 implies that 82% of walks terminate at peaks lower than the wild-type. The remaining 18% reach higher peaks. a. Population size 102. 18% of walks terminate at peaks higher than the wild-type. b. Population size 105. 10% of the walks terminate at higher peaks. c. Population size 108. 10% of walks terminate at higher peaks. d-f. Number of high peaks reached by a single adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 15 × 106 adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong, because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 102, e 105, and f 108.

Kimura adaptive walks on the Fis landscape. a-c. Fraction of adaptive walks attaining high peaks for different population sizes. The vertical axis indicates the cumulative percentage of adaptive walks on the Fis landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15 million simulated adaptive walks (15,000 starting genotypes × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for Fis, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a y-value of 0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% reaching higher peaks. a. Population size 10². 27% of walks terminate at peaks higher than the wild-type. b. Population size 10⁵. 13% of walks terminate at higher peaks. c. Population size 10⁸. 13% of walks terminate at higher peaks. d-f. Number of high peaks reached by individual adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 150 million adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 10², e 10⁵, and f 10⁸.

Kimura adaptive walks on the IHF landscape. a-c. Fraction of adaptive walks attaining high peaks for different population sizes. The vertical axis indicates the cumulative percentage of adaptive walks on the IHF landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15 million simulated adaptive walks (15,000 starting points × 1,000 adaptive walks each). The vertical dashed line indicates the regulation strength of the wild-type TFBS for IHF, and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a y-value of 0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% reaching higher peaks. a. Population size 10². 31% of walks terminate at peaks higher than the wild-type. b. Population size 10⁵. 15% of walks terminate at higher peaks. c. Population size 10⁸. 15% of walks terminate at higher peaks. d-f. Number of high peaks reached by individual adaptive walks. The histograms show the distribution of the number of high peaks visited by a single adaptive walk, based on the same 150 million adaptive walks analyzed in a-c. Note that a walk can visit multiple peaks when genetic drift is strong because genetic drift can displace a population from an adaptive peak. The mean ± s.d. of the distribution is shown at the top of each plot. Population sizes: d 10², e 10⁵, and f 10⁸.

Fraction of greedy adaptive walks attaining high peaks for different TFs. a-c. Fraction of greedy adaptive walks attaining high peaks for different TFs. The vertical axis indicates the cumulative distribution function (CDF) of adaptive walks on each TF landscape that end at a genotype with a regulation strength shown on the horizontal axis. Data is based on 15,000 adaptive walks (15,000 starting points × 1 adaptive greedy walk for each) for each TF landscape. The vertical dashed line indicates the regulation strength of the wild-type TFBS for their respective TFs and can be used to infer the proportion of walks that terminate at lower regulatory peaks (horizontal line, blue lettering). For example, a value of y=0.9 implies that 90% of walks terminate at peaks lower than the wild-type, with the remaining 10% terminating at higher peaks. a. CDF for adaptive walks in the CRP landscape. Note the intersection at y= 0.89, meaning that 11% of the walks terminated on high peaks. b. CDF for adaptive walks in the Fis landscape. Note the intersection at y= 0.85, meaning that 15% of the walks terminated on high peaks. c. CDF for adaptive walks in the IHF landscape. Note the intersection at y= 0.82, meaning that 18% of the walks terminated on high peaks.

Diversity of alternative shortest paths for pairs of starting genotypes and their respective attained peaks.

Each violin plot is based on data from 15,000 randomly (and uniformly) sampled starting (non-peak) genotypes from each TF landscape – CRP, Fis and IHF (see color legend) – and on 1’000 Kimura adaptive walks per starting genotypes. For each pair of a starting genotype and a high peak that is attainable from this starting genotype, we calculated the number of distinct shortest paths reaching the peak (vertical axis, note logarithmic scale). The width of the violin plot at any given y-axis value represents the density of data points at that value. Wider sections indicate a higher density of data points. The text at the top of each violin indicates the median number of shortest paths and the interquartile range (IQR) for each TF. For example, in the CRP landscape peaks are reached via a median number (± IQR) of 4 (± 4) shortest paths, with a sample size (number of starting genotypes × attainable peak pairs) of N = 345,224. The violin regions appear discretized because the number of shortest paths is a discrete variable, not a continuous one. This means that the data can only take on specific integer values, resulting in discrete regions within the violin plots with a non-zero density of data points.

Peak genotypes conveying stronger regulation are also reached more often during adaptive evolution.

Each scatter plot represents the association between the strength of regulation conveyed by a peak genotype (horizontal axis) and the number of times the peak is attained (vertical axis, note the logarithmic scale). The data for each plot is based on 15,000 randomly (and uniformly) sampled starting (non-peak) genotypes from each TF landscape – CRP, Fis and IHF (see color legend) – and on 1’000 Kimura adaptive walks per starting genotypes (15×106 adaptive walks per landscape). The heatmap colors represent the density of peak genotypes from low (dark blue) to high (yellow). The dashed vertical line represents the regulation strength of the wild type TFBS. The red curve is derived from linear regression (in linear space). P-values in each panel result from a test of the null hypothesis that the two variables are not associated (Pearson correlation coefficient R=0; Coefficient of determination R2=0). a. CRP landscape (R= 0.56; R2 =0.34, N = 2,154). b. Fis landscape. (R=0.6, R2 =0.36, N = 2,312), c. IHF landscape. (R=0.62, R2 =0.4, N = 2,434).

The frequency distribution of attained high peaks is biased towards some frequently attained peaks.

We randomly and uniformly sampled 10 starting genotypes, started 103 adaptive walks from each, and recorded the number and frequency of distinct high peaks attained in these random walks. Each starting variant is symbolized by a vertical bar. The number of stacks within each bar (delineated by horizontal lines, also indicated by an integer above each bar) shows the number of high peaks reached by the 103 adaptive walks. Starting variants are ordered in ascending order based on this number of attained peaks. Stack height indicates the fraction of walks that reached the same peak, and is indicated in red, orange, and yellow for the three most frequently attained peaks. a. Fis landscape. b. IHF landscape.

The frequency distribution of traversed paths to high peaks is biased towards some paths that are frequently traversed.

We randomly and uniformly sampled 20 starting genotypes, initiated 103 adaptive walks from each, and recorded the number and frequency of distinct paths traversed by these random walks for each attained peak. We then randomly and uniformly sampled a high peak attained by each starting variant for our analysis. Each starting variant is represented by a vertical bar. The number of stacks within each bar indicates the number of unique paths traversed to reach the sampled high peak. Stack height represents the fraction of walks that reached the high peak, with the three most frequently transversed paths in red, orange, and yellow. a. CRP landscape. b. Fis landscape. c. IHF landscape.

Acknowledgements

We gratefully acknowledge financial support from the Swiss National Science Foundation grant 310030_208174. We also extend our thanks to the UZH University Priority Research Program in Evolutionary Biology, the UZH flow cytometry facility, and the Functional Genomics Center Zurich for their technical support. We are especially thankful to Andrei Papkou for his guidance with computational analysis and for engaging in theoretical discussions.

Additional information

Author contributions

C.A.W. and A.W conceived the study and designed experiments. C.A.W. carried out experiments. C.A.W and A.W. analyzed data. C.A.W wrote computer code to carry out bioinformatic work and analysis. C.A.W generated figures. L.G. wrote computer code to carry out simulations. C.A.W. and A.W. wrote the paper, which was edited by all authors.

Supplementary methods

1. General procedures

Although all general procedures have been previously described 27, we describe them again below for completeness.

1.1 ​Media and reagents

To prepare SOB medium, we mixed 25.5g of solid medium stock (VWR J906) with 960 ml of purified water and subjected the resulting suspension to autoclaving. To prepare the SOC medium, we dissolved 20 ml of 1 M D-glucose (Sigma G8270) and 20 ml of 1 M magnesium sulfate (Sigma 230391) in 960 ml of pre-prepared SOB solution. For the LB medium, we combined 25g of solid medium stock (Sigma-Aldrich L3522) with 1 liter of purified water and then autoclaved it. To prepare the M9 minimal medium, we diluted M9 minimal salt sourced from Sigma (M6030) in distilled water as per the manufacturer’s guidelines, autoclaved the solution, and added 0.4% glucose (Sigma G8270), 0.2% casamino acid (Merk Millipore, 2240), 2 mM magnesium sulfate (Sigma 230391), and 0.1 mM calcium chloride (Sigma C7902). Where required, we supplemented growth media with chloramphenicol (50 µg/mL), anhydrotetracycline (100 ng/mL, Cayman-chemicals #10009542), and/or glucose (0.4% w/v final concentration). We prepared anhydrotetracycline by diluting the dried chemical in absolute ethanol, from a stock concentration (1000X) of 100 µg/mL to a working concentration of 100 ng/mL.

1.2 ​Overnight incubation of cultures in liquid and solid medium

We cultivated bacteria in liquid LB medium (using either 15mL or 50mL Falcon tubes), enriched with chloramphenicol at a concentration of 50 µg/mL. We incubated these cultures for a period of 16 hours at a temperature of 37°C, with a shaking speed of 200rpm and 50 mm orbital motion, using an Infors HT Multitron Incubator Shaker. Similarly, for cultures in solid medium, we grew bacterial colonies on LB-agar plates (using sterile plastic Petri dishes of 90mm × 15mm dimensions), also supplemented with chloramphenicol at a concentration of 50 µg/mL, and incubated them for the same time and at the same temperature.

1.3 ​PCR Reactions

Except where specifically mentioned, we amplified DNA fragments through a polymerase chain reactions (PCR), employing Q5® high-fidelity polymerase (NEB #M0491L) to minimize mutation introduction into the amplicons. We followed the protocol recommended by NEB, aiming for a final reaction mixture of 50uL. We conducted each PCR reaction twice, and combined the products from these duplicates after the completion of the reaction. We determined the primer melting temperatures (Tm) using the NEB Tm calculator (accessible at https://tmcalculator.neb.com/#!/main), with primers at a concentration of 500nM.

1.4 ​Verifying PCR products through gel electrophoresis

Unless indicated otherwise, we verified successful PCR amplification via gel electrophoresis to ensure the presence of singular-band amplicons and the absence of non-specific bands. This process involved separating PCR products in a 0.8% agarose Tris-EDTA (TAE) gel. We conducted electrophoresis for 45 minutes at 120V, or until the bands had progressed beyond the halfway point of the gel’s total length.

1.5 ​DNA purification with commercial kits

Once we had confirmed a successful PCR through gel electrophoresis, we purified the PCR products utilizing the Monarch® DNA PCR/Gel Extraction Kit (NEB #T1020L), adhering to the manufacturer’s protocol. In situations requiring further purification (such as the occurrence of non-specific bands post-PCR), we performed gel purification. This involved mixing 10 μL of 6x NEB DNA dye with each 50 μL PCR product, then loading all 60 μL onto a 1% agarose gel. We carried out electrophoresis for 45 minutes at 120V, or until the bands had moved more than halfway through the gel. We then excised the DNA band corresponding to the amplified sequence using a scalpel. For extracting DNA from the gel, we used the Monarch® DNA Gel Extraction Kit (NEB #T1020L).

1.6 ​Gibson assembly168

We assembled PCR-amplified fragments using the NEBuilder-HiFi® DNA Assembly Master Mix kit (NEB #E2621L). We determined the molarity required for assembly based on the guidelines outlined by the Barrick Lab (details available at https://barricklab.org/twiki/bin/view/Lab/ProtocolsGibsonCloning). We incubated the assembly mix for one hour at 50°C in a dry bath incubator, followed by cooling on ice for subsequent steps.

1.7 ​Preparation of electrocompetent cells

We utilized glycerol/mannitol step centrifugation to prepare electrocompetent cells 169. To this end, we first cultured the appropriate E. coli strain (Supplementary Table S4) in 5 mL SOB medium at 37°C with a shaking speed of 250 rpm overnight. The next day, we transferred 3 mL of the culture to 300 mL SOB medium, and incubated under the same conditions until the OD600 reached a value between 0.4 and 0.6 (measured at an optical path length of 1 cm), which took approximately 2-4 hours. After cooling the culture on ice for 15 minutes, we centrifuged the cells at 4°C and 1,500 g for 15 minutes. We then resuspended the cells in 60 mL of ice-cold distilled water (dH2O) and divided them into three 50 mL tubes. Gradually, we added 10 mL of an ice-cold glycerol/mannitol solution (consisting of 20% glycerol (w/v) and 1.5% mannitol (w/v)) to each tube using a 10 mL pipette. We centrifuged the tubes at 1,500 g and 4°C for 15 minutes in an Eppendorf 5810/5810 R centrifuge with acceleration/deceleration set to zero. After discarding the supernatant, we resuspended cells in 3.0 mL of the same glycerol/mannitol solution. We transferred these suspensions to pre-cooled 1.5 mL tubes, and incubated them in a dry ice-ethanol bath for about 1 minute. Finally, we stored the suspensions at −80°C for future transformation experiments.

1.8 ​Electroporation

In all transformation experiments described in this study, we used 100 µL of electrocompetent cells for electroporation, employing 0.2 cm cuvettes (EP202, Cell Projects, UK) and a Micropulser electroporator (Bio-Rad) set to the EC3 setting (15k V/cm). Post-electroporation, we recovered cells in 1 mL of SOC media, warmed in advance in 15 mL Falcon tubes. This recovery step lasted for 1.5 hours at 37°C with a shaking speed of 220 rpm. Unless specified otherwise, we spread 300 µL of each recovered culture on an LB agar plate that contained 50 μg/mL chloramphenicol. We then incubated this plate overnight for 16 hours at 37 °C. Subsequently, we confirmed the identity of the clones on the plate via Sanger sequencing.

2. The design of plasmid pCAW-Sort-Seq-V2

The plasmid pCAW-Sort-Seq-V2 is a derivative of the plasmid pCAW-Sort-Seq27. Briefly, plasmid pCAW-Sort-Seq harbours a pBBR1 replication origin, which ensures a broad host range and a low copy number—typically between 5 to 10 copies per cell170.It also harbours a chloramphenicol resistance gene, a TetR repression system171 and a TFBS measuring module. The latter consists of an interchangeable TFBS that lies between a constitutive promoter and a superfolder GFP (sfgfp172) reporter gene. In this system, TF-TFBS interactions can decrease GFP production by physically obstructing the bacterial RNA polymerase, a phenomenon known as steric hindrance. The reporter gene sfgfp is insulated by a transcriptional insulator named RiboJ, a synthetic ribozyme that removes 5’UTR interferences from variable TFBS sequences in the mRNA by self-cleavage173. More details and features have been previously described in ref. 27. Here, we have integrated a bicistronic expression cassette into pCAW-Sort-Seq to enable the regulated expression of a TF of choice. This cassette harbors the gene encoding one of our focal TFs situated upstream of a mscarlet-I 174 reporter gene to monitor the expression of this bicistronic operon via fluorescence. Regulation is achieved through the pLtetO-1 171 promoter, a synthetic promoter tightly repressed by TetR. Expression is initiated by the addition of anhydrotetracycline (Cayman Chemicals, catalog #10009542), which relieves TetR-mediated repression and activates the expression of the entire cassette.

3. Construction of the plasmid pCAW-Sort-Seq-V2 and its variants

We initiated the construction of pCAW-Sort-Seq-V2 by synthesizing (Twist Biosciences, Calfornia, USA) and cloning a codon-optimized mscarlet-I gene expression cassette into the pCAW-Sort-Seq plasmid. This gene is oriented antiparallel to the tetracycline resistance gene (tetr). This cloning step resulted in the formation of the intermediate pCAW-Sort-Seq-mScarlet plasmid. Subsequently, we amplified for each of our three TFs the encoding gene (crp, fis and ihf) from the genome of the E. coli strain BW25113 and inserted it upstream of the mscarlet-I gene in a bicistronic operon configuration using Gibson assembly. This process produced TF-specific expression plasmid variants. The final phase involved removing the core promoter to create negative controls for each TF and cloning libraries into each plasmid variant upstream of the sfgfp gene. This yielded the plasmids used in our experiments (Supplementary Table S1). The specific sets of primers used for each PCR reaction are listed in Supplementary Table S6.

4. Library design, synthesis, and cloning

We based the design of the TFBS library for each TF on consensus sequences available in the RegulonDB database47, the most comprehensive database for E. coli transcriptional studies. We designed each library by randomizing the eight most important TFBS positions, such that each of the four nucleotides had an equal probability (0.25) to occur at each position. The expected library size for this approach is (48) = 65,536 sequences. Library compositions can be found in Supplementary Table S3. We designed libraries in the Snapgene® software (snapgene.com), and had them synthesized by IDT (Coralville, USA) as single-stranded DNA Ultramers® of 140bp (4nmol). We resuspended each library in nuclease-free distilled water and serially diluted it to a concentration of 50ng/uL. We used Ultramers® as templates in a PCR reaction for the formation of dsDNA fragments and library amplification. We amplified the Ultramer® DNA molecules with the following PCR program: 98°C/30 s; 25 cycles of 98°C/10 s, 60°C/15 s and 72°C/80 s; and 1 cycle of 72°C/5 min. We opted for a maximum of 20 cycles to reduce amplification biases. We analyzed amplification products through gel electrophoresis to confirm that only a single product band with the expected size of around 140bp was present for each PCR reaction. After confirming the presence of single bands, we purified the products using the Monarch® DNA gel extraction kit (NEB #T1020L). Whenever unspecific bands appeared during electrophoresis, we gel-purified the PCR products using the same kit.

For the construction of each plasmid-based library, we digested 1μg of the purified library with HindIII-HF (NEB #R3104) and BamHI (NEB #R3136) restriction enzymes in a 100μL reaction, followed by overnight incubation at 37°C. We isolated the appropriate cloning plasmid from its host strain using the QIAprep spin miniprep kit (Qiagen, Germany), and digested it with the same enzymes. Post-digestion, we added 3μL of Quick CIP (calf intestinal alkaline phosphatase) to prevent self-ligation by dephosphorylating DNA ends. We purified the ligated DNA using the Monarch® DNA gel extraction kit (NEB #T1020L).

We performed the ligation with a 10:1 molar ratio of insert-to-vector, using 100ng of vector, 10 units of T4 DNA ligase (NEB #M0202L), and 2 µL of 10X ligation buffer in a 20 µL reaction. We incubated the mixture at 20-22°C for approximately 16 hours, followed by 15-minute inactivation of the ligase at 65°C. We purified the ligation product, resulting in 10 µL of DNA resuspended in dH2O. We transformed E. coli SIG10-MAX® cells by electroporation with this purified product.

Post-transformation, we plated 50 µL of each recovered culture on LB agar for colony-forming unit counting (cfu) and transformation efficiency estimation. From the agar plates, we selected 30 colonies for colony PCR and Sanger sequencing to assess library diversity (NightSeq® service, Microsynth, Switzerland). Our mean transformation efficiency was 106 cells per transformation. We diluted the remaining 950 µL of transformants in 9 mL of LB medium with chloramphenicol and cultured it overnight. We then aliquoted the overnight culture into 1 mL cryotubes with 20% glycerol and stored at −80°C.

After transforming plasmid libraries into the SIG10-MAX® strains for reasons of transformation efficiency and plasmid maintenance, we extracted the plasmids using a QIAprep spin miniprep kit (Qiagen, Germany), and transformed them into the appropriate host mutant host strains Δcrp, Δfis and Δihf (see the strain genotypes in Supplementary Table S2). We cultured each library-transformed strain overnight, aliquoted it in 1 mL cryotubes with 20% glycerol, and stored it at −80°C for further experimentation.

5. Analysing and sorting cells

In preparation for cell sorting, we cultivated cells harboring each library in liquid LB medium enriched with chloramphenicol. Specifically, we cultured 1mL of transformed cell aliquots and a streak of cells with a control plasmid (a promoterless pCAW-Sort-Seq-V2 plasmid lacking sfGFP expression) in 50ml Falcon tubes with 9mL LB medium (containing 50 µg/mL chloramphenicol) overnight. Subsequently, we diluted these overnight cultures at a 1:100 ratio (v/v) in LB medium with chloramphenicol and divided them into two aliquots. We supplemented one aliquot with the anhydrotetracycline (Atc) inducer to express the plasmid-encoded TF. The other aliquot remained unchanged. We incubated both cultures for 5 hours until late-exponential/early-stationary phase (200RPM, 37°C). Subsequently, we diluted 20µL of the cultures in 1mL of cold filtered Dulbecco’s PBS (Sigma-Aldrich #D8537) in 15 mL FACS tubes.

We performed FACS-sorting on a FACS Aria III flow cytometer (BD Biosciences, San Jose, CA) using a 70 µm nozzle. We utilized a 488 nm laser for detecting forward scatter (FSC) and side scatter (SSC) with a 488nm/10nm band-pass filter. We set the flow rate to 1.0, adjusting sample dilution as necessary to achieve no more than ≈10,000 events/second. Considering the small size of bacterial cells, we reduced the particle detection threshold to the lowest feasible setting (200 arbitrary units on FSC and SSC channels), increasing it to a maximum of 500 units if background noise was excessive. We then adjusted FSC-H and SSC-H for cells with the negative control plasmid to center the bacterial population in the cytometer’s software (BD FACSDiva™ Software v9.0). We based the sorting and binning of cells on both mScarlet-I (PE-Texas-Red channel, excitation laser: YellowGreen 561nm, LP filter: 600 nm, BP filter: 610/20nm) and sfGFP fluorescence (FITC channel, excitation laser: 488nm, LP filter: 502nm, BP filter: 530nm/30nm), setting both the PE-Texas-Red and FITC channels voltages so that the median fluorescence of the negative control was between 0 and 100 (arbitrary units) on the FITC-H and PE-Texas-Red-H axes.

Initially, the sorting process involved establishing a gate for identifying cells that are red-fluorescence-positive, i.e., reporter and thus TF-expressing. To establish this gate, we began by measuring the autofluorescence of our negative control culture on the PE-Texas-Red-H axis. Thereafter, we examined the fluorescence of a positive control that had been induced with Atc and was expressing the mScarlet-I protein. We then configured a gate around the positive mScarlet-I-expressing population, ensuring that all cells sorted through this gate exhibited a level of reporter expression higher than the negative control.

Next, we established the green-fluorescent sorting gates. For setting sorting gates on the FITC-H axis, we first recorded the autofluorescence of the negative control culture. This median autofluorescence defined the upper boundary of the lowest bin (B1) for the experimental population. We then analyzed the fluorescence of 106 cells expressing sfGFP and containing the library, without sorting, in order to establish the boundaries of the next binning gates. We took the lower bound of the highest bin (B13) for the experimental population to correspond to the 95th percentile of the fluorescence distribution of this population. We chose boundaries between the remaining 11 intermediate bins with equidistant spacing on a binary logarithmic (log2) scale. After we had determined the gates in this way, we calculated the fractions of the previously 106 recorded cells that fell inside each bin.

To maintain statistical robustness and minimize sampling error in our downstream analyses, we wanted to ensure that each of the 65,536 unique sequences in our library was represented by at least 30 cells after sorting. In addition, it was crucial to account for an anticipated diversity loss of up to 70% of the cell population during sorting, due to factors such as cell death and the dilution of low-frequency and lower-fitness genotypes during the post-sorting recovery phase 102. To establish the number of cells required to be sorted initially, we thus determined a multiplicative factor f based on the minimally needed number of 30 cells per sequence and the expected cell retention rate post-sorting of 30%, i.e., cells/sequence. We thus multiplied our initial library size by 100 for sorting purposes, i.e., we sorted total of 6,553,600 cells. This calculation aims to ensure that despite a substantial reduction in cell numbers post-sorting, each unique sequence remains adequately represented in the surviving cell population. We sorted cells into 1.5 mL Eppendorf tubes, each containing 500 µL of LB medium, and kept at 4 °C to prevent growth during sorting and sample processing. We replicated the entire sorting procedure three times, each based on independent library transformations.

After sorting, we added 1mL of LB without antibiotics to each tube and removed 20 uL of the resulting volume for serial dilutions. We transferred the remaining liquid culture (980 uL) to 50 mL falcon tubes and allowed each culture to recover for 2 hours (37°C, 220 rpm). After recovery, we added 9mL of LB supplemented with chloramphenicol, and grew the cultures overnight for freezing part of them in glycerol stock aliquots, for validating our binning procedure (see below), and for extracting plasmid DNA for subsequent PCR and sequencing steps. We used the 20 uL initially removed from each 1mL culture for preparing two serial dilutions (10-4 and 10-6) that we plated on LB-Cm agar plates (200uL per plate) to estimate the post-sorting viability through colony forming unit (cfu) counting. By knowing how many cells were sorted into each bin, we can estimate how many cells would be expected in our dilutions and compare this number with the cfu counts we observed. In this way, we estimated that on average (across bins) 77% of cells remained viable (standard deviation: 18.15%). We also estimated the genetic diversity of the library through Sanger sequencing of DNA from individual colonies (NightSeq® service, Microsynth, Switzerland).

To validate our binning procedure, we re-grew binned cultures from either overnight recovered cultures or frozen aliquot stocks. We then measured their expression distributions by flow cytometry, which reproduced the original expression measurements. This approach allowed us to compare the fluorescence distributions of the re-grown cultures with the pre-sorting distributions. Specifically, we assessed whether the geometric mean of the fluorescence distribution for each sorting bin matched those recorded during the initial sorting procedure. Next, we re-sorted cells derived from each bin in order to eliminate cell cross-contaminations and other factors that could alter the bin distributions. The geometric mean is often preferred over the arithmetic mean in this type of analysis, because it is less affected by the presence of outliers in the data 175,176. It is also a more accurate representation of the central tendency of data that is log-normally distributed, which is often the case with flow cytometry fluorescence measurements175,176.

6. DNA extraction and sequencing

We diluted 500 uL of individual glycerol stocks of each replicate subpopulation of sorted cells (i.e., cells from each bin of fluorescence intensity) in 5mL of LB supplemented with chloramphenicol in 15mL Falcon tubes, and grew the resulting cell culture overnight (16 hours, 37°C, 220 rpm). On the next day, we isolated plasmids from each culture using a QIAprep® spin miniprep kit (Qiagen, Germany). In order to allow the sequencing of multiple pooled samples (multiplexing), we barcoded our regulatory region through PCR with specific HPLC-purified primers (Supplementary Table S6) provided by Eurofins (Konstanz, Germany). We added barcodes to the 5’region of the amplicon through a PCR reaction. We performed this PCR reaction with the Q5 high-fidelity polymerase, and did so in triplicate for each miniprep-isolated plasmid library. To calculate primer melting temperatures (Tm), we used the NEB Tm calculator (https://tmcalculator.neb.com/#!/main) for a primer concentration of 500nM. We performed the PCR with the following program: 98°C/30 s; 25 cycles of 98°C/10 s, 64°C/30 s and 72°C/30 s; and 1 cycle of 72°C/2 min.

After PCR amplification, we digested the reaction products with the restriction enzymes DpnI (NEB #R0176L) and Exonuclease I (NEB #M0293L) in order to remove traces of genomic DNA, plasmids, and single-stranded DNA that could interfere with sequencing. The Master Mix we used for a single digestion harbored 1µL of 10x CutSmart® Buffer (NEB #B6004S), 1µL of Exonuclease I (NEB #M0293L), 1µL of DpnI restriction enzyme (NEB #R0176L), and 7µL of distilled nuclease-free water. For each PCR product, we added 10µL of the Master Mix. We incubated the reaction for 1 hour at 37°C, following 15 minutes at 80°C for deactivation of the enzymes.

We purified the digestion products using the Monarch® DNA PCR/Gel Extraction Kit (NEB #T1020L) and fractionated them through gel electrophoresis to confirm that only a single band with a size ≈150bp was present. After this confirmation, we pooled the purified PCR products from the different bins of each replicate sorting equimolarly to a total mass of 2,600 ng and a volume of 100 µL (26ng/µL of DNA) in 1.5mL Eppendorf tubes. We then sent the pooled purified amplicons for adapter ligation and sequencing at Eurofins (NGSelect Amplicons® on Illumina HiSeq), obtaining 15 million paired-end reads (2 × 150 bp) for all samples.

We quantified the total number of reads (t) required for sequencing based on several factors, including the number of bins (b), biological replicates (r), strains (s), genotypes (g), and the reads per genotype (p). Using these variables, we first calculated the total number nbc of needed barcodes as follows:

For each of our three TFs, the values of these parameters are b=13 bins, r=3 biological replicates, and s=1 strains, yielding nbc=39. The total number of required sequence reads can be estimated through the equation:

where we aimed for p=30 paired-reads per genotype. The number g of expected genotypes per library is the same (65,536) for each TF. These values lead to a total of t = 3 × 65,536 × 30 = 7,667,712 required paired-end reads for each library.

7. Data analysis

7.1 ​Filtering and preparing sequencing reads

We processed the sequencing data with a blend of custom python and awk scripts, complemented by established bioinformatics utilities. Firstly, we trimmed sequences by computationally excising Illumina adapters, followed by merging paired-end reads. We then organized the paired-end reads into distinct files, each tagged with a sequence barcode identifying the sequence bin from which the corresponding reads originated.

For the removal of Illumina adapter sequences from the paired-end reads, we employed Cutadapt 177 with the following parameters:

Here, “ADAPTER_FWD” and “ADAPTER_REV_RC” are placeholders for the actual adapter sequences used, which are 5’-AGATCGGAAGAGCACACGTCTGAACTCCAGTCA-3’ for read 1, and 5’-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ for read 2.

Because our amplicons were short (153 bp) and each sequencing sample was divided into two files with overlapping single-end reads, merging the two files (single-end reads) was necessary, for which we used the FLASH software 178. After the filtering and merging of reads, we demultiplexed paired-end reads with the following FLASH parameters:

Next, we utilized the FastX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)179 to discern and preserve only those sequences that exceeded a high-quality threshold of Q = 33. Subsequently, we refined our dataset by removing sequences that contained undesirable mutations or insertions/deletions (indels) within the region extending from the start of the core promoter to the end of the variable TFBS library. We performed this filtering step using a custom R 180 script. Thereafter, we used custom awk and R180 scripts to convert the data into a table. Each row of this table contains data from one TFBS in the library, and each column contains the number of reads for this TFBS from one of the thirteen bins into which we had sorted cells. We estimated the average regulation strength of each TFBS from this data.

7.2 ​Calculating regulation strengths

We calculated regulation strengths as previously described27. Briefly, in sort-seq experiments, sequences often appear in multiple fluorescence bins due to random mis-sorting48,137 and the stochastic nature of gene expression165. Following methods established in previous research 39,49,103, we determined the reporter expression level driven from each TFBS variant by calculating a weighted average of bins in which the variant occurred. This involved multiplying the frequency of each sequence (xi) in a given bin i by a numerical value representing that bin (wi=1,2,3,…,13), and then averaging these products over the total sequence count. Mathematically, this weighted average calculates as

where e is the expression level driven by the sequence.

This approach yields a continuous spectrum of expression levels e within the range of 1 to 13. High expression values indicate robust GFP expression and, thus, weak binding of a TF to a TFBS variant. To facilitate interpretation, we inverted this scale so that higher values indicate stronger repression and lower GFP expression, i.e., we define the regulation strength b (strength of regulation) as follows:

where emax=13 is the maximal expression level (maximal bin value). We then normalized b by the maximal regulation strength among all TFBSs we studied for a given TF. The resulting normalized regulation strength scores b vary from 0 to 1, where low scores denote TFBS variants with weak TF binding (weak reporter repression), and high scores indicate variants with strong TF binding (strong reporter repression). A score of b =1 signifies the highest regulation strength (repression) calculated among all TFBSs for a given TF.

7.3 ​Combining data from triplicates

As a quality-filtering step, we eliminated TFBS variants that did not appear in all three replicates or that were represented by less than 30 reads in total (summing their read counts over the thirteen bins). Given that we have 13 bins and most sequences are typically found within an average of 3.4 bins, we wanted a minimum of 10 reads per bin to prevent misinterpretation of repression levels and also guide our threshold’s choice. This procedure reduced the fraction of TFBS variants for which we had sort-seq-based sequence data from 95%, 90%, and 93% of the total library size (65,536 sequences) for CRP, Fis and IHF, respectively, to 49%, 66% and 63%.

Subsequently, we calculated the regulation strengths b of the remaining TFBS variants (Supplementary Materials and Methods 7.2). Then, we determined the coefficient of variation of regulation strengths across replicates for each TFBS variant, which estimate the consistency of measured regulation strengths among replicates. Notably, most TFBS variants exhibited a coefficient of variation (CV) below 0.5, which was the threshold we set for filtering sequences. Sequences with a CV above 0.5 were excluded from the analysis. This threshold is commonly used in transcriptional studies employing fluorescent reporters to account for transcriptional noise and measurement variability 165. Finally, we averaged regulation strengths for each TFBS variant across all replicates and normalized the resulting averages by the maximal observed regulation strength among all TFBSs of a given TF.

7.4 ​Frequency matrices and sequence logos

We generated frequency matrices of nucleotides that occur in TFBSs from each fluorescence bin by counting the frequency of each nucleotide at each variable position of the TFBS library. From this data, we generated heatmaps and DNA sequence logos representing the frequency matrices graphically. A sequence logo consists of a stack of the letters A, C, G, and T, at each position of a DNA sequence, where the relative size of each letter indicates its frequency in the sequence. The total height of the stack corresponds to the information content of that position, in bits 68,181. A sequence logo is a graphical representation of the informational properties of a TFBS. When mutated, nucleotides with high information content are more likely to lead to a loss of binding (repression) than nucleotides with low information content.

7.5 Creation of genotype networks and determining network metrics

We used in-house Python script and the Python package igraph182 to generate directed genotype networks. These are graphs in which TFBS variants are nodes (vertices, genotypes), and variants that differ in a single nucleotide are connected by an edge. Each node of this network is associated with the corresponding DNA sequence and the associated regulation strength. Each edge is directed, i.e., it corresponds to a binding-score-increasing mutation, and points from a TFBS variant with lower regulation strength to a neighbor with higher regulation strength. We extracted the largest weakly connected subgraph (the“giant component” 6) of the network, and used it for all further analyses. This giant component comprises the vast majority (99%) of sequenced genotypes for all of our TF landscapes. We used in-house Python and R scripts for all network analyses described below.

Epistasis

Epistasis refers to non-additive interactions between two or more mutations. It can impose severe constraints on molecular evolution, because the mutations that are beneficial in one genetic background may be deleterious in another37. Epistasis between two mutations can be classified as magnitude, simple sign, or reciprocal sign epistasis, depending on the sign (i.e., positive or negative) of the fitness effect of individual mutations and their combinations37. In magnitude epistasis, the effect of a mutation on regulation strength varies depending on the genetic background but the sign of this effect (increasing or decreasing regulation strength) does not. Simple sign epistasis occurs if one single mutant has a lower regulation strength than both the wild type and the double mutant, while the other single mutant has a regulation strength that is intermediate to the wild type and double mutant. Reciprocal sign epistasis occurs when both mutations independently decrease regulation strength, but their combination increases regulation strength. The presence of reciprocal sign epistasis is a necessary condition for the existence of multiple peaks in an adaptive landscape75,76.

To determine the incidence of epistasis in our landscapes, we employed a method that involves identifying all “squares” in a genotype network with the motifs function from the igraph library in R. Each square consists of a “wild-type” sequence, a double-nucleotide mutant, and the corresponding two single mutants. We assessed epistasis for each square along a single axis by selecting the highest-regulation strength sequence as the double mutant. Our analysis placed each square into one of three categories: no sign epistasis, simple sign epistasis, and reciprocal sign epistasis. The no sign epistasis category included both magnitude epistasis and additivity (no epistasis) without differentiating between them, because neither affects the accessibility of landscape peaks through only binding-increasing mutations11,108. We determined the proportion of all squares that fell into each category (see Supplementary Table S5).

Peaks

A peak is a genotype (TF binding site variant) whose neighbors all convey lower regulation strength than itself. Two peaks are connected if they are neighbors and convey the same regulation strength. We refer to the genotype with the highest regulation strength as the summit or global peak6,183.

Genotype connectivity

To evaluate the sparsity of our networks and the likelihood of peak misclassification due to the incompleteness of our landscapes, we used a metric called “genotype connectivity.” This represents the number of immediate (1-mutant) neighbors of a given genotype for which our experiment produced repression strength data. A perfect experiment creating a complete landscape data set would provide repression data for each of our 48 genotype and all of its 24 immediate neighbors. In practice, however, data for some genotypes and some of their neighbors is unavailable. We quantified the “relative connectivity” of any one genotype as the ratio of the actual number of adjacent genotypes for which we have repression data to the theoretical maximum of 24 neighbors. This measure helps us to address questions about the sparsity of our landscape and its implications. Specifically, it allows us to determine if our sample is biased, with some regions or genotypes being more connected than others. This is particularly important for assessing whether peaks and high peaks are less connected than non-peaks, which could potentially lead to peak misassignments.

Accessible Paths

In the context of our genotype networks, we call a mutational path accessible if the regulation strength increases with each mutational step 6,183. We systematically enumerated all shortest accessible paths, distinguishing them by their path length, i.e., by the number of mutational steps in a path. Notably, there can be multiple alternative accessible paths with the same number of mutational steps from a single starting genotype to a peak.

Basins of attraction

The basin of attraction of a peak comprises all TFBS variants from which accessible paths to the peak exist. We refer to the basin’s size as the number of variants in the basin. We determined basin sizes by exhaustive enumeration.

Overlap between basins

The basins of attraction of different peaks may comprise overlapping sets of variants. To determine the overlap between two basins B1 and B2, we used the Jaccard index J184,185, which is equal to the size of the intersection between two sets of variants divided by the size of their union:

7.6 ​Generating randomly shuffled landscapes

To generate uncorrelated random landscapes for our three TF landscapes, we performed a random shuffling of regulation strength values from our genotypes, followed by landscape analysis to identify peaks. To this end, we employed an ad-hoc python script to randomly permute experimentally measured regulation strength values among all genotypes in each of our landscape. This procedure preserves the distribution of regulation strengths in each landscape, but also creates an uncorrelated random landscape73,79. Following random shuffling, we analyzed each shuffled landscape’s topography with the same methods we had applied to the original (unshuffled) landscape to identify its peaks. To ensure statistical robustness, we repeated the random shuffling and peak identification 100 times. We then compared the number of peaks between the 100 shuffled landscapes and the original (unshuffled) landscape.

7.7 ​Principal component analysis

In order to investigate which nucleotides in a TFBS are most important for changes in regulation strengths, we performed a principal component analysis for each landscape (Supplementary Figures S19-S21). To this end, we first one-hot-encoded our data. This encoding represents each categorical value (nucleotide at each position of a DNA string) as a binary vector of length 4. It thus converts an entire DNA string of length L into a 4xL binary matrix. We used this binary matrix to perform PCA with the R base function prcomp.

8. Simulated adaptive walks

We simulated the adaptive evolution of a population on our landscapes by performing two different types of random walks, “greedy” adaptive random walks, and random walks based on Kimura’s model of fixation probabilities.

For all simulations, we assumed that only point mutations occur and that the time it takes for a point mutation to become fixed in a population is much shorter than the time it takes for a new mutation to appear that will eventually go to fixation. This scenario is also called the strong selection weak mutation (SSWM) scenario 39,8183. Under this scenario, evolving populations are monomorphic most of the time, i.e., all individuals have the same genotype. This scenario is realistic when the product of effective population size N and mutation rate μ is small ( < 1), which is the case for E. coli (N=1.8×108, μ=2×10-10) 43. It allows us to model adaptive evolution as an adaptive random walk in our landscapes. Our simulations also account for mutation bias, which means that different types of mutation have a different probability of occurring in a population. We use mutation biases that were experimentally determined for Escherichia coli86.

In a greedy adaptive random walk, starting from any one genotype, only the mutational neighbor that conveys the largest fitness advantage is fixed in a population. We initiated one greedy random walk from each non-peak genotype and terminated the walk once a fitness peak was reached, i.e., once every neighbor of the current genotype had lower fitness than the genotype itself. Because every greedy random walk is deterministic in the absence of neutral mutations, it is sufficient to initiate one greedy random walk per starting genotype.

Another well-established model for random walks allows for genetic drift. It uses fixation probabilities computed by Kimura80,186,187 i.e., fij = (1 – e-2s) / (1 – e-2Ns), where fij is the probability of fixing mutation j in the background of genotype i, N is the effective population size, and s is the selection coefficient, i.e. the difference in fitness between genotypes i and j80,186. For a given pair of genotypes, the only parameter of this model is the effective population size N, for which we explored values of N=108, N=105 and N=102. Generally, the smaller the population size is, the larger is the probability that neutral or deleterious mutations become fixed in a population.

For these “Kimura” adaptive walks, we chose 15,000 random starting genotypes for each of the three values of N, and simulated 1,000 random walks for each of them. At each step, we randomly picked a mutation j, generated a random number in the interval [0, 1], and considered the mutation to become fixed if the random number fell within the interval [0, fij]. Computationally, we accelerated this process by precomputing all fixation probabilities for all genetic backgrounds, and then used the python function numpy.choice to generate a random sample from the multinomial distribution of fixation probabilities at each step188. Because of genetic drift, Kimura’s random walks do not necessarily terminate when they reach a fitness peak. For this reason, we simulated each such random walk for a maximum of 25 mutational steps.