Mutating parent sequences reveals vastly different probabilities of promoter emergence.

(A) The location of -10 and -35 boxes in a subset of the parent sequences. See Fig S1 for the complete set (n=25). Orange trapezoids correspond to -35 box motifs, and magenta trapezoids to -10 box motifs, each identified using position-weight matrices (see sequence logos below and Fig S1). (B-C) The Sort-Seq protocol. (B) Top: we amplified 25 150bp parent sequences from 25 promoter islands in the E. coli genome using an error-prone polymerase chain reaction, generating a mutagenesis library of 245’639 unique daughter sequences. Bottom: we cloned the library into the pMR1 reporter plasmid between a green fluorescent protein (GFP) coding sequence on the top strand (blue arrow) and a red fluorescent protein (RFP) coding sequence on the bottom strand (red arrow). We transformed the plasmid library into E. coli. (C) Using fluorescence-activated cell sorting (FACS), we sorted the transformed E. coli cells into 8 fluorescence bins: none, weak, moderate, and strong, for both RFP and GFP expression (see Fig S2 for bins). We sequenced the plasmid inserts of cells from each bin, assigning a fluorescence score from 1.0-4.0 arbitrary units (a.u.), ranging from no fluorescence (1.0 a.u.) to strong fluorescence (4.0 a.u.) (see Fig S4 for score distributions). (D) Circles show the probability that mutation creates an active promoter from an inactive parent. Data is shown for each of 40 inactive parents and orientations (i.e., 25 parents ×2 orientations = 50. 50 – 3 top strand promoters – 7 bottom strand promoters = 40.) The height of the box represents the interquartile range (IQR) and the center line the median. Whiskers correspond to 1.5×IQR.

The majority of promoters emerge and evolve within a subset of preexisting promoter motifs.

(A) We calculated the mutual information Ii(b, f) between nucleotide identity (b = A,T,C,G) and fluorescence scores rounded to the nearest whole number ( f = 1,2,3,4 a.u.) for each position i in a parent sequence. In essence, the calculation compares the probability pi(b) of a base b occurring at position i , and the probability p(f) that a sequence has a fluorescence score f. The joint probability pi(b, f) is the probability that a sequence with base b at position i has fluorescence score f. The greater the joint probability is compared to the individual probabilities, the more important the base at this position is for promoter activity. See methods. (B) An example of how to interpret mutual information using position-weight matrix (PWM) scores of predicted -10 and -35 box motifs. Top: we plot the mutual information Ii(b, f) for P19’s top strand. P19 is an active promoter on both DNA strands. Solid line: mean mutual information. Shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Bottom: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. We define hotspots as mutual information peaks greater than or equal to the 90th percentile of total mutual information (methods), and highlight them with dashed rectangles. (C) Stacked bar plots depicting the percentage of hotspots overlapping with -10 box motifs only (magenta), -35 box motifs only (orange), both -10 and -35 box motifs (red), or with neither (gray). We plot this both for active parents (left) and inactive parents (right). (D) Analogous to (B) but for the bottom strand of P3. P3 is an inactive parent sequence. Hotspot overlaps with a -10 box motif. (E) Analogous to (B) but for the top strand of P18. P18 is an inactive parent sequence. Hotspots overlap with (from left to right) a -35 box motif, both a -10 and a -35 box motif, and neither (None). Fig S6 shows analogous mutual information plots for daughters derived from each parent sequence.

Gaining -10 and -35 boxes that overlap with promoter motifs creates de-novo promoters.

(A) The change in fluorescence (arbitrary units, a.u.) when gaining or losing -10 and -35 box motifs in mutual information hotspots of inactive parent sequences (see Fig S7 for calculation overview). Dotted lines indicate an effect size threshold of ± 0.5 a.u. Each circle corresponds to a gain or loss of a box motif in a mutual information hotspot (see Fig 2 for hotspots). Circles with letters in parentheses refer to the corresponding figure panels. The volume of each violin plot corresponds to a kernel density estimate of each distribution. (B) The bottom strand of parent P16. Top: Mutual information Ii(b, f) between nucleotide identity and fluorescence at each position. Solid red line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Bottom: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. Dotted trapezoid indicates the region of interest analyzed in the subsequent panel. (C) Fluorescence scores (a.u.) for daughter sequences without (left) and with (right) a -10 box motif in the region of interest defined in (B). We tested the null hypothesis that the gain of the motif significantly increases fluorescence (two-tailed Mann-Whitney U [MWU] test). The q-values correspond to Benjamini-Hochberg-corrected p-values (methods) (two-tailed MWU test, q=7.64×10-32). Within the violin plot is a box plot, where the box represents the interquartile range (IQR) and the white circle the median. Whiskers correspond to 1.5×IQR. (D) Analogous to (B) but for the top strand of parent P9. (E) Analogous to (C) except for the region of interest defined in (D) (two-tailed MWU test, q=2.83×10-173). (F) Analogous to (B) but for the bottom strand of parent P1. (G) Analogous to (C) except for gaining a -35 box instead of a -10 box in the region of interest defined in (F) (two-tailed MWU test, q=4.25×10-42). (H) Promoter emergence models from left to right. Promoters emerge by gaining -10 boxes downstream of preexisting -35 box motifs (Shiko -10), gaining -35 boxes upstream of preexisting -10 box motifs (Shiko -35), gaining a -10 box independent of an upstream -35 box motif, or gaining a -35 box upstream independent of a preexisting -10 box motif. See Fig S8 for additional examples.

Gaining -10 and -35 boxes modulates promoter activity.

(A) A cartoon for the promoter mapped on the bottom strand of P12 based on data from Fig S9. Orange trapezoids = -35 boxes, magenta trapezoids = -10 boxes, dashed trapezoids = regions of interest in subsequent figure panels. (B) We test the null hypothesis that the fluorescence scores in each group have the same central tendency using a two-sided Mann-Whitney U (MWU) test. We test this null hypothesis for daughter sequences with vs without a -35 box motif (left comparison) or with vs without a -10 box motif (right comparison) using a Mann-Whitney U (MWU) test. The q-values correspond to Benjamini-Hochberg-corrected p-values (methods; (two-tailed MWU test, q=3.35×10-5 and q=4.36×10-4, respectively). The areas of the violin plots are the kernel density estimates (KDE) of each distribution. Within each violin plot is a box plot, where the box represents the interquartile range (IQR) and the white circle the median. Whiskers correspond to 1.5×IQR. Dotted lines bridging the box plots with numbers underneath correspond to the median change in fluorescence (a.u.). (C) A proposed mechanism whereby gaining -35 boxes adjacent to -35 boxes or -10 boxes adjacent to -10 boxes lowers expression. (D) Analogous to (B) but for the presence or absence of a -10 box shifted +1 bp downstream from the modeled -10 box (two-tailed MWU test, q=5.85×10-13). E) A proposed mechanism where mutations shift -10 and -35 boxes either closer or further apart to modulate promoter strength. See Fig S10 for 5 additional examples. (F) Analogous to (A) but for the promoter mapped on the top strand of P20. (G) Analogous to (B) but for the presence or absence of a -10 box motif overlapping with the modeled -35 box (two-tailed MWU test, q=7.29×10-7). (H) A proposed mechanism whereby mutations create new -10 or -35 box motifs in the locations of existing promoter binding sites, destroying the promoter. See Fig S10 for two additional examples.

Promoter island sequences.

(A) The locations and arrangements of -10 and -35 box motifs located across both the top and bottom strands of the 150 base pair DNA sequences called parent sequences (P1-P25). Orange trapezoids: -35 box motifs. Magenta trapezoids: 10 box motifs. Left: top strand. Right: bottom strand. Note: the plot is shown from 5’-3’ for both strands. (B) Distributions for the number of motifs found in the 25 promoter island sequences. We identified motifs with position-weight matrices (PWMs, methods). Magenta: the number of -10 box motifs in each promoter island regardless of orientation (n=50). Orange: the number of -35 box motifs in each promoter island, in both orientations (n=50). (C) Distribution of AT-content for the 25 parent sequences. Boxes in B) and C) represent the interquartile range (IQR) and the center line the median. Whiskers correspond to 1.5×IQR

Mutagenesis library and sort-seq bins.

(A) A histogram with the number of unique daughter sequences per parent sequence (N=245’639). (B) The number of point mutations per daughter sequence (N=245’639). The number above each bin is the number of daughters in each bin. (C) We defined four red fluorescence bins (1, 2, 3, 4, corresponding to none, weak, moderate, and strong fluorescence, respectively) using three controls: GFP+, RFP+, and a negative control (see Methods). Plots show histograms of the fluorescent readouts from 10’000 cells for each control. For bin 1 (none), we defined a minimum boundary as the larger of two PE-H values. These values are the minimum PE-H of the negative control (empty pMR1 plasmid) and the minimum PE-H of the opposite fluorophore positive control (GFP in this case). The upper boundary of bin 1 is the highest PE-H value detected for these same controls. For bin 4 (strong), we defined the lower boundary as the mean fluorescence level of the positive RFP control. This bin does not have an upper bound, because it encompasses the highest levels of fluorescence. For RFP bins 2 (weak) and 3 (moderate), the lower bound of bin 2 is identical to the upper bound of bin 1. The upper bound of bin 3 is identical to the lower bound of bin 4. The upper bound of bin 2, which is identical to the lower bound of bin 3, equals the average of the lower boundary of bin 4 and the upper bound of bin 1. (D) Analogous to (C) except with GFP and RFP controls swapped, with the following exception: Because the GFP positive control produces a bimodal FITC-H distribution, we defined the lower bound of bin 4 for green fluorescence as the peak of the higher mode of this distribution.

Mutational coverage for the parent sequences.

(A) Heatmaps depicting the frequency of each nucleotide (A,T,C,G, y-axis) at each position (5’-3’, x-axis) for all of the daughter sequences of each parent sequence (P1-P25). Daughter sequences contain 1-10 point mutations (see Fig S2B). The color of each square in the heatmap corresponds to the frequency of each nucleotide occurring in the library at its respective position, where less frequent nucleotides are shown in dark magenta. Highly frequent nucleotides are in yellow (log-scale). These include the nucleotides of the parent sequence, due to our modest mutation rate (∼2.0 point mutations per daughter sequence). Light gray squares indicate mutations absent from the library. (B) We calculate coverage as the percentage of all possible mutations of the parent sequences (A,T,C,G and positions 1 through 150, 3×150 = 450 total neighboring mutations). In the boxplot, the box represents the interquartile range (IQR) and the center line the median. Whiskers correspond to 1.5×IQR.

Histogram of the fluorescence score distributions for each promoter island and genetic orientation.

Rows: parent sequences (P1-P25). Left column: bottom strand (RFP fluorescence scores); Right column: top strand (GFP fluorescence scores.) Each panel is a histogram of the frequency of fluorescence scores (arbitrary units, a.u.) from the sort-seq experiment. Gray histograms indicate those parent sequences / orientations that already encode promoter activity. Note: daughter sequence counts vary among parents, and y-axis is unique to each panel.

Number of hotspots correlates with Pnew, but numbers of -10 and -35 box motifs do not.

(A) Scatterplots comparing the number of mutual information hotspots (see Fig 2) and Pnew (see Fig 1). Pnew is the ratio of daughter sequences with a fluorescence score greater than 1.5 a.u. to the total number of daughter sequences for each parent. Each point represents a parent / orientation without regulatory activity (n=40). The dotted line is the line of best fit calculated using the method of least squares. We test the null hypothesis that the slope of the correlation equals zero with the Wald Test. The r2 value is the Pearson correlation coefficient. Calculation carried out using scipy.stats.linregress (version 1.8.1) (p=1.35×10-4, r2=0.322). (B) Analogous to (A) but comparing the number of daughter sequences for each parent with Pnew (p=0.529, r2=0.0105, n.s.). (C) Analogous to (A) but comparing the GC-content of each parent sequence with Pnew (p=0.930, r2=2.07×10- 4, not significant (n.s.)). (D) Analogous to (A) but comparing the number of -10 box motifs in each parent sequence and orientation with Pnew (p=0.852, r2=9.23×10-4, n.s.). (E) Analogous to (A) but comparing the number of -35 box motifs in each parent sequence and orientation with Pnew (p=0.834, r2=1.16×10-3, n.s.).

Mutual information and promoter motifs in the promoter islands.

Each panel corresponds to a unique promoter island sequence, with its numerical identifier in bold. Within each panel there are two plots, corresponding to the top strand (left), and the bottom strand (right), both shown from 5’ (left) to 3’ (right). Within each panel, we show on top the mutual information Ii(b, f) between nucleotide identity and fluorescence levels at every position (Solid line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods)). The bottom of each panel shows position-weight matrix (PWM) predictions for the occurrence of -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. See methods.

Identifying where motifs are gained and lost in hotspots that are associated with changes in fluorescence.

These panels do not show data from any parent sequence or hotspot, but show hypothetical data to illustrate how we identified the associations of base identity and fluorescence in our analysis. (A) Top: a cartoon illustration of all daughter sequences from a single parent. Red boxes correspond to position-weight matrix (PWM) predicted motifs. Bottom: Mutual information between fluorescence levels and nucleotide identity at each position calculated from the respective daughter sequences of a given parent sequence. We examine for each mutual information hotspot if -35 or -10 box motifs are present or absent in each daughter sequence. The dashed rectangle highlights a hotspot at the 3’-end of the daughter sequences discussed in B) and C). (B) We categorize the fluorescence scores for the daughter sequences based on whether they have a motif (red) or not (black) in the hotspot. (C) We test the null hypothesis that the fluorescence scores in each group have the same central tendency using a two-sided Mann-Whitney U (MWU) test. See methods.

Additional examples of which gaining -10 and -35 boxes creates de-novo promoters.

(A) The change in fluorescence (arbitrary units, a.u.) when gaining or losing -10 and -35 boxes in hotspots in the non-regulatory parent sequences. Dotted lines indicate an effect size threshold of ± 0.5 a.u. Each point corresponds to a gain or loss of a motif in a mutual information hotspot. Outlined points with letters in parenthesis highlight the corresponding figure panels. See Fig 3A for the same plot. The areas of the violin plots are the kernel density estimates (KDE) of each distribution. (B) Top: the top strand of parent P9. Dashed rectangle indicates the region of interest analyzed in the subsequent figure panel. We plot the mutual information Ii(b, f) between nucleotide identity and fluorescence at each position. Solid line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Middle: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. Bottom: fluorescence scores (a.u.) for daughter sequences with vs without a -10 box in the region of interest at position 136:142. We tested the null hypothesis that the fluorescence scores have the same central tendency using a two-sided Mann-Whitney U (MWU) test (q=7.14×10-4). The areas of the violin plots are the kernel density estimates (KDE) of each distribution. Within each violin plot is a box plot. The box represent the interquartile range (IQR) and the center line the median. Whiskers correspond to 1.5×IQR. (C) Analogous to (B) except for gaining a -10 box on the bottom strand of parent P1 at position 114:120 (two-tailed MWU test, q=4.41×10-110). (D) Analogous to (B) except for gaining a -10 box on the bottom strand of parent P8 at position 104:110 (two-tailed MWU test, q=7.50×10-7). (E) Analogous to (B) except for gaining two -35 boxes on the bottom strand of parent P8 at position 108:115 (two-tailed MWU tests, q=1.65×10-31 and q=1.29×10-8, left and right -35 box, respectively). (F) Analogous to (B) except for gaining a -35 box on the top strand of parent P3 at position 16:22 (two-tailed MWU test, q=9.62×10-15).

Mapping promoters in active parents.

(A) The change in fluorescence (arbitrary units, a.u.) when gaining or losing -10 and -35 boxes in hotspots in the active parent sequences. Dotted lines indicate an effect size threshold of ± 0.5 a.u. Each point corresponds to a gain or loss of a -10 or -35 box in a mutual information hotspot. Outlined points with letters in parenthesis highlight the corresponding figure panels. The areas of the violin plots are the kernel density estimates (KDE) of each distribution. Within each violin plot is a box plot, where the box represents the interquartile range (IQR) and the white circle the median. Whiskers correspond to 1.5×IQR. (B) Top: the bottom strand of parent P6. We plot the mutual information Ii(b, f) between nucleotide identity and fluorescence at each position. Solid line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Middle: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. Outlined trapezoids with shadows indicate regions of interest for the subsequent figure panel. (C) We test the null hypothesis that the fluorescence scores have the same central tendency for daughter sequences with vs without -10 boxes using a two-sided Mann-Whitney U (MWU) test. We tested the gain of -10 boxes in three locations. Left: position 92:98 (q=8.70×10-25), middle: position 94:100 (q=8.21×10-70), right: position 97:103 (q=2.96×10-229). (D) We combine the information from the previous panels (B-C) to generate a cartoon model for P6. Solid trapezoids correspond to validated binding sites in the previous panel. Trapezoids without borders correspond to PWM predicted motifs. Dashed trapezoids refer to inferred low-affinity motifs (see methods). Orange trapezoids: -35 boxes, magenta trapezoids = -10 boxes. (E) Analogous to (B) but for the bottom strand of P12. (F) Analogous to (C) but for losing a -35 box and a -10 box (two-tailed MWU test, q=6.29×10-67 and q=5.28×10-91, respectively). (G) Analogous to (D) but a cartoon model for P12 based on panels (E-F). (H) Analogous to (B) but for the bottom strand of P13. (I) Analogous to (C) but for losing a -35 box and a -10 box (two-tailed MWU test, q=6.73×10-155 and q=2.52×10-74, respectively). (J) Analogous to (D) but a cartoon model for P13 based on panels (H-I). (K) Analogous to (B) but for the top strand of P22. (L) Analogous to (C) but for losing a -35 box (two-tailed MWU test, q=7.19×10-35). (M) Analogous to (D) but a cartoon model for P22 based on panels (K-L).

Additional examples of mutations modulating promoter activity.

(A) Top: the top strand of parent P19. We plot the mutual information Ii(b, f) between nucleotide identity and fluorescence at each position. Solid line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Middle: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. Dashed trapezoids with shadows indicate regions of interest for the subsequent figure panel. (B) We test the null hypothesis that the fluorescence scores have the same central tendency for daughter sequences with vs without -10 boxes using a two-sided Mann-Whitney U (MWU) test (q=2.03×10-3). The areas of the violin plots are the kernel density estimates (KDE) of each distribution. Within each violin plot is a box plot, where the box represents the interquartile range (IQR) and the white circle the median. Whiskers correspond to 1.5×IQR. (C) A cartoon for the promoter mapped on the bottom strand of P13 based on data from Fig S9. Orange trapezoids = -35 boxes, magenta trapezoids = -10 boxes, dashed trapezoids = regions of interest in subsequent figure panels. (D) Analogous to (B) but for gaining (left) a -35 box, (middle) a -35 box, (right) a -10 box (two-sided MWU test, q=1.40×10-6, q=7.79×10-15, q=2.36×10-3, respectively). (E) Analogous to (C) but for the bottom strand of P6. (F) Analogous to (B) but for gaining (left) a -35 box, (middle) a -10 box, (right) a -35 box (two-sided MWU test, q=1.25×10-7, q=1.19×10-23, q=1.23×10-20, respectively).

Parent P20.

(A) We plot the mutual information for the top (blue) and bottom (red) strands of parent P20. Top: the top strand of parent P20. We plot the mutual information Ii(b, f) between nucleotide identity and fluorescence at each position. Solid line: mean mutual information, shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Middle: position-weight matrix (PWM) predictions for the -10 box motifs (magenta trapezoids) and -35 box motifs (orange trapezoids) along the wild-type parent sequence. Outlined trapezoids with letters in parenthesis indicate regions of interest for the subsequent figure panel. Bottom: the mutual information for the bottom strand of parent P20 as described for the top strand. (B) We test the null hypothesis that the fluorescence scores have the same central tendency for daughter sequences with vs without -35 boxes at position 21:26 using a two-sided Mann-Whitney U (MWU) test (q=6.30×10-8). The areas of the violin plots are the kernel density estimates (KDE) of each distribution. Within each violin plot is a box plot, where the box represents the interquartile range (IQR) and the white circle the median. Whiskers correspond to 1.5×IQR. (C) Analogous to (B) but for gaining a -10 box at position 46:51 (two-tailed MWU test, q=1.08×10-5). (D) Analogous to (B) but for gaining a -35 box at position 95:101 (two-tailed MWU test, q=0.037). (E) Analogous to (B) but for gaining a -10 box at position 122:128 (two-tailed MWU test, q=1.88×10-3).

Molecular cloning parent sequences.

(A) We first amplify the parent sequences from the E. coli genome using Q5 polymerase. Top: the forward and reverse primers contain constant overhang regions, and unique sequences (red N’s) unique to the parent sequence. Bottom: the product is a unique parent sequence with the sequences 5’-GGCTGAATTC…insert…GGATCCTTGC-3’ concatenated to the flanks. (B) We pooled the parent sequences amplified in (A) together for the error-prone polymerase chain reaction using GoTaq and MnCl2. Top: the forward and reverse primers contain constant overhang regions homologous to the pMR1 region for Gibson Assembly. Bottom: the product is the parent sequence flanked by sequences homologous to the pMR1 plasmid. (C) We carry out a Gibson Assembly reaction using NEBuilder. Top: we combine the products from (B) with a linear copy of the pMR1 plasmid. Bottom: the final assembly product is the pMR1 plasmid with a mutant library of parent sequences.