The majority of promoters emerge and evolve within a subset of preexisting promoter motifs.
(A) We calculated the mutual information Ii(b, f) between nucleotide identity (b = A,T,C,G) and fluorescence scores rounded to the nearest whole number ( f = 1,2,3,4 a.u.) for each position i in a parent sequence. In essence, the calculation compares the probability pi(b) of a base b occurring at position i, and the probability p(f) that a sequence has a fluorescence score f. The joint probability pi(b, f) is the probability that a sequence with base b at position i has fluorescence score f. The greater the joint probability is compared to the individual probabilities, the more important the base at this position is for promoter activity. See methods. (B) An example of how to interpret mutual information using position-weight matrix (PWM) scores of predicted -10 and -35 boxes. Top: we plot the mutual information Ii(b, f) for P19-GFP. P19-GFP is an active promoter. Solid line: mean mutual information. Shaded region: ± 1 standard deviation when the dataset is randomly split into three equally sized subsets (methods). Bottom: position-weight matrix (PWM) predictions for the -10 boxes (magenta trapezoids) and -35 boxes (orange trapezoids) along the wild-type parent sequence. We define hotspots as mutual information peaks greater than or equal to the 90th percentile of total mutual information (methods), and highlight them with dashed rectangles. (C) Stacked bar plots depicting the percentage of hotspots overlapping with -10 boxes only (magenta), -35 boxes (orange), both -10 and -35 boxes (red), or with neither (gray). We plot this information for the wild-type (WT) promoters and non-promoter parents, as well as scrambled (scram.) versions of the parents. Horizontal lines correspond to χ (chi)-squared tests between the counts in each group (3 degrees of freedom). (D) Analogous to (B) but for parent P3-RFP. P3-RFP is an inactive parent sequence. Hotspot overlaps with a -10 box. (E) Analogous to (B) but for parent P18-GFP. P18-GFP is an inactive parent sequence. Hotspots overlap with (from left to right) a -35 box, both a -10 and a -35 box, and neither (None). Fig S6 shows analogous mutual information plots for daughters derived from each parent sequence.