Functional characterization of 17,635 previously reported E. coli promoters.

A) Three sources of genome-wide promoter predictions show little agreement in the reported TSSs at the single-nucleotide level. B) We synthesized oligos overlapping the −120 to +30 bp context of 17,635 reported TSSs and integrated construct into a fixed genomic landing pad. Measuring barcode expression using RNA-Seq captures quantitative measurements of transcriptional activity for individual TSSs. C) MPRA results are highly replicable across technical replicates (r = 0.965, p < 2.2 x 10-16). D) The TSS library measurements span over 100-fold with negative controls exhibiting low levels of expression and positive controls spanning the entire dynamic range. E) A majority of tested TSSs are inactive in LB. F) Active and inactive TSSs have significantly different mean PWM scores for −10 and −35 σ70 motifs (Wilcoxon rank-sum test, “***” =< 0.001).

Genome-wide Identification of E. coli promoters.

A) 321,123 sheared genomic fragments were screened using the same MPRA platform. The fragments were 200 to 300 bp in size giving an average 8.5x coverage across each strand of the E. coli genome. Promoter activity of each fragment was measured and averaged at each position to recover nucleotide-specific expression. B) We created a website to showcase the E. coli promoter landscape (https://ecolipromoterdb.com/). This section of the genome displayed in this figure contains five candidate promoter regions that appear within intergenic regions. C) Meta-analysis of mean promoter activity at experimentally validated active TSSs, inactive TSSs, and negative controls. D) Oligo tiling library identifies promoters within candidate promoter regions. We synthesized 150 bp oligos tiling all promoter regions identified in rich media at 10 bp intervals. We then determine minimal promoter boundaries by identifying the overlap of transcriptionally active tiles. E) Oligo tile expression across the mraZ promoter shows two distinct promoters. Positions are defined according to the right-most genomic position of each 150 bp oligo. Dashed line indicates the threshold for active oligo tiles F) Distribution of the number of promoters per promoter region shows many regions contain multiple promoters. G) Left: Distribution of the lengths of the minimal promoter boundaries shows enrichment for σ70-promoter sized regions (40 bp). Right: 40 bp minimal promoters (red) span a wide range of expression whereas 150 bp promoters are typically weak (blue).

Intragenic promoters are widespread, often found in the antisense orientation, and alter transcript levels and codon usage of the genes they are within.

A) Orientation and positioning of identified promoters reveals many promoters are intragenic and antisense. B) Antisense promoters suppress gene expression genome-wide. Left: Meta-gene analysis of the median RNA-Seq coverage across all sense, antisense, and dual-regulated genes. Right: Meta-gene analysis of sense promoter activity at sense, antisense, and dual regulated genes. C) Intragenic promoters are enriched for specific amino acids relative to whole genome amino acid frequencies (Chi-squared test, “*” = p < 0.05). Amino acids are arranged by mean AT-content of all corresponding codons. D) Specific, often rare, codons are enriched in intragenic promoters. Codon bias within intragenic promoters relative to whole genome. Bars are colored by the relative genome-wide usage compared to other synonymous codons (Chi-squared test, “*” = p < 0.05).

The E. coli promoter landscape dynamically responds to environmental conditions.

A) Shared and unique promoter regions are found between LB and glucose minimal media. B) Genes activated by promoters in glucose minimal media are enriched for amino acid-related genes according to RAST subsystem annotations. C) Occurrence of reported transcription factor binding sites in promoter regions activated in LB compared to glucose minimal media (M9). Black lines indicate 2-fold enrichment threshold. D) Number of binding sites per transcription factor within activated promoter regions. A median of four sites per transcription factors were activated in LB and a median of five sites in M9.

Scanning mutagenesis of 2,057 TSS-associated promoters identifies known and novel regulatory motifs.

A) Scanning mutagenesis of 2,057 E. coli promoters to identify regulatory elements. For each promoter, 10 bp regions were mutated across the full length of the promoter at 5 bp intervals. B) Mutating each position across E. coli promoters identifies sequences that activate and repress promoter activity. Rows are rearranged using hierarchical clustering and the intensities are normalized within each row. C) Scanning mutagenesis of the well-characterized (Left) lacZYA and (Right) relBE promoters captures known regulatory elements. D) Scanning mutagenesis of the newly characterized (Left) cfa and (Right) rpsL promoters identifies regions encoding regulation within these promoters.

Global identification of 3,317 E. coli regulatory motifs by scanning mutagenesis.

A) We identified scrambled regulatory regions that significantly increase (N = 1,885) or decrease (N=5,408) expression when scrambled relative to the unscrambled promoter. Data are colored by whether the regulatory region activates or represses activity of the promoter. B) Activating promoter sequences are enriched at the −10, −35, and −80 positions whereas repressing sequences are enriched at +1, −20, and −50 positions. C) Identified regulatory regions overlapping reported TFBS annotations shows mixed concordance with reported effects; 77.8% (2,583/3,317) of identified regulatory regions are unreported by RegulonDB. D) Scanning mutagenesis of the FadR promoter (bottom) identifies a repressing sequence near the −30 that has been reported to be activating (top).

Various machine learning models for promoter activity classification and regression.

A) Performance of various models to classify promoter sequences. Convolutional neural networks performed best in the lower recall range, while logistic regression based on simple hand-crafted features performs better in the higher recall range. Dashed line represents the expected performance from random prediction using full library. B) Performance of regression models to predict a quantitative level of promoter activity. We evaluated performance using both root mean squared error (RMSE) and coefficient of determination (R2) on the held-out test set. Similar to classification, convolutional neural networks performed the best with the lowest RMSE and highest R2.