Genome-wide Functional Characterization of Escherichia coli Promoters and Sequence Elements Encoding Their Regulation

  1. Molecular Biology Interdepartmental Doctoral Program, University of California, Los Angeles, CA, 90095, USA
  2. Department of Systems Biology, Columbia University, New York, NY, 10032, USA
  3. Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, CA 90095, USA
  4. Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, CA, 90095, USA
  5. Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California, USA
  6. Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA
  7. Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN, 37232 USA
  8. Molecular Biology Institute, University of California, Los Angeles, CA 90095, USA
  9. UCLA-DOE Institute for Genomics and Proteomics, Quantitative and Computational Biology Institute, Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA 90095, USA

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Joseph Wade
    Department of Biomedical Sciences, School of Public Health University at Albany, Albany, United States of America
  • Senior Editor
    Aleksandra Walczak
    École Normale Supérieure - PSL, Paris, France

Reviewer #1 (Public Review):

Summary:
This paper uses a high-throughput assay of transcription levels to (i) assess the potential of large numbers of Escherichia coli genomic sequences to function as promoters, and (ii) identify regulatory sequences in some of those promoters. This is a substantial undertaking, and while much of the work supports principles of transcription and transcription regulation described by many prior studies, there is considerable value in assessing promoters on such a large scale. The identification of putative regulatory sequences in larger numbers of promoters will likely be valuable to other groups studying transcription regulation in E. coli. And the analysis of antisense promoters provides some interesting new insight that goes beyond previous anecdotal studies.

Strengths:
- The presentation of the work is very clear, and the conclusions are mostly well supported by the data.
- The assays are rigorously controlled and analyzed.
- Conclusions regarding the impact of antisense transcription on sense transcript levels provide new insight. While these data are consistent with previous anecdotal studies, to my knowledge this is the first large-scale analysis supporting a negative regulatory role for antisense transcription.
- The putative regulatory elements mapped in the high-throughput mutagenesis experiments will be a valuable resource for the scientific community.

Weaknesses:
(all minor)
- There are some parts where the authors could clarify their arguments.
- I'm not convinced that intragenic promoters impact codon usage rather than the other way around.
- The authors should present a more nuanced discussion of promoters that avoids making yes/no calls (i.e., characterize sequences by promoter strength rather than a binary yes/no call of being a promoter).
- Data relating to intragenic promoters should be presented and discussed for sense and antisense promoters separately.

Reviewer #2 (Public Review):

In this work, Urtecho et al. use genome-integrated massively parallel reporter assays (MPRAs) to catalog the locations of promoters throughout the E. coli genome. Their study uses four different MPRA libraries. First, they assayed a library containing 17,635 promoter regions having transcription start sites (TSSs) previously reported by three different sources. They found that 2,760 of these regions exhibited transcription above an experimentally determined threshold. Second, they assayed a library using sheared E. coli genome fragments. This library allowed the authors to systematically identify candidate promoter regions throughout the genome, some of which had not been identified before. Additionally, by performing experiments with this library under different growth conditions, the authors were able to identify promoters with condition-dependent activity. Third, to improve the resolution at which they were able to identify transcription start sites, the authors assayed a library that tiled all candidate promoter regions identified using the genomic fragments library. Data from the tiled library allowed the authors to identify minimal promoter regions. Fourth, the authors assayed a scanning mutagenesis library in which they systematically scrambled individual 10 bp windows within 2,057 previously identified active promoters at 5 bp intervals. After validation with known promoters, this approach allowed the authors to identify novel functional elements within regulatory regions. Finally, the authors fit multiple machine learning models to their data with the goal of predicting promoter activity from DNA sequences.

The work by Urtecho et al. provides an important resource for researchers studying bacterial transcriptional regulation. Despite decades of study, a comprehensive catalogue of E. coli promoters is still lacking. The results of Urtecho et al. provide a state-of-the-art atlas of promoters in the E. coli genome that is readily accessible through the website, http://ecolipromoterdb.com. The authors' work also provides an important demonstration of the power of genome-integrated MPRAs. Unlike many MPRA-based studies, the authors use the results of their initial MPRAs to design follow-up MPRAs, which they then carry out. Finally, the scanning mutagenesis MPRAs the authors perform provide valuable data that could lead to the discovery of novel transcription factor binding sites and other functional regulatory sequence elements.

Below I provide two major critiques and some minor critiques of the paper. The purpose of these critiques is simply to help the authors improve the quality of the manuscript.

Major points:
1. Ultimately, a comprehensive atlas of E. coli promoters should include nucleotide resolution TSS data, which is not present in the MPRA datasets reported by Urtecho et al.. The authors do use some methods to narrow down the positions of TSSs, but these methods do not provide the resolution one would ideally like to see in a TSS atlas. I understand that acquiring single-nucleotide-resolution data is beyond the scope of this manuscript, but it still might make sense for the authors to discuss this limitation in the Discussion section.

2. The authors should clarify which points in the Results section are novel conclusions or observations, and which points are simply statements that prior conclusions or observations were confirmed. This distinction can be unclear at times.

Minor points:
1. Line 200-203: "We conclude that inactive TSS-associated promoters lack -35 elements but may become active in growth conditions where additional transcription factors mobilize and facilitate RNAP positioning in the absence of a -35 motif." Making this type of mechanistic observations from the slight difference observed in the enrichment analysis seems too speculative to me. Also, I do not understand how the discrepancies can be explained in terms of transcription factor differences. If the previous studies from which the annotated TSS were extracted were also performed during the log phase in rich media, why would the transcription factors present be different?

2. Line 224-226: "Active TSSs not overlapping a candidate promoter region generally exhibited weak activity, which may indicate that greater sensitivity is achieved through testing of oligo-array synthesized regions (Figure S3)." The authors should clarify this statement. In particular, it is mechanistically unclear why one library would be more sensitive than another if they contain similar sequences.

3. Figure 2B. The authors should clarify that the heights of the arrows correspond to TSS activity as assayed by one library and that the pile-up plots represent promoter activity as assayed by a different library.

4. Line 255-257: "We also observed an enrichment for 150 bp minimal promoter regions, although these were generally weak indicating that our resolution is limited when tiling weaker promoters." The authors should clarify whether the peak at 150 bp is an artifact of using oligos containing 150 bp tiles to construct the library. Also, the authors should clarify why there are some minimal promoters with lengths > 150 bp when the length of the tiles was 150 bp.

5. Line 262 refers to "Supplementary Table 1", but I was not able to find this table in the supplement.

6. Line 324-325: "We used a σ70 PWM to identify the highest-scoring σ70 motifs within intragenic promoters and determined their relative coding frames". I find the term "relative coding frame" here to be unclear; the authors should clarify what they mean.

7. Figure 3 C , D: The authors should use the same terminology in the plots and the methods section describing them. They should also clarify how the values plotted in C and D were computed.

8. Line 329-332: "The observed depletion of -35 motifs positioned in the +2 reading frame and -10 motifs in the +1 reading frame is likely due to the fact that the canonical sequences for these motifs would create stop codons within the protein if placed at these positions." The definition of the reading frame here is unclear. Do the authors mean that the 0 frame is defined as occurring when the hexamer exactly overlaps 2 codons, the +1 frame is when the hexamer is shifted 1 nt downstream of that position, and the +2 frame is when the hexamer is shifted 2 nt downstream of that position?

9. Line 538-539: "We performed hyperparameter tuning for a three-layer CNN and achieved an AUPRC =0.44." The authors should explicitly describe the architecture used for the CNN, and perhaps include a diagram of this architecture. In addition, the authors should clarify the mathematical forms of the other methods tested.

10. Line 1204-1205: "We standardized all datasets as detailed above in 'Universal Promoter Expression Quantification and Activity Thresholding'". That title does not appear before in the text. I believe the appropriate subsection is called "Standardizing Promoter Expression Quantification and Activity Thresholding".

11. Line 1265-1266: "We include a k-mer if the absolute correlation with expression is greater than the 'random' k-mer frequency, resulting in 4800/5440 filtered k-mers." It is unclear to me which two correlations are being compared. Please clarify. For example, would this be accurate: "We include a k-mer if the absolute correlation of its frequency with expression is greater than the absolute correlation of its 'random' frequency with expression"?

Reviewer #3 (Public Review):

In this revised manuscript, Urtecho et al., present an updated version of their earlier submission. They characterized thousands of promoter sequences in E. coli using a massively-parallel reporter assay and built a number of computational models to classify active from inactive promoters or associate the sequence to promoter expression/strength. As eluded in the earlier review cycle, the amount of experimental, bioinformatics, and analytical work presented here is astounding.

Identifying promoters and associating genomic (or promoter) sequences to promoter strength is nontrivial. Authors report challenges in achieving this grand goal even with the state-of-the-art characterization technology used here. Nevertheless, the experimental work, analytic workflow, and data resource presented here will serve as a milestone for future researchers.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation