1. Physics of Living Systems
Download icon

Deciphering the regulatory genome of Escherichia coli, one hundred promoters at a time

  1. William T Ireland
  2. Suzannah M Beeler
  3. Emanuel Flores-Bautista
  4. Nicholas S McCarty
  5. Tom Röschinger
  6. Nathan M Belliveau
  7. Michael J Sweredoski
  8. Annie Moradian
  9. Justin B Kinney
  10. Rob Phillips  Is a corresponding author
  1. Department of Physics, California Institute of Technology, United States
  2. Division of Biology and Biological Engineering, California Institute of Technology, United States
  3. Division of Chemistry and Chemical Engineering, California Institute of Technology, United States
  4. Proteome Exploration Laboratory, Division of Biology and Biological Engineering, Beckman Institute, California Institute of Technology, United States
  5. Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
Research Article
  • Cited 0
  • Views 1,854
  • Annotations
Cite this article as: eLife 2020;9:e55308 doi: 10.7554/eLife.55308

Abstract

Advances in DNA sequencing have revolutionized our ability to read genomes. However, even in the most well-studied of organisms, the bacterium Escherichia coli, for ≈65% of promoters we remain ignorant of their regulation. Until we crack this regulatory Rosetta Stone, efforts to read and write genomes will remain haphazard. We introduce a new method, Reg-Seq, that links massively parallel reporter assays with mass spectrometry to produce a base pair resolution dissection of more than a E. coli promoters in 12 growth conditions. We demonstrate that the method recapitulates known regulatory information. Then, we examine regulatory architectures for more than 80 promoters which previously had no known regulatory information. In many cases, we also identify which transcription factors mediate their regulation. This method clears a path for highly multiplexed investigations of the regulatory genome of model organisms, with the potential of moving to an array of microbes of ecological and medical relevance.

Introduction

DNA sequencing is as important to biology as the telescope is to astronomy. We are now living in the age of genomics, where DNA sequencing has become cheap and routine. However, despite these incredible advances, how all of this genomic information is regulated and deployed remains largely enigmatic. Organisms must respond to their environments through the regulation of genes. Genomic methods often provide a 'parts list' but leave us uncertain about how those parts are used creatively and constructively in space and time. Yet, we know that promoters apply all-important dynamic logical operations that control when and where genetic information is accessed. In this paper, we demonstrate how we can infer the logical and regulatory interactions that control bacterial decision making by tapping into the power of DNA sequencing as a biophysical tool. The method introduced here provides a framework for solving the problem of deciphering the regulatory genome by connecting perturbation and response, mapping information flow from individual nucleotides in a promoter sequence to downstream gene expression, determining how much information each promoter base pair carries about the level of gene expression.

The advent of RNA-Seq (Lister et al., 2008; Nagalakshmi et al., 2008; Mortazavi et al., 2008) launched a new era in which sequencing could be used as an experimental read-out of the biophysically interesting counts of mRNA, rather than simply as a tool for collecting ever more complete organismal genomes. The slew of ‘X’-Seq technologies that are available continues to expand at a dizzying pace, each serving their own creative and insightful role: RNA-Seq, ChIP-Seq, Tn-Seq, SELEX, 5C, etc (Stuart and Satija, 2019). In contrast to whole genome screening sequencing approaches, such as Tn-Seq (Goodall et al., 2018) and ChIP-Seq (Gao et al., 2018), which give a coarse-grained view of gene essentiality and regulation respectively, another class of experiments known as massively parallel reporter assays (MPRA) have been used to study gene expression in a variety of contexts (Patwardhan et al., 2009; Kinney et al., 2010; Sharon et al., 2012; Patwardhan et al., 2012; Melnikov et al., 2012; Kwasnieski et al., 2012; Fulco et al., 2019; Kinney and McCandlish, 2019). One elegant study relevant to the bacterial case of interest here by Kosuri et al., 2013 screened more than 104 combinations of promoter and ribosome-binding sites (RBS) to assess their impact on gene expression levels. Even more recently, the same research group has utilized MPRAs in sophisticated ways to search for regulated genes across the genome (Urtecho et al., 2019; Urtecho et al., 2020), in a way we see as being complementary to our own. While their approach yields a coarse-grained view of where regulation may be occurring, our approach yields a base-pair-by-base-pair view of how exactly that regulation is being enacted.

One of the most exciting X-Seq tools based on MPRAs with broad biophysical reach is the Sort-Seq approach developed by Kinney et al., 2010. Sort-Seq uses fluorescence activated cell sorting (FACS) based on changes in the fluorescence due to mutated promoters combined with sequencing to identify the specific locations of transcription factor binding in the genome. Importantly, it also provides a readout of how promoter sequences control the level of gene expression with single base-pair resolution. The results of such a massively parallel reporter assay make it possible to build a biophysical model of gene regulation to uncover how previously uncharacterized promoters are regulated. In particular, high-resolution studies like those described here yield quantitative predictions about promoter organization and protein-DNA interactions (Kinney et al., 2010). This allows us to employ the tools of statistical physics to describe the input-output properties of each of these promoters which can be explored much further with in-depth experimental dissection like those done by Razo-Mejia et al., 2018 and Chure et al., 2019 and summarized in Phillips et al., 2019. In this sense, the Sort-Seq approach can provide a quantitative framework to not only discover and quantitatively dissect regulatory interactions at the promoter level, but also provides an interpretable scheme to design genetic circuits with a desired expression output (Barnes et al., 2019).

Earlier work from Belliveau et al., 2018 illustrated how Sort-Seq, used in conjunction with mass spectrometry, can be used to identify which transcription factors bind to a given binding site, thus enabling the mechanistic dissection of promoters which previously had no regulatory annotation. However, a crucial drawback of the approach of Belliveau et al., 2018 is that while it is high-throughput at the level of a single gene and the number of promoter variants it accesses, it was unable to readily tackle multiple genes at once. Even in one of biology’s best understood organisms, the bacterium Escherichia coli, for more than 65% of its genes, we remain completely ignorant of how those genes are regulated (Belliveau et al., 2018Santos-Zavaleta et al., 2019). If we hope to some day have a complete base pair resolution mapping of how genetic sequences relate to biological function, we must first be able to do so for the promoters of this 'simple' organism.

What has been missing in uncovering the regulatory genome in organisms of all kinds is a large-scale method for inferring genomic logic and regulation. Here, we replace the low-throughput, fluorescence-based Sort-Seq approach with a scalable, RNA-Seq based approach that makes it possible to attack many promoters at once. Accordingly, we refer to the entirety of our approach (MPRA, information footprints and energy matrices, and transcription factor identification) as Reg-Seq, which we employ here on over one hundred promoters. The concept of MPRA methods is to perturb promoter regions by mutating their sequences, and then to use next-generation sequencing (NGS) methods to read out how those mutations impact the expression level of each promoter (Patwardhan et al., 2009; Kinney et al., 2010; Sharon et al., 2012; Patwardhan et al., 2012; Melnikov et al., 2012; Kwasnieski et al., 2012; Fulco et al., 2019; Kinney and McCandlish, 2019). We generate a broad diversity of promoter sequences for each promoter of interest and use mutual information as a metric to measure the information flow from that distribution of sequences to gene expression. Thus, Reg-Seq is able to collect causal information about candidate regulatory sequences that is then complemented by techniques such as mass spectrometry, which allows us to find which transcription factors mediate the action of those newly discovered candidate regulatory sequences. Hence, Reg-Seq solves the causal problem of linking DNA sequence to regulatory logic and information flow.

To demonstrate our ability to perform Reg-Seq at scale, we report here our results for 113 E. coli genes, whose regulatory architectures (i.e. gene-by-gene distributions of transcription-factor-binding sites and identities of the transcription factors that bind those sites) were determined in parallel for multiple different growth conditions. Although we make substantial progress in mapping the regulatory information for a swath of E. coli genes in this study (the 'regulome'), the field still remains limited in its understanding of which specific growth conditions, small molecules and metabolites (the allosterome) are responsible for altering the milieu of transcription factor activities (Lindsley and Rutter, 2006; Piazza et al., 2018; Huang et al., 2018). We hope to address this shortcoming in future studies by appealing to recent work on solving the 'allosterome problem' (Piazza et al., 2018). By taking the Sort-Seq approach from a gene-by-gene method to a larger scale, more multiplexed approach, we can begin to piece together not just how individual promoters are regulated, but also the nature of gene-gene interactions by revealing how certain transcription factors serve to regulate multiple genes at once. This approach has the benefits of a high-throughput assay without sacrificing any of the resolution afforded by the previous gene-by-gene approach, allowing us to uncover the gene regulation of over 100 operons, with base-pair resolution, in one set of experiments.

The organization of the remainder of the paper is as follows. In the Results section, we benchmark Reg-Seq against our own earlier Sort-Seq experiments to show that the use of RNA-Seq as a readout of the expression of mutated promoters is equally reliable as the fluorescence-based approach. Additionally, we provide a global view of the discoveries that were made in our exploration of more than 100 promoters in E. coli using Reg-Seq. These results are described in summary form in the paper itself, with a full online version of the results (www.rpgroup.caltech.edu/RegSeq/interactive) showing how different growth conditions elicit different regulatory responses. This section also follows the overarching view of our results by examining several biological stories that emerge from our data and serve as case studies in what has been revealed in our efforts to uncover the regulatory genome. The Discussion section summarizes the method and the current round of discoveries it has afforded with an eye to future applications to further elucidate the E. coli genome and open up the quantitative dissection of other non-model organisms. Lastly, in the Materials and methods section and Appendices, we describe our methodology and the false positive and false negative rates of the method.

Results

Selection of genes and methodology

As shown in Figure 1, we have explored more than 100 genes from across the E. coli genome. Our choices were based on a number of factors (see Appendix 1 Section 'Choosing target genes' for more details); namely, we wanted a subset of genes that served as a 'gold standard' for which the hard work of generations of molecular biologists have yielded deep insights into their regulation. Our set of gold standard genes is lacZYA, znuCB, znuA, ompR, araC, marR, relBE, dgoR, dicC, ftsK, xylA, xylF, rspA, dicA, and araAB. By using Reg-Seq on these genes, we were able to demonstrate that this method recovers not only what was already known about binding sites of transcription factors for well-characterized promoters (Appendix 2—figures 2 and 3), but also whether there are any important differences between the results of the methods presented here and the previous generation of experiments based on fluorescence and cell-sorting as a readout of gene expression (Kinney et al., 2010; Belliveau et al., 2018). These promoters of known regulatory architecture are complemented by an array of previously uncharacterized genes that we selected in part using data from a recent proteomic study, in which mass spectrometry was used to measure the copy number of different proteins in 22 distinct growth conditions (Schmidt et al., 2016). We selected genes that exhibited a wide variation in their copy number over the different growth conditions considered, reasoning that differential expression across growth conditions implies that those genes are under regulatory control.

The E. coli regulatory genome.

Illustration of the current ignorance with respect to how genes are regulated in E. coli. Genes with previously annotated regulation (as reported on RegulonDB [Gama-Castro et al., 2016]) are denoted with blue ticks and genes with no previously annotated regulation denoted with red ticks. The 113 genes explored in this study are labeled in gray, and their precise genomic locations can be found in Figure 1—source data 1.

As noted in the introduction, the original formulation of Reg-Seq, termed Sort-Seq, was based on the use of fluorescence activated cell sorting, one gene at a time, as a way to uncover putative binding sites for previously uncharacterized promoters (Belliveau et al., 2018). As a result, as shown in Figure 2, we have formulated a second generation version that permits a high-throughput interrogation of the genome. A comparison between the Sort-Seq and Reg-Seq approaches on the same set of genes is shown in Figure 3. In the Reg-Seq approach, for each promoter interrogated, we generate a library of mutated variants and design each variant to express an mRNA with a unique sequence barcode. By counting the frequency of each expressed barcode using RNA-Seq, we can assess the differential expression from our promoter of interest based on the base-pair by base-pair sequence of its promoter. Using the mutual information between mRNA counts and sequences, we develop an information footprint that reveals the importance of different bases in the promoter region to the overall level of expression. We locate potential transcription-factor-binding regions by looking for clusters of base pairs that have a significant effect on gene expression. Further details on how potential binding sites are identified are found in the Methods Section 'Automated putative binding site algorithm' and 'Manual selection of binding sites', while determination of the false positive and false negative rates of the method can be found in Appendix 2 Section 'False positive and false negative rates'. Blue regions of the histogram shown in the information footprints of Figure 2 correspond to hypothesized activating sequences and red regions of the histogram correspond to hypothesized repressing sequences.

Schematic of the Reg-Seq procedure as used to recover a repressor-binding site.

The process is as follows: After constructing a promoter library driving expression of a randomized barcode (an average of five barcodes for each promoter), RNA-Seq is conducted to determine the frequency of these mRNA barcodes across different growth conditions (list included in Appendix 1 Section 'Growth conditions'). By computing the mutual information between DNA sequence and mRNA barcode counts for each base pair in the promoter region, an 'information footprint' is constructed that yields a regulatory hypothesis for the putative binding sites (with the RNAP-binding region highlighted in blue and the repressor-binding site highlighted in red). Energy matrices, which describe the effect that any given mutation has on DNA-binding energy, as well as sequence logos, are inferred for the putative transcription-factor-binding sites. Next, we identify which transcription factor preferentially binds to the putative binding site via DNA-affinity chromatography followed by mass spectrometry. This procedure culminates in a coarse-grained, cartoon-level view of our regulatory hypothesis for how a given promoter is regulated.

A summary of four direct comparisons of measurements from Sort-Seq and Reg-Seq.

We show the identified regulatory regions as well as quantitative comparisons between inferred position weight matrices. (A) CRP binds upstream of RNAP in the lacZYA promoter. Despite the different measurement techniques for the two inferred position weight matrices, the CRP-binding sites have a Pearson correlation coefficient of r=0.98. (B) The dgoRKADT promoter is activated by CRP in the presence of galactonate and is repressed by DgoR. For Sort-Seq and Reg-Seq, type II activator-binding sites can be identified based on the signals in the information footprint in the area indicated in green. Additionally, the quantitative agreement between the CRP position weight matrices are strong, with r=0.9. (C) The relBE promoter is repressed by RelBE as can be identified algorithmically in both Sort-Seq and Reg-Seq. The inferred logos for the two measurement methods have r=0.8. (D) The marRAB promoter is repressed by MarR. The inferred energy matrices (data not shown) and sequence logos shown have r=0.78. The right most MarR site overlaps with a ribosome-binding site. The overlap has a stronger obscuring effect on the sequence specificity of the Sort-Seq measurement, which measures protein levels directly, than it does on the output of the Reg-Seq measurement. Numeric values for the displayed data can be found in Figure 3—source data 1.

With the information footprint in hand, we can then determine energy matrices and sequence logos (described in the next section). Given putative binding sites, we use synthesized oligonucleotides that serve as fishing hooks to isolate the transcription factors that bind to those putative binding sites using DNA-affinity chromatography and mass spectrometry (Mittler et al., 2009). Given all of this information, we can then formulate a schematized view of the newly discovered regulatory architecture of the previously uncharacterized promoter. For the case schematized in Figure 2, the experimental pipeline yields a complete picture of a simple repression architecture (i.e. a gene regulated by a single binding site for a repressor).

Visual tools for data presentation

Throughout our investigation of the more than 100 genes explored in this study, we repeatedly relied on several key approaches to help make sense of the immense amount of data generated in these experiments. As these different approaches to viewing the results will appear repeatedly throughout the paper, here we familiarize the reader with five graphical representations referred to respectively as information footprints, energy matrices, sequence logos, mass spectrometry enrichment plots and regulatory cartoons, which taken together provide a quantitative description of previously uncharacterized promoters.

Information footprints

From our mutagenized libraries of promoter regions, we can build up a base-pair by base-pair graphical understanding of how the promoter sequence relates to level of gene expression in the form of the information footprint shown in Figure 2. In this plot, the bar above each base pair position represents how large of an effect mutations at this location have on the level of gene expression. Specifically, the quantity plotted is the mutual information Ib at base pair b between mutation of a base pair at that position and the level of expression. In mathematical terms, the mutual information measures how much the joint probability p(m,μ) differs from the product of the probabilities pmut(m)pexpr(μ) which would be produced if mutation and gene expression level were independent. Formally, the mutual information between having a mutation at position b and level of expression is given by

(1) Ib=m=01μ=01p(m,μ)log2(p(m,μ)pmut(m)pexpr(μ)).

Note that both m and µ are binary variables that characterize the mutational state of the base of interest and the level of expression, respectively. Specifically, m can take the values

(2) m={0,ifb is a mutated base1,ifb is a wild-type base.

and µ can take on values

(3) μ={0,for sequencing reads from the DNA library1,for sequencing reads originating from mRNA,

where both m and µ are index variables that tell us whether the base has been mutated and if so, how likely that the read at that position will correspond to an mRNA, reflecting gene expression or a promoter, reflecting a member of the library. The higher the ratio of mRNA to DNA reads at a given base position, the higher the expression. pmut(m) in Equation 1 refers to the probability that a given sequencing read will be from a mutated base. pexpr(μ) is a numeric value that gives the ratio of the number of DNA or mRNA sequencing counts to the total number of sequencing counts for each barcode.

Furthermore, we color the bars based on whether mutations at this location lowered gene expression on average (in blue, indicating an activating role) or increased gene expression (in red, indicating a repressing role). In this experiment, we targeted the regulatory regions based on a guess of where a transcription start site (TSS) will be, based on experimentally confirmed sites contained in RegulonDB (Santos-Zavaleta et al., 2019), a 5’ RACE experiment (Mendoza-Vargas et al., 2009), or by targeting small intergenic regions so as to capture all likely regulatory regions. Further details on TSS selection can be found in the Materials and methods Section 'Library design and construction'. After completing the Reg-Seq experiment, we note that many of the presumed TSS sites are not in the locations assumed, the promoters have multiple active RNA polymerase (RNAP) sites and TSS, or the primary TSS shifts with growth condition. To simplify the data presentation, the '0' base pair in all information footprints is set to the originally assumed base pair for the primary TSS, rather than one of the TSS that was found in the experiment.

Energy matrices

Focusing on an individual putative transcription-factor-binding site as revealed in the information footprint, we are interested in a more fine-grained, quantitative understanding of how the underlying protein-DNA interaction is determined. An energy matrix displays this information using a heat map format, where each column is a position in the putative binding site and each row displays the effect on binding that results from mutating to that given nucleotide (given as a change in the DNA-transcription factor interaction energy upon mutation) (Berg and von Hippel, 1987; Stormo and Fields, 1998; Kinney et al., 2010). These energy matrices are scaled such that the wild type sequence is colored in white, mutations that improve binding are shown in blue, and mutations that weaken binding are shown in red. These energy matrices encode a full quantitative picture for how we expect sequence to relate to binding for a given transcription factor, such that we can provide a prediction for the binding energy of every possible binding site sequence as 

(4) binding energy=i=1Nεi,

where the energy matrix is predicated on an assumption of a linear binding model in which each base within the binding site region contributes a specific value (εi for the ith base in the sequence) to the total binding energy. Energy matrices are either given in A.U. (arbitrary units) or, for several cases where the gene has a simple repression or activation architecture with a single RNA polymerase (RNAP) site, are assigned kBT energy units following the procedure in Kinney et al., 2010 and validated on repression by lac repressor in Barnes et al., 2019. The details of how and when absolute units are determined can be found in Appendix 3 Section 'Inference of scaling factors for energy matrices'.

Sequence logos

From an energy matrix, we can also represent a preferred transcription-factor-binding site with the use of the letters corresponding to the four possible nucleotides, as is often done with position weight matrices (Schneider and Stephens, 1990). In these sequence logos, the size of the letters corresponds to how strong the preference is for that given nucleotide at that given position, which can be directly computed from the energy matrix. This method of visualizing the information contained within the energy matrix is more easily digested and allows for quick comparison among various binding sites.

Mass spectrometry enrichment plots

As the final piece of our experimental pipeline, we wish to determine the identity of the transcription factor we suspect is binding to our putative binding site that is represented in the energy matrix and sequence logo. While the details of the DNA-affinity chromatography and mass spectrometry can be found in the Materials and methods, the results of these experiments are displayed in enrichment plots such as is shown in the bottom panel of Figure 2. In these plots, the relative abundance of each protein bound to our site of interest is quantified relative to a scrambled control sequence. The putative transcription factor is the one we find to be highly enriched compared to all other DNA-binding proteins.

Regulatory cartoons

The ultimate result of all these detailed base-pair-by-base-pair resolution experiments yields a cartoon model of how we think the given promoter is being regulated. A complete set of cartoons for all the architectures considered in our study is presented later in Figure 4. While the cartoon serves as a convenient, visual way to summarize our results, it is important to remember that these cartoons are a shorthand representation of all the data in the four quantitative measures described above and are, further, backed by quantitative predictions of how we expect the system to behave when tested experimentally. Throughout this paper, we use consistent iconography to illustrate the regulatory architecture of promoters with activators and their binding sites in green, repressors in red, and RNAP in blue.

All regulatory architectures uncovered in this study.

For each regulated promoter, activators and their binding sites are labeled in green, repressors and their binding sires are labeled in red, and RNAP-binding sites are labeled in blue. All cartoons are displayed with the transcription direction to the right. Only one RNAP site is depicted per promoter. The transcription-factor-binding sites displayed have either been identified by the method described in the Section 'Automated putative binding site algorithm' or have additional evidence for their presence as described in Table 2. Binding sites found for these promoters in the EcoCyc or RegulonDB databases are only depicted in these cartoons if the sites are within the 160 bp mutagenized region studied, and are detected by Reg-Seq.

Newly discovered E. coli regulatory architectures

Elucidating individual promoters

With the tools outlined above, we are positioned to explore individual promoters, specifically those belonging to the part of the E. coli genome for which the function of the genes is unknown. Previously christened as the ‘y-ome’, Ghatak et al., 2019 surprisingly found that roughly 35% of the genes in E. coli lack experimental evidence of function. The situation is likely worse for other organisms. For many of the genes in the y-ome, we remain similarly ignorant of how those genes are regulated. Figures 4 and 5 provide several examples of genes which until now had unknown regulation. As shown in Figure 5, our study has found the first examples that we are aware of in the entire E. coli genome of a binding site for YciT. These examples are intended to show the outcome of the methods developed here and to serve as an invitation to browse the online resource (https://www.rpgroup.caltech.edu/RegSeq/interactive) where our full dataset is presented.

Examples of the insight gained by Reg-Seq in the context of promoters with no previously known regulatory information.

Activator-binding regions are highlighted in green, repressor binding regions in red, and RNAP binding regions in blue. (A) From the information footprint of the ykgE promoter under different growth conditions, we can identify a repressor-binding site downstream of the RNAP-binding site. From the enrichment of proteins bound to the DNA sequence of the putative repressor as compared to a control sequence, we can identify YieP as the transcription factor bound to this site as it has a much higher enrichment ratio than any other protein. Lastly, the binding energy matrix for the repressor site along with corresponding sequence logo shows that the wild-type sequence is the strongest possible binder and it displays an imperfect inverted repeat symmetry. (B) Illustration of a comparable dissection for the phnA promoter. Numeric values for the displayed data can be found in Figure 5—source data 1.

The ability to find binding sites for both widely acting regulators and transcription factors which may have only a few sites in the whole genome allows us to get an in-depth and quantitative view of any given promoter. As indicated in Figure 5(A) and (B), we were able to perform the relevant search and capture for the transcription factors that bind our putative binding sites. In both of these cases, we now hypothesize that these newly discovered binding site-transcription factor pairs exert their control through repression. The ability to extract the quantitative features of regulatory control through energy matrices means that we can take a nearly unstudied gene such as ykgE, which is regulated by an understudied transcription factor YieP, and quickly get to the point at which we can do quantitative modeling in the style that we and many others have performed on the lac operon (Vilar and Leibler, 2003; Vilar et al., 2003; Bintu et al., 2005; Kinney et al., 2010; Garcia and Phillips, 2011; Vilar and Saiz, 2013; Barnes et al., 2019; Phillips et al., 2019).

A panoply of promoter results

Figure 6 (and Tables 1 and 2) provides a summary of the discoveries made in the work done here using our next-generation Reg-Seq approach. The outcome of our study is a set of hypothesized regulatory architectures as characterized by a suite of binding sites for RNAP, repressors, and activators, as well as the extremely potent binding energy matrices. We do not assume, a priori, that a particular collection of such binding sites is AND, OR, or any other logic (Galstyan et al., 2019). Figure 6(A) provides a shorthand notation that conveniently characterizes the different kinds of regulatory architectures found in bacteria. In this (na, nr) notation, na and nr correspond to the number of recovered activator- and repressor-binding sites, respectively. In previous work (Rydenfelt et al., 2014), we have explored the entirety of what is known about the regulatory genome of E. coli, revealing that the most common motif is the (0, 0) constitutive architecture, although we hypothesized that this is not a statement about the facts of the E. coli genome, but rather a reflection of our collective regulatory ignorance in the sense that we suspect that with further investigation, many of these apparent constitutive architectures will be found to be regulated under the right environmental conditions. The two most common regulatory architectures that emerged from our previous datebase survey are the (0, 1) and the (1, 0) architectures, the simple repression motif and the simple activation motif, respectively. It is interesting to consider that the (0, 1) architecture is in fact the repressor-operon model originally introduced in the early 1960s by Jacob and Monod, 1961 as the concept of gene regulation emerged. Now we see retrospectively the far reaching importance of that architecture across the regulatory genome.

A summary of regulatory architectures discovered in this study.

(A) The cartoons display a representative example of each type of architecture, along with the corresponding shorthand notation. (B) Counts of the different regulatory architectures discovered in this study. We exclude the 'gold-standard' promoters (listed in Appendix 2—table 1) unless new transcription factors are also discovered in the promoter. If, for example, one repressor was newly discovered and two activators were previously known, then the architecture is still counted as a (2,1) architecture. (C) Distribution of positions of binding sites discovered in this study for activators and repressors. Only newly discovered binding sites are included in this figure. The position of the transcription-factor-binding sites are calculated relative to the estimated TSS location, which is based on the location of the associated RNAP site. Numeric values for the binding locations can be found in Figure 6—source data 1.

Table 1
All promoters examined in this study, categorized according to type of regulatory architecture.

Those promoters which have no recognizable RNAP site are labeled as inactive rather than constitutively expressed (0, 0).

ArchitectureTotal number
of promoters
Number of promoters with
at least one newly
discovered binding site
All Architectures11348
(0,0)340
(0,1)2621
(1,0)1110
(1,1)43
(0,2)44
(2,0)32
(1,2)43
(2,1)22
(2,2)11
(3,0)31
(0,3)21
(0,4)10
inactive180
Table 2
All genes investigated in this study categorized according to their regulatory architecture, given as (number of activators, number of repressors).

The regulatory architectures as listed reflect only the binding sites that would be able to be recovered within our 160 bp constructs, but include both newly discovered and previously known binding sites. In those cases where binding sites that appear in RegulonDB or Ecocyc are omitted from this tally, the Section 'Explanation of included binding sites' in Appendix 4 has the reasoning, for each relevant gene, why the binding sites are not shown. The table also lists the number of newly discovered binding sites, previously known binding sites, and number of identified transcription factors. The evidence used for the transcription factor identification is given in the final column. 'Bioinformatic' evidence implies that discovered position weight matrices were compared to known transcription factor position weight matrices. The literature sites column contains only those sites that are both expected to be and are, in actuality, observed in the Reg-Seq data.

ArchitecturePromoterNewly discovered binding sitesLiterature binding sitesIdentified binding sitesEvidence
(0, 0)acuI000
aegA000
arcB000
cra000
dnaE000
ecnB000
fdoH000
holC000
hslU000
htrB000
minC000
modE000
ycgB000
mscL000
pitA000
poxB000
rlmA000
rumB000
sbcB000
sdaB000
tar000
ybdG000
ybiP000
ybjT000
yehT000
yfhG000
ygdH000
ygeR000
yggW000
ynaI000
yqhC000
zapB000
zupT000
amiC000
(0, 1)araC010
bdcR101Known binding location (NsrR) (Partridge et al., 2009)
coaA100
dicC010
dinJ100
ybeZ100
idnK101Mass- Spectrometry (YgbI)
leuABCD101Mass- Spectrometry (YgbI)
mscM100
yedK101Mass- Spectrometry (TreR)
rapA101Growth condition Knockout (GlpR), Bioinformatic (GlpR)
sdiA100
tff-rpsB-tsf101Growth condition Knockout (GlpR), Bioinformatic (GlpR), Knockout (GlpR)
thiM100
tig101Growth condition Knockout (GlpR), Bioinformatic (GlpR), Knockout (GlpR)
ybiO100
ydjA100
yedJ100
phnA101Mass- Spectrometry (YciT)
mutM100
rhlE101Growth condition Knockout (GlpR), Bioinformatic (GlpR), Mass- Spectrometry (GlpR)
uvrD101Bioinformatic (LexA)
dusC100
ftsK010
znuA010
znuCB010
(1, 0)waaA-coaD100
rcsF100
groSL100
mscS100
thrLABC100
yeiQ101Growth condition Knockout (FNR), Bioinformatic (FNR)
ycbZ100
ygjP100
lac010Bioinformatic (CRP)
yehS100
yehU101Growth condition Knockout (FNR), Bioinformatic (FNR)
(0, 2)pcm200
yecE201Mass- Spectrometry (HU)
yjjJ201Growth condition Knockout (MarA), Bioinformatic (MarA)
dcm201Mass- Spectrometry (HNS)
(1, 1)arcA202Growth condition Knockout (FNR), Bioinformatic (FNR), Mass- Spectrometry (FNR, CpxR)
dgoR020Bioinformatic (CRP) Bioinformatic (DgoR)
ykgE202Growth condition Knockout (FNR), Bioinformatic (FNR), Mass- Spectrometry(YieP) Knockout (YieP)
ymgG200
(2, 0)asnA200
fdhE202Growth condition Knockout (FNR, ArcA), Bioinformatic (FNR, ArcA), Knockout (ArcA)
xylF020
(1, 2)marR030Mass- Spectrometry (MarR)
aphA302Growth condition Knockout (FNR), Bioinformatic (FNR), Mass- Spectrometry (DeoR)
iap300
ilvC301Mass- Spectrometry (IlvY) (Rhee et al., 1998)
(2, 1)maoP303Growth condition Knockout (GlpR), Bioinformatic (GlpR), Knockout (PhoP, HdfR, GlpR)
rspA121Mass- Spectrometry (DeoR)
(2, 2)ybjX404Bioinformatic (2 PhoP sites), Mass- Spectrometry (HNS, StpA)
(3, 0)araAB030
xylA030
yicI300
(0, 3)ompR030
ybjL300
(0, 4)relBE040Mass- Spectrometry (RelBE)

For the 113 genes we considered, Figure 6(B) summarizes the number of simple repression (0, 1) architectures discovered, the number of simple activation (1, 0) architectures discovered and so on. A comparison of the frequency of the different architectures found in our study to the frequencies of all the known architectures in the RegulonDB database is provided in Appendix 4—figure 2. Tables 1 and 2 provide a more detailed view of our results. As seen in Table 1, of the 113 genes we considered, 34 of them revealed no signature of any transcription-factor-binding sites and they are labeled as (0, 0). The simple repression architecture (0, 1) was found 26 times, the simple activation architecture (1, 0) was found 11 times, and more complex architectures featuring multiple binding sites (e.g. (1, 1), (0, 2), (2, 0), etc.) were revealed as well. Further, for 18 of the genes that we label 'inactive’, Reg-Seq did not reveal a potential RNAP-binding site. The lack of observable RNAP site could be because the proper growth condition to get high levels of expression was not used, or because the mutation window chosen for the gene does not capture a highly transcribing TSS.

The tables also include our set of 15 'gold standard' genes for which previous work has resulted in a knowledge (sometimes only partial) of their regulatory architectures. We find that our method recovers the regulatory elements of these gold standard cases fully in 11 out of 15 cases, and the majority of regulatory elements in two of the remaining cases. Overall, the performance of Reg-Seq in these gold-standard cases (for more details see Appendix 2—figures 2 and 3) builds confidence in the approach. Further, the failure modes inform us of the blind spots of Reg-Seq. For example, we find it challenging to observe weaker binding sites when multiple strong binding sites are also present such as in the marRAB operon. The araC case study shows that Reg-Seq does not perform well when many repressor sites regulate the promoter. Additionally the method will fail when there is no active TSS in the mutation window, as occurred in the case of dicA. Further details on the comparison to gold standard genes can be found in Appendix 2 Section 'False positive and false negative rates'.

We observe that the most common motif to emerge from our work (with the exception of constitutive expression) is the simple repression motif. Another relevant regulatory statistic is shown in Figure 6(C) where we see the distribution of binding site positions. Our own experience in the use of different quantitative modeling approaches to transcriptional regulation reveal that, for now, we remain largely ignorant of how to account for transcription-factor-binding site positions, and datasets like the one presented here will begin to provide data that can help us uncover how this parameter dictates gene expression. Indeed, with binding site positions and energy matrices in hand, we can systematically move these binding sites and explore the implications for the level of gene expression, providing a systematic tool to understand the role of binding-site position.

Uncovering the action of global regulators

One of the revealing case studies that demonstrates the broad reach of our approach for discovering regulatory architectures is offered by the insights we have gained into two widely acting regulators, GlpR (Figure 7; Schweizer et al., 1985) and FNR (Figure 8; Körner et al., 2003; Kargeti and Venkatesh, 2017). In both cases, we have expanded the array of promoters that they are now known to regulate. Further, these two case studies illustrate that even for widely acting transcription factors, there is a large gap in regulatory knowledge and the approach advanced here has the power to discover new regulatory motifs. The newly discovered binding sites in Figure 7(A), with additional evidence for GlpR binding in Figure 7(B) and (C), more than double the number of operons known to be regulated by GlpR as reported in RegulonDB (Santos-Zavaleta et al., 2019). We found five newly regulated operons in our data set, even though we were not specifically targeting GlpR regulation. Although the number of example promoters across the genome that we considered is too small to make good estimates, finding five regulated operons out of approximately 100 examined operons supports the claim that GlpR widely regulates and many more of its sites would be found in a full search of the genome. The regulatory roles revealed in Figure 7(A) also reinforce the evidence that GlpR is a repressor.

GlpR as a widely acting regulator.

(A) Information footprints for the promoters which we found to be regulated by GlpR, all of which were previously unknown. Activator-binding regions are highlighted in green, repressor-binding regions in red, and RNAP-binding regions in blue. (B) GlpR was demonstrated to bind to rhlE by mass spectrometry. (C) Sequence logos for GlpR-binding sites. Binding sites in the promotes of tff, tig, maoP, rhlE, and rapA have similar DNA binding preferences as seen in the sequence logos and each transcription-factor-binding site binds strongly only in the presence of glucose (As shown in (A)). These similarities suggest that the same transcription factor binds to each site. To test this hypothesis, we knocked out GlpR and ran the Reg-Seq experiments for tff, tig, and maoP. In (A), we see that knocking out GlpR removes the binding signature of the transcription factor. Numeric values for the binding locations can be found in Figure 7—source data 1.

FNR as a global regulator.

FNR is known to be upregulated in anaerobic growth, and here we found it to regulate a suite of six genes. In aerobic growth conditions, the putative FNR sites are weakened. (A) Information footprints for the six regulated promoters. Activator binding regions are highlighted in green, repressor-binding regions in red, and RNAP binding regions in blue. (B) Sequence logos for the FNR-binding sites displayed in (A). The DNA binding preference of the six sites are shown to be similar from their sequence logos. Numeric values for the binding locations can be found in Figure 8—source data 1.

For the GlpR-regulated operons newly discovered here, we found that this repressor binds strongly in the presence of glucose while all other growth conditions result in greatly diminished, but not entirely abolished, binding (Figure 7(A)). As there is no previously known direct molecular interaction between GlpR and glucose and the repression is reduced but not eliminated, the derepression in the absence of glucose is likely an indirect effect. As a potential mechanism of the indirect effect, gpsA is known to be activated by CRP (Seoh and Tai, 1999), and GpsA is involved in the synthesis of glycerol-3-phosphate (G3P), a known binding partner of GlpR which disables its repressive activity (Larson et al., 1987). Thus, in the presence of glucose, GpsA and consequently G3P will be found at low concentrations, ultimately allowing GlpR to fulfill its role as a repressor.

Prior to this study, there were four operons known to be regulated by GlpR, each with between 4 and 8 GlpR-binding sites (Larson et al., 1992; Zhao et al., 1994; Yang and Larson, 1996; Ye and Larson, 1988; Weissenborn et al., 1992), where the absence of glucose and the partial induction of GlpR was not enough to prompt a notable change in gene expression (Lin, 1976). These previously explored operons seemingly are regulated as part of an AND gate. glpTQ, glpRABC, glpD, and glpFKX have high gene expression when grown in growth media that does not contain glucose but does contain contain G3P (or glycerol, which leads to high concentrations of G3P). All other combinations of growth media, such as M9 glucose with G3P, or growth in LB without G3P, lead to low gene expression (Lin, 1976). In contrast, we have discovered operons whose regulation appears to be mediated by a single GlpR site per operon. With only a single site, GlpR functions as an indirect glucose sensor, as only the absence of glucose is needed to relieve repression by GlpR.

The second widely acting regulator our study revealed, FNR, has 151 binding sites already reported in RegulonDB and is well studied compared to most transcription factors (Santos-Zavaleta et al., 2019). However, the newly discovered FNR sites displayed in Figure 8(A), with sequence logos of the respective sites displayed in Figure 8(B), demonstrate that even for well-understood transcription factors there is much still to be uncovered. Our information footprints are in agreement with previous studies suggesting that FNR acts as an activator. In the presence of O2, dimeric FNR is converted to a monomeric form and its ability to bind DNA is greatly reduced (Myers et al., 2013). Only in low oxygen conditions did we observe a binding signature from FNR, and we show a representative example of the information footprint from one of 11 aerobic growth conditions in Figure 8(A).

We observe quantitatively how FNR affects the expression of fdhE both directly through transcription factor binding (Figure 9B and C) and indirectly through increased expression of ArcA (Figure 9A, B, C and D). Also, fully understanding even a single operon often requires investigating several regulatory regions as we have in the case of fdoGHI-fdhE by investigating the main promoter for the operon as well as the promoter upstream of fdhE. 36% of all multi-gene operons have at least one TSS which transcribes only a subset of the genes in the operon (Conway et al., 2014). Regulation within an operon is even more poorly studied than regulation in general. The main promoter for fdoGHI-fdhE has a repressor-binding site, which demonstrates that there is regulatory control of the entire operon. However, we also see in Figure 9(B) that there is control at the promoter level, as fdhE is regulated by both ArcA and FNR and will therefore be upregulated in anaerobic conditions (Compan and Touati, 1994). The main TSS transcribes all four genes in the operon, while the secondary site shown in Figure 9(B) only transcribes fdhE, and therefore anaerobic conditions will change the stoichiometry of the proteins produced by the operon. By investigating over a hundred promoter regions in this experiment it becomes feasible to target multiple promoters within an operon as we have done with fdoGHI-fdhE. We can then determine under what conditions an operon is internally regulated.

Inspection of a genetic circuit.

(A) Here, the information footprint of the arcA promoter is displayed along with the energy matrix describing the discovered FNR-binding site. (B) Intra-operon regulation of fdhE by both FNR and ArcA. The information footprint of fdhE is displayed. The discovered sites for FNR and ArcA are highlighted and the energy matrix for ArcA is displayed. A TOMTOM (Gupta et al., 2007) search of the binding motif found that ArcA was the most likely candidate for the transcription factor. The displayed information footprint from a knockout of ArcA demonstrates that the binding signature of the site, and its associated RNAP site, are no longer determinants of gene expression. (C) Sequence logos for FNR generated from both the sites cataloged in RegulonDB, as well as the discovered sites regulating arcA and fdhE. (D) Sequence logos for ArcA from sites contained in RegulonDB and the ArcA site regulating fdhE. Numeric values for the binding locations can be found in Figure 9—source data 1.

In summary

By examining the over 100 promoters considered here, grown under 12 growth conditions, we have a total of more than 1000 information footprints and data sets. In this age of big data, methods to explore and draw insights from that data are crucial. To that end, as introduced in Figure 10, we have developed an online resource (see https://www.rpgroup.caltech.edu/RegSeq/interactive) that makes it possible for anyone who is interested to view our data and draw their own biological conclusions. Information footprints for any combination of gene and growth condition are displayed via drop down menus. Each identified transcription-factor-binding site is marked, and energy matrices for all transcription-factor-binding sites are displayed. In addition, for each gene, we feature a simple cartoon-level schematic that captures our now current, best understanding of the regulatory architecture and resulting mechanism.

Representative view of the interactive figure that is available online.

This interactive figure captures the entirety of our dataset. Each figure features a drop-down menu of genes and growth conditions. For each such gene and growth condition, there is a corresponding information footprint revealing putative binding sites, an energy matrix that shows the strength of binding of the relevant transcription factor to those binding sites and a cartoon that schematizes the newly-discovered regulatory architecture of that gene. Numeric values for the binding locations can be found in Figure 10—source data 1.

The interactive figure in question was invaluable in identifying transcription factors, such as GlpR, whose binding properties vary depending on growth condition. As sigma factor availability also varies greatly depending on growth condition, studying the interactive figure identified many of the secondary RNAP sites present. The interactive figure provides a valuable resource both to those who are interested in the regulation of a particular gene and those who wish to look for patterns in gene regulation across multiple genes or across different growth conditions.

Discussion

The study of gene regulation is one of the centerpieces of modern biology. As a result, it is surprising that in the genome era, our ignorance of the regulatory landscape in even the best-understood model organisms remains so vast. Despite understanding the regulation of transcription initiation in bacterial promoters (Browning and Busby, 2016), and how to tune their expression (Barnes et al., 2019), we lack an experimental framework to unravel understudied promoter architectures at scale. As such, in our view, one of the grand challenges of the genome era is the need to uncover the regulatory landscape for each and every organism with a known genome sequence. Given the ability to read and write DNA sequences at will, we are convinced that to make that reading of DNA sequence truly informative about biological function and to give that writing the full power and poetry of what Crick christened 'the two great polymer languages', we need a full accounting of how the genes of a given organism are regulated and how environmental signals communicate with the transcription factors that mediate that regulation – the so-called 'allosterome' problem (Lindsley and Rutter, 2006). The work presented here provides a general methodology for making progress on the former problem and also demonstrates that, by performing Reg-Seq in different growth conditions, we can make headway on the latter problem as well.

The advent of cheap DNA sequencing offers the promise of beginning to achieve this grand challenge in the form of MPRAs as reviewed in Kinney and McCandlish, 2019. A particular implementation of such methods was christened Sort-Seq (Kinney et al., 2010) and was demonstrated in the context of well understood regulatory architectures. A second generation of the Sort-Seq method (Belliveau et al., 2018) established a full protocol for regulatory dissection through the use of DNA-affinity chromatography and mass spectrometry which made it possible to identify the transcription factors that bind the putative binding sites discovered by Sort-Seq. However, there were critical shortcomings in the method, not least of which was that it lacked the scalability to uncover the regulatory genome in a more multiplexed manner.

The work presented here builds on the foundations laid in previous studies by invoking RNA-Seq as a readout for the level of expression of the promoter mutant libraries needed to infer information footprints and their corresponding energy matrices and sequence logos. The original inference and hypothesis generation is followed by a combination of mass spectrometry, comparison of binding motifs, and gene knockouts to identify the transcription factors that bind those sites. The case studies described in the main text showcase the ability of the Reg-Seq method to deliver on the promise of beginning to uncover the regulatory genome systematically. The extensive online resources hint at a way of systematically reporting those insights in a way that can be used by the community at large to develop regulatory intuition for biological function and to design novel regulatory architectures using energy matrices.

However, several shortcomings remain in the approach introduced here. First, the current implementation of Reg-Seq is not fully automated for various aspects in the experimental pipeline; for example, manual examination of information footprints is used to generate testable regulatory hypotheses. As the method is scaled up further, this can limit throughput of the analysis. To address this for future work, we have created an automated methodology for identifying putative binding sites, which we describe in the Materials and methods section, that will simplify future scaled up efforts at identifying putative binding sites. All putative binding sites reported in this study either were identified through the automated methodology or have additional evidence for their presence such as mass spectrometry. In addition, these regulatory hypotheses can be converted into gene regulatory models using statistical physics (Buchler et al., 2003; Bintu et al., 2005). However, here too, as the complexity of the regulatory architectures increases, it will be of great interest to use automated model generation as suggested in a recent biophysically based neural network approach (Tareen and Kinney, 2019).

Another key challenge faced by the methods described here is that the mass spectrometry and the gene knockout confirmation aspects of the experimental pipeline remain low-throughput and, at times, inconclusive. Occasionally, we have found it challenging to observe weaker binding sites when multiple strong binding sites are also present. This was the case for the marRAB operon. To make our transcription factor identification methods more high-throughput, we have begun to explore a new generation of experiments such as in vitro binding assays that will make it possible to accomplish transcription factor identification in a multiplexed manner. Specifically, we are exploring multiplexed mass spectrometry measurements and multiplexed Reg-Seq on libraries of gene knockouts as ways to break the identification bottleneck. Transcription factor identification using Reg-Seq is also complicated by the growth conditions that we can test; for the 18 genes that we tested and labeled as 'inactive' in this study, Reg-Seq did not reveal even an RNAP-binding site, suggesting that the proper growth condition to get high levels of expression was not used, or perhaps that the mutation window chosen for the gene does not capture a highly transcribing TSS. While information on the location of a TSS is available for 2500 of 2600 operons in E. coli (Santos-Zavaleta et al., 2019), this information does not guarantee those sites will have high transcription in the growth conditions studied. Similarly, many genes have multiple TSS that can be active under different growth conditions. In these cases, we are limited both by the finite set of growth conditions we test as well as by the length of the mutation window, as it cannot always capture all TSS.

Another shortcoming of the current implementation of the method is that it misses regulatory action at a distance. Indeed, our laboratory has invested a significant effort in exploring such long-distance regulatory action in the form of DNA looping in bacteria (Johnson et al., 2012; Han et al., 2009) and V(D)J recombination in jawed vertebrates (Lovely et al., 2015; Hirokawa et al., 2020). It is well known that transcriptional control through enhancers in eukaryotic regulation is central in contexts ranging from embryonic development to hematopoiesis (Melnikov et al., 2012). The current incarnation of the methods, as described here, have focused on contiguous regions in the vicinity of the transcription start site (within the 160 base pair mutagenized window). Clearly, to dissect the entire regulatory genome, these methods will have to be extended to non-contiguous regions of the genome.

Despite their limitations, the findings from this study provide a foundation for systematic, multiplexed regulatory dissections. We have developed a method to pass from complete regulatory ignorance to designable, regulatory architectures and we are hopeful that others will adopt these methods with the ambition of uncovering the regulatory architectures that preside over their organisms of interest.

Materials and methods

Here, we provide an overview of the key methodological aspects of Reg-Seq. Extensive details of the methods used in this study can also be found on the GitHub Wiki associated with this work.

Library design and construction

Request a detailed protocol

We selected 113 TSS from the E. coli K12 genome for experiments. The promoter regions analyzed in this study were each 160 base pairs in length, a region that includes 45 base pairs downstream and 115 base pairs upstream of each TSS. The general principles by which we selected each TSS were to first prioritize those TSS which have been extensively experimentally validated and catalogued in RegulonDB (Santos-Zavaleta et al., 2019) or EcoCyc (Keseler et al., 2017). Secondly, we selected those sites which had evidence of active transcription from RACE experiments (Mendoza-Vargas et al., 2009) and were listed in RegulonDB. If a TSS lacked both experimental evidence and active transcription as indicated by RACE experiments, we used the computationally predicted TSS as indicated on RegulonDB (Santos-Zavaleta et al., 2019) or EcoCyc (Keseler et al., 2017) and determined previously by Huerta and Collado-Vides, 2003. If there were multiple TSS located upstream of the gene in question, we selected the TSS closest to the gene start, unless selecting one further upstream would allow for multiple TSS to be contained in the 160 base pair mutated region analyzed for each promoter. Not all TSS locations are known, and many genes have multiple TSS. The exact start sites used, as well as the reasoning behind our selection of each TSS, are listed in Supplementary file 1.

Promoter variants were synthesized on a microarray (TWIST Bioscience, San Francisco, CA). The sequences were designed computationally such that each base in the 160 base pair promoter region has a 10% probability of being mutated. For each promoter’s oligonucleotide library, we ensured that the mutation rate as averaged across all sequences was kept between 9.5% and 10.5%, otherwise the library was regenerated. There are an average of 2200 unique promoter sequences per gene (for an analysis of how our results depend upon number of unique promoter sequences see Appendix 3—figure 1). The library arrived lyophilized (76 pmol) and was resuspended in 100 µL of TE (pH 8.0). Of the resuspended oligonucleotide, 1 µL was amplified for 12 cycles with New England Biolabs Q5 High-Fidelity 2x Master Mix (NEB, Ipswich, MA) to increase the quantity of DNA in the library. Unless otherwise stated, all amplifications were performed using this polymerase mixture.

The PCR product was then run on a 2% TAE agarose gel, and approximately 200 base pair amplicons were extracted using a Zymoclean Gel DNA Recovery Kit (Zymo Research, Irvine, CA). To add a random 20-nucleotide barcode to each oligonucleotide, 1 ng of the purified DNA library was amplified for 10 PCR cycles using primers containing random 20-nucleotide DNA overhangs. All primer sequences can be found in Supplementary file 2. After cleaning this PCR product using a Zymo Clean and Concentrator Kit (Zymo Research, Irvine, CA), the library was cloned into the plasmid backbone of pJK14 (SC101 origin) (Kinney et al., 2010) using Gibson Assembly. An illustration of this plasmid is displayed in Appendix 1—figure 1. Genetic constructs were electroporated into E. coli K-12 MG1655 (Blattner et al., 1997) and plated on LB plates with kanamycin. After 24 hr of growth on plates, libraries were scraped and inoculated into M9 media with 0.5% glucose in preparation for DNA sequencing.

All genetic barcodes were inserted 120 base pairs from the 5’ end of the mRNA, containing 45 base pairs from the targeted regulatory region, 64 base pairs containing primer sites used in the construction of the plasmid, and 11 base pairs containing a three frame stop codon. Exact sequences of primers and spacer sequences for the constructs are listed in Supplementary file 2. Following each genetic barcode, there is an RBS, a GFP-coding region, and a terminator.

Preparation of libraries for sequencing

Request a detailed protocol

To prepare cDNA libraries for sequencing, cells were grown to an optical density of 0.3 and RNA was stabilized using Qiagen RNA Protect (Qiagen, Hilden, Germany). Lysis was performed using lysozyme (Sigma Aldrich, Saint Louis, MO) and RNA isolated using the Qiagen RNA Mini Kit. Reverse transcription was preformed using Superscript IV (Invitrogen, Carlsbad, CA) with a specific primer for the labeled mRNA. qPCR was then performed in triplicate to check the level of DNA contamination. Any sample that had contaminating DNA at a level of 5% or more of the mRNA concentration was discarded. DNA libraries were prepared by growing cells to an optical density of 0.3 and isolating plasmid DNA with a spin miniprep kit (Qiagen, Hilden, Germany).

Sequencing

Request a detailed protocol

After preparing the barcoded libraries, we used next-generation sequencing (NGS) to map promoters to their respective barcodes. Sequencing libraries (both cDNA and DNA) had unindexed illumina flow cell adaptors attached via PCR, using primers that amplified a 221 base pair region that included the random barcode. We limited PCR cycles to exponential amplification, as determined by qPCR. Specifically, when we performed qPCR to check for DNA contamination, we also determined the number of cycles at which each sample reached exponential amplification, and then repeated the PCR reactions with the determined number of cycles to limit bias. After amplification, libraries were cleaned using a Zymo Clean and Concentrator kit and analyzed on an Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA). Samples were submitted to NGX Bio (NGX Bio, South Plainfield, NJ) for 150 base pair paired-end sequencing on a Hi-Seq 2500 (Illumina, San Diego, CA). We typically acquired 250 million total reads for mapping of libraries. Further details of how we process the sequences can be found in Appendix 1 Section 'Sequencing Analysis' and the GitHub Wiki associated with this work.

To quantify relative gene expression values for each promoter mutant in our library, we next grew cells expressing the DNA libraries in various growth conditions to an OD600 of 0.3. DNA and cDNA libraries were prepared in the same way as stated previously, and were sequenced at the Millard and Muriel Jacobs Genetics and Genomics Laboratory at Caltech on a HiSeq 2500 with a 100 base pair single read flow cell. An average of five unique 20 base pair barcodes per variant promoter was used for the purpose of counting transcripts. Specifically, for each promoter variant the number of sequences from the DNA library and the number of sequences produced from mRNA are determined.

Determination of energy matrices

Request a detailed protocol

Energy matrices are used to represent the binding energy contribution for each nucleotide in a DNA sequence. We use relative gene expression values, as determined by counting genetic barcodes from NGS data for each mutated variant of a given regulatory sequence, and infer the energy contribution of each nucleotide by maximizing the mutual information between the rank-ordered binding strength predictions from the energy matrix and the gene expression data. We also perform this maximization using MCMC. Further discussion of how energy matrices are inferred can be found in Appendix 3 Section 'Energymatrix inference' and on the GitHub Wiki that accompanies this study.

In each energy matrix plot, a red box indicates that a mutation to a nucleotide in that position decreases the energy of transcription factor binding, while a blue box indicates that a mutation at a given nucleotide position increases transcription-factor-binding energy. Energy matrices are typically given in arbitrary units, but the method by which we can assign absolute units in kbT is covered in Appendix 3 Section 'Inference of scaling factors for energy matrices'.

DNA-affinity chromatography and mass spectrometry

Request a detailed protocol

Upon identifying a putative transcription-factor-binding site, we used DNA-affinity chromatography, as performed in Belliveau et al., 2018, to isolate and enrich for the transcription factor of interest. In brief, we order biotinylated oligos of our binding site of interest (Integrated DNA Technologies, Coralville, IA) along with a control, 'scrambled' sequence, that we expect to have no specificity for the given transcription factor. We tether these oligos to magnetic streptavidin beads (Dynabeads MyOne T1; ThermoFisher, Waltham, MA), and incubate them overnight with whole cell lysate grown in the presences of either heavy (with 15N) or light (with 14N) lysine for the experimental and control sequences, respectively. The next day, proteins are recovered by digesting the DNA with the PtsI restriction enzyme (New England Biolabs, Ipswich, MA), whose cut site was incorporated into all designed oligos.

Protein samples were then prepared for mass spectrometry by either in-gel or in-solution digestion using the Lys-C protease (Wako Chemicals, Osaka, Japan). Liquid chromatography coupled mass spectrometry (LC-MS) was performed as previously described by Belliveau et al., 2018, and is further discussed in Appendix 3 Section 'Processing of mass spectrometry experiments'. SILAC labeling was performed by growing cells (Δ LysA) in either heavy isotope form of lysine or its natural form.

It is also important to note that while we utilized the SILAC method to identify the transcription factor identities, our approach does not require this specific technique. Specifically, our method only requires a way to contrast between the copy number of proteins bound to a target promoter in relation to a scrambled version of the promoter. In principle, one could use multiplexed proteomics based on isobaric mass tags (Pappireddi et al., 2019) to characterize up to 10 promoters in parallel. Isobaric tags are reagents used to covalently modify peptides by using the heavy-isotope distribution in the tag to encode different conditions. The most widely adopted methods for isobaric tagging are the isobaric tag for relative and absolute quantitation (iTRAQ) and the tandem mass tag (TMT). This multiplexed approach involves the fragmentation of peptide ions by colliding with an inert gas. The resulting ions are resolved in a second MS-MS scan (MS2).

Only a subset (13) of all transcription factor targets were identified by mass spectrometry due to limitations in scaling the technique to large numbers of targets. The transcription factors identified by this method are enriched more than any other DNA binding protein, with p<0.01 using the outlier detection method as outlined by Cox and Mann, 2008, with corrections for multiple hypothesis testing using the method proposed by Benjamini and Hochberg, 1995. Details on data processing can be found in Appendix 3 Section 'Processing of mass spectrometry experiments'. A detailed explanation of all experimental and computational steps can be found in the GitHub Wiki that accompanies this work.

Construction of knockout strains

Request a detailed protocol

Conducting DNA-affinity chromatography followed by mass spectrometry on putative binding sites resulted in potential candidates for the transcription factors that bind to the target region. For some cases, to verify that a given transcription factor is, in fact, regulating a given promoter, we repeated the RNA sequencing experiments on strains in which the transcription factor of interest has been knocked out.

To construct the knockout strains, we ordered strains from the Keio collection (Yamamoto et al., 2009) from the Coli Genetic Stock Center. These knockouts were put in a MG1655 background via phage P1 transduction and verified with Sanger sequencing. To remove the kanamycin resistance that comes with the strains from the Keio collection, we transformed in the pCP20 plasmid (Datsenko and Wanner, 2000), induced FLP recombinase, and then selected for colonies that no longer grew on either kanamycin or ampicillin, verifying both loss of the chromosomally integrated kanamycin resistance and the pCP20 plasmid which confers ampicillin resistance. Finally, we transformed our desired promoter libraries into the constructed knockout strains, allowing us to perform the RNA sequencing in the same context as the original experiments.

Automated putative binding site algorithm

Request a detailed protocol

We introduce a systematized way of identifying the locations of binding sites to supplement manual curation (described in the Section 'Manual selection of binding sites'). As illustrated in Figure 11, for a given information footprint, we average over 15 base pair 'windows'. We then determine which base pairs are part of a regulatory region by setting an information threshold of 2.5×104 bits. Threshold selection is described in Appendix 2 Section 'False positive and false negative rates'. All base pair positions that pass the information threshold were then joined into regulatory regions. We consider 'activator-like' (mutation decreases expression) and 'repressor-like' (mutation increases expression) base pairs separately. This means that it is possible to have overlapping repressor- and activator-binding sites identified. We join any base pair positions within four base pairs of each other into single regulatory regions. We then find the edges of the region by trimming off any base pairs at the edge that are below the information threshold (even if the 15 base pair average is above the threshold). While we can often resolve overlapping or nearby repressors from activators, a limitation of this method of identification is that is cannot resolve two activators or two repressors that are very close to each other or overlapping.

Procedure to identify binding site regions automatically.

First, an information footprint is generated for the target region. Next, the information footprint is smoothed over a 15 base pair sliding window and a threshold of 2.5×104 bits is applied to identify regions of interest. RNAP-binding sites are first identified (in blue), and the remainder of the regulatory regions are identified as repressor-binding sites (if they tend to increase expression on mutation from wild type) or activator-binding sites (if they tend to decrease expression upon mutation).

To identify RNAP-binding sites, we compare the sequence preference (through energy matrices and sequence logos) to experimentally validated examples of RNAP sites. We have examples of energy matrices for the σ70 RNAP site from Belliveau et al., 2018. For energy matrices of other σ factor binding sites, such as σ32 and σ28, we use energy matrices generated from within the Reg-Seq experiment itself. For a σ32 binding site, for example, we used the example from the hslU gene. For a σ28 binding site, we used the energy matrix generated from the dnaE gene. We 'scan' the example energy matrices across the mutated region. For each position in the region, we calculate the Pearson correlation coefficient between the example RNAP energy matrix and the inferred energy matrix at that position. We find RNAP-binding site locations by thresholding the Pearson correlation coefficients at a value of 0.45. When performing manual curation of binding sites, we visually compare the sequence logos of the example RNAP-binding sites to the sequence logos of putative binding sites. Further details of the method to create energy matrices and compare them to known motifs are given in Appendix 3 Section 'Energy matrix inference' and Appendix 3 Section 'TOMTOM motif comparison', respectively. Further, a detailed discussion of energy matrix construction is provided in the Sequencing Analysis GitHub Wiki page that accompanies this work.

Manual selection of binding sites

Request a detailed protocol

Similarly to the automated method of locating putative binding regions, we look for regions of high mutual information in the information footprints. While there was no hard cut-off for mutual information values during manual curation, we select clusters of base pairs that have a similar average information value (2.5×104 bits) to that described in the Section 'Automated putative binding site algorithm'.

During manual curation of binding sites, we also disqualify any binding sites where there are only three or fewer base pairs with high values in the mutual information footprint. The logic behind this decision is that individual bases with very high mutual information can potentially indicate that a putative binding site is only active when a certain mutation occurs. In turn, the binding site would not be active in wild-type conditions. To explain why this is, consider that a typical binding site mutation, at any given base pair, will significantly weaken the binding site of interest. Therefore, each of those mutated base pairs is said to have a 'large effect' on expression. For a very poor binding site that is not active in the wild-type case, most mutations will further weaken a site which already will have only a minor effect on gene expression. However, for a small number of base pairs, a mutation can occur that makes the DNA bind more tightly to the transcription factor, making it relevant for gene expression. Therefore, in the case of an extremely weak binding site that is not relevant in the wild type condition, there can still be a small number of highly informative bases. Initial hypothesis generation in Reg-Seq was done manually. However, all those sites that are reported in Table 2 that do not have additional validation through mass spectrometry, gene knockouts, or bioinformatics appear in the set of putative binding sites generated by the method described in Section 'Automated putative binding site algorithm'.

Code and data availability

Request a detailed protocol

An in-depth discussion of all experimental protocols and mathematical analysis used in this study can be found on the GitHub Wiki for this study (Ireland, 2020 https://github.com/RPGroup-PBoC/RegSeq/wiki (copy archived at https://github.com/elifesciences-publications/RegSeq). All code used for processing data and plotting as well as the final processed data, plasmid sequences, and primer sequences can also be found on the GitHub repository(archived by Zenodo; https://doi.org/10.5281/zenodo.3966687). Energy matrices were generated using the MPAthic software (Ireland and Kinney, 2016). All raw sequencing data is available at the Sequence Read Archive (accession no.PRJNA599253 and PRJNA603368). All inferred information footprints and energy matrices can be found on the GitHub repository (archived by Zenodo; https://doi.org/10.5281/zenodo.3966687). All mass spectrometry raw data is available on the CaltechData repository (https://doi.org/10.22002/d1.1336).

Appendix 1

Extended details of experimental design

Choosing target genes

Genes in this study were chosen to cover several different categories. Twenty-nine genes had at least one transcription-factor-binding site listed in RegulonDB and were picked to validate our method under a number of conditions (15 with relevant high evidence sites). Thirty-seven were chosen because the work of Schmidt et al., 2016 demonstrated that gene expression changed significantly under different growth conditions. A handful of genes such as minC, maoP, or fdhE were chosen because we found either their physiological significance interesting, as in the case of minC, whose product is crucial for cell division and proper partitioning of the cell into two equal sized daughters in E. coli (Lutkenhaus, 2007). Alternatively, for some cases we found the gene regulatory question interesting, such as for the intra-operon regulation demonstrated by fdhE. The remainder of the genes were chosen because they had no regulatory information, often had minimal information about the function of the gene, and had an annotated transcription start site (TSS) in RegulonDB. A list of all genes chosen can be found in Supplementary file 1.

Sequencing analysis

Appendix 1—figure 1
Schematic of the genetic construct used in this study.

Mutated DNA libraries for each regulatory region were expressed from a pSC101 plasmid with kanamycin resistance (kanR). Each mutated sequence is 160 bp in length, which includes 45 bp downstream and 115 bp upstream of a given TSS. Each mutated sequence is flanked by primer-binding sites to facilitate cloning. The genetic construct also contains a random barcode, a ribosome-binding site (RBS), a GFP gene, and a terminator labeled with a large 'T'.

In this Appendix section, we provide further details associated with the analysis of next-generation sequencing (NGS) results, from both the 'mapping' experiment, in which each unique barcode is 'linked' to its corresponding mutated promoter region, and from the barcode sequencing experiments, in which the frequency of each barcode is counted and relative gene expression values determined. It is important to perform two sequencing experiments, in this manner, for a couple of reasons. Oligonucleotide libraries ordered from Twist Bioscience, which we use to construct promoter regions mutated at a 10% rate, are prone to random errors. This means that we do not fully know what is in the ordered library, and so it is necessary to sequence the full library and determine which mutations are present in each promoter region. The 'mapping' phase of experiments also serves to connect each random, genetic barcode (which is added via PCR with primer overhangs) to its corresponding, mutated promoter. By linking barcodes to promoters, we are able to build a 'codex' that enables us to count genetic barcodes and, in turn, understand the relative gene expression values for each mutant promoter sequence.

For the 'mapping' of genetic barcodes to their corresponding mutant promoter, we use paired-end sequencing, with 150 cycles for both Read 1 and Read 2, on a Hi-Seq 2500 machine. We acquired 250 million total reads for mapping of libraries.

In our analysis of FASTQ files, we removed any barcodes that were associated with a promoter variant which had insertions or deletions. Similarly, any genetic barcodes associated with multiple promoter variants were removed from the analysis, as were any sequences which appeared only once (barcodes must appear at least two times to be analyzed, as the appearance of a single, unique barcode sequence could be attributed to a sequencing error). The paired end reads from this sequencing step were assembled using the FLASH tool (Magoč and Salzberg, 2011). Any sequence with a PHRED score less than 20 was then removed using the FastX toolkit (Hannon, 2010). The specific commands used for this step of our analysis are listed on the GitHub Wiki associated with this work.

To analyze the 'mapping' data and link each genetic barcode to its unique, mutagenized promoter region, we used a custom Python module, which can be found on the GitHub repository associated with this work. This module contains functions to check that sequences are the expected length, map unique barcodes to their corresponding promoter regions, and extract barcode sequences for subsequent sequencing experiments. We also provided a Jupyter notebook on the GitHub repository which provides a step-by-step walkthrough of the code used in processing sequencing data.

After mapping each barcode to its corresponding, mutated promoter region, we next 'count' barcodes, both DNA and cDNA, to determine the relative gene expression values for each mutated promoter. For barcode counting experiments, only the region containing the random, 20 bp barcode was sequenced. For each growth condition, each promoter library yielded 20,000 to 500,000 usable sequencing reads. If the dataset for a gene in a given growth condition did not have at least 20,000 reads, it was not analyzed further, as we consistently found that, below this threshold, we reached a regime wherein the inference reliability of MCMC was reduced.

When preparing DNA and cDNA for NGS, we add a 4nt barcode, via PCR, to the library isolated from each growth condition. These 4nt barcodes are used during data analysis both to map each library to its particular growth condition and to keep track of biological replicates, while the 20 bp barcodes can be used to identify each mutated promoter region. We performed all experiments with two biological replicates.

After collecting the FASTQ files, we perform quality filtering with FastX. We then perform barcode splitting with the FastX toolkit to separate each FASTQ file based on its growth condition, as well as separate the sequencing files based on whether they are derived from the DNA or cDNA library. Each experimental condition (both biological replicates, RNA vs. DNA, and growth conditions) receives a unique, 4nt barcode sequence, which enables us to identify where each library came from. Full details of our sequencing analysis methodologies, as well as all Python scripts, can be found on the GitHub repository associated with this work.

Growth conditions

The growth conditions used in this study were inspired by Schmidt et al., 2016, a study which observed changes in the E. coli proteome under growth conditions similar to the ones presented. The growth conditions utilized in this study are tabulated in Appendix 1—table 1. The growth conditions explored here involved a range of environmental perturbations including altering the carbon source, inducing stress, or introducing trace metals. Unless otherwise noted in the caption of Appendix 1—table 1, the cells were grown in the medium at 37°C until reaching an OD of 0.3, at which point the cells were harvested and the RNA extracted. These growth conditions were chosen so as to span a wide range of growth rates, as well as to illuminate any carbon source specific regulators.

Appendix 1—table 1
All growth conditions used in the Reg-Seq study.

Heat shocked cells were exposed to 42°C for 5 min upon reaching OD 0.3 as this is known to induce transcription by σ32 (Arsène et al., 2000). Low oxygen growth cells were grown in a flask sealed with parafilm with minimal oxygen, although some was present as no anaerobic chamber was used. This level of oxygen stress was still sufficient to activate FNR binding, thus activating anaerobic metabolism. For cells grown with iron, upon reaching OD of 0.3 iron was added and cells were incubated for 10 min before harvesting RNA. Growth without cAMP was accomplished by the use of the JK10 strain (Kinney et al., 2010) which does not maintain its cAMP levels.

Growth conditions
M9 with glucose (0.5%)
M9 with acetate (0.5%)
M9 with arabinose (0.5%)
M9 with xylose (0.5%) and arabinose (0.5%)
M9 with succinate (0.5%)
M9 with trehalose (0.5%)
M9 with glucose (0.5%) and 5 mM sodium salycilate
LB
heat shock in M9 with glucose (0.5%)
LB in low oxygen
zinc, 5 mM ZnCl in M9 with glucose (0.5%)
iron, 5 mM FeCL in M9 with glucose (0.5%)
no cAMP in M9 with glucose (0.5%)

All knockout experiment were performed in M9 with glucose except for the knockouts for arcA, hdfR, and phoP which were grown in LB.

Appendix 2

Validating Reg-Seq against previous methods and results

The work presented here is effectively a third-generation of the use of Sort-Seq methods for the discovery of regulatory architecture. The primary difference between the present work and previous generations (Kinney et al., 2010; Belliveau et al., 2018) is the use of RNA-Seq rather than fluorescence and cell sorting as a readout of the level of expression of our promoter libraries. As such, there are many important questions to be asked about the comparison between the earlier methods and this work. We attack that question in several ways. First, as shown in Figure 3, we have performed a head-to-head comparison of the two approaches to be described further in this section. Second, as shown in the next section, our list of candidate promoters included roughly 20% for which there is at least one experimentally validated transcription-factor-binding site. In these cases, we examined the extent to which our methods recover the known features of regulatory control about those promoters.

Comparison between Reg-Seq by RNA-Seq and fluorescent sorting

As the basis for comparing the results of the fluorescence-based Sort-Seq approach with our RNA-Seq-based approach, we use information footprints and position weight matrices as our metrics.

When making these comparisons between the two methods, we compare the values of a position weight matrix (PWM), often displayed as a sequence logo, generated from the Sort-Seq and Reg-Seq methods. PWMs contain the probabilities that a given base will occur at a given position in the binding site. We calculate the Pearson correlation coefficient between the PWM values (represented as the height of the letters at each position) for the two methods. To compute the correlation coefficient, we use

(5) r=α=14i=1N(xi,α-x¯)(yi,α-y¯)α=14i=1N(xi,α-x¯)2α=14i=1N(yi,α-y¯)2,

where xi,α and yi,α are the entries of the PWM of nucleotide α at position i obtained from Sort-Seq and Reg-Seq respectively, N is the total length of the binding site, and x¯ and y¯ are the means of xi,α and yi,α, respectively. As an example, consider the following sequence logo from a Sort-Seq experiment,

PositionACGT
10.010.010.030.95
20.040.830.060.07
30.700.170.110.02
40.860.010.100.03

and the same region resulting from a Reg-Seq experiment:

PositionACGT
10.010.040.030.92
20.050.850.070.03
30.740.140.090.03
40.810.020.130.04

We see that for both sequence logos, the preferred nucleotides from position 1 through 4 are T-C-A-A, as indicated by the values in bold. Plugging in these values into Equation 5, we get a Pearson correlation coefficient of r=0.997, indicating substantial agreement between the Sort-Seq and Reg-Seq methods in this example. As a way to visualize similarity, for each position in the sequence logo we can plot the numerical value as resulting from the Sort-Seq experiment (xi,α) vs. the corresponding value obtained from the Reg-Seq experiment (yi,α). Perfect correspondence between the methods would result in all the data lying on the x=y line (Appendix 2—figure 1).

Appendix 2—figure 1
Mock data comparing Sort-Seq and Reg-Seq sequence logo values.

These data have a Pearson correlation coefficient of r=0.997. This high correlation is also indicated by the data deviating little from the x=y line.

Figure 3 shows examples of this comparison for four distinct genes of interest. Figure 3(A) shows the results of the two methods for the lacZYA promoter with special reference to the CRP-binding site. Both the information footprint and the the position weight matrices (displayed with sequence logos) identify the same binding site.

Figure 3(B) provides a similar analysis for the dgoRKADT promoter where the position weight matrices for the CRP-binding site from Reg-Seq and Sort-Seq have a correlation coefficient of r = 0.90. Figure 3(C) provides a quantitative dissection of the relBE promoter which is repressed by RelBE. Here, we use both information footprints and expression shifts as a way to quantify the significance of mutations to different binding sites across the promoter. Finally, Figure 3(D) shows a comparison of the two methods for the marRAB promoter. The two approaches both identify a MarR-binding site.

False positive and false negative rates

We introduce a systematized way of identifying the locations of binding sites, as shown in Figure 11, that allows the false negative and false positive rate of binding site identification to be clearly assessed. For a given information footprint, we average over 15 base pair 'windows'. We then determine which base pairs are part of a regulatory region by setting an information threshold of 2.5×104 bits, which is explained below. All base pair positions that pass the information threshold are then joined into 'regulatory regions', which we consider to be 'activator-like' (if a mutation decreases expression) or 'repressor-like' (if a mutation increases expression). This means that it is possible to identify overlapping repressor- and activator-binding sites. We join any base pair positions within 4 base pairs of each other into a single regulatory region. We then find the edges of each binding site region by trimming off any base pairs at the edge that are below the information threshold (even if the 15 base pair average is above the threshold). A limitation of this method of identification is that is cannot resolve transcription-factor-binding sites that are very close to each other. The primary reasons for this is that putative binding sites will overlap after the smoothing step. While the method could be tuned to avoid treating nearby regions as the same site, many transcription-factor-binding sites will have sections of base pairs within their site where base identity has little to no effect on gene expression. Helix-turn-helix type transcription factors like CRP (whose binding site can be observed in Figure 3) are common examples of this phenomenon.

To determine which information threshold to use as a cutoff for a putative binding site, as displayed in Figure 11, we selected a training set of genes which included two of our 'gold standard' genes with previously studied binding sites, DgoR (the upstream site from the dgoR promoter) and CRP (from the araAB promoter), two genes with only RNAP-binding sites, including hslU (under heat shock) and poxB, and several genes that we classified as inactive, wherein no RNAP-binding sites or other binding sites could be identified. These inactive genes included hicB, mtgA, eco, hslU (without heat shock), and yncD. The growth condition (heat shock) is specified for the hslU promoter as transcription occurs from a σ32 RNAP site, which will be inactive except during heat shock. We selected the threshold such that the RNAP sites and known binding sites were identified, while no binding sites were identified in the inactive regions.

We then determine a set of binding sites upon which to test this method and determine a false negative rate for the Reg-Seq experiment. In this set of binding sites, we include those sites which are 'high-evidence' according to EcoCyc. Such 'high evidence' binding sites have been validated experimentally with the binding of purified protein or through site mutation. Some 'high-evidence' sites are excluded because they are not included within our 160 base pair, mutagenized sequence, or because they are not active in any of the growth conditions that we tested. Justifications for those binding sites which were not included are now listed in a new appendix; Appendix 4 Section 'Explanation of included binding sites'. A full list of promoters and binding sites that were included in the set of genes used to validate our automated binding-site finding algorithms are also provided in Appendix 2—table 1.

Appendix 2—table 1
A suite of experimentally validated and high-evidence binding sites used to test our automated binding site finding algorithm.

Specifically, this list of genes was used to test the false negative rate of our Reg-Seq method by examining what fraction of high-evidence sites were also identified with Reg-Seq.

GeneTranscription factorTranscription factor type
rspACRPactivator
rspAYdfHrepressor
araABAraC (two sites)activator
znuCBZurrepressor
xylACRPactivator
xylAXylR (two sites)activator
xylFXylR (two sites)activator
dicCDicArepressor
relBERelBErepressor
ftsKLexArepressor
znuAZurrepressor
lacCRPactivator
marRFisactivator
marRMarAactivator
marRMarR (two sites)repressor
dgoRCRPactivator
dgoRDgoR (right site)repressor
ompRIHF (three sites)repressor
ompRCRPrepressor
dicADicArepressor
araCAraC (two sites)repressor
araCAraC (two sites)activator
araCCRPactivator
araCXylR (two sites)repressor

For each promoter contained in Appendix 2—table 1, we used the automated procedure outlined above and in Figure 11 to identify the activator- and repressor-binding sites. A visual display of the expected binding sites, the information footprints for the promoters in Appendix 2—table 1, and the discovered binding sites are all displayed in Appendix 2—figures 2 and 3. To assess the false negative rate, we compare the identified regulatory regions to the known binding sites from Appendix 2—table 2. At this stage, we did not consider the identities of the binding sites; we merely consider their presence or absence. Inferred binding sites are declared to 'match' the known binding site if the automated identification procedure classifies at least half of the base pairs reported in EcoCyc as belonging to a transcription-factor-binding site and correctly determines whether the binding site belongs to an activator or repressor.

We do not require exact matching of the edges of the binding sites for several reasons. One such reason is that, in some cases, the sequence of half of a binding site (for example, corresponding to one half of a helix-turn-helix binding motif) can contribute relatively little to gene expression, and so will not have high mutual information values in the corresponding information footprint for that binding site. While this may appear unintuitive for transcription factors where both sections of the binding site are bound by identical halves of a dimer, we see several examples of this in our Reg-Seq experiment results, including for CRP-binding sites of the rspA promoter studied during our analysis of false negative rates. We can see in Appendix 2—figures 2 and 3 that the downstream half of the binding site is not identified as important for gene expression. If we examine the wild type sequence of the rspA promoter, we also see that, for the upstream half of the sequence, the wild type matches the five most conserved bases of the consensus sequence (TGTGA) perfectly. The downstream half of the sequence, however, has three mismatches out of five bases. The downstream half of the binding site already binds to its target transcription factor poorly, so further mutations have little effect. While it is true that CRP binds to that sequence region, it is also true that CRP binds only extremely weakly to that section of the region. A similar effect can be seen in previous work from Belliveau et al., 2018 where a mutation in the downstream half of a CRP-binding site in the xylE promoter had more than a 10-fold greater effect on binding energy than mutation in the upstream half of the binding site. As such, we are lenient when evaluating the successes of our algorithm in this regard. Furthermore, the methods that have been used to determine the presence of 'high evidence' binding sites in the past, such as ChIP-Seq, do not typically have base pair resolution with which to precisely determine the edges of binding sites (Skene and Henikoff, 2015).

Lastly, a known weakness of our algorithmic approach is that binding sites that are extremely close or overlapping cannot be distinguished from each other initially. For example, the XylR sites in the xylF promoter are only separated by three bases according to RegulonDB. While the sites can be distinguished upon later investigation through gene knockouts, mass spectrometry, or motif comparison, our initial algorithm joins the sites into one large site. While this is a weakness of the algorithm, for our purposes it does not constitute a false negative, as the important regions for regulation are still discovered. All regions for all promoters that are classified as regulatory regions, their identities as activators, repressors, or RNAP binding sites, as well as their starting and ending base pairs, can be found in Supplementary file 3. Furthermore, we summarize the success and failures of the method at each binding site in Appendix 2—table 2 below.

Appendix 2—table 2
The results of the comparison between experimentally verified, high-evidence binding sites and Reg-Seq-binding sites.

A visual illustration of the comparison can be found in Appendix 2—figures 2 and 3.

GeneTranscription factorWas the region classified correctly?
rspACRPYes
rspAYdfHYes
araABAraC (two sites)Yes
znuCBZurYes
xylACRPYes
xylAXylR (two sites)Yes
xylFXylR (two sites)Yes
dicCDicAYes
relBERelBEYes
ftsKLexAYes
znuAZurYes
lacCRPYes
marRFisNo
marRMarAYes
marRMarR (two sites)Yes
dgoRCRPYes
dgoRDgoR (right site)No
ompRIHF (three sites)Yes
ompRCRPNo
dicADicANo
araCAraC (four sites)one site identified
araCCRPNo
araCXylR (two sites)No

We see in Appendix 2—table 2 that 11 of the 15 promoter regions included in Appendix 2—table 1 have all transcription-factor-binding sites classified as putative transcription factors, two have the majority of sites correctly classified, and two do not have any of their binding sites correctly classified as regulatory elements. We can see the information footprints used in the correct identifications in Appendix 2—figures 2 and 3. We could alternatively consider that 23 out of 33 binding sites are correctly classified. However, we argue that the false negative rate should be considered on a per promoter basis, rather than on the basis of individual binding sites. The reason for this argument can be seen in the two 'worst' cases of correct binding site identification; namely, for the araC and dicA promoters.

The araC promoter is repressed by multiple repressor-binding sites in all growth conditions tested. araC only has high expression transiently after addition of arabinose (Johnson and Schleif, 1995), and while growth in arabinose is utilized in this experiment, RNA was not collected during the window of high expression. The case study shows that Reg-Seq does not perform well when many repressor sites regulate the promoter. Reg-Seq relies on mapping the effect on expression of mutating a particular site, and when many strong repressor sites are present, expression change will be minimal unless all repressor sites are weakened through mutation. Additionally, in this highly repressed case, the RNAP-binding site we observe in the mutagenized region is not the documented RNAP site in RegulonDB, indicating that we are seeing transcription primarily from an alternative TSS. Different RNAP sites are often regulated differently, and in this case, the presence of an alternative and dominant RNAP-binding site (in the repressed case), likely contributes to a failure to observe six of the seven binding sites in the araC promoter. Similarly, in the dicA promoter, we did not find an RNAP-binding site in the studied region, which would make it very unlikely for any transcription-factor-binding sites to be identifiable.

In order to determine false positive rates, we test against promoters for which we are certain there are not additional, unannotated binding sites. Most known binding sites were not determined using a method like Reg-Seq, which looks for regulatory elements across an entire promoter region at base pair resolution. Rather, many efforts to pinpoint transcription-factor-binding site locations use assays like ChIP-Seq, which prioritizes looking for all binding sites of a given transcription factor across the entire genome. For those promoters studied with Reg-Seq, there are five promoters for which we have reason to believe that there are no undiscovered binding sites. There is evidence that the zupT promoter is constitutive (Grass et al., 2005), and the marR, relBE, dgoR, and lacZYA promoters have all been examined for binding sites at base pair resolution previously (in the Sort-Seq experiment [Belliveau et al., 2018; Kinney et al., 2010]).

To evaluate false positive rates, we examine the putative activator and repressor binding sites as identified using our automated methodology (described previously), and compare any known binding sites to the known binding sites for the target promoters. We also classify any putative regulatory regions that are outside of known transcription-factor-binding sites as false positives. Similarly, any identified RNAP-binding sites which were outside of the known RNAP binding locations were classified as false positives. In the zupT promoter, only the correctly placed RNAP site was identified. There were similarly no false positives identified in the marR, relBE, dgoR, or lacZYA promoters.

We additionally compare the energy matrices from putative regulatory regions to known binding site motifs. The known motifs are obtained either from RegulonDB or are generated from data from our prior Sort-Seq experiments (see Belliveau et al., 2018). We utilize the TOMTOM motif comparison software from Gupta et al., 2007 to perform these comparisons. TOMTOM generates a p-value under the null hypothesis that the two compared motifs are drawn independently from the same underlying probability distribution. We test 95 motifs against each target motif that we are attempting to identify. The 95 resulting p-values (for each target) generated by TOMTOM are displayed in Appendix 2—figure 4. A full discussion of TOMTOM can be found in Appendix 3 Section 'TOMTOM motif comparison'. We only included those transcription factors that either have over 50 known binding sites in RegulonDB or have experimental measurements of binding site preference, such as in Sort-Seq (Belliveau et al., 2018). As such, we used TOMTOM on the XylR, CRP, MarA, MarR, and RelBE sites in Appendix 2—table 2. We utilized a p-value cutoff of 0.05, corrected for multiple hypothesis testing. 95 motifs were tested against each target, and using the Bonferroni correction leads to a p-value cutoff of 0.0595=5×10-4. In Appendix 2—figure 4 we show that the correct transcription factor falls below the p-value threshold in all cases. For the CRP-binding site in the lacZYA promoter, FNR also falls below the cutoff, but CRP has a calculated p-value that is ≈ 6 orders of magnitude lower, and so is clearly identified as the correct binding site. The results show that motif comparisons can be used reliably in those cases where we have high-quality energy matrices for comparison.

Appendix 2—figure 2
A visual comparison of the literature binding sites (left panel) and the extent of the binding sites discovered by our algorithmic approach (right panel).

RNAP-binding sites are also labeled in the right panel, but RNAP-binding sites are not included in the false positive analysis. Numeric values for the displayed data can be found in Appendix 2—figure 2—source data 1.

Appendix 2—figure 3
A continuation of the visual comparison of the literature binding sites (left panel) and the binding sites discovered by our algorithmic approach (right panel) begun in Appendix 2—figure 2.
Appendix 2—figure 4
A visual display of the results of the TOMTOM motif comparison between the discovered binding sites and known sequence motifs from RegulonDB and our prior Sort-Seq experiment (Belliveau et al., 2018).

Each dot in a given panel represents a comparison between the target position weight matrix (given in the figure title) and a position weight matrix for a given transcription factor. The p-value is calculated using the null hypothesis, that both motifs are drawn independently from the same underlying probability distribution. The red dotted line is displayed at a p-value of 5×10-4. The line represents a p-value threshold of 0.05 that has been corrected for multiple hypothesis testing using the Bonferroni correction (95 motifs were compared against the target for a p-value threshold of 0.0595=5×10-4). Numeric values for the displayed data can be found in Appendix 2—figure 4-source data 1.

Appendix 3

Extended details of analysis methods

Information footprints

We favor the use of information footprints as a tool for hypothesis generation to identify regions which may contain transcription-factor-binding sites. In general, a mutation within a transcription factor site is likely to weaken that site. We look for groups of positions where mutation away from wild type has a large effect on gene expression. Our datasets consist of nucleotide sequences, the number of times we sequenced a given, specific mutated promoter in the plasmid library, and the number of times we sequenced its corresponding mRNA. A simplified illustrative dataset on a hypothetical 4 nucleotide sequence is shown in Appendix 3—table 1.

Appendix 3—table 1
Example dataset of four nucleotide sequences, and the corresponding counts from the plasmid library and mRNAs.
SequenceLibrary sequencing countsmRNA counts
ACTA523
ATTA53
CCTG1111
TAGA123
GTGC20
CACA87
AGGC73

One strategy to measure the impact of a given mutation on expression is to take all sequences which have base b at position i and determine the number of mRNAs produced per read in the sequencing library. By comparing the values for different bases we can determine how large of an effect a mutation has on gene expression. For example in Appendix 3—table 1, for the second position (i=2) those sequences that contain the wild type base A (b=A) have 20 sequencing counts out of 50 (23+3+11+3+0+7+3=50) from the DNA library and 10 sequencing counts from the 50 (5+5+11+12+2+8+7=50) mRNA reads. For all other sequences (b=C,G, or T), there are 30 sequencing counts from the DNA library and 40 sequencing counts from mRNA. A measure of the effect of mutation on expression would be to compare the ratios mRNA counts/total mRNA countslibrary counts/total library counts between mutated and wild-type sequences. For the data in Appendix 3—table 1, sequences with a wild type base at position 2 will have a ratio of purple (10/50)/(20/50)=0.5 and sequences with a mutated base at position 2 will have a ratio of (40/50)/(30/50)1.3.

While directly comparing ratios is one way to measure the effect on gene expression, we use mutual information to quantify the effect of mutation, as Kinney et al., 2010 demonstrated could be done successfully. In Appendix 3—table 1, the frequency of the nucleotide A in the DNA library at position 2 is 0.4, as 20 out of 50 sequencing counts have an A at position 2. Similarly, the other frequencies at position 2 are 0.32 for C, 0.14 for G and 0.14 for T. In the observed mRNA sequence counts, we find C at 34 of of 50 total mRNA counts, which gives a frequency of 0.68, indicating that Cytosine is enriched in the mRNA transcripts compared to the DNA library. The frequencies for the other bases are 0.2 for A, 0.06 for T and 0.06 for G. Large enrichment of a base compared to others in mRNA sequencing counts occurs when base identity is important for gene expression.

We are classifying bases as either wild type (m=0) or mutated (m=1). A discussion of this assumption can be found at the end of this section. We compute mutual information at position i as

(6) Ii=m=01μ=01p(m,μ)log2(p(m,μ)pmut(m)pexpr(μ)),

where pexpr(μ) is the ratio of the number of DNA (μ=0) or mRNA (μ=1) sequencing counts to the total number of counts,

(7) pexpr(μ)={(mRNA counts)/(total counts)ifμ=1(Library Sequencing counts)/(total counts),ifμ=0.

From the example data in Appendix 3—table 1 we can calculate pexpr(μ). To do so, we sum up DNA counts and mRNA counts from all sequences and divide by the total number of counts (50+50=100) to obtain

(8) pexpr(μ)={0.5,ifμ=10.5,ifμ=0.

In addition, pmut(m) is the fraction of the total counts that either have a mutation (m=1) at the given position or the fraction that have a wild-type base (m=0) at the position. pmut has to be computed for each position individually. For position 1, the wild type base is an A, and we see that there are a total of 100 sequencing counts, of which 46 counts (DNA and mRNA combined) contain an A at position 1. Therefore, p(m) can be calculated for position 1 as

(9) pmut(m)={0.46,ifm=00.54,ifm=1.

Lastly, the joint distribution p(m,μ) is the probability that a given sequencing read in the dataset will have expression level µ and mutation status m. p(m,μ) is calculated by dividing the number of sequencing reads at the chosen position with mutation status m and expression status µ by the total number of sequencing reads. In the case of the example dataset in Appendix 3—table 1 and for m=0 and μ=0, we sum the sequencing reads that are wild type at position 1 and also are in the DNA library. As there are 17 sequences that fit the criteria out of 100 total sequences, p(m=0,μ=0)=0.17. The other values of p(m,μ) can be calculated to be

(10) p(m,μ)={0.17,ifm=0(wild type base)andμ=0(DNA)0.21,ifm=1(mutated base)andμ=1(RNA)0.33,ifm=1andμ=00.29,ifm=0andμ=1.

The marginal distributions pexpr and pmut can be obtained by summing over one of the two variables, that is,

(11) pexpr(μ)=mp(m,μ),
(12) pmut(m)=μp(m,μ).

Plugging the values calculated above into Equation (6) yields a mutual information value of 0.06 bits at position 1. The unit is bits because the mutual information is computed with a logarithm of base 2. Other bases can be chosen, however, that results in different units for the mutual information.

Mutual information is a measurement that quantifies how much the measurement of one of two variables reduces uncertainty of the other variable. For example, very low mutual information means that by knowing one variable one gains no information about the other variable, while on the other hand high mutual information means that by knowing one variable our knowledge about the others increases. At a position where base identity matters little for expression level, there would be little difference in the frequency distributions for the library and mRNA transcripts. The entropy of the distribution would decrease only by a small amount when considering the two types of sequencing reads separately.

We seek to determine the effect on gene expression of mutating a given base. However, if mutation rates at each position are not fully independent such that p(mi,mi)p(mi)p(mi), then the information value calculated in Equation (6) will also encode the effect of mutation at correlated positions. For instance, if position i is part of an activator-binding site, mutating it will have a large effect on gene expression. If position i is not within the activator site, then mutating position i will have minimal true effect on gene expression. However, if mutations at the two bases are correlated, mutating position i will make it more likely for i, and therefore the activator-binding site, to be mutated. Knowledge that i is mutated is predictive of overall expression, and so position i will have high mutual information according to Equation (6), even though that position has no regulatory function. In our experiment we designed sequences to be synthesized such that each position had a probability of mutation that was independent of mutation at any other position. However, due to errors in the oligonucleotide synthesis process, additional mutations in the ordered sequences were introduced. Sequencing our DNA libraries reveals that mutation at a given base pair can make mutation at another base pair more likely by up to 10%, where neighboring base pairs are the most likely to have correlations between mutations. This is enough to cloud the signature of most transcription factors in an information footprint calculated using Equation (6).

We need to determine values for pi(m|μ) when mutations are independent, and to do this we need to fit these quantities from our data. We assert that

(13) CmRNAe-βEeff

is a reasonable approximation to make, which we will justify by considering a number of possible regulatory scenarios. CmRNA is the average number of mRNAs produced and Eeff is an effective binding energy for the sequence that can be determined by summing contributions from each position in the sequence independently. There are many possible underlying regulatory architectures, and those that have been discovered with Reg-Seq are summarized in Table 1. While we will show that under reasonable assumptions this approach is useful for any of these regulatory architectures, let us first consider the simple case where there is only an RNAP site in the region under study. We can write down an expression for average gene expression per cell as

(14) CmRNApboundPNNSe-βEP1+PNNSe-βEP,

where pbound is the probability that the RNAP is bound to DNA and is known to be proportional to gene expression in E. coli (Ackers et al., 1982; Buchler et al., 2003; Garcia and Phillips, 2011), EP is the energy of RNAP binding, NNS is the number of nonspecific DNA binding sites, and P is the number of RNAP. If RNAP binds weakly then PNNSe-βEP1, and we can simplify Equation (14) to

(15) CmRNAe-βEP.

Using this relation, we can compute the ratio of average mRNA counts in wild type CmRNAWTi to average mRNA counts in a mutant CmRNAMuti as

(16) CmRNAWTiCmRNAMuti=eβEPWTieβEPMuti,
(17) CmRNAWTiCmRNAMuti=eβ(EPWTiEPMuti),

where EPWTi is the binding energy of RNAP to the wild-type binding site and EPMuti is the binding energy of RNAP to the mutant-binding site. Using the assumption that each position contributes independently to the binding energy, we can simplify the differences in energies to EPWTi-EPMuti=ΔEPi. We can now calculate the probability of finding a specific base in the expressed sequences. If the probability of finding a wild type base at position i in the DNA library is pi(m=0|μ=0), then

(18) pi(m=0|μ=1)=pi(m=0|μ=0)CmRNAWTiCmRNAMutipi(m=1|μ=0)+pi(m=0|μ=0)CmRNAWTiCmRNAMuti,
(19) pi(m=0|μ=1)=pi(m=0|μ=0)eβΔEPipi(m=1|μ=0)+pi(m=0|μ=0)eβΔEPi.

Under certain conditions, we can also infer a value for pi(m|μ=1) using a linear model when there are any number of activator or repressor-binding sites. We will demonstrate this in the case of a single activator and a single repressor, although a similar analysis can be done when there are greater numbers of transcription factors. Define p=PNNSe-βEP and a=ANNSe-βEA where A is the number of activators, and EA is the binding energy of the activator. Also define r=RNNSe-βER where R is the number of repressors and ER is the binding energy of the repressor. Then we can compute the average number of produced mRNA as

(20) CmRNApboundp+pae-βϵAP1+a+p+r+pae-βϵAP,

where ϵAP is the interaction energy of activators and the RNAP. One assumption we make is that activators and RNAP bind weakly to their binding sites (a1 and p1) but interact strongly (pae-βϵAPp). Under this assumption, RNAP and associated activators are much more likely to bind DNA as a unit than separately. The binding energy measurements by Forcier et al., 2018 support this assumption in the case of CRP in the lac operon. The DNA-protein binding energy of CRP is measured to be -3.18 kBT and the interaction energy between CRP and RNAP is measured to be ϵAP=-6.56kBT. The copy number of CRP is A4000 (Schmidt et al., 2016), the copy number of RNAP is P2000 in slowly growing cells (Bremer and Dennis, 1996), and the RNAP binding energy for the wild type lac promoter is EP5.2 kBT (Brewster et al., 2012). As NNS4.6×106, the value of a can be calculated to be a40004.6×106e3.180.02. Similarly p can be calculated to be p20004.6×106e5.20.08. Lastly, we can calculate pae-βϵAPpae6.561. We can see that these numbers satisfy the assumptions a1, p1, and pae-ϵAPp. We can simplify Equation (20) to

(21) CmRNApboundpae-βϵAP1+r+pae-βϵAP.

The last assumption we make is that repressors bind very strongly (r1 and rpae-ϵAP). To justify this assumption, we once again look to the lac operon. Wild-type LacI copy number is R10 and the wild-type binding energy for the O1 operator is ER-16kBT (Garcia and Phillips, 2011). We can use these values to compute r104.6×106e1620. We can simplify Equation (21) to

(22) CmRNA paeβϵAPr
(23) CmRNA eβ(EPEA+ER)

As we typically assume that RNAP binding energy, activator binding energy, and repressor binding can all be represented as sums of contributions from their constituent bases, the combination of the energies can be written as a total effective energy Eeff which is a sum of independent contributions from all positions within the binding sites.

We fit the parameters for each base using Markov Chain Monte Carlo Method (MCMC). Two MCMC runs are conducted using randomly generated initial conditions. We require both chains to reach sufficiently similar distributions to prove the convergence of the chains or we repeat the runs. During the analysis, we artificially treat mutation rates at all positions as equal, as we do not wish for mutation rate to play a role in mutual information calculations. The information values are smoothed by averaging with neighboring values.

By only considering wild type or mutated energy contributions to the total effective binding energy rather than having separate values for energy contributions from all four base pairs, our methods will not be accurate in the case of calculating mutual information at locations with degenerate base pairs. However, the information footprints are intended to be hypothesis generation tools that can identify transcription-factor-binding sites. As such, the most important test for the assumption that we can approximate effective energy contributions from all 4 bases as contributions from only wild type or mutated bases is to assess whether the approximation has any effect on determining binding site locations. We re-ran the false positive and false negative assessments discussed in Appendix 2 Section 'False positive and false negative rates', but instead calculated the effective energy parameters for producing information footprints as a sum of contributions from all four bases. We find that the literature binding sites that were properly identified, as summarized in Appendix 2—table 2, are identically identified. Specifically, any site which was identified using the previous method is still identified and any site that failed to be identified is still not observed. Similarly, when we only fit effective energy parameters for mutated or wild type bases there are no false positives identified in the promoters for marR, relBE, dgoR, zupT, or lacZYA. There are also no false positives when repeating the procedure while considering all 4 bases in the effective energy fits, implying that the simplification to only considering mutated or wild type bases does not have an effect on our ability to identify binding sites.

Processing of mass spectrometry experiments

Mass spectrometry results were processed using MaxQuant (Cox and Mann, 2008; Cox et al., 2009). Spectra were searched against the UniProt E. coli K-12 database as well as a contaminant database (256 sequences). LysC was specified as the digestion enzyme. Proteins were considered if they were known to be transcription factors, or were predicted to bind DNA (using gene ontology term GO:0003677, for DNA-binding in BioCyc).

Uncertainty due to number of independent sequences

1400 promoter variants were ordered from TWIST Bioscience for each promoter studied. Due to errors in oligonucleotide synthesis, additional mutations are randomly introduced into the ordered oligos. We have found that, as a result of these random, additional errors, the final number of variants received was an average of 2200 per promoter.

To demonstrate that our results are not strongly dependent on the number of sequences in each promoter library, and also to assess how a reduction in the number of sequences per promoter library could facilitate larger scale experiments in the future, we generated examples of smaller data sets by computationally sub-sampling the Reg-Seq experimental data from seven mutated promoter libraries (maoP, hslU, rpsA, leuABCD, aphA, araC, and tig). These promoters are representative of a large cross-section of the variety of regulation we see in our study, as they include promoters with constitutive expression (hslU), simple repression(leuABCD, tig), simple activation (aphA), as well as more complicated regulatory architectures (maoP, rspA, araC). Each sub-sampling was done three times, and we then use the Pearson correlation coefficient (Appendix 2 Section 'Comparison between Reg-Seq by RNA-Seq and fluorescent sorting') as a comparison metric between the inference based on the full data set and the computationally sub-sampled data sets.

Based on our analysis, the results of which are displayed in Appendix 3—figure 1, we find that there is only a small effect on the resulting sequence logo until the library has been reduced to approximately 500 promoter variants. We could, therefore, reasonably lower the resolution of the experiment to approximately 1000 or fewer unique sequences before large deviations in the inference are experienced. Decreasing the number of unique sequences can give modest boosts to the number of genes that can be studied, but will not be able to give order of magnitude increases in the number of genes that can be explored.

Appendix 3—figure 1
Pearson correlation as a function of the number of unique DNA sequences (as explained in Appendix 2 Section 'Comparison between Reg-Seq by RNA-Seq and 2uorescent sorting').

For seven different genes, we studied how the number of mutated DNA sequences affects the reproducibility of our MCMC inference models. As the number of unique sequences increases, so too does the Pearson correlation value, approaching 1.0. Numeric values for the displayed data can be found in Appendix 3—figure 1—source data 1.

Effect on calculated energy matrices when transcription factor copy number ≈ plasmid copy number

Throughout this study, we utilize plasmids to express GFP from mutated promoters, and then use the ratio of mRNA/DNA, based on sequencing results, to handle the effect of variability in plasmid copy number between cells. It is necessary, however, to consider the situation wherein the plasmid copy number is of a similar magnitude to the transcription factor copy number, and whether this can impact the calculated energy matrices and binding energies. The genetic expression levels are determined not only by the binding energy, but also by the transcription factor availability, and so it is necessary to consider whether, for those cases where transcription factor copy number ≈ plasmid copy number, there is a corresponding under-estimation of binding energies. Prior work from our laboratory was precisely aimed at rigorously predicting and measuring this effect (Weinert et al., 2014). In that study, we demonstrated how to control this effect, wherein transcription factor copy number ≈ plasmid copy number, in a parameter-free manner. However, to mitigate this effect in future studies, we plan to use genome-integrated libraries, rather than plasmid-based expression.

The plasmid used in our experiments is derived from pUA66, which contains a pSC101 origin of replication (Zaslaver et al., 2006). The copy number of plasmids with a pSC101 origin is, in log phase, approximately 3 or 4 (Lutz and Bujard, 1997). We have not independently assessed the copy number of the plasmid used in this study.

The absolute copy number of thousands of proteins in E. coli have been determined using whole-proteome LC-MS. Specifically, a 2016 study that provides the absolute quantification for roughly 55 percent of predicted proteins in the E. coli K12 proteome (see Supplementary Table S6) (Schmidt et al., 2016). For those transcription factors that were quantified in that study, and also show up in our Reg-Seq experiments, we provide their absolute quantification in E. coli K12 for both glucose and LB growth media in Appendix 3—table 2.

Appendix 3—table 2
Global, absolute quantification for most transcription factors identified in this study, as determined for E. coli K12 grown in both glucose (5 g/L concentration in M9 minimal media) and LB medias.

The values in this table are reprinted from Schmidt et al., 2016 Supplemental Table S6.

Transcription factor nameGlucoseLB
FNR6091101
YieP158261
YciT82104
NsrR872136
LexA5601027
DeoR2634
CRP20483450
YdfH96154
ArcA33675464
Zur70130
GlpR75145
PhoP29673132
HNS2254147133
StpA68635241
DicA2025
YgbI26
XylR18

For most transcription factors, the copy number as determined by LC-MS is much greater than the expected, low copy number of the plasmid used in this study, thus mitigating the concern that the limited availability of a transcription factor could impact gene expression.

There are a few transcription factors that have copy number on the order of the plasmid copy number, however, including XylR, DicA, and YgbI. Prior work from our group (Weinert et al., 2014) has explored how gene expression behaves in the regime where transcription factor copy number is ≈ plasmid copy number. Here, we will discuss the case of simple repression to demonstrate how the relationship between transcription factor and plasmid copy number could impact our results. The standard thermodynamic model for gene expression under simple repression with a weak promoter, as described by Bintu et al., 2005, is

(24) Cpbound=PNNSe-βΔεP1+RNNSe-βΔεR,

where C is a measurement for gene expression level, NNS is the number of nonspecific DNA-binding sites, P is the number of RNAP, and R is the number of repressors. ΔεR and ΔεP represent the difference in the repressor-binding energy and RNAP-binding energy between the specific binding site and the averaged nonspecific genomic background respectively. Weinert et al., 2014 demonstrated experimentally that, in the presence of multiple target binding sites, such as from a multi-copy plasmid, the gene expression level can be described by a very similar functional form to Equation (24), namely,

(25) Cpbound=λPe-βΔεP1+λRe-βΔεR,

where λP and λR are the fugacity of RNAP and the repressor and describe the relative availability of RNAP or repressor as a function of plasmid copy number, transcription factor copy number, and binding site strength. The presence of additional plasmid copies does weaken the effect of repressor binding when the repressor copy number is ≈ plasmid copy number. Thus, our information footprint calculations will be affected and the information signature of binding sites such as YgbI, DicA, or XylR will be decreased.

For transcription-factor-binding site interactions that are sufficiently weak, together with a low transcription factor copy number, the effect of having multiple plasmids expressed in a cell could cause us to have a false negative, and thus miss the presence of a binding site. However, the Reg-Seq method does not claim to capture every regulatory feature for a given promoter, as the activity of some transcription factors is induced only in certain growth conditions, we use a finite, 160 bp mutation window that may miss 'regulation at a distance', and the presence of extremely weak and nonspecific binding sites may cause Reg-Seq to 'miss' some transcription factors (indeed, for the bdcR promoter, the GlaR-binding site is outside of the mutagenized region and so is not observed). The effect of additional plasmids within the cellular confines will always decrease the fugacity in Equation (25), as an increase in the number of sites competing for a limited pool of transcription factors will decrease the relative availability of those transcription factors. As a result, the effect on gene expression of a given transcription factor will always lessen in the presence of additional plasmids. This means that, while multi-copy plasmids can introduce false negatives into Reg-Seq, it will not introduce false positives. Additionally, we see empirically that, even for the lowest copy transcription factor for which we have a measurement, XylR (≈ 1 copy per cell), we can identify its transcription-factor-binding site. In Appendix 2—figures 2 and 3, 2 (previously known) XylR sites are identified for the xylA promoter, and 2 (previously known) XylR sites are identified in the xylF promoter.

Finally, the energy matrices, which are a quantitative output of the Reg-Seq experiment, will be unaffected by the presence of multi-copy plasmids. As discussed in Appendix 3 Section 'Energy matrix inference', energy matrix inference relies on calculating the mutual information between the energy predictions of the model and the experimental data. Mutual information is invariant under transformations to the input variables that do not affect their rank order. While the presence of multiple plasmid copies will affect the fugacity in Equation (25), and so will also affect any quantitative prediction of gene expression, a weaker repressor-binding site will still be predicted to have higher gene expression than a stronger repressor-binding site, regardless of the relative availability of the transcription factor. The rank-order is always preserved and so the presence of a multi-copy plasmid will not change the mutual information between model predictions and experimental data. As a result, the final inference of energy matrices will remain the same.

Energy matrix inference

Energy matrices in this experiment are of the form shown in Appendix 3—table 3,

Appendix 3—table 3
Example energy matrix.

This matrix is in arbitrary units, and the process to obtain absolute units (in kBT) is described in Appendix 3 Section 'Inference of scaling factors for energy matrices'.

PosACGT
0−0.01−0.01−0.010.03
10.0020.05−0.060.008
2−0.0002−0.040.0080.03
3−0.020.02−0.010.01

where each entry gives the energy contribution from a base pair at a given location. As an example from Appendix 3—table 3, an A at position 1 would give a total energy contribution of −0.01 (A.U.). All energy matrices used in our analysis are linear energy matrices, where the total energy is the sum of contributions from each base pair. As a result, total binding energy is

(26) binding energy=i=1Lj=ATθijδij,

where δij is the Kronecker delta, which takes on a value of 1 if the base at position i is equal to j and is 0 otherwise, L is the length of the binding site, and θij is the energy contribution of nucleotide j and position i in arbitrary units. To infer the parameters θij in Equation (26) from the experimental data, we perform Bayesian inference using a MCMC method, which requires us to calculate the likelihood of the model given the experimental data. The likelihood function is difficult to determine, but Kinney et al., 2010 found that, given a large amount of data, the likelihood function is related to the mutual information between energy predictions and data by the equation

(27) L(D|θ)2NI(μ,E),

where N is the total number of independent sequences, D is the data consisting of sequences and measured sequencing counts, I is the mutual information between gene expression label µ and energy predictions E. µ is a discrete variable that characterizes the gene expression level as described in Equation (3) in the main text. We can calculate mutual information using the formula for mutual information between a continuous and a discrete variable, namely,

(28) I(μ,E)=-dEμ=01p(μ,E)log2(p(μ,E)p(E)p(μ)).

While Equation (28) is used for continuous energy predictions, there are only N total sequences, and so only N discrete energy predictions. For a simple example of calculating the joint probability distribution p(μ,E), consider the hypothetical dataset of only four nucleotides in Appendix 3—table 1. We first predict the binding energy of each of the example sequences, shown in Appendix 3—table 4.

Appendix 3—table 4
Example dataset with energy predictions.

Energy predictions are made by applying the example energy matrix in Appendix 3—table 3 to the example dataset in Appendix 3—table 1 according to Equation (26).

μ=0μ=1Energy (kBT
5230.05
530.008
11110.09
123−0.03
200.03
87−0.07
73−0.04

We use kernel density estimation with kernel width of 4% to estimate the true joint distribution p(μ,Esmooth) from the data contained in the joint distribution in the matrix in Appendix 3—table 4. This process estimates an underlying continuous distribution from a discrete set of energy predictions. The details of kernel density estimation can be found in Hastie et al., 2009. We can do the final calculation of the mutual information by splitting the smoothed joint distribution into 500 energy 'bins' z and calculating

(29) I(μ,E)=z=1500μ=01p(μ,Ez)log2(p(μ,Ez)p(Ez)p(μ)).

With the ability to calculate the likelihood of an energy matrix model, MCMC can be used to infer the posterior distribution for our model. First a random matrix model is generated. The model is perturbed and the new model is accepted or rejected based on the Metropolis-Hastings algorithm (Patil et al., 2010). After an initial burn in period of 60,000 steps, iterations are saved every 60 steps. A total of 600,000 iterations are performed. This procedure is performed twice for each model, and if inferred models do not have a Pearson correlation coefficient of 0.99 or higher they are discarded and computed again. A complete overview of the computational pipeline can be found at the GitHub wiki page.

Inference of scaling factors for energy matrices

For the majority of energy matrices reported in our work, the results are given in arbitrary units. This is a direct result of using the method of Kinney et al., 2010 to infer our matrices. The method appeals to information theory to write an 'error-model-averaged' likelihood function for a given model. The likelihood function is given in Equation (27). A property of mutual information is that it is invariant to changes in the input variables as long as those transformations do not affect the rank-order of those variables. As a result, we can scale the energy predictions by any constant without changing the likelihood of the model, which means that in the case of simple linear models for transcription factor binding we cannot assign absolute units to energy matrix values. When we widen our view to considering promoter regions rather than single binding sites we can overcome this drawback. Using thermodynamic modeling as outlined in Bintu et al., 2005, we can predict the gene expression from any given transcriptional architecture. In the case a thermodynamic model of simple repression the expression is given by

(30) Cpbound=PNNSe-βΔεP1+PNNSe-βΔεP+RNNSe-βΔεR,

where C is a measurement for expression, P is the number of RNAP, R is the number of repressors and NNS is the number of nonspecific binding sites. ΔεR and ΔεP represent the difference in the repressor binding energy and RNAP-binding energy between the specific binding site and the averaged nonspecific genomic background respectively. As we use linear energy matrix models as described in Appendix 3 Section 'Energy matrix inference', ΔεR and ΔεP will be given by Equation (26). In these cases the overall rank order of gene expression predictions will change if you scale the energy matrix, and so the absolute units can be determined (Kinney and Atwal, 2014). Equation (30) is a more complicated and non-linear functional form for predicting C than a simple linear binding model, and has a correspondingly more difficult to sample posterior. To address complications in the inference, we first only use the non-linear fits to fix overall scale and wild-type energy for energy matrices rather than fit all parameters in this way. In other words, we use the standard fitting procedure to find the θij in the Equation (26) using the standard MCMC procedure.

The binding energy matrices can be written Aθij+B where A is a constant that scales the matrix from arbitrary units to absolute units (kBT) and B is an additive constant that relates to the wild-type energy. We fit the constants A and B for the transcription factor binding energy using the thermodynamic model in Equation (30).

While we can in principle fit thermodynamic models to any given architecture, these models are non-linear and, due to numerical difficulties, unreliable for sufficiently complex models. We only use this method on examples of simple repression or activation without more than one prominent RNAP model, whose transcription-factor-binding site does not overlap significantly with RNAP −10 or −35 sites. The scaling factors we discovered are given in Appendix 3—table 5.

Appendix 3—table 5
A table showing scaling factors to convert arbitrary units to absolute units in kBT.

Growth conditions indicate the energy matrix and dataset used in the fit. In some growth condition additional regulatory features will be present, meaning specify condition is important.

GeneGrowthScaling factor A
tff-rpsB-tsfHeat shock8.1 kBT
tigHeat shock26.3 kBT
yjjJHeat shock11.3 kBT
bdcRHeat shock9.9 kBT
fdhEAnaerobic growth6.34 kBT
ykgEArabinose12.1 kBT
dicCArabinose15.1 kBT
rspAArabinose5.5 kBT

We perform the inference using parallel tempering MCMC, where multiple chains are run in parallel with different 'temperatures'. High temperature chains widely explore parameter space, escaping any local optima, while low temperature chains optimize locally. The current parameter values of the chains are exchanged periodically. The fitting procedure is done using the emcee ensemble sampler (Goodman and Weare, 2010) with 10 temperatures ranging from 1 to 10,000 on a log scale.

Examination of promoters for which no RNAP site was found

We failed to find an RNAP site for 18 promoter sequences. In order to understand these sequences in more detail we examine the sequences within 50 bases of the TSS for the 18 genes in question for sequences which resemble the known consensus RNAP-binding site. For this comparison, we use the σ70 consensus binding sequence −35TTGACA - spacer sequence - TGNTATAAT−7 (where the superscripts −35 and −7 indicate the position relative to the TSS). The consensus sequence we use for comparison contains the extended −10 element, consisting of a TG at bases −15 and −14 as we have found those to be important for gene expression in our study. The spacer length is between 15 and 13 bases (the typically reported spacer length is between 18 and 16 but this does not include the extended minus 10 element). The consensus sequence for the heat shock σ factor was used for the promoter yajL.

Previously, to analyze RNAP sites, we have examined energy matrices produced by Reg-Seq. Now we add an examination of wild type sequences. For each promoter, we found the best match to the consensus site, namely the sequence with the fewest mutations compared to the consensus sequence. We use the number of mutations as a measure of how well the site resemble consensus. We find that 16 out of 18 promoters have at least five mutations in the sequence that most closely resembles RNAP, one promoter has four mutations, and the last has three mutations. To put these numbers into context, Brewster et al., 2012 measured the RNAP binding energies of several RNAP-binding site mutants. Mutations away from the strongest sequence tested (lacUV5, which is two mutations away from consensus) yields a change in binding energy of 12kBT. If the promoters are constitutive, then (in the weak promoter approximation) expression level will be proportional to e-βΔϵP where ΔϵP is the RNAP binding energy relative to the nonspecific background. Therefore, as an approximation, a sequence with three mutations would be predicted to be 3−10 fold weaker than a 'strong' RNAP site, and as such could be said to show a resemblance to the consensus RNAP site. However, 16 out of 18 of these promoter regions have, at best, extremely weak RNAP sites. It is important to note however, that even extremely weak RNAP sites often transcribe, especially when aided by activators. We do not intend to claim that RNAP does not bind to these promoter regions, merely that we do not detect it in our experiment. In fact, while the RNAP sites are weak, there is experimental evidence in EcoCyc of some level of transcription for 9 out of 18 promoters.

TOMTOM motif comparison

In some cases, we used an alternative approach to mass spectrometry to discover the transcription factor identity regulating a given promoter based on sequence analysis using a motif comparison tool. TOMTOM (Gupta et al., 2007) is a tool that uses a statistical method to infer if a putative motif resembles any previously discovered motif in a database. It accounts for all possible offsets between the motifs. Moreover, it uses a suite of metrics to compare between motifs such as Kullback-Leibler divergence, Pearson correlation, Euclidean distance, among others. All TOMTOM analyses in Reg-Seq utilize Euclidean distance. The method calculates a p-value under the null hypothesis that the two compared motifs are independently drawn from the same underlying distribution probability distribution.

We performed comparisons of the motifs generated from our energy matrices to those generated from all known transcription-factor-binding sites in RegulonDB. Appendix 3—figure 2 shows a result of TOMTOM, where we compared the motif derived from the -35 region of the ybjX promoter and found a good match with the motif of PhoP from RegulonDB.

The information derived from this approach was then used to guide some of the transcription factor knockout experiments, in order to validate its interaction with a target promoter characterized by the loss of the information footprint. Furthermore, we also used TOMTOM to search for similarities between our own database of motifs, in order to generate regulatory hypotheses in tandem. This was particularly useful when looking at the group of GlpR-binding sites found in this experiment.

Appendix 3—figure 2
Motif comparison using TOMTOM for the two PhoP-binding sites in the ybjX promoter.

Searching our energy motifs against the RegulonDB database using TOMTOM allowed us to guide our transcription factor knockout experiments. Here, we show the sequence logos of the PhoP transcription factor from RegulonDB (top) and the ones generated from the ybjX promoter energy matrix. E-value = 0.01 using Euclidean distance as a similarity matrix.

Appendix 4

Additional results

Binding sites regulating divergent operons

In addition to discovering new binding sites, we have discovered additional functions of known binding sites. In particular, in the case of bdcR, the repressor for the bdcA gene, which is transcribed from the same promoter in the opposite direction of transcription (Partridge et al., 2009), is also shown to repress bdcR in Appendix 4—figure 1(A). Similarly in Appendix 4—figure 1(B) IvlY is shown to repress ilvC in the absence of inducer. Divergently (transcription in opposite directions from the same promoter) transcribed operons that share regulatory regions are plentiful in E. coli, and although there are already many known examples of transcription-factor-binding sites regulating several different operons, there are almost certainly many examples of this type of transcription that have yet to be discovered.

Appendix 4—figure 1
Two cases in which we see transcription-factor-binding sites that we have found to regulate both of the two divergently transcribed genes.

(A) An information footprint and regulatory cartoon for the divergently transcribed bdcA and bdcR promoters. A single NsrR site regulates both promoters. (B) An information footprint and regulatory cartoon for the ilvC and ilvY promoters. Both promoters are repressed by IlvY when grown without acetolactate. Only the IlvY site is labeled on the information footprint.

In the case of ilvC, IlvY is known to activate ilvC in the presence of inducer. However, we now see that it also represses the promoter in the absence of that inducer. The production of ilvC is known to increase by approximately a factor of 100 in the presence of inducer (Rhee et al., 1998). The magnitude of the change is attributed to the cooperative binding of two IlvY-binding sites, but the lowered expression of the promoter due to IlvY repression in the absence of inducer is also a factor.

Comparison of results to RegulonDB

One area in which our work can be compared to current repositories of regulatory information such as RegulonDB is in comparing the prevalence of different regulatory architectures in the database to Reg-Seq. Appendix 4—figure 2 shows the prevalence of each type of architecture (not including architectures more complex than 2 activators and 2 repressors) and shows how simpler architectures are more common in both cases.

Another point of comparison between RegulonDB and Reg-Seq can be found in comparing sequence motifs from Reg-Seq to those generated from binding sites in RegulonDB. This can often produce useful results, such as in Appendix 3 Section 'TOMTOM motif comparison'. For other cases, the data used to generate the RegulonDB motifs can be lacking. We believe the GlpR motif in RegulonDB highlights some of the issues with using the reported motifs in RegulonDB to predict binding preference. First, there are only four promoters regulated by GlpR, with a total of 17 binding sites for GlpR in RegulonDB. Nine of these binding sites differ by nine mutations or more from the consensus site (out of 22 total base pairs). Nine mutations is more than even the weak O3 operator for LacI. We do believe that a relatively low number of weakly conserved binding sites likely do not reveal quality sequence logos for a binding site, especially as compared to Reg-Seq which constructs sequence logos from over a thousand promoter variants. Generation of such sequence motifs is a point on which we believe Reg-Seq can improve the current status of regulatory knowledge.

Explanation of included binding sites

This section is intended to clarify cases in which the regulatory cartoon or the displayed 'expected' binding sites differs from what can be found in RegulonDB or EcoCyc. The primary reason for these discrepancies is that our experiment only targets a 160 base pair mutation window. Some known binding sites will be outside of this window. Additionally, while some genes are known to be regulated by a specific transcription factor, the exact location of that transcription factor’s binding site is unknown and so we cannot be certain during the design of the 160 base pair mutagenized window whether or not the transcription-factor-binding site will be present in our experiment. The locations of the TSS selected in this experiment can be found in Supplementary file 1. Additionally, some transcription factors are known to only be active under certain growth conditions. Information footprints are depictions of the regulatory information for a specific growth condition; accordingly, not all transcription-factor-binding sites can be identified using a single growth condition. Throughout the main text and SI, however, we depict regulatory cartoons with their full milieu of transcription factors (based on experiments performed in multiple growth conditions).

When devising this study, we sought to test the reliability of the Reg-Seq method by testing experimentally-validated transcription-factor-binding sites, as reported by EcoCyc or RegulonDB, to assess our ability to recapitulate prior experiments. EcoCyc labels some transcription-factor-binding sites as 'low-evidence' in their database, most of which were identified via sequence motif matching. We have repeatedly observed that transcription-factor-binding sites identified from sequence matching are unreliable in relation to the empirical data collected in our experiment, and so we choose not to include them in the set of 'gold standard' genes which were used for this purpose of assessing Reg-Seq’s accuracy.

All of our 'gold standards' are genes for which there is high quality experimental evidence of their transcriptional regulation and the location of related transcription-factor-binding sites and, again, they were used to evaluate the false negative rates of our experiment. In those cases where the binding sites are either ’low-evidence’ according to EcoCyc, the location of a binding site is not known, a gene is only actively transcribed in certain or unknown growth conditions, or the binding site location is outside of the 160 bp mutagenized region, we do not include them in the list of sites we use to test our method even though they appear as binding sites in RegulonDB or EcoCyc. Regulatory features that are not transcription factors, including regulatory RNAs, are also not labeled in our reported results.

Accordingly, in some cases, the regulatory cartoons or architectures we present in this study may appear to be incomplete relative to previous reports of promoter architectures. For each gene below, we explain these discrepancies. This section is intended to explain why annotations on information footprints or regulatory cartoons do not match what is seen in RegulonDB or EcoCyc.

sdiA

sdiA is known to be regulated by both Nac as well as CsrA (which has two binding sites), the CsrA sites are downstream of the mutated region and the location of the Nac-binding site is unknown. Thus, none of these binding sites are reported in our regulatory architectures for this gene.

yqhC

yqhC is known to be regulated by GlaR, but the location of this binding site is unknown. As a result, we were unable to identify this binding site in our analysis, and the architecture for yqhC is listed in this study as (0,0).

bdcR

bdcR is known to be regulated by GlaR, but this binding site is outside of the targeted mutation window of 160 bp. A known binding site for NsrR is included within the 160 bp region, but it was not previously known to regulate bdcR; the binding site for NsrR is included as a new discovery as shown in Section 'Binding sites regulating divergent operons'.

aegA

aegA has a predicted CRP-binding site, but the location of this binding site is unknown and it is also listed as low-evidence in EcoCyc. As a result, the site is not included within this study’s analysis.

hicB

The CRP site associated with hicB is cited as low-evidence in EcoCyc and the HicB-binding site is outside of the 160 base pair mutated region. As a result, neither site is included in this study.

rplKAJL-rpoBC

The known RplA-binding site for this operon is outside of the targeted, 160 base pair mutation window. As a result, the RplA site is not included in this study.

tff-rpsB-tsf

RpsB is not contained in the mutated region. Additionally, the nearby predicted Mar-Sox-Rob-binding site is listed as low-evidence in EcoCyc and is also not directly predicted to regulate tff-rpsB-tsf, even though it may be present within the region. As a result, neither site is included in this study.

yodB

GlaR is known to regulate yodB. However, the location of this binding site is unknown. As a result, we do not include the GlaR-binding site in our reported regulatory architecture for this gene.

maoP

HdfR is known to regulate maoP. However, the location of the binding site is unknown. Additionally, the HdfR site is listed as low-evidence in EcoCyc. During the Reg-Seq experiment, however, we confirmed the presence of the low-evidence HdfR site with a gene knockout and located the binding site position. Thus, we include it in all regulatory cartoons and report the HdfR site in our discoveries.

poxB

MarA and Sox have low-evidence binding sites in the mutagenized region. There is also a low-evidence site for Cra with an unknown binding location. As a result, neither site is included in the reported regulatory architectures in this study.

mscM

While there is a known CpxR-binding site for mscM, the binding site exists outside of the mutagenized region. As a result, it is not included in the reported regulatory architectures in this study.

tar

There is a low-evidence FNR site for tar. Its location is unknown. For both of these reasons, we do not include the binding site in our reported regulatory architectures for this gene.

dpiBA

While there are 10 total binding sites for dpiBA, including an FNR site. However, the only ones that are known to regulate the particular TSS we chose (at position 652172 in E. coli) are 2 DcuR sites and a (low-evidence) NarL site. DcuR is induced by growth conditions like succinate or fumarate, neither of which were tested in this study. As a result, none of the sites are included in this study.

araAB

There are a total of five AraC-binding sites and one CRP-binding site that regulate araAB. However, the three furthest upstream AraC-binding sites are outside of the 160 bp mutagenized region, and so only two AraC sites and one CRP site is included in the reported regulatory architecture in this study.

xylF

There are two XylR sites, as well as three low-evidence Fis sites that regulate xylF in the mutagenized region. There is also a low-evidence CRP site outside the mutagenized region. Only the two XylR sites are included in the reported regulatory architectures, as the remaining sites are low-evidence or outside the mutagenized region.

xylA

There are two XylR sites, two AraC, and a CRP site that regulates xylA. In our analysis, we utilize a growth condition containing xylose and arabinose. Under growth with xylose, XylR will bind DNA and activate expression. Under growth with arabinose, AraC will not bind DNA. We would only expect to see two XylR sites and a CRP site under growth in xylose and arabinose, so we only include these sites in our study.

dicB

DicA has a low-evidence repressor-binding site for dicB. Additionally the binding location is unknown, and so we do not include the binding site in the reported regulatory architecture.

xapAB

XapR has two low-evidence binding sites. The binding site furthest upstream is outside of the 160 bp mutagenized region. As the remaining site is low-evidence, it is not included in our reported regulatory architectures.

ilvC

There are two IlvY-binding sites for ilvC. IlvY is known to be induced by acetolactate and activated in its presence . We do not utilize this growth condition in this experiment, however, nor do we include the two IlvY-binding sites in our 'gold standard' experimental analysis. We find that IlvY acts as a repressor when grown in other growth conditions. As repressor activity at these sites was not previously reported, we include this in our list of new discoveries.

asnA

There are four low-evidence AsnC-binding sites in the mutated region. As they are low-evidence, however, we do not include these binding sites in the reported regulatory architectures for this gene.

idnK

While there are three GntR sites, a CRP site, and one IdnR site, they are all low-evidence. As a result, we do not include any of these sites in our reported regulatory architectures.

dinJ

While dinJ is regulated by DinJ-YafQ and LexA, they are both outside of the mutagenized window. As a result, neither are included in our reported regulatory architectures.

yjiY

yjiY is regulated by both BtsR and CRP. However, CRP is outside of the mutagenized window and so CRP is not included in our reported regulatory architectures.

cra

cra is regulated by a low-evidence binding site of PhoB. The location of the binding site is not known, however. As a result, the site is not included in the reported regulatory architecture.

uvrD

uvrD is regulated by a low-evidence binding site for LexA. This binding site is not included in the reported regulatory architectures for this study.

znuCB

There are binding sites for Zur and OxyR in the mutagenized region for znuCB. OxyR is known to act as an activator under oxidative stress. As we do not utilize an oxidative stress growth condition in this study, we do not include this binding site in the reported regulatory architectures for this study.

znuA

There are binding sites for Zur and OxyR in the mutagenized region for znuA. The OxyR-binding site is outside of the mutagenized region. Only the Zur-binding site is included our reported regulatory architectures.

pitA

There is a low-evidence binding site for FNR in the mutagenized region. The location of this binding site, however, is unknown. Thus, this binding site is not included in our reported regulatory architectures.

ecnB

There is a low-evidence OmpR-binding site for ecnB. The binding site is not included in our reported regulatory architectures.

lacZYA

The mutagenized region extends from the TSS (the primary TSS p1) to 75 base pairs upstream of the TSS. The location of the mutagenized region excludes the LacI sites, while including a single CRP-binding site, a MarA-binding site, and two HNS-binding sites. The expression from marA is expected to be low, as we do not grow the cells in the presence salicylate or antibiotic stress and so we do not expect to observe the MarA site. In fact, the precursor of the Reg-Seq experiment, Sort-Seq, mutagenized and studied the same 75 base pair region, and only observed binding by CRP (Kinney et al., 2010). As such, we only include CRP in Table 2, the regulatory cartoons, or the analysis of false positives and false negatives.

leuABCD

There is a binding site for LeuO regulating leuABCD. The site is low-evidence and also has no known binding location. As a result, the site is not included in our reported regulatory architectures.

arcA

There is a binding site for FNR within the mutagenized region listed as 'low-evidence' in EcoCyc. We find substantial additional evidence for the presence of the FNR-binding site. As such, we include the site in Table 2 as an 'Identified Binding Site'.

relBE

The relBE promoter contains four RelBE-binding sites and two RelB-binding sites in EcoCyc and RegulonDB. While the all four RelBE sites are listed as high evidence, Belliveau et al., 2018 mutagenized the RelBE promoter and did not identify binding in the furthest downstream or furthest upstream binding sites. Also, the original identification of the RelBE-binding sites presented (Li et al., 2008), claims that the furthest upstream and downstream sites are only identified by similarity to consensus sequence. As a result only two of the RelBE and two of the RelB sites are included in this study.

marR

The marR promoter contains a CpxR, CRP, Cra, and AcrR in EcoCyc that are not included in the 'gold standard' analysis or Table 2. Belliveau et al., 2018 performed mutagenesis experiments on the marR promoter and did not identify these additional sites and so they have been excluded.

Appendix 4—figure 2
A comparison of the types of architectures found in RegulonDB (Santos-Zavaleta et al., 2019) to the architectures with newly discovered binding sites found in the Reg-Seq study.

For each type of architecture, labeled as (number of activators, number of repressors), the fraction that architecture comprises of the total number of operons is given both for the data found in RegulonDB and from the results of the Reg-Seq experiment. Numeric values for the displayed data can be found in Appendix 4—figure 2—source data 1.

Appendix 5

Resource Table

Appendix 5—key resources table
Reagent type (species)
or resource
DesignationSource or referenceIdentifiersAdditional information
Cell line (Escherichia coli)E. coli K12E. coli Stock Center
Cell line (Escherichia coli)E. coli ΔYiePE. coli Stock Center
Cell line (Escherichia coli)E. coli ΔGlpRE. coli Stock Center
Cell line (Escherichia coli)E. coli ΔArcAE. coli Stock Center
Cell line (Escherichia coli)E. coli ΔLrhAE. coli Stock Center
Cell line (Escherichia coli)E. coli ΔPhoPE. coli Stock Center
Cell line (Escherichia coli)E. coli ΔHdfRE. coli Stock Center
Strain, strain background
(Escherichia coli)
E. coli ΔGlpR
in K12 strain
This paperKnockout transferred to E. coli K12
Strain, strain background
(Escherichia coli)
E. coli ΔArcA
in K12 strain
This paperKnockout transferred to E. coli K12
Strain, strain background
(Escherichia coli)
E. coli ΔLrhA
in K12 strain
This paperKnockout transferred to E. coli K12
Strain, strain background
(Escherichia coli)
E. coli ΔPhoP
in K12 strain
This paperKnockout transferred to E. coli K12
Strain, strain background
(Escherichia coli)
E. coli ΔHdfR
in K12 strain
This paperKnockout transferred to E. coli K12
Chemical compound, drugQ5 PolymeraseQiagenCat. : M0491L
Chemical compound, drugqPCR master mixQuantaBioCat. : 101414–166
Chemical compound, drugLysyl EndopeptidaseWako ChemicalsCat. : 125–05061
Commercial assay or kitRNEasy Mini kitQiagenCat. : 74104
Chemical compound, drugRNAprotect
bacteria reagent
QiagenCat. : 76506
Software, algorithmmpathicKinney Lab Ireland and Kinney, 2016
Software, algorithmFastXHannon LabRRID:SCR_005534
Software, algorithmFLASHCBCBRRID:SCR_005531
OtherOligo PoolTwist Bioscience
Sequence-based reagentfwd oligo 101IDTTTCGTCTTCACCT CGAGCACGCTTATT CGTGCCGTGTTAT
Sequence-based reagentfwd oligo 102IDTTTCGTCTTCACCTC GAGCACTTTGCTT CAGTCAGATTCGC
Sequence-based reagentfwd oligo 103IDTTTCGTCTTCACCT CGAGCACGTCGAGT CCTATGTAACCGT
Sequence-based reagentfwd oligo 104IDTTTCGTCTTCACCT CGAGCACGTAAGAT GGAAGCCGGGATA
Sequence-based reagentfwd oligo 105IDTTTCGTCTTCACCT CGAGCACGGTGTCGC AACATGATCTAC
Sequence-based reagentfwd oligo 106IDTTTCGTCTTCACCT CGAGCACGTGCTAAG TCACACTGTTGG
Sequence-based reagentfwd oligo 107IDTTTCGTCTTCACCT CGAGCACTCTAAACA GTTAGGCCCAGG
Sequence-based reagentfwd oligo 108IDTTTCGTCTTCACCT CGAGCACGTCTTTAT ACTTGCCTGCCG
Sequence-based reagentfwd oligo 109IDTTTCGTCTTCACCT CGAGCACCACCGCGA TCAATACAACTT
Sequence-based reagentfwd oligo 110IDTTTCGTCTTCACCT CGAGCACTTCGGATA GACTCAGGAAGC
Sequence-based reagentfwd oligo 111IDTTTCGTCTTCACCT CGAGCACCCATTGAT AGATTCGCTCGC
Sequence-based reagentfwd oligo 112IDTTTCGTCTTCACCT CGAGCACTTTTCTAC TTTCCGGCTTGC
Sequence-based reagentfwd oligo 113IDTTTCGTCTTCACCT CGAGCACATGACTAT TGGGGTCGTACC
Sequence-based reagentfwd oligo 114IDTTTCGTCTTCACCT CGAGCACTCGACAAT AGTTGAGCCCTT
Sequence-based reagentfwd oligo 115IDTTTCGTCTTCACCT CGAGCACGAGCCATG TGAAATGTGTGT
Sequence-based reagentfwd oligo 116IDTTTCGTCTTCACCT CGAGCACCGTATACG TAAGGGTTCCGA
Sequence-based reagentfwd oligo 117IDTTTCGTCTTCACCT CGAGCACTTATGATG TCCGGATACCCG
Sequence-based reagentfwd oligo 118IDTTTCGTCTTCACCT CGAGCACTCTTAGAA ATCCACGGGTCC
Sequence-based reagentrev oligo 101IDTTGTAAAACGACGG CCAGTGACTAGCGC TGAGGAGAAGCCT AATAGGGCACAGC AATCAAAAGTA
Sequence-based reagentrev oligo 102IDTTGTAAAACGACG GCCAGTGAGGAGCGC TGAGGAGAAGCC TAATACCGGGATT CAGTGATTGAAC
Sequence-based reagentrev oligo 103IDTTGTAAAACGACG GCCAGTGAGTCCC GCTGAGGAGAAG CCTAATATGAAGAT ATGACGACCCCTG
Sequence-based reagentrev oligo 104IDTTGTAAAACGACGG CCAGTGACCGACGCT GAGGAGAAGCCTAA TATTCCACAGCTC TATGAGGTG
Sequence-based reagentrev oligo 105IDTTGTAAAACGACGG CCAGTGATTGGCGCT GAGGAGAAGCCTA ATAGCAAACATGA CTAGGAACCG
Sequence-based reagentrev oligo 106IDTTGTAAAACGACGG CCAGTGAGATACGC TGAGGAGAAGCC TAATACCGGGACG AGATTAGTACAA
Sequence-based reagentrev oligo 107IDTTGTAAAACGACGGC CAGTGAACTCCGCT GAGGAGAAGCCTA ATACACGCCAGTT GTGAACATAA
Sequence-based reagentrev oligo 108IDTTGTAAAACGACG GCCAGTGATACTCGC TGAGGAGAAGC CTAATACAAAGGC CAAATCAGTTCCA
Sequence-based reagentrev oligo 109IDTTGTAAAACGACGGC CAGTGACCAACGCT GAGGAGAAGCCT AATAGGTGCATGGG AGGAACTATA
Sequence-based reagentrev oligo 110IDTTGTAAAACGACG GCCAGTGAAGGCCGC TGAGGAGAAGCCT AATATGCATGGGT CTGTCTATTGT
Sequence-based reagentrev oligo 111IDTTGTAAAACGACGGC CAGTGAAATTCGC TGAGGAGAAGCCT AATACTCCTATGCT AGCTCGACTC
Sequence-based reagentrev oligo 112IDTTGTAAAACGACG GCCAGTGATTGT CGCTGAGGAGAAG CCTAATAATGGTA AGAAGCTCCCACAA
Sequence-based reagentrev oligo 113IDTTGTAAAACGACGGC CAGTGATTTACGCT GAGGAGAAGCCTA ATACTATGGTCA TTCCCGTACGA
Sequence-based reagentrev oligo 114IDTTGTAAAACGACGGC CAGTGAACCGCGCT GAGGAGAAGCCTA ATATAATCGGCT ACGTTGTGTCT
Sequence-based reagentrev oligo 115IDTTGTAAAACGACGGC CAGTGATGGCCGC TGAGGAGAAGC CTAATATGACTCGA TCCTTTAGTCCG
Sequence-based reagentrev oligo 116IDTTGTAAAACGACGG CCAGTGAGGCCCGC TGAGGAGAAGC CTAATAACGCTTT GTGTTATCCGATG
Sequence-based reagentrev oligo 117IDTTGTAAAACGACGG CCAGTGAGGTGCG CTGAGGAGAAG CCTAATAACCACG GTGGAGTATACATC
Sequence-based reagentrev oligo 118IDTTGTAAAACGACG GCCAGTGACAATCG CTGAGGAGAAGC CTAATAGGCACCA GGTACATATCTCA
Sequence-based reagentmRNA revIDTGCAGGGGATAA TATTGCCCA
Sequence-based reagentfwd sequencing 94IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTGACC TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 95IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTCAGT TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 96IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTTCTA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 97IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTAGAG TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 98IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTGCAT TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 99IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTCTTA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 100IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTTAGC TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 101IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTCAAG TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 102IDT AATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTGTAC TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 103IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTTGAA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 104IDT AATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTTCGT TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 105IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTATGC TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 106IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTGTCA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 107IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTCTCA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentfwd sequencing 108IDTAATGATACGGCGACCAC CGAGATCT ACACTCTT TCCCTACACGACGC TCTTCCGATCTAGTA TATTAGGCTT CTCCTCAGCG
Sequence-based reagentrev sequencingIDTAAGCAGAAGACGGCAT ACGAGATCGGT CTCG GCATTCCTGCTGAACC GCTCTTCCGATCTCAAA GCAGGGGATAA TATTGCCCA
OtherStreptavin coated dynabeadsThermo FisherCat. : 65601
DatabaseRegulonDBRRID:SCR_003499
DatabaseEcoCycRRID:SCR_002433

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
    Modulation of chemical composition and other parameters of the cell by growth rate
    1. H Bremer
    2. P Dennis
    (1996)
    In: F. C Neidhardt, editors. Escherichia coli and Salmonella Typhimurium: Cellular and Molecular Biology. Merck KGaA. pp. 1–3000.
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    Ensemble samplers with affine invariance
    1. J Goodman
    2. J Weare
    (2010)
    Communications in Applied Mathematics and Computational Science 5:65–80.
    https://doi.org/10.2140/camcos.2010.5.65
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
    On the regulation of gene activity
    1. F Jacob
    2. J Monod
    (1961)
    Cold Spring Harbor Symposia on Quantitative Biology. pp. 193–211.
    https://doi.org/10.1101/SQB.1961.026.01.024
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
    Purification and characterization of the repressor for the sn-glycerol 3-phosphate regulon of Escherichia coli K12
    1. TJ Larson
    2. SZ Ye
    3. DL Weissenborn
    4. HJ Hoffmann
    5. H Schweizer
    (1987)
    The Journal of Biological Chemistry 262:15869–15874.
  50. 50
    Interaction at a distance between multiple operators controls the adjacent, divergently transcribed glpTQ-glpACB operons of Escherichia coli K-12
    1. TJ Larson
    2. JS Cantwell
    3. AT van Loo-Bhattacharya
    (1992)
    The Journal of Biological Chemistry 267:6114–6121.
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
    Structure and regulation of the glpFK operon encoding glycerol diffusion facilitator and glycerol kinase of Escherichia coli K-12
    1. DL Weissenborn
    2. N Wittekindt
    3. TJ Larson
    (1992)
    The Journal of Biological Chemistry 267:6122–6131.
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96

Decision letter

  1. Armita Nourmohammad
    Reviewing Editor; University of Washington, United States
  2. Naama Barkai
    Senior Editor; Weizmann Institute of Science, Israel
  3. Tamar Friedlander
    Reviewer; IST Austria, Austria

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The manuscript presents a new approach, Reg-Seq, to study regulatory logic in hundreds of promoters at a time. To do so, the authors use massively parallel reporter assays and mass spectroscopy and build an information theoretical model to characterize binding energies in regulatory promoters at the base-pair resolution level. This approach can significantly advance our understanding of gene regulation in microbes and opens new avenues towards genome-wide quantification of regulatory logic and discovery of new transcription factors.

Decision letter after peer review:

Thank you for submitting your article "Deciphering the regulatory genome of Escherichia coli, one hundred promoters at a time" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Tamar Friedlander (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Comments:

This manuscript introduces a new method "Reg-Seq" for identifying regulatory sites in promoters using RNA sequencing for cell sorting. The method generates impressive quantitative data for many genes in E. coli, which corroborates many previous results and starts to give us hints about the overall trends of regulatory architectures in E. coli. Although the experimental set up is very compelling and potentially promising, there are still some major concerns regarding presentation of the data and the underlying methods and analyses, which we would like to see addressed.

The main concerns about the manuscript can be summarized as: 1) Experimental method should be better discussed in the main text, its limitations should be explicitly pointed out and its advantages should be compared more clearly against previous techniques. 2) Statistical methods (including inference based on mutual information profiles) should be described more clearly and with more quantitative details throughout the text and the significance of the conclusions should be more explicitly expressed. 3) Analysis of E. coli data, comparison with previously identified regulatory regions and binding sites and the significance of these findings should be better discussed.

The reviewers strongly believe that the manuscript should be reworked to focus less on cherry-picked examples and more on presenting the method. The manuscript needs a significant rearrangement to better explain the procedures applied and fill in various missing points as detailed below.

Essential revisions:

1) The approach based on the mutual information profile to identify the binding sites within a promoter and to classify them as activator or repressors should be better explained:

1.1) Since this is such a key step, the authors need to provide more details about how the manual choice of activator/ repressors works. Overall, it is unclear how one goes from mutual information profiles to deciding whether known sites are or are not identified, how to decide which regions are putative sites, whether they fit known motifs, whether they are RNAP sites, and how to construct an energy matrix for each putative site.

1.2) It is mentioned (subsection “Analysis of sequencing results”, first paragraph) that, to confirm these manual choices, authors computationally identify regions of activators / repressors by assessing the fold change in gene expression due to mutations. It is unclear how the authors are assessing the significance of the changes, i.e., what the underlying null model is. If this computational approach is statistically sound, why the need for the manual approach? These manual choices are worrisome and raise concerns about the feasibility to scale-up this method, as intended by the authors.

2) It is difficult to assess the false positive/negative rate of identifying binding sites: This is partly due to the lack of clarity about the authors' approach in deciding whether a known site was identified or not, partly because the annotation of known sites is unclear and incomplete, and partly because the statistics are not transparently reported in the manuscript. It is therefore important that the authors present an objective method for assessing what fraction of known sites is not recovered, for what fraction of predicted novel sites a motif can be made, and for what fraction of those the mass-spec successfully recovers a binding TF.

As a suggestion: The authors should start from a unambiguously defined set of sites (e.g. from RegulonDB and EcoCyc) and then have an unambiguous procedure for:

i) Calling segments where a site exists (for example based on the summed mutual information within the segment).

ii) Calling the motif by comparing the profile with PSWMs or energy matrices for known motifs.

After that, they should clearly report what fractions of sites are recovered and for what fraction of sites the correct motif is predicted.

3) Energy matrices:

3.1) How are the energy matrices constructed? Details are missing from the manuscript.

3.2) How are the p-values estimated to assign significance to energy parameters? It appears that the authors have used MCMC to sample from a likelihood function but the details are very vague. To define an energy, it is assumed that the average expression can be written as the negative exponential of some effective energy to parametrize the average effects of mutations on expression- this should be better justified in the text. However, it seems that instead of 4 different energy parameters at each position, the model only assumes a “mutant” and a “wild type” energy. This may not be a good approximation given that we know nucleotides within TF binding sites have highly varying degeneracies. Moreover, the likelihood function for the observed DNA and RNA reads should be expressed as a function of the energy parameters at all positions. However, this is not the approach taken by the authors and instead, equations (8) and (9) only look at single positions. So, it is not clear what kind of likelihood function the MCMC is sampling from, or how p-values and confidence intervals are determined. Also, if the authors are indeed using MCMC with some likelihood function to sample the space of possible parameter values, the resulting variation in the energy parameters should correspond to posterior probability intervals and not confidence intervals.

3.3) Could the authors present the accuracy of the inferred energy matrices as a function of the number of genomic variants used? A particular example is currently given in Appendix 3. Can the authors use a simulation for that purpose? This is required to understand the validity of the results and the ability to scale-up such an experiment (if for example a lower resolution is sufficient, can larger regulatory regions be analyzed in future experiments?)

4) The step for assigning regulatory logic to promoters is unclear. Do authors always assume OR-gate-like interactions between the TFs, which is what Equation 14 suggests? If so, what is the justification for that assumption? Is it just the most basic statistical mechanical assumption, or is there strong empirical support?

5) The description of the method and analysis is hard to follow and unclear about many assumptions and limitations. To give a few examples:

5.1) Appendix 3 outlines how to calculate the mutual information footprint from the read count data but a more explicit discussion would be helpful: For example, how should the quantities in Equation 5 (p(m, 𝜇), p(m), p(𝜇)) be calculate from the preceding example data (subsection “Information footprints”)?

5.2) The methods rely on both DNA and RNA sequencing of the same sample and using ratios of RNA/DNA reads from the same reporter to estimate expression. The description of the sequencing protocol is insufficient. Importantly, the DNA sequencing is not even mentioned in the methods and it is not explained how precisely this expression quantification is done.

5.3) The methods mention correcting for correlated mutations, MCMC to assign p-values, and energy matrix reconstruction, but it is completely unclear how either of these are done.

6) The discussion of the methods in the main text is mostly qualitative and crucial details are scattered throughout the manuscript, in the appendices and in Materials and methods.

6.1) For example, it requires digging into the appendices to understand that the authors study a region of 160 bp mostly upstream TSS, and so, is this method inadequate if the regulatory region is larger or if its location is unknown? A clear and concise explanation of the crucial details (e.g. size of the region, number of variants per promoter, etc) should be given in the main text.

6.2) To further clarify the methods, it would be helpful to include a picture of the reporter construct, which based on the descriptions includes:

-promoter including 45 bp downstream of annotated TSS.

-then 64bp with primers for plasmid construction

-then 11bp with stop codons in 3 frames.

-then barcodes?

-then a RBS

-then GFP mRNA

7) Authors used TSS information to decide which promoters to pursue. How limiting is this for scaling up the procedure? It would be helpful to discuss briefly what fraction of promoters in E. coli have good TSS information and how valid is to assume that at the genome-wide level, each operon is dominated by a single TSS?

8) The authors rely on expression from plasmids and use mRNA/DNA ratio to handle the effect of variability in plasmid copy number between cells. However, if the plasmid copy number is of a similar order of magnitude as the transcription factor copy number, then the expression level measured (to calculate the energy matrices) is determined not only by the binding energy, but also by the TF availability leading to under-estimation of the binding energies. The authors should comment on this in the manuscript, and if they have the data available, show the measurements for plasmid and TF copy numbers to address this point. At this point we do not see a necessity for additional experiments.

9) The limitations of the method should be more concisely explained. Currently, limitations are scattered throughout the text.

Revisions required for Figures and Tables:

Figure 1: This figure is hard to read. It is difficult to distinguish the individual tick marks around the genome, because there are too many, they are too densely packed, and the colors are too mixed. Also, the caption describes the color of some ticks as "red," but in the printed figure they look more brown (They appeared closer to red on the screen.)

Figure 2: You might want to clarify that the blue region likely corresponds to the σ factor binding site.

Figure 3: The number of (0,0) promoters should be shown as well. Moreover, it seems that the counts in Figure 3 add up to about 50 and the text mentions that there are 32 promoters with no sites found, i.e. type (0,0). Adding this in the sum still seems much less than 113 (i.e., the total number of promoters as indicated in the manuscript). Related to this, it seems that Table 2 has clearly less than 113 promoters in it. Where are the remaining ones?

Figure 4: The authors state about the results in Figure 4: "In each of the cases shown in the figure, prior to the work presented here, these promoters had no regulatory information in relevant databases such as EcoCyc (Keseler et al., 2016) and RegulonDB (Santos- 318 Zavaleta et al., 2019)."

This is simply not true. If you check EcoCyc for these genes you find:

i) idnK regulated by CRP, IdnR and GtnR.

ii) leuABCD regulated by LeuO, slyA, LrhA, and RcsB-BgU.

iii) maoP regulated by hdfR.

iv) rspA regulated by CRP and YdfH.

Only for yjjJ, aphA, and ybjX nothing was previously reported. It should also be noted that except for the rspA promoter, the previously reported regulatory interactions were generally not identified by the method.

Finally, it seems that there is no mass-spec data for yjjJ. Is the identification of marA as the regulator based on motif matching then?

Table 2: This table is unclear. Does the "architecture" of a promoter (first column) correspond to what is inferred in this study, to what is known in the literature, or to a combination of both? For the newly found sites it should be listed whether the binding TF was identified and (if so) what it is. For the fourth column, “literature binding sites”, it is not clear whether this is the total number of known literature sites or only the number of known sites that were successfully identified here – both should be listed. It is not clear what the fifth column “identified binding sites” refers to. The evidence column is also unclear. Is this evidence for the newly found sites or the literature sites?

Just to illustrate the confusion: consider the well-known lac operon. In Table 2 the lac operon (called lac as opposed to lacZYA, which is used elsewhere in the manuscript) is reported to have only 1 known literature site, whereas it is well established that it has a site for CRP and multiple sites for LacI. In the final column it is claimed that there is mass-spec evidence for LacI binding. However, on the website, the lac operon does not occur at all and in Appendix 2—figure 1 only the CRP site is reported. What makes this all even more confusing is that none of the experimental conditions included lactose or another inducer of the operon. As such, the lac operon should be highly repressed and so we expect very little effect of mutating the CRP site, and strong effects of abolishing the LacI site.

In multiple cases in the table, less literature information is reported than it actually exists: e.g. tar is a reported target of FNR, uvrD is a reported target of lexA, cra is a reported target of PhoB, and so on. The authors should make clear what literature information is shown here.

The authors should clear up this information so as to make it unambiguous and consistent across tables, figures, and website.

https://doi.org/10.7554/eLife.55308.sa1

Author response

Essential revisions:

1) The approach based on the mutual information profile to identify the binding sites within a promoter and to classify them as activator or repressors should be better explained:

1.1) Since this is such a key step, the authors need to provide more details about how the manual choice of activator/ repressors works. Overall, it is unclear how one goes from mutual information profiles to deciding whether known sites are or are not identified, how to decide which regions are putative sites, whether they fit known motifs, whether they are RNAP sites, and how to construct an energy matrix for each putative site.

We thank the reviewers for this comment and completely agree with both its spirit and content. As such, we have taken this opportunity to expand and clarify our discussion of the data analysis pipeline and how it leads to regulatory hypotheses. First, it is important to note that the information footprints used in this paper are tools for hypothesis generation, and the Reg-Seq approach is further complemented by the use of expression shifts (a signed quantity which provides other helpful hints) about possible binding site locations. Deciding which regions are putative binding sites depends on the value of the mutual information of the base pairs within the putative binding sites. The process by which we calculate mutual information can be found in Appendix 3 subsection “Information footprints”. In this part of the response, we focus on manual curation, the process by which we identify potential transcription factor binding sites by hand, and later take up the subject of automated identification of transcription factor binding sites as suggested by the reviewers.

In regards to manual curation, we used our experience in looking at gold-standard genes – genes with well-studied, experimentally-determined regulatory architectures (as indicated in EcoCyc or RegulonDB) – from a previous study conducted by our group (Belliveau et al., 2018). Our analysis of these genes, which include lacZYA and other well-studied promoters, provided us with a sense of the “threshold value” of the mutual information that indicates the presence of a binding site. As a result, as noted at the beginning of this response, a key additional tool that we have introduced and that has been a critical part of the manual method of identifying putative binding sites is the use of expression shifts, a signed version of the information footprint that is generally associated with increases in gene expression (corresponding to the presence of a repressor binding site) and decreases in gene expression (corresponding to presence of an activator or RNAP binding site). Because expression shifts are a signed quantity, boundaries between binding sites are more easily identified. For future iterations of the Reg-Seq method, we have every intention to explore automated curation of expression shifts as a tool to identify candidate binding sites as we highlight below.

During manual curation of binding sites, we also disqualify any binding sites where there are only 3 or fewer base pairs with high values in the mutual information footprint. The logic behind this decision is that individual bases with very high mutual information can potentially indicate that a putative binding site is only active when a certain mutation occurs. In turn, the binding site would not be active in wild-type conditions. To explain why this is, consider that a typical binding site mutation, at any given base pair, will significantly weaken the binding site of interest. Therefore, each of those mutated base pairs is said to have a “large effect” on expression. For a very poor binding site that is not active in the wild-type case, most mutations will further weaken a site which already will have only a minor effect on gene expression. However, for a small number of base pairs, a mutation can occur that makes the DNA bind more tightly to the TF, making it relevant for gene expression. Therefore, in the case of an extremely weak binding site that is not relevant in the wild type condition, there can still be a small number of highly informative bases. We integrate these expanded details into a new subsection in the Materials and methods section of the main text that is entitled “Manual selection of binding sites”.

While there was no hard cut-off for mutual information values during manual curation, inspired by the reviewer comments, we have undertaken another round of automation, a schematic for which is presented in Figure 11 and is to be explained in more detail in another reviewer comment. To automatically identify putative binding sites from information footprints, we developed a strategy, which is as follows: we first take a 15 base-pair window where the average mutual information value is greater than 2.5×104 bits. The cut-off value was chosen based on examination of a test set of genes, a fuller discussion of which can be found in another reviewer comment. Any cluster of 15 bp that meets the information value threshold is then classified as a putative regulatory region, and can then be used to identify putative RNAP, activator, and repressor binding sites. We classify binding sites as those for activators or repressors based on whether mutation of bases at a given site causes expression to increase (indicating a repressor binding site) or decrease (denoting an activator/RNAP binding site).

Specifically, to identify RNAP binding sites, we compare the sequence preference (through energy matrices and sequence logos) to experimentally validated examples of RNAP sites. We have examples of energy matrices for the σ70 RNAP site from Belliveau et al., 2018. For energy matrices of other σ factor binding sites, such as σ32 and σ28, we use energy matrices generated from within the Reg-Seq experiment itself. For a σ32 binding site, for example, we used the example from the hslU gene. For a σ28 binding site, we used the energy matrix generated from the dnaE gene. When performing manual curation of binding sites, we visually compare the sequence logos of the example RNAP binding sites to the sequence logos of putative binding sites. We have also included a discussion of the method to create energy matrices and compare them to known motifs in Appendix 3 subsection “Energy matrix inference” and Appendix 3 subsection “TOMTOM motif comparison”, respectively. Further, a detailed discussion of energy matrix construction is provided in the GitHub Wiki that accompanies this work.

We really hope that the reviewers understand our goals. The information footprints should not be viewed as the key outcome of this work; we see them, rather, as a useful visualization tool that helps us generate hypotheses about the regulatory architecture for a given promoter. These hypotheses are allied with, and bolstered by, regulatory cartoons and energy matrices, which serve as a jumping off point for the kind of quantitative dissections we have carried out using chemical master equations and thermodynamic models in conjunction with careful promoter by promoter analyses during previous studies in our group (see Laxhuber et al., 2020; Amit et al., 2011; Phillips et al., 2019; Razo-Mejia et al., 2018; Rydenfelt et al., 2014a; Weinert et al., 2014). We share the reviewer’s concern in regards to scaling up the Reg-Seq methodology, but believe that it is not enough to produce single base pair resolution graphs such as the information footprints (and their analogs in the work of others), and are striving instead to produce a systematic approach to parlay those plots into actionable, regulatory hypotheses.

1.2) It is mentioned (subsection “Analysis of sequencing results”, first paragraph) that, to confirm these manual choices, authors computationally identify regions of activators / repressors by assessing the fold change in gene expression due to mutations. It is unclear how the authors are assessing the significance of the changes, i.e., what the underlying null model is. If this computational approach is statistically sound, why the need for the manual approach? These manual choices are worrisome and raise concerns about the feasibility to scale-up this method, as intended by the authors.

We appreciate the reviewer’s suggestions and the opportunity to improve our computational methods in this resubmitted manuscript. We hope that the reviewers will appreciate that we have been engaged in a staged scale-up of our approach to dissect the regulatory genome. During each generation of experiments, we have been upgrading all of our methodologies in an effort to develop tools that, eventually, will help us fully elucidate the so-called “regulome”. We detail the improvements that we have made below.

In light of the excellent suggestions of the reviewers, we have implemented automated motif finding, which replaces the appeal to p-values as explained in this study. The use of automated motif finding replaces our initial focus on p-values, replacing it with an appeal to our ability to recover the regulatory architectures of certain “gold-standard” promoters as a metric of the soundness of our results. The details of this new automated method are discussed in the Materials and methods subsection entitled “Automated putative binding site algorithm”.

Despite the advancements (and shortcomings) of the Reg-Seq method, we still hope that the reviewers are aware of the state of the art in the field of bacterial regulation and the important advances that we have made even without a genome-scale effort. During each round of new experiments, we want to materially alter the completeness of the EcoCyc and RegulonDB databases such that researchers interested in particular promoters have a way forward to design and perturb those promoters. Indeed, the group of Collado-Vides (see Santos-Zavaleta et al., 2019,) in Mexico, which manages the RegulonDB database, has already taken the results of the unpublished work in this manuscript that appeared on BioRxiv, contacted us, and updated their database accordingly. Our hope in the coming few years is that each and every promoter will have a full regulatory cartoon and corresponding energy matrix but, for the moment, we are excited about two successive 10-fold scaleups in our work, and hope that the reviewers acknowledge this as significant progress (Belliveau et al., 2018).

2) It is difficult to assess the false positive/negative rate of identifying binding sites: This is partly due to the lack of clarity about the authors' approach in deciding whether a known site was identified or not, partly because the annotation of known sites is unclear and incomplete, and partly because the statistics are not transparently reported in the manuscript. It is therefore important that the authors present an objective method for assessing what fraction of known sites is not recovered, for what fraction of predicted novel sites a motif can be made, and for what fraction of those the mass-spec successfully recovers a binding TF.

As a suggestion: The authors should start from a unambiguously defined set of sites (e.g. from RegulonDB and EcoCyc) and then have an unambiguous procedure for:

i) Calling segments where a site exists (for example based on the summed mutual information within the segment).

ii) Calling the motif by comparing the profile with PSWMs or energy matrices for known motifs.

After that, they should clearly report what fractions of sites are recovered and for what fraction of sites the correct motif is predicted.

We thank the reviewer for this comment, which is an extremely serious consideration that we hope to fully address in this resubmitted manuscript. We note that, with each successive generation of our experimental efforts, we are expanding our ability to scale-up the methods. The manual curation of binding sites has been used as a tool for hypothesis generation that we have followed by independent confirmation of these hypotheses. That said, we are very interested in developing robust, scalable approaches.

In this resubmitted manuscript, we introduce a systematized way of identifying the locations of binding sites, as shown in Figure 11, that allows the false negative and false positive rate of binding site identification to be clearly assessed. This new approach also replaces the previous arguments based on p-values and offers an alternative method to assess the accuracy of our method. Thanks to the reviewer’s suggestions, we find this new treatment more intuitive and transparent, and think it has substantially improved this manuscript.

For a given information footprint, we average over 15 base pair “windows”. We then determine which base pairs are part of a regulatory region by setting an information threshold of 2.5×104 bits (the rationale for this threshold is explained below). All base pair positions that pass the information threshold are then joined into “regulatory regions”, which we consider to be “activator-like” (if a mutation decreases expression) or “repressor-like” (if a mutation increases expression). This means that it is possible to identify overlapping repressor and activator binding sites. We join any base pair positions within 4 base pairs of each other into a single regulatory region. We then find the edges of each binding site region by trimming off any base pairs at the edge that are below the information threshold (even if the 15 base pair average is above the threshold). While we can often resolve overlapping or nearby repressors from activators, a limitation of this method of identification is that is cannot resolve two activators or two repressors that are very close to each other or overlapping.

To determine which information threshold to use as a cutoff for a putative binding site, as displayed in Figure 11, we selected a training set of genes which included two of our ”gold standard” genes with transcription factor binding sites which have experimental validation, DgoR (the upstream site from the dgoR promoter) and CRP (from the araAB promoter), two genes with only RNAP binding sites, including hslU (under heat shock) and poxB, and several genes that we classified as inactive, wherein no RNAP binding sites or other binding sites could be identified. These inactive genes included hicB, mtgA, eco, hslU (without heat shock), and yncD. The growth condition (heat shock) is specified for the hslU promoter as transcription occurs from a σ32 RNAP site, which will be inactive except during heat shock. We selected the threshold such that the RNAP sites and known binding sites were identified, while no binding sites were identified in the inactive regions.

We then determine a set of binding sites upon which to test this method and determine a false negative rate for the Reg-Seq experiment. In this set of binding sites, we include those sites which are “high-evidence” according to EcoCyc. Such “high evidence” binding sites have been validated experimentally with the binding of purified protein or through site mutation. Some “high-evidence” sites are excluded because they are not included within our 160-base pair, mutagenized sequence, or because they are not active in any of the growth conditions that we tested. Justifications for those binding sites which were not included are now listed in a new appendix; Appendix 4 subsection “Explanation of included binding sites”. A full list of promoters and binding sites that were included in the set of genes used to validate our automated binding-site finding algorithms are also provided in Appendix 2—table 1.

For each promoter contained in Appendix 2—table 1, we used the automated procedure outlined above and in Figure 11 to identify the activator and repressor binding sites. A visual display of the expected binding sites, the information footprints for the promoters in Appendix 2—table 1, and the discovered binding sites are all displayed in Appendix 2—figure 2 and 3. To assess the false negative rate, we summarize the identified regulatory regions to the known binding sites from Appendix 2—table 2. At this stage, we did not consider the identities of the binding sites; we merely consider their presence or absence. Inferred binding sites are declared to “match” the known binding site if the automated identification procedure classifies at least half of the base pairs reported in EcoCyc as belonging to a transcription factor binding site and correctly determines whether the binding site belongs to an activator or repressor.

We do not require exact matching of the edges of the binding sites for several reasons. One such reason is that, in some cases, the sequence of half of a binding site (for example, corresponding to one half of a helix-turn-helix binding motif) can contribute relatively little to gene expression, and so will not have high mutual information values in the corresponding information footprint for that binding site. While this may appear unintuitive for TFs where both sections of the binding site are bound by identical halves of a dimer, we see several examples of this in our Reg-Seq experiment results, including for CRP binding sites of the rspA promoter studied during our analysis of false negative rates. We can see in Appendix 2—figures 2 and 3 that the downstream half of the binding site is not identified as important for gene expression. If we examine the wild type sequence of the rspA promoter, we also see that, for the upstream half of the sequence, the wild type matches the five most conserved bases of the consensus sequence (TGTGA) perfectly. The downstream half of the sequence, however, has 4 mismatches out of 5 bases (CGGCT compared to the consensus sequence of TCACA). The downstream half of the binding site already binds to its target transcription factor poorly, so further mutations have little effect. While it is true that CRP binds to that sequence region, it is also true that CRP binds only extremely weakly to that section of the region. A similar effect can be seen in previous work from Belliveau et al., 2018, where a mutation in the downstream half of a CRP binding site in the xylE promoter had more than a 10-fold greater effect on binding energy than mutation in the upstream half of the binding site. As such, we are lenient when evaluating the successes of our algorithm in this regard. Furthermore, the methods that have been used to determine the presence of “high evidence” binding sites in the past, such as ChIP-seq, do not typically have base pair resolution with which to precisely determine the edges of binding sites (Skene and Henikoff, 2015).

Lastly, a known weakness of our algorithmic approach is that binding sites that are extremely close or overlapping cannot be distinguished from each other initially. For example, the XylR sites in the xylF promoter are only separated by 3 bases according to RegulonDB. While the sites can be distinguished upon later investigation through gene knockouts, mass spectrometry, or motif comparison, our initial algorithm joins the sites into one large site. While this is a weakness of the algorithm, for our purposes it does not constitute a false negative, as the important regions for regulation are still discovered. All regions for all promoters that are classified as regulatory regions, their identities as activators, repressors, or RNAP binding sites, as well as their starting and ending base pairs, can be found in Supplementary file 3. Furthermore, we summarize the success and failures of the method at each binding site in Appendix 2—table 2.

We see in Appendix 2—table 2 that 11 of the 15 promoter regions included in Appendix 2—table 1 have all TF binding sites classified as putative transcription factors, two have the majority of sites correctly classified, and two do not have any of their binding sites correctly classified as regulatory elements. We can see the information footprints used in the correct identifications in Appendix 2—figures 2 and 3. Considering binding sites, we find that 23 out of 33 binding sites are correctly classified. However, we argue that the false negative rate should be considered on a per promoter basis, rather than on the basis of individual binding sites. The reason for this argument can be seen in the two “worst” cases of correct binding site identification; namely, for the araC and dicA promoters.

The araC promoter is repressed by multiple repressor binding sites in all growth conditions tested. araC only has high expression transiently after addition of arabinose (Johnson and Schleif, 1995), and while growth in arabinose is utilized in this experiment, RNA was not collected during the window of high expression. The case study shows that Reg-Seq does not perform well when many repressor sites regulate the promoter. Reg-Seq relies on mapping the effect on expression of mutating a particular site, and when many strong repressor sites are present, expression change will be minimal unless all repressor sites are weakened through mutation. Additionally, in this highly repressed case, the RNAP binding site we observe in the mutagenized region is not the documented RNAP site in RegulonDB, indicating that we are seeing transcription primarily from an alternative TSS. Different RNAP sites are often regulated differently, and in this case, the presence of an alternative and dominant RNAP binding site (in the repressed case), likely contributes to a failure to observe six of the seven binding sites in the araC promoter. Similarly, in the dicA promoter, we did not find an RNAP binding site in the studied region, which would make it very unlikely for any transcription factor binding sites to be identifiable.

We next take the reviewer’s suggestion to compare the energy matrices from putative regulatory regions to known binding site motifs. The known motifs are obtained either from RegulonDB or are generated from data from prior Sort-Seq experiments (see Belliveau et al., 2018). We utilize the TOMTOM motif comparison software (Gupta et al., 2007) to perform these comparisons. TOMTOM generates a p-value under the null hypothesis that the two compared motifs are drawn independently from the same underlying probability distribution. We test 95 motifs against each target motif that we are attempting to identify. The 95 resulting p-values (for each target) generated by TOMTOM are displayed in Appendix 2—figure 4. A full discussion of TOMTOM can be found in Appendix 3 subsection “TOMTOM motif comparison”. We only included those transcription factors that either have over 50 known binding sites in RegulonDB or have experimental measurements of binding site preference, such as in Sort-Seq (Belliveau et al., 2018). As such, we used TOMTOM on the XylR, CRP, MarA, MarR, and RelBE sites in Appendix 2—table 2. We utilized a p-value cutoff of 0.05, corrected for multiple hypothesis testing. 95 motifs were tested against each target and using the Bonferroni correction leads to a p-value cutoff of 0.0595 = 5 × 10−4. In Appendix 2—figure 4 we show that the correct transcription factor falls below the p-value threshold in all cases. For the CRP binding site in the lacZYA promoter, FNR also falls below the cut-off, but CRP has a calculated p-value that is ≈ 6 orders of magnitude lower, and so is clearly identified as the correct binding site. The results show that motif comparisons can be used reliably in those cases where we have high-quality energy matrices for comparison.

In order to determine false positive rates, we test against promoters for which we are certain there are not additional, unannotated binding sites. Most known binding sites were not determined using a method like Reg-Seq, which looks for regulatory elements across an entire promoter region at base pair resolution. Rather, many efforts to pinpoint TF binding site locations use assays like ChIP-seq, which prioritizes looking for all binding sites of a given TF across the entire genome. For those promoters studied with Reg-Seq, there are five promoters for which we have reason to believe that there are no undiscovered binding sites. There is evidence that the zupT promoter is constitutive (Grass et al., 2005), and the marR, relBE, dgoR, and lacZYA promoters have all been examined for binding sites at base pair resolution previously in the Sort-Seq experiments (Belliveau et al., 2018;Kinney et al., 2010).

To evaluate false positive rates, we examine the putative activator and repressor binding sites as identified using our automated methodology (described previously), and compare any known binding sites to the known binding sites for the target promoters. We also classify any putative regulatory regions that are outside of known TF binding sites as false positives. Similarly, any identified RNAP binding sites which were outside of the known RNAP binding locations were classified as false positives. In the zupT promoter, only the correctly placed RNAP site was identified. There were similarly no false positives identified in the marR, relBE, dgoR, or lacZYA promoters. We have added this discussion to Appendix 2 in a new subsection entitled “False positive and false negative rates”.

3) Energy matrices:

3.1) How are the energy matrices constructed? Details are missing from the manuscript.

We thank the reviewer for this comment and concur that additional information on our construction of energy matrices is necessary. We have added a general overview of our method for constructing energy matrices to the Materials and methods section of this manuscript and further detail can be found in Appendix 3 subsection “Energy matrix inference”. To ensure that readers of this work are able to pick up and use our methods in their own laboratories, we have also paid considerable attention, in this resubmitted manuscript, to the construction of a full, detailed set of experimental, mathematical, and computational discussions. Detailed instructions to build energy matrices can be found on the GitHub Wiki for this work, while associated code for building energy matrices can be found in our Jupyter notebooks.

To address the reviewer’s concern and clearly delineate how we built energy matrices in this work, we note that the plotting of an energy matrix for a regulatory region of interest begins by analyzing the next-generation sequencing data for each 160 bp mutagenized region included in our DNA libraries. In particular, we deduce the relative gene expression value for each mutated regulatory region by counting genetic barcodes in RNA and cDNA form. This approach, whereby genetic barcodes serve as a semi-quantitative metric for gene expression, is frequently used in a type of experiments known as “massively parallel reporter assays” (MPRAs). We count barcodes as a relative measure of gene expression values for every mutant sequence and infer the energy matrix that best describes the data by maximizing the mutual information between the expression predictions from a given energy matrix and its expression data. We do this by starting from a random energy matrix and using a Markov Chain Monte Carlo method following the procedure outlined in Kinney et al., 2010. An energy matrix with high mutual information would predict (if the binding site corresponds to a repressor) that the sequences with low expression would have strong binding of said repressor, while sequences with high expression would have weak binding by the repressor.

The output from our MCMC inference is a table of the form:

Author response table 1
posACGT
0-0.005387-0.011758-0.0101760.027322
10.0023380.049826-0.0580300.005866
2-0.000259-0.0372240.0080210.029461
3-0.0174940.015760-0.0121840.013918

In Author response table 1, each value represents the binding energy contribution for a specific nucleotide: A, C, G, or T. A negative value indicates that the nucleotide in that specific position confers a stronger binding energy with the transcription factor, while a positive value indicates a weaker binding energy. We next visualize these nucleotide-specific values using a simple Python script executed in Jupyter notebooks. The notebook and all code used for energy matrix visualization is available in our GitHub repository. Again, we are grateful to the reviewer for raising this concern and agree that our methods needed to be made clearer. We also note that we have updated our Materials and methods section, in the main text, to immediately state that “the reader is urged to visit our detailed GitHub repository and associated methods on our GitHub Wiki”.

3.2) How are the p-values estimated to assign significance to energy parameters? It appears that the authors have used MCMC to sample from a likelihood function, but the details are very vague. To define an energy, it is assumed that the average expression can be written as the negative exponential of some effective energy to parametrize the average effects of mutations on expression- this should be better justified in the text. However, it seems that instead of 4 different energy parameters at each position, the model only assumes a “mutant” and a “wild type” energy. This may not be a good approximation given that we know nucleotides within TF binding sites have highly varying degeneracies. Moreover, the likelihood function for the observed DNA and RNA reads should be expressed as a function of the energy parameters at all positions. However, this is not the approach taken by the authors and instead, equations (8) and (9) only look at single positions. So, it is not clear what kind of likelihood function the MCMC is sampling from, or how p-values and confidence intervals are determined. Also, if the authors are indeed using MCMC with some likelihood function to sample the space of possible parameter values, the resulting variation in the energy parameters should correspond to posterior probability intervals and not confidence intervals.

We thank the reviewer for the comment. We are afraid our p-value analysis was confusing since there is no such analysis associated with energy matrices. The information footprints, for which we calculate effective energies and p-values, and the energy matrices involve separate calculations. Furthermore, we would like to note that the quantitative measures created during Reg-Seq, the energy matrices of transcription factors or RNAP sites, do in fact consider all four bases separately for each position. This is again in contrast to the calculations of effective energies for information footprints, which are based on calculating parameters for only mutated or wild type bases.

We acknowledge that, by only considering wild type or mutated energy contributions to the total effective binding energy rather than having separate values for energy contributions from all four base pairs, our methods will not be accurate in the case of calculating mutual information at locations with degenerate base pairs. However, the information footprints, which are created from the fits to effective energy of wild type or mutated base pairs referred to in the reviewer comment, are intended to be hypothesis generation tools that can identify transcription factor binding sites, rather than a quantitative tool like the energy matrices for transcription factors that we generate. As such, the most important test for the assumption that we can approximate effective energy contributions from all 4 bases as contributions from only wild type or mutated bases is to assess whether the approximation has any effect on determining binding site locations. We re-ran the false positive and false negative assessments discussed in the response to major reviewer comment 2, but instead calculated the effective energy parameters for producing information footprints as a sum of contributions from all four bases. We find that the literature binding sites that were properly identified, as summarized in Appendix 2—table 2, are identically identified. Specifically, any site which was identified using the previous method is still identified and any site that failed to be identified is still not observed. Similarly, when we only fit effective energy parameters for mutated or wild type bases there are no false positives identified in the promoters for marR, relBE, dgoR, zupT, or lacZYA. There are also no false positives when repeating the procedure while considering all 4 bases in the effective energy fits, implying that the simplification to only considering mutated or wild type bases does not have an effect on our ability to identify binding sites. We have amended the text in Appendix 3 subsection “Information footprints” to call attention to the limitations of the assumption we make and to justify our assumptions regarding effective energy used when inferring the parameters used in creating information footprints.

We now refer explicitly to the likelihood function used during MCMC inference in Appendix 3 subsection “Energy matrix inference” in Equation 27.

3.3) Could the authors present the accuracy of the inferred energy matrices as a function of the number of genomic variants used? A particular example is currently given in Appendix 3. Can the authors use a simulation for that purpose? This is required to understand the validity of the results and the ability to scale-up such an experiment (if for example a lower resolution is sufficient, can larger regulatory regions be analyzed in future experiments?)

We appreciate the reviewer’s comment and, in this resubmitted manuscript, have expanded the analysis that was carried out in Appendix 3 to fully address the issue of how the number of mutated sequences for a given promoter affects inference of energy matrices. We wish to both prove that our results will not be strongly dependent on the number of sequences in the sequence regime that we are in, and also to explore, as the reviewer helpfully suggested, how a reduction in the number of mutated sequences could facilitate larger scale experiments in the future. While pure simulation of data from an energy matrix is a possible way to answer this question, we instead generated examples of smaller data sets by computationally sub-sampling the Reg-Seq data from 7 mutated promoters (maoP, hslU, rpsA, leuABCD, aphA, araC, and tig). These promoter regions are representative of a large cross section of the variety of transcriptional regulation that we see in our study, including constitutive expression (hslU), simple repression (leuABCD, tig), simple activation (aphA), and more complicated regulatory architectures (maoP, rspA, araC). Each sub-sampling was performed in triplicate, and we used the Pearson correlation coefficient as a comparison metric between the inference based on the full data set and the computationally sub-sampled data sets. The results of this analysis are displayed in Appendix 3—figure 1.

Based on this analysis, we find that there is room to moderately lower the resolution of the experiment to approximately 1000 unique, mutated sequences per 160-base pair “window” before our results are subjected to large deviations in the inference of energy matrices. A decrease in the number of unique sequences can give modest boosts to the number of genes that can be studied in our Reg-Seq experiments but would not, alas, be able to provide order of magnitude increases in the number of genes that can be explored. We have replaced the original discussion in Appendix 3 with this expanded analysis, which we hope will fully address the reviewer’s concerns.

4) The step for assigning regulatory logic to promoters is unclear. Do authors always assume OR-gate-like interactions between the TFs, which is what Equation 14 suggests? If so, what is the justification for that assumption? Is it just the most basic statistical mechanical assumption, or is there strong empirical support?

We appreciate that the reviewer has raised this question. First, we reiterate that the approach that we present in this study, if we as a field are being appropriately cautious, should be viewed as a powerful, quantitative tool for regulatory hypothesis generation. But it should be clear, based on many papers from our lab, that we are by no means satisfied with stopping at the current iteration of this method (Phillips et al., 2019; Razo-Mejia et al., 2018; Garcia and Phillips, 2011; Chure et al., 2019). The outcome of our study is a hypothesized regulatory architecture as characterized by a suite of binding sites for RNAP, repressors and activators, as well as the extremely potent binding energy matrices. We do not assume, a priori, that a particular collection of such binding sites is AND, OR, or any other logic (Buchler et al., 2003; Bintu et al., 2005; Galstyan et al., 2019). We are exploring a new round of experimental tools that will allow us to do the next step in our analysis of such architectures. Our new Figure 4, in the main body of this paper, shows the outcome of our analysis. For those promoters that contain multiple transcription factor binding sites, we do not yet know what detailed regulatory logic (AND, OR, NAND, and so forth) they imply. Using the traditional approach in our lab of using statistical physics to predict input-output functions, in conjunction with our methods for constructing energy matrices, we believe that there is a way forward to find out what the regulatory logic of a given promoter is. We also suspect, however, that information footprint and expression shifts under multiple growth conditions might produce substantial hints as to the logic as well. Ultimately, we believe the present work lays the foundation for asking and answering precisely these questions. To more fully address the reviewer’s concern, and our noted inability to identify regulatory logic, we have included a discussion of inferred regulatory logic based on this comment in the Results subsection entitled “Newly discovered E. coli regulatory architectures”.

5) The description of the method and analysis is hard to follow and unclear about many assumptions and limitations. To give a few examples:

5.1) Appendix 3 outlines how to calculate the mutual information footprint from the read count data but a more explicit discussion would be helpful: For example, how should the quantities in Equation 5 (p(m, m𝜇), p(m), p(𝜇)) be calculate from the preceding example data (subsection “Information footprints”)?

We thank the reviewer for their comment and agree that, as written in the original manuscript, our mathematical explanations are insufficient to guide the reader through our equations in a logical manner. We sincerely hope to address this shortcoming in this resubmitted manuscript. Accordingly, we have expanded our discussion regarding the calculation of information footprints in Appendix 3 to include explicit calculations of p(m), p(µ), and p(m, µ) from the example data presented in Appendix 3—table 1.

In the subsection entitled “Information footprints” in Appendix 3, we now give a very precise, pedagogical example of the calculations for p(m), p(µ), and p(m, µ), and provide further explanations regarding calculations of the mutual information itself. This Appendix, in conjunction with the new GitHub Wiki and Jupyter notebooks, also now show how these calculations can be performed on real data obtained using the Reg-Seq methodology.

5.2) The methods rely on both DNA and RNA sequencing of the same sample and using ratios of RNA/DNA reads from the same reporter to estimate expression. The description of the sequencing protocol is insufficient. Importantly, the DNA sequencing is not even mentioned in the Materials and methods and it is not explained how precisely this expression quantification is done.

We thank the reviewer for their comment and apologize for the error. We have now updated our Materials and methods section to more fully explain how our DNA libraries are prepared, how we isolate RNA and DNA, and how our sequencing experiments are performed. We have also added several sentences that describe how we count barcodes, in cDNA and DNA form, and then use these data to quantify gene expression values for mutated promoter sequences. In addition, the GitHub Wiki provides a detailed description of these methods.

5.3) The Materials and methods mention correcting for correlated mutations, MCMC to assign p-values, and energy matrix reconstruction, but it is completely unclear how either of these are done.

We apologize that our Materials and methods failed to mention the corrections for correlated mutations, MCMC to assign p-values, and how we construct energy matrices. We take this exclusion very seriously, as it is our sincere hope that readers and other laboratories will be able to easily utilize Reg-Seq to dissect a greater number of regulatory architectures. Since our first submission, we have made an in-depth GitHub Wiki that outlines all experimental, computational, and mathematical steps involved in our experiments. However, we do not merely wish to direct readers to an external GitHub repository, as this exclusion warrants remedy in the main body of the text. Accordingly, we have added a subsection in Appendix 3 entitled “Energy matrix inference” to explain the process of creating energy matrices. In light of the suggestions of the reviewers, we have also implemented a method for automated motif finding that replaces the appeal to p-values. We now explain this process in a subsection entitled “Automated putative binding site algorithm” in the Materials and methods section. The generation of mutual information footprints that we use to address the problem of correlated mutations is also now addressed in Appendix 3, in the subsection entitled “Information footprints”.

6) The discussion of the methods in the main text is mostly qualitative and crucial details are scattered throughout the manuscript, in the appendices and in Materials and methods.

6.1) For example, it requires digging into the appendices to understand that the authors study a region of 160 bp mostly upstream TSS, and so, is this method inadequate if the regulatory region is larger or if its location is unknown? A clear and concise explanation of the crucial details (e.g. size of the region, number of variants per promoter, etc) should be given in the main text.

6.2) To further clarify the methods, it would be helpful to include a picture of the reporter construct, which based on the descriptions includes:

-promoter including 45 bp downstream of annotated TSS.

-then 64bp with primers for plasmid construction

-then 11bp with stop codons in 3 frames.

-then barcodes?

-then a RBS

-then GFP mRNA

Author response image 1
Figure 10 from Rydenfelt et al., 2014b.

Distribution of activating and repressing binding sites bound by global TFs and specific TFs, respectively. The y-axis shows the number of binding sites overlapping each nucleotide position, after aligning all promoters with respect to their transcription start site (TSS) for the different kinds of TFs.

We thank the reviewer for this constructive comment and apologize that the methodology in our initial submission was unclear and insufficient. In this resubmitted manuscript, we have made numerous efforts to address these concerns.

We have created an exhaustive Materials and methods section that outlines every experimental, mathematical, and computational procedure utilized in this project. The full text is available on the GitHub Wiki that accompanies this project. Additionally, we now include a spreadsheet that lists every gene/operon that was studied in our manuscript (provided as Supplementary file 1). For each gene/operon, we also provide details on why each TSS was selected and provide applicable references for each choice. We hope that this provides clarity on our methodologies. Finally, we have updated our methods, in both the main text and appendix, to more clearly disclose our choice of TSS for each gene/operon and the design of our genetic constructs.

When we generate a library of mutated “regulatory regions” for Reg-Seq, we include a 160 bp region that includes 45 bp downstream of each TSS, and 115 bp upstream of each TSS. It is certainly possible, as noted by the reviewer, that a 160 bp region does not include all of the TF binding sites and regulatory information for a given gene/operon. In cases where little is known about the genetic architecture for a gene, and no TF binding sites have been experimentally determined, we cannot say definitively that our selected region includes all regulatory information.

A previous study from our group (Rydenfelt et al., 2014b) (see Author response image 1) asked the question: What is the distribution of binding site locations for activator and repressor TFs in the E. coli genome, based on a computational analysis of the RegulonDB database? In that study, as seen in Author response image 1, we found that most (but not all) TF binding sites fall within the range of approximately 45 bp downstream and 115 bp upstream of a TSS. Repressors tend to bind further upstream of a TSS than do activators. We used this study as the basis for designing the size and positioning of the 160 bp regulatory ”windows” used in this study, but of course, we are fully cognizant of the broader distribution revealed in Author response image 1, and in addition, of the even more thorny issue of action at a distance in the form of DNA looping. In future iterations of our experiments to dissect the E. coli ”regulome”, we aim to broaden the length of our mutagenized region, a goal which poses few conceptual or technical obstacles, though the question of DNA looping (and enhancers in eukaryotes) will clearly require further innovation.

For each gene, we obtained an average of 2200 unique DNA mutants in the form of oligo pools from TWIST Biosciences. This number was found to be sufficient for the replicable inference of binding energy matrices based on a previous systematic, experimental effort to evaluate the minimum number of DNA mutants that are necessary for such inference; we have previously discussed this in the context of Appendix 3—figure 1. We also now include our discussion of these points in Appendix 3 subsection “Uncertainty due to number of independent sequences”.

When we choose a TSS to study based on peer-reviewed literature and our search through the EcoCyc database, we do so using a few criteria; we always select, where possible, a TSS based on experimental evidence. If there are multiple TSS listed for a gene or operon of interest, we typically select the TSS that is best positioned to include regulatory information for all of the nearby TSS. For full transparency, however, we now include a spreadsheet that provides our rationale for selecting the TSS for each and every gene in this study (provided as Supplementary file 1).

To better disclose the design of our genetic constructs, we have also created a “cartoon” figure of the cloned plasmid architecture, which we show in Appendix 1—figure 1. As noted by the reviewer, our reporter construct is a pSC101-based plasmid with a kanamycin resistance marker. Each mutated, 160 bp DNA sequence is flanked by primer sites that are 20 bp in length (forward and reverse sites). Immediately downstream of the cloned regulatory regions is a random, unique, 20 bp barcode, followed by a ribosome binding site, GFP gene, and terminator. We have added clarifications on our genetic constructs to the accompanying figure legend in Appendix 1. Additionally, exact details of the sequences flanking the mutated region are stored in Supplementary file 2.

7) Authors used TSS information to decide which promoters to pursue. How limiting is this for scaling up the procedure? It would be helpful to discuss briefly what fraction of promoters in E. coli have good TSS information and how valid is to assume that at the genome-wide level, each operon is dominated by a single TSS?

We appreciate the reviewer’s comment and aim to address these concerns in this resubmission. We argue that the TSS information available in public databases is a significant limitation in the scaling up of this work that will require further ingenuity and vigilance on our part.

To explore this assumption, we downloaded the ”all promoters” dataset from RegulonDB v10.5 (Santos-Zavaleta et al., 2019), which provides information on every promoter in E. coli, with details on each promoter’s name, its DNA strand orientation, σ factors that recognize the promoter, the promoter sequence, evidence to support the existence of each promoter, and the confidence level of the evidence. An examination using Python indicated that, of the 2600 operons listed in RegulonDB, 2500 have a TSS known to be related to the operon or to have a TSS, determined by transcription initiation mapping, within 300 bp of the start of the operon. If we restrict our consideration to those TSS classified as having “Strong” or “Confirmed” evidence by RegulonDB, then 1700 operons have an associated TSS. In most cases, the evidence for a promoter with this classification is experimental; namely, the TSS was identified using transcription initiation mapping or RNA- seq experiments. The number of operons with a known TSS, as listed in these databases, represent a more than 10x increase over the number of genes explored in our Reg-Seq experiments and thus will serve as a powerful jumping off point for future iterations of this work. Unfortunately, EcoCyc and RegulonDB are not comprehensive, and thus will still not permit a complete genome-wide treatment of the problem.

The reviewer has also asked whether it is safe to assume, at the genome-wide level, that each operon is dominated by a single TSS. This is certainly not an assumption that we make in this study, as many genes/operons in E. coli have experimental evidence for multiple TSS, meaning that there are sometimes two or three promoters that are active upstream of certain genes or operons. Indeed, one of the limitations of the Reg-Seq method is its limited DNA “mutational window”; we are only able to build mutated oligo libraries for DNA fragments up to approximately 200 base pairs in length. This is a limitation that comes down to modern oligo synthesis technologies and financial feasibility, as ordering about 2000 unique sequences for 100 genes, each of which is about 200 bases in length, is already a costly endeavour. Thus, we do our utmost to include all known or putative regulatory information in the DNA sequences that we do order. If there are multiple TSS for a given gene/operon, we choose the TSS with the most experimental evidence, or the TSS that is located most closely to the others. In this resubmitted manuscript, we now include a spreadsheet as Supplementary file 1 that lists our rationale for choosing every TSS in this study.

8) The authors rely on expression from plasmids and use mRNA/DNA ratio to handle the effect of variability in plasmid copy number between cells. However, if the plasmid copy number is of a similar order of magnitude as the transcription factor copy number, then the expression level measured (to calculate the energy matrices) is determined not only by the binding energy, but also by the TF availability leading to under-estimation of the binding energies. The authors should comment on this in the manuscript, and if they have the data available, show the measurements for plasmid and TF copy numbers to address this point. At this point we do not see a necessity for additional experiments.

We are grateful to the reviewer for this comment, and certainly agree that this point demands serious consideration and scrutiny. Indeed, one of our favorite works from our laboratory was precisely aimed at rigorously predicting and measuring this effect, regarding protein and plasmid abundance (see Weinert et al., 2014). In that paper, we demonstrated how to control this effect in a parameter-free manner. Further, we share the reviewer’s aggravation that, until now, our experiments have been performed on plasmids. We have been working hard with the Sri Kosuri lab to make sure that our next set of experiments will involve chromosomally integrated libraries, and so only a single copy of each mutated promoter will be present in future iterations of our experiments.

With that said, the plasmid used in our experiments is derived from pUA66, which contains a pSC101 origin of replication (Zaslaver et al., 2006). A study from 1997 quantified the copy number of various “replication origins” and found that pSC101 has a copy number (in log phase) of 3 or 4 (Lutz and Bujard, 1997). To assess this copy number, the experimenters encoded luciferase on plasmids with differing origins of replication, quantified the luciferase activities for each, and compared their obtained values to a strain in which a single copy of the luciferase gene was inserted into the chromosome. The authors used a different strain of E. coli (DH5αZ1) and we have not independently assessed the copy number of our plasmid.

The absolute copy number of various TFs identified in our study can also be found using data from prior studies, chiefly those that performed experiments to quantify whole-proteome E. coli samples. Specifically, our laboratory has long admired a 2016 study from Matthias Heinemann’s group, which provides the absolute quantification for roughly 55 percent of predicted proteins in the E. coli K12 proteome using LC-MS (see Supplementary Table S6 from Schmidt et al., 2016. For those TFs that were quantified in the Schmidt study, and also identified in our Reg-Seq experiments, we provide their absolute quantification in E. coli K12 for both glucose and LB growth media in Appendix 3—table 2.

This table lists the global, absolute quantification for the listed TFs for E. coli K12 grown in both glucose (5 g/L concentration in M9 minimal media) and LB. For most TFs, the purported copy number for nearly all of these TFs is much greater than the low-copy number plasmid used in this study, thus mitigating the concern that the limited availability of a TF could impact gene expression, though we reiterate that going forward, we are committed to moving beyond this approach to perform chromosomal integrations.

“There are a few transcription factors in this work that have copy number on the order of the plasmid copy number, however, including XylR, DicA, and YgbI. […] The rank-order is always preserved and so the presence of a multi-copy plasmid will not change the mutual information between model predictions and experimental data. As a result, the final inference of energy matrices will remain the same.”

Again, we agree wholeheartedly with the reviewer that this point merits serious deliberation and discussion. As this is a point that astute readers of our work will likely share, we have added a modified form of this explanation to Appendix 3 subsection “Effect on calculated energy matrices when transcription factor copy number ≈ plasmid copy number”.

9) The limitations of the method should be more concisely explained. Currently, limitations are scattered throughout the text.

We thank the reviewer for their feedback and have aimed to address this concern by updating the main text’s Discussion section accordingly. Specifically, we have identified the major limitations discussed in the Results section, and have added them to the Discussion section, so that all main limitations relating to the Reg-Seq methodology can be found in one place. We have also added numerous limitations to this Discussion section, including our concerns regarding the use of mass spectrometry to identify weaker TF:DNA interactions, and further discuss our efforts to ameliorate some of these limitations for future work and the transcription research community at-large.

Revision required for Figures and Tables:

Figure 1: This figure is hard to read. It is difficult to distinguish the individual tick marks around the genome, because there are too many, they are too densely packed, and the colors are too mixed. Also, the caption describes the color of some ticks as "red," but in the printed figure they look more brown (They appeared closer to red on the screen.)

We thank the reviewer for their comment and agree that the lines on this figure are difficult to distinguish. When making this schematic, we wanted to emphasize the known and unknown regions of the E. coli genome and did not aim to encode precise information about their positions. The percentages in the center serve to summarize the main point: that the majority of operons have no known regulation. As a result, we have opted to leave the figure unchanged, as it bolsters our argument concerning our “regulatory ignorance” of the E. coli genome. However, to address the reviewer’s concern with respect to gene positions, we now include (in Supplementary file 1) the genomic location of each gene’s TSS.

Figure 2: You might want to clarify that the blue region likely corresponds to the σ factor binding site.

We thank the reviewer for noticing this error and have corrected the caption for Figure 2 accordingly. Specifically, we now state that the RNAP binding region is highlighted in blue, whilst repressor binding regions are in red.

We have also updated the legends for Figure 3, 5, 7 and 8 to point out that repressor binding regions are in red, activator binding regions in green, and RNAP binding regions in blue.

Figure 3: The number of (0,0) promoters should be shown as well. Moreover, it seems that the counts in Figure 3 add up to about 50 and the text mentions that there are 32 promoters with no sites found, i.e. type (0,0). Adding this in the sum still seems much less than 113 (i.e., the total number of promoters as indicated in the manuscript). Related to this, it seems that Table 2 has clearly less than 113 promoters in it. Where are the remaining ones?

This figure (now Figure 6) has been updated to include these (0,0) architectures. As summarized in Table 1 of the main text, we only include in Figure 6(B) those regulated promoters which have at least one newly discovered binding site. The remaining 18 promoters that are not accounted for in Figure 6(B) were those we deemed “inactive” as we were unable to recover an RNAP binding site in the growth conditions we explored here. Further discussion regarding those promoters for which we did not recover an active RNAP binding site is now included in Appendix 3 subsection “Examination of promoters for which no RNAP site was found”.

Figure 4: The authors state about the results in Figure 4: "In each of the cases shown in the figure, prior to the work presented here, these promoters had no regulatory information in relevant databases such as EcoCyc (Keseler et al., 2016) and RegulonDB (Santos- 318 Zavaleta et al., 2019)."

This is simply not true. If you check EcoCyc for these genes you find:

i) idnK regulated by CRP, IdnR and GtnR.

ii) leuABCD regulated by LeuO, slyA, LrhA, and RcsB-BgU.

iii) maoP regulated by hdfR.

iv) rspA regulated by CRP and YdfH.

Only for yjjJ, aphA, and ybjX nothing was previously reported. It should also be noted that except for the rspA promoter, the previously reported regulatory interactions were generally not identified by the method.

Finally, it seems that there is no mass-spec data for yjjJ. Is the identification of marA as the regulator based on motif matching then?

We acknowledge and regret this mistake in claiming that these were newly discovered regulatory architectures. As an explanation, this figure was meant to display examples of data from a range of regulatory architectures, regardless of whether or not they were previously known, and the line referenced by the reviewer was intended to describe the following figure, but erroneously was used to describe the previous Figure 4. It was certainly not our intention to falsely over-state our discoveries. We have amended the previous Figure 4 caption to remove the erroneous text.

Unrelated to this comment, but as part of an overall push to make the methodology and results clearer, this figure has now replaced by a new Figure 4, listing all TF binding sites for which we have data. The new Figure 4 better encapsulates our point here: that this method is able to recover a myriad of regulatory architectures. Furthermore, MarA was identified as a regulator of yjjJ through motif matching to a putative binding site. A note to this effect is in the “Evidence” column of Table 2.

Table 2: This table is unclear. Does the "architecture" of a promoter (first column) correspond to what is inferred in this study, to what is known in the literature, or to a combination of both? For the newly found sites it should be listed whether the binding TF was identified and (if so) what it is. For the fourth column, “literature binding sites”, it is not clear whether this is the total number of known literature sites or only the number of known sites that were successfully identified here – both should be listed. It is not clear what the fifth column “identified binding sites” refers to. The evidence column is also unclear. Is this evidence for the newly found sites or the literature sites?

Just to illustrate the confusion: consider the well-known lac operon. In Table 2 the lac operon (called lac as opposed to lacZYA, which is used elsewhere in the manuscript) is reported to have only 1 known literature site, whereas it is well established that it has a site for CRP and multiple sites for LacI. In the final column it is claimed that there is mass-spec evidence for LacI binding. However, on the website, the lac operon does not occur at all and in Appendix 2—figure 1 only the CRP site is reported. What makes this all even more confusing is that none of the experimental conditions included lactose or another inducer of the operon. As such, the lac operon should be highly repressed and so we expect very little effect of mutating the CRP site, and strong effects of abolishing the LacI site.

In multiple cases in the table, less literature information is reported than it actually exists: e.g. tar is a reported target of FNR, uvrD is a reported target of LexA, cra is a reported target of PhoB, and so on. The authors should make clear what literature information is shown here.

The authors should clear up this information so as to make it unambiguous and consistent across tables, figures, and website.

We thank the reviewer for their comment and hope to address it in this resubmitted manuscript. Our 160 bp promoter regions (with 45 bp downstream of the annotated TSS used) does not always encapsulate all possible regulatory mechanisms, especially long-distance binding of repressors, as is the case in the lac operon. The regulatory architectures that we list in Table 2 in the main text reflect only the binding sites that we are able to recover within our 160 bp “mutation” window that was analyzed for each promoter in this study. While this is a clear limitation of the method, in that we are only able to recover regulation in a rather small, 160 bp region, we are encouraged by previous work that found that the majority of annotated regulation is found within 100 bp of the TSS (see Rydenfelt et al., 2014b).

In that study (a figure from which is displayed here as Author response image 1 and is further discussed in Major Comment 6 in this response letter), Rydenfelt et al. were able to show, using TSS and binding site positions in RegulonDB, that ”75% of all reported TF interactions in RegulonDB (v8.5) take place within 100 bp of the transcription start site.” Regardless of this close proximity to the TSS, however, repressor and activator binding sites do have distinctly different profiles, as activator binding sites tend to reside slightly further upstream of the TSS compared to repressor binding sites.

Furthermore, we do not include binding sites in this analysis that lack a known binding site location or that are listed as “low-evidence” in EcoCyc, which generally means that they were found only by similarity to a consensus binding site or through gene expression analysis. We agree with the reviewers that our rationale for including or excluding binding sites, as well as our organization of Table 2, was inadequate in our previous submission. To amend this shortcoming, we have added a new section to the appendix; namely, subsection “Explanation of included binding sites” in Appendix 4, which provides our explanations, for each relevant gene, regarding binding sites that appear in the regulatory information in RegulonDB or EcoCyc that are not addressed in Reg-Seq as a ”literature binding site” in Table 2. As one specific example from lacZYA, the mutagenized region extends from the TSS (the primary TSS p1) to 75 base pairs upstream of the TSS. The location of the mutagenized region excludes the LacI sites, while including a single CRP binding site, a MarA binding site, and two HNS binding sites. The expression from marA is expected to be low, as we do not grow the cells in the presence of salicylate or antibiotic stress and so we do not expect to observe the MarA site. In fact, the precursor of the Reg-Seq experiment, Sort-Seq, mutagenized and studied the same 75-base pair region, and only observed binding by CRP (Kinney et al., 2010). As such, we only include CRP in Table 2, the regulatory cartoons, or the analysis of false positives and false negatives discussed in Appendix 2 subsection “False positive and false negative rates”.

In our Reg-Seq experiments, there were several genes for which we failed to identify any repressor or activator binding sites, and indeed a limitation of this method (which we also discuss in the Discussion section of the main text), is our inability to study those binding sites outside of our narrow, 160 bp window.

We also have amended Table 2 in the main text in an attempt to clarify what is meant by each column. We now indicate in the table caption that the first column is meant to reflect the total number of binding sites that could be seen in the Reg-Seq data, regardless of whether there was previous evidence for their existence. We also note that binding sites which are not in the mutagenesis window or are otherwise not expected to appear in the Reg-Seq data, as explained in Appendix 4 subsection “Explanation of included binding sites”, are not counted in this tally. Similarly, we also emphasize that the literature sites column contains those sites that are both expected to be and are, in actuality, observed in the Reg-Seq data.

https://doi.org/10.7554/eLife.55308.sa2

Article and author information

Author details

  1. William T Ireland

    Department of Physics, California Institute of Technology, Pasadena, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0971-2904
  2. Suzannah M Beeler

    Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, United States
    Contribution
    Conceptualization, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1930-4827
  3. Emanuel Flores-Bautista

    Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, United States
    Contribution
    Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Nicholas S McCarty

    Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, United States
    Contribution
    Resources, Software, Writing - review and editing
    Competing interests
    No competing interests declared
  5. Tom Röschinger

    Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, United States
    Contribution
    Data curation, Software, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4900-3216
  6. Nathan M Belliveau

    Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, United States
    Present address
    Howard Hughes Medical Institute and Department of Biology, University of Washington, Seattle, United States
    Contribution
    Conceptualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1536-1963
  7. Michael J Sweredoski

    Proteome Exploration Laboratory, Division of Biology and Biological Engineering, Beckman Institute, California Institute of Technology, Pasadena, United States
    Contribution
    Formal analysis, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0878-3831
  8. Annie Moradian

    Proteome Exploration Laboratory, Division of Biology and Biological Engineering, Beckman Institute, California Institute of Technology, Pasadena, United States
    Contribution
    Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0407-2031
  9. Justin B Kinney

    Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, United States
    Contribution
    Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1897-3778
  10. Rob Phillips

    1. Department of Physics, California Institute of Technology, Pasadena, United States
    2. Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Validation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    phillips@pboc.caltech.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3082-2809

Funding

National Institutes of Health (Director's Pioneer Award)

  • Rob Phillips

National Institutes of Health (National Research Service Award)

  • Suzannah M Beeler

National Institutes of Health (Maximizing Investigators Research Award)

  • Rob Phillips

Howard Hughes Medical Institute (International Student Research Fellowship)

  • Nathan M Belliveau

National Institutes of Health (1S10OD02001301)

  • Annie Moradian
  • Michael J Sweredoski

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are grateful to Rachel Banks, Stephanie Barnes, Curt Callan, Griffin Chure, Ana Duarte, Vahe Galstyan, Hernan Garcia, Soichi Hirokawa, Thomas Lecuit, Heun Jin Lee, Madhav Mani, Muir Morrison, Steve Quake, Manuel Razo-Mejia, Gabe Salmon, and Guillaume Urtecho for useful discussion and feedback on the manuscript. Guillaume Urtecho and Sri Kosuri have been instrumental in providing key advice and protocols at various stages in the development of this work. We would like to thank Jost Vielmetter and Nina Budaeva for providing access to their Cell Disruptor. Brett Lomenick provided crucial help and advice with protein preparation. We also thank Igor Antoshechkin for his help with sequencing at the Caltech Genomics Facility.

Funding: We are deeply grateful for support from NIH Grants DP1 OD000217 (Director’s Pioneer Award) and 1R35 GM118043-01 (Maximizing Investigators Research Award) which made it possible to undertake this multi-year project. NMB was supported by an HHMI International Student Research Fellowship. SMB was supported by the NIH Institutional National Research Service Award (5T32GM007616-38) provided through Caltech. The Proteome Exploration Laboratory is supported by, the Beckman Institute, and NIH 1S10OD02001301.

Senior Editor

  1. Naama Barkai, Weizmann Institute of Science, Israel

Reviewing Editor

  1. Armita Nourmohammad, University of Washington, United States

Reviewer

  1. Tamar Friedlander, IST Austria, Austria

Publication history

  1. Received: January 20, 2020
  2. Accepted: September 18, 2020
  3. Accepted Manuscript published: September 21, 2020 (version 1)
  4. Version of Record published: October 16, 2020 (version 2)

Copyright

© 2020, Ireland et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,854
    Page views
  • 321
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Microbiology and Infectious Disease
    2. Physics of Living Systems
    Vivek V Thacker et al.
    Tools and Resources

    We establish a murine lung-on-chip infection model and use time-lapse imaging to reveal the dynamics of host-Mycobacterium tuberculosis interactions at an air-liquid interface with a spatiotemporal resolution unattainable in animal models and to probe the direct role of pulmonary surfactant in early infection. Surfactant deficiency results in rapid and uncontrolled bacterial growth in both macrophages and alveolar epithelial cells. In contrast, under normal surfactant levels, a significant fraction of intracellular bacteria are non-growing. The surfactant-deficient phenotype is rescued by exogenous addition of surfactant replacement formulations, which have no effect on bacterial viability in the absence of host cells. Surfactant partially removes virulence-associated lipids and proteins from the bacterial cell surface. Consistent with this mechanism, the attenuation of bacteria lacking the ESX-1 secretion system is independent of surfactant levels. These findings may partly explain why smokers and elderly persons with compromised surfactant function are at increased risk of developing active tuberculosis.

    1. Cell Biology
    2. Physics of Living Systems
    Andrea Serra-Marques et al.
    Research Article

    Intracellular transport relies on multiple kinesins, but it is poorly understood which kinesins are present on particular cargos, what their contributions are and whether they act simultaneously on the same cargo. Here, we show that Rab6-positive secretory vesicles are transported from the Golgi apparatus to the cell periphery by kinesin-1 KIF5B and kinesin-3 KIF13B, which determine the location of secretion events. KIF5B plays a dominant role, whereas KIF13B helps Rab6 vesicles to reach freshly polymerized microtubule ends, to which KIF5B binds poorly, likely because its cofactors, MAP7-family proteins, are slow in populating these ends. Sub-pixel localization demonstrated that during microtubule plus-end directed transport, both kinesins localize to the vesicle front and can be engaged on the same vesicle. When vesicles reverse direction, KIF13B relocates to the middle of the vesicle, while KIF5B shifts to the back, suggesting that KIF5B but not KIF13B undergoes a tug-of-war with a minus-end directed motor.