The emergence and evolution of gene expression in genome regions replete with regulatory motifs

  1. Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
  2. Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, Switzerland
  3. The Sante Fe Institute, Sante Fe, NM, USA

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Vincent Lynch
    University at Buffalo, State University of New York, Buffalo, United States of America
  • Senior Editor
    Alan Moses
    University of Toronto, Toronto, Canada

Reviewer #1 (Public Review):

Summary:

This study by Fuqua et al. studies the emergence of sigma70 promoters in bacterial genomes. While there have been several studies to explore how mutations lead to promoter activity, this is the first to explore this phenomenon in a wide variety of backgrounds, which notably contain a diverse assortment of local sigma70 motifs in variable configurations. By exploring how mutations affect promoter activity in such diverse backgrounds, they are able to identify a variety of anecdotal examples of gain/loss of promoter activity and propose several mechanisms for how these mutations interact within the local motif landscape. Ultimately, they show how different sequences have different probabilities of gaining/losing promoter activity and may do so through a variety of mechanisms.

Major strengths and weaknesses of the methods and results:

This study uses Sort-Seq to characterize promoter activity, which has been adopted by multiple groups and shown to be robust. Furthermore, they use a slightly altered protocol that allows measurements of bi-directional promoter activity. This combined with their pooling strategy allows them to characterize expressions of many different backgrounds in both directions in extremely high throughput which is impressive! A second key approach this study relies on is the identification of promoter motifs using position weight matrices (PWMs). While these methods are prone to false positives, the authors implement a systematic approach which is standard in the field. However, drawing these types of binary definitions (is this a motif? yes/no) should always come with the caveat that gene expression is a quantitative trait that we oversimplify when drawing boundaries.

Their approach to randomly mutagenizing promoters allowed them to find many anecdotal examples of different types of evolutions that may occur to increase or decrease promoter activity. However, the lack of validation of these phenomena in more controlled backgrounds may require us to further scrutinize their results. That is, their explanations for why certain mutations lead or obviate promoter activity may be due to interactions with other elements in the 'messy' backgrounds, rather than what is proposed.

An appraisal of whether the authors achieved their aims, and whether the results support their conclusions:

The authors express a key finding that the specific landscape of promoter motifs in a sequence affects the likelihood that local mutations create or destroy regulatory elements. The authors have described many examples, including several that are non-obvious, and show convincingly that different sequence backgrounds have different probabilities for gaining or losing promoter activity. While this overarching conclusion is supported by the manuscript, the proposed mechanisms for explaining changes in promoter activity are not sufficiently validated to be taken for absolute truth. There is not sufficient description of the strength of emergent promoter motifs or their specific spacings from existing motifs within the sequence. Furthermore, they do not define a systematic process by which mutations are assigned to different categories (e.g. box shifting, tandem motifs, etc.) which may imply that the specific examples are assigned based on which is most convenient for the narrative.

Impact of the work on the field, and the utility of the methods and data to the community:

From this study, we are more aware of different types of ways promoters can evolve and devolve, but do not have a better ability to predict when mutations will lead to these effects. Recent work in the field of bacterial gene regulation has raised interest in bidirectional promoter regions. While the authors do not discuss how mutations that raise expression in one direction may affect another, they have created an expansive dataset that may enable other groups to study this interesting phenomenon. Also, their variation of the Sort-Seq protocol will be a valuable example for other groups who may be interested in studying bidirectional expression. Lastly, this study may be of interest to groups studying eukaryotic regulation as it can inform how the evolution of transcription factor binding sites influences short-range interactions with local regulator elements.

Any additional context to understand the significance of the work:

The task of computationally predicting whether a sequence drives promoter activity is difficult. By learning what types of mutations create or destroy promoters from this study, we are better equipped for this task.

Reviewer #2 (Public Review):

Summary:

Fuqua et al investigated the relationship between prokaryotic box motifs and the activation of promoter activity using a mutagenesis sequencing approach. From generating thousands of mutant daughter sequences from both active and non-active promoter sequences they were able to produce a fantastic dataset to investigate potential mechanisms for promoter activation. From these large numbers of mutated sequences, they were able to generate mutual information with gene expression to identify key mutations relating to the activation of promoter island sequences.

Strengths:

The data generated from this paper is an important resource to address this question of promoter activation. Being able to link the activation of gene expression to mutational changes in previously nonactive promoter regions is exciting and allows the potential to investigate evolutionary processes relating to gene regulation in a statistically robust manner. Alongside this, the method of identifying key mutations using mutual information in this paper is well done and should be standard in future studies for identifying regions of interest.

Weaknesses:

While the generation of the data is superb the focus only on these mutational hotspots removes a lot of the information available to the authors to generate robust conclusions. For instance.

(1) The linear regression in S5 used to demonstrate that the number of mutational hotspots correlates with the likelihood of a mutation causing promoter activation is driven by three extreme points.

(2) Many of the arguments also rely on the number of mutational hotspots being located near box motifs. The context-dependent likelihood of this occurring is not taken into account given that these sequences are inherently box motif rich. So, something like an enrichment test to identify how likely these hot spots are to form in or next to motifs.

(3) The link between changes in expression and mutations in surrounding motifs is assessed with two-sided Mann Whitney U tests. This method assumes that the sequence motifs are independent of one another, but the hotspots of interest occur either in 0, 3, 4, or 5s in sequences. There is therefore no sequence where these hotspots can be independent and the correlation causation argument for motif change on expression is weakened.

(4) The distance between -10 and -35 was mentioned briefly but not taken into account in the analysis.

The authors propose mechanisms of promoter activation based on a few observations that are treated independently but occur concurrently. To address this using complementary approaches such as analysis focusing on identifying important motifs, using something like a glm lasso regression to identify significant motifs, and then combining with mutational hotspot information would be more robust. Other elements known to be involved in promoter activation including TGn or UP elements were not investigated or discussed.

Reviewer #3 (Public Review):

Summary:

Like many papers in the last 5-10 years, this work brings a computational approach to the study of promoters and transcription, but unfortunately disregards or misrepresents much of the existing literature and makes unwarranted claims of novelty. My main concerns with the current paper are outlined below although the problems are deeply embedded.

Strengths:

The data could be useful if interpreted properly, taking into account i) the role of translation ii) other promoter elements, and iii) the relevant literature.

Weaknesses:

(1) Incorrect assumptions and oversimplification of promoters.

- There is a critical error on line 68 and Figure 1A. It is well established that the -35 element consensus is TTGACA but the authors state TTGAAA, which is also the sequence represented by the sequence logo shown and so presumably the PWM used. It is essential that the authors use the correct -35 motif/PWM/consensus.

-Likely, the authors have made this mistake because they have looked at DNA sequence logos generated from promoter alignments anchored by either the position of the -10 element or transcription start site (TSS), most likely the latter. The distance between the TSS and -10 varies. Fewer than half of E. coli promoters have the optimal 7 bp separation with distances of 8, 6, and 5 bp not being uncommon (PMID: 35241653). Furthermore, the distance between the -10 and -35 elements is also variable (16,17, and 18 bp spacings are all frequently found, PMID: 6310517). This means that alignments, used to generate sequence logos, have misaligned -35 hexamers. Consequently, the true consensus is not represented. If the alignment discrepancies are corrected, the true consensus emerges. This problem seems to permeate the whole study since this obviously incorrect consensus/motif has been used throughout to identify sequences that resemble -35 hexamers.

- An uninformed person reading this paper would be led to believe that prokaryotic promoters have only two sequence elements: the -10 and -35 hexamers. This is because the authors completely ignore the role of the TG motif, UP element, and spacer region sequence. All of these can compensate for the lack of a strong -35 hexamer and it's known that appending such elements to a lone -10 sequence can create an active promoter (e.g. PMIDs 15118087, 21398630, 12907708, 16626282, 32297955). Very likely, some of the mutations, classified as not corresponding to a -10 or -35 element in Figure 2, target some of these other promoter motifs.

- The model in Figure 4C is highly unlikely. There is no evidence in the literature that RNAP can hang on with one "arm" in this way. In particular, structural work has shown that sequence-specific interactions with the -10 element can only occur after the DNA has been unwound (PMID: 22136875). Further, -10 elements alone, even if a perfect match to the consensus, are non-functional for transcription. This is because RNAP needs to be directed to the -10 by other promoter elements, or transcription factors. Only once correctly positioned, can RNAP stabilise DNA opening and make sequence-specific contacts with the -10 hexamer. This makes the notion that RNAP may interact with the -10 alone, using only domain 2 of sigma, extremely unlikely.

(2) Reinventing the language used to describe promoters and binding sites for regulators.

- The authors needlessly complicate the narrative by using non-standard language. For example, On page 1 they define a motif as "a DNA sequence computationally predicted to be compatible with TF binding". They distinguish this from a binding site "because binding sites refer to a location where a TF binds the genome, rather than a DNA sequence". First, these definitions are needlessly complicated, why not just say "putative binding sites" and "known binding sites" respectively? Second, there is an obvious problem with the definitions; many "motifs" with also be "bindings sites". In fact, by the time the authors state their definitions, they have already fallen foul of this conflation; in the prior paragraph they stated: "controlled by DNA sequences that encode motifs for TFs to bind". The same issue reappears throughout the paper.

- The authors also use the terms "regulatory" and non-regulatory" DNA. These terms are not defined by the authors and make little sense. For instance, I assume the authors would describe promoter islands lacking transcriptional activity (itself an incorrect assumption, see below)as non-regulatory. However, as horizontally acquired sections of AT-rich DNA these will all be bound by H-NS and subject to gene silencing, both promoters for mRNA synthesis and spurious promoters inside genes that create untranslated RNAs. Hence, regulation is occurring.

- Line 63: "In prokaryotes, the primary regulatory sequences are called promoters". Promoters are not generally considered regulatory. Rather, it is adjacent or overlapping sites for TFs that are regulatory. There is a good discussion of the topic here (PMID: 32665585).

(3) The authors ignore the role of translation.

- The authors' assay does not measure promoter activity alone, this can only be tested by measuring the amount of RNA produced. Rather, the assay used measures the combined outputs of transcription and translation. If the DNA fragments they have cloned contain promoters with no appropriately positioned Shine-Dalgarno sequence then the authors will not detect GFP or RFP production, even though the promoter could be making an RNA (likely to be prematurely terminated by Rho, due to a lack of translation). This is known for promoters in promoter islands (e.g. Figure 1 in PMID: 33958766).

- In Figure S6 it appears that the is a strong bias for mutations resulting in RFP expression to be close to the 3' end of the fragment. Very likely, this occurs because this places the promoter closer to RFP and there are fewer opportunities for premature termination by Rho

(4) Ignoring or misrepresenting the literature.

- As eluded to above, promoter islands are large sections of horizontally acquired, high AT-content, DNA. It is well known that such sequences are i) packed with promoters driving the expression on RNAs that aren't translated ii) silenced, albeit incompletely, by H-NS and iii) targeted by Rho which terminates untranslated RNA synthesis (PMIDs: 24449106, 28067866, 18487194). None of this is taken into account anywhere in the paper and it is highly likely that most, if not all, of the DNA sequences the authors have used contain promoters generating untranslated RNAs.

- The authors state that GC content does not correlate with the emergence of new promoters. It is known that GC content does correlate to the emergence of new promoters because promoters are themselves AT-rich DNA sequences (e.g. see Figure 1 of PMID: 32297955). There are two reasons the authors see no correlation in this work. First, the DNA sequences they have used are already very AT-rich (between 65 % and 78 % AT-content). Second, they have only examined a small range of different AT-content DNA (i.e. between 65 % and 78 %). The effect of AT-content on promoter emerge is most clearly seen between AT-content of between around 40 % and 60 %. Above that level, the strong positive correlation plateaus.

- Once these authors better include and connect their results to the previous literature, they can also add some discussion of how previous papers in recent years may have also missed some of this important context.

(5) Lack of information about sequences used and mutations.

- To properly assess the work any reader will need access to the sequences cloned at the start of the work, where known TSSs are within these sequences (ideally +/- H-NS, which will silence transcription in the chromosomal context but may not when the sequences are removed from their natural context and placed in a plasmid). Without this information, it is impossible to assess the validity of the authors' work.

- The authors do not account for the possibility that DNA sequences in the plasmid, on either side of the cloned DNA fragment, could resemble promoter elements. If this is the case, then mutations in the cloned DNA will create promoters by "pairing up" with the plasmid sequences. There is insufficient information about the DNA sequences cloned, the mutations identified, or the plasmid, to determine if this is the case. It is possible that this also accounts for mutational hotspots described in the paper.

(6) Overselling the conclusions.

Line 420: The paper claims to have generated important new insights into promoters. At the same time, the main conclusion is that "Our study demonstrates that mutations to -10 and -35 boxes motifs are the primary paths to create new promoters and to modulate the activity of existing promoters". This isn't new or unexpected. People have been doing experiments showing this for decades. Of course, mutations that make or destroy promoter elements create and destroy promoters. How could it be any other way?

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation