1. Chromosomes and Gene Expression
  2. Genetics and Genomics
Download icon

Universally high transcript error rates in bacteria

  1. Weiyi Li
  2. Michael Lynch  Is a corresponding author
  1. Department of Biology, Indiana University, United States
  2. Center for Mechanisms of Evolution, The Biodesign Institute, Arizona State University, United States
Research Article
  • Cited 0
  • Views 1,309
  • Annotations
Cite this article as: eLife 2020;9:e54898 doi: 10.7554/eLife.54898

Abstract

Errors can occur at any level during the replication and transcription of genetic information. Genetic mutations derived mainly from replication errors have been extensively studied. However, fundamental details of transcript errors, such as their rate, molecular spectrum, and functional effects, remain largely unknown. To globally identify transcript errors, we applied an adapted rolling-circle sequencing approach to Escherichia coli, Bacillus subtilis, Agrobacterium tumefaciens, and Mesoplasma florum, revealing transcript-error rates 3 to 4 orders of magnitude higher than the corresponding genetic mutation rates. The majority of detected errors would result in amino-acid changes, if translated. With errors identified from 9929 loci, the molecular spectrum and distribution of errors were uncovered in great detail. A G→A substitution bias was observed in M. florum, which apparently has an error-prone RNA polymerase. Surprisingly, an increased frequency of nonsense errors towards the 3′ end of mRNAs was observed, suggesting a Nonsense-Mediated Decay-like quality-control mechanism in prokaryotes.

eLife digest

Most cells contain molecules of DNA that carry instructions to make the proteins cells need to perform different tasks. When a cell requires a certain protein, the corresponding DNA sequence is first transcribed into molecules of ribonucleic acid (RNA) known as transcripts. These sequences of RNA are then read by the cell and translated into the desired protein sequence.

Errors in copying DNA before a cell divides, can lead to genetic mutations that affect the ability of the cell to carry out certain roles, influencing the overall ‘fitness’ of the cell. Similar to genetic mutations, errors that arise when forming RNA transcripts may also alter the tasks a cell performs. However, it is difficult to find out what kinds of errors cells have in their transcripts and how often these mistakes occur. This is because current methods for sequencing RNA are prone to technical inaccuracies that interfere with the ability to detect true transcript errors.

Now, Li and Lynch have adapted a method for high-throughput sequencing of RNA, which can accurately identify transcript errors in Escherichia coli and other species of bacteria. The experiments showed that errors in RNA molecules occurred more frequently than genetic mutations in the same sequence of DNA. Li and Lynch also found that the transcripts contained more nonsense errors – that is, mutations which prematurely stop transcripts from being translated, resulting in shorter proteins – at the end of the RNA molecule than at the beginning or middle. It is possible that transcripts with errors at the beginning or the middle are more efficiently eliminated than those at the end, suggesting that bacteria have a quality-control mechanism for removing transcripts with premature stop sequences.

These findings suggest that at any one-time cells carry thousands of transcripts with inaccuracies in their sequence, which likely impact the tasks cells perform. The next step will be to investigate how these different transcript errors affect the fitness of cells.

Introduction

Transcript errors refer to any inconsistencies between RNA transcripts and their corresponding genomic loci. They can occur during ribonucleotide (rNTP) incorporations by RNA polymerases and/or via post-transcriptional modifications. Errors on RNA transcripts may directly cause dysfunctions due to the regulatory roles of small RNAs and the fate determination of mRNAs by RNA structural motifs (Strathern et al., 2012). Such errors can also indirectly induce various effects at the protein level. Transcript errors can inactivate proteins and result in a loss-of-function (Gordon et al., 2013). They can also indirectly give rise to misfolded proteins and induce proteotoxic stress (Gout et al., 2017; Vermulst et al., 2015). Errors on RNA transcripts may be causal factors leading to neuron degenerative diseases (van Leeuwen et al., 1998a; van Leeuwen et al., 1998b) and tumorigenesis (Saxowsky et al., 2008). Therefore, transcript errors represent a significant potential mechanism influencing cellular integrity and fitness.

Reporter-construct assays have long been the major approach to evaluating the fidelity of RNA polymerases and identifying transcript errors (Blank et al., 1986; Bubunenko et al., 2017; Nesser et al., 2006; Rosenberger and Foskett, 1981; Rosenberger and Hilton, 1983; Shaw et al., 2002; Springgate and Loeb, 1975; Strathern et al., 2012), but these methods focus only on individual loci and cannot identify errors without phenotypic marker effects. Conventional high-throughput sequencing approaches have been considered to identify transcript errors at a large scale (van Dijk et al., 2015). However, the challenge is to distinguish the real signal of transcript errors from noise produced by technical errors resulting during reverse transcription and sequencing. To circumvent this problem, a rolling-circle amplification-based sequencing (CirSeq) method (Acevedo and Andino, 2014; Acevedo et al., 2014; Lou et al., 2013) was recently proposed and later applied to identify transcript errors in the whole transcriptome of prokaryotes (Traverse and Ochman, 2016). We further modified this protocol to minimize RNA damage potentially introduced during the preparation of sequencing libraries (Gout et al., 2017).

In this study, we applied an adapted CirSeq approach, which has been demonstrated to identify transcript errors accurately and efficiently at a large scale in eukaryotes (Gout et al., 2017), to prokaryotes for the first time. A large number of transcript errors was detected, and transcript-error rates were revealed to be orders of magnitude higher than corresponding genetic mutation rates. Our results indicate that the bias in molecular spectra of transcript errors can be influenced by both RNA polymerases and cellular rNTP concentrations. Furthermore, the spatial distribution of transcript errors on RNAs provides novel insights into the mechanism of RNA quality-control in prokaryotes.

Results

A global view of the transcript error distribution

Applying the adapted CirSeq method (see Materials and methods) to E. coli, B. subtilis, A. tumefaciens, and M. florum, RNA sequencing libraries were made with three biological replicates for each species. Key steps of library preparations involve circularizing RNA fragments and generating cDNAs with tandem repeats by rolling-circle reverse transcription. In this way, transcript errors tend to appear on all repeats of sequencing reads, while sequencing and reverse transcription errors are nearly always revealed as singletons (Figure 1—figure supplement 1). The number of loci where transcript errors were identified from each species ranges from 2006 to 2942, totaling 9929 loci across all species. M. florum showed a per-site error rate of 1.82±0.01 (SEM)×10-5, the highest among the four species (P=0.009, Mann-Whitney U test). The error rates in E. coli, B. subtilis, and A. tumefaciens were 5.84±0.10 (SEM)×10-6, 5.80±0.14 (SEM)×10-6, and 7.26±0.35 (SEM)×10-6, respectively. These error rates are 3 to 4 orders of magnitude higher than the corresponding genomic (DNA-level) mutation rates estimated from mutation-accumulation experiments in these species (Lee et al., 2012; Lynch et al., 2016; Sung et al., 2016; Sung et al., 2015; Sung et al., 2012).

With such a large number of transcript errors identified, a transcriptome-wide view of the error distribution in each species was uncovered. Based on the circular genomes of bacteria (except for A. tumefaciens, which has one circular chromosome, one linear chromosome, and two plasmids [Goodner et al., 2001]), we annotated genomic positions of transcript errors with different potential functional effects and plotted transcript-error rates in 10 kb sliding windows (1 kb for M. florum) (Figure 1). To test whether transcript errors are randomly distributed across different genes, a previously proposed test (Long et al., 2016) was performed to identify genes enriched with transcript errors. For each gene, the expected number of transcript errors was calculated as the product of the average transcriptome-wide error rate per base and the sequencing coverage of the gene. The Poisson probability of observing a number of errors greater than or equal to the observed number was calculated. Out of 607, 495, 586, and 186 genes with detected transcript errors in E. coli, A. tumefaciens, B. subtilis and M. florum, respectively, 1, 4, 0 and 4 genes were revealed to have significantly larger numbers of errors than random expectations (Bonferroni-corrected P values of 0.05, Supplementary file 1, Tables 2-5), suggesting that transcript errors are in general randomly distributed across genes.

Figure 1 with 2 supplements see all
The distribution of transcript errors across the whole transcriptomes of E. coli, B. subtilis, A. tumefaciens, and M. florum.

The first nucleotide of the circular chromosome starts at the 12 o’clock position. For A. tumefaciens, chromosomes and plasmids are arranged from the largest to smallest size in a clockwise orientation. From the outer ring to the inner ring: bacterial chromosomes (dark gray), protein-coding region (grey, black strokes indicate gene densities), synonymous substitutions (blue), missense substitutions (orange), nonsense substitutions (purple) and average transcript-error rates (plots in dark gray) in a 10 kb sliding window with a step size of 1 bp (1 kb windows for M. florum). Windows without sufficient sequencing coverages to detect transcript errors are left blank.

The whole bacterial transcriptome is synthesized by a single type of RNA polymerase. However, RNA products from protein-coding and noncoding RNA (ncRNA) regions undergo distinct co- and post-transcriptional processes. mRNAs are mature upon transcription and ready for translation, while ncRNAs, such as ribosomal RNAs (rRNA) and transfer RNAs (tRNA), need to be further processed to be functional (Cooper, 2000). To evaluate whether transcript-error rates of these two genomic regions are different, we calculated the error rates of protein-coding and ncRNA transcripts by dividing the number of errors by the number of nucleotides assayed in corresponding regions. Transcript-error rates of these two regions are similar in E. coli and A. tumefaciens, but the error rate of ncRNA transcripts is higher than that of protein-coding transcripts in B. subtilis and lower in M. florum (p<0.05, paired t-test) (Figure 2).

Transcript-error rates of protein-coding and ncRNA regions.

cds includes all protein-coding genes that were sequenced in this study. ncRNA refers to RNAs that are functional but not translated into proteins, for example tRNA and rRNA. Transcript-error rates were calculated by dividing the number of errors by the number of nucleotides assayed in corresponding regions. Error bars indicate standard errors. The level of significance difference is indicated by asterisks (*p<0.05, paired t-test).

The molecular spectra of transcript errors are biased to C→U and G→A substitutions

A transition/transversion bias of genetic mutations has been widely observed in different species, with the molecular spectrum mostly dominated by G:C→A:T substitutions (Hershberg and Petrov, 2010; Hildebrand et al., 2010; Lynch, 2010). However, knowledge on the molecular spectrum of transcript errors in prokaryotes remains limited (Imashimizu et al., 2015; Traverse and Ochman, 2016; Traverse and Ochman, 2018). In this study, we calculated the error rate of all twelve categories of substitutions for each species (Figure 3), revealing a general bias of transitions over transversions. This bias has been thought to be driven solely by C→U substitutions (Traverse and Ochman, 2016), which may mainly result from post-transcriptional cytosine deaminations. However, the transition/transversion bias here even holds after C→U substitutions are excluded (P < 0.005, χ2 test, Supplementary file 1, Table 6). This observation indicates that the transcriptional machinery in bacteria, similar to the replication machinery, tends to have a low ability to distinguish rNTPs within the same structural class of nitrogenous bases (Keightley et al., 2009; Kucukyildirim et al., 2016; Lee et al., 2012; Long et al., 2015a; Long et al., 2015b; Lynch, 2007; Lynch, 2010; Lynch et al., 2008; Ossowski et al., 2010; Sung et al., 2015). Of all transitions, the C→U substitution rate is consistently high in all four species. In addition, an unexpectedly high G→A substitution rate is revealed in M. florum, which displayed the highest transcript-error rates among four species in the present study. Intriguingly, this substitution bias was also recently observed in yeast and E. coli transcription-machinery mutants with decreased fidelity (Gout et al., 2017; Imashimizu et al., 2015; Traverse and Ochman, 2018). Thus, the G→A substitution bias may be a signature of error-prone RNA polymerase in both eukaryotes and prokaryotes.

The molecular spectra of transcript errors for four bacterial species.

The conditional error rates of each type of substitutions were calculated from the number of particular transcript errors, divided by the number of corresponding ribonucleotides assayed. Error bars indicate standard errors.

Characterization of transcript errors

To evaluate potential functional effects of transcript errors, we categorized transcript errors within protein-coding regions into synonymous, missense, and nonsense substitutions using SnpEff (Cingolani et al., 2012Table 1). Based on the bias of rNTP substitution rates (Figure 3) and codon usages of each bacterium, we also calculated the expected percentages of each error type under the assumption that transcript errors are randomly generated across the genome without error-correction processes (see Materials and methods, and Supplementary file 1, Table 7). Consistent with observations, the majority of transcript errors are expected to result in amino-acid changes, if translated (Table 1). For nonsense errors, the observed percentages are close to or significantly lower than the random expectation (P < 0.005, χ2 test, Table 1).

Table 1
Percentages of transcript errors in mRNAs that are synonymous, missense, or nonsense (other potential types of transcript errors with small percentages, such as start/stop codon loss-errors, are not shown).

Observed and expected (in parentheses) percentages are presented. Based on the bias of observed rNTP substitution rates and codon usages of each bacterium, expected percentages are calculated assuming a random generation of errors and an absence of error-correction processes. The level of significant difference is indicated by asterisks (*P < 0.05, ** P < 0.005, χ2 test).

SpeciesSynonymousMissenseNonsense
E. coli40.18 (34.35) **56.25 (59.79) *3.57 (5.62) **
B. subtilis32.76 (31.86)61.69 (61.63)5.15 (6.15)
A. tumefaciens40.68 (36.76) *56.36 (59.17)2.96 (3.86)
M. florum17.58 (24.12) **79.27 (70.58) **2.37 (4.85) **

Biased distribution of nonsense errors in RNA transcripts

As shown in Table 1, nonsense errors represent only a small percentage of all errors. However, they are of particular interest because they will result in the formation of a premature termination codon (PTC) and thus truncated proteins if not degraded. To ameliorate the potential severe fitness effects resulting from such errors, eukaryotes have evolved the Nonsense Mediated Decay (NMD) mechanism (Losson and Lacroute, 1979; Maquat, 1995; Peltz et al., 1993) to facilitate the degradation of RNA transcripts carrying PTCs. A key to the success of NMD is distinguishing a PTC from the original stop codon (Amrani et al., 2004; Le Hir et al., 2001), and the ability of the NMD machinery to identify a PTC is thought to diminish as the PTC approaches the 3′ end of a mRNA (Isken and Maquat, 2007). This hypothesis is supported by yeast transcript-error data that show a marked increase in the frequency of PTCs towards the 3′ end of mRNAs (Gout et al., 2017).

Although no analog of the eukaryotic NMD system is known in prokaryotes, a destabilizing effect of PTCs on mRNA stability has been observed in bacteria (Arnold et al., 1998; Braun, 1998; Morse and Yanofsky, 1969; Nilsson et al., 1987). Evaluating the distribution of nonsense errors across the whole length of mRNA transcripts, we observed an increased frequency of nonsense errors at the 3′ end of transcripts, although the trend is not statistically significant in A. tumefaciens (Figure 4A). Compared to other three species, a smaller number of nonsense errors were detected in A. tumefaciens (Supplementary file 1, Table 7), which may result in a low statistical power to reveal a potential pattern for the distribution of nonsense errors. We further modified the analysis by dividing the frequency of nonsense errors by that of all errors. This ratio tends to be higher at the 3′ end of mRNAs (Figure 4—figure supplement 1), excluding the possibility that the enrichment of nonsense errors results mainly from a higher overall transcript-error rate at the 3′ end of mRNAs.

Figure 4 with 2 supplements see all
Nonsense errors in prokaryotic transcripts.

(A) Distributions of nonsense errors across mRNA transcripts. The frequency of nonsense errors is calculated in a 100-nt sliding window with a step size of 1 nt for data visualization. Grey intervals represent standard deviations assuming the number of errors at each locus follows a binomial distribution. Linear regression between the distance to the original stop codon and the frequency of nonsense errors of each window is indicated in dark grey lines. P values were calculated from weighted linear regressions of individual data points before binning into a window. (B) The ribosome-release model for PTCs degradation in prokaryotes. Compared to a late PTC, an early PTC results in a larger portion of ribonucleotides unprotected by ribosomes, and therefore a higher probability of being digested by cellular ribonuclease.

Of all types of genetic codons, those with one nucleotide difference from a stop codon (one-off codons) have a higher probability of mutating into PTCs. We further normalized the frequency of nonsense errors by the abundance of one-off codons at corresponding loci. This still revealed an increased frequency of nonsense errors towards 3′ ends of transcripts (Figure 4—figure supplement 2), suggesting the higher frequency of nonsense errors is not caused by more abundant one-off codons at the 3′ end of transcripts.

The increased frequency of PTCs at the 3′ end of mRNA transcripts suggests the presence of an NMD-like process, albeit by a likely different mechanism than in eukaryotes, which largely rely on the poly-A tail or exon-exon junction complex (Amrani et al., 2004). One speculative model for the degradation of PTCs in eukaryotes, the ribosome-release model (Brogna and Wen, 2009), in which the degradation of RNAs with PTCs depends on the degree of ribosome coverage on RNA molecules, has the potential to hold true in prokaryotes. Ribosomes can load on to nascent transcripts immediately after RNA synthesis. Therefore, a whole transcript with a normal stop codon can be covered by multiple ribosomes towards its 3′ end, with these ribosomes protecting the transcript from degradation by blocking ribonuclease cleavage sites. In contrast, a PTC upstream of the original stop codon will stall the ribosomes, leaving the ribonucleotides between the PTC and the site of the original stop codon unprotected by ribosomes, potentially promoting degradation by cellular ribonucleases (Figure 4B).

Discussion

A key to accurately identifying bona fide transcript errors is to distinguish them from technical errors and low-frequency genetic mutations. With previous efforts on method development to eliminate sequencing errors (Acevedo and Andino, 2014; Acevedo et al., 2014; Lou et al., 2013) and to evaluate the error rate of the reverse transcriptase (Gout et al., 2013), it is now possible to ensure that contributions from such technical errors are orders of magnitudes lower than true transcript-error rates by the CirSeq approach (See Materials and methods). Except for M. florum, transcript-error rates in bacteria estimated by the current study are about one order of magnitude lower than those from a previous study (Traverse and Ochman, 2016). Specifically in E. coli, our error-rate estimates for each type of substitutions tend to be lower than those from Traverse and Ochman (2016), the most striking difference involving the C→U substitution rate, which could be partly due to the use of a metal ion-based RNA fragmentation approach in the previous work vs. enzymatic RNA fragmentation in the present study. The latter minimizes RNA damage (Gout et al., 2017), in particular cytosine deaminations, introduced during the preparation of the sequencing library.

Besides base-substitution errors, a small portion of transcript errors can occur in other forms such as insertions and deletions. Estimates of transcript insertion/deletion (indel) error rates from species in this study are 0.1 to 0.2 of the corresponding base-substitution error rates (Supplementary file 1, Table 1).

Bacterial transcriptomes predominantly consist of ncRNA transcripts, such as rRNAs and tRNAs (Westermann et al., 2012). However, only a small portion of the whole ncRNA transcripts was evaluated in the present study (Supplementary file 1, Table 8) because of technical limitations. The rRNA depletion procedure in the sequencing library preparation protocol removes the majority of rRNAs. Secondary structures and nucleotide modifications of tRNAs interfere the cDNA synthesis and sequencing adapter ligations. In the future, to achieve a better measurement of transcript-error rates of ncRNA transcripts, total RNAs can be mixed with rRNA-depleted RNAs at a certain ratio to increase the abundance of rRNAs in the sequencing library. Demethylase enzymes and thermophilic reversetranscriptase can be used to remove nucleotide modifications of tRNAs and to improve processivity in generating cDNAs from highly structured RNA templates (Schwartz et al., 2018; Zheng et al., 2015).

The molecular spectrum of transcript errors revealed in our work indicates a general C→U substitution bias, which has been proposed to be due to spontaneous deamination (Imashimizu et al., 2013; Traverse and Ochman, 2016) owing to the chemical instability of cytosine (Alberts et al., 2015). Besides this widely accepted mechanism, non-Watson-Crick base pairing during rNTP incorporations may also contribute to this bias. Because dG and rU can form a base pair (Sugimoto et al., 2000; Sugimoto et al., 1997), mispairing between a template DNA (dG) and an RNA (rU) during rNTP incorporations likely also contributes to the C→U substitution bias.

Another intriguing observation from the molecular spectra in the present study is the G→A substitution bias in M. florum. One source for this substitution may be unrepaired uracils on the DNA antisense strand, which pair with rATPs during transcription, resulting in a G→A substitution on the RNA transcript. Although M. florum has a diminutive genome (0.79 Mb) and lacks many genes (RefSeq NC_006055.1), a uracil-DNA glycosylase (UDG) ortholog whose product presumably removes uracils (McCullough et al., 1999) does exist in the genome. Therefore, the extent to which mismatches between the unrepaired uracil and rATP can explain the G→A bias remains unclear.

Taking data from previous studies (Gout et al., 2017; Imashimizu et al., 2015; Traverse and Ochman, 2018) and this work together, G→A substitution bias seems to be a general pattern in cells with error-prone transcription machineries. What might be the underlying mechanism? The error spectrum is shaped by two factors. One is the ability of an RNA polymerase to distinguish correct rNTPs from incorrect ones. The other factor, which is sometimes neglected, is the rNTP pool within a cell. The error rate of competitive binding of rNTPs to the template can be expressed as, (kincorrectCincorrect-rNTPs)/(kcorrectCcorrect-rNTPs), where k refers to the rNTP incorporation rate and C indicates the concentration of rNTPs. As suggested by this equation, a biased cellular rNTP concentration might present an additional challenge to transcriptional fidelity for certain categories of rNTPs. Based on observations that RNA polymerases have a low ability to distinguish rNTPs with the same structural class of nitrogenous bases and that the cellular concentration of rATPs is the highest among all types of nucleotides in both eukaryotes and prokaryotes (Bennett et al., 2009; Buckstein et al., 2008; Traut, 1994), it is reasonable to speculate that the high cellular concentration of rATPs contribute to the observed bias towards G→A substitutions.

An additional cellular process influencing transcript errors is RNA quality-control. Because genes involved in NMD, such as up-frameshift (UPF) genes, have not been identified in prokaryotes, evidence for the existence of NMD in prokaryotes is still lacking. However, previous studies based on single gene-reporters (Baker and Mackie, 2003; Braun, 1998; Nilsson et al., 1987) and our transcriptome-wide survey suggest a Nonsense-Mediated Decay-like quality-control mechanism in prokaryotes. A key implication of the increased frequency of nonsense errors at the 3′ end of mRNAs (Figure 4A) is that the degradation of RNAs carrying nonsense errors may simply result from a higher degree of exposure to cellular ribonucleases rather than from a reliance on specific protein-based systems.

Current models of mRNA surveillance mechanisms mostly focus on stop codon-related errors (Deutscher, 2006; Richards et al., 2008), which are expected to represent only a small portion of the total transcript errors in a cell. It is largely unknown whether, and if so by which mechanisms the major transcript errors (missense errors) get degraded. To resolve this, future research will be required to evaluate the rate at which transcript errors are degraded after initially being generated during transcription. This might be possible by comparing transcript errors on nascent transcripts bound to RNA polymerases with those on mature transcripts associated with ribosomes.

Materials and methods

Bacteria strains and growth conditions

Request a detailed protocol

All bacteria strains were inoculated into liquid culture from single colonies and grew to mid-exponential growth phase upon harvest. E. coli MG1655 and B. subtilis NCIB 3610 were grown at 37°C in LB liquid medium. M. florum L1 (ATCC #33453) was grown at 30°C in SNE liquid medium. A. tumefaciens C58 was grown at 28°C in LB liquid medium.

RNA extraction

Request a detailed protocol

Bacteria were harvested from liquid culture media by centrifugation and total RNA was extracted and purified using the FastRNA Blue Kit (MPBiomedicals), RNase-free DNase set (Qiagen), and the RNeasy Mini Kit (Qiagen). rRNA was depleted by the Ribo-Zero rRNA Removal Kit (Bacteria) (Illumina) for the following library preparations.

Library preparation and sequencing

Request a detailed protocol

We followed a refined protocol of CirSeq (Gout et al., 2017) to prepare libraries for transcript error identifications. Five hundred nanograms of rRNA-depleted RNAs were firstly fragmented with the NEBNext RNase III RNA Fragmentation Module (New England Biolabs) for 90 min at 37°C. After a clean-up using the Oligo Clean and Concentrator kit (Zymo Research), RNA fragments were circularized with RNA ligase 1 (New England Biolabs) according to the manufactuer’s guidelines. cDNA with tandem repeats was generated by the rolling-circle reverse transcription as described in the refined CirSeq protocol. Synthesis of the second strand of cDNA and sequencing library preparation were performed using the NEBNext Ultra RNA Library Prep Kit and NEBNext Multiplex Oligos for Illumina (New England Biolabs). The size selection and clean-up during sequencing library preparations were performed by Agencourt AMPure XP Beads (Beckman Coulter) according the NEB guideline that is optimized for approximately 200nt RNA inserts. A final gel-based size selection was performed to enrich PCR amplified products that are longer than 300nt. Single-end reads (300nt) were then generated using Illumina HiSeq 2500 System. The sequencing data were deposited in NCBI with the BioProject Number PRJNA592142.

Genome references and annotation files

Request a detailed protocol

The accession numbers of genome references for E. coli, B. subtilis, and M. florum are NC_000913.3, NZ_CM000488.1, and NC_006055.1. For A. tumefacien, accession numbers are NC_003062.2, NC_003063.2, NC_003064.2 and NC_003065.3. The corresponding genome annotation files are from RefSeq.

Data analysis

Request a detailed protocol

Several analysis pipelines already existed to process reads with multiple tandem repeats and call transcript errors, but with their own limitations. The CirSeq_v2 pipeline (Acevedo and Andino, 2014; Acevedo et al., 2014) can only analyze reads with exactly three repeats and reads generated by CirSeq approach can contain more than four repeats if the original RNA template is smaller than 75 nt. Another pipeline described in a recent work in yeast (Gout et al., 2017) cannot generate consensus calls and recalculate the quality score from a site where not all base calls are identical. Therefore, we developed Python scripts following the methods outlined by Lou et al. (2013) (Figure 1—figure supplement 2). The structure of repeats within one read was identified by an autocorrelation-based method, in which the length of one potential repeat P is detected by the maximum fraction of identical base calls that are separated by a distance P within one read. The consensus sequence was constructed and the corresponding new quality score was calculated by a Bayesian approach where an inferred consensus call is taken with the maximum posterior probability given all observed base calls. This approach also allows the processing of varied numbers and types of base calls at one site. To identify the ligation junction of circular templates and to reorganize the consensus sequence, a tandem duplicate of the consensus sequence was constructed and then mapped back to the reference genome by BWA (Li and Durbin, 2009). The longest continuous mapped regions of the duplicated consensus sequences therefore correspond to original RNA fragments. We also excluded the 4 nucleotides at both ends of the reorganized consensus sequence to minimize potential confusions, because mapping can be ambiguous at the two ends of RNA fragments. After mapping of reconstructed consensus sequences, reads uniquely mapped to protein-coding regions and all reads mapped to ncRNA regions were kept. Transcript errors were called if a mismatch between a consensus call and the reference was supported by less than 1% of reads at corresponding loci. To exclude false positives of transcript errors from genetic mutations in multiple copies of ncRNA genes (such as rRNA and tRNA genes), an additional filter was included to exclude an error call that is supported by genetic variations from different copies of ncRNA genes. The transcript error rate of a given region was calculated as the number of transcript errors divided by the total number of rNTPs assayed from the corresponding region. The code for the bioinformatic pipeline can be found at https://github.com/LynchLab/CirSeq4TranscriptErrors (Li, 2020; copy archived at https://github.com/elifesciences-publications/CirSeq4TranscriptErrors).

Strategies to distinguish transcript errors from other types of errors

Request a detailed protocol

First, reverse transcription and sequencing errors need to be filtered out in the analysis. Because the rate of transcript error is generally 10-6 ∼ 10-5 /nt , the recalculated probability of an erroneous base call at 10-7 or lower was required to minimize contaminations from sequencing errors. Because the error rate of the reverse transcriptase used here is ∼10-4 /nt (Gout et al., 2013), at least two tandem repeats were required in the analysis to minimize false positives from reverse transcription errors.

Second, genetic mutations (DNA level) can arise during cell culture and low frequency mutations can behave like transcript errors in the sequencing data. The probability of capturing a genetic mutation can be calculated by dividing the expected number of genetic mutations generated during cell propagations by the total transcriptome size at the time point of sample collection, μgTnTn, in which µ is the per site per generation mutation rate, g is the number of generations during cell culture, T is the size of genome regions get transcribed, and n is the average expression level per site. This equation can be further simplified as µg. Because we know the mutation rate from mutation accumulation experiments (Lee et al., 2012; Lynch et al., 2016; Sung et al., 2016; Sung et al., 2015; Sung et al., 2012) and the number of generations from culture-growth dynamics (~30 generations), Low frequency genetic mutation can only inflate the transcript-error rate we calculated here by ~1‰ -1%.

To calculate the expected percentages of transcript errors with different effects

Request a detailed protocol

Take the calculation for synonymous substitution as one example. The percentage can be calculated by summing the probabilities of each codon to have a synonymous change, P(syn)=i=164PiPi(syn). Pi refers to the probability of having codon i based on the codon usage of a specific genome and there are 64 codons in total. Pi(syn) is the probability that codon i has a synonymous substitution and it can be calculated from, Pisyn=j=19μj1j results in syn. μj denotes the substitution probability of 1 of the 9 single-base substitutions that can happen in one codon. And it can be calculated by, μj=ejj=19ej, in which ej refers to the error rate of 1 of the 9 substitutions in one codon. Estimates of ej are displayed in Figure 3.

The sliding window analysis and weighted linear regression to evaluate the distribution of nonsense errors on mRNA transcripts

Request a detailed protocol

The sliding window analysis (window size = 100nt and step size = 1nt) of the distribution of nonsense errors across mRNAs was used for data visualization. To evaluate whether or not the negative correlation between the frequency of nonsense errors and the corresponding distance from a nonsense error to the original stop codon is statistically significant, a weighted linear regression method was used. The weight was calculated as the reciprocal of a variance of a nonsense error frequency. Because the observed number of transcript errors at each locus is expected to follow a binomial distribution, the variance of the nonsense error frequency can be estimated as p(1-p)n, where p is the estimated frequency of errors and n refers to the read coverage at the corresponding locus.

References

  1. 1
  2. 2
  3. 3
    Molecular Biology of the Cell
    1. B Alberts
    2. A Johnson
    3. J Lewis
    4. D Morgan
    5. M Raff
    6. K Roberts
    7. P Walter
    (2015)
    Garland Science.
  4. 4
  5. 5
    mRNA stabilization by the ompA 5' untranslated region: two protective elements hinder distinct pathways for mRNA degradation
    1. TE Arnold
    2. J Yu
    3. JG Belasco
    (1998)
    RNA 4:319–330.
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
    The Cell: A Molecular Approach
    1. GM Cooper
    (2000)
    ASM Press.
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
    The Origins of Genome Architecture
    1. M Lynch
    (2007)
    Sinauer Associates.
  37. 37
  38. 38
  39. 39
  40. 40
    When cells stop making sense: effects of nonsense codons on RNA metabolism in vertebrate cells
    1. LE Maquat
    (1995)
    RNA 1:453–465.
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
    Stabilities of internal rU-dG and rG-dT pairs in RNA/DNA hybrids
    1. N Sugimoto
    2. I Yasumatsu
    3. M Fujimoto
    (1997)
    Nucleic Acids Symposium Series 37:199–200.
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68

Decision letter

  1. Christian R Landry
    Reviewing Editor; Université Laval, Canada
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States
  3. Joanna Masel
    Reviewer; University of Arizona, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

DNA replication errors are one of the engines of evolution. However, how errors accumulate at the different steps of the central dogma of molecular biology has received limited attention. Investigating how errors in transcription and translation affect cell biology first requires that we measure their rates and that we examine how they affect various species. In this study, the authors use a recently developed technology to measure transcription error rates in four species of bacteria. Their results reveal that transcription error rates are higher than mutation rates, that many of these errors lead to amino acid changes if translated and that there may be a mechanism in prokaryotes for the quality control of mRNAs. These results open new research avenues on how cells may control the propagation of errors along the steps of the central dogma and why such error rates may be tolerated by evolution.

Decision letter after peer review:

Thank you for submitting your work entitled "Universally high transcript error rates in bacteria" for consideration by eLife. Your article has been reviewed by a Senior Editor, a Reviewing Editor, and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Joanna Masel (Reviewer #3).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife. But please keep in mind that a substantially changed version could be re-submitted as a new paper.

The reviewers agree that the questions addressed are interesting and important and consider that it is possible to improve the manuscript so that it meets their criticisms. They found particularly interesting the proposed mechanism based on premature termination but found that it was not fully supported by the analyses and may require actual lab experiments. They also identified several issues with the presentation of the data and some of the analyses. In light of this, we cannot consider the manuscript further under its present form. Their complete comments appear below.

Reviewer #1:

The manuscript submitted by Li and Lynch is interested in transcript error rates in bacteria, which were already known to be orders of magnitude more frequent than replication errors. The authors developed a slightly modified version of the previously published CirSeq approach, which produces tandem copies of transcribed fragments from circularized RNA molecules, to allow a more accurate detection of errors on a genome-wide scale. The authors used this method to quantify mutations in transcripts from four representative bacteria, and showed evidence suggesting that the M. florum RNA polymerase is error-prone, with a strong G to A substitution bias compared to the other bacteria. The authors showed data suggesting that nonsense mutations tend to accumulate at the 3' end of transcripts and proposed a model to explain how transcripts containing a premature termination codon could be recognized and degraded by the cell.

Overall, I found the topic interesting but the results rely on a single type of evidence (CirSeq) and are mostly descriptive. Additional experiments would be required in my opinion to fully support the model proposed by the authors. For example, it would be interesting to introduce a premature termination codon at different positions in a reporter gene such as lacZ and evaluate if such mutated transcripts have a shorter half-life compared to the wild-type. I am still debating whether this is essential for this manuscript or not. I also noted a few suggestions or elements that should be clarified:

- It would be useful for readers to include a more detailed description of CirSeq as the manuscript heavily depends on this technique. The manuscript was mentioning reference work but a short explanation would help.

- The average mutation rate of reverse transcriptase is ~1E-4. Since this rate is relatively high, the authors only considered mutations with at least two tandem repetitions in the CirSeq data. Were the reverse transcription errors observed by Gout et al., (2013) evenly distributed across the transcripts (hotspots), and were certain mutations types more predominant than others? If such biases exist, would it still be possible to confidently discriminate between reverse transcription errors versus transcription errors? Can the authors comment on that?

- In Figure 4 and Figure S1 to Figure S3, how were the replicates processed? Were they combined before the sliding window analysis or after? Would it be relevant to show the error frequency calculated for each replicate to see the inter-replicate variability? The Pearson's correlations coefficients should also be present on the figures, especially for Figure S1 as this comparison is very important for the conclusions. My first impression was that Pearson correlation coefficient from Figure S1 (all transcript errors) would be similar to those reported for Figure 4 (nonsense errors) but the comparison is not possible in this version of the manuscript. Is there a reason to not show the Pearson's correlation coefficient on Figure S1 but show them in Figure 4 and Figure S2- Figure S3? From what I understand in subsection “Biased distribution of nonsense errors in RNA transcripts”, the correlation coefficients calculated on data presented in Figure S1 should not be significant. Is that the case?

Reviewer #2:

The manuscript by Li and Lynch reports the mutational spectrum of transcription in 4 bacterial species. The objective, results and discussion are clear and easy to read. However, I found that the methodological descriptions often lack important technicalities that are needed (at least to me) to fully appreciate what the authors have done exactly. I will first raise few general points and then provide a list of missing methodological pieces of info. I have no expertise in the molecular biology part, so I won't comment on it.

- Subsection “A global view of the transcript error distribution”: If the authors used a Bonferroni correction for multiple testing, they should not observe 5% of false positives, but expect 0.05 false positives, that is well below 1. Typically they should observe 0 positive if H0 is correct. If they really observe 5%-ish (as it is implied in the text), they do have a lot of significant genes for which transcripts are enriched in mutations.

- Subsection “Biased distribution of nonsense errors in RNA transcripts”: the authors report an overrepresentation of non-sense mutations near the 3' of the transcript. This is interpreted as a sign of a potential NMD in bacteria. I indeed think this is indeed a possibility. Alternatively, it could also be the case that simply there are more mutations in the 3' end of transcript in general. What is the pattern for missense and synonymous mutations regarding their localization in the genes?

- Subsection “Biased distribution of nonsense errors in RNA transcripts”: I am not sure of what the added value of this measure of relative gene length. Can you get rid of this paragraph and plot?

- Did the authors attempted to look at indels? I think 'long enough' SSRs will also present SSRs copy number variations.

- for Abstract/Introduction: errors can also accumulate without replication nor transcription (see some recent papers on non-replicative/quiescent errors)

As mentioned earlier, many methodological pieces of information are missing. Here is a list of some of them. Consider that the readers must have *everything* to understand and eventually redo the experiments. In this current version, most treatments are opaque.

- Subsection “A global view of the transcript error distribution”: how did you normalized to "the same level"?

- Subsection “Characterization of transcript errors”: sure, ~2/3 of mutations are non-synonymous. So this does not come as a surprise.

- Subsection “Characterization of transcript errors”: "close to" means "close to significance"?

- Subsection “Data analysis”. Can you provide insights on what is the Lou et al., method? What is the autocorrelation method? Where is the python code available, so readers can have a chance to understand what you did? What is the Bayesian approach you mentioned using? What parameters of BWA did you used?

- Subsection “Strategies to distinguish transcript errors from other types of errors”: what is the equation? (you suggested simplifying by \mu g, but I don't see an equation)?

- Subsection “To calculate the expected percentages of transcript errors with different effects”: I suspect these are not probabilities but counts turned into frequencies, right? So at best, probabilities estimates. \mu_i are the relative mutation rates? More generally, I am not sure what did you use this whole calculation (in this paragraph) for?

- Figure 3 legend: what are 'conditional' error rates?

- Table1: How exactly did you compute your p-values? Did you take the spectrum into account?

- Figure 4: Why do you have the x-axis in reverse order? How did you normalize in 0-1.

In conclusion, I believe this manuscript has potential but should be revised with great care before becoming a decent published article.

Reviewer #3:This manuscript measures the single nucleotide transcription error rate and spectrum in four bacterial species. The primary findings are that E. coli errors are an order of magnitude less common than previously reported by Traverse and Ochman, that M. florum has a strikingly high G->A substitution bias, and that nonsense errors are depleted toward the 3' end of mRNA in a manner compatible with previously hypothesized NMD-like quality control in prokaryotes. Overall, a potential obstacle to publication in ELife is the incremental nature of these findings – the methods are not greatly advanced from earlier work including from the same group, this is not the first evidence for NMD-like quality control, and there is speculation about but not proof for the reasons for the high G->A bias and the discrepancy with previous E. coli measurements. I have a number of concerns, especially about the statistics, but correcting them is unlikely to reverse the major findings.

Probably the biggest issue is the contrast between Figure 4 and Figure S3, used to help infer the mechanism of action of the NMD-like system. Frequencies are a kind of binned data, and it is inappropriate to report correlation coefficients from binned data: see eg https://statmodeling.stat.columbia.edu/2016/06/17/29400/ or https://serialmentor.com/blog/2013/8/18/common-errors-in-statistical-analyses for discussions of this point. It is definitely not appropriate to claim that the Figure 4 model is better than the Figure S3 model (subsection “Biased distribution of nonsense errors in RNA transcripts”) because the correlation coefficients are larger, especially because the binning is clearly different in the two cases, eg. with far more zeros in Figure S3.

For these and all similar figures, there should be error bars on each dot from sampling error: for a binomial the error is sqrt(p*(1-p)/n). I was unable to figure out what exactly the normalization to scale the y-axis between 0 to 1 entailed – the fact that no code was made available (despite the ELife reporting form instructions to "Include code used for data analysis") meant that I couldn't compensate for thinly described methods. I see in any case no justification for normalization; it would be better to just give the actual numbers on the y-axis, while keeping the "zoom" the same for visualization purposes.

The best statistical approach is generally to work with the raw data, which in this case is a vast dataset of 0s (no error) and 1s (error). The number of (rare) errors meeting given criteria follows a Poisson distribution, and a generalized linear model can be used to model this error function. This is what we did for transcription errors in https://www.biorxiv.org/content/10.1101/554329v1, and it leads to vastly greater power to discern the kinds of trends hypothesized in Figure 4 and Figure S3. I realize that, as well as some conflict in asking for use and thus citation of my own work, there may be concerns citing a preprint, but the manuscript is now accepted pending minor revisions in GBE, and citable as such. I would really like to know eg whether M. florum really does have a different shape to E. coli in Figure 4 as it appears to, and more sophisticated statistical models are required to answer questions of this sort. Our accepted manuscript provides such models, with code fully available on github.

Our preprint re-analyzes the data of Traverse and Ochman, and finds that the (non-C->U) error rate depends on protein abundance. The Discussion section is the only place in the paper that attempts to say why the currently observed non-C->U error rate is an order of magnitude lower than that previously observed by Traverse and Ochman, attributing all the discrepancy, especially but not limited to cytosine deamination, to RNA damage during library preparation by Traverse and Ochman. This is problematic in the light of our finding of systematic differences that are not expected to be caused by library preparation problems. I don't know why the non-C->U error rates are so different between the studies either, but it clearly isn't all cytosine deamination, nor all library preparation, and it would be useful to acknowledge this puzzle and comment on other possibilities, e.g. differences in strain or experimental condition.

Note that the per-gene method to find outliers described in subsection “A global view of the transcript error distribution” is much lower powered than our test for dependence on protein abundance, and so in no way rules out the variation among genes that we discovered. And if the aim is to detect mutations or programmed errors, it is better to do so per-site than per-gene, as we also did to also find that such problems were rare.

In general, the statistics used in this manuscript assume that sampling error is negligible and that all variance is therefore attributed to biological replication. This underlies the use of t-tests and paired t-tests throughout. However, in tables like Supplementary file 5 and Supplementary file 6, neither error bars nor denominator is given, so I am unable to verify that sqrt(p*(1-p)/n) is negligible. If it is not negligible, then all these t-tests should be weighted rather than the current use of unweighted tests, or better still, a more advanced generalized linear model with a Poisson error term (see above) that simultaneously accounts for all sources of variance should be used instead of t-tests. The problems with the statistical tests used primarily create lower power, and so the positive findings should all hold, but more problematic is that the supplementary data files lack the information needed to allow future reanalysis of the data using better methods.

Another statistical problem is in calculating expected percentages (subsection “To calculate the expected percentages of transcript errors with different effects”). The equation used assumes that while the error rates of the 3 sites within the codon might be different, every codon has an equal error rate. The much more logical alternative would be to assume site-specific error rates that do not depend on which codon the site is found in. If our finding that error rates depend on protein abundance is also true, this will contribute to misleading results even after this correction is made. We found that the difference between error rates at synonymous vs. non-synonymous sites was entirely attributable to the dependence of codon bias strength on protein abundance, i.e. that variation in error rate was at the gene level not the codon level.

Figure 2 legend refers to the analysis only of genes with detected transcript errors. This would seem to create a variety of ascertainment biases, inflating the error rate overall as zeros are neglected, and preferentially neglecting potentially large numbers of genes each of low expression.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Universally high transcript error rates in bacteria" for consideration by eLife. Your article has been reviewed by Patricia Wittkopp as the Senior Editor, a Reviewing Editor, and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Joanna Masel (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

In this manuscript, Li and Lynch measure the rate of transcription errors in a few prokaryotes. This is a revised version of a previous manuscript. The reviewers appreciate the importance of the work and the changes that have been made to the initial version. However, there are still some concerns about the statistical analyses and interpretation of the data. The comments are summarized below.

Essential revisions:

Data analysis and interpretation:

1) They uncovered almost the full spectrum of transcription errors, but somehow omitting indel errors from their data set. As is, this work will serve as a method reference for people interested in transcription errors. Based on their spectrum analysis, the authors speculate about different mechanisms by which these errors can be made. In order to be complete, the authors should analyze and report indels from their existing data.

2) Concerning the model to explain the biased distribution of nonsense errors in mRNA, they proposed an NMD mechanism no yet characterized in bacteria, but this can easily be explained by transcription polarity described first by Morse and Yanofsky, (1969).

3) In order to make the sequencing method user-friendly, I would strongly suggest that the authors include a flowchart for the analysis of the sequencing data as they have done for the technical concept in Figure 1—figure supplement 1.

Statistical analyses:

4) While claims based on comparing the magnitudes of correlation coefficients of different binned data sets has now been removed, correlation coefficients for data that is not just binned (frequency data), but also non-independent (sliding windows) continue to be reported in Figure 4, Figure S2, and Figure S3. Corresponding regression models on these non-independent sliding window datapoints are inappropriately used to generate p-values. An alternative and somewhat less flawed approach is given only in the response to reviews document and not in the manuscript itself. This is presented as being the method of Meer et al., "with minor modifications", although they are not that minor, as they sacrifice the main advantage of that method. A good statistical model should have a good model of the error function. Count data has a binomial error function, well approximated as a Poisson. The advantage of the method of Meer et al., is that it exactly models this Poisson, but the method is dismissed in the response to reviews because it explicitly accounts for the fact that the raw count data depends on the number of reads, as though this were a bad thing. The alternative presented in the response to reviews does avoid the non-independence problem of the sliding window, but it still assumes homoscedasticity with respect to #errors/#reads, an assumption that is violated in practice given many more reads at some sites than others. If the authors want to model Equation (1) instead of Equation (2) (as numbered in the response to reviews), I would find this acceptable if and only if they (a) use a weighted regression to account for the known heteroscedasticity (ie at least model the magnitude of the error even if they don't model its shape), and (b) place this analysis in the manuscript itself as a replacement for doing a regression on the output of non-independent sliding-window-bins. The sliding window is perfectly appropriate and indeed very helpful as a visualization of the data, but inappropriate for the fitting and testing of statistical models.

5) I take issue with the claim in subsection “A global view of the transcript error distribution” that transcript errors are randomly distributed across genes. After a Bonferroni correction for the number of genes, the manuscript has very little power with which to make such a claim of evidence of absence. What is more, Meer et al., showed (using a different E. coli dataset) that the error rate does depend on the strength of selection on the gene in question, as assessed by its protein abundance. This analysis had much higher power than the current study, firstly because there was only one degree of freedom and hence no need for a radical Bonferroni correction, and secondly because explicit modeling of the error function as described above will generally increase power. Given that (uncited) evidence that contradicts the manuscript's claim for randomness has already been published, and the lower power of the current manuscript's basis for making this claim, a claim of randomness should not be made unless the authors are able to apply the same or similarly high-powered method and directly show contradictory results. Should the previously published claim non-randomness be supported, this would have implications for the authors' model of randomly generated transcript error, eg subsection “Characterization of transcript errors”. In this passage too, the statistics are problematic. Observed and expected "data" are compared using a paired t-test, as though the expectations were themselves data with error magnitudes homoscedastic with those of the observed data.

6) I am also confused by the bootstrapped standard deviation in Figure 4, Figure S2, and Figure S3. In Figure 4, why bootstrap rather than use the known formula for a binomial? In Figure S2 and Figure S3, why bootstrap only the numerator and not the denominator of the quantity whose error is being assessed?

7) I still fear that the casual reader will take away from the Discussion section that differences between the two studies in error rates other than C->U are trivial in magnitude, which is not the case.

https://doi.org/10.7554/eLife.54898.sa1

Author response

In this revision, we have addressed all the reviewers’ comments. In response to the major comment and to better support the Nonsense Mediated Decay-like mechanism in prokaryotes, refined analysis methods and statistical tests have been provided to evaluate the distribution of nonsense errors across mRNAs. Extensive empirical evidence has also been cited to support the destabilizing effect of nonsense errors to prokaryotic mRNAs.

Reviewer #1:

[…] Overall, I found the topic interesting but the results rely on a single type of evidence (CirSeq) and are mostly descriptive. Additional experiments would be required in my opinion to fully support the model proposed by the authors. For example, it would be interesting to introduce a premature termination codon at different positions in a reporter gene such as lacZ and evaluate if such mutated transcripts have a shorter half-life compared to the wild-type. I am still debating whether this is essential for this manuscript or not. I also noted a few suggestions or elements that should be clarified:

It would certainly be interesting to do this experiment, but one would also have to be cautious to interpret results from these reporter-construct assays because there may be gene-specific peculiarities. In contrast, nonsense errors detected by CirSeq are from a transcriptome-wide study. It is hard to see how CirSeq could cause the bias observed in Figure 4.

However, experiments similar to that suggested by the reviewer have actually been done previously, with nonsense mutations introduced to different positions of reporters, such as bla, rpsO, and rpsT genes. The dynamics of mRNA degradation of these mutants are fully consistent with our results, therefore providing further support for the ribosome-release model. Please see:

Figure 2A of https://www.ncbi.nlm.nih.gov/pubmed/2440033

Figure 3 of https://www.ncbi.nlm.nih.gov/pubmed/9707438

Figure 3CD, Table 1. Entry 14, 15 of https://onlinelibrary.wiley.com/doi/full/10.1046/j.1365-2958.2003.03292.x

All these results are now cited in the Discussion of the manuscript.

- It would be useful for readers to include a more detailed description of CirSeq as the manuscript heavily depends on this technique. The manuscript was mentioning reference work but a short explanation would help.

Thanks for this suggestion. We have included a short description on crucial steps in making the sequencing library. We also include a flowchart as Figure 1—figure supplement 1 to illustrate how this approach provides an accurate detection of transcript errors.

- The average mutation rate of reverse transcriptase is ~1E-4. Since this rate is relatively high, the authors only considered mutations with at least two tandem repetitions in the CirSeq data. Were the reverse transcription errors observed by Gout et al., (2013) evenly distributed across the transcripts (hotspots), and were certain mutations types more predominant than others? If such biases exist, would it still be possible to confidently discriminate between reverse transcription errors versus transcription errors? Can the authors comment on that?

Indeed, as shown in Figure 1—figure supplement 1 of Gout et al., (2013), some types of reverse transcription (RT) errors are more predominant than others. However, the highest RT error rate is < 1.5x10-4, so the probability of observing the same RT error in two tandem repeats and misidentifying it as a transcript error is < 2.25x10-8. Because the transcript-error rates that we observe are on the order of 10-6 to 10-5, RT errors cannot contribute more than ~2% of detected transcript errors by including at least two tandem repeats of CirSeq reads.

In summary, whereas in principle it might be possible to finesse a maximum-likelihood procedure to factor in RT errors in a full data set, this would be error-prone itself, only introducing another source of sampling variance in the final analysis, and given the points made above, we are confident that RT errors are not contributing to our estimates in more than a very minor way.

-In Figure 4 and Figure S1 to Figure S3, how were the replicates processed? Were they combined before the sliding window analysis or after? Would it be relevant to show the error frequency calculated for each replicate to see the inter-replicate variability? The Pearson's correlations coefficients should also be present on the figures, especially for Figure S1 as this comparison is very important for the conclusions. My first impression was that Pearson correlation coefficient from Figure S1 (all transcript errors) would be similar to those reported for Figure 4 (nonsense errors) but the comparison is not possible in this version of the manuscript. Is there a reason to not show the Pearson's correlation coefficient on Figure S1 but show them in Figure 4 and Figure S2- Figure S3? From what I understand in subsection “Biased distribution of nonsense errors in RNA transcripts”, the correlation coefficients calculated on data presented in Figure S1 should not be significant. Is that the case?

We apologize for the confusion. In the sliding-window analysis as shown in Figure 4, we first combined the data from three biological replicates for one species because the number of nonsense errors detected from each replicate is small.

In our previous study in yeast (https://advances.sciencemag.org/content/3/10/e1701484, Figure 4F), we found that the expected pattern (enrichment of PTCs at 3’ end of RNAs) is only observable with a large number of nonsense errors. (The pattern is not shown in WT but shown in transcriptional fidelity mutants).

The number of nonsense errors of each biological replicate ranges from 10 to 50, which is small.

Therefore, we merged the data from biological replicates before the sliding-window analysis to have better power to reveal the overall pattern. We have also included statistics such as regression coefficients and sampling errors in Figure 4.

Figure S1 has now been updated as Figure S2, in which the ratio of nonsense error frequency to total error frequency is plotted. The higher ratio at the 3’ end excludes the possibility that the enrichment of nonsense errors results mainly from a higher overall transcript-error rate at the 3′ end of mRNAs.

Reviewer #2:

The manuscript by Li and Lynch reports the mutational spectrum of transcription in 4 bacterial species. The objective, results and discussion are clear and easy to read. However, I found that the methodological descriptions often lack important technicalities that are needed (at least to me) to fully appreciate what the authors have done exactly. I will first raise few general points and then provide a list of missing methodological pieces of info. I have no expertise in the molecular biology part, so I won't comment on it.

To minimize the confusion on the refined CirSeq method, we have included more descriptions and added a flowchart as Figure 1—figure supplement 1 to clarify this approach.

- Subsection “A global view of the transcript error distribution”: If the authors used a Bonferroni correction for multiple testing, they should not observe 5% of false positives, but expect 0.05 false positives, that is well below 1. Typically they should observe 0 positive if H0 is correct. If they really observe 5%-ish (as it is implied in the text), they do have a lot of significant genes for which transcripts are enriched in mutations.

We have modified the analysis, first identifying genes enriched with transcript errors from each biological replicate. We define a gene as a hotspot if it is supported by at least two biological replicates. As shown in Supplementary file 1, Supplementary file 2, Supplementary file 3 and Supplementary file 4, there are just a few hotspot genes (ranging from 0-4 in each species), and transcript errors are in general randomly distributed across genes.

- Subsection “Biased distribution of nonsense errors in RNA transcripts”: the authors report an overrepresentation of non-sense mutations near the 3' of the transcript. This is interpreted as a sign of a potential NMD in bacteria. I indeed think this is indeed a possibility. Alternatively, it could also be the case that simply there are more mutations in the 3' end of transcript in general. What is the pattern for missense and synonymous mutations regarding their localization in the genes?

To exclude the alternative hypothesis proposed here, we have further divided the window-specific PTC frequencies by the overall error frequencies and evaluated the distribution of this resultant value. Figure S2 has been updated with this result and suggests that the enrichment of PTCs at the 3’ end of RNAs cannot be simply explained by more transcript errors at 3’ end than other regions.

- Subsection “Biased distribution of nonsense errors in RNA transcripts”: I am not sure of what the added value of this measure of relative gene length. Can you get rid of this paragraph and plot?

We did this analysis because we intended to test whether the efficiency of degradation depends on the absolute size or the portion of the ribonucleotides of one transcript that are uncovered by ribosomes.

As we revised the manuscript, we noticed that most bins in Figure S3 of the last version contained zeros and the sample size of nonsense errors in the present study does not allow us to compare the above hypotheses. Therefore, we have removed Figure S3 and the corresponding paragraph.

- Did the authors attempted to look at indels? I think 'long enough' SSRs will also present SSRs copy number variations.

We haven’t evaluated indel errors because an algorithm that is accurate and efficient to reconstruct consensus sequences from tandem repeats with potential indels is still not available. We agree with the reviewer that copy number variations of SSRs may happen at the transcriptome level, and that this piece of information may shed light on the mechanism of the slippage of RNA polymerases. More extensive analysis on indels will be a future direction of research.

- for Abstract/Introduction: errors can also accumulate without replication nor transcription (see some recent papers on non-replicative/quiescent errors)

We agree. Genetic mutations can be generated by non-replicative mechanisms such as DNA damage and DNA repair errors. We did not intend to review all mechanisms of errors, and our only point here is to draw attention from replication to transcription.

As mentioned earlier, many methodological pieces of information are missing. Here is a list of some of them. Consider that the readers must have *everything* to understand and eventually redo the experiments. In this current version, most treatments are opaque.

- Subsection “A global view of the transcript error distribution”: how did you normalized to "the same level"?

We have updated details on how transcript-error rates were calculated in the legend of Figure 2.

- Subsection “Characterization of transcript errors”: sure, ~2/3 of mutations are non-synonymous. So this does not come as a surprise.

This piece of result is not surprising, but essential to provide a full picture of the potential functional categories of transcript errors.

- Subsection “Characterization of transcript errors”: "close to" means "close to significance"?

“close to” means the observed values are close to random expectations.

- Subsection “Data analysis”. Can you provide insights on what is the Lou et al., method? What is the autocorrelation method? Where is the python code available, so readers can have a chance to understand what you did? What is the Bayesian approach you mentioned using? What parameters of BWA did you used?

The autocorrelation method meant to identify the structure of tandem repeats and the Bayesian approach to recalculate quality scores of consensus reads are proposed in Lou et al. We have posted these python scripts on Lynch-lab Github site.

The BWA cmd that we used is the following:

bwa mem reference sample.fastq > sample.sam

There are multiple filters, so as to keep uniquely mapped reads and to keep high quality reads, after this BWA mapping step. We also posted this analysis pipeline on Github.

- Subsection “Strategies to distinguish transcript errors from other types of errors”: What is the equation? (you suggested simplifying by \mu g, but I don't see an equation)?

We apologize for the confusion. We have updated with the original equation and explained how to derive the simplified equation.

- Subsection “To calculate the expected percentages of transcript errors with different effects”: I suspect these are not probabilities but counts turned into frequencies, right? So at best, probabilities estimates. \mu_i are the relative mutation rates? More generally, I am not sure what did you use this whole calculation (in this paragraph) for?

The reviewer is correct. Pi is the estimate of probabilities based on frequencies. The μj can be understood as relative substitution rates of each of the 9 base substitutions in a codon. Equations here are used to calculate the expected percentages shown in Table. 1.

- Figure 3 legend: what are 'conditional' error rates?

Conditional error rates refer to error rates based on four different kinds of rNTPs. The way to calculate the conditional error rates has been explained in the legend of Figure 3.

- Table1: How exactly did you compute your p-values? Did you take the spectrum into account?

Yes, we have considered the bias in the molecular spectra of transcript errors. The equation to calculate the expected percentage can be found in the Materials and methods section. P values are from paired t-tests. Details are shown in Supplementary file 6.

- Figure4: Why do you have the x-axis in reverse order? How did you normalize in 0-1.

We have the x-axis in the reverse order to make it in a 5’ to 3’ orientation.

In the previous version of Figure 4, first we calculated PTC frequencies of corresponding sliding windows. Second, we simply rescaled these values from 0-1 by (value-min)/(max-min). However, to make the y-axis more straightforward, we don’t do the rescale now.

In conclusion, I believe this manuscript has potential but should be revised with great care before becoming a decent published article.

Reviewer #3:

This manuscript measures the single nucleotide transcription error rate and spectrum in four bacterial species. The primary findings are that E. coli errors are an order of magnitude less common than previously reported by Traverse and Ochman, that M. florum has a strikingly high G->A substitution bias, and that nonsense errors are depleted toward the 3' end of mRNA in a manner compatible with previously hypothesized NMD-like quality control in prokaryotes. Overall, a potential obstacle to publication in ELife is the incremental nature of these findings – the methods are not greatly advanced from earlier work including from the same group, this is not the first evidence for NMD-like quality control, and there is speculation about but not proof for the reasons for the high G->A bias and the discrepancy with previous E. coli measurements. I have a number of concerns, especially about the statistics, but correcting them is unlikely to reverse the major findings.

Probably the biggest issue is the contrast between Figure 4 and Figure S3, used to help infer the mechanism of action of the NMD-like system. Frequencies are a kind of binned data, and it is inappropriate to report correlation coefficients from binned data: see eg https://statmodeling.stat.columbia.edu/2016/06/17/29400/ or https://serialmentor.com/blog/2013/8/18/common-errors-in-statistical-analyses for discussions of this point. It is definitely not appropriate to claim that the Figure 4 model is better than the Figure S3 model (subsection “Biased distribution of nonsense errors in RNA transcripts”) because the correlation coefficients are larger, especially because the binning is clearly different in the two cases, eg. with far more zeros in Figure S3.

We agree with the reviewer that the correlation coefficient from binned data should be avoided in evaluating the correlation between the frequency of nonsense errors and the distance between the nonsense error and the position of the original stop codon in a single base-resolution.

We did the sliding-window (99% overlap of neighboring windows) analysis in the present study for the following reasons:

1) Considering the sample size of nonsense errors (n=59-108) in each species, we can’t evaluate the distribution of the frequency of nonsense errors at single-base resolution. Our goal is simply to compare the frequency of nonsense errors at the very 3’ end to that at more upstream regions.

2) This being said, we have evaluated the correlations using individual points before binning (details shown in the following response), and the results are consistent with the sliding-window analysis.

3)Instead of binning data into separated windows, the step size of our sliding-window analysis is 1 nt and there is a 99% overlap between neighboring windows. This approach allows us to smooth fluctuations at a small scale, considering the small sample size of nonsense errors.

We also agree that it is not appropriate to claim a lack of correlation in analyses as shown in the Figure S3 because there are a lot of zeros. Data collected in the present study cannot help us to distinguish whether the absolute or relative distance between PTCs and 3’ end of RNAs matters for the efficiency of degradation. Therefore, we have removed the Figure S3 of the previous version and corresponding paragraphs.

For these and all similar figures, there should be error bars on each dot from sampling error: for a binomial the error is sqrt(p*(1-p)/n). I was unable to figure out what exactly the normalization to scale the y-axis between 0 to 1 entailed – the fact that no code was made available (despite the ELife reporting form instructions to "Include code used for data analysis") meant that I couldn't compensate for thinly described methods. I see in any case no justification for normalization; it would be better to just give the actual numbers on the y-axis, while keeping the "zoom" the same for visualization purposes.

We have included more details on how we ran analyses related to Figure 4, and error bars are included. We have also posted scripts and corresponding data on the Lynch-lab Github site.

The best statistical approach is generally to work with the raw data, which in this case is a vast dataset of 0s (no error) and 1s (error). The number of (rare) errors meeting given criteria follows a Poisson distribution, and a generalized linear model can be used to model this error function. This is what we did for transcription errors in https://www.biorxiv.org/content/10.1101/554329v1, and it leads to vastly greater power to discern the kinds of trends hypothesized in Figure 4 and Figure S3. I realize that, as well as some conflict in asking for use and thus citation of my own work, there may be concerns citing a preprint, but the manuscript is now accepted pending minor revisions in GBE, and citable as such. I would really like to know eg whether M. florum really does have a different shape to E. coli in Figure 4 as it appears to, and more sophisticated statistical models are required to answer questions of this sort. Our accepted manuscript provides such models, with code fully available on github.

We followed the approach suggested by the reviewer with minor modifications and got consistent results with our sliding-window analysis. Details are as follows.

The frequency of nonsense errors can be modeled as a linear function of the distance between nonsense errors and original stop codons,

(1) EiRi=α+βdi

where Ei is the number of nonsense errors at locus i, Ri refers to the number of ribonucleotides assayed, and di is the distance. To better evaluate this linear model, the reviewer suggested to multiply Ri on both sides and to evaluate the following model,

(2) EiRi+Ridi+εpoisson

However, the coefficient of the second term (Ridi) in function (2) is not only related to di.

Therefore, we evaluated the linear model shown in (1) using the data points without binning.

Slopes (β) are -3.43E-07, -1.20E-07, -2.31E-07, and -1.32E-07 for E. coli, A. tumefaciens, B. subtilis and M. florum, respectively with P values of 6.01E-09, 4.02E-10, 0.019, and 7.63E-09. These results still support our argument that there is an enrichment of PTCs at the 3’ end (Figure 4A, in which the x axis is reversed).

Our preprint re-analyzes the data of Traverse and Ochman, and finds that the (non-C->U) error rate depends on protein abundance. The Discussion section is the only place in the paper that attempts to say why the currently observed non-C->U error rate is an order of magnitude lower than that previously observed by Traverse and Ochman, attributing all the discrepancy, especially but not limited to cytosine deamination, to RNA damage during library preparation by Traverse and Ochman. This is problematic in the light of our finding of systematic differences that are not expected to be caused by library preparation problems. I don't know why the non-C->U error rates are so different between the studies either, but it clearly isn't all cytosine deamination, nor all library preparation, and it would be useful to acknowledge this puzzle and comment on other possibilities, e.g. differences in strain or experimental condition.

The most striking difference comes from the C-to-U error rate. As we discussed in the manuscript, a lower C-to-U error rate in our study may result from enzymatic fragmentation approach which minimizes the RNA damages, in particular cytosine deaminations.

We agree with the reviewer that non C-to-U error rates are also different between the present and Traverse and Ochman’s studies, and note that it is entirely possible that metal ion-based fragmentation may also induce non C-to-U error rates.

We have provided details on our data and analysis methods, such as the strain number, growth phase, and criteria to filter false positives, which may be helpful for future studies to compare with our work. We realize that the reviewer’s recent conclusion in a theoretical paper relies on the previous Traverse data set, which we believe has significant issues, but it is not clear that this should bear on a judgement of our empirical work.

Note that the per-gene method to find outliers described in subsection “A global view of the transcript error distribution” is much lower powered than our test for dependence on protein abundance, and so in no way rules out the variation among genes that we discovered. And if the aim is to detect mutations or programmed errors, it is better to do so per-site than per-gene, as we also did to also find that such problems were rare.

We followed the per-site method described in the paper that the reviewer suggested. For example, in an E. coli sample, we calculated the probability that the transcript-error frequency at one site is higher than the random expectation according to cumulative binomial distribution. A significance cutoff of 5.4x10-9 (Bonferroni corrected p-value of 0.02) was used. A hotspot site was defined as being supported by at least two biological replicates. Results are shown below (Author response image 1).

Author response image 1

Running this per-site method, first, we found transcript errors are generally randomly distributed, which is consistent with the per-gene method. Second, we also found the bias in different base substitution rates needs to be considered in this per-site model. For example, in M. florum, where the most hotspots were inferred, most hotspot sites correspond to GA errors. If a GA specific error rate is used, the number of hotspot sites decreases to 10. Lastly, we found that the feasibility of the per-site method depends on a high coverage at each site, which is not always the case in practice.We emphasize that all of our data will be freely available to the reviewer for further analysis.

In general, the statistics used in this manuscript assume that sampling error is negligible and that all variance is therefore attributed to biological replication. This underlies the use of t-tests and paired t-tests throughout. However, in tables like Supplementary file 5 and Supplementary file 6, neither error bars nor denominator is given, so I am unable to verify that sqrt(p*(1-p)/n) is negligible. If it is not negligible, then all these t-tests should be weighted rather than the current use of unweighted tests, or better still, a more advanced generalized linear model with a Poisson error term (see above) that simultaneously accounts for all sources of variance should be used instead of t-tests. The problems with the statistical tests used primarily create lower power, and so the positive findings should all hold, but more problematic is that the supplementary data files lack the information needed to allow future reanalysis of the data using better methods.

To provide information for future reanalysis, we have included details, such as numerators and denominators in tables like Supplementary file 5 and Supplementary file 6.

Another statistical problem is in calculating expected percentages (subsection “To calculate the expected percentages of transcript errors with different effects”). The equation used assumes that while the error rates of the 3 sites within the codon might be different, every codon has an equal error rate. The much more logical alternative would be to assume site-specific error rates that do not depend on which codon the site is found in. If our finding that error rates depend on protein abundance is also true, this will contribute to misleading results even after this correction is made. We found that the difference between error rates at synonymous vs. non-synonymous sites was entirely attributable to the dependence of codon bias strength on protein abundance, i.e. that variation in error rate was at the gene level not the codon level.

First, we would like to clarify the method we used to calculate the expected percentages. There are two biases we considered. The first is the bias in codon usages. The second is the bias in nucleotide-specific error rates. We do not assume every codon has an equal error rate; instead we calculated the error rate of each type of codon based on the nucleotide-specific error rate.

Second, we would like to argue that the finding from the reviewer’s study that the variation in error rate is at the gene level is only supported by the E. coli data set from Traverse and Ochman. It is unclear whether it holds true for other three bacterial species in the present study.

Lastly, the reviewer suggested to use site/gene-specific error rates to approximate empirical observations. However, we intend to compare expected percentages calculated from random expectations to observed percentages to reveal potential error-correction processes.

Figure 2 legend refers to the analysis only of genes with detected transcript errors. This would seem to create a variety of ascertainment biases, inflating the error rate overall as zeros are neglected, and preferentially neglecting potentially large numbers of genes each of low expression.

We apologize for the confusing wording here. We included all genes with adequate sequencing coverage, no matter whether there were errors detected or not.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Essential revisions:

Data analysis and interpretation:

1) They uncovered almost the full spectrum of transcription errors, but somehow omitting indel errors from their data set. As is, this work will serve as a method reference for people interested in transcription errors. Based on their spectrum analysis, the authors speculate about different mechanisms by which these errors can be made. In order to be complete, the authors should analyze and report indels from their existing data.

We have now provided the analysis pipeline to identify indel errors in the transcriptome. The code can be found on Lynch-lab GitHub site. Estimates of indel error rates are 0.1-0.2 of those of base substitution rates. The results have been included as Supplementary file 1.

2) Concerning the model to explain the biased distribution of nonsense errors in mRNA, they proposed an NMD mechanism not yet characterized in bacteria, but this can easily be explained by transcription polarity described first by Morse and Yanofsky, (1969).

The polarity described by Morse and Yanofsky is defined as “the reduction in the relative rates of synthesis of those polypeptides specified by the genes of an operon on the operator-distal side of a mutated gene with a mutated gene with a nonsense codon”. Morse and Yanofsky explicitly excluded the arrested transcription mechanism for polarity and suggested that the polarity results from degradation of the untranslated region of RNAs that are immediately distal to a nonsense codon and are not covered by ribosomes. Therefore, the transcription polarity and the ribosomerelease model essentially refer to the same underlying biological process. The transcription polarity in Morse and Yanofsky, 1969 emphasizes the destabilizing effect of genetic polar mutants and it is not widely interpreted as a quality-control mechanism for mRNAs.

In the present study, we interpreted the biased distribution of nonsense errors as a result of a potential NMD-like mechanism not only because of the biased distribution of nonsense transcript errors in mRNAs, but also because of observations shown in Table 1, where the expected percentages of nonsense errors tend to be higher than the observed ones.

We also clarified in the manuscript that the ribosome-release model was firstly proposed by Brogna and Wen, 2009 as a speculative model for NMD. We believe both previous studies (Figure 3CD and Table 1. Entry 14,15 of Baker and Mackie, 2003; Figure 3 of Braun et al., 1998; Figure 2A of Nilsson et al., 1987) and our survey provide empirical evidence for this model and therefore highlight the existence of potential NMD-like mechanism in prokaryotes.

3) In order to make the sequencing method user-friendly, I would strongly suggest that the authors include a flowchart for the analysis of the sequencing data as they have done for the technical concept in Figure 1—figure supplement 1.

We agree. One of the goals of this study is to serve as the method reference for the transcript error community. We have now included a flowchart in Figure 1—figure supplement 2 to show how CirSeq reads are processed. Crucial steps such as how consensus sequences are constructed and reorganized are illustrated in detail.

Statistical analyses:

4) While claims based on comparing the magnitudes of correlation coefficients of different binned data sets has now been removed, correlation coefficients for data that is not just binned (frequency data), but also non-independent (sliding windows) continue to be reported in Figure 4, Figure S2, and Figure S3. Corresponding regression models on these non-independent sliding window datapoints are inappropriately used to generate p-values. An alternative and somewhat less flawed approach is given only in the response to reviews document and not in the manuscript itself. This is presented as being the method of Meer et al.,( "with minor modifications", although they are not that minor, as they sacrifice the main advantage of that method. A good statistical model should have a good model of the error function. Count data has a binomial error function, well approximated as a Poisson. The advantage of the method of Meer et al., is that it exactly models this Poisson, but the method is dismissed in the response to reviews because it explicitly accounts for the fact that the raw count data depends on the number of reads, as though this were a bad thing. The alternative presented in the response to reviews does avoid the non-independence problem of the sliding window, but it still assumes homoscedasticity with respect to #errors/#reads, an assumption that is violated in practice given many more reads at some sites than others. If the authors want to model Equation (1) instead of Equation (2) (as numbered in the response to reviews), I would find this acceptable if and only if they (a) use a weighted regression to account for the known heteroscedasticity (ie at least model the magnitude of the error even if they don't model its shape), and (b) place this analysis in the manuscript itself as a replacement for doing a regression on the output of non-independent sliding-window-bins. The sliding window is perfectly appropriate and indeed very helpful as a visualization of the data, but inappropriate for the fitting and testing of statistical models.

Here the reviewer has two major comments. One is on the sliding-window analysis of binned data, and the other suggests accounting for sampling errors by using a weighted regression approach.

In response to the first major comment, we agree with the reviewer that the correlation coefficient from binned data should be avoided in evaluating the correlation between the frequency of nonsense errors and the distance between the nonsense error and the position of the original stop codon in a single base-resolution. In the current version of manuscript, we did the sliding-window (99% overlap of neighboring windows) analysis only for data visualization.

As the reviewer suggested, sampling errors should be considered in the regression analysis. We have updated the method by using a weighted regression method to evaluate whether there is a correlation between the frequency of a nonsense error and its distance from the original stop codon. The weight was calculated as the reciprocal of the variance of the nonsense error frequency, where the variance can be estimated as p(1p)R. 𝑅 refers to the number and of ribonucleotides assayed (the read coverage), and 𝑝 is the estimated frequency of transcript errors.

The statistics from the weighted linear regression indicates that the negative correlation revealed by the sliding window analysis is significant in E. coli, B. subtilis, and M. florum, but not in A. tumefaciens. Compared to other three species, a smaller number of nonsense errors were detected in A. tumefaciens (59 vs. 109, 113, 105 in other three species), which may result in a low statistical power to reveal a potential pattern for the distribution of nonsense errors at a single base-resolution. This speculation is supported by our previous study (https://advances.sciencemag.org/content/3/10/e1701484, Figure 4F), where an increased number of nonsense errors detected in yeast transcription fidelity mutants vs. the wild type helps to reveal the negative correlation.

5) I take issue with the claim in subsection “A global view of the transcript error distribution” that transcript errors are randomly distributed across genes. After a Bonferroni correction for the number of genes, the manuscript has very little power with which to make such a claim of evidence of absence. What is more, Meer et al., showed (using a different E. coli dataset) that the error rate does depend on the strength of selection on the gene in question, as assessed by its protein abundance. This analysis had much higher power than the current study, firstly because there was only one degree of freedom and hence no need for a radical Bonferroni correction, and secondly because explicit modeling of the error function as described above will generally increase power. Given that (uncited) evidence that contradicts the manuscript's claim for randomness has already been published, and the lower power of the current manuscript's basis for making this claim, a claim of randomness should not be made unless the authors are able to apply the same or similarly high-powered method and directly show contradictory results. Should the previously published claim non-randomness be supported, this would have implications for the authors' model of randomly generated transcript error, eg subsection “Characterization of transcript errors”. In this passage too, the statistics are problematic. Observed and expected "data" are compared using a paired t-test, as though the expectations were themselves data with error magnitudes homoscedastic with those of the observed data.

We think this comment comes from the discrepancy between the argument of gene-specific transcript-error rates from Meer et al., 2019 and our observation of randomly distributed transcript errors. In response to this comment, we would like to first revisit observations and the corresponding conclusion from Meer et al., 2019. By analyzing a different E. coli dataset from Traverse and Ochman, 2016, Meer et al. claim an inverse correlation between protein abundance and non C-to-U transcript-error rates and argue for potential gene-specific transcript-error rates shaped by selection. Because the C-to-U error rate in the previous study is likely overestimated owing to technical errors, it is unclear whether the Meer et al., conclusion holds for the overall error rate. It is also unclear whether the gene-specific error rate argument applies for prokaryotes other than E. coli.

In the present study, we provided a per-gene approach to evaluate the distribution of transcript errors, where the sampling error related to the coverage of genes has been considered as Poisson distributed. Besides, we also ran a per-site analysis suggested by the reviewer. Results from both analyses indicate, in general, there is no sign for gene-specific transcript-error rates.

We have also updated the statistics and used chi-square tests to compare expected and observed percentages displayed in Table 1.

6) I am also confused by the bootstrapped standard deviation in Figure 4, Figure S2, and Figure S3. In Figure 4, why bootstrap rather than use the known formula for a binomial? In Figure S2 and Figure S3, why bootstrap only the numerator and not the denominator of the quantity whose error is being assessed?

We agree with the reviewer that the number of transcript errors observed at one locus is expected to follow a binomial distribution. Therefore, we have updated those figures and calculated the binomial standard deviation of error frequencies according to the read coverage and the frequency of transcript errors.

7) I still fear that the casual reader will take away from the Discussion section that differences between the two studies in error rates other than C->U are trivial in magnitude, which is not the case.

We agree with the reviewer that although the most striking differences comes from the C-to-U error rate, other types of substitution rates are also different between the present study and Traverse and Ochman’s study. We have now clearly pointed out this difference in the main text. It is possible that metal ion-based fragmentation may also induce non C-to-U error rates.

We also would like to point out that we provided details on our data and analysis methods, such as the strain number, growth phase, and criteria to filter false positives, which we think may be helpful for future studies to compare with our work.

https://doi.org/10.7554/eLife.54898.sa2

Article and author information

Author details

  1. Weiyi Li

    Department of Biology, Indiana University, Bloomington, United States
    Present address
    Joint Initiative for Metrology in Biology, SLAC National Accelerator Laboratory, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1168-7093
  2. Michael Lynch

    1. Department of Biology, Indiana University, Bloomington, United States
    2. Center for Mechanisms of Evolution, The Biodesign Institute, Arizona State University, Tempe, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Validation, Investigation, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    mlynch11@asu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1653-0642

Funding

National Institutes of Health (R01-GM036827)

  • Michael Lynch

National Institutes of Health (R35-GM122566)

  • Michael Lynch

Army Research Office (W911NF-09-1-0444)

  • Michael Lynch

Army Research Office (W911NF-14-1-0411)

  • Michael Lynch

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Stephen Simpson, W Kelley Thomas, Samuel F Miller, Jiaqi Zheng, Jie Huang, and James Ford for technical support, and Daniel Kearns and Clay Fuqua for providing B. subtilis NCIB 3610 and A. tumefaciens C58 strains. We also thank Hongan Long, Michelle Marasco, Chi-Chun Chen, Parul Johri, and Jean-Francois Gout for helpful discussions. This work was supported by National Institutes of Health Awards R01-GM036827 and R35-GM122566, and Multidisciplinary University Research Initiative Award W911NF-09-1-0444 and W911NF-14-1-0411 from the US Army Research Office (to ML).

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Christian R Landry, Université Laval, Canada

Reviewer

  1. Joanna Masel, University of Arizona, United States

Publication history

  1. Received: January 6, 2020
  2. Accepted: April 28, 2020
  3. Version of Record published: May 29, 2020 (version 1)

Copyright

© 2020, Li and Lynch

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,309
    Page views
  • 170
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Chromosomes and Gene Expression
    Ryan R Cheng et al.
    Research Article Updated

    Using computer simulations, we generate cell-specific 3D chromosomal structures and compare them to recently published chromatin structures obtained through microscopy. We demonstrate using machine learning and polymer physics simulations that epigenetic information can be used to predict the structural ensembles of multiple human cell lines. Theory predicts that chromosome structures are fluid and can only be described by an ensemble, which is consistent with the observation that chromosomes exhibit no unique fold. Nevertheless, our analysis of both structures from simulation and microscopy reveals that short segments of chromatin make two-state transitions between closed conformations and open dumbbell conformations. Finally, we study the conformational changes associated with the switching of genomic compartments observed in human cell lines. The formation of genomic compartments resembles hydrophobic collapse in protein folding, with the aggregation of denser and predominantly inactive chromatin driving the positioning of active chromatin toward the surface of individual chromosomal territories.

    1. Cancer Biology
    2. Chromosomes and Gene Expression
    Xiao Ling Li et al.
    Research Article

    Long noncoding RNAs (lncRNAs) are often associated with polysomes, indicating coding potential. However, only a handful of endogenous proteins encoded by putative lncRNAs have been identified and assigned a function. Here, we report the discovery of a putative gastrointestinal tract-specific lncRNA (LINC00675) that is regulated by the pioneer transcription factor FOXA1 and encodes a conserved small protein of 79 amino acids which we termed FORCP (FOXA1-Regulated Conserved Small Protein). FORCP transcript is undetectable in most cell types but is abundant in well-differentiated colorectal cancer (CRC) cells where it functions to inhibit proliferation, clonogenicity and tumorigenesis. The epitope-tagged and endogenous FORCP protein predominantly localizes to the endoplasmic reticulum (ER). In response to ER stress, FORCP depletion results in decreased apoptosis. Our findings on the initial characterization of FORCP demonstrate that FORCP is a novel, conserved small protein encoded by a mis-annotated lncRNA that regulates apoptosis and tumorigenicity in well-differentiated CRC cells.