Quantifying how posttranscriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions
Abstract
Transcriptional rates are often estimated by fitting the distribution of mature mRNA numbers measured using smFISH (single molecule fluorescence in situ hybridization) with the distribution predicted by the telegraph model of gene expression, which defines two promoter states of activity and inactivity. However, fluctuations in mature mRNA numbers are strongly affected by processes downstream of transcription. In addition, the telegraph model assumes one gene copy but in experiments, cells may have two gene copies as cells replicate their genome during the cell cycle. While it is often presumed that posttranscriptional noise and gene copy number variation affect transcriptional parameter estimation, the size of the error introduced remains unclear. To address this issue, here we measure both mature and nascent mRNA distributions of GAL10 in yeast cells using smFISH and classify each cell according to its cell cycle phase. We infer transcriptional parameters from mature and nascent mRNA distributions, with and without accounting for cell cycle phase and compare the results to livecell transcription measurements of the same gene. We find that: (i) correcting for cell cycle dynamics decreases the promoter switching rates and the initiation rate, and increases the fraction of time spent in the active state, as well as the burst size; (ii) additional correction for posttranscriptional noise leads to further increases in the burst size and to a large reduction in the errors in parameter estimation. Furthermore, we outline how to correctly adjust for measurement noise in smFISH due to uncertainty in transcription site localisation when introns cannot be labelled. Simulations with parameters estimated from nascent smFISH data, which is corrected for cell cycle phases and measurement noise, leads to autocorrelation functions that agree with those obtained from livecell imaging.
Editor's evaluation
This important paper tackles a core problem in systems biology – how to quantify kinetic parameters from incomplete and noisy experimental data. The study makes a convincing methodological contribution to the field, and its potential usefulness is demonstrated using experimental data in yeast.
https://doi.org/10.7554/eLife.82493.sa0Introduction
Transcription in single cells occurs in stochastic bursts (Suter et al., 2011; Larsson et al., 2019). Although the first observation of bursting occurred more than 40 years ago (McKnight and Miller, 1977), the precise mechanisms behind this phenomenon are still under active investigation (Nicolas et al., 2017; Tunnacliffe and Chubb, 2020). The direct measurement of the dynamic properties of bursting employs livecell imaging approaches, which allow visualization of bursts as they occur in living cells (Donovan et al., 2019). However, in practice, such livecell measurements are challenging because they are lowthroughput and require genomeediting (Brouwer et al., 2020; Lenstra and Larson, 2016). To circumvent this, one can exploit the fact that bursting creates heterogeneity in a population. In this case, it is relatively straightforward to obtain a steadystate distribution of the number of mRNAs per cell from smFISH or singlecell sequencing experiments. These distributions have been used to infer dynamics by comparison to theoretical models. The simplest mathematical model describing bursting is the telegraph (or twostate) model (Peccoud and Ycart, 1995; Raj et al., 2006). In this model, promoters switch between an active and inactive state, where initiation occurs during the active promoter state. The model makes the further simplifying assumption that the gene copy number is one and that all the reactions are effectively firstorder. The mRNA in this model can be interpreted as cellular (mature) mRNA since its removal via various decay pathways in the cytoplasm is known to follow singleexponential (firstorder) decay kinetics in eukaryotic cells (Wang et al., 2002; Herzog et al., 2017). The solution of the telegraph model for the steadystate distribution of mRNA numbers has been fitted to experimental mature mRNA number distributions to estimate the transcriptional parameters (Raj et al., 2006; Kim and Marioni, 2013; Suter et al., 2011; Larsson et al., 2019).
However, the reliability of the estimates of transcriptional parameters from mRNA distributions is questionable because the noise in mature mRNA (and consequently the shape of the mRNA distribution) is affected by a wide variety of factors. Recent extensions of the telegraph model have carefully investigated how mRNA fluctuations are influenced by the number of promoter states (Zhou and Zhang, 2012; Ham et al., 2020b), polymerase dynamics (Cao et al., 2020), celltocell variability in the rate parameter values (Dattani and Barahona, 2017; Ham et al., 2020a), replication and binomial partitioning due to cell division (Cao and Grima, 2020), nuclear export (Singh and Bokes, 2012) and cell cycle duration variability (PerezCarrasco et al., 2020). One way to avoid noise from various posttranscriptional sources is to measure distributions of nascent mRNA rather than mature mRNA, and then fit these to the distributions predicted by an appropriate mathematical model. A nascent mRNA (Zenklusen et al., 2008; Larson et al., 2009) is an mRNA that is being actively transcribed, that is it is still tethered to an RNA polymerase II (Pol II) moving along a gene during transcriptional elongation. Fluctuations in nascent mRNA numbers thus directly reflect the process of transcription. Because nascent mRNA removal is not firstorder, an extension of the telegraph model has been developed (the delay telegraph model) (Xu et al., 2016).
However, nascent mRNA data still suffers from other sources of noise due to celltocell variability. For example in an asynchronous population of dividing cells, cells can have either one or two gene copies. In the absence of a molecular mechanism that compensates for the increase in gene copy number upon replication, cells with two gene copies which cannot be spatially resolved will have a different distribution of nascent mRNA numbers (one with higher mean) than cells with one gene copy. The importance of the cell cycle is illustrated by the finding (Zopf et al., 2013) that noisy transcription from the synthetic TetO promoter in S. cerevisiae is dominated by its dependence on the cell cycle. The estimation of transcriptional parameters from nascent mRNA data for pre and postreplication phases of the cell cycle has, to the best of our knowledge, only been reported in Skinner et al., 2016.
Interestingly, all the studies that estimate transcriptional parameters from nascent mRNA data (Skinner et al., 2016; Xu et al., 2015; Zoller et al., 2018; Senecal et al., 2014; Fritzsch et al., 2018) do not compare them with transcriptional parameters estimated from cellular (mature) mRNA data measured in the same experiment. Similarly, a quantitative comparison between inference from cellcyclespecific data and data which contains information from all cell cycle phases is lacking. Likely, this is because it is considered evident that quantifying fluctuations earlier in the gene expression process and adjusted for the cellcycle will improve estimates. However, nascent mRNA distributions are technically more challenging to acquire than mature mRNA distributions; and inference from nascent mRNA distributions is substantially more complex (Xu et al., 2016). Thus, it still needs to be shown that acquiring nascent mRNA data is a necessity from a parameter inference point of view, i.e. that it leads to significantly different and more robust estimates than using mature mRNA data. We also note that current studies report parameter inference from organisms where it is possible to label introns to identify mRNA located at the transcription site. This is not possible in many yeast genes and other microorganisms, and in these cases it is unclear how to correct parameter estimates for uncertainty in the transcription site location.
In this paper, we sought to understand the precise impact of posttranscriptional noise and celltocell variability on the accuracy of transcriptional parameters inferred from mature mRNA data. The fitting algorithms (for mature and nascent mRNA data) were first tested on simulated data, where limitations of the algorithms were uncovered in accurately estimating the transcriptional parameters in certain regions of parameter space. The algorithms were then applied to four independent experimental data sets, each measuring GAL10 mature and nascent mRNA data from smFISH in galactoseinduced budding yeast, conditional on the stage of the cell cycle (G1 or G2) for thousands of cells. Comparison of the transcriptional parameter estimates allowed us to separate the influence of ignoring cell cycle variability from that of posttranscriptional noise (mature vs nascent mRNA data). We found that only fitting of nascent cellcycle data, corrected for measurement noise (due to uncertainty in the transcription site location), provided good agreement with measurements from livecell data. Cellcycle specific analysis also revealed that upon transition from G1 to G2, yeast cells show dosage compensation by reducing burst frequency, similar to mammalian cells (PadovanMerhar et al., 2015). Our systematic comparison highlights the challenges of obtaining kinetic information from static data, and provides insight into potential biases when inferring transcriptional parameters from smFISH distributions.
Results
Inference from mature mRNA data vs inference from nascent mRNA data: testing inference accuracy using synthetic data
To understand the accuracy of the inference algorithms from nascent and mature mRNA data, in various regions of parameter space, (i) we generated synthetic data using stochastic simulations with certain known values of the parameters; (ii) applied the inference algorithms to estimate the parameters from the synthetic data; (iii) compared the true and inferred kinetic parameter values.
The generation of synthetic mature mRNA data (mature mRNA measurements in each of 10^{4} cells) using stochastic simulations of the telegraph model (Figure 1a) is described in Methods Sections Mathematical model and Generation of synthetic mature mRNA data. The inference algorithm is described in detail in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. It is based on a maximization of the likelihood of observing the single cell mature mRNA numbers measured in a population of cells. The likelihood of observing a certain number of mature mRNA numbers from a given cell is given by evaluating the telegraph model’s steadystate mature mRNA count probability distribution.
For nascent RNA data, we used stochastic simulations of the delay telegraph model (Figure 1b) to generate the position of bound Pol II molecules from which we constructed the synthetic smFISH signal in each of 10^{4} cells (Methods Section Generation of synthetic nascent mRNA data). An inference algorithm estimates the parameters, based on a maximization of the likelihood of observing the single cell total fluorescence intensity measured in a population of cells (Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data). Note that the likelihood of observing a certain fluorescence signal intensity from a cell is given by extension of the delay telegraph model (but not directly by the delay telegraph model itself) to account for the smFISH probe positions.
This extension takes into account that the experimental fluorescence data used in this manuscript was acquired from smFISH of PP7GAL10 in budding yeast, where probes were hybridized to the PP7 sequences. Because the PP7 sequences are positioned at the 5’ of the GAL10 gene, the fluorescence intensity of a single mRNA on the DNA locus resembles a trapezoidal pulse (see Figure 1 for an illustration). As the Pol II molecule travels through the 14 repeats of the PP7 loops, the fluorescence intensity increases as the fluorescent probes binds to the nascent mRNA (this is the linear part of the trapezoidal pulse). However, once all 14 loops on the nascent mRNA are bound by the fluorescent probes, the intensity of a single mRNA reaches maximal intensity and the plot plateaus as the RNA elongates through the GAL10 gene body before termination and release. The total fluorescent signal density function is hence given by
where $p(sk)$ is the density function of the signal $s$ given there are $k$ bound Pol II molecules and $P(k;\theta )$ is the steadystate solution of the delay telegraph model giving the probability of observing $k$ bound Pol II molecules for the parameter set $\theta $. In Methods Section Mathematical model, we show how $p(sk)$ can be approximately calculated for the trapezoidal pulse. Hence Equation (1) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site. Note that both of these inference algorithms were used to infer the promoter switching and initiation rate parameters. The degradation rate and the elongation time were not estimated but assumed to be known. The inference and synthetic data generation procedures are summarised and illustrated in Figure 1d.
The accuracy of inference was first calculated as the mean of the relative error in the estimated parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$, and $\rho $ (for its definition see Methods, Equation (6)); note that this error measures deviations from the known ground truth values. Figure 2a shows, by means of a 3D scatter plot, the ratio of the mean relative error from nascent mRNA data (using delay telegraph model) and the mean relative error from mature mRNA data (using the telegraph model) for 789 independent parameter sets sampled on a grid (for each of these sets, we simulated 10^{4} cells). The overall bluish hue of the plot suggested that the mean relative error from nascent mRNA data was typically less than the error from mature mRNA data. This was confirmed in Figure 2b where the same data was plotted but now as a function of the fraction of ON time (defined as ${f}_{\text{ON}}={\sigma}_{\text{on}}/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$). Out of 789 parameter sets, for 483 of them ($\approx 61\%$) the inference accuracy was higher when using nascent mRNA data.
Thus far, we have implicitly assumed that fluctuations in both nascent and mature mRNA are due to transcriptional bursting. However, it is clear that mature mRNA data exhibit a higher degree of noise due to posttranscriptional processing. For example, it has been shown that transcriptional noise is typically amplified during mRNA nuclear export (Hansen et al., 2018). In addition, celltocell variation in the number of nuclear pore complexes has recently been identified as the source of heterogeneity in nuclear export rates within isogenic yeast populations (Durrieu et al., 2022). To take into account these additional noise sources, which we call external noise, we added noise to the initiation rate $\rho $ in the telegraph model since this rate implicitly models all processes between the synthesis of the transcript and the appearance of mature mRNA in the cytoplasm. Specifically, for each of the 789 parameter sets previously used, we changed $\rho $ to ${\rho}^{\prime}$ where the latter is a lognormal distributed random variable such that its mean is $\rho $ and its standard deviation is equal to 0.05 of the mean (5% external noise). Note that this implies that at the time of measurement, each cell in the population had a different value of the initiation rate. Simulations with this perturbed set of parameters led to a synthetic mature mRNA data set from which we reinferred parameters using the telegraph model. In Figure 2c we show the ratio of mean relative error from nascent mRNA data and the mean relative error from perturbed mature mRNA data as a function of the fraction of ON time, ${f}_{\text{ON}}$. The percentage of parameters where nascent mRNA is more accurate is slightly increased compared to the data without noise (64% versus 61% of the parameters; compare Figure 2c and Figure 2b). However, the addition of even more noise (10% external noise added to the initiation rate) increases the inference accuracy for 91% of the parameter sets when the nascent mRNA data is used (Appendix 1 and Appendix 1—figure 1).
To obtain more insight into the accuracy of the individual parameters, we next plotted the median relative error of transcriptional parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}},\rho $, burst size and the inferred fraction of ON time, as a function of the true fraction of ON time (Figure 2d). We compared the results using synthetic nascent mRNA, synthetic mature mRNA data and synthetic mature mRNA with 5% external noise. The median of the relative error for each transcriptional parameter (as given by the second equation of Equation 8) was obtained for the subset of the 789 parameter sets for which the true fraction of ON time ${f}_{\text{ON}}$ falls into the interval $[x0.05,x+0.05]$ where $x=0.1,0.2,\mathrm{\dots},0.9$. From the plots, the following can be deduced: (i) the errors in ${\sigma}_{\mathrm{on}}$ (the burst frequency), ${\sigma}_{\mathrm{off}}$ and the burst size $\rho /{\sigma}_{\mathrm{off}}$ tend to increase with ${f}_{\mathrm{ON}}$ while the rest of the parameters ($\rho $ and the estimated value of ${f}_{\mathrm{ON}}$) decrease; (ii) for small ${f}_{\mathrm{ON}}$, the best estimated parameters are the burst frequency and size while for large ${f}_{\mathrm{ON}}$, it was $\rho $ and the estimated value of ${f}_{\mathrm{ON}}$. The worst estimated parameter was ${\sigma}_{\mathrm{off}}$, independent of the value of ${f}_{\mathrm{ON}}$; (iii) the addition of external noise to mature mRNA data had a small impact on inference for small ${f}_{\mathrm{ON}}$; in contrast, for large ${f}_{\mathrm{ON}}$ the noise appreciably increased the relative error in ${\sigma}_{\mathrm{off}}$ and to a lesser extent the error in the other parameters too.
Additionally, in Appendices 1 and 2 we show that (i) independent of the accuracy of parameter estimation, the best fit distributions accurately matched the ground truth distributions (Appendix 1 and Appendix 1—figure 2); (ii) the parameters ordered by relative error were in agreement with the parameters ordered by sample variability (Appendix 1 and Appendix 1—table 1) and by profile likelihood error (Kreutz et al., 2013) (Appendix 1, Appendix 1—tables 2 and 3). Since from experimental data, only the sample variability and the profile likelihood error are available, it follows that the results of our synthetic data study in Figure 2 based on relative error from the ground truth have wide practical applicability; (iii) stochastic perturbation of the mature or nascent mRNA data (due to errors in the measurement of the number of spots and the fluorescent intensity) had little effect on the inference quality, unless the gene spent a large proportion of time in the OFF state (Appendix 1—tables 4 and 5); (iv) if one utilized the conventional telegraph model to fit the nascent data generated by the delay telegraph model, it was possible to obtain a distribution fitting as good as the delay telegraph model but with lowfidelity parameter estimation (Appendix 2, Appendix 2—figure 1 and Appendix 2—table 1). Analytically, the telegraph model is only an accurate approximation of the delay telegraph model when the promoter switching timescales are much longer than the time spent by Pol II on a gene or the off switching rates are very small such that gene expression is nearly constitutive.
In summary, by means of synthetic experiments, we have clarified how the accuracy of the parameter inference strongly depends on the type of data (nascent or mature mRNA) and the fraction of time spent in the ON state (which determines the mode of gene expression).
Applications to experimental yeast mRNA data
Now that we have introduced the inference algorithms and tested them thoroughly using synthetic data, we applied the algorithms to experimental data (see Method Section Experimental data acquisition and processing for details of the data acquisition). Note that in what follows, delay telegraph model refers to the extended delay telegraph model that accounts for the smFISH probe positions that was used to predict the smFISH fluorescent signal of the transcription site.
Inference from mature mRNA data: experimental artifacts
We have four independent datasets from which we determined mRNA count and nascent RNA distributions. Figure 3a shows an example cell with mature single RNAs in the cytoplasm, and a bright nuclear spot representing the site of nascent transcription. Spots and cell outlines were identified using automated pipelines. Importantly, to obtain an accurate estimation of transcriptional parameters, the experimental input distributions of mRNA count and nascent RNAs require high accuracy. We therefore first determined how technical artifacts in the analysis affects the inference estimates.
First, if the number of mRNA transcripts per cell is high, accurate determination of the number of transcripts may be challenging, as transcripts may overlap. To determine if this occurred in our datasets, we analyzed the distributions of intensities of the cytoplasmic spots, which revealed unimodal distributions where ∼90% of the detected spots fell in the range 0.5× median – 1.5× median (Figure 4a). We therefore concluded that overlapping spots are not a large confounder in our data. In fact, in our experiments, the number of detected mature mRNA transcripts per cell was lower than expected, based on the number of nascent transcripts (compare Figure 3 with Figure 4). This discrepancy between nascent and mature transcripts likely arises because the addition of the PP7 loops to the GAL10 RNA destabilizes the RNA, resulting in faster mRNA turnover compared to most endogenous RNAs (Miller et al., 2011; Wang et al., 2002; Holstege et al., 1998; Geisberg et al., 2014). Previously, both shorter and longer mRNA halflives from the addition of stem loops have been observed, which may be caused because changes in the 5’ UTR length or sequence affect its recognition by the mRNA degradation machinery (Heinrich et al., 2017; Tutucci et al., 2018; Garcia and Parker, 2015). In our case, we note that such high turnover should aid transcriptional parameter estimates, as it closely reflects transcriptional activity.
A second possible source of error is cell segmentation. To test how cell segmentation errors contribute to the mature mRNA distribution and the transcriptional bursting estimates, we compared two independent segmentation tools, where segmentation 1 often resulted in missed spots (Figure 3b), resulting in an underestimation of the mean mRNA count and of the variance (compare Figure 3b and c). We inferred the transcriptional parameters using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. In the absence of an experimental measurement of the degradation rate, we could only estimate the three transcriptional parameters normalised by $d$. The best fits of dataset 1 are shown in (Figure 3b and c) and the transcriptional parameters (for all four datasets) are summarized in (Figure 3e). Note that the estimated parameters for all four datasets, using both segmentations, are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Notably the segmentation algorithms led to similar estimates for the burst frequency but considerably different estimates for the rest of the parameters. In particular segmentation 1 suggested that burst expression is infrequent (≈20% of the time) whereas segmentation 2 was consistent with burst expression occurring half of the time. Given that accurate cell segmentation remains challenging, this analysis illustrates that parameter estimation from mature mRNA counts may be affected by technical errors. For the remainder of the mature mRNA analysis, we have used only segmentation 2 data.
Lastly, it may be challenging to distinguish the nascent transcription site from a mature RNA, especially if few nascent RNAs are being produced. Either one can decide to include all cellular spots in the total mRNA count, including the transcription site, with the result that the number of mature transcripts is overestimated with one RNA for cells which show a transcription site. Or conversely, one can decide to exclude the transcription site by subtracting one spot from each cell, with the result that the number of mature mRNAs may be underestimated by one RNA for cells that are transcriptionally silent. To understand how this choice affects the accuracy of parameter inference, we compared both options in (Figure 3c, d and e), where seg2 included all spots, and seg2TS excluded transcription sites (by subtracting 1 from each cell). The estimated parameters for all four datasets are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Although the mean was lower when transcription sites were excluded, all the parameters except the burst frequency ${\sigma}_{\mathrm{on}}$ were within the error, indicating that the choice of whether or not to include the transcription site in the mature mRNA count had a small influence on parameter estimation. For the remainder of the analysis, we included all spots, and counted the transcription site as one RNA.
Inference from mature mRNA data: merged versus cellcycle specific
The above analysis was performed using the merged data from all cells, irrespective of their position in the cell cycle. The inferred parameters of all four datasets are shown in Figure 3g (grey). To understand the effect of the cell cycle on these parameter estimates, we compared this inference with cellcyclespecific data. We used the integrated nuclear DAPI intensity as a measure for DNA content to classify cells into G1 or G2 cells (Figure 3f (left)) to obtain separate mature mRNA distributions for G1 and G2 cells.
To infer the transcriptional parameters from mature mRNA data of cells in G1, the inference protocol remained the same. However for cells in the G2 stage, this protocol needed to be altered since G2 cells have two gene copies, whereas the solution of the telegraph model assumes one gene copy. Assuming the transcriptional activities of the two gene copies are independent, the distribution of the total molecule number is the convolution of the molecule number (obtained from the telegraph model) with itself for mature mRNA data. This convolved distribution was used in steps (ii) and (iii) of the inference algorithm in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. A difference between our method of estimating parameters in G2 from that in the literature (Skinner et al., 2016) is that we do not assume that the burst frequency is the only parameter that changes upon replication, and we estimated all transcription parameters simultaneously.
Note that the independence of gene copy transcription has been verified for genes in some eukaryotic cells (Skinner et al., 2016) where the two copies can be easily resolved. For yeast data, as we are analyzing in this paper, it is generally not possible to resolve the two copies of the allele in G2 because they are within the diffraction limit. However, in the absence of experimental evidence, the independence assumption is the simplest reasonable assumption that we could make (see later for a relaxation of this assumption).
For both G1 and G2 cells, we performed inference for cellcycle specific mature mRNA data, the results of which are shown in Figure 3f (centre and right) and Figure 3g – see Appendix 3—table 2 for the confidence intervals of the estimates calculated using profile likelihood. As expected, the mean number of mRNAs in G2 cells was larger than that in G1 cells. For both merged and cellcycle specific data, the parameters ordered by increasing variability of the estimates from independent samples (the standard deviation divided by the mean) were: $\rho $, ${f}_{\mathrm{ON}}$, ${\sigma}_{\mathrm{ON}}$, burst size and ${\sigma}_{\mathrm{OFF}}$, and the same order was predicted by the relative error (from ground truth values) from our synthetic experiments (compare with ${f}_{\mathrm{ON}}=0.50$ and ${f}_{\mathrm{ON}}=0.80$ in the middle and right panels of Figure 2d) and by sample variability (Appendix 1). In Appendix 3 and Appendix 3—table 3 we show that the relaxation of the assumption of independence between the allele copies in G2 (by instead assuming perfect state correlation of the two alleles) had practically no influence on the inference of the two best estimated parameters ($\rho $, ${f}_{\mathrm{ON}}$).
A comparison of the two types of data predicted different behaviour (Figure 3g bottom): merged data indicated behaviour consistent with the gene being ON half of the time and small burst sizes, while cellcyclespecific data implied the gene is ON ≈80% of the time with large burst sizes. We note that the burst sizes have considerable sample variability, exemplifying burst size estimates of transcriptional parameters from mature mRNA distributions have to be treated with caution. Nevertheless, in line with this high fraction ON and large burst size, which start to approach constitutive expression, the variation introduced by the transcription kinetics is relatively modest with Fano factors not far from one: $2.43\pm 0.21$ for merged data and $1.75\pm 0.45$ for cellcycle data (the slightly higher value for merged data likely was due to heterogeneity stemming from varying gene copy numbers per cell).
Comparing the mean rates between the G1 and G2 phases, we found that ${\sigma}_{\mathrm{off}}$, ${\sigma}_{\mathrm{on}}$, $\rho $ decreased while ${f}_{\mathrm{ON}}$ and the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were wellseparated between the two phases were ${\sigma}_{\mathrm{on}}$ and $\rho $. These two parameters decreased by 65% and 21%, respectively, which suggests that upon replication, there are mechanisms at play which reduce the expression of each copy to partially compensate for the doubling of the gene copy number (gene dosage compensation) (Skinner et al., 2016).
In conclusion, what is particularly surprising in our analysis is the differences in the inference results using merged and cellcycle specific data: the former suggests the gene spends only half of its time in the ON state while the latter implies the gene is mostly in its ON state.
Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference
Cellcyclespecific versus merged data
To determine the number of nascent transcripts at the transcription site, we selected the brightest spot from each nucleus and normalized its intensity to the median intensity of the cytoplasmic spots. As the distribution of intensities of the cytoplasmic mRNAs followed a narrow unimodal distribution, its median likely represents the intensity of a single RNA (orange distribution in the central panel of Figure 4a). The inference of transcriptional parameters using the merged data was done using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data.
Similar to above, to account for two gene copies in G2 cells, we assumed that the transcriptional activities of the two gene copies are independent. The distribution of the total fluorescent signal from both gene copies was the convolution of the signal distribution (obtained from the extended delay telegraph model, i.e. Equation (1)) with itself. This convolved distribution was then used in steps (ii) and (iii) of the inference algorithm.
The inference of transcriptional parameters from nascent RNA data was done using a fixed elongation time, which was measured previously at a related galactoseresponsive gene (GAL3) at $65\mathrm{bp}/\mathrm{s}$ (Donovan et al., 2019). Since the total transcript length is $3062\mathrm{bp}$ (see Figure 1c), the elongation time ($\tau $ in our model) is $\approx 47.1s\approx 0.785\text{min}$. The fixed elongation rate enabled us to infer the absolute values of the three transcriptional parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$ and $\rho $.
Best fits of the extended delay telegraph model to the distribution of signal intensity of nascent mRNAs at the transcription site are shown in Figure 4a and b for dataset 1; for the other datasets see Appendix 4—figure 1. The corresponding estimates of the transcriptional parameters are shown in Appendix 4—table 1 and also illustrated by bar charts in Figure 4c. The confidence intervals of the transcriptional parameters (computed using the profile likelihood method) are shown in Appendix 4—table 2.
Comparing this estimation with that from mature mRNA, we observed that in both cases ${f}_{\mathrm{ON}}\approx 0.5$ for merged data and in the range $0.70.8$ for cellcyclespecific data. Also in both cases, the Fano factors of merged data were larger than those of cellcyclespecific data. Hence, we are confident that not accounting for the cell cycle phase leads to an overestimation of the time spent in the OFF state and of the Fano factor. In addition, comparing the burst sizes in Figure 3g and Appendix 4—table 1, we found that not taking into account posttranscriptional noise (by using mature mRNA data) led to an lower estimation of the burst size (2.6fold, 2.6fold, and 1.1fold lower for inference from merged, G1 and G2 data, respectively). We note that it would be useful to directly compare the absolute estimates of the other transcriptional parameters from mature and nascent mRNA data. However, this was not possible because the telegraph model only estimates the switching rates and the initiation rate scaled by the degradation rate, and the latter is unknown. On the other hand, the estimates from nascent data were rates multiplied by the average elongation time, which is known and hence the absolute rates can be estimated from nascent mRNA data only. The only quantities that could be directly compared were the burst size and the fraction of ON time, since these are both nondimensional.
Comparing the variability of the parameter estimates, we found that $\rho $ and ${f}_{\mathrm{ON}}$ were the parameters with the smallest variability across samples for the nascent data, as for inference from mature data. However, the inferred parameter variability across samples was on average about 2.5fold lower for nascent data compared to mature mRNA data (this was obtained by computing the standard deviation divided by the mean for each parameter and then averaging over all parameters and over merged, G1 and G2 data). Likely this is because nascent data does not suffer from posttranscriptional noise. Indeed, synthetic experiments suggested that the errors in parameter inference using nascent data are often less than those in mature data when ${f}_{\mathrm{ON}}\approx 0.80$ (Figure 2d). In summary, we have more confidence in the parameter estimates from nascent data, in particular those from cellcycle separated data.
To further investigate the hypothesis that estimates from cellcyclespecific data are more accurate than merged data, we compared the estimates from merged and cellcyclespecific data to previous livecell transcription measurements of the same gene (Donovan et al., 2019). Because livecell traces and simulated traces with the estimated transcriptional parameters are difficult to compare directly, we instead compared their normalized autocorrelation functions (ACFs). Specifically, the livecell traces displayed celltocell variation in overall fluorescent intensities arising from differences in the PP7 coat protein expression level, precluding a direct comparison of the livecell intensities with the smFISH distributions. The normalized ACFs are normalized per trace and thus can be used to directly compare the kinetics. For this, we fed the parameter estimates to the SSA to generate synthetic livecell data and then calculated the corresponding ACF (Appendix 5). We found that the estimates from cellcyclespecific data produced ACFs that match the livecell data closer than that from the merged data (Figure 4d). This was also clear from the sum of squared residuals which for each dataset was smaller for the ACF computed using the cellcyclespecific estimates rather than those from merged data (Figure 4e).
Using nascent data, we also reinvestigated the hypothesis that the gene exhibits dosage compensation. Comparing the mean rates between the G1 and G2 phases, we found that ${\sigma}_{\mathrm{off}}$, ${\sigma}_{\mathrm{on}}$, $\rho $, ${f}_{\mathrm{ON}}$ decreased while the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were cleanly separated between the two phases were ${\sigma}_{\mathrm{on}}$ and ${f}_{\mathrm{ON}}$. These two decreased by 41% and 5%, respectively. These results had some similarity to those deduced from cellcycle separated mature mRNA data (the decrease of ${\sigma}_{\mathrm{on}}$) but they also displayed differences. Namely, from mature mRNA data it was predicted that $\rho $ decreased upon replication while from nascent data we predicted that $\rho $ did not change and it was rather ${f}_{\mathrm{ON}}$ that decreased by a small degree. The decrease of the burst frequency ${\sigma}_{\text{on}}$ after replication has also been reported for some genes in mammalian cells (Skinner et al., 2016; PadovanMerhar et al., 2015), indicating that this could be a general mechanism for gene dosage compensation. Our results are consistent with a populationbased ChIPseq study (Voichek et al., 2016) that showed DNA dosage compensation after replication in budding yeast. We note that our singlecell analysis only revealed partial dosage compensation, where the mean signal intensity of nascent mRNAs in G2 is not the same as in G1, but 1.7fold higher in G2 than in G1 (Figure 4c).
Correcting for experimental artefacts
Although inference on cell cycle separated data outperformed inference on merged data, we noticed that the corresponding best fit distributions did not match well to the experimental signal distributions in the lower bins (Figure 4b and Appendix 4—figure 1). In all cases, the experimental distributions showed high intensities in bins 1, 2, and 3, which was likely an artifact of the experimental data acquisition system. Since we defined the transcription site as the brightest spot, this implies that in the absence of a transcription site, a mature transcript can be misclassified as a nascent transcript. We therefore investigated two methods to correct for this, the ‘rejection’ method and the ‘fusion’ method.
The rejection method removed all data associated with the first $k$ bins of the experimentally obtained histogram of fluorescent intensities (Figure 5a shows the fits for dataset 1; for the other datasets see Appendix 4—figure 2). We found that the parameter estimates varied strongly when the number of bins from which data was rejected ($k$) was changed (Figure 5b; see also Appendix 4—table 3). Although the distributions fit well to the experimental histograms (Appendix 4—figure 1), comparison with the livecell normalized ACF indicated that the estimates actually became worse than noncurated estimates, with a higher sum of squared residuals (Figure 5c and d). The rejection method therefore does not produce reliable estimates.
Next, we considered another data curation method which we call the fusion method. This works by setting to zero all fluorescent intensities in a cell population which were below a certain threshold. In other words, we fused or combined the first $k$ bins of the experimentally obtained histogram of fluorescent intensities, thereby taking into account that the true intensity of bin 0 was artificially distributed over some of the first bins.
Figure 6 and Appendix 4—table 4 show that the fusion method led to estimates that varied little with $k$ which enhanced our degree of confidence in them (note that $k=1$ is the same as the uncurated data). The peak at the zero bin for both G1 and G2 was better captured using the fusion method than using noncurated data (compare Figure 4b and Appendix 4—figure 1, with Figure 6b). Comparison to the autocorrelation function of the livecell data shows that correction with the fusion method also led to improved transcriptional estimates, as indicated by a reduction in the sum of the squared residuals for all four data sets (Figure 6c).
Overall, we conclude that for inferring parameters from the smFISH data, the optimal method is to use nascent cellcyclespecific data, corrected by the fusion method. The optimally inferred parameters for the four data sets in our study are those given in Appendix 4—figure 2d. The profile likelihood estimates of the 95% confidence intervals of these parameters are shown in Appendix 4—table 5. Note that in line with our synthetic data study in Figure 2, the parameters suffering from the least sample variability were ${f}_{\mathrm{ON}}$ and $\rho $. The rest of the parameters (${\sigma}_{\mathrm{off}},{\sigma}_{\mathrm{on}}$ and burst size) suffered more sample variability because the fraction of ON time was high; however since their standard deviation divided by the mean (computed over the four datasets) was not high (in the range of 1020%), they still can be regarded as useful estimates. Note also that the previous prediction that gene dosage compensation involves regulation of the burst frequency did not change upon correction of the nascent data using the fusion method. All these results were deduced assuming that the two copies in G2 are independent from each other. Inferring rates under the opposite assumption of perfectly synchronized copies (Appendix 4—table 6) gave very similar estimates for $\rho $ and ${f}_{\mathrm{ON}}$ (to be expected since according to the synthetic data study, these two are the most robustly estimated parameters for genes spending most of their time in the active state) but different estimates for the rest of the parameters. While such perfect synchronization of alleles is unlikely, some degree of synchronization is plausible and further improvement of the transcriptional parameters in the G2 phase will require its precise experimental quantification.
Discussion
In this study, we compared the reliability of transcriptional parameter inference from mature and nascent mRNA distributions, with and without taking into account the cell cycle phase. Although these distributions come from the same experiment, we found that the different fits produced very different parameter estimates, ranging from small bursts to very large bursts. Comparison to livecell data revealed that the optimal inference method is to use nascent mRNA data that is separated by cell cycle.
Our findings illustrate the risk of inferring transcriptional parameters from fitting of mRNA distributions. First of all, as we have shown, these fits are sensitive to the segmentation method which can lead to large errors in the estimates. Secondly, the most common method of parameter inference in the literature is fitting of mature mRNA distributions that are not separated by cell cycle (Larsson et al., 2019; Raj et al., 2006; Zenklusen et al., 2008). Obtaining such distributions is straightforward using methods such as smFISH, where one can directly count the number of mRNAs per cell. Additionally, with the advance of singlecell mRNA sequencing technologies, it is possible to obtain mRNA distributions for many genes simultaneously and it is tempting to use these to estimate bursting behaviour across the genome (Kim and Marioni, 2013; Larsson et al., 2019). However, our comparisons on the same dataset show that the values obtained from mature mRNA fits (using merged data) can be significantly different from the optimal values (using nascent cellcycle separated data corrected using the fusion method), with underestimation of the burst sizes of almost 10fold and underestimation of the active fraction of more than 1.5fold. These results indicate that parameter inference from merged mature mRNA data should be treated with caution. There were smaller differences between the burst size and the active fraction inferred from cellcycle separated mature and nascent data (only these two can be directly compared because these are nondimensional); however the relative errors in the estimates (computed over the four datasets) were more than twofold higher for mature data likely due to posttranscriptional noise which nascent data is free from.
It is more common to fit mature distributions rather than nascent distributions because nascent distributions are technically more challenging to obtain. As nascent singlecell sequencing methods are still in the early phase (Hendriks et al., 2019), the only method available so far for nascent measurements is smFISH (Patel et al., 2021). In such smFISH experiments, intronic probes can be used to specifically label nascent RNA, although there may be some effects of splicing kinetics on the distribution (Wan et al., 2021). If introns are not present, like for most yeast genes, one can use exonic probes instead (Zenklusen et al., 2008). Since exonic probes label both nascent and mature mRNA transcript, it may be challenging to identify the nascent transcription site unambiguously, especially at lower transcription levels. We show in this manuscript that the fusion method can correct for this bias by combining bins below k RNAs, which results in an improvement of the parameter estimates.
Our analysis also emphasizes the importance of separately analyzing G1 and G2 cells (Skinner et al., 2016). It is important to note that for cellcyclespecific analysis, experimental adjustments or cellcycle synchronized cultures are not required. Although asynchronous cultures consist of a mix G1, S and G2 cells, the integrated DNA intensity of the nucleus of each cell, for example from a DAPI signal, can be used to separate these cells by cell cycle phase in silico (Skinner et al., 2016; Roukos et al., 2015). As most smFISH experiments already include a DNAlabelled channel, adding an extra analysis step should in principle not limit the incorporation of this step in future smFISH fitting procedures.
Even with our optimal fitting strategy, there is a residual error of the simulated ACF and the measured ACF from livecell measurements. This difference may be the result of different experimental biases of the two measurements. For example, livecell measurements have a detection threshold below which RNAs may not be detected. In addition, livecell measurements include cells in S phase, which are not analyzed in smFISH. There could also be differences in the exact percentage of G1 and G2 cells, or other noise sources between livecell and smFISH experiments. Alternatively, the fit may be imperfect because there might be parameter sets, others than the ones which our inference algorithm found, which provide an accurate fit of the nascent mRNA distribution and perhaps an even better fit to the ACF than we found. We cannot exclude this possibility because we estimated ${f}_{\text{ON}}$ to be $0.70.8$ and using synthetic data we showed that the accuracy of some parameters (${\sigma}_{\mathrm{on}},{\sigma}_{\mathrm{off}}$ and the burst size) deteriorated as ${f}_{\text{ON}}$ approached 1 (Figure 2d). Another factor which could explain the residual error between the simulated ACF and the measured ACF is that perhaps the twostate model may be too simplistic to cover the true promoter states in living cells and may therefore not be able to describe the true in vivo kinetics. The promoter may switch between more than 2 states, or there may be sources of extrinsic noise other than the cell cycle that contribute to the heterogeneity. Previous studies have for example identified extrinsic noise on the elongation rate (Fritzsch et al., 2018). However, these more complex transcription models also have more parameters, which in practice often means that very few will be identifiable with the current set of experimental observations. To fit these models, one requires temporal data on the transcription kinetics (Fritzsch et al., 2018; Rodriguez et al., 2019), or simultaneous measurements of various sources of extrinsic noise, such as singlecell transcription factor concentration and RNA polymerase number measurements, cellular volume, local cell crowding, etc, which are often not available in standard smFISH experiments (Battich et al., 2015; Foreman and Wollman, 2020). Nevertheless, given that there is no explicit time component in smFISH data, the closeness of the simulated ACF to the measured ACF provides confidence we are close to the real values.
The optimal parameter set (Figure 6d) indicates long ON promoter times of 75 s, during which almost 50 RNAs are produced in a burst. Large burst sizes (>70) have been previously reported for mouse embryonic stem cells (Skinner et al., 2016, mouse hepatocytes Bahar Halpern et al., 2015 and human fibroblasts Larsson et al., 2019). The large burst size and high active fraction of 0.78 suggests that GAL10 expression is reaching its limit of maximal expression, which may not be surprising as it is already one of the most highly expressed genes in yeast. It is also interesting to note that the ON time of 75 s is longer than the residence time of a single transcript (47 s), which means that RNA polymerases in the beginning of a burst have already left the locus before the burst has finished.
The optimal parameter set (Figure 6d) also indicates partial gene dosage compensation. Specifically the burst frequency per gene copy (${\sigma}_{\text{on}}$) in the G2 phase is 0.66 that in the G1 phase; the other transcriptional rates are not significantly different between the two cell cycle phases. The fold change in the burst frequency per gene copy was previously estimated for the $Oct4$ and $Nanog$ genes to be 0.63 and 0.71 respectively, in mouse embryonic stem cells (Skinner et al., 2016). The similarity of our estimate of the fold change to those previously measured could be explained by the results of a recent study (Jia et al., 2021); using a detailed model of gene expression, it was shown that in the absence of a dependence of the initiation rate on cell volume, gene dosage compensation optimally leads to approximate mRNA concentration homeostasis when the fold change in the burst frequency upon DNA replication is $\sqrt{2}/2\approx 0.71$.
In conclusion, obtaining kinetic information from static distributions can introduce biases. However, we show that it is possible to obtain reasonable estimates that agree with livecell measurements, if one infers parameters from nascent mRNA distributions that are accounted for cell cycle phase.
Methods
Inference from mature mRNA data
Mathematical model
The steadystate solution of the telegraph model of gene expression (Peccoud and Ycart, 1995) gives mature mRNA distributions. The reaction steps in this model are illustrated in Figure 1a. Next we describe the generation of synthetic mature mRNA data and the algorithm used to infer parameters from this data.
Generation of synthetic mature mRNA data
We generate parameter sets on an equidistant mesh grid laid over the space:
where the units are inverse minute. Furthermore we apply a constraint on the effective transcription rate
In each of the three dimensions of the parameter space, we take 10 points that are equidistant, leading to a total of 1000 parameter sets which reduce to 789 after the effective transcription rate constraint is enforced.
We additionally fix the degradation rate $d=1$ min^{1}. Note that we choose not to vary the degradation rate (as we did for the other three parameters) since it is not possible to infer all four rates simultaneously – this is because the steadystate solution of the telegraph model is a function of the nondimensional parameter ratios $\rho /d,{\sigma}_{\text{off}}/d$ and ${\sigma}_{\text{on}}/d$ (Raj et al., 2006).
Once a set of parameters is chosen, we use the stochastic simulation algorithm (SSA Gillespie, 2007) to simulate the telegraph model reactions in Figure 1a and generate 10^{4} samples of synthetic data. Note that each sample mimicks a single cell measurement of mature mRNA.
Steps of the algorithm to estimate parameters from mature mRNA data
The inference procedure consists of the following steps: (i) select a set of random transcriptional parameters; (ii) use the solution of the telegraph model to calculate the probability of observing the number of mature mRNA measured for each cell; (iii) evaluate the likelihood function for the observed data; (iv) iterate the procedure until the negative loglikelihood is minimized; (v) the set of parameters that accomplishes the latter provides the best pointestimate of the parameters of the telegraph model that describes the measured mature mRNA fluctuations.
For step (i), we restrict the search for optimal parameters in the following region of parameter space
The degradation rate is fixed to $d=1$ min^{1}.
Step (ii) can be obtained either by computing the distribution from the analytical solution (Peccoud and Ycart, 1995 or by using the finite state projection (FSP) method Munsky and Khammash, 2006). Here, for the sake of computational efficiency, we use the FSP method to compute the probability distribution of mature mRNA numbers.
For step (iii) we calculate the likelihood of observing the data given a chosen parameter set $\theta $
where $P({n}_{j};\theta )$ is the probability distribution of mature mRNA numbers obtained from step (ii) given a parameter set $\theta $, n_{j} is the total number of mature mRNA from cell $j$ and ${N}_{\text{cell}}$ is the total number of cells.
Steps (i) and (iv) involve an optimization problem. Specifically we use a gradientfree optimization algorithm, namely adaptive differential evolution optimizer (ADE optimizer) using BlackBoxOptim.jl (https://github.com/robertfeldt/BlackBoxOptim.jl; Feldt and Stukalov, 2022) within the Julia programming language to find the optimal parameters
The minimization of the negative loglikelihood is equivalent to maximizing the likelihood. Note the optimization algorithm is terminated when the number of iterations is larger than 10^{4}; this number is chosen because we have found that invariably after this number of iterations, the likelihood has converged to some maximal value. Note that the inference algorithm is particularly low cost computationally, with the optimal parameter values estimated in at most a few minutes.
Once the best parameter set ${\theta}^{*}$ is found, we calculate the mean relative error (MRE) which is defined as
where ${\theta}_{i}^{*}$ and ${\theta}_{\text{true},i}$ represent the $i$th estimated and true parameters respectively, and $M$ denotes the number of the estimated parameters. Thus, the mean relative error reflects the deviation of the estimated parameters from the true parameters.
Inference from nascent mRNA data
Mathematical model
The steadystate solution of the delay telegraph model (Xu et al., 2016) gives the distribution of the number of bound Pol II. In Appendix 6, we present an alternative approach to derive the steadystate solution. The reaction steps are illustrated in Figure 1a.
The position of a Pol II molecule on the gene determines the fluorescence intensity of the mRNA attached to it. In particular for fluorescence data acquired from smFISH PP7GAL10, the fluorescence intensity of a single mRNA on the DNA locus looks like a trapezoidal pulse (see Figure 1b for an illustration). This presents a problem because although we can predict the distribution of the number of bound Pol II using the delay telegraph model, we do not have any specific information on their spatial distribution along the gene. However, since the delay telegraph model implicitly assumes that a Pol II molecule has fixed velocity and that Pol II molecules do not interact with each other (via volume exclusion), it is reasonable to assume that in steadystate, the bound Pol II molecules are uniformly distributed along the gene. This hypothesis is confirmed by stochastic simulations of the delay telegraph model where the position of a Pol II molecule is calculated as the product of the constant Pol II velocity and the time since its production.
By the uniform distribution assumption and the measured trapezoidal fluorescence intensity profile, it follows that the signal intensity of each bound Pol II has the density function $g$ defined by
where ${L}_{1}=862\text{bp}(\text{base pairs}),{L}_{2}=2200\text{bp},L={L}_{1}+{L}_{2}$ as defined in Figure 1b. The indicator function ${\chi}_{[0,1]}(s)=1$ if and only if $s\in [0,1]$ and ${\delta}_{1}(s)$ is the Dirac function at 1. The probability of the signal $s$ being between 0 and 1 is due to the first part of the trapezoid function and hence is multiplied by ${L}_{1}/L$ which is the probability of being in this region if Pol II is uniformly distributed. Similarly, the probability of $s$ being 1 is due to the L_{2} part of the trapezoid and hence the probability is ${L}_{2}/L$ by the uniform distribution assumption. Note that the signal $s$ from each Pol II is at most 1 because in practice, the signal intensity from the transcription site is normalized by the median intensity of single cytoplasmic mRNAs (Zenklusen et al., 2008).
The total signal is the sum of the signals from each bound Pol II. Hence, the density function of the sum is given by the convolution of the signal densities from each bound Pol II. Defining $p(sk)$ as the density function of the signal given there are $k$ bound Pol II molecules, we have that $p(sk)$ is the $k$–th convolution power of $g$, that is
where ${\delta}_{0}(s)$ is the Dirac function at. Finally we can write the total fluorescent signal density function as
where $P(k;\theta )$ is the steadystate solution of the delay telegraph model giving the probability of observing $k$ bound Pol II molecules for the parameter set $\theta $. Hence Equation (8) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site.
Comparison to the algorithm in Xu et al., 2016. Both algorithms take into account the fact that the signal intensity depends on the position of Pol II on the gene, albeit this is done in different ways. In Xu et al., 2016 a master equation is written for the joint distribution of gene state and the number of nascent mRNA. In this case the number of nascent RNAs can have noninteger values since it represents the experimentally measured signal from the (incomplete) nascent RNA. Solution of this master equation proceeds by (a) a discretization of the continuous nascent mRNA signal into bins which are much smaller than one; (b) solution using finite state projection (FSP). This approach can lead to a large state space which incurs a large computational cost. In contrast, in our method, we use FSP to solve for the delay telegraph model, i.e. the distribution of the discrete number of bound Pol II from which we construct (using convolution) the approximate distribution of the continuous nascent mRNA signal by assuming the Pol II is uniformly distributed on the gene. Since the state space of bound Pol II is typically not large, our method will typically be more computationally efficient than the one described in Xu et al., 2016.
Generation of synthetic nascent mRNA data
We generated synthetic smFISH signal data by using the SSA, modified to include delay to simulate the delay telegraph model (Fu et al., 2022). Specifically, we use Algorithm 2 described in Barrio et al., 2006. One run of the algorithm simulates the fluctuating number of bound Pol II molecules in a single cell.
The total fluorescence intensity (mimicking smFISH) is obtained as follows. When a particular bound Pol II is produced by a firing of the transcription reaction $G\to G+N$, we record this production time; since the elongation rate is assumed to be constant, given the production time we can calculate the position of the Pol II molecule on the gene at any later time and hence using Figure 1b we can deduce the fluorescent signal due to this Pol II molecule.
Specifically we normalize each transcribing Pol II’s position to $[0,1]$ and map the position to its normalized signal by
where $x$ is the normalized position on the gene. Thus at a given time, the total fluorescent signal from the $n$th cell (the $n$th realization of the SSA) equals
where ${J}_{n}$ is the number of bound Pol II molecules in the $n$th cell, and $\{{x}_{j}\}$ with $j=1,\mathrm{\dots},{J}_{n}$ is the vector of all Pol II positions on the gene. The total signal from each cell is a real number but it is discretized into an integer.
The kinetic parameters are chosen from the same region of parameter space as in (2), on the same equidistant mesh grid and with the same constraint on the effective transcription rate. Unlike the mature mRNA case, here there is no degradation rate; instead we have the elongation time, which we fix to $\tau =0.5\text{(min)}$. Note that fixing this time is necessary since it is not possible to infer the three transcriptional parameters rates and the elongation time simultaneously because the steadystate solution of the delay telegraph model is a function of the nondimensional parameter ratios $\rho \tau ,{\sigma}_{\text{off}}\tau $ and ${\sigma}_{\text{on}}\tau $.
Once a set of parameters is chosen, we use the modified SSA (as described above) to simulate the signal intensity in each of 10^{4} cells.
Steps of the algorithm to estimate parameters from nascent mRNA data
The inference procedure is essentially the same as steps (i)(v) described in mature mRNA inference except for the following points.
In step (ii), the probability of observing a total signal of intensity $i$ from a single cell is obtained by integrating $p(s;\theta )$ in Equation (8) on an interval $[i1,i]$ for $i\in \mathbb{N}$ which, in our numerical scheme, means
Note that the integration over the interval of length 1 is to match the discretization of the synthetic data and $\theta \in \mathrm{\Theta}$. Intuitively, one can always choose a positive integer $K$ such that $P(k)=0$ for any $k\ge K$. The computation of the solution of the delay telegraph model $P(k)$ can be done either using the analytical solution (evaluated using high precision) or using the finite state projection algorithm (FSP) Munsky and Khammash, 2006. In Appendix 6—figure 1 and Appendix 6—table 1, we show that the two methods yield comparable accuracy and CPU time.
For step (iii) we calculate the likelihood of observing the data given a chosen parameter set $\theta $
where q_{j} is the discretized total signal intensity from cell $j$ and ${N}_{\text{cell}}$ is the total number of cells. In the optimization, we aim to find
The whole procedure (for both mature and nascent mRNA inference) is summarized by a flowchart in Figure 1c.
Experimental data acquisition and processing
A diploid yeast strain of BY4743 background with a single integration of 14xPP7 loops at the 5’UTR of GAL10 (strain YTL047 Donovan et al., 2019) was used in this study. Four replicate yeast cultures were grown in synthetic complete media with 2% galactose to early midlog (OD 0.5), fixed with 5% paraformaldehyde (PFA) for 20 min, permeabilized with 300 units of lyticase and hybridized with 7.5 pmol each of four PP7 probes labeled with Cy3 (Integrated DNA Technologies) as described in Trcek et al., 2012 and Lenstra et al., 2015; Patel et al., 2021, resulting in four technical replicates. The PP7 probe sequences are: atatcgtctgctcctttcta, atatgctctgctggtttcta, gcaattaggtaccttaggat, aatgaacccgggaatactgc. Coverslips were mounted on microscope slides using mounting media with DAPI (ProLong Gold, Life Technologies).
The coverslips were imaged on a Zeiss AxioObserver (Zeiss, USA) widefield microscope with a PlanApochromat 40x1.4 NA oil DIC UV objective and a 1.25 x optovar. For Cy3, a 562 nm longpass dichroic (Chroma T562lpxr), 595/50 nm emission filter (Chroma ET595/50 m) and 550/15 nm LED excitation at full power (Spectra X, Lumencor) were used. For DAPI, a 425 nm longpass dichroic (Chroma T425lpxr) and a 460/50 nm emission filter (Chroma ET460/50 m) and LED excitation at 395/25 nm at 25% power (Spectra X, Lumencor) were used. The signal was detected on a Hamamatsu ORCAFlash4.0 V3 Digital CMOS camera (Hamamatsu Photonics, Japan). For each sample and each channel, we utilized the MicroManager software (UCSF) to acquire at least 20 fieldsofview based on the DAPI channel. Each fieldofview consisted of 13 zstacks (with a zstep of 0.5 µm) at 25ms exposure for DAPI and 250ms exposure for Cy3.
A custom python pipeline was used for analysis (https://github.com/Lenstralab/smFISH; Pomp, 2022). Maximum intensity projected images were used to segment the cell and nucleus using Otsu thresholding and watershedding (segmentation 1). In addition, we segmented cells using CellProfiler (segmentation 2). The diffractionlimited Cy3 spots were detected per zslice using bandpass filtering and refined using iterative Gaussian mask localization procedure (Crocker and Grier, 1996; Thompson et al., 2002; Larson et al., 2005; Larson et al., 2011 and Coulon et al., 2014). Cells in which no spots were detected were excluded from further analysis since a visual inspection indicated that these cells were not properly segmented or were improperly permeabilized.
Spots were classified as nuclear or cytoplasmic and the brightest nuclear spots were classified as transcription sites. The intensity of the brightest nuclear spot in a cell was normalized with the median fluorescence intensity of all the cytoplasmic spots in all cells. This is due to the fact that 90% of cytoplasmic mRNAs are isolated (Figure 4a), thus the median of the fluorescence signal of cytoplasmic mRNAs can be considered as the normalizing value. The distribution of the normalised intensity of the brightest nuclear spot, calculated over the cell population, is the experimental equivalent of the total fluorescent signal density function as given by the solution of the modified delay telegraph model, Equation (8).
The number of mature mRNA in each cell is given by counting the number of spots in the entire cell, that is nuclear plus cytoplasmic. The transcription site is counted as 1 mRNA, regardless of its intensity. We show in Figure 3c that this has negligible influence on the estimated parameters since the mean number of mature mRNA is much greater than 1. The distribution of the number of spots is the experimental equivalent of the solution of the telegraph model, that is the marginal distribution of mature mRNA numbers in steadystate conditions.
The integrated nuclear intensity of each cell was calculated by summing the DNA content intensity (DAPI) of all the pixels within the nucleus mask. The distribution of the intensities was fit with a bimodal Gaussian distribution. Those cells whose intensity was within a standard deviation of the mean of the first (second) Gaussian peak was classified as G1 (G2) (see Figure 3e left). This gave similar results to a different cell cycle classication method using the Fried/Baisch model (Johnston et al., 1978) which was recently employed in Skinner et al., 2016. See Appendix 7—figure 1 for a comparison of the two methods. We note that cells in late G2 may contain two separate transcription sites, one in the mother and one in the bud. When the nucleus moves into the bud, buds often contain less DNA than G1 cells, and mothers contain more DNA than G1 cells, both of which are excluded from the analysis. When the DNA content of the mother and daughter is similar, both mother and daughter are counted separately as G1 cells. We note that this late G2 subpopulation is very small.
We did four independent experiments with a total number of cells equal to 2510, 6411, 4592, 3181, respectively. After classification, the numbers of G1 cells are 766, 2111, 1495, 904 and the number of G2 cells are 683, 1657, 1209, 1143, whereas the rest were classified as undetermined.
Data availability
The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH; Pomp, 2022. The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl; Fu, 2022.
Appendix 1
Accuracy of inference from synthetic mature and nascent mRNA data
Inference from synthetic mature mRNA data with external noise
In the main text, Figure 2, we showed how the addition of 5% external noise to synthetic mature mRNA data degrades the inference accuracy. In Appendix 1—figure 1 we show how the addition of a larger amount of external noise (10%) causes an even larger loss of accuracy. In particular for 91% of the parameters, the inference accuracy is higher when using nascent mRNA data (Appendix 1—figure 1a) and the median relative errors become very high for most parameters, especially for $\rho $ and ${\sigma}_{\mathrm{off}}$ in the limit of large ${f}_{\text{ON}}$ (Appendix 1—figure 1b).
Accuracy of distribution fits
In the main text, Figure 2, we showed how the accuracy of parameter estimation is not uniform across parameter space. Here we investigate if there is any relationship between this accuracy and how well is a distribution of mature mRNA numbers / signal intensity fit by the inference algorithm. For 12 parameter sets (6 for the telegraph model – Appendix 1—figure 2a) and (6 for the delay telegraph model – Appendix 1—figure 2b), we evaluate the fits to the distribution of synthetic data in Appendix 1—figure 2c–d. The results show that independent of the accuracy of the parameters estimated by the inference algorithm, the fits of the delay telegraph and telegraph model distributions to the distributions generated from synthetic data are generally excellent.
Testing the variability of the inference procedure
To obtain a better understanding of the variability of the inference procedure, for each of the six parameter sets in Appendix 1—figure 2b and i.e. for the inference using the delay telegraph model, we generated 10 independent sets of synthetic data and used maximum likelihood to infer the parameters for each dataset. The mean and standard deviation of the parameters (computed over all 10 datasets) are shown in Appendix 1—table 1. The means are close to the true parameter values (in Appendix 1—figure 2b); this shows that the inference procedure is working correctly and the deviations from ground truth values are mostly due to noise in the synthetic datasets. We can define the sample variability of a parameter as the standard deviation divided by the mean. Ordering parameters by this quantity, we find that for all six parameter sets the error is largest for ${\sigma}_{\mathrm{off}}$. For small ${f}_{\mathrm{ON}}$, the parameters ordered by increasing error are: ${\sigma}_{\mathrm{on}}$, burst size, $\rho $, ${f}_{\mathrm{ON}}$ and ${\sigma}_{\mathrm{off}}$. While for large ${f}_{\mathrm{ON}}$, the order is: $\rho $, ${f}_{\mathrm{ON}}$, ${\sigma}_{\mathrm{on}}$, burst size and ${\sigma}_{\mathrm{off}}$. Note that this order is the same as we determined using the relative errors (from the ground truth) which is shown in Figure 2 of the main text.
Confidence intervals using profile likelihood
We perform a profile likelihood study (Kreutz et al., 2013) on the 12 parameter sets of synthetic mature/nascent mRNA data described in Appendix 1—figure 2ab. We obtain the 95% confidence interval for each parameter. The results are shown in Appendix 1—table 2. We also compare the relative errors for each parameter (computed as (estimated value  ground truth)/ground truth using the data in Appendix 1—figure 2a and b) with the profile likelihood error (computed as (upper bound  lower bound)/optimal estimate using the bounds in Appendix 1—table 2 and the optimal values in Appendix 1—figure 2a and b). The results are shown in Appendix 1—table 3. Note that in most cases, the parameters ordered by relative error are in agreement with the parameters ordered by profile likelihood error.
Effect of random perturbation of mature mRNA data on inference
To assess the reliability of the inference results due to errors in spot counting, we redid the inference with synthetic mRNA data (and using the telegraph model) perturbed randomly by minus 1/plus 1/unchanged with probability $1/3$. The results are shown in Appendix 1—table 4.
We found that when the fraction of ON time was very small (Set 1), there is a considerable effect of the perturbations on the values of the inferred parameters – this is because in this case, the mean number of mRNA is very small and hence a perturbation of one molecule is very significant. However as expected, the inference results are quite robust when the fraction of ON time is not too small (Sets 2–6).
Effect of lognormal noise in nascent fluorescent signal on inference
Given the synthetic nascent mRNA signal data ${\{{S}_{i}\}}_{i=1}^{N}$ (where $N$ represents the number of samples) generated by delay telegraph model, for each ${S}_{i}$, we perform a random perturbation $\mathrm{\Phi}$ under the following conditions
where $\mathrm{\Phi}$ is a stochastic perturbation satisfying the following constraints
$\mathrm{\Phi}({S}_{i})$ is a random variable sampled from the distribution Lognormal $(\alpha ,\beta )$ whose mean equals $S}_{i$, and the standard deviation equals $0.1\ast {S}_{i}$.
This means the random perturbation keeps the mean value of the signal ${S}_{i}$ unchanged but adds noise with a coefficient of variation equal to 0.1.
In Appendix 1—table 5 we compare the results of inference using synthetic nascent mRNA data with the aforementioned stochastic perturbation and without. As for mature mRNA data, we find that the perturbation has only a significant impact when the fraction of ON time is very small.
Appendix 2
Telegraph model versus delay telegraph model
In this section, we aim to precisely understand the differences between the telegraph and the delay telegraph models. For both models, we define the rate of switching from the ON state to OFF state as ${\sigma}_{\text{off}}$, the rate of switching from the OFF state to the ON state as ${\sigma}_{\text{on}}$ and the production rate of nascent mRNAs in the ON state as $\rho $. The firstorder decay rate of nascent mRNA in the telegraph model is given by $d$ and the delay time between initiation and degradation in the delay telegraph model is $\tau $. Note that while the telegraph model was in the main text explained in terms of mature mRNA, in this section we use it as a model for nascent mRNA since we want to compare directly with the delay telegraph model. The telegraph and delay telegraph models can be solved exactly in steadystate conditions (Peccoud and Ycart, 1995; Xu et al., 2016) (an alternative derivation for the delay telegraph model is also given in Section F). From the generating function solution of the models, one can deduce expressions for the first and second centered moments in steadystate conditions:
where $\u27e8n\u27e9}_{\text{delay}$ and $\u27e8n\u27e9}_{\text{tele}$ are the mean number of nascent mRNA in the delay telegraph and telegraph models respectively, and ${\mathrm{Var}}_{\text{delay}}$ and ${\mathrm{Var}}_{\text{tele}}$ are the corresponding variances in molecule numbers. To understand the differences between the two models, we set the means of the two models to be the same (by choosing $d=1/\tau $) and then compute the relative error in their variance predictions:
where $x$ and $y$ are nondimensional variables defined as
Note that $x$ is the ratio of the time for nascent mRNA to detach from the gene $\tau $ and the timescale of promoter switching $1/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$. The nondimensional parameter $y$ is a measure of the burstiness of gene expression since it increases with the mean burst size $\rho /{\sigma}_{\text{off}}$ and the fraction of time spent in the OFF state ${\sigma}_{\text{off}}/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$. In fact as ${\sigma}_{\text{off}}\to 0$ which implies $y\to 0$, the telegraph model converges to a constitutive model where the time between two successive nascent mRNA production events is exponentially distributed. Note that the relative errors in the Fano factor and the coefficient of variation squared are the same as the error in the variance since the means of the two models are the same.
Using Equation (12), it is easy to see that the relative error between the two models vanishes in the limit of $x\to 0$ (when the promoter switching timescales are much longer than the time spent by a polymerase on a gene) or in the limit of $y\to 0$ (when there is no burstiness in gene expression). Hence, the telegraph model is an accurate approximation of the delay telegraph model in these two limits. It can be shown that in the first case, the distribution of nascent mRNA numbers is well approximated by the sum of two Poisson distributions, whereas for the second case the distribution is a Poisson. Note that since $R$ is always positive, it follows that the telegraph model systematically underestimates the size of noise in nascent mRNA numbers. It can be further shown that $R$ increases monotonically with $x$ and $y$ and the maximum attainable value is $R=1/2$ (when $x\to \mathrm{\infty},y\to \mathrm{\infty}$), that is, the worst prediction of the telegraph model is that the variance is half that of the delay telegraph model.
In Appendix 2—figure 1a and c we contrast the distributions of nascent mRNA predicted by the delay telegraph model with those predicted by the telegraph model for the case $d=1/\tau $ where the means are matched, as also assumed for the calculation of the relative error above. Appendix 2—figure 1b shows the effect of increasing $x$ (via $\tau $) and Appendix 2—figure 1d shows the effect of increasing $y$ (via $\rho $). The number distributions are constructed from the generating functions of the telegraph model Peccoud and Ycart, 1995 and of the delay telegraph model (Section F). Note that the shapes of the distributions of the two models can be considerably different, e.g. the cases $\rho =3$ and $\rho =10$ in Appendix 2—figure 1c shows that the nascent distribution from the delay model is bimodal with peaks at 0 and at a nonzero value but it is unimodal with a nonzero peak from the telegraph model. Hence we conclude that if we are interested in accurately predicting nascent mRNA number distributions then generally the Markovian telegraph model is not a good approximation of the nonMarkovian delay telegraph model.
In the main text, we used the delay telegraph model to infer the synthetic data from the synthetic data generated by the delay SSA. For comparison, we repeat the same but now we use the telegraph model (rather than the delay telegraph model) to calculate step (ii) in the inference algorithm, that is $P(k;\theta )$ in Equation 8 in the main text is now chosen to be the steadystate solution of the telegraph model. Once the best parameter set ${\theta}^{*}$ is found, we calculate two scores: (i) the fitness given by the smallest negative loglikelihood value found by the optimizer normalized over the sample size. (ii) the mean relative error (for definition see Equation (4.5) in the main text). In Appendix 2—figure 1, we show both of these scores obtained for 20 independent numerical experiments – clearly the error using the delay telegraph model for the inference algorithm is significantly lower than if the telegraph model is used.
These observations are further reinforced in Appendix 2—figure 1f where we show the best fit distributions and the corresponding relative errors in the estimates of the burst size and the burst frequency (bar chart insets). These are computed using the formula:
We note that the distributions are well fit in all cases, using both telegraph and delay telegraph models. However the errors $\alpha $ and $\beta $ are considerably larger for the former.
In Appendix 2—table 1 we show the true and estimated parameters for the 6 distributions in Appendix 2—figure 1f. The estimates for $\rho $ are accurate independent of the choice of model; however, the estimates for the promoter switching rates ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$ are far more accurate using the delay telegraph model. Hence we conclude that although inference using the telegraph model provides a histogram that fits well with the synthetic data, the inferred parameter values have little meaning because they are not an accurate reflection of the true parameter values.
Appendix 3
Inference results using mature mRNA data
Influence of segmentation on inference
In main text Section Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference, we used independent segmentation tools to study segmentation artifacts. The inference results under segmentation 1, segmentation 2 and segmentation 2 without counting the transcriptional site are summarised in Appendix 3—table 1.
Confidence intervals for estimates from merged and cellcycle specific mRNA data using segmentation 2
We obtain 95% confidence intervals for the parameters estimated from mature mRNA data under segmentation 2 (shown in Figure 3f of the main text). The confidence intervals are shown for ${\sigma}_{\text{off}},{\sigma}_{\text{on}},\rho $ in Appendix 3—table 2.
Effect of synchronization of gene copies in G2 phase on parameter inference
In the main text, we assumed that the transcription of two allele copies in the G2 phase are independent. Here we consider the opposite scenario where the two allele copies in G2 phase are perfectly synchronized with each other, i.e. when one copy switches from on to off, the other copy also switches from on to off. Note that the time at which mRNA is transcribed from each copy is however not the same. Using this modified simulation algorithm, we reperform the inference using the experimental data under segmentation 2. The results are shown in Appendix 3—table 3. Comparing these values with those estimated for phase G2 under the assumption of allele independence (main text Figure 3f right panel), we find that $\rho $ and ${f}_{\mathrm{ON}}$ are very similar but the other parameters vary considerably.
Appendix 4
Inference results using nascent mRNA data
Inference using noncurated data
We show the inference results of the experimental noncurated nascent mRNA data (with and without taking into consideration the cellcycle) for the four datasets (Appendix 4—table 1) and their 95% confidence interval using the profile likelihood estimate (Appendix 4—table 2). The best fit distributions for merged and cellcyclespecific data are shown in Appendix 4—figure 1.
Inference using curated data
We show the inference results using nascent mRNA data curated with the rejection method in Appendix 4—table 3 and Appendix 4—figure 2, and with the fusion method in Appendix 4—table 4. We also used the profile likelihood method to obtain the 95% confidence intervals for the fusion method with $k=4$; the results are shown in Appendix 4—table 5. We also reinferred the parameters in the G2 phase for fusion corrected data by assuming that the allele states are perfectly synchronised; these are shown in Appendix 4—table 6.
Appendix 5
Calculating the autocorrelation function (ACF) from stochastic simulations
Intensity traces of GAL10 transcription were generated by a stochastic simulation algorithm (SSA) following the delay telegraph model presented in the main manuscript. Briefly, for each yeast cell, the ON and OFF GAL10 promoter states are simulated such that they have an exponentially distributed lifetime with means $1/{\sigma}_{\text{off}}$ and $1/{\sigma}_{\text{off}}$, respectively. When the promoter was ON, the transcription of single GAL10 mRNAs was initiated following a Poisson process with mean equal to the initiation rate $\rho $. The fluorescence trace $I(t)$, as measured experimentally over time, was obtained by convolving the train of mRNA initiation events with the trapezoidal intensity pulse of the PP7 tagging system (shown in Figure 1c of the main text). It is important to note that this last step is necessary to accurately model the fluorescent intensities that are experimentally measured.
To account for cell cycle, we simulated an asynchronous population, where cells could switch from G1 to G2. In the latter, to account for two independent transcription sites, two independent intensity traces were generated and summed over time. The ratio of G1 to G2 cells was 1.15:1, 1.27:1, 1,24:1 and 0.80:1 for datasets 1, 2, 3, and 4, respectively, as determined from the smFISH experiments. It is important to note that: (i) livecell experiments (as in Donovan et al., 2019) are performed in asynchronous populations, without any sorting based on cell cycle phase; (ii) the average G1 phase duration for mother and daughter cells (${t}_{\text{mother}}=2400s$ and ${t}_{\text{daughter}}=6480s$, respectively – unpublished data) is on the order of the typical total acquisition time (e.g. $1800s$). Hence, in our simulations, cells that were G1 at the start of the intensity trace (i.e. livecell experiment) were allowed to switch to G2 before the end of the experiment. Trivially, cells that were G2 at the start remained so until the end of the experiment.
The autocorrelation function (ACF) was calculated as:
where $\mathrm{\Delta}t$ is the timelag (e.g. $30s$ in Donovan et al., 2019), $\u27e8\cdot \u27e9$ denotes the temporal average over $t$, and $\delta I(t)=I(t)\u27e8I(t)\u27e9$. Appendix 5—figure 1 shows the ACFs from Figure 6c of the main manuscript, in which a linear fit is performed to correct for nonstationary or slowlyvarying effects, i.e. switching from G1 to G2 during the experiment, bleaching, heterogeneous galactose induction.
Appendix 6
Derivation of the exact solution for the master equation of the delay telegraph model and finite state projection
Exact solution
The delay telegraph model includes four reactions
where $G$ and ${G}^{\star}$ stand for the active (ON) and inactive (OFF) gene state, respectively, and ${\sigma}_{\text{off}}$ and ${\sigma}_{\text{on}}$ are the activation and inactivation rates, respectively. Once a nascent mRNA is produced, it will be removed from the system after a deterministic time $\tau $. We aim to find the quantitative description of the kinetics of nascent mRNA ($N$). We proceed by deriving the delay chemical master equation (CME) of Equation (15). The same delay models have also been studied by other authors .
Let $P(0,n,t)$ and $P(1,n,t)$ be the probability of observing $n$ nascent mRNAs while the gene state is inactive and active at time $t$, respectively. Consequently, we can state
Note that by instant reactions, we mean all reactions except the delayed reaction that removes nascent mRNA. Since Part A includes all Markovian reactions, it follows from standard arguments Gillespie, 2007 that
The contribution of Part B requires a careful consideration of the history of the process: (i) $(1,{n}^{\prime},t\tau )$ leads to $(1,{n}^{\prime}+1,t\tau +\mathrm{\Delta}t)$ when the gene was ON; (ii) $(1,{n}^{\prime}+1,t\tau +\mathrm{\Delta}t)$ leads to $(i,n+1,t)$ where the gene state $i$ can be either OFF or ON; (iii) $(i,n+1,t)$ leads to $(i,n,t)$. Indeed, since the removal of nascent mRNA is a delayed reaction, (iii) is due to a production reaction in (i) a time interval $\tau $ earlier. The probability of (i) occurring is $\rho \mathrm{\Delta}tP(1,{n}^{\prime},t\tau )$. The probability of (iii) occurring is 1 since every production event is followed by a removal event a time $\tau $ later. The probability of (ii) occurring is:
As for Equation (18), the two ‘+1’s means that the new nascent mRNA was produced during the time interval $(t\tau ,t\tau +\mathrm{\Delta}t)$, and it did not participate in any other reactions, thereby not influencing the probability at time $t+\mathrm{\Delta}t$. In short,
Furthermore, all ${n}^{\prime}$ nascent mRNAs must leave the system prior to time $t$ as they were all born before time $t\tau $. This implies that
Hence the probability that event B occurs is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of ${n}^{\prime}$
where we used ${\sum}_{{n}^{\prime}}P(1,{n}^{\prime},t\tau )=P(1,t\tau )$.
Finally, the contribution of Part C is obtained from simple probability arguments
where we used the same argument as in Equation (21). Using Equations 16; 17; 21; 22 and taking the limit of small $\mathrm{\Delta}t$, we finally obtain the set of delay CMEs
where ${E}^{k}P(i,n,t)=P(i,n+k,t)$ is the step operator. Summing over all possible $n$ for the two equations in Equation (23), we obtain
initiated with the inactive state, in which $P(0,t)$ and $P(1,t)$ are the probabilities of finding a cell at the inactivated and activated state at time $t$, and their solutions are
Besides, it is noted that in Equation (23)
for $t\ge \tau $, and $h(t\tau )=0$ if $t<\tau$. The $n$ nascent mRNAs of the two conditional probabilities $P(i,n,t1,0,t\tau )$ for $i=0,1$ in Equation (23) are produced during time $(t\tau ,t)$, and the pertinent dynamics are that of a reaction system only composed of the three instant reactions in Equation (15). Specifically, we have
where the initial values are $\stackrel{~}{P}(0,n,0)=0$ for any $n$, $\stackrel{~}{P}(1,n,0)=1$ for $n=0$ and equal to 0 otherwise, as well as $P(i,n,t1,0,t\tau )=\stackrel{~}{P}(i,n,\tau )$ for any $i=0,1$.
Let ${G}_{i}(u,t)=\sum _{n}(u+1{)}^{n}P(i,n,t)$ and ${\stackrel{~}{G}}_{i}(u,t)=\sum _{n}(u+1{)}^{n}\stackrel{~}{P}(i,n,t)$. We particularly define the generating function in such a form to simplify the notation. Then, using Equations 23; 24 we obtain
and
where the arguments $u$ and $t$ in the generating functions are suppressed for clarity, and the superscript $\tau $ is used to emphasize the generating function ${\stackrel{~}{G}}_{i}$ up to a particular time $\tau $. The initiation condition of Equation (27) is ${\stackrel{~}{G}}_{0}=0$ and ${\stackrel{~}{G}}_{1}=1$ when $t=0$.
Under the condition of steadystate as $t\to \mathrm{\infty}$, Equation (26) reduces to
with $\overline{h}={\sigma}_{\text{on}}/({\sigma}_{\text{on}}+{\sigma}_{\text{off}})$.
Therefore, by solving Equation (28) we obtain
where we define $\mathrm{\Delta}(u)={\left({\sigma}_{\text{off}}u\rho \right)}^{2}+2{\sigma}_{\text{on}}\left({\sigma}_{\text{off}}+u\rho \right)+{\sigma}_{\text{on}}^{2}$. Finally, we obtain the probability distribution by the following relation
We checked that the obtained distribution agrees exactly with the nascent mRNA distribution of Equation 7 in Xu et al., 2016.
Finite State Projection
The finite state projection (FSP) method Munsky and Khammash, 2006 is an efficient numerical method that can be implemented to solve the CME (23) to any desired degree of accuracy. Because Equation (23) is coupled with Equation (25), here we show how to use a twostep FSP method to obtain the probability distribution.
By assuming that there exists a large $N$ such that $P(i,n,t)=0$ and $\stackrel{~}{P}(i,n,t)=0$ for any $i=0,1$ and $n\ge N+1$, we denote
as a ${N}^{2}$dimensional vector and $\stackrel{~}{\mathbf{P}}(t)\in {\mathbb{R}}^{{N}^{2}}$ similarly. We first use FSP method to solve Equation (25) with the following ordinary differential equation (ODE) up to time
where $\mathbf{A}$ is defined as
with the initial condition
Secondly, we solve the following inhomogeneous ODE
where $\mathcal{S}$ is the rightshift operator, i.e. for any $\mathbf{x}=({x}_{1},{x}_{2},\dots ,{x}_{N})$, $\mathcal{S}\mathbf{x}=(0,{x}_{1},{x}_{2},\dots ,{x}_{N1})$ and
with the initial condition
Hence, the solution of Equation (31) gives the approximate probability distribution $\mathbf{P}(t)$ for any $t>0$.
Note that if one uses the generating function Equation 29 to compute the probability distribution, the computation for the highorder derivatives can cause numerical instabilities due to lack of arithmetic precision – compare the exact solutions with precision 85 and 300 in the left and right panels of Appendix 6—figure 1. Because FSP avoids the computation of the higherorder derivatives, it can be as numerically stable as computing the exact solution with high precision (right panel of Appendix 6—figure 1). In Appendix 6—table 1, we show the computational efficiency of FSP versus numerical evaluation of the exact solution – FSP and the exact solution with high precision are hence comparable both in terms of accuracy and efficiency. In the main text, we use FSP.
Appendix 7
Classification of the cell cycle phase: bimodal Gaussian distribution method versus the Fried/Baisch model
The distribution of DNA content intensities (DAPI) was fit using a bimodal distribution; those cells whose intensity was around one peak were classified as G1 and those around the second peak were classified as G2 (main text Figure 3e). We also did the classification using an alternative method, based on the Freid/Baisch model, which was recently employed in Skinner et al., 2016. In Appendix 7—figure 1 we show the distributions of fluorescent signal intensity for cells in the G1 phase (top row) and for those in the G2 phase (bottom row) for the four data sets described in the main text. Note that the method (bimodal Gaussian or Freid/Baisch) used to classify the cells in G1/G2 does not alter the intensity distribution hence verifying the robustness of our classification method.
Data availability
The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH (copy archived at swh:1:rev:b49af68653e9fdcab3fa48085f648fc86d8c659e). The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl (copy archived at swh:1:rev:be2fcc8f7a811a571a297d3e150395c0a73add09).

Open Science FrameworkID d5nvj. smFISH datasets for PP7GAL10 in budding yeast.
References

Bursty gene expression in the intact mammalian liverMolecular Cell 58:147–156.https://doi.org/10.1016/j.molcel.2015.01.027

Oscillatory regulation of HES1: discrete stochastic delay modelling and simulationPLOS Computational Biology 2:e117.https://doi.org/10.1371/journal.pcbi.0020117

A stochastic model of gene expression with polymerase recruitment and pause releaseBiophysical Journal 119:1002–1014.https://doi.org/10.1016/j.bpj.2020.07.020

Methods of digital video microscopy for colloidal studiesJ Colloid Interface Sci 179:298–310.https://doi.org/10.1006/jcis.1996.0217

Stochastic models of gene transcription with upstream drives: exact solution and sample path characterizationJournal of the Royal Society, Interface 14:126.https://doi.org/10.1098/rsif.2016.0833

Mammalian gene expression variability is explained by underlying cell stateMolecular Systems Biology 16:e9146.https://doi.org/10.15252/msb.20199146

EstrogenDependent control and celltocell variability of transcriptional burstingMolecular Systems Biology 14:e7678.https://doi.org/10.15252/msb.20177678

SoftwareRNAInferenceTool.jl, version swh:1:rev:be2fcc8f7a811a571a297d3e150395c0a73add09Software Heritage.

Stochastic simulation of chemical kineticsAnnual Review of Physical Chemistry 58:35–55.https://doi.org/10.1146/annurev.physchem.58.032806.104637

Extrinsic noise and heavytailed laws in gene expressionPhysical Review Letters 124:108101.https://doi.org/10.1103/PhysRevLett.124.108101

Exactly solvable models of stochastic gene expressionJ Chem Phys 152:144106.https://doi.org/10.1063/1.5143540

NASCseq monitors RNA synthesis in single cellsNature Communications 10:1–9.https://doi.org/10.1038/s41467019110289

Thiollinked alkylation of RNA to assess expression dynamicsNature Methods 14:1198–1204.https://doi.org/10.1038/nmeth.4435

Automatic processing and interpretation of DNA distributions: comparison of several techniquesComputers and Biomedical Research 11:393–404.https://doi.org/10.1016/00104809(78)900204

Profile likelihood in systems biologyFEBS Journal 280:2564–2571.https://doi.org/10.1111/febs.12276

A single molecule view of gene expressionTrends in Cell Biology 19:630–637.https://doi.org/10.1016/j.tcb.2009.08.008

Singlemolecule mrna detection in live yeastCurrent Protocols in Molecular Biology 113:14.https://doi.org/10.1002/0471142727.mb1424s113

Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeastMolecular Systems Biology 7:458.https://doi.org/10.1038/msb.2010.112

What shapes eukaryotic transcriptional bursting?Molecular BioSystems 13:1280–1290.https://doi.org/10.1039/c7mb00154a

Markovian modeling of geneproduct synthesisTheoretical Population Biology 48:222–234.https://doi.org/10.1006/tpbi.1995.1027

Effects of cell cycle variability on lineage and population measurements of messenger RNA abundanceJournal of the Royal Society, Interface 17:20200360.https://doi.org/10.1098/rsif.2020.0360

Cell cycle staging of individual cells by fluorescence microscopyNature Protocols 10:334–348.https://doi.org/10.1038/nprot.2015.016

Consequences of mRNA transport on stochastic variability in protein levelsBiophysical Journal 103:1087–1096.https://doi.org/10.1016/j.bpj.2012.07.015

Precise nanometer localization analysis for individual fluorescent probesBiophysical Journal 82:2775–2783.https://doi.org/10.1016/S00063495(02)75618X

What is a transcriptional burst?Trends in Genetics 36:288–297.https://doi.org/10.1016/j.tig.2020.01.003

Expression homeostasis during DNA replicationScience 351:1087–1090.https://doi.org/10.1126/science.aad1162

Stochastic kinetics of nascent RNAPhysical Review Letters 117:128101.https://doi.org/10.1103/PhysRevLett.117.128101

Singlerna counting reveals alternative modes of gene expression in yeastNature Structural & Molecular Biology 15:1263–1271.https://doi.org/10.1038/nsmb.1514

Analytical results for a multistate gene modelSIAM Journal on Applied Mathematics 72:789–818.https://doi.org/10.1137/110852887

Cellcycle dependence of transcription dominates noise in gene expressionPLOS Computational Biology 9:e1003161.https://doi.org/10.1371/journal.pcbi.1003161
Decision letter

Anna AkhmanovaSenior and Reviewing Editor; Utrecht University, Netherlands
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
[Editors' note: this paper was reviewed by Review Commons.]
Thank you for submitting your article "Quantifying how posttranscriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions" for consideration by eLife. Your article has been reviewed by 3 peer reviewers at Review Commons, and the evaluation at eLife has been overseen by Anna Akhmanova as the Senior Editor in consultation with two of the three original reviewers. We apologize for the delay in coming back to you, which was due to the fact that the reviewers have taken some time to respond.
The authors made substantial modifications to the text to accommodate reviewer comments and changed the title to better describe their results. The claims are now easier to assess in the context of the existent literature. Additionally, they also modified figures to more explicitly back their claims. Overall, these changes make the manuscript easier to understand.
However, there are some remaining issues that need to be addressed, as outlined below:
1. A main concern that remains is the discrepancy between nascent and mature mRNAs. The authors argue that this is presumably due to large mRNA turnover rates, which may be expected when the coat protein is not expressed. First, it would be great if the authors could back up this statement with appropriate references. Second, if I take the inferred transcription rates of ~36mRNA/min and the fraction the gene is on ~0.8, this leads to an effective transcription rate of roughly 30mRNA/min (the elongation delay should only lead to a static delay, but not affect the average amount of mature mRNAs that we observe at steady state). Now, in order to obtain an average steady state of mature mRNA around 10 transcripts (e.g., Fig. 3f), I require the mRNA degradation rate to be around 3min^1. This would give a halflife of roughly log(2)/3=0.2min, which seems to be exceedingly low. According to literature, mRNA halflives in yeast are typically in the order of 30min and only for very few mRNAs falls below 10mins (see e.g., Geisberg et al., Cell 2014). If we trust this simple calculation above, that would mean that the mRNA halflive is substantially smaller than the lowest reported values that I could find in the literature. I therefore still worry that there might be an issue with the data/spot extraction. At the very least, I would expect such mismatch and its potential origins to be discussed critically in the paper.
2. The second remaining concern relates to the validation against timeseries data. I appreciate that the discussion around the calculation of the ACF has been extended. The authors argue that a comparison between statistics obtained from smFISH and livecell data is not possible due to a potential detection threshold in the former. But why am I then allowed to compare the autocorrelation function (that is derived from calibration against smFISH data) with the autocorrelation of the livecell data, which also is a statistic of the same transcriptional process? Or conversely, if it was the case that the mRNA mean and variance don't match between experiments, why should we expect the autocorrelation to match? I would be grateful if the authors could clarify these questions. Moreover, I still believe that the term "nonstationary effects" is misleading, since the slowly varying components to the ACF still need to be assumed to be in stationarity in order to capture them by an ACF with just a single time parameter (e.g., the linear fit). Perhaps "slowlyvarying" would be a better term.
https://doi.org/10.7554/eLife.82493.sa1Author response
We thank all referees for their constructive comments which has helped us to improve the manuscript significantly. Below we provide a pointbypoint response.
Reviewer 1:
Referee Summary and Significance
Summary:
In this study, the authors consider the problem of inferring transcription dynamics from smFISH data. They distinguish between two important experimental situations. The first one considers measurements of mature mRNAs, while the second one considers measurements of nascent mRNA through fluorescent probes targeting PP7 stem loops. The former problem has been previously dealt with extensively, but less work has been done on the context of the latter. The inference approaches are based on maximum likelihood estimation, from which point estimates for promoterswitching and transcription rates are obtained. The study focuses on steady state measurements only. The authors perform several analyses using synthetic data to understand the limitations of both approaches. They find that inference from nascent mRNA is more reliable than inference from mature mRNA distributions. Moreover, they show that accounting for different cellcycle stages (G1 vs G2) is important and that pooling measurements across the cellcycle can lead to quantitatively and even qualitatively different inferences. Both approaches are then used to analyze transcription in an experimental system in yeast, for which they find evidence of gene dosage compensation. I consider this an interesting and relevant study, which will appeal to the systems and computational biology community. The paper is well written and the (computational) methods are described in detail. The experimental description is quite minimal and could profit from further details / explanations. I have several technical criticisms and questions, which I believe should be addressed before publication. Since I am a theorist, I will comment predominantly on the statistical / computational aspects.
Significance:
Quantifying kinetic parameters from incomplete and noisy experimental data is a core problem in systems biology. I therefore consider this manuscript to be very relevant to this field. The contribution of this manuscript is largely methodological, although its potential usefulness is demonstrated using experimental data in yeast.
Thank you for the overall positive assessment of our paper. Below, we respond to each comment in detail.
Referee Comment (Major)
– A key reference that is missing is Fritzsch et al. Mol Syst Biol (2018). In this work, the authors have used nascent mRNA distributions and autocorrelations (obtained from liveimaging) to infer promoter and transcription dynamics. I believe this work should be appropriately cited and discussed.
Thank you for pointing out this important reference. We have now added it to the Introduction and the Discussion.
Referee Comment (Major)
Synthetic case study:
– Inference and point estimates. The authors use a maximumlikelihood framework to extract point estimates of the parameters. Subsequently, relative absolute differences are used to assess the accuracy of the inference. However, as far as I have understood, this is performed for only a single simulated dataset, for each considered parameter configuration. The resulting metric, however, does not really capture the inference accuracy, since it is based on a single (random) realization of the MLE. I would recommend to at least repeat the inference multiple times for different realizations of the simulated dataset (per parameter configuration) to get a better feeling of the distribution of the MLE (e.g., its bias / variance). Alternatively, identifiability analyses based on the Fisher information could be performed for (some of) the different parameter configurations although this may be computationally more demanding.
We thank the reviewer for the suggestion. In SI Section 1.3 we have added numerical experiments to obtain a better understanding of the variability of the inference procedure. For each of the 6 parameter sets in SI Figure 2b, i.e. for the inference using the delay telegraph model, we generated 10 independent sets of synthetic data and used maximum likelihood to infer the parameters for each dataset. The mean and standard deviation of the estimated parameters (computed over all 10 datasets) are shown in SI Table 1. We can define the sample variability of a parameter as the standard deviation divided by the mean. Ordering parameters by this quantity, we find that for all 6 parameter sets the error is largest for σ_{off}. For small f_{ON}, the parameters ordered by increasing error are: σ_{on}, burst size, ρ, f_{ON} and σ_{off}. While for large f_{ON}, the order is: ρ, f_{ON}, σ_{on}, burst size and σ_{off}. Note that this order is the same as we determined using the relative errors of each of the parameters (from the ground truth) which we have added to Figure 2 of the main text. Note further that the same order is found by using profile likelihood error (SI Section 1.4, SI Tables 2 and 3). Since from experimental data, only the sample variability and the profile likelihood error are available, it follows that the results of our synthetic data study in Figure 2 based on relative error from the ground truth have wide practical applicability. We also note that the means of the parameters (calculated over the parameters inferred from 10 independent synthetic datasets) in SI Table 1 is close to the true parameters (in SI Figure 2b); this shows that the inference procedure is working correctly and the deviations from ground truth values are mostly due to noise in the synthetic datasets.
Referee Comment (Major)
– It would be useful to include confidence intervals based on profile likelihoods also for the synthetic case study, in particular for the 6 reported datasets. I would also find it helpful to see comprehensive profile likelihood plots for the key results / parameter inferences in the supplement. This would also provide useful insights into the identifiability of the parameters.
As suggested by the reviewer, we have performed a comprehensive profile likelihood study for the inference on synthetic data and obtained the 95% confidence intervals. We added these to SI Section 1.4. We have also provided confidence intervals for estimates from merged and cellcycle specific mRNA data in SI Section 3.2 and Table 8; for merged nascent noncurated data in SI Section 4.1 and Table 11; for cellcycle specific nascent data curated using the fusion method in SI Section 4.2 and Table 14.
Referee Comment (Major)
Experimental case study:
– Validation against livecell data. In the simulation of the autocorrelation function, what was the ratio of cells initialized in G1 / G2, respectively? I’d expect this to have direct influence on the simulated ACF. Moreover, a linear fit is used to correct for ”nonstationary effects” in the ACF that supposedly stem from cellcycle dynamics. First, I don’t think this terminology is really accurate, since nonstationarity would lead to an ACF that depends on two parameters (τ_{1} and τ_{2}). I suppose the goal of the linear correction is to remove slow / static population heterogeneity? If yes, wouldn’t it be easier / more direct to also change the simulations to nonsynchronized cellcycles? In this case, they should also display the very slow / static components as displayed in the data, which would eliminate the need for the posthoc correction. I was also wondering whether other statistics (e.g., mean, variance, distributions) match between the simulations and the livecell experiment? This could provide further validation of the inferred parameters.
The ratio of G1 to G2 cells was 1.15:1, 1.27:1, 1,24:1 and 0.80:1 for datasets 1, 2, 3 and 4 respectively, as determined from the smFISH experiments. As correctly pointed out by the reviewer, we correct for nonstationary affects in the ACF to remove slow extrinsic heterogeneity, which includes but is not limited to cell cycle. In addition to cell cycle, there are many nonstationary effects, such as bleaching or heterogeneous galactose induction. Since each of these nonstationary effects has its own kinetics and contribution to the ACF, we are limited by posthoc correction of the ACF. We clarified these points in SI Section 5. Furthermore we note that while in principle, the mean, variance and the distributions of intensities in the livecell data should be the same as the smFISH data, in reality for the former there is often a detection threshold, and hence the statistics of both distributions cannot be matched directly.
Referee Comment (Major)
– If I understood correctly, the signal intensity of the measured transcription spot is normalized by the median cytoplasmic spot brightness. Since the normalized intensity of a single complete transcript is 1, the cumulative intensity should give a lower bound on the nascent mRNAs. The histograms in Figure 4b show intensity values in the range of 30, which would mean that at least 30 transcripts contribute to the transcription spot. The total number of nucleoplasmic and cytoplasmic mRNA, however, is in the range of 10 (Figure 3a). I am probably missing something but how can we reconcile these numbers? The authors mention that the brightest spot just counts for one transcript, but argue that this has negligible influence on mature RNA counts. Could this be a possible explanation for the mismatch?
The difference between mature mRNA count and the nascent transcript count is likely the result of high mRNA turnover. Addition of PP7 loops to the gene is known to affect mRNA stability, when the PP7 coat protein is not expressed. We note that high mRNA turnover should aid in determination of kinetics, as it more closely matches the transcriptional activity. In addition, low mRNA counts allow for better fitting of spots to determine their intensity, in the small volume of the yeast cell. These advantages allow us to make a proper comparison between nascent and mature distributions. We clarified this in the main text (Section 2.2.1). To test the contribution of counting the transcription site as one transcript on the estimated parameters, we now compared two methods, where we either include or exclude the transcription site from the mature RNA distribution. As shown in the main text Figure 3c,d, subtracting the transcription site (shown as seg2TS) has only a small influence on the fitted parameters (compare to seg2). A discussion can be found in the last paragraph of Section 2.2.1.
In the experimental case study, the authors argue that the ”correct” inference result is the one that accounts for cellcycle stage, while the other one termed ”incorrect” I find this terminology too strong, since every estimate is subject to uncertainty.We agree and changed to phrasing to ”optimal” or ”matched the livecell data better”.
Page 2: ”… in a asynchronous population” → ”… in an asynchronous population”
We corrected this typo.
Page 7: ”…parameters sets 3 and 4” → ”…parameter sets 3 and 4”We corrected this typo.
Figures 5a and 6a: parameter names and units should go on the yaxis.
In the new figures, the equivalent of the previous Figures 5a and 6a are Figures 5b and 6a, respectively. Note that we have now changed these to bar graphs.With this format, we have included the parameter names and units on the yaxis and we also added the parameter names on top to make it visually more clear.
In the experimental case study, the authors argue that the ”correct” inference result is the one that accounts for cellcycle stage, while the other one termed ”incorrect”. I find this terminology too strong, since every estimate is subject to uncertainty.
Reviewer 2:
Referee Summary and Significance
Summary:
In the manuscript Fu and coauthors compare accuracy for 2 models that infer kinetics of the transcription from synthetic and experimental data. Specifically, they compare the telegraph model for mRNA and the delayed telegraph model for nascent RNA. They first provide the comparison for synthetically simulated data, and derive that the latter exhibits higher accuracy. Next they apply the model to experimental data from smFISH for PP7GAL10 strain, and provide the framework to estimate the number of mRNAs and use the intensity at the transcriptional site to infer the number of bounds of polymerase during the transcription (nascent RNA). For the latter, I appreciate that they account for the fact that intensity throughout the transcription will depend on ’spatial’ position of polymerase and incorporate this into the framework to infer nascent RNA levels. Additionally, for the experimental data they infer kinetics with and without accounting for cell cycle (accordingly 1 or 2 gene copies), and through comparing to life imaging data from Donovan et al., 2019, they suggest that the model that best describes experimental data is delayed telegraph for nascent RNA when accounted for cell cycle. Finally they provide 2 approaches – called rejection and fusion – to account for potential artifacts in estimation of nascent RNA levels from the intensity at transcriptional sites, and provide the comparison of how this approaches affect the overall fit.
Whereas it is important to have a systematic understanding/comparison for both models as well as for how accounting of cell cycle might improve the overall accuracy, some of the aspects of the results/estimation of values from experimental data require more thorough analysis.
Significance:
This paper is somewhat outside my core expertise, although closer to the expertise of my postdoc who assisted with the review.
The work is interesting but the generalisability of the conclusions is somewhat limited, partially by the lack of experimental validation. Nevertheless, there are interesting aspects of the study and the area of research is important.
Overall, the main statements of the paper – that cell cycle specific inference from the experimental data using delayed telegraph model from nascent RNA performs best (compared to telegraph model from mRNA or not cell cycle specific) are supported, and I agree that understanding of the limitations of the currently popular models (telegraph for mRNA and/or not accounting for cell cycle) is an important addition to the field. I would be happy to further proceed with the revision/acceptance of the paper if the (major) comments above are addressed/considered.
Thank you for the overall positive assessment of our paper. Below, we respond to each comment in detail.
Referee Comment (Major)
Comparison of the models for simulated data.
In the first two chapters of the results the authors compare simulations/parameter inference from the synthetic data for the telegraphbased model for mRNA and delayed telegraph model for nascent RNA, and conclude that the latter provides better accuracy. However, based on the relationship for mean relative error distribution as a function of fON, it seems to me that both models show very similar results, and the support of better accuracy for nascent RNA seems unclear to me. Additionally, simulations are performed for the concise number of parameter sets, and it is unclear how well/uniformly the chosen sets cover the parameter space.
I suggest that more thorough analysis is required. One way to do so would be to perform simulations on the same set of parameters that comprehensively cover the parameter space for both models and compare mean error rates in pairwise fashion. Additionally, it might be worth considering comparing error rate for each parameter separately (i.e. for σon, σoff and the production rate of mRNAs when promoter is on).
We agree and we now report a detailed study to take into account all the points raised above. Specifically, we calculate the relative errors (from the true values) for each inferred parameter from 789 synthetic nascent and mature mRNA datasets. Each of these datasets corresponds to an independent parameter set sampled on a grid. We generate parameter sets on an equidistant mesh grid laid over the space:
(σ_{off}, σ_{on}, ρ) ∈ [Uniform(0, 150), Uniform(0, 150), Uniform(0, 250)] , (1)
where the units are inverse minute. Furthermore we apply a constraint on the effective transcription rate $\hat{\rho}=\frac{{\rho \sigma}_{\mathrm{o}\mathrm{n}}}{{\sigma}_{\mathrm{o}\mathrm{n}}+{\sigma}_{\mathrm{o}\mathrm{f}\mathrm{f}}}<100.$ In each of the three dimensions of the parameter space, we take 10 points that are equidistant, leading to a total of 1000 parameter sets which reduce to 789 after the effective transcription rate constraint is enforced. The results are presented in Figure 2 of the main text and SI Figure 1. Note that for the telegraph model, besides the usual model, we consider an additional version where we add noise to the initiation rate to model posttranscriptional noise (due to splicing, export, etc) in mature mRNA data that is not present in the nascent mRNA data. In Figure 2b, c and SI Figure 1 we show that the mean relative error is generally smaller for the delay telegraph model (inference from synthetic nascent data) than using the telegraph model (inference using synthetic mature mRNA data with or without posttranscriptional noise). Furthermore in Figure 2d we show plots of the median relative error (for each parameter individually) as a function of the fraction of ON time. Several conclusions can be made including: (i) the errors in σ_{on} (the burst frequency), σ_{off} and the burst size ρ/σ_{off} tend to increase with f_{ON} while for the rest of the parameters (ρ and the estimated value of f_{ON}) there is a decreasing tendency; (ii) for small f_{ON}, the best estimated parameters are the burst frequency and size while for large f_{ON}, it was ρ and the estimated value of f_{ON}. The worst estimated parameter was σ_{off}, independent of the value of f_{ON}; (iii) the addition of noise to mature mRNA data (to model posttranscriptional noise) had a small impact on inference for small f_{ON}; in contrast, for large f_{ON} the noise appreciably increased the relative error in σ_{off} and to a lesser extent the error in the other parameters too.
Referee Comment (Major)
An additional analysis of the accuracy of the estimated values from the experimental data.
We have now carried such an analysis by (i) calculating the 95% confidence intervals using the profile likelihood method. For merged and cellcycle specific mRNA data these can be found in SI Section 3.2 and Table 8; for merged nascent noncurated data in SI Section 4.1 and Table 11; for cellcycle specific nascent data curated using the fusion method in SI Section 4.2 and Table 14; (ii) to test the contribution of counting the transcription site as one transcript on the estimated parameters, we now compared two methods, where we either include or exclude the transcription site from the mature RNA distribution. As shown in the main text Figure 3c,d, subtracting the transcription site (shown as seg2TS) has only a small influence on the fitted parameters (compare to seg2). A discussion can be found in the last paragraph of Section 2.2.1.
Referee Comment (Major)
When it comes to experimental data, the overall fit of any proposed model will depend on both the suitability/correctness of a model to explain the process in question as well as the reliability of the estimates (inputs for the model) from the experiments. Specifically, it is possible that a model (either telegraph for mRNA or delayed telegraph for nascent RNA or both) to explain transcriptional kinetics is fairly accurate, but the input estimates (for accordingly mRNA or nascent RNA) are biased (due to technical artifacts from the experiment and/or the approach towards estimating those values), thus affecting the overall fit of a model and interpretation of the results.
I appreciate that authors address one potential artifact in estimating nascent RNA, where it is possible that the intensity of nascent RNA is overestimated if it is mistakenly confused with mRNA. I suggest that the more detailed analysis of the accuracy for both the number of mRNA molecules and the intensity of nascent RNA is required to provide better insight in how reliably those values are estimated and accordingly whether models might perform poorly due to biased estimates.
We thank the reviewer for the good suggestion. Numerical experiments testing the reliability of the inference procedure to random perturbation of mature and nascent mRNA data is now reported in SI Sections 1.5 and 1.6 and the associated SI Tables 4, 5. Specifically for mature mRNA we randomly changed the number by minus 1/plus 1/unchanged with probability 1/3. For nascent mRNA, the perturbation we considered consisted of the addition of lognormal noise to the fluorescent signal. In all cases, we found the inference results were practically unchanged, except for cases where the gene spends a huge amount of time in the OFF state. This is not the case for our experimental data and hence we are confident that the inference results are robust. In the main text we have also added a new section Section 2.2.1 to discuss how various experimental artefacts (overlapping spots, segmentation method, mistaking a transcription site with mature mRNA) may affect the inference – see the response to the next comment.
Referee Comment (Major)
Mature mRNA:
More detailed method section covering the estimate for background signal and spot detection.
A potential proximity of mRNA molecules resulting in underestimation of the total number of mRNAs, and how this might affect the fit of the telegraph model. Even though smFISH has been widely used to estimate the number of mRNA molecules (as a total number of spots), the technique has been mostly applied to mammalian cells with considerably bigger cell size. Additionally, the usage of the total number of mRNA molecules in order to estimate transcriptional kinetics from the telegraph model seemingly requires a highly accurate estimate of the total number of molecules. Combined, it is not obvious if potential underestimation of mRNAs (specifically in cells with high number of mRNAs) via smFISH in budding yeast cells might lead to the misleading interpretation of the results. One way to assess whether such ’merging’ takes place is to look into the distribution of intensities for cytoplasmic spots (per cell and/or all the cells in the whole field of view). If those distributions frequently show bi/multimodal behavior, it is worth considering whether a proposed way to estimate mRNA number is suitable in for given model organism/growth conditions/gene, and further extend the analysis on simulated data to provide the robustness of the fit of the telegraph model for mRNAs in cases whether number of mRNAs is underestimated.
The reviewer is correct that a highly accurate estimate of the total number of mRNAs is optimal for proper estimates of the transcriptional parameters. We now added an extra section (2.2.1) to describe these possible artifacts and their potential effect on the parameter estimates. Briefly, since cytoplasmic PP7GAL10 has a high turnover, the number of RNAs allows us to accurately identify all spots, without ”merging” of spots. This is also apparent from the distribution of the intensities of cytoplasmic spots, which follows a unimodal distribution (orange colored distribution in the centre of Figure 4a) where ≈ 90 percent of the detected spots fall within half a standard deviation of the median. Another possible source of error is cell segmentation. To test how cell segmentation errors contribute to the mature mRNA distribution and the error estimates, we included a comparison of two independent segmentation tools. The inferred results are reported in Figure 3b,c,d. Segmentation 1 often results in missed spots, and underestimation of the mRNA count, which reduces the burst size and f_{ON} compared to segmentation 2. Because of the higher accuracy of the latter we used segmentation 2 for the inference from merged and cellcycle specific mature mRNA data. This analysis further illustrates the challenges associates with parameter estimation from mature mRNA counts, as nascent mRNA data is not affected by the segmentation method.
Referee Comment (Major)
A more minor issue, but authors state that, for each cell, the highest intensity of the nuclear spot will count as one mRNA, and that it has a negligible influence. I would appreciate a more thorough analytical explanation for this or an additional analysis on the simulated data to support how random +/1 of mRNAs might affect results of the fit, specifically for cases with low average mRNA estimate.
We analyzed how removal of the transcription site from the experimental mRNA count affects the inferred parameters. As shown in Figure 3c and d subtracting the transcription site has only a small influence on the estimated parameters. We included a description of this analysis in Section 2.2.1. As mentioned in response to a previous comment, numerical experiments testing the reliability of the inference procedure to random perturbation of mature and nascent mRNA data are now reported in SI Sections 1.5 and 1.6 and the associated SI Tables 4, 5. In all cases, we found the inference results were practically unchanged, except for cases where the gene spends a huge amount of time in the OFF state, which is not the case for our experimental data.
Referee Comment (Major)
Nascent RNA:
I might be missing something, but it seems that for cells in late G2 phase where nucleus is either strongly elongated (and looks like a sand clock) or even exhibits 2 separate nuclei connected with the chromatin bridge – 2 copies of the gene can be spatially resolved and therefore it might happen that 2 independent/separate brightest spots (one per each cell) amount to total estimate of nascent RNA in cases where promoter is on simultaneously in both copies? If so, depending on estimated in the study/prior literaturebased estimates for σon/off, the probability of simultaneous transcription might vary and this should be taken into account? This also might partially explain the phenomenon of lower transcriptional activity in G2 which is currently suggested to be explained with dosage compensation? Or are those cells considered as 2 cells in G1? If so, it needs to be specified in the text.
The reviewer is correct that cells in late G2 have elongated nuclei which may contain 2 transcription sites. When the nucleus moves into the bud, buds often contain less DNA than G1 cells, and mothers contain more DNA than G1 cells, both of which are excluded from the analysis. When the DNA content of the mother and daughter is similar, both mother and daughter are counted separately as G1 cells. We note that this late G2 subpopulation is very small. We included a description of this to Methods Section 4.3. For the cells in early G2, the two gene copies cannot be easily resolved in yeast cells since they are within the diffraction limit. In our manuscript, we assumed that these two transcription sites are independent. In SI Section 3.3 and SI Table 9 we show that the relaxation of the assumption of independence between the allele copies in G2 (by instead assuming the opposite case of perfect state correlation of the two alleles) has practically no influence on the inference of the two best estimated parameters (ρ, f_{ON}) from mature RNA data (compare the values for phase G2 in Figure 3f and those in SI Table 9). The same is shown for inference from nascent mRNA data in SI Table 15. We note at the end of Section 2 in the main text that while such perfect synchronization of alleles is unlikely, some degree of synchronization is plausible and further improvement of the transcriptional parameters in the G2 phase will require its precise quantification. This is however beyond the scope of the current manuscript since as previously mentioned it is currently difficult to reliably resolve two alleles in yeast cells.
Referee Comment (Major)
Additionally, I suggest that images from microscopy can be provided as a supplement to aid clarity in how cell cycle, number of mRNAs and intensity for nascent RNA were estimated.
We have uploaded the data to https://osf.io/d5nvj/. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH. The code for the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl. In addition, we adjusted the text and figure to provide clarity on how the data was processed. We included the DAPI intensity distribution in Figure 3e and updated the description in the Methods (section 4.3) to explain how this distribution was used for the selection of G1 and G2 cells. We also included images of the cells in Figure 3a,b and c to show how the number of RNAs was counted and how the segmentation was performed to obtain the mature RNA distributions. Lastly, we added Figure 4a to show how the intensity distribution of the cytoplasmic spots was used to estimate nascent transcripts at the transcription site.
Referee Comment (Major)
The analysis of the experimental data consists of the (I presume highly comparable with Donovan et al., 2019) single condition (i.e. galactose concentration, glu/galactose ratio) resulting in a single parameter set for transcriptional kinetics. Specifically, it is estimated that σ on and off will be comparable for the given set up, and therefore, based on simulated data, the estimates will be somewhat reliable for the cell cycle accounted delayed telegraph for nascent RNA. I wonder how in practice (i.e. estimated from the experiments) the same model will perform for a different set of parameters/different conditions. Ideally, I would suggest performing the similar experiment, but where σ on/σ off is expected to be different. One way to achieve this with the GAL10 / galactose set up is to tune the glu/gal ratio of the media. Even without a comparison to livecell tracing, the analysis of estimated parameters for merged and cell cycle specific data can shed light on how suitable the model is for alternative parameters.
Alternatively, if the experiment is currently not feasible, I would appreciate a more extensive discussion of the practical suitability of the cellcycle specific delayed telegraph model for nascent RNA for alternative sets of transcriptional parameters. Considering that the comparison was performed only against ’simple’ telegraph model and in introduction authors mention a variety of ’improved’ models for mRNA, that account for various sources of heterogeneity, they might be more suitable for alternative set of transcriptional parameters, and might be more suitable that cell cycle specific delayed telegraph for nascent RNA.
Although we appreciate the added value of an extra experimental dataset in different conditions, we question whether this gain outweighs the required time investment. We feel that this questions is already sufficiently addressed by our novel thorough analysis across a large parameters space using synthetic data, described above and in Figure 2, which indicates that inference from nascent transcription distributions performs better for the majority of the parameters space. As suggested by the reviewer, we included a section in the discussion on the suitability of more complex models models. We discuss that these more complex transcription models, although perhaps capturing the kinetics better, they also have many more parameters, meaning that in practice very few will be identifiable with the current set of experimental observables. To use these models, one would thus need the acquisition of temporal data or simultaneous measurement of various sources of extrinsic noise (single cell transcription factor and RNAP number measurements, cellular volume, local cell crowding, etc) which influence gene expression. This is far beyond the scope of the current manuscript, but will be interesting to explore in future studies.
Referee Comments (Minor)
Current method section is lacking the description of the growth media, which is an important aspect to specify when it comes to budding yeast (particularly when the sugar source is different from the standard glucose and/or results are compared to another publication).
We agree this should be included and added the growth media (synthetic complete media with 2 % galactose) to the Methods section 4.3
In the figure 2b I find the cartoon a little misleading – specifically why polymerase is bound when the promoter is off? If it is to illustrate the case when transcription/polymerase bound occured after promoter is switched off, why there are no polymerase to the right from the current one (as in in the case where promoter is on)?
We agree and we changed the cartoon (Figure 1c), by showing no Pol II when the promoter is off.
In table1 – there is a typo in the 2nd metarow – I suspect it should say G2?
This is corrected (SI Table 10).
Reviewer 3:
Referee Summary and Significance
Summary:
In this work, the authors investigate the effect of using mature mRNAs instead of only nascent mRNA (located at the transcription site) when estimating transcriptional kinetics parameters from singlemolecule fluorescent in situ hybridization (smFISH) experiments. The authors find that using nascent mRNA and correcting for cell cycle effects yields more accurate parameter estimates than using mature mRNAs. The author performs smFISH experiments of the GAL10 gene in yeast to test their findings. Also, the authors test different methods to obtain parameter estimates in cases where there is no information about the location of the transcription site.
Significance:
As described above, some claims do not seem novel considering the references in this manuscript. This is not a problem; the authors can soften their claims to novelty without compromising their other claims. Previous works that estimated mRNA transcription kinetic parameters by quantifying nascent mRNA recognized that using mature mRNA would incur in parameter estimation errors. They considered it evident that quantifying the process closer to the transcription site would improve estimates. Similarly, it was also apparent that adding missing information (the gene copy number based on cell cycle information) would improve parameter estimates. That is why the authors presenting those arguments as findings is unnecessary. However, it is true that here the authors are interested in the level of error, not the fact that getting more accurate (or relevant) measurement will improve estimates.
An item that the authors may want to emphasize is their finding that it is possible to correct for measurements where the identity of the transcription site is unknown. All the works that they cite where nascent mRNA is measured using some method to localize the position of the transcription site. I mammalian cells and fly embryos, it is possible to label introns to identify mRNA located at the transcription site. That is not possible in many yeast genes or other microorganisms.
Which audience would be most interested in this work? I think those searching for methods to quantify transcriptional kinetics in organisms where the identity of the transcription site cannot be measured by smFISH or other novel methods such as CasFISH.
I performed studies of transcriptional kinetics in bacteria during my doctorate, and I continue utilizing smFISH in my research.
Overall comment:
Overall, I think the current experiments are sufficient to support their claims. Also, the description of methods and references is appropriate to allow other researchers to reproduce their observations. Finally, the experiments are replicated, and enough cells are analyzed to provide enough statistical significance to their claims.
Thank you for the overall positive assessment of our paper. Below, we respond in detail to each comment.
Referee Comment (Major)
1. The authors make multiple claims of novelty that conflict with work described in some of their references, particularly: Skinner et al., eLife, 2016; Xu et al., Nature Methods, 2015 and Physical Review Letters, 2016 (References #26,27 and 24 in their manuscript). I could find several instances where the scope of their claims was unclear. Below I describe some cases:
a. The title of this paper, “accurate inference of stochastic gene expression from nascent transcript heterogeneity” could also be the summary conclusion of the three works cited above. However, later in the Introduction of the manuscript, the authors state that their goal is to “understand the impact of posttranscriptional noise and celltocell variability on the accuracy of transcriptional parameters inferred from mature mRNA data,” a related yet different topic. I would change the title of the manuscript to reflect their main goal better.
b. I would make their claims of novelty more specific. For example, at the end of the abstract, the authors claim that “our novel data curation method yields a quantitatively accurate picture of gene expression.” Quantifying nascent mRNA using smFISH to obtain transcription kinetic parameters has been done before (the references above are an example) also developing the modeling tools to do so (for example, in Xu et al., Physical Review Letters, 2016). What is, exactly, the novelty in their approach? They need to make that explicit or soften their claims.
c. In the Introduction, when discussing the effect of the cell cycle in parameter estimation, they write: “Since estimation of all transcriptional parameters (…) from nascent data as a function of the cell cycle phase has not been reported”. However, the work they reference (Skinner et al., eLife, 2016) shows such measurements for multiple transcriptional parameters for different cell cycle stages. The original work may not have gone as far as the current work, but it is unclear what has been done before from the way the authors describe earlier literature.
d. The authors develop a new formulation of the delay telegraph model to obtain kinetic parameters from the nascent RNA copy number statistics. They state in the SI that “Similar delay models have also been studied by other authors,” however, the authors do not explain in which way their model differs from previous work. Does their approach have advantages over previously published models?
We have now clarified the claims of novelty, as follows.
a. We agree with the reviewer. We have now changed the title of the manuscript to “Quantifying how posttranscriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions”.
b. We agree and have made our claims of novelty more specific and have changed parts of the abstract and introduction. We now describe in the introduction that previous work by Xu et al., PRL 2016 and Skinner et al., eLife 2016 develop methods to estimate parameters from nascent mRNA data where the transcription site is localized, but have not considered cases where the transcription sites are unknown; in addition, a quantitative comparison between parameters estimated from mature and nascent data is also not considered because from their point of view it is evident that the estimates using nascent mRNA data should be more accurate. Nevertheless, it could be the case that the estimates using nascent data are only marginally better than those from mature data. In addition, we have removed the sentence “Our novel data curation method yields a quantitatively accurate picture of gene expression”. Furthermore, in the abstract we have mentioned that while it is presumed in the literature that posttranscriptional noise and gene copy number variation may affect parameter estimation, however the size of the errors incurred is presently unclear and this is what we focus on. Lastly, we explicitly mention in the abstract that the fusion method corrects for measurement noise due to the uncertainty in the localization of the transcription site when introns cannot be labelled. We have also added a paragraph to the Introduction (the one before the last) clarifying what other studies before us have done and what remains to be addressed.
c. We agree that from our description, although technically correct, it was unclear what has been done before by Skinner et al. and what is novel. In Skinner et al., eLife (2016) (Supplementary File 2 Table B ), all parameters are reported, but it was assumed that the burst frequency is the only parameter that changed upon replication, i.e. not all parameters were simultaneously estimated pre and postreplication. In our manuscript, we estimate all parameters without this assumption. In the Introduction we have now changed the sentence to “The estimation of transcriptional parameters from nascent mRNA data for pre and postreplication phases of the cell cycle has, to the best of our knowledge, only been reported in Skinner et al. eLife (2016)”. In the Introduction, we also point out that in this reference (and others) there is no direct comparison between estimates from nascent and mature mRNA data. In Section 2.2.2, we have made the differences between our method and that in Skinner et al. more explicit by stating: “A difference between our method of estimating parameters in G2 from that in the literature (Skinner et al) is that we do not assume that the burst frequency is the only parameter that changes upon replication, i.e. we estimate all transcription parameters simultaneously.”.
d. Our algorithm and that in Xu et al. PRL (2016) both take into account the fact that the signal intensity depends on the position of Pol II on the gene, albeit this is done in different ways. In Xu et al. PRL (2016), a master equation is written for the joint distribution of gene state and the number of nascent mRNA. In this case the number of nascent RNAs can have noninteger values since it represents the experimentally measured signal from the (incomplete) nascent RNA. Solution of this master equation proceeds by (a) a discretization of the continuous nascent mRNA signal into bins which are much smaller than one; (b) solution using finite state projection (FSP). This approach can lead to a large state space which incurs a large computational cost. In contrast in our method, we use FSP to solve for the delay telegraph model, i.e. the distribution of the discrete number of bound Pol II from which we construct (using convolution) the approximate distribution of the continuous nascent mRNA signal by assuming Pol II is uniformly distributed on the gene. Since the state space of bound Pol II is typically not large, our method will typically be more computationally efficient than the one described in Xu et al. PRL (2016). We note that in Skinner, et al. eLife (2016) there is also mentioned that they used a uniform distribution assumption for the Pol II positions but we cannot find any details of the algorithm in the paper to compare directly to ours. However, note that our paper is not about a novel inference algorithm but about the systematic study of how posttranscriptional and gene copy number noise affect the accuracy of parameter inference – this is the main novelty. We have added a discussion of the above points to the Methods Section 4.2.1.
Referee Comment (Major)
2. There is a particular choice during their analysis that I find problematic. In section 2.3, the authors state “The transcription site is counted as 1 mRNA, regardless of its intensity, but has a negligible influence since the mean number of mature mRNA is much greater than 1” (the number should be spelled). It is unclear that statement is true for all possible kinetic parameters. It is also hard to evaluate that claim because the authors do not show images of transcription sites that would support it. Trying to find more information, I saw images from previous work from one of the authors (“Optimized protocol for singlemolecule RNA FISH to visualize gene expression in S. cerevisiae”, figure 4). Those images suggest that the opposite is the case: in the cell shown, the number of mRNAs in the transcription site is not negligible but instead seems to contain most of the mRNAs in the cell. Solving this problem would require the authors to remake their analysis without making this assumption.
The reviewer is correct that the number of nascent transcripts is substantial. As these transcripts are not mature, their count should however not contribute to the mature transcript distribution. We previously counted this transcription site as one mRNA. As also described in the reviewer comments to reviewers 1 and 2, we now analyzed how removal of the transcription site from the experimental mRNA count affects the inferred parameters. As shown in Figure 3c and d subtracting the transcription site has only a small influence on the estimated parameters (compare seg 2 and seg2TS). See also SI Table 7 and SI Figure 4a. We included a description of this analysis in Section 2.2.1. To see if this insensitivity to TS inclusion is true for all parameter sets, we used synthetic data to randomly change the number of mature mRNA by minus 1/plus 1/unchanged with probability 1/3. In all cases, we found the inference results were practically unchanged, except for cases where the gene spends a huge amount of time in the OFF state. The results are described in SI Section 1.5 and SI Table 4
Referee Comments (Minor)
1. In section 2.1.3, the authors mention using an optimization package written in Julia programing language. A reference to the package needs to be included, either an academic article or the website to the package.
We now added the webpage for BlackBoxOptim.jl https://github.com/robertfeldt/BlackBoxOptim.jl in Methods Section 4.1.3.
2. In the discussion, the authors state “In addition, livecell measurements include cells in S phase, which are excluded in smFISH.” I do not think that statement is correct. One would expect that a large enough sample of cells assayed with smFISH will contain a subpopulation containing cells in the Sphase.
The reviewer is correct. smFISH contains cells in S phase, which are the cells in Figure 3e with a DNA content in between the G1 and G2 population. In this manuscript, we do not analyze this S phase population. They are however measured in livecell imaging. We updated the text to “which are not analyzed in smFISH”.
3. I find the overall presentation of figures and the analysis performed not optimal to convey their points. Below are some suggestions regarding presentation (and in some cases, analysis).
– Text suggestions:
a. The meaning of the word “inference” seems to change across the manuscript. In the title, I understand that inference means “estimation,” or more explicitly, estimating model parameters from experimental or simulated data. However, in the methods section, the authors write “Mature mRNA inference” and “Nascent mRNA inference.” Do they mean “Estimating/Inferring model parameters from synthetic/experimental mature/nascent mRNA datasets”?
To avoid confusion and to be clearer, we have changed the titles of the subsections in Methods. With the word “inference” we mean the inference of parameters from mRNA data. The algorithms for inference from mature and nascent data in Methods are the same for synthetic and experimental data.
b. In the Introduction, the authors use three different terms for cell cycle (cell cycle position, cell cycle stage, and cell cycle phase). It is unclear to me if they are referring to the same concept.
We have now used the phrase “cell cycle phase” throughout the manuscript.
– Presentation suggestions:
c. I would remove Figure 2C and put it in the Supplementary information. It shows procedure details that are not fundamental to understanding their claims.
We appreciate the suggestion but we have left the new version of Figure 2c (presently Figure 1d) in the main text. The reason is that the study from synthetic data is still part of the story in the main text and we believe the figure aids its comprehension.
d. I would also relegate the tables in their six datasets in figure 1 and 2 to the Supplementary material. Tables are not very effective methods to present information.
We agree and moved these tables to the supplement. We have also replaced all relevant tables in the manuscript with bargraphs.
e. I do not think that figures 1c and 2d are needed. Comparing the results from stochastic simulations and the predictions from the models is an internal control that the researchers should do to test the accuracy of their SSA implementation; it does not convey a message related to the main conclusions of their work.
We agree and moved these figures to the supplement.
f. I like figure 4a; it conveys one of the main points: not correcting for cell cycle can lead to considerable errors in parameter estimation. **I would like to see a similar plot that conveys the difference in parameter estimation when using nascent vs. mature mRNA.**
We agree this comparison would be very interesting. We note however that for nascent versus mature, a direct comparison, parameter by parameter, is not possible. This is because using the telegraph model it is only possible to estimate the switching rates and the initiation rate scaled by the degradation rate, and the latter is is unknown. On the other hand, the estimates from nascent data are rates multiplied by the elongation time – the latter is known and hence the absolute rates can be estimated from nascent mRNA data only. The only quantities that can be directly compared are the burst size and the fraction of ON time, since these are both nondimensional. We now included bar charts in Figure 3f and 4c to compare these two and we comment on the comparison in Section 2.3.1.
g. Why do the authors have table 1 separated from figure 4 while adding the tables to figures 1 and 2? I would be consistent and move all tables to the supplementary material.
We have considerably reorganised the table and figure order in the manuscript. Most tables are now in the SI. However, we have left some of the tables in the main text, i.e. Figure 3f for mature mRNA data and Figure 6d for nascent mRNA data corrected for the fusion method, since these enable an easy way to assess the impact of transcriptional noise and cellcycle phase on inferred values.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
However, there are some remaining issues that need to be addressed, as outlined below:
1. A main concern that remains is the discrepancy between nascent and mature mRNAs. The authors argue that this is presumably due to large mRNA turnover rates, which may be expected when the coat protein is not expressed. First, it would be great if the authors could back up this statement with appropriate references. Second, if I take the inferred transcription rates of ~36mRNA/min and the fraction the gene is on ~0.8, this leads to an effective transcription rate of roughly 30mRNA/min (the elongation delay should only lead to a static delay, but not affect the average amount of mature mRNAs that we observe at steady state). Now, in order to obtain an average steady state of mature mRNA around 10 transcripts (e.g., Fig. 3f), I require the mRNA degradation rate to be around 3min^1. This would give a halflife of roughly log(2)/3=0.2min, which seems to be exceedingly low. According to literature, mRNA halflives in yeast are typically in the order of 30min and only for very few mRNAs falls below 10mins (see e.g., Geisberg et al., Cell 2014). If we trust this simple calculation above, that would mean that the mRNA halflive is substantially smaller than the lowest reported values that I could find in the literature. I therefore still worry that there might be an issue with the data/spot extraction. At the very least, I would expect such mismatch and its potential origins to be discussed critically in the paper.
We agree that the halflife of PP7tagged GAL10 is very fast compared to other endogenous mRNA. We have added a discussion on this in the results section, as well as added the reference suggested by the reviewer: “In fact, in our experiments, the number of detected mature mRNA transcripts per cell was lower than expected based on the number of nascent transcripts (compare Fig. 3 with Fig. 4). This discrepancy between nascent and mature transcripts likely arises because the addition of the PP7 loops to the GAL10 RNA results in a very fast mRNA turnover, which is much faster than the turnover of most endogenous RNAs [1, 2, 3, 4]. Previously, both shorter and longer mRNA halflives from the addition of stem loops have been observed, which may be caused because changes in the 5’ UTR length or sequence changes its recognition by the mRNA degradation machinery [5, 6, 7].”
2. The second remaining concern relates to the validation against timeseries data. I appreciate that the discussion around the calculation of the ACF has been extended. The authors argue that a comparison between statistics obtained from smFISH and livecell data is not possible due to a potential detection threshold in the former. But why am I then allowed to compare the autocorrelation function (that is derived from calibration against smFISH data) with the autocorrelation of the livecell data, which also is a statistic of the same transcriptional process? Or conversely, if it was the case that the mRNA mean and variance don't match between experiments, why should we expect the autocorrelation to match? I would be grateful if the authors could clarify these questions. Moreover, I still believe that the term "nonstationary effects" is misleading, since the slowly varying components to the ACF still need to be assumed to be in stationarity in order to capture them by an ACF with just a single time parameter (e.g., the linear fit). Perhaps "slowlyvarying" would be a better term.
We apologize that we misunderstood the reviewer previously. The reason that we cannot directly compare the mean and variance of the livecell data with the estimated distribution derived from the smFISH, is because of celltocell variation in the intensities of the livecell data. Specifically, the livecell traces showed celltocell variation in overall fluorescent intensity of each trace arising from differences in the PP7 coat protein expression level that was expressed from a plasmid. This means that the scaling factor of how much intensity represents a single RNA varies between cells, but the traces are not long enough to obtain this scaling factor (or the full intensity distribution) for each cell. The normalized ACF however, normalizes the intensity per trace, and thus allows comparison between the kinetics even if the absolute intensity differs per cell. We now added a statement to the manuscript explaining why we use the normalized ACF for comparison rather than the intensity distribution: ”Specifically, the livecell traces showed celltocell variation in overall fluorescent intensity arising from differences in the PP7 coat protein expression level, precluding a direct comparison of the livecell intensities with the smFISH distributions. The normalized ACFs are normalized per trace and thus can be used to directly compare the kinetics.”
We agree that the term ”nonstationary effects” does not accurately encompass all corrected slowvarying effects, such a bleaching. We note that we also use the same linear subtraction method to correct for the nonstationary fast induction from raffinose to galactose in the livecell data, which in our hands is welldescribed by a linear long time component in the ACF. We have adapted our description and nomenclature in the supplemental methods (last sentence of Section 5) to make this more accurate.
References
[1] Miller, C. et al. Dynamic transcriptome analysis measures rates of mrna synthesis and decay in yeast. Molecular systems biology 7, 458 (2011).
[2] Wang, Y. et al. Precision and functional specificity in mrna decay. Proceedings of the National Academy of Sciences 99, 5860–5865 (2002).
[3] Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717–728 (1998).
[4] Geisberg, J. V., Moqtaderi, Z., Fan, X., Ozsolak, F. & Struhl, K. Global analysis of mrna isoform halflives reveals stabilizing and destabilizing elements in yeast. Cell 156, 812–824 (2014).
[5] Heinrich, S., Sidler, C. L., Azzalin, C. M. & Weis, K. Stem–loop rna labeling can affect nuclear and cytoplasmic mrna processing. Rna 23, 134–141 (2017).
[6] Tutucci, E. et al. An improved ms2 system for accurate reporting of the mrna life cycle. Nature methods 15, 81–89 (2018).
[7] Garcia, J. F. & Parker, R. Ms2 coat proteins bound to yeast mrnas block 5 to 3 degradation and trap mrna decay products: implications for the localization of mrnas by ms2mcp system. Rna 21, 1393–1395 (2015).
https://doi.org/10.7554/eLife.82493.sa2Article and author information
Author details
Funding
National Natural Science Foundation of China (61988101)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
National Natural Science Foundation of China (62073137)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
H2020 European Research Council (755695 BURSTREG)
 Tineke L Lenstra
Leverhulme Trust (RPG2020327)
 Ramon Grima
Shanghai Action Plan for Technological Innovation Grant (22ZR1415300)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
Shanghai Action Plan for Technological Innovation Grant (22511104000)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
Shanghai Sailing Program (22YF1410700)
 Xiaoming Fu
Oncode Institute
 Tineke L Lenstra
Netherlands Organisation for Scientific Research (Gravitation program CancerGenomiCs.nl)
 Tineke L Lenstra
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
ZC, XF, and LX acknowledge the support from Natural Science Foundation of China (NSFC No. 61988101, 62073137), Shanghai Action Plan for Technological Innovation Grant (No. 22ZR1415300, 22511104000) and Shanghai Center of Biomedicine Development. XF acknowledges the support from Shanghai Sailing Program (22YF1410700). TLL was supported by the Netherlands Organization for Scientific Research (NWO, gravitation program CancerGenomiCs.nl), Oncode Institute, which is partly financed by the Dutch Cancer Society, and the European Research Council (ERC Starting Grant 755695 BURSTREG). RG was supported by a Leverhulme Trust research award (RPG2020–327).
Senior and Reviewing Editor
 Anna Akhmanova, Utrecht University, Netherlands
Publication history
 Preprint posted: November 11, 2021 (view preprint)
 Received: August 7, 2022
 Accepted: October 14, 2022
 Accepted Manuscript published: October 17, 2022 (version 1)
 Version of Record published: November 10, 2022 (version 2)
Copyright
© 2022, Fu, Patel et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 628
 Page views

 130
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structurebased measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higherorder statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

 Cancer Biology
 Computational and Systems Biology
Lung squamous cell carcinoma (LUSC) is a type of lung cancer with a dismal prognosis that lacks adequate therapies and actionable targets. This disease is characterized by a sequence of low and highgrade preinvasive stages with increasing probability of malignant progression. Increasing our knowledge about the biology of these premalignant lesions (PMLs) is necessary to design new methods of early detection and prevention, and to identify the molecular processes that are key for malignant progression. To facilitate this research, we have designed XTABLE (Exploring Transcriptomes of Bronchial Lesions), an opensource application that integrates the most extensive transcriptomic databases of PMLs published so far. With this tool, users can stratify samples using multiple parameters and interrogate PML biology in multiple manners, such as two and multiplegroup comparisons, interrogation of genes of interests, and transcriptional signatures. Using XTABLE, we have carried out a comparative study of the potential role of chromosomal instability scores as biomarkers of PML progression and mapped the onset of the most relevant LUSC pathways to the sequence of LUSC developmental stages. XTABLE will critically facilitate new research for the identification of early detection biomarkers and acquire a better understanding of the LUSC precancerous stages.