Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions

  1. Xiaoming Fu
  2. Heta P Patel
  3. Stefano Coppola
  4. Libin Xu
  5. Zhixing Cao  Is a corresponding author
  6. Tineke L Lenstra  Is a corresponding author
  7. Ramon Grima  Is a corresponding author
  1. Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, China
  2. School of Biological Sciences, University of Edinburgh, United Kingdom
  3. Center for Advanced Systems Understanding, Helmholtz-Zentrum Dresden-Rossendorf, Germany
  4. The Netherlands Cancer Institute, Oncode Institute, Division of Gene Regulation, Netherlands

Abstract

Transcriptional rates are often estimated by fitting the distribution of mature mRNA numbers measured using smFISH (single molecule fluorescence in situ hybridization) with the distribution predicted by the telegraph model of gene expression, which defines two promoter states of activity and inactivity. However, fluctuations in mature mRNA numbers are strongly affected by processes downstream of transcription. In addition, the telegraph model assumes one gene copy but in experiments, cells may have two gene copies as cells replicate their genome during the cell cycle. While it is often presumed that post-transcriptional noise and gene copy number variation affect transcriptional parameter estimation, the size of the error introduced remains unclear. To address this issue, here we measure both mature and nascent mRNA distributions of GAL10 in yeast cells using smFISH and classify each cell according to its cell cycle phase. We infer transcriptional parameters from mature and nascent mRNA distributions, with and without accounting for cell cycle phase and compare the results to live-cell transcription measurements of the same gene. We find that: (i) correcting for cell cycle dynamics decreases the promoter switching rates and the initiation rate, and increases the fraction of time spent in the active state, as well as the burst size; (ii) additional correction for post-transcriptional noise leads to further increases in the burst size and to a large reduction in the errors in parameter estimation. Furthermore, we outline how to correctly adjust for measurement noise in smFISH due to uncertainty in transcription site localisation when introns cannot be labelled. Simulations with parameters estimated from nascent smFISH data, which is corrected for cell cycle phases and measurement noise, leads to autocorrelation functions that agree with those obtained from live-cell imaging.

Editor's evaluation

This important paper tackles a core problem in systems biology – how to quantify kinetic parameters from incomplete and noisy experimental data. The study makes a convincing methodological contribution to the field, and its potential usefulness is demonstrated using experimental data in yeast.

https://doi.org/10.7554/eLife.82493.sa0

Introduction

Transcription in single cells occurs in stochastic bursts (Suter et al., 2011; Larsson et al., 2019). Although the first observation of bursting occurred more than 40 years ago (McKnight and Miller, 1977), the precise mechanisms behind this phenomenon are still under active investigation (Nicolas et al., 2017; Tunnacliffe and Chubb, 2020). The direct measurement of the dynamic properties of bursting employs live-cell imaging approaches, which allow visualization of bursts as they occur in living cells (Donovan et al., 2019). However, in practice, such live-cell measurements are challenging because they are low-throughput and require genome-editing (Brouwer et al., 2020; Lenstra and Larson, 2016). To circumvent this, one can exploit the fact that bursting creates heterogeneity in a population. In this case, it is relatively straightforward to obtain a steady-state distribution of the number of mRNAs per cell from smFISH or single-cell sequencing experiments. These distributions have been used to infer dynamics by comparison to theoretical models. The simplest mathematical model describing bursting is the telegraph (or two-state) model (Peccoud and Ycart, 1995; Raj et al., 2006). In this model, promoters switch between an active and inactive state, where initiation occurs during the active promoter state. The model makes the further simplifying assumption that the gene copy number is one and that all the reactions are effectively first-order. The mRNA in this model can be interpreted as cellular (mature) mRNA since its removal via various decay pathways in the cytoplasm is known to follow single-exponential (first-order) decay kinetics in eukaryotic cells (Wang et al., 2002; Herzog et al., 2017). The solution of the telegraph model for the steady-state distribution of mRNA numbers has been fitted to experimental mature mRNA number distributions to estimate the transcriptional parameters (Raj et al., 2006; Kim and Marioni, 2013; Suter et al., 2011; Larsson et al., 2019).

However, the reliability of the estimates of transcriptional parameters from mRNA distributions is questionable because the noise in mature mRNA (and consequently the shape of the mRNA distribution) is affected by a wide variety of factors. Recent extensions of the telegraph model have carefully investigated how mRNA fluctuations are influenced by the number of promoter states (Zhou and Zhang, 2012; Ham et al., 2020b), polymerase dynamics (Cao et al., 2020), cell-to-cell variability in the rate parameter values (Dattani and Barahona, 2017; Ham et al., 2020a), replication and binomial partitioning due to cell division (Cao and Grima, 2020), nuclear export (Singh and Bokes, 2012) and cell cycle duration variability (Perez-Carrasco et al., 2020). One way to avoid noise from various post-transcriptional sources is to measure distributions of nascent mRNA rather than mature mRNA, and then fit these to the distributions predicted by an appropriate mathematical model. A nascent mRNA (Zenklusen et al., 2008; Larson et al., 2009) is an mRNA that is being actively transcribed, that is it is still tethered to an RNA polymerase II (Pol II) moving along a gene during transcriptional elongation. Fluctuations in nascent mRNA numbers thus directly reflect the process of transcription. Because nascent mRNA removal is not first-order, an extension of the telegraph model has been developed (the delay telegraph model) (Xu et al., 2016).

However, nascent mRNA data still suffers from other sources of noise due to cell-to-cell variability. For example in an asynchronous population of dividing cells, cells can have either one or two gene copies. In the absence of a molecular mechanism that compensates for the increase in gene copy number upon replication, cells with two gene copies which cannot be spatially resolved will have a different distribution of nascent mRNA numbers (one with higher mean) than cells with one gene copy. The importance of the cell cycle is illustrated by the finding (Zopf et al., 2013) that noisy transcription from the synthetic TetO promoter in S. cerevisiae is dominated by its dependence on the cell cycle. The estimation of transcriptional parameters from nascent mRNA data for pre- and post-replication phases of the cell cycle has, to the best of our knowledge, only been reported in Skinner et al., 2016.

Interestingly, all the studies that estimate transcriptional parameters from nascent mRNA data (Skinner et al., 2016; Xu et al., 2015; Zoller et al., 2018; Senecal et al., 2014; Fritzsch et al., 2018) do not compare them with transcriptional parameters estimated from cellular (mature) mRNA data measured in the same experiment. Similarly, a quantitative comparison between inference from cell-cycle-specific data and data which contains information from all cell cycle phases is lacking. Likely, this is because it is considered evident that quantifying fluctuations earlier in the gene expression process and adjusted for the cell-cycle will improve estimates. However, nascent mRNA distributions are technically more challenging to acquire than mature mRNA distributions; and inference from nascent mRNA distributions is substantially more complex (Xu et al., 2016). Thus, it still needs to be shown that acquiring nascent mRNA data is a necessity from a parameter inference point of view, i.e. that it leads to significantly different and more robust estimates than using mature mRNA data. We also note that current studies report parameter inference from organisms where it is possible to label introns to identify mRNA located at the transcription site. This is not possible in many yeast genes and other microorganisms, and in these cases it is unclear how to correct parameter estimates for uncertainty in the transcription site location.

In this paper, we sought to understand the precise impact of post-transcriptional noise and cell-to-cell variability on the accuracy of transcriptional parameters inferred from mature mRNA data. The fitting algorithms (for mature and nascent mRNA data) were first tested on simulated data, where limitations of the algorithms were uncovered in accurately estimating the transcriptional parameters in certain regions of parameter space. The algorithms were then applied to four independent experimental data sets, each measuring GAL10 mature and nascent mRNA data from smFISH in galactose-induced budding yeast, conditional on the stage of the cell cycle (G1 or G2) for thousands of cells. Comparison of the transcriptional parameter estimates allowed us to separate the influence of ignoring cell cycle variability from that of post-transcriptional noise (mature vs nascent mRNA data). We found that only fitting of nascent cell-cycle data, corrected for measurement noise (due to uncertainty in the transcription site location), provided good agreement with measurements from live-cell data. Cell-cycle specific analysis also revealed that upon transition from G1 to G2, yeast cells show dosage compensation by reducing burst frequency, similar to mammalian cells (Padovan-Merhar et al., 2015). Our systematic comparison highlights the challenges of obtaining kinetic information from static data, and provides insight into potential biases when inferring transcriptional parameters from smFISH distributions.

Results

Inference from mature mRNA data vs inference from nascent mRNA data: testing inference accuracy using synthetic data

To understand the accuracy of the inference algorithms from nascent and mature mRNA data, in various regions of parameter space, (i) we generated synthetic data using stochastic simulations with certain known values of the parameters; (ii) applied the inference algorithms to estimate the parameters from the synthetic data; (iii) compared the true and inferred kinetic parameter values.

The generation of synthetic mature mRNA data (mature mRNA measurements in each of 104 cells) using stochastic simulations of the telegraph model (Figure 1a) is described in Methods Sections Mathematical model and Generation of synthetic mature mRNA data. The inference algorithm is described in detail in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. It is based on a maximization of the likelihood of observing the single cell mature mRNA numbers measured in a population of cells. The likelihood of observing a certain number of mature mRNA numbers from a given cell is given by evaluating the telegraph model’s steady-state mature mRNA count probability distribution.

Overview of the inference of transcriptional parameters from synthetic data.

(a) A schematic illustration of the telegraph model. (b). A schematic of the delay telegraph model. The double horizontal line for nascent mRNA removal indicates this is a delayed reaction. (c) Illustration showing promoter switching between two states, Pol II binding to the promoter in the ON state and subsequently undergoing productive elongation. Note that the length of the nascent mRNA tail increases until Pol II terminates at the end of the gene. As Pol II travels through the 14 repeats of the PP7 loops, the intensity of the mRNA increases due to fluorescent probe binding to the mRNA; intensity saturates as Pol II enters the GAL10 gene body. (d) Illustration of the algorithms to generate synthetic data and to perform inference from mature and nascent mRNA data. The green boxes are only applicable for the inference of the fluorescence signal intensity of nascent mRNAs; note that in nascent mRNA inference, the "RNA number" in the flow chart should be interpreted as the number of bound Pol II molecules on the gene. A large iteration step Nmax (104) is chosen as the termination condition for the optimizer.

For nascent RNA data, we used stochastic simulations of the delay telegraph model (Figure 1b) to generate the position of bound Pol II molecules from which we constructed the synthetic smFISH signal in each of 104 cells (Methods Section Generation of synthetic nascent mRNA data). An inference algorithm estimates the parameters, based on a maximization of the likelihood of observing the single cell total fluorescence intensity measured in a population of cells (Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data). Note that the likelihood of observing a certain fluorescence signal intensity from a cell is given by extension of the delay telegraph model (but not directly by the delay telegraph model itself) to account for the smFISH probe positions.

This extension takes into account that the experimental fluorescence data used in this manuscript was acquired from smFISH of PP7-GAL10 in budding yeast, where probes were hybridized to the PP7 sequences. Because the PP7 sequences are positioned at the 5’ of the GAL10 gene, the fluorescence intensity of a single mRNA on the DNA locus resembles a trapezoidal pulse (see Figure 1 for an illustration). As the Pol II molecule travels through the 14 repeats of the PP7 loops, the fluorescence intensity increases as the fluorescent probes binds to the nascent mRNA (this is the linear part of the trapezoidal pulse). However, once all 14 loops on the nascent mRNA are bound by the fluorescent probes, the intensity of a single mRNA reaches maximal intensity and the plot plateaus as the RNA elongates through the GAL10 gene body before termination and release. The total fluorescent signal density function is hence given by

(1) p(s;θ)=k=0p(s|k)P(k;θ),

where p(s|k) is the density function of the signal s given there are k bound Pol II molecules and P(k;θ) is the steady-state solution of the delay telegraph model giving the probability of observing k bound Pol II molecules for the parameter set θ. In Methods Section Mathematical model, we show how p(s|k) can be approximately calculated for the trapezoidal pulse. Hence Equation (1) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site. Note that both of these inference algorithms were used to infer the promoter switching and initiation rate parameters. The degradation rate and the elongation time were not estimated but assumed to be known. The inference and synthetic data generation procedures are summarised and illustrated in Figure 1d.

The accuracy of inference was first calculated as the mean of the relative error in the estimated parameters σoff,σon, and ρ (for its definition see Methods, Equation (6)); note that this error measures deviations from the known ground truth values. Figure 2a shows, by means of a 3D scatter plot, the ratio of the mean relative error from nascent mRNA data (using delay telegraph model) and the mean relative error from mature mRNA data (using the telegraph model) for 789 independent parameter sets sampled on a grid (for each of these sets, we simulated 104 cells). The overall bluish hue of the plot suggested that the mean relative error from nascent mRNA data was typically less than the error from mature mRNA data. This was confirmed in Figure 2b where the same data was plotted but now as a function of the fraction of ON time (defined as fON=σon/(σoff+σon)). Out of 789 parameter sets, for 483 of them (61%) the inference accuracy was higher when using nascent mRNA data.

Accuracy of the inferred kinetic parameters from synthetic mature and nascent mRNA data using the telegraph and delay telegraph model, respectively.

(a) 3D scatter plot showing the ratio of the mean relative error from nascent mRNA data (using delay telegraph model MREdelay) and the mean relative error from mature mRNA data (using the telegraph model MREtele) for 789 independent parameter sets sampled on a grid. Red data points indicate parameter sets with lower relative errors for mature data compared to nascent data, blue datapoints indicate parameter sets with lower relative error for nascent data compared to mature data (b) Same data as (a) but shown as a function of the fraction of ON time, fON. For 61% of the parameters, the inference accuracy is higher when using nascent mRNA data. (c) Sampling from the same parameter space, we then add log-normal distributed noise (size 5%) to the initiation rate ρ (see text for details) to mimic external noise due to post-transcriptional processing that is only present in mature mRNA. Log10 of the ratio of the median relative error (MRE) using perturbed mature mRNA data against Log10 MRE using nascent mRNA data is shown as a function of the true fraction of ON time, fON. For 64% of the parameters, the inference accuracy is higher when using nascent mRNA data. (d) The median relative error of each transcriptional parameter as a function of the fraction of ON time, using synthetic nascent mRNA, synthetic mature mRNA data and synthetic mature mRNA with external noise. Inference from nascent data is generally more accurate than using mature mRNA data.

Thus far, we have implicitly assumed that fluctuations in both nascent and mature mRNA are due to transcriptional bursting. However, it is clear that mature mRNA data exhibit a higher degree of noise due to post-transcriptional processing. For example, it has been shown that transcriptional noise is typically amplified during mRNA nuclear export (Hansen et al., 2018). In addition, cell-to-cell variation in the number of nuclear pore complexes has recently been identified as the source of heterogeneity in nuclear export rates within isogenic yeast populations (Durrieu et al., 2022). To take into account these additional noise sources, which we call external noise, we added noise to the initiation rate ρ in the telegraph model since this rate implicitly models all processes between the synthesis of the transcript and the appearance of mature mRNA in the cytoplasm. Specifically, for each of the 789 parameter sets previously used, we changed ρ to ρ where the latter is a log-normal distributed random variable such that its mean is ρ and its standard deviation is equal to 0.05 of the mean (5% external noise). Note that this implies that at the time of measurement, each cell in the population had a different value of the initiation rate. Simulations with this perturbed set of parameters led to a synthetic mature mRNA data set from which we re-inferred parameters using the telegraph model. In Figure 2c we show the ratio of mean relative error from nascent mRNA data and the mean relative error from perturbed mature mRNA data as a function of the fraction of ON time, fON. The percentage of parameters where nascent mRNA is more accurate is slightly increased compared to the data without noise (64% versus 61% of the parameters; compare Figure 2c and Figure 2b). However, the addition of even more noise (10% external noise added to the initiation rate) increases the inference accuracy for 91% of the parameter sets when the nascent mRNA data is used (Appendix 1 and Appendix 1—figure 1).

To obtain more insight into the accuracy of the individual parameters, we next plotted the median relative error of transcriptional parameters σoff,σon,ρ, burst size and the inferred fraction of ON time, as a function of the true fraction of ON time (Figure 2d). We compared the results using synthetic nascent mRNA, synthetic mature mRNA data and synthetic mature mRNA with 5% external noise. The median of the relative error for each transcriptional parameter (as given by the second equation of Equation 8) was obtained for the subset of the 789 parameter sets for which the true fraction of ON time fON falls into the interval [x-0.05,x+0.05] where x=0.1,0.2,,0.9. From the plots, the following can be deduced: (i) the errors in σon (the burst frequency), σoff and the burst size ρ/σoff tend to increase with fON while the rest of the parameters (ρ and the estimated value of fON) decrease; (ii) for small fON, the best estimated parameters are the burst frequency and size while for large fON, it was ρ and the estimated value of fON. The worst estimated parameter was σoff, independent of the value of fON; (iii) the addition of external noise to mature mRNA data had a small impact on inference for small fON; in contrast, for large fON the noise appreciably increased the relative error in σoff and to a lesser extent the error in the other parameters too.

Additionally, in Appendices 1 and 2 we show that (i) independent of the accuracy of parameter estimation, the best fit distributions accurately matched the ground truth distributions (Appendix 1 and Appendix 1—figure 2); (ii) the parameters ordered by relative error were in agreement with the parameters ordered by sample variability (Appendix 1 and Appendix 1—table 1) and by profile likelihood error (Kreutz et al., 2013) (Appendix 1, Appendix 1—tables 2 and 3). Since from experimental data, only the sample variability and the profile likelihood error are available, it follows that the results of our synthetic data study in Figure 2 based on relative error from the ground truth have wide practical applicability; (iii) stochastic perturbation of the mature or nascent mRNA data (due to errors in the measurement of the number of spots and the fluorescent intensity) had little effect on the inference quality, unless the gene spent a large proportion of time in the OFF state (Appendix 1—tables 4 and 5); (iv) if one utilized the conventional telegraph model to fit the nascent data generated by the delay telegraph model, it was possible to obtain a distribution fitting as good as the delay telegraph model but with low-fidelity parameter estimation (Appendix 2, Appendix 2—figure 1 and Appendix 2—table 1). Analytically, the telegraph model is only an accurate approximation of the delay telegraph model when the promoter switching timescales are much longer than the time spent by Pol II on a gene or the off switching rates are very small such that gene expression is nearly constitutive.

In summary, by means of synthetic experiments, we have clarified how the accuracy of the parameter inference strongly depends on the type of data (nascent or mature mRNA) and the fraction of time spent in the ON state (which determines the mode of gene expression).

Applications to experimental yeast mRNA data

Now that we have introduced the inference algorithms and tested them thoroughly using synthetic data, we applied the algorithms to experimental data (see Method Section Experimental data acquisition and processing for details of the data acquisition). Note that in what follows, delay telegraph model refers to the extended delay telegraph model that accounts for the smFISH probe positions that was used to predict the smFISH fluorescent signal of the transcription site.

Inference from mature mRNA data: experimental artifacts

We have four independent datasets from which we determined mRNA count and nascent RNA distributions. Figure 3a shows an example cell with mature single RNAs in the cytoplasm, and a bright nuclear spot representing the site of nascent transcription. Spots and cell outlines were identified using automated pipelines. Importantly, to obtain an accurate estimation of transcriptional parameters, the experimental input distributions of mRNA count and nascent RNAs require high accuracy. We therefore first determined how technical artifacts in the analysis affects the inference estimates.

Inference results using four mature mRNA data sets with sample sizes of 2333, 6366, 4550 and 3163 cells, respectively.

(a) Representative smFISH image of a yeast cell with PP7-GAL10 RNAs labeled with Cy3 and the nucleus labeled with DAPI. (b) The DAPI and Cy3 signals were used to determine the nuclear and cellular mask, respectively. Detected and fitted spots are indicated in green. Mature RNA count distribution (pink) for segmentation method 1 with a best fit obtained from the telegraph model (gray curve). Scale bar is 5 μm(c-d) The DAPI and Cy3 signals were used to determine the nuclear and cellular mask using a second independent segmentation tool (segmentation 2). Mature RNA count distribution (gray and cyan) with/without counting the transcription site (TS) for segmentation method 2 with a best fit obtained from the telegraph model (gray curves). (e) Bar graphs of inferred transcriptional parameters (merged mature RNA data) from fitting the distributions of the two segmentation methods (‘seg1’ and ‘seg2’) as well as the distribution of mature RNAs only (‘seg2 -TS’ which indicates the exclusion of one spot in each cell that represents the transcription site). The burst size was computed as ρ/σoff and the fraction of ON time as σon/(σon+σoff). Error bars indicate standard deviation computed over the four datasets. (f) Distribution of the integrated DAPI intensity for each cell. Cyan line represents a Gaussian bimodal fit with highlighted regions indicating the intensity-based classification of G1 and G2 cells. Distributions of the mature RNA count for all cells (merged) and cell-cycle classified cells (G1 cells and G2 cells). (g) Tables and bar graphs of inferred parameters for merged and cell-cycle-specific data. Note that the transcriptional parameters σon,σoff,ρ are normalised by the degradation rate and hence dimensionless. For the cell-cycle-specific data, parameters were inferred per gene copy.

First, if the number of mRNA transcripts per cell is high, accurate determination of the number of transcripts may be challenging, as transcripts may overlap. To determine if this occurred in our datasets, we analyzed the distributions of intensities of the cytoplasmic spots, which revealed unimodal distributions where ∼90% of the detected spots fell in the range 0.5× median – 1.5× median (Figure 4a). We therefore concluded that overlapping spots are not a large confounder in our data. In fact, in our experiments, the number of detected mature mRNA transcripts per cell was lower than expected, based on the number of nascent transcripts (compare Figure 3 with Figure 4). This discrepancy between nascent and mature transcripts likely arises because the addition of the PP7 loops to the GAL10 RNA destabilizes the RNA, resulting in faster mRNA turnover compared to most endogenous RNAs (Miller et al., 2011; Wang et al., 2002; Holstege et al., 1998; Geisberg et al., 2014). Previously, both shorter and longer mRNA half-lives from the addition of stem loops have been observed, which may be caused because changes in the 5’ UTR length or sequence affect its recognition by the mRNA degradation machinery (Heinrich et al., 2017; Tutucci et al., 2018; Garcia and Parker, 2015). In our case, we note that such high turnover should aid transcriptional parameter estimates, as it closely reflects transcriptional activity.

Inference from the normalized nascent mRNA distributions for merged and cell-cycle specific data.

(a) Normalized nascent mRNA distributions of merged cell-cycle data were obtained by normalizing the signal intensity of the transcription site (defined as the brightest spot in the cell) by the median signal intensity of the cytoplasmic spots (shown in orange and zoom-in depicted in the inset). In all 4 datasets, approximately 90% of the detected cytoplasmic spots fell in the range 0.5× median – 1.5× median (grey bargraph). Black line in normalized distribution on the right represents best fit with delay telegraph model. (b) Nascent RNA distributions for cell-cycle-specific data. Black lines represent best fits with delay telegraph model. (c) Bar graphs comparing the transcriptional parameters, burst size, fraction of ON time and Fano factor for cell-cycle-specific and merged data. Error bars indicate standard deviation of the four datasets. (d) Normalized ACF plots of cell-cycle-specific and merged data. The ACF plots are generated by stochastic simulations using estimated parameters from merged and cell-cycle specific nascent mRNA data for each of the four data sets; these were compared with the ACF measured directly using live-cell data in Donovan et al., 2019 (green line). (e) The sum of squared ACF residuals of merged and cell-cycle-specific data from each dataset (this is the sum of squared deviations between the measured and estimated normalised ACF where the sum was calculated over all time points).

A second possible source of error is cell segmentation. To test how cell segmentation errors contribute to the mature mRNA distribution and the transcriptional bursting estimates, we compared two independent segmentation tools, where segmentation 1 often resulted in missed spots (Figure 3b), resulting in an underestimation of the mean mRNA count and of the variance (compare Figure 3b and c). We inferred the transcriptional parameters using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. In the absence of an experimental measurement of the degradation rate, we could only estimate the three transcriptional parameters normalised by d. The best fits of dataset 1 are shown in (Figure 3b and c) and the transcriptional parameters (for all four datasets) are summarized in (Figure 3e). Note that the estimated parameters for all four datasets, using both segmentations, are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Notably the segmentation algorithms led to similar estimates for the burst frequency but considerably different estimates for the rest of the parameters. In particular segmentation 1 suggested that burst expression is infrequent (≈20% of the time) whereas segmentation 2 was consistent with burst expression occurring half of the time. Given that accurate cell segmentation remains challenging, this analysis illustrates that parameter estimation from mature mRNA counts may be affected by technical errors. For the remainder of the mature mRNA analysis, we have used only segmentation 2 data.

Lastly, it may be challenging to distinguish the nascent transcription site from a mature RNA, especially if few nascent RNAs are being produced. Either one can decide to include all cellular spots in the total mRNA count, including the transcription site, with the result that the number of mature transcripts is overestimated with one RNA for cells which show a transcription site. Or conversely, one can decide to exclude the transcription site by subtracting one spot from each cell, with the result that the number of mature mRNAs may be underestimated by one RNA for cells that are transcriptionally silent. To understand how this choice affects the accuracy of parameter inference, we compared both options in (Figure 3c, d and e), where seg2 included all spots, and seg2-TS excluded transcription sites (by subtracting 1 from each cell). The estimated parameters for all four datasets are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Although the mean was lower when transcription sites were excluded, all the parameters except the burst frequency σon were within the error, indicating that the choice of whether or not to include the transcription site in the mature mRNA count had a small influence on parameter estimation. For the remainder of the analysis, we included all spots, and counted the transcription site as one RNA.

Inference from mature mRNA data: merged versus cell-cycle specific

The above analysis was performed using the merged data from all cells, irrespective of their position in the cell cycle. The inferred parameters of all four datasets are shown in Figure 3g (grey). To understand the effect of the cell cycle on these parameter estimates, we compared this inference with cell-cycle-specific data. We used the integrated nuclear DAPI intensity as a measure for DNA content to classify cells into G1 or G2 cells (Figure 3f (left)) to obtain separate mature mRNA distributions for G1 and G2 cells.

To infer the transcriptional parameters from mature mRNA data of cells in G1, the inference protocol remained the same. However for cells in the G2 stage, this protocol needed to be altered since G2 cells have two gene copies, whereas the solution of the telegraph model assumes one gene copy. Assuming the transcriptional activities of the two gene copies are independent, the distribution of the total molecule number is the convolution of the molecule number (obtained from the telegraph model) with itself for mature mRNA data. This convolved distribution was used in steps (ii) and (iii) of the inference algorithm in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. A difference between our method of estimating parameters in G2 from that in the literature (Skinner et al., 2016) is that we do not assume that the burst frequency is the only parameter that changes upon replication, and we estimated all transcription parameters simultaneously.

Note that the independence of gene copy transcription has been verified for genes in some eukaryotic cells (Skinner et al., 2016) where the two copies can be easily resolved. For yeast data, as we are analyzing in this paper, it is generally not possible to resolve the two copies of the allele in G2 because they are within the diffraction limit. However, in the absence of experimental evidence, the independence assumption is the simplest reasonable assumption that we could make (see later for a relaxation of this assumption).

For both G1 and G2 cells, we performed inference for cell-cycle specific mature mRNA data, the results of which are shown in Figure 3f (centre and right) and Figure 3g – see Appendix 3—table 2 for the confidence intervals of the estimates calculated using profile likelihood. As expected, the mean number of mRNAs in G2 cells was larger than that in G1 cells. For both merged and cell-cycle specific data, the parameters ordered by increasing variability of the estimates from independent samples (the standard deviation divided by the mean) were: ρ, fON, σON, burst size and σOFF, and the same order was predicted by the relative error (from ground truth values) from our synthetic experiments (compare with fON=0.50 and fON=0.80 in the middle and right panels of Figure 2d) and by sample variability (Appendix 1). In Appendix 3 and Appendix 3—table 3 we show that the relaxation of the assumption of independence between the allele copies in G2 (by instead assuming perfect state correlation of the two alleles) had practically no influence on the inference of the two best estimated parameters (ρ, fON).

A comparison of the two types of data predicted different behaviour (Figure 3g bottom): merged data indicated behaviour consistent with the gene being ON half of the time and small burst sizes, while cell-cycle-specific data implied the gene is ON ≈80% of the time with large burst sizes. We note that the burst sizes have considerable sample variability, exemplifying burst size estimates of transcriptional parameters from mature mRNA distributions have to be treated with caution. Nevertheless, in line with this high fraction ON and large burst size, which start to approach constitutive expression, the variation introduced by the transcription kinetics is relatively modest with Fano factors not far from one: 2.43±0.21 for merged data and 1.75±0.45 for cell-cycle data (the slightly higher value for merged data likely was due to heterogeneity stemming from varying gene copy numbers per cell).

Comparing the mean rates between the G1 and G2 phases, we found that σoff, σon, ρ decreased while fON and the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were well-separated between the two phases were σon and ρ. These two parameters decreased by 65% and 21%, respectively, which suggests that upon replication, there are mechanisms at play which reduce the expression of each copy to partially compensate for the doubling of the gene copy number (gene dosage compensation) (Skinner et al., 2016).

In conclusion, what is particularly surprising in our analysis is the differences in the inference results using merged and cell-cycle specific data: the former suggests the gene spends only half of its time in the ON state while the latter implies the gene is mostly in its ON state.

Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference

Cell-cycle-specific versus merged data

To determine the number of nascent transcripts at the transcription site, we selected the brightest spot from each nucleus and normalized its intensity to the median intensity of the cytoplasmic spots. As the distribution of intensities of the cytoplasmic mRNAs followed a narrow unimodal distribution, its median likely represents the intensity of a single RNA (orange distribution in the central panel of Figure 4a). The inference of transcriptional parameters using the merged data was done using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data.

Similar to above, to account for two gene copies in G2 cells, we assumed that the transcriptional activities of the two gene copies are independent. The distribution of the total fluorescent signal from both gene copies was the convolution of the signal distribution (obtained from the extended delay telegraph model, i.e. Equation (1)) with itself. This convolved distribution was then used in steps (ii) and (iii) of the inference algorithm.

The inference of transcriptional parameters from nascent RNA data was done using a fixed elongation time, which was measured previously at a related galactose-responsive gene (GAL3) at 65bp/s (Donovan et al., 2019). Since the total transcript length is 3062bp (see Figure 1c), the elongation time (τ in our model) is 47.1s0.785 min. The fixed elongation rate enabled us to infer the absolute values of the three transcriptional parameters σoff,σon and ρ.

Best fits of the extended delay telegraph model to the distribution of signal intensity of nascent mRNAs at the transcription site are shown in Figure 4a and b for dataset 1; for the other datasets see Appendix 4—figure 1. The corresponding estimates of the transcriptional parameters are shown in Appendix 4—table 1 and also illustrated by bar charts in Figure 4c. The confidence intervals of the transcriptional parameters (computed using the profile likelihood method) are shown in Appendix 4—table 2.

Comparing this estimation with that from mature mRNA, we observed that in both cases fON0.5 for merged data and in the range 0.7-0.8 for cell-cycle-specific data. Also in both cases, the Fano factors of merged data were larger than those of cell-cycle-specific data. Hence, we are confident that not accounting for the cell cycle phase leads to an over-estimation of the time spent in the OFF state and of the Fano factor. In addition, comparing the burst sizes in Figure 3g and Appendix 4—table 1, we found that not taking into account post-transcriptional noise (by using mature mRNA data) led to an lower estimation of the burst size (2.6-fold, 2.6-fold, and 1.1-fold lower for inference from merged, G1 and G2 data, respectively). We note that it would be useful to directly compare the absolute estimates of the other transcriptional parameters from mature and nascent mRNA data. However, this was not possible because the telegraph model only estimates the switching rates and the initiation rate scaled by the degradation rate, and the latter is unknown. On the other hand, the estimates from nascent data were rates multiplied by the average elongation time, which is known and hence the absolute rates can be estimated from nascent mRNA data only. The only quantities that could be directly compared were the burst size and the fraction of ON time, since these are both non-dimensional.

Comparing the variability of the parameter estimates, we found that ρ and fON were the parameters with the smallest variability across samples for the nascent data, as for inference from mature data. However, the inferred parameter variability across samples was on average about 2.5-fold lower for nascent data compared to mature mRNA data (this was obtained by computing the standard deviation divided by the mean for each parameter and then averaging over all parameters and over merged, G1 and G2 data). Likely this is because nascent data does not suffer from post-transcriptional noise. Indeed, synthetic experiments suggested that the errors in parameter inference using nascent data are often less than those in mature data when fON0.80 (Figure 2d). In summary, we have more confidence in the parameter estimates from nascent data, in particular those from cell-cycle separated data.

To further investigate the hypothesis that estimates from cell-cycle-specific data are more accurate than merged data, we compared the estimates from merged and cell-cycle-specific data to previous live-cell transcription measurements of the same gene (Donovan et al., 2019). Because live-cell traces and simulated traces with the estimated transcriptional parameters are difficult to compare directly, we instead compared their normalized autocorrelation functions (ACFs). Specifically, the live-cell traces displayed cell-to-cell variation in overall fluorescent intensities arising from differences in the PP7 coat protein expression level, precluding a direct comparison of the live-cell intensities with the smFISH distributions. The normalized ACFs are normalized per trace and thus can be used to directly compare the kinetics. For this, we fed the parameter estimates to the SSA to generate synthetic live-cell data and then calculated the corresponding ACF (Appendix 5). We found that the estimates from cell-cycle-specific data produced ACFs that match the live-cell data closer than that from the merged data (Figure 4d). This was also clear from the sum of squared residuals which for each dataset was smaller for the ACF computed using the cell-cycle-specific estimates rather than those from merged data (Figure 4e).

Using nascent data, we also reinvestigated the hypothesis that the gene exhibits dosage compensation. Comparing the mean rates between the G1 and G2 phases, we found that σoff, σon, ρ, fON decreased while the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were cleanly separated between the two phases were σon and fON. These two decreased by 41% and 5%, respectively. These results had some similarity to those deduced from cell-cycle separated mature mRNA data (the decrease of σon) but they also displayed differences. Namely, from mature mRNA data it was predicted that ρ decreased upon replication while from nascent data we predicted that ρ did not change and it was rather fON that decreased by a small degree. The decrease of the burst frequency σon after replication has also been reported for some genes in mammalian cells (Skinner et al., 2016; Padovan-Merhar et al., 2015), indicating that this could be a general mechanism for gene dosage compensation. Our results are consistent with a population-based ChIP-seq study (Voichek et al., 2016) that showed DNA dosage compensation after replication in budding yeast. We note that our single-cell analysis only revealed partial dosage compensation, where the mean signal intensity of nascent mRNAs in G2 is not the same as in G1, but 1.7-fold higher in G2 than in G1 (Figure 4c).

Correcting for experimental artefacts

Although inference on cell cycle separated data outperformed inference on merged data, we noticed that the corresponding best fit distributions did not match well to the experimental signal distributions in the lower bins (Figure 4b and Appendix 4—figure 1). In all cases, the experimental distributions showed high intensities in bins 1, 2, and 3, which was likely an artifact of the experimental data acquisition system. Since we defined the transcription site as the brightest spot, this implies that in the absence of a transcription site, a mature transcript can be misclassified as a nascent transcript. We therefore investigated two methods to correct for this, the ‘rejection’ method and the ‘fusion’ method.

The rejection method removed all data associated with the first k bins of the experimentally obtained histogram of fluorescent intensities (Figure 5a shows the fits for dataset 1; for the other datasets see Appendix 4—figure 2). We found that the parameter estimates varied strongly when the number of bins from which data was rejected (k) was changed (Figure 5b; see also Appendix 4—table 3). Although the distributions fit well to the experimental histograms (Appendix 4—figure 1), comparison with the live-cell normalized ACF indicated that the estimates actually became worse than non-curated estimates, with a higher sum of squared residuals (Figure 5c and d). The rejection method therefore does not produce reliable estimates.

Inference results using the rejection method.

(a) Nascent RNA distributions for cell-cycle-specific and merged data. Black lines represent best fits with delay telegraph model using the rejection method. (only the distributions for dataset #1 k=4 are shown). (b) Estimated transcriptional parameters, burst size, fraction of ON time and Fano factor (mean values and standard deviation error bars of the four datasets) by rejecting the first k bins with k=1,2,3,4. The estimated parameters are listed in Appendix 4—table 3. (c) Normalized autocorrelation function (ACF) predicted by stochastic simulations using the estimated parameters (for k=4) for each of the four data sets versus that measured directly using live-cell data (green line). (d) The sum of squared residuals of the ACF of cell-cycle-specific data from each dataset without/with rejection when k=4.

Next, we considered another data curation method which we call the fusion method. This works by setting to zero all fluorescent intensities in a cell population which were below a certain threshold. In other words, we fused or combined the first k bins of the experimentally obtained histogram of fluorescent intensities, thereby taking into account that the true intensity of bin 0 was artificially distributed over some of the first bins.

Figure 6 and Appendix 4—table 4 show that the fusion method led to estimates that varied little with k which enhanced our degree of confidence in them (note that k=1 is the same as the uncurated data). The peak at the zero bin for both G1 and G2 was better captured using the fusion method than using non-curated data (compare Figure 4b and Appendix 4—figure 1, with Figure 6b). Comparison to the autocorrelation function of the live-cell data shows that correction with the fusion method also led to improved transcriptional estimates, as indicated by a reduction in the sum of the squared residuals for all four data sets (Figure 6c).

Inference results using the fusion method.

(a) Estimated burst size, fraction of ON and Fano factor (mean values and standard deviation error bars of the four datasets) by combining the first k bins with k=1,2,3,4. (b) Corresponding fitted distributions for G1 (top row) and G2 (bottom row) using delay telegraph model with the fusion method (only the distributions for k=4 are shown). Magenta bar represents the combined bin 0–3 when k=4. (c) Normalised autocorrelation function (ACF) predicted by stochastic simulations using the estimated parameters (for k=4) for each of the four data sets versus that measured directly using live-cell data (green line). The sum of squared residuals of the ACF plots using cell-cycle specific data without/with fusion method when k=4. (d) Estimated parameters of cell cycle specified data and merged data of nascent mRNAs with fusion method with k=4 (fusing bins 0–3). These correspond to the fitted distributions in b. The elongation time τ is fixed to 0.785 min. See the inferred parameters in Appendix 4—table 4 for all other values of k.

Overall, we conclude that for inferring parameters from the smFISH data, the optimal method is to use nascent cell-cycle-specific data, corrected by the fusion method. The optimally inferred parameters for the four data sets in our study are those given in Appendix 4—figure 2d. The profile likelihood estimates of the 95% confidence intervals of these parameters are shown in Appendix 4—table 5. Note that in line with our synthetic data study in Figure 2, the parameters suffering from the least sample variability were fON and ρ. The rest of the parameters (σoff,σon and burst size) suffered more sample variability because the fraction of ON time was high; however since their standard deviation divided by the mean (computed over the four datasets) was not high (in the range of 10-20%), they still can be regarded as useful estimates. Note also that the previous prediction that gene dosage compensation involves regulation of the burst frequency did not change upon correction of the nascent data using the fusion method. All these results were deduced assuming that the two copies in G2 are independent from each other. Inferring rates under the opposite assumption of perfectly synchronized copies (Appendix 4—table 6) gave very similar estimates for ρ and fON (to be expected since according to the synthetic data study, these two are the most robustly estimated parameters for genes spending most of their time in the active state) but different estimates for the rest of the parameters. While such perfect synchronization of alleles is unlikely, some degree of synchronization is plausible and further improvement of the transcriptional parameters in the G2 phase will require its precise experimental quantification.

Discussion

In this study, we compared the reliability of transcriptional parameter inference from mature and nascent mRNA distributions, with and without taking into account the cell cycle phase. Although these distributions come from the same experiment, we found that the different fits produced very different parameter estimates, ranging from small bursts to very large bursts. Comparison to live-cell data revealed that the optimal inference method is to use nascent mRNA data that is separated by cell cycle.

Our findings illustrate the risk of inferring transcriptional parameters from fitting of mRNA distributions. First of all, as we have shown, these fits are sensitive to the segmentation method which can lead to large errors in the estimates. Secondly, the most common method of parameter inference in the literature is fitting of mature mRNA distributions that are not separated by cell cycle (Larsson et al., 2019; Raj et al., 2006; Zenklusen et al., 2008). Obtaining such distributions is straightforward using methods such as smFISH, where one can directly count the number of mRNAs per cell. Additionally, with the advance of single-cell mRNA sequencing technologies, it is possible to obtain mRNA distributions for many genes simultaneously and it is tempting to use these to estimate bursting behaviour across the genome (Kim and Marioni, 2013; Larsson et al., 2019). However, our comparisons on the same dataset show that the values obtained from mature mRNA fits (using merged data) can be significantly different from the optimal values (using nascent cell-cycle separated data corrected using the fusion method), with underestimation of the burst sizes of almost 10-fold and underestimation of the active fraction of more than 1.5-fold. These results indicate that parameter inference from merged mature mRNA data should be treated with caution. There were smaller differences between the burst size and the active fraction inferred from cell-cycle separated mature and nascent data (only these two can be directly compared because these are non-dimensional); however the relative errors in the estimates (computed over the four datasets) were more than twofold higher for mature data likely due to post-transcriptional noise which nascent data is free from.

It is more common to fit mature distributions rather than nascent distributions because nascent distributions are technically more challenging to obtain. As nascent single-cell sequencing methods are still in the early phase (Hendriks et al., 2019), the only method available so far for nascent measurements is smFISH (Patel et al., 2021). In such smFISH experiments, intronic probes can be used to specifically label nascent RNA, although there may be some effects of splicing kinetics on the distribution (Wan et al., 2021). If introns are not present, like for most yeast genes, one can use exonic probes instead (Zenklusen et al., 2008). Since exonic probes label both nascent and mature mRNA transcript, it may be challenging to identify the nascent transcription site unambiguously, especially at lower transcription levels. We show in this manuscript that the fusion method can correct for this bias by combining bins below k RNAs, which results in an improvement of the parameter estimates.

Our analysis also emphasizes the importance of separately analyzing G1 and G2 cells (Skinner et al., 2016). It is important to note that for cell-cycle-specific analysis, experimental adjustments or cell-cycle synchronized cultures are not required. Although asynchronous cultures consist of a mix G1, S and G2 cells, the integrated DNA intensity of the nucleus of each cell, for example from a DAPI signal, can be used to separate these cells by cell cycle phase in silico (Skinner et al., 2016; Roukos et al., 2015). As most smFISH experiments already include a DNA-labelled channel, adding an extra analysis step should in principle not limit the incorporation of this step in future smFISH fitting procedures.

Even with our optimal fitting strategy, there is a residual error of the simulated ACF and the measured ACF from live-cell measurements. This difference may be the result of different experimental biases of the two measurements. For example, live-cell measurements have a detection threshold below which RNAs may not be detected. In addition, live-cell measurements include cells in S phase, which are not analyzed in smFISH. There could also be differences in the exact percentage of G1 and G2 cells, or other noise sources between live-cell and smFISH experiments. Alternatively, the fit may be imperfect because there might be parameter sets, others than the ones which our inference algorithm found, which provide an accurate fit of the nascent mRNA distribution and perhaps an even better fit to the ACF than we found. We cannot exclude this possibility because we estimated fON to be 0.7-0.8 and using synthetic data we showed that the accuracy of some parameters (σon,σoff and the burst size) deteriorated as fON approached 1 (Figure 2d). Another factor which could explain the residual error between the simulated ACF and the measured ACF is that perhaps the two-state model may be too simplistic to cover the true promoter states in living cells and may therefore not be able to describe the true in vivo kinetics. The promoter may switch between more than 2 states, or there may be sources of extrinsic noise other than the cell cycle that contribute to the heterogeneity. Previous studies have for example identified extrinsic noise on the elongation rate (Fritzsch et al., 2018). However, these more complex transcription models also have more parameters, which in practice often means that very few will be identifiable with the current set of experimental observations. To fit these models, one requires temporal data on the transcription kinetics (Fritzsch et al., 2018; Rodriguez et al., 2019), or simultaneous measurements of various sources of extrinsic noise, such as single-cell transcription factor concentration and RNA polymerase number measurements, cellular volume, local cell crowding, etc, which are often not available in standard smFISH experiments (Battich et al., 2015; Foreman and Wollman, 2020). Nevertheless, given that there is no explicit time component in smFISH data, the closeness of the simulated ACF to the measured ACF provides confidence we are close to the real values.

The optimal parameter set (Figure 6d) indicates long ON promoter times of 75 s, during which almost 50 RNAs are produced in a burst. Large burst sizes (>70) have been previously reported for mouse embryonic stem cells (Skinner et al., 2016, mouse hepatocytes Bahar Halpern et al., 2015 and human fibroblasts Larsson et al., 2019). The large burst size and high active fraction of 0.78 suggests that GAL10 expression is reaching its limit of maximal expression, which may not be surprising as it is already one of the most highly expressed genes in yeast. It is also interesting to note that the ON time of 75 s is longer than the residence time of a single transcript (47 s), which means that RNA polymerases in the beginning of a burst have already left the locus before the burst has finished.

The optimal parameter set (Figure 6d) also indicates partial gene dosage compensation. Specifically the burst frequency per gene copy (σon) in the G2 phase is 0.66 that in the G1 phase; the other transcriptional rates are not significantly different between the two cell cycle phases. The fold change in the burst frequency per gene copy was previously estimated for the Oct4 and Nanog genes to be 0.63 and 0.71 respectively, in mouse embryonic stem cells (Skinner et al., 2016). The similarity of our estimate of the fold change to those previously measured could be explained by the results of a recent study (Jia et al., 2021); using a detailed model of gene expression, it was shown that in the absence of a dependence of the initiation rate on cell volume, gene dosage compensation optimally leads to approximate mRNA concentration homeostasis when the fold change in the burst frequency upon DNA replication is 2/20.71.

In conclusion, obtaining kinetic information from static distributions can introduce biases. However, we show that it is possible to obtain reasonable estimates that agree with live-cell measurements, if one infers parameters from nascent mRNA distributions that are accounted for cell cycle phase.

Methods

Inference from mature mRNA data

Mathematical model

The steady-state solution of the telegraph model of gene expression (Peccoud and Ycart, 1995) gives mature mRNA distributions. The reaction steps in this model are illustrated in Figure 1a. Next we describe the generation of synthetic mature mRNA data and the algorithm used to infer parameters from this data.

Generation of synthetic mature mRNA data

We generate parameter sets on an equidistant mesh grid laid over the space:

(2) (σoff,σon,ρ)[Uniform(0,150),Uniform(0,150),Uniform(0,250)],

where the units are inverse minute. Furthermore we apply a constraint on the effective transcription rate

ρ^=ρσonσon+σoff<100.

In each of the three dimensions of the parameter space, we take 10 points that are equidistant, leading to a total of 1000 parameter sets which reduce to 789 after the effective transcription rate constraint is enforced.

We additionally fix the degradation rate d=1 min-1. Note that we choose not to vary the degradation rate (as we did for the other three parameters) since it is not possible to infer all four rates simultaneously – this is because the steady-state solution of the telegraph model is a function of the non-dimensional parameter ratios ρ/d,σoff/d and σon/d (Raj et al., 2006).

Once a set of parameters is chosen, we use the stochastic simulation algorithm (SSA Gillespie, 2007) to simulate the telegraph model reactions in Figure 1a and generate 104 samples of synthetic data. Note that each sample mimicks a single cell measurement of mature mRNA.

Steps of the algorithm to estimate parameters from mature mRNA data

The inference procedure consists of the following steps: (i) select a set of random transcriptional parameters; (ii) use the solution of the telegraph model to calculate the probability of observing the number of mature mRNA measured for each cell; (iii) evaluate the likelihood function for the observed data; (iv) iterate the procedure until the negative log-likelihood is minimized; (v) the set of parameters that accomplishes the latter provides the best point-estimate of the parameters of the telegraph model that describes the measured mature mRNA fluctuations.

For step (i), we restrict the search for optimal parameters in the following region of parameter space

(3) (σoff,σon,ρ)[Uniform(0,250),Uniform(0,250),Uniform(0,300)](min-1)=:Θ.

The degradation rate is fixed to d=1 min-1.

Step (ii) can be obtained either by computing the distribution from the analytical solution (Peccoud and Ycart, 1995 or by using the finite state projection (FSP) method Munsky and Khammash, 2006). Here, for the sake of computational efficiency, we use the FSP method to compute the probability distribution of mature mRNA numbers.

For step (iii) we calculate the likelihood of observing the data given a chosen parameter set θ

(4) L(θ)=j=1NcellP(nj;θ),

where P(nj;θ) is the probability distribution of mature mRNA numbers obtained from step (ii) given a parameter set θ, nj is the total number of mature mRNA from cell j and Ncell is the total number of cells.

Steps (i) and (iv) involve an optimization problem. Specifically we use a gradient-free optimization algorithm, namely adaptive differential evolution optimizer (ADE optimizer) using BlackBoxOptim.jl (https://github.com/robertfeldt/BlackBoxOptim.jl; Feldt and Stukalov, 2022) within the Julia programming language to find the optimal parameters

(5) θ=argmin θΘ(j=1NcelllogP(nj;θ)).

The minimization of the negative log-likelihood is equivalent to maximizing the likelihood. Note the optimization algorithm is terminated when the number of iterations is larger than 104; this number is chosen because we have found that invariably after this number of iterations, the likelihood has converged to some maximal value. Note that the inference algorithm is particularly low cost computationally, with the optimal parameter values estimated in at most a few minutes.

Once the best parameter set θ* is found, we calculate the mean relative error (MRE) which is defined as

(6) MRE=1Mi=1MRelative error(θi,θtrue,i),Relative error(θi,θtrue,i)=|θiθtrue,i||θtrue,i|

where θi* and θtrue,i represent the i-th estimated and true parameters respectively, and M denotes the number of the estimated parameters. Thus, the mean relative error reflects the deviation of the estimated parameters from the true parameters.

Inference from nascent mRNA data

Mathematical model

The steady-state solution of the delay telegraph model (Xu et al., 2016) gives the distribution of the number of bound Pol II. In Appendix 6, we present an alternative approach to derive the steady-state solution. The reaction steps are illustrated in Figure 1a.

The position of a Pol II molecule on the gene determines the fluorescence intensity of the mRNA attached to it. In particular for fluorescence data acquired from smFISH PP7-GAL10, the fluorescence intensity of a single mRNA on the DNA locus looks like a trapezoidal pulse (see Figure 1b for an illustration). This presents a problem because although we can predict the distribution of the number of bound Pol II using the delay telegraph model, we do not have any specific information on their spatial distribution along the gene. However, since the delay telegraph model implicitly assumes that a Pol II molecule has fixed velocity and that Pol II molecules do not interact with each other (via volume exclusion), it is reasonable to assume that in steady-state, the bound Pol II molecules are uniformly distributed along the gene. This hypothesis is confirmed by stochastic simulations of the delay telegraph model where the position of a Pol II molecule is calculated as the product of the constant Pol II velocity and the time since its production.

By the uniform distribution assumption and the measured trapezoidal fluorescence intensity profile, it follows that the signal intensity of each bound Pol II has the density function g defined by

g(s)=L1Lχ[0,1](s)+L2Lδ1(s),s[0,1],

where L1=862 bp(base pairs),L2=2200 bp,L=L1+L2 as defined in Figure 1b. The indicator function χ[0,1](s)=1 if and only if s[0,1] and δ1(s) is the Dirac function at 1. The probability of the signal s being between 0 and 1 is due to the first part of the trapezoid function and hence is multiplied by L1/L which is the probability of being in this region if Pol II is uniformly distributed. Similarly, the probability of s being 1 is due to the L2 part of the trapezoid and hence the probability is L2/L by the uniform distribution assumption. Note that the signal s from each Pol II is at most 1 because in practice, the signal intensity from the transcription site is normalized by the median intensity of single cytoplasmic mRNAs (Zenklusen et al., 2008).

The total signal is the sum of the signals from each bound Pol II. Hence, the density function of the sum is given by the convolution of the signal densities from each bound Pol II. Defining p(s|k) as the density function of the signal given there are k bound Pol II molecules, we have that p(s|k) is the k–th convolution power of g, that is

(7) p(s|k)=(ggg)(s)=g*k(s),g*0(s)=δ0(s),

where δ0(s) is the Dirac function at. Finally we can write the total fluorescent signal density function as

(8) p(s;θ)=k=0p(s|k)P(k;θ),

where P(k;θ) is the steady-state solution of the delay telegraph model giving the probability of observing k bound Pol II molecules for the parameter set θ. Hence Equation (8) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site.

Comparison to the algorithm in Xu et al., 2016. Both algorithms take into account the fact that the signal intensity depends on the position of Pol II on the gene, albeit this is done in different ways. In Xu et al., 2016 a master equation is written for the joint distribution of gene state and the number of nascent mRNA. In this case the number of nascent RNAs can have non-integer values since it represents the experimentally measured signal from the (incomplete) nascent RNA. Solution of this master equation proceeds by (a) a discretization of the continuous nascent mRNA signal into bins which are much smaller than one; (b) solution using finite state projection (FSP). This approach can lead to a large state space which incurs a large computational cost. In contrast, in our method, we use FSP to solve for the delay telegraph model, i.e. the distribution of the discrete number of bound Pol II from which we construct (using convolution) the approximate distribution of the continuous nascent mRNA signal by assuming the Pol II is uniformly distributed on the gene. Since the state space of bound Pol II is typically not large, our method will typically be more computationally efficient than the one described in Xu et al., 2016.

Generation of synthetic nascent mRNA data

We generated synthetic smFISH signal data by using the SSA, modified to include delay to simulate the delay telegraph model (Fu et al., 2022). Specifically, we use Algorithm 2 described in Barrio et al., 2006. One run of the algorithm simulates the fluctuating number of bound Pol II molecules in a single cell.

The total fluorescence intensity (mimicking smFISH) is obtained as follows. When a particular bound Pol II is produced by a firing of the transcription reaction GG+N, we record this production time; since the elongation rate is assumed to be constant, given the production time we can calculate the position of the Pol II molecule on the gene at any later time and hence using Figure 1b we can deduce the fluorescent signal due to this Pol II molecule.

Specifically we normalize each transcribing Pol II’s position to [0,1] and map the position to its normalized signal by

q(x)={xLL1x[0,L1L],1x[L1L,1],

where x is the normalized position on the gene. Thus at a given time, the total fluorescent signal from the n-th cell (the n-th realization of the SSA) equals

qn=j=1Jnq(xj),

where Jn is the number of bound Pol II molecules in the n-th cell, and {xj} with j=1,,Jn is the vector of all Pol II positions on the gene. The total signal from each cell is a real number but it is discretized into an integer.

The kinetic parameters are chosen from the same region of parameter space as in (2), on the same equidistant mesh grid and with the same constraint on the effective transcription rate. Unlike the mature mRNA case, here there is no degradation rate; instead we have the elongation time, which we fix to τ=0.5 (min). Note that fixing this time is necessary since it is not possible to infer the three transcriptional parameters rates and the elongation time simultaneously because the steady-state solution of the delay telegraph model is a function of the non-dimensional parameter ratios ρτ,σoffτ and σonτ.

Once a set of parameters is chosen, we use the modified SSA (as described above) to simulate the signal intensity in each of 104 cells.

Steps of the algorithm to estimate parameters from nascent mRNA data

The inference procedure is essentially the same as steps (i)-(v) described in mature mRNA inference except for the following points.

In step (ii), the probability of observing a total signal of intensity i from a single cell is obtained by integrating p(s;θ) in Equation (8) on an interval [i-1,i] for i which, in our numerical scheme, means

(9) S(i;θ):=k=0KP(k;θ)i1igk(x)dx,i=1,2,

Note that the integration over the interval of length 1 is to match the discretization of the synthetic data and θΘ. Intuitively, one can always choose a positive integer K such that P(k)=0 for any kK. The computation of the solution of the delay telegraph model P(k) can be done either using the analytical solution (evaluated using high precision) or using the finite state projection algorithm (FSP) Munsky and Khammash, 2006. In Appendix 6—figure 1 and Appendix 6—table 1, we show that the two methods yield comparable accuracy and CPU time.

For step (iii) we calculate the likelihood of observing the data given a chosen parameter set θ

(10) L(θ)=j=1NcellS(qj;θ),

where qj is the discretized total signal intensity from cell j and Ncell is the total number of cells. In the optimization, we aim to find

θ=argmin θΘ(j=1NcelllogS(qj;θ)).

The whole procedure (for both mature and nascent mRNA inference) is summarized by a flow-chart in Figure 1c.

Experimental data acquisition and processing

A diploid yeast strain of BY4743 background with a single integration of 14xPP7 loops at the 5’UTR of GAL10 (strain YTL047 Donovan et al., 2019) was used in this study. Four replicate yeast cultures were grown in synthetic complete media with 2% galactose to early mid-log (OD 0.5), fixed with 5% paraformaldehyde (PFA) for 20 min, permeabilized with 300 units of lyticase and hybridized with 7.5 pmol each of four PP7 probes labeled with Cy3 (Integrated DNA Technologies) as described in Trcek et al., 2012 and Lenstra et al., 2015; Patel et al., 2021, resulting in four technical replicates. The PP7 probe sequences are: atatcgtctgctcctttcta, atatgctctgctggtttcta, gcaattaggtaccttaggat, aatgaacccgggaatactgc. Coverslips were mounted on microscope slides using mounting media with DAPI (ProLong Gold, Life Technologies).

The coverslips were imaged on a Zeiss AxioObserver (Zeiss, USA) widefield microscope with a Plan-Apochromat 40x1.4 NA oil DIC UV objective and a 1.25 x optovar. For Cy3, a 562 nm longpass dichroic (Chroma T562lpxr), 595/50 nm emission filter (Chroma ET595/50 m) and 550/15 nm LED excitation at full power (Spectra X, Lumencor) were used. For DAPI, a 425 nm longpass dichroic (Chroma T425lpxr) and a 460/50 nm emission filter (Chroma ET460/50 m) and LED excitation at 395/25 nm at 25% power (Spectra X, Lumencor) were used. The signal was detected on a Hamamatsu ORCA-Flash4.0 V3 Digital CMOS camera (Hamamatsu Photonics, Japan). For each sample and each channel, we utilized the Micro-Manager software (UCSF) to acquire at least 20 fields-of-view based on the DAPI channel. Each field-of-view consisted of 13 z-stacks (with a z-step of 0.5 µm) at 25ms exposure for DAPI and 250ms exposure for Cy3.

A custom python pipeline was used for analysis (https://github.com/Lenstralab/smFISH; Pomp, 2022). Maximum intensity projected images were used to segment the cell and nucleus using Otsu thresholding and watershedding (segmentation 1). In addition, we segmented cells using CellProfiler (segmentation 2). The diffraction-limited Cy3 spots were detected per z-slice using band-pass filtering and refined using iterative Gaussian mask localization procedure (Crocker and Grier, 1996; Thompson et al., 2002; Larson et al., 2005; Larson et al., 2011 and Coulon et al., 2014). Cells in which no spots were detected were excluded from further analysis since a visual inspection indicated that these cells were not properly segmented or were improperly permeabilized.

Spots were classified as nuclear or cytoplasmic and the brightest nuclear spots were classified as transcription sites. The intensity of the brightest nuclear spot in a cell was normalized with the median fluorescence intensity of all the cytoplasmic spots in all cells. This is due to the fact that 90% of cytoplasmic mRNAs are isolated (Figure 4a), thus the median of the fluorescence signal of cytoplasmic mRNAs can be considered as the normalizing value. The distribution of the normalised intensity of the brightest nuclear spot, calculated over the cell population, is the experimental equivalent of the total fluorescent signal density function as given by the solution of the modified delay telegraph model, Equation (8).

The number of mature mRNA in each cell is given by counting the number of spots in the entire cell, that is nuclear plus cytoplasmic. The transcription site is counted as 1 mRNA, regardless of its intensity. We show in Figure 3c that this has negligible influence on the estimated parameters since the mean number of mature mRNA is much greater than 1. The distribution of the number of spots is the experimental equivalent of the solution of the telegraph model, that is the marginal distribution of mature mRNA numbers in steady-state conditions.

The integrated nuclear intensity of each cell was calculated by summing the DNA content intensity (DAPI) of all the pixels within the nucleus mask. The distribution of the intensities was fit with a bimodal Gaussian distribution. Those cells whose intensity was within a standard deviation of the mean of the first (second) Gaussian peak was classified as G1 (G2) (see Figure 3e left). This gave similar results to a different cell cycle classication method using the Fried/Baisch model (Johnston et al., 1978) which was recently employed in Skinner et al., 2016. See Appendix 7—figure 1 for a comparison of the two methods. We note that cells in late G2 may contain two separate transcription sites, one in the mother and one in the bud. When the nucleus moves into the bud, buds often contain less DNA than G1 cells, and mothers contain more DNA than G1 cells, both of which are excluded from the analysis. When the DNA content of the mother and daughter is similar, both mother and daughter are counted separately as G1 cells. We note that this late G2 subpopulation is very small.

We did four independent experiments with a total number of cells equal to 2510, 6411, 4592, 3181, respectively. After classification, the numbers of G1 cells are 766, 2111, 1495, 904 and the number of G2 cells are 683, 1657, 1209, 1143, whereas the rest were classified as undetermined.

Data availability

The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH; Pomp, 2022. The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl; Fu, 2022.

Appendix 1

Accuracy of inference from synthetic mature and nascent mRNA data

Inference from synthetic mature mRNA data with external noise

In the main text, Figure 2, we showed how the addition of 5% external noise to synthetic mature mRNA data degrades the inference accuracy. In Appendix 1—figure 1 we show how the addition of a larger amount of external noise (10%) causes an even larger loss of accuracy. In particular for 91% of the parameters, the inference accuracy is higher when using nascent mRNA data (Appendix 1—figure 1a) and the median relative errors become very high for most parameters, especially for ρ and σoff in the limit of large fON (Appendix 1—figure 1b).

Appendix 1—figure 1
Comparing inference accuracy using synthetic nascent mRNA data and synthetic mature mRNA data with 10% external noise (log-normal distributed noise is added to the initiation rate ρ to mimic external noise due to post-transcriptional processing that is only present in mature mRNA).

(a) Ratio of the mean relative errors in the two types of data as a function of the true fraction of ON time, fON. For ≈91% (719/789) of the parameters, the inference accuracy is higher when using nascent mRNA data. (b) The median relative error of each transcriptional parameter as a function of the fraction of ON time using synthetic mature mRNA.

Accuracy of distribution fits

Appendix 1—figure 2
Inference with the telegraph model and delay telegraph model for six parameter sets.

(a) Estimates using the inference algorithm with the telegraph model (with no external noise) for six parameter sets. For both the ground truth and the estimated parameters, we fix the degradation rate d=1 min-1. (b). Estimates using the inference algorithm with the delay telegraph model for six parameter sets. For both the ground truth and the estimated parameters, we fix the delay τ=0.5 min. (c) Distributions from synthetic mature mRNA data fitted using the telegraph model. (d) Distributions from synthetic nascent mRNA data fitted using the delay telegraph model.

In the main text, Figure 2, we showed how the accuracy of parameter estimation is not uniform across parameter space. Here we investigate if there is any relationship between this accuracy and how well is a distribution of mature mRNA numbers / signal intensity fit by the inference algorithm. For 12 parameter sets (6 for the telegraph model – Appendix 1—figure 2a) and (6 for the delay telegraph model – Appendix 1—figure 2b), we evaluate the fits to the distribution of synthetic data in Appendix 1—figure 2c–d. The results show that independent of the accuracy of the parameters estimated by the inference algorithm, the fits of the delay telegraph and telegraph model distributions to the distributions generated from synthetic data are generally excellent.

Testing the variability of the inference procedure

To obtain a better understanding of the variability of the inference procedure, for each of the six parameter sets in Appendix 1—figure 2b and i.e. for the inference using the delay telegraph model, we generated 10 independent sets of synthetic data and used maximum likelihood to infer the parameters for each dataset. The mean and standard deviation of the parameters (computed over all 10 datasets) are shown in Appendix 1—table 1. The means are close to the true parameter values (in Appendix 1—figure 2b); this shows that the inference procedure is working correctly and the deviations from ground truth values are mostly due to noise in the synthetic datasets. We can define the sample variability of a parameter as the standard deviation divided by the mean. Ordering parameters by this quantity, we find that for all six parameter sets the error is largest for σoff. For small fON, the parameters ordered by increasing error are: σon, burst size, ρ, fON and σoff. While for large fON, the order is: ρ, fON, σon, burst size and σoff. Note that this order is the same as we determined using the relative errors (from the ground truth) which is shown in Figure 2 of the main text.

Appendix 1—table 1
Mean and standard deviation of the parameters estimated from 10 independent synthetic datasets, generated for each parameter set in Appendix 1—figure 2.
ParameterMeanStandard deviation
σoffσonρburst sizefONσoffσonρburst sizefON
Set 131.852.6915.680.530.1019.070.227.680.070.05
Set 292.6730.47141.421.560.2523.761.5121.610.140.04
Set 39.948.0163.356.390.450.720.191.940.250.01
Set 42.302.1618.768.170.480.120.050.310.310.01
Set 52.265.8012.575.590.720.190.300.110.410.01
Set 61.2210.0124.9720.530.890.100.430.161.650.01

Confidence intervals using profile likelihood

We perform a profile likelihood study (Kreutz et al., 2013) on the 12 parameter sets of synthetic mature/nascent mRNA data described in Appendix 1—figure 2a-b. We obtain the 95% confidence interval for each parameter. The results are shown in Appendix 1—table 2. We also compare the relative errors for each parameter (computed as (estimated value - ground truth)/ground truth using the data in Appendix 1—figure 2a and b) with the profile likelihood error (computed as (upper bound - lower bound)/optimal estimate using the bounds in Appendix 1—table 2 and the optimal values in Appendix 1—figure 2a and b). The results are shown in Appendix 1—table 3. Note that in most cases, the parameters ordered by relative error are in agreement with the parameters ordered by profile likelihood error.

Appendix 1—table 2
95% confidence intervals of the 12 parameter sets (shown in Appendix 1—figure 2a-b).
ParameterTelegraph CIDelay CI
σoffσonρσoffσonρ
Set 1(6.76, 300.00)(3.53, 8.49)(5.59, 107.67)(13.80, 300.00)(1.85, 2.69)(9.94, 160.51)
Set 2(17.22, 190.38)(14.56, 23.92)(61.14, 250.51)(103.80, 268.35)(30.00, 35.38)(161.54, 300.00)
Set 3(0.54, 0.59)(0.41, 0.43)(46.08, 47.15)(6.83, 8.55)(6.42, 7.04)(58.28, 63.18)
Set 4(0.31, 0.35)(0.46, 0.49)(15.50, 15.91)(1.64,1.99)(1.65,1.85)(17.92,18.77)
Set 5(0.49, 0.85)(5.82, 7.62)(49.62, 50.92)(1.56,2.59)(4.58,5.85)(12.08,12.97)
Set 6(0.08, 0.16)(2.46, 3.54)(26.10, 26.54)(0.73,1.24)(7.53,9.75)(24.32,25.14)
Appendix 1—table 3
Table showing the relative error against profile likelihood error of 12 parameter sets (shown in Appendix 1—figure 2a and b).

See text for details.

TelegraphDelay
Relative errorProfile likelihood errorRelative errorProfile likelihood error
σoffσonρσoffσonρσoffσonρσoffσonρ
Set 10.870.270.7718.820.8811.070.820.110.581.750.341.82
Set 20.040.041.44E-034.720.522.220.410.110.270.810.160.55
Set 30.080.010.010.090.060.020.200.150.030.220.090.08
Set 40.010.011.11E-030.140.070.030.220.180.020.200.110.05
Set 50.220.134.97E-030.550.270.030.300.230.030.640.280.07
Set 60.290.240.010.740.370.020.200.120.010.550.250.03

Effect of random perturbation of mature mRNA data on inference

To assess the reliability of the inference results due to errors in spot counting, we redid the inference with synthetic mRNA data (and using the telegraph model) perturbed randomly by minus 1/plus 1/unchanged with probability 1/3. The results are shown in Appendix 1—table 4.

We found that when the fraction of ON time was very small (Set 1), there is a considerable effect of the perturbations on the values of the inferred parameters – this is because in this case, the mean number of mRNA is very small and hence a perturbation of one molecule is very significant. However as expected, the inference results are quite robust when the fraction of ON time is not too small (Sets 2–6).

Appendix 1—table 4
Effects of random perturbations on inference of parameters from mature mRNA data (using the telegraph model).
ParameterTrueUnperturbed–1/0/+1 stochastic perturbation
σoffσonρburst sizefONσoffσonρburst sizefONσoffσonρburst sizefON
Set 1120.387.7040.850.340.0615.585.639.220.590.270.591.173.706.280.66
Set 235.4817.3985.052.400.3336.7318.0385.172.320.3324.1315.7970.892.940.40
Set 30.520.4246.2188.360.440.570.4246.6282.470.430.610.4647.1776.740.43
Set 40.320.4715.7148.650.590.330.4715.7048.010.590.390.5416.0941.170.58
Set 50.535.8549.9594.720.920.646.6250.2077.990.910.686.7250.3574.480.91
Set 60.163.8926.45169.810.960.112.9426.30238.760.960.133.0626.42203.200.96

Effect of log-normal noise in nascent fluorescent signal on inference

Given the synthetic nascent mRNA signal data {Si}i=1N (where N represents the number of samples) generated by delay telegraph model, for each Si, we perform a random perturbation Φ under the following conditions

Φ:SiΦ(Si)

where Φ is a stochastic perturbation satisfying the following constraints

Φ(Si) is a random variable sampled from the distribution Log-normal (α,β) whose mean equals Si, and the standard deviation equals 0.1Si.

This means the random perturbation keeps the mean value of the signal Si unchanged but adds noise with a coefficient of variation equal to 0.1.

In Appendix 1—table 5 we compare the results of inference using synthetic nascent mRNA data with the aforementioned stochastic perturbation and without. As for mature mRNA data, we find that the perturbation has only a significant impact when the fraction of ON time is very small.

Appendix 1—table 5
Inference using the delay telegraph model from synthetic nascent fluorescent data, with and without perturbation by log-normal noise.
ParameterTrueDelayPerturbation
σoffσonρburst sizefONσoffσonρburst sizefONσoffσonρburst sizefON
Set 134.492.8016.650.480.0862.743.1026.380.420.0550.081.5231.710.630.03
Set 2135.6832.98179.741.320.2080.4229.44131.351.630.27233.9630.74300.001.280.12
Set 39.867.9663.166.410.457.856.7861.207.790.468.856.6365.497.400.43
Set 42.262.1418.648.260.491.771.7618.2810.360.501.901.6418.659.830.46
Set 52.275.8012.555.530.721.604.4812.217.630.742.594.7913.275.130.65
Set 61.189.9424.9221.150.890.948.7424.7426.370.901.8410.1526.1014.180.85

Appendix 2

Telegraph model versus delay telegraph model

In this section, we aim to precisely understand the differences between the telegraph and the delay telegraph models. For both models, we define the rate of switching from the ON state to OFF state as σoff, the rate of switching from the OFF state to the ON state as σon and the production rate of nascent mRNAs in the ON state as ρ. The first-order decay rate of nascent mRNA in the telegraph model is given by d and the delay time between initiation and degradation in the delay telegraph model is τ. Note that while the telegraph model was in the main text explained in terms of mature mRNA, in this section we use it as a model for nascent mRNA since we want to compare directly with the delay telegraph model. The telegraph and delay telegraph models can be solved exactly in steady-state conditions (Peccoud and Ycart, 1995; Xu et al., 2016) (an alternative derivation for the delay telegraph model is also given in Section F). From the generating function solution of the models, one can deduce expressions for the first and second centered moments in steady-state conditions:

(11) ntele=ρσond(σon+σoff),ndelay=ρσonτσon+σoff,Vardelay=ρσon((σoff+σon)3τ+2ρσoff(e(σoff+σon)τ+(σoff+σon)τ1))(σoff+σon)4,Vartele=ρσon(ρσoff+d(σoff+σon)+(σoff+σon)2)d(σoff+σon+d)(σoff+σon)2,

where ndelay and ntele are the mean number of nascent mRNA in the delay telegraph and telegraph models respectively, and Vardelay and Vartele are the corresponding variances in molecule numbers. To understand the differences between the two models, we set the means of the two models to be the same (by choosing d=1/τ) and then compute the relative error in their variance predictions:

(12) R=Vardelay-VarteleVardelay=(2(1+x)+ex(x2-2))y(1+x)(2y+ex(x+2(x-1)y)),

where x and y are non-dimensional variables defined as

x=τ(σoff+σon),y=(ρ/σoff)(σoff/(σoff+σon))2.

Note that x is the ratio of the time for nascent mRNA to detach from the gene τ and the timescale of promoter switching 1/(σoff+σon). The non-dimensional parameter y is a measure of the burstiness of gene expression since it increases with the mean burst size ρ/σoff and the fraction of time spent in the OFF state σoff/(σoff+σon). In fact as σoff0 which implies y0, the telegraph model converges to a constitutive model where the time between two successive nascent mRNA production events is exponentially distributed. Note that the relative errors in the Fano factor and the coefficient of variation squared are the same as the error in the variance since the means of the two models are the same.

Using Equation (12), it is easy to see that the relative error between the two models vanishes in the limit of x0 (when the promoter switching timescales are much longer than the time spent by a polymerase on a gene) or in the limit of y0 (when there is no burstiness in gene expression). Hence, the telegraph model is an accurate approximation of the delay telegraph model in these two limits. It can be shown that in the first case, the distribution of nascent mRNA numbers is well approximated by the sum of two Poisson distributions, whereas for the second case the distribution is a Poisson. Note that since R is always positive, it follows that the telegraph model systematically underestimates the size of noise in nascent mRNA numbers. It can be further shown that R increases monotonically with x and y and the maximum attainable value is R=1/2 (when x,y), that is, the worst prediction of the telegraph model is that the variance is half that of the delay telegraph model.

In Appendix 2—figure 1a and c we contrast the distributions of nascent mRNA predicted by the delay telegraph model with those predicted by the telegraph model for the case d=1/τ where the means are matched, as also assumed for the calculation of the relative error above. Appendix 2—figure 1b shows the effect of increasing x (via τ) and Appendix 2—figure 1d shows the effect of increasing y (via ρ). The number distributions are constructed from the generating functions of the telegraph model Peccoud and Ycart, 1995 and of the delay telegraph model (Section F). Note that the shapes of the distributions of the two models can be considerably different, e.g. the cases ρ=3 and ρ=10 in Appendix 2—figure 1c shows that the nascent distribution from the delay model is bimodal with peaks at 0 and at a non-zero value but it is unimodal with a non-zero peak from the telegraph model. Hence we conclude that if we are interested in accurately predicting nascent mRNA number distributions then generally the Markovian telegraph model is not a good approximation of the non-Markovian delay telegraph model.

In the main text, we used the delay telegraph model to infer the synthetic data from the synthetic data generated by the delay SSA. For comparison, we repeat the same but now we use the telegraph model (rather than the delay telegraph model) to calculate step (ii) in the inference algorithm, that is P(k;θ) in Equation 8 in the main text is now chosen to be the steady-state solution of the telegraph model. Once the best parameter set θ* is found, we calculate two scores: (i) the fitness given by the smallest negative log-likelihood value found by the optimizer normalized over the sample size. (ii) the mean relative error (for definition see Equation (4.5) in the main text). In Appendix 2—figure 1, we show both of these scores obtained for 20 independent numerical experiments – clearly the error using the delay telegraph model for the inference algorithm is significantly lower than if the telegraph model is used.

These observations are further reinforced in Appendix 2—figure 1f where we show the best fit distributions and the corresponding relative errors in the estimates of the burst size and the burst frequency (bar chart insets). These are computed using the formula:

(13) α=Relative error in the burst frequency estimate=|σon*-σon||σon*|,β=Relative error in the burst size estimate=|ρ*/σoff*-ρ/σoff||ρ*/σoff*|.

We note that the distributions are well fit in all cases, using both telegraph and delay telegraph models. However the errors α and β are considerably larger for the former.

In Appendix 2—table 1 we show the true and estimated parameters for the 6 distributions in Appendix 2—figure 1f. The estimates for ρ are accurate independent of the choice of model; however, the estimates for the promoter switching rates σoff,σon are far more accurate using the delay telegraph model. Hence we conclude that although inference using the telegraph model provides a histogram that fits well with the synthetic data, the inferred parameter values have little meaning because they are not an accurate reflection of the true parameter values.

Appendix 2—figure 1
Distributions and mean errors of the transcriptional parameter inference.

(a-d) Comparison of the stochastic properties of the delay telegraph model and the telegraph model. a. Distributions of the nascent mRNA predicted by the delay telegraph model and the telegraph model for various values of τ. We fix the parameters (σoff,σon,ρ)=(0.6,0.03,1) which implies that the change of τ leads to a change in x=τ(σon+σoff) at constant y=(ρ/σoff)(σoff/(σoff+σon))2. (b) Corresponding relative error R between the variances of two models calculated as a function of τ using Equation (12). (c) Distributions of the nascent mRNA predicted by the delay telegraph model and the telegraph model for various values of ρ. We fix the parameters (σoff,σon,τ)=(0.6,0.03,100) which implies that the change of ρ leads to a change in y at constant x. (d) Corresponding relative error R between the variances of two models calculated as a function of ρ using Equation (12). (e-f) Inference of transcriptional parameters using as input synthetic fluorescent signal data generated by SSA simulations of transcription and fluorescent tagging for 104 cells (see Methods section of the main text). (e) Mean relative error and normalised fitness score (fitness/number of samples) plot for 20 sets of numerical experiments. The inference is done in two different ways, using either the telegraph model (green) or the delayed telegraph model (blue). (f) Distributions of total fluorescent intensity from synthetic data (red dots) fit using the inference algorithm with telegraph model (dashed green) or delayed telegraph model (blue) for 6 different parameter sets. The insets show the relative errors in the estimates of the burst frequency (α) and of the burst size (β) calculated using Equation (13). Note that while both models provide a very good fit to the distribution from synthetic data, nevertheless parameter estimation is far more accurate using a delayed telegraph model. This is also reflected in (a) where we see low fitness scores for both models but a high mean relative error for estimates based on the telegraph model. The true and estimated parameters are shown in Appendix 2—table 1.

Appendix 2—table 1
Estimates using the inference algorithm with delay telegraph and telegraph models for the six parameter sets in Appendix 2—figure 1f.
ParameterTrueDelayTelegraph
σoffσonρburst sizefONσoffσonρburst sizefONσoffσonρburst sizefON
Set 11.058.2057.9955.090.890.947.1957.9261.870.880.604.1059.5199.420.87
Set 21.273.1458.1745.690.711.132.9158.0151.160.720.451.2259.43133.070.73
Set 32.275.8012.555.530.721.604.4812.217.630.740.591.7112.0920.610.74
Set 41.189.9424.9221.150.890.948.7424.7426.370.900.464.1224.9254.000.90
Set 52.262.1418.648.260.491.771.7618.2810.360.500.540.5817.6032.590.52
Set 61.384.7721.7415.790.781.084.0021.5019.940.790.381.4921.3156.440.80

Appendix 3

Inference results using mature mRNA data

Influence of segmentation on inference

In main text Section Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference, we used independent segmentation tools to study segmentation artifacts. The inference results under segmentation 1, segmentation 2 and segmentation 2 without counting the transcriptional site are summarised in Appendix 3—table 1.

Appendix 3—table 1
Inferred transcriptional parameters using merged mature mRNA data from segmentation 1, segmentation 2 and segmentation 2 without transcriptional site (TS).
Parametersegmentation 1segmentation 2segmentation 2 without TS
σoffσonρburst sizefONσoffσonρburst sizefONσoffσonρburst sizefON
Set 121.184.7347.152.230.183.273.3620.626.310.512.922.4521.057.220.46
Set 219.535.4440.112.050.223.223.8719.175.950.552.462.6818.357.460.52
Set 316.534.7739.482.390.223.674.0520.095.470.523.002.8719.736.570.49
Set 428.725.1650.001.740.155.794.5823.043.980.445.273.2724.334.610.38
Mean21.495.0244.192.100.193.993.9720.735.430.503.412.8220.876.460.46
Std5.190.345.210.280.031.060.431.430.890.041.260.352.561.290.06
Appendix 3—figure 1
Merged and cell-cycle specific mature mRNA count distributions.

(a) Merged mature mRNA count distribution (purple) under segmentation method 1 and with/without counting the transcriptional site (TS) under segmentation method 2 with a best fit obtained from the telegraph model (magenta curves). (b) Cell-cyle specific mature mRNA count distribution (purple) under segmentation method 2 with a best fit obtained from the telegraph model (magenta curves).

Confidence intervals for estimates from merged and cell-cycle specific mRNA data using segmentation 2

We obtain 95% confidence intervals for the parameters estimated from mature mRNA data under segmentation 2 (shown in Figure 3f of the main text). The confidence intervals are shown for σoff,σon,ρ in Appendix 3—table 2.

Appendix 3—table 2
95% confidence interval intervals for the estimates from experimental mature mRNA data using segmentation 2.
matureσoffσonρCI for σoff,σon,ρ
mergedSet 13.273.3620.62(2.08, 6.08)(2.81, 4.16)(18.03, 25.97)
Set 23.223.8719.17(2.38, 4.69)(3.41, 4.44)(17.51, 21.62)
Set 33.674.0520.09(2.54, 5.92)(3.51, 4.79)(18.03, 23.67)
Set 45.794.5823.04(3.33, 13.33)(3.79, 5.85)(19.36, 33.51)
G1Set 10.692.7610.71(0.27, 2.35)(1.73, 4.96)(9.77, 12.85)
Set 20.602.8010.27(0.35, 1.24)(2.13, 3.92)(9.77, 11.15)
Set 30.883.4110.33(0.38, 2.84)(2.23, 5.77)(9.58, 12.35)
Set 42.465.2111.12(0.46, 300.00)(2.68, 18.86)(8.96, 216.67)
G2Set 10.420.999.45(0.22, 0.91)(0.65, 1.48)(8.64, 10.86)
Set 20.141.017.92(0.05, 0.36)(0.56, 1.72)(7.61, 8.46)
Set 30.181.247.91(0.05, 0.72)(0.57, 2.59)(7.49, 8.85)
Set 40.421.688.19(0.12, 2.07)(0.76, 3.50)(7.41, 10.28)

Effect of synchronization of gene copies in G2 phase on parameter inference

In the main text, we assumed that the transcription of two allele copies in the G2 phase are independent. Here we consider the opposite scenario where the two allele copies in G2 phase are perfectly synchronized with each other, i.e. when one copy switches from on to off, the other copy also switches from on to off. Note that the time at which mRNA is transcribed from each copy is however not the same. Using this modified simulation algorithm, we re-perform the inference using the experimental data under segmentation 2. The results are shown in Appendix 3—table 3. Comparing these values with those estimated for phase G2 under the assumption of allele independence (main text Figure 3f right panel), we find that ρ and fON are very similar but the other parameters vary considerably.

Appendix 3—table 3
Inferred transcriptional rate (normalized) per gene copy for the G2 cell cycle phase under the assumption that the two gene states are perfectly synchronized.
matureσoffσonρburst sizefON
G2 syncSet 10.622.178.4513.690.78
Set 20.373.267.7221.050.90
Set 30.674.427.9211.850.87
Set 40.553.457.5513.620.86
Mean0.553.327.9115.050.85
Std0.130.920.394.090.05

Appendix 4

Inference results using nascent mRNA data

Inference using non-curated data

We show the inference results of the experimental non-curated nascent mRNA data (with and without taking into consideration the cell-cycle) for the four datasets (Appendix 4—table 1) and their 95% confidence interval using the profile likelihood estimate (Appendix 4—table 2). The best fit distributions for merged and cell-cycle-specific data are shown in Appendix 4—figure 1.

Appendix 4—table 1
Estimated parameters from the non-curated distribution of the normalized intensity of the brightest nuclear spot (nascent mRNA data) constructed by merging all data or else specific to the cell cycle phases G1 and G2.

The elongation time τ is estimated to be 0.785 min, based on measurements of the elongation speed.

nascentσoffσonρburst sizefON
mergedSet 15.245.0777.4614.780.49
Set 25.585.1382.1114.710.48
Set 35.115.1776.0914.900.50
Set 45.966.1373.2312.300.51
Mean5.475.3777.2214.170.50
Std0.380.503.701.250.01
G1Set 11.113.7637.8334.100.77
Set 21.533.9441.3327.060.72
Set 30.953.2336.7938.560.77
Set 41.283.7636.0428.090.75
Mean1.223.6738.0031.950.75
Std0.250.312.345.390.02
G2Set 10.741.6935.0047.300.70
Set 20.822.1836.3044.370.73
Set 30.912.1934.5437.900.71
Set 41.082.6133.2730.760.71
Mean0.892.1734.7840.080.71
Std0.150.381.257.350.01
Appendix 4—table 2
95% confidence intervals for non-curated data estimated using the profile likelihood method.
nascentσoffσonρburst sizefONCI for σoff,σon,ρ
mergedSet 15.245.0777.4614.780.49(4.42, 6.23)(4.68, 5.45)(73.48, 82.92)
Set 25.585.1382.1114.710.48(5.05, 6.15)(4.97, 5.37)(79.38, 85.53)
Set 35.115.1776.0914.900.50(4.53, 5.71)(4.98, 5.38)(73.49, 79.64)
Set 45.966.1373.2312.300.51(5.19, 7.22)(5.87, 6.57)(69.34, 78.37)
G1Set 11.113.7637.8334.100.77(0.77, 1.66)(3.04, 4.63)(36.26, 39.96)
Set 21.533.9441.3327.060.72(1.29, 1.88)(3.64, 4.40)(40.18, 42.85)
Set 30.953.2336.7938.560.77(0.77, 1.17)(2.86, 3.60)(36.05, 37.69)
Set 41.283.7636.0428.090.75(0.99, 1.73)(3.18, 4.34)(34.68, 37.75)
G2Set 10.741.6935.0047.300.70(0.54, 1.02)(1.36, 2.08)(33.64, 36.72)
Set 20.822.1836.3044.370.73(0.66, 1.07)(1.85, 2.52)(35.35, 37.40)
Set 30.912.1934.5437.900.71(0.70, 1.23)(1.85, 2.61)(33.38, 36.05)
Set 41.082.6133.2730.760.71(0.75, 1.71)(2.00, 3.41)(31.91, 35.40)
Appendix 4—figure 1
Inference results using merged and cell-cycle specific nascent data.

Experimental distributions (purple) are fit using the delay telegraph model (magenta curves).

Inference using curated data

We show the inference results using nascent mRNA data curated with the rejection method in Appendix 4—table 3 and Appendix 4—figure 2, and with the fusion method in Appendix 4—table 4. We also used the profile likelihood method to obtain the 95% confidence intervals for the fusion method with k=4; the results are shown in Appendix 4—table 5. We also reinferred the parameters in the G2 phase for fusion corrected data by assuming that the allele states are perfectly synchronised; these are shown in Appendix 4—table 6.

Appendix 4—figure 2
Inference results using cell-cycle specific data curated with the rejection method (only the distributions for k=4 are shown).

Corresponding fitted distributions (purple) for G1 (top row) and G2 (bottom row) using the delay telegraph model (magenta curves).

Appendix 4—table 3
(Rejection method) Estimated parameters by discarding the first k signal bins of the experimental distribution of the signal intensity (and renormalizing afterwards).

Inference is done for each of the four data sets. The elongation time is fixed to τ0.785 min.

k = 1σoffσonburst sizefON
G1Set 11.064.3336.6234.630.80
Set 21.454.4239.8427.510.75
Set 31.043.8436.6635.190.79
Set 41.274.1235.4127.780.76
Mean1.214.1837.1331.280.78
Std0.190.261.904.200.02
G2Set 10.842.1034.5441.200.71
Set 20.972.9435.4936.730.75
Set 30.942.5333.6835.990.73
Set 41.203.0732.8127.280.72
Mean0.992.6634.1335.300.73
Std0.150.441.155.820.02
k=2σoffσonρburst sizefON
G1Set 11.586.3337.7723.890.80
Set 21.935.9340.8621.160.75
Set 31.285.1537.0729.070.80
Set 41.555.2935.9223.100.77
Mean1.595.6737.9124.300.78
Std0.270.552.113.380.02
G2Set 11.433.4235.8325.110.71
Set 21.694.3937.4222.100.72
Set 31.333.3834.6426.080.72
Set 41.563.6833.7021.610.70
Mean1.503.7235.3923.720.71
Std0.160.471.612.200.01
k=3σoffσonρburst sizefON
G1Set 12.669.0339.7314.920.77
Set 22.557.4142.0416.460.74
Set 31.646.6737.6922.920.80
Set 42.227.3237.0216.650.77
Mean2.277.6139.1217.730.77
Std0.461.002.263.540.02
G2Set 12.014.3537.2318.500.68
Set 22.165.1038.5617.850.70
Set 31.704.0535.5520.850.70
Set 41.884.1634.4818.300.69
Mean1.944.4136.4518.880.69
Std0.190.481.801.340.01
k=4σoffσonρburst sizefON
G1Set 16.1014.1844.417.280.70
Set 23.729.5743.9611.810.72
Set 32.027.9938.2518.890.80
Set 42.758.6037.7813.760.76
Mean3.6510.0941.1012.940.74
Std1.772.803.574.810.04
G2Set 12.124.5037.4817.680.68
Set 22.595.6839.5515.250.69
Set 32.475.1437.2715.080.68
Set 42.184.5535.1716.130.68
Mean2.344.9737.3716.040.68
Std0.230.561.791.190.01
Appendix 4—table 4
(Fusion method) Estimated parameters by combining the first k signal bins of the experimental distribution of the signal intensity.

Inference is done for each of the four data sets. The elongation time is fixed to τ0.785 min.

k=1σoffσonρburst sizefON
G1Set 11.113.7637.8334.100.77
Set 21.533.9441.3327.060.72
Set 30.953.2336.7938.560.77
Set 41.283.7636.0428.090.75
Mean1.223.6738.0031.950.75
Std0.250.312.345.390.02
G2Set 10.741.6935.0047.300.70
Set 20.822.1836.3044.370.73
Set 30.912.1934.5437.900.71
Set 41.082.6133.2730.760.71
Mean0.892.1734.7840.080.71
Std0.150.381.257.350.01
k=2σoffσonρburst sizefON
G1Set 10.782.9936.6546.990.79
Set 21.143.2740.0135.060.74
Set 30.712.5836.0051.000.79
Set 40.963.0935.0836.520.76
Mean0.902.9836.9442.390.77
Std0.190.302.157.820.02
G2Set 10.541.3034.3763.200.70
Set 20.671.8735.7453.700.74
Set 30.741.8833.9645.750.72
Set 40.942.3632.8434.890.72
Mean0.721.8534.2349.380.72
Std0.170.441.2012.010.01
k=3σoffσonρburst sizefON
G1Set 10.712.7936.3851.390.80
Set 21.073.1439.7737.010.75
Set 30.632.3535.7456.840.79
Set 40.802.7134.5843.090.77
Mean0.802.7536.6247.080.78
Std0.190.322.238.780.02
G2Set 10.521.2534.3065.860.71
Set 20.651.8435.7054.750.74
Set 30.711.8133.8547.670.72
Set 40.912.3132.7635.910.72
Mean0.701.8034.1551.050.72
Std0.160.431.2212.570.01
k=4σoffσonρburst sizefON
G1Set 10.692.7336.3052.850.80
Set 21.053.0939.6937.710.75
Set 30.632.3535.7456.780.79
Set 40.832.7934.6841.570.77
Mean0.802.7436.6047.230.78
Std0.190.312.179.040.02
G2Set 10.561.3434.4561.050.70
Set 20.661.8635.7354.110.74
Set 30.671.7233.7050.560.72
Set 40.922.3332.7935.480.72
Mean0.701.8134.1750.300.72
Std0.150.411.2410.800.01
Appendix 4—table 5
Inference of the kinetic parameters (σoff,σon,ρ) using nascent data curated with the fusion method.

Inferred values and the corresponding 95% confidence intervals of G1 and G2 cell-cycle-specific data calculated using the profile likelihood method.

nascentσoffσonρburst sizefONCI for σoff,σon,ρ
G1Set 10.692.7336.3052.850.80(0.46, 1.05)(2.12, 3.52)(35.06, 37.91)
Set 21.053.0939.6937.710.75(0.86, 1.29)(2.73, 3.47)(38.77, 40.92)
Set 30.632.3535.7456.780.79(0.49, 0.79)(1.99, 2.71)(35.06, 36.58)
Set 40.832.7934.6841.570.77(0.62, 1.16)(2.27, 3.41)(33.54, 36.01)
G2Set 10.561.3434.4561.050.70(0.40, 0.81)(1.04, 1.73)(33.35, 35.82)
Set 20.661.8635.7354.110.74(0.51, 0.85)(1.57, 2.21)(34.87, 36.77)
Set 30.671.7233.7050.560.72(0.50, 0.91)(1.40, 2.11)(32.78, 34.87)
Set 40.922.3332.7935.480.72(0.62, 1.45)(1.74, 3.09)(31.65, 34.68)
Appendix 4—table 6
Inferred transcriptional rate (normalized) per gene copy for the G2 cell cycle phase using nascent data curated with the fusion method under the assumption that the two gene states are perfectly synchronized.
nascent fusionσoffσonρburst sizefON
G2 syncSet 11.844.4233.8418.420.71
Set 22.185.7835.8516.440.73
Set 32.155.3633.6115.630.71
Set 43.537.4534.169.690.68
Mean2.425.7534.3615.050.71
Std.0.751.261.013.760.02

Appendix 5

Calculating the autocorrelation function (ACF) from stochastic simulations

Intensity traces of GAL10 transcription were generated by a stochastic simulation algorithm (SSA) following the delay telegraph model presented in the main manuscript. Briefly, for each yeast cell, the ON and OFF GAL10 promoter states are simulated such that they have an exponentially distributed lifetime with means 1/σoff and 1/σoff, respectively. When the promoter was ON, the transcription of single GAL10 mRNAs was initiated following a Poisson process with mean equal to the initiation rate ρ. The fluorescence trace I(t), as measured experimentally over time, was obtained by convolving the train of mRNA initiation events with the trapezoidal intensity pulse of the PP7 tagging system (shown in Figure 1c of the main text). It is important to note that this last step is necessary to accurately model the fluorescent intensities that are experimentally measured.

To account for cell cycle, we simulated an asynchronous population, where cells could switch from G1 to G2. In the latter, to account for two independent transcription sites, two independent intensity traces were generated and summed over time. The ratio of G1 to G2 cells was 1.15:1, 1.27:1, 1,24:1 and 0.80:1 for datasets 1, 2, 3, and 4, respectively, as determined from the smFISH experiments. It is important to note that: (i) live-cell experiments (as in Donovan et al., 2019) are performed in asynchronous populations, without any sorting based on cell cycle phase; (ii) the average G1 phase duration for mother and daughter cells (tmother=2400s and tdaughter=6480s, respectively – unpublished data) is on the order of the typical total acquisition time (e.g. 1800s). Hence, in our simulations, cells that were G1 at the start of the intensity trace (i.e. live-cell experiment) were allowed to switch to G2 before the end of the experiment. Trivially, cells that were G2 at the start remained so until the end of the experiment.

The autocorrelation function (ACF) was calculated as:

(14) G(Δt)=δI(t)δI(t+Δt)I(t)21,

where Δt is the time-lag (e.g. 30s in Donovan et al., 2019), denotes the temporal average over t, and δI(t)=I(t)I(t). Appendix 5—figure 1 shows the ACFs from Figure 6c of the main manuscript, in which a linear fit is performed to correct for non-stationary or slowly-varying effects, i.e. switching from G1 to G2 during the experiment, bleaching, heterogeneous galactose induction.

Appendix 5—figure 1
Autocorrelation functions of 104 simulated GAL10 intensity traces (solid blue lines).

The transcriptional parameters for G1 and G2 cells in the four sets of experimental data were obtained using the fusion method (see Figure 6d of the main manuscript). A linear fit (dashed black line) was subtracted to correct the ACFs for switching from G1 to G2 (solid cyan lines).

Appendix 6

Derivation of the exact solution for the master equation of the delay telegraph model and finite state projection

Exact solution

The delay telegraph model includes four reactions

(15) GρG+N,GσoffG,GσonG,Nτ,

where G and G stand for the active (ON) and inactive (OFF) gene state, respectively, and σoff and σon are the activation and inactivation rates, respectively. Once a nascent mRNA is produced, it will be removed from the system after a deterministic time τ. We aim to find the quantitative description of the kinetics of nascent mRNA (N). We proceed by deriving the delay chemical master equation (CME) of Equation (15). The same delay models have also been studied by other authors .

Let P(0,n,t) and P(1,n,t) be the probability of observing n nascent mRNAs while the gene state is inactive and active at time t, respectively. Consequently, we can state

(16) P(i,n,t+Δt)={Part A: Probability of one instant reaction occurring during Δt}+{Part B: Probability of one delayed reaction occurring during Δt}+{Part C: Probability of no reactions occurring during Δt}.

Note that by instant reactions, we mean all reactions except the delayed reaction that removes nascent mRNA. Since Part A includes all Markovian reactions, it follows from standard arguments Gillespie, 2007 that

(17) P(0,n,t+Δt;1,n,t)=σoffΔtP(1,n,t)+o(Δt),P(1,n,t+Δt;0,n,t)=σonΔtP(0,n,t)+o(Δt),P(1,n,t+Δt;1,n-1,t)=ρΔtP(1,n-1,t)+o(Δt).

The contribution of Part B requires a careful consideration of the history of the process: (i) (1,n,t-τ) leads to (1,n+1,t-τ+Δt) when the gene was ON; (ii) (1,n+1,t-τ+Δt) leads to (i,n+1,t) where the gene state i can be either OFF or ON; (iii) (i,n+1,t) leads to (i,n,t). Indeed, since the removal of nascent mRNA is a delayed reaction, (iii) is due to a production reaction in (i) a time interval τ earlier. The probability of (i) occurring is ρΔtP(1,n,t-τ). The probability of (iii) occurring is 1 since every production event is followed by a removal event a time τ later. The probability of (ii) occurring is:

(18) P(i,n+1,t|1,n+1,t-τ+Δt)

As for Equation (18), the two ‘+1’s means that the new nascent mRNA was produced during the time interval (t-τ,t-τ+Δt), and it did not participate in any other reactions, thereby not influencing the probability at time t+Δt. In short,

(19) P(i,n+1,t|1,n+1,t-τ+Δt)=P(i,n,t|1,n,t-τ+Δt).

Furthermore, all n nascent mRNAs must leave the system prior to time t as they were all born before time t-τ. This implies that

(20) P(i,n,t|1,n,t-τ+Δt)=P(i,n,t|1,0,t-τ+Δt).

Hence the probability that event B occurs is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of n

(21) P(i,n,t+Δt;i,n+1,t)=ρΔtnP(1,n,t-τ)P(i,n,t|1,0,t-τ+Δt)=ρΔtP(1,t-τ)P(i,n,t|1,0,t-τ+Δt),

where we used nP(1,n,t-τ)=P(1,t-τ).

Finally, the contribution of Part C is obtained from simple probability arguments

(22) P(0,n,t;0,n,t)=P(0,n,t)-P(1,n,t;0,n,t)-P(0,n-1,t;0,n,t)=P(0,n,t)-σonΔtP(0,n,t)-ρΔtP(1,t-τ)P(0,n-1,t|1,0,t-τ+Δt),P(1,n,t;1,n,t)=P(1,n,t)-P(0,n,t;1,n,t)-P(1,n+1,t;1,n,t)-P(1,n-1,t;1,n,t)=P(1,n,t)-σoffΔtP(1,n,t)-ρΔtP(1,n,t)-ρΔtP(1,t-τ)P(1,n-1,t|1,0,t-τ+Δt),

where we used the same argument as in Equation (21). Using Equations 16; 17; 21; 22 and taking the limit of small Δt, we finally obtain the set of delay CMEs

(23) {dP(0,n,t)dt=σonP(0,n,t)+σoffP(1,n,t)+(E1)P(0,n1,t|1,0,tτ)ρP(1,tτ),dP(1,n,t)dt=σonP(0,n,t)σoffP(1,n,t)+ρ(E11)P(1,n,t)+(E1)P(1,n1,t|1,0,tτ)ρP(1,tτ),

where EkP(i,n,t)=P(i,n+k,t) is the step operator. Summing over all possible n for the two equations in Equation (23), we obtain

{dP(0,t)dt=σonP(0,t)+σoffP(1,t),dP(1,t)dt=σonP(0,t)σoffP(1,t),

initiated with the inactive state, in which P(0,t) and P(1,t) are the probabilities of finding a cell at the inactivated and activated state at time t, and their solutions are

P(0,t)=σoffσoff+σon+σonσoff+σone-(σon+σoff)t,P(1,t)=σonσoff+σon(1-e-(σon+σoff)t).

Besides, it is noted that in Equation (23)

(24) P(1,t-τ)=σonσoff+σon(1-e-(σon+σoff)(t-τ))=:h(t-τ),

for tτ, and h(t-τ)=0 if t<τ. The n nascent mRNAs of the two conditional probabilities P(i,n,t|1,0,t-τ) for i=0,1 in Equation (23) are produced during time (t-τ,t), and the pertinent dynamics are that of a reaction system only composed of the three instant reactions in Equation (15). Specifically, we have

(25) {dP~(0,n,t)dt=σonP~(0,n,t)+σoffP~(1,n,t),dP~(1,n,t)dt=σonP~(0,n,t)σoffP~(1,n,t)+ρ(E11)P~(1,n,t),

where the initial values are P~(0,n,0)=0 for any n, P~(1,n,0)=1 for n=0 and equal to 0 otherwise, as well as P(i,n,t|1,0,t-τ)=P~(i,n,τ) for any i=0,1.

Let Gi(u,t)=n(u+1)nP(i,n,t) and G~i(u,t)=n(u+1)nP~(i,n,t). We particularly define the generating function in such a form to simplify the notation. Then, using Equations 23; 24 we obtain

(26) {tG0=σonG0+σoffG1ρh(tτ)u1[τ,)G~0τ,tG1=ρuG1+σonG0σoffG1ρh(tτ)u1[τ,)G~1τ,

and

(27) {tG~0=σonG~0+σoffG~1,tG~1=ρuG~1+σonG~0σoffG~1,

where the arguments u and t in the generating functions are suppressed for clarity, and the superscript τ is used to emphasize the generating function G~i up to a particular time τ. The initiation condition of Equation (27) is G~0=0 and G~1=1 when t=0.

Under the condition of steady-state as t, Equation (26) reduces to

(28) {0=σonG0+σoffG1ρh¯uG~0τ,0=ρuG1+σonG0σoffG1ρh¯uG~1τ,

with h¯=σon/(σon+σoff).

Therefore, by solving Equation (28) we obtain

(29) G(u,)=G0(u,)+G1(u,)=e12τ(Δ(u)+σoff+σonuρ)2Δ(u)(σoff+σon)×[(σoff+σon)(Δ(u)+eΔ(u)τ(Δ(u)+σoff+σon)σoffσon)u(eΔ(u)τ1)ρ(σoffσon)],

where we define Δ(u)=(σoff-uρ)2+2σon(σoff+uρ)+σon2. Finally, we obtain the probability distribution by the following relation

P(n,t)=1n!unG(u,)|u=-1,for all n.

We checked that the obtained distribution agrees exactly with the nascent mRNA distribution of Equation 7 in Xu et al., 2016.

Finite State Projection

The finite state projection (FSP) method Munsky and Khammash, 2006 is an efficient numerical method that can be implemented to solve the CME (23) to any desired degree of accuracy. Because Equation (23) is coupled with Equation (25), here we show how to use a two-step FSP method to obtain the probability distribution.

By assuming that there exists a large N such that P(i,n,t)=0 and P~(i,n,t)=0 for any i=0,1 and nN+1, we denote

P(t)=(P(0,0,t),P(0,1,t),,P(0,N,t),P(1,0,t),P(1,1,t),,P(1,N,t))T

as a N2-dimensional vector and P~(t)N2 similarly. We first use FSP method to solve Equation (25) with the following ordinary differential equation (ODE) up to time

(30) ddtP~(t)=AP~(t)

where A is defined as

A=(σonIN00σoffIN+B)2(N+1)×2(N+1),B=(ρ000ρρ0000ρρ)N+1×N+1

with the initial condition

P~(0)=(0,0,,0N+1,1,0,,0N+1)T.

Secondly, we solve the following inhomogeneous ODE

(31) ddtP(t)=AP(t)+h(tτ)(P~(τ)SP~(τ)),

where S is the right-shift operator, i.e. for any x=(x1,x2,,xN), Sx=(0,x1,x2,,xN1) and

h(t)={ρσonσoff+σon(1e(σon+σoff)t)t0,0t<0,

with the initial condition

P(0)=(1,0,,0N+1,0,0,,0N+1)T.

Hence, the solution of Equation (31) gives the approximate probability distribution P(t) for any t>0.

Note that if one uses the generating function Equation 29 to compute the probability distribution, the computation for the high-order derivatives can cause numerical instabilities due to lack of arithmetic precision – compare the exact solutions with precision 85 and 300 in the left and right panels of Appendix 6—figure 1. Because FSP avoids the computation of the higher-order derivatives, it can be as numerically stable as computing the exact solution with high precision (right panel of Appendix 6—figure 1). In Appendix 6—table 1, we show the computational efficiency of FSP versus numerical evaluation of the exact solution – FSP and the exact solution with high precision are hence comparable both in terms of accuracy and efficiency. In the main text, we use FSP.

Appendix 6—figure 1
Left panel: Numerical instabilities due to the calculation of higher-order derivatives in the exact solution appear when the arithmetic precision is not very high (85).

Right panel: These instabilities disappear when the precision increases to 300. The exact solution with such high precision agrees well with the FSP method using double-precision floating-point (Float64) type. The parameters are,σoff=1.71 σon=5.82,,ρ=53.74 and.τ=0.56

Appendix 6—table 1
Comparison of the performance of three methods to compute the probability distribution.

Parameters same as in Appendix 6—figure 1. The time was calculated using the Julia package BenchmarkTools.jl.

Exact solution: precision = 85Exact solution: precision = 300FSP method
Minimum time:6.422ms9.277ms8.317ms
Median time:6.868ms9.562ms9.092ms
Mean time:8.279ms11.482ms9.415ms
Maximum time:16.791ms16.919ms14.203ms
# of Simulations:604436531

Appendix 7

Classification of the cell cycle phase: bimodal Gaussian distribution method versus the Fried/Baisch model

The distribution of DNA content intensities (DAPI) was fit using a bimodal distribution; those cells whose intensity was around one peak were classified as G1 and those around the second peak were classified as G2 (main text Figure 3e). We also did the classification using an alternative method, based on the Freid/Baisch model, which was recently employed in Skinner et al., 2016. In Appendix 7—figure 1 we show the distributions of fluorescent signal intensity for cells in the G1 phase (top row) and for those in the G2 phase (bottom row) for the four data sets described in the main text. Note that the method (bimodal Gaussian or Freid/Baisch) used to classify the cells in G1/G2 does not alter the intensity distribution hence verifying the robustness of our classification method.

Appendix 7—figure 1
Blue curves (Method 1) show the fluorescent intensity distributions of the four experimental data sets after the classification of cells into G1 and G2 phases using the Fried/Baisch model.

Magenta curves (Method 2) show the same but using a bimodal Gaussian, as described in the main text.

Data availability

The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH (copy archived at swh:1:rev:b49af68653e9fdcab3fa48085f648fc86d8c659e). The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl (copy archived at swh:1:rev:be2fcc8f7a811a571a297d3e150395c0a73add09).

The following data sets were generated
    1. Lenstra TL
    (2022) Open Science Framework
    ID d5nvj. smFISH datasets for PP7-GAL10 in budding yeast.

References

Article and author information

Author details

  1. Xiaoming Fu

    1. Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, China
    2. School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
    3. Center for Advanced Systems Understanding, Helmholtz-Zentrum Dresden-Rossendorf, Görlitz, Germany
    Contribution
    Software, Formal analysis, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Heta P Patel
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4073-9822
  2. Heta P Patel

    The Netherlands Cancer Institute, Oncode Institute, Division of Gene Regulation, Amsterdam, Netherlands
    Contribution
    Investigation, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Xiaoming Fu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1618-951X
  3. Stefano Coppola

    The Netherlands Cancer Institute, Oncode Institute, Division of Gene Regulation, Amsterdam, Netherlands
    Contribution
    Software, Formal analysis
    Competing interests
    No competing interests declared
  4. Libin Xu

    Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, China
    Contribution
    Formal analysis
    Competing interests
    No competing interests declared
  5. Zhixing Cao

    Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, China
    Contribution
    Formal analysis, Supervision, Writing – original draft, Project administration
    For correspondence
    zcao@ecust.edu.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2600-5806
  6. Tineke L Lenstra

    The Netherlands Cancer Institute, Oncode Institute, Division of Gene Regulation, Amsterdam, Netherlands
    Contribution
    Conceptualization, Supervision, Methodology, Writing – original draft, Project administration, Writing – review and editing, Funding acquisition
    For correspondence
    t.lenstra@nki.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4440-9962
  7. Ramon Grima

    School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Conceptualization, Supervision, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    ramon.grima@ed.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1266-8169

Funding

National Natural Science Foundation of China (61988101)

  • Zhixing Cao
  • Xiaoming Fu
  • Libin Xu

National Natural Science Foundation of China (62073137)

  • Zhixing Cao
  • Xiaoming Fu
  • Libin Xu

H2020 European Research Council (755695 BURSTREG)

  • Tineke L Lenstra

Leverhulme Trust (RPG-2020-327)

  • Ramon Grima

Shanghai Action Plan for Technological Innovation Grant (22ZR1415300)

  • Zhixing Cao
  • Xiaoming Fu
  • Libin Xu

Shanghai Action Plan for Technological Innovation Grant (22511104000)

  • Zhixing Cao
  • Xiaoming Fu
  • Libin Xu

Shanghai Sailing Program (22YF1410700)

  • Xiaoming Fu

Oncode Institute

  • Tineke L Lenstra

Netherlands Organisation for Scientific Research (Gravitation program CancerGenomiCs.nl)

  • Tineke L Lenstra

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

ZC, XF, and LX acknowledge the support from Natural Science Foundation of China (NSFC No. 61988101, 62073137), Shanghai Action Plan for Technological Innovation Grant (No. 22ZR1415300, 22511104000) and Shanghai Center of Biomedicine Development. XF acknowledges the support from Shanghai Sailing Program (22YF1410700). TLL was supported by the Netherlands Organization for Scientific Research (NWO, gravitation program CancerGenomiCs.nl), Oncode Institute, which is partly financed by the Dutch Cancer Society, and the European Research Council (ERC Starting Grant 755695 BURSTREG). RG was supported by a Leverhulme Trust research award (RPG-2020–327).

Copyright

© 2022, Fu, Patel et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,179
    views
  • 199
    downloads
  • 37
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Xiaoming Fu
  2. Heta P Patel
  3. Stefano Coppola
  4. Libin Xu
  5. Zhixing Cao
  6. Tineke L Lenstra
  7. Ramon Grima
(2022)
Quantifying how post-transcriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions
eLife 11:e82493.
https://doi.org/10.7554/eLife.82493

Share this article

https://doi.org/10.7554/eLife.82493

Further reading

    1. Computational and Systems Biology
    David B Blumenthal, Marta Lucchetta ... Martin H Schaefer
    Research Article Updated

    Degree distributions in protein-protein interaction (PPI) networks are believed to follow a power law (PL). However, technical and study biases affect the experimental procedures for detecting PPIs. For instance, cancer-associated proteins have received disproportional attention. Moreover, bait proteins in large-scale experiments tend to have many false-positive interaction partners. Studying the degree distributions of thousands of PPI networks of controlled provenance, we address the question if PL distributions in observed PPI networks could be explained by these biases alone. Our findings are supported by mathematical models and extensive simulations, and indicate that study bias and technical bias suffice to produce the observed PL distribution. It is, hence, problematic to derive hypotheses about the topology of the true biological interactome from the PL distributions in observed PPI networks. Our study casts doubt on the use of the PL property of biological networks as a modeling assumption or quality criterion in network biology.

    1. Computational and Systems Biology
    2. Immunology and Inflammation
    Peng Li, Sree Pulugulla ... Warren J Leonard
    Short Report

    Transcription factor partners can cooperatively bind to DNA composite elements to augment gene transcription. Here, we report a novel protein-DNA binding screening pipeline, termed Spacing Preference Identification of Composite Elements (SPICE), that can systematically predict protein binding partners and DNA motif spacing preferences. Using SPICE, we successfully identified known composite elements, such as AP1-IRF composite elements (AICEs) and STAT5 tetramers, and also uncovered several novel binding partners, including JUN-IKZF1 composite elements. One such novel interaction was identified at CNS9, an upstream conserved noncoding region in the human IL10 gene, which harbors a non-canonical IKZF1 binding site. We confirmed the cooperative binding of JUN and IKZF1 and showed that the activity of an IL10-luciferase reporter construct in primary B and T cells depended on both this site and the AP1 binding site within this composite element. Overall, our findings reveal an unappreciated global association of IKZF1 and AP1 and establish SPICE as a valuable new pipeline for predicting novel transcription binding complexes.