Quantifying how posttranscriptional noise and gene copy number variation bias transcriptional parameter inference from mRNA distributions
Abstract
Transcriptional rates are often estimated by fitting the distribution of mature mRNA numbers measured using smFISH (single molecule fluorescence in situ hybridization) with the distribution predicted by the telegraph model of gene expression, which defines two promoter states of activity and inactivity. However, fluctuations in mature mRNA numbers are strongly affected by processes downstream of transcription. In addition, the telegraph model assumes one gene copy but in experiments, cells may have two gene copies as cells replicate their genome during the cell cycle. While it is often presumed that posttranscriptional noise and gene copy number variation affect transcriptional parameter estimation, the size of the error introduced remains unclear. To address this issue, here we measure both mature and nascent mRNA distributions of GAL10 in yeast cells using smFISH and classify each cell according to its cell cycle phase. We infer transcriptional parameters from mature and nascent mRNA distributions, with and without accounting for cell cycle phase and compare the results to livecell transcription measurements of the same gene. We find that: (i) correcting for cell cycle dynamics decreases the promoter switching rates and the initiation rate, and increases the fraction of time spent in the active state, as well as the burst size; (ii) additional correction for posttranscriptional noise leads to further increases in the burst size and to a large reduction in the errors in parameter estimation. Furthermore, we outline how to correctly adjust for measurement noise in smFISH due to uncertainty in transcription site localisation when introns cannot be labelled. Simulations with parameters estimated from nascent smFISH data, which is corrected for cell cycle phases and measurement noise, leads to autocorrelation functions that agree with those obtained from livecell imaging.
Editor's evaluation
This important paper tackles a core problem in systems biology – how to quantify kinetic parameters from incomplete and noisy experimental data. The study makes a convincing methodological contribution to the field, and its potential usefulness is demonstrated using experimental data in yeast.
https://doi.org/10.7554/eLife.82493.sa0Introduction
Transcription in single cells occurs in stochastic bursts (Suter et al., 2011; Larsson et al., 2019). Although the first observation of bursting occurred more than 40 years ago (McKnight and Miller, 1977), the precise mechanisms behind this phenomenon are still under active investigation (Nicolas et al., 2017; Tunnacliffe and Chubb, 2020). The direct measurement of the dynamic properties of bursting employs livecell imaging approaches, which allow visualization of bursts as they occur in living cells (Donovan et al., 2019). However, in practice, such livecell measurements are challenging because they are lowthroughput and require genomeediting (Brouwer et al., 2020; Lenstra and Larson, 2016). To circumvent this, one can exploit the fact that bursting creates heterogeneity in a population. In this case, it is relatively straightforward to obtain a steadystate distribution of the number of mRNAs per cell from smFISH or singlecell sequencing experiments. These distributions have been used to infer dynamics by comparison to theoretical models. The simplest mathematical model describing bursting is the telegraph (or twostate) model (Peccoud and Ycart, 1995; Raj et al., 2006). In this model, promoters switch between an active and inactive state, where initiation occurs during the active promoter state. The model makes the further simplifying assumption that the gene copy number is one and that all the reactions are effectively firstorder. The mRNA in this model can be interpreted as cellular (mature) mRNA since its removal via various decay pathways in the cytoplasm is known to follow singleexponential (firstorder) decay kinetics in eukaryotic cells (Wang et al., 2002; Herzog et al., 2017). The solution of the telegraph model for the steadystate distribution of mRNA numbers has been fitted to experimental mature mRNA number distributions to estimate the transcriptional parameters (Raj et al., 2006; Kim and Marioni, 2013; Suter et al., 2011; Larsson et al., 2019).
However, the reliability of the estimates of transcriptional parameters from mRNA distributions is questionable because the noise in mature mRNA (and consequently the shape of the mRNA distribution) is affected by a wide variety of factors. Recent extensions of the telegraph model have carefully investigated how mRNA fluctuations are influenced by the number of promoter states (Zhou and Zhang, 2012; Ham et al., 2020b), polymerase dynamics (Cao et al., 2020), celltocell variability in the rate parameter values (Dattani and Barahona, 2017; Ham et al., 2020a), replication and binomial partitioning due to cell division (Cao and Grima, 2020), nuclear export (Singh and Bokes, 2012) and cell cycle duration variability (PerezCarrasco et al., 2020). One way to avoid noise from various posttranscriptional sources is to measure distributions of nascent mRNA rather than mature mRNA, and then fit these to the distributions predicted by an appropriate mathematical model. A nascent mRNA (Zenklusen et al., 2008; Larson et al., 2009) is an mRNA that is being actively transcribed, that is it is still tethered to an RNA polymerase II (Pol II) moving along a gene during transcriptional elongation. Fluctuations in nascent mRNA numbers thus directly reflect the process of transcription. Because nascent mRNA removal is not firstorder, an extension of the telegraph model has been developed (the delay telegraph model) (Xu et al., 2016).
However, nascent mRNA data still suffers from other sources of noise due to celltocell variability. For example in an asynchronous population of dividing cells, cells can have either one or two gene copies. In the absence of a molecular mechanism that compensates for the increase in gene copy number upon replication, cells with two gene copies which cannot be spatially resolved will have a different distribution of nascent mRNA numbers (one with higher mean) than cells with one gene copy. The importance of the cell cycle is illustrated by the finding (Zopf et al., 2013) that noisy transcription from the synthetic TetO promoter in S. cerevisiae is dominated by its dependence on the cell cycle. The estimation of transcriptional parameters from nascent mRNA data for pre and postreplication phases of the cell cycle has, to the best of our knowledge, only been reported in Skinner et al., 2016.
Interestingly, all the studies that estimate transcriptional parameters from nascent mRNA data (Skinner et al., 2016; Xu et al., 2015; Zoller et al., 2018; Senecal et al., 2014; Fritzsch et al., 2018) do not compare them with transcriptional parameters estimated from cellular (mature) mRNA data measured in the same experiment. Similarly, a quantitative comparison between inference from cellcyclespecific data and data which contains information from all cell cycle phases is lacking. Likely, this is because it is considered evident that quantifying fluctuations earlier in the gene expression process and adjusted for the cellcycle will improve estimates. However, nascent mRNA distributions are technically more challenging to acquire than mature mRNA distributions; and inference from nascent mRNA distributions is substantially more complex (Xu et al., 2016). Thus, it still needs to be shown that acquiring nascent mRNA data is a necessity from a parameter inference point of view, i.e. that it leads to significantly different and more robust estimates than using mature mRNA data. We also note that current studies report parameter inference from organisms where it is possible to label introns to identify mRNA located at the transcription site. This is not possible in many yeast genes and other microorganisms, and in these cases it is unclear how to correct parameter estimates for uncertainty in the transcription site location.
In this paper, we sought to understand the precise impact of posttranscriptional noise and celltocell variability on the accuracy of transcriptional parameters inferred from mature mRNA data. The fitting algorithms (for mature and nascent mRNA data) were first tested on simulated data, where limitations of the algorithms were uncovered in accurately estimating the transcriptional parameters in certain regions of parameter space. The algorithms were then applied to four independent experimental data sets, each measuring GAL10 mature and nascent mRNA data from smFISH in galactoseinduced budding yeast, conditional on the stage of the cell cycle (G1 or G2) for thousands of cells. Comparison of the transcriptional parameter estimates allowed us to separate the influence of ignoring cell cycle variability from that of posttranscriptional noise (mature vs nascent mRNA data). We found that only fitting of nascent cellcycle data, corrected for measurement noise (due to uncertainty in the transcription site location), provided good agreement with measurements from livecell data. Cellcycle specific analysis also revealed that upon transition from G1 to G2, yeast cells show dosage compensation by reducing burst frequency, similar to mammalian cells (PadovanMerhar et al., 2015). Our systematic comparison highlights the challenges of obtaining kinetic information from static data, and provides insight into potential biases when inferring transcriptional parameters from smFISH distributions.
Results
Inference from mature mRNA data vs inference from nascent mRNA data: testing inference accuracy using synthetic data
To understand the accuracy of the inference algorithms from nascent and mature mRNA data, in various regions of parameter space, (i) we generated synthetic data using stochastic simulations with certain known values of the parameters; (ii) applied the inference algorithms to estimate the parameters from the synthetic data; (iii) compared the true and inferred kinetic parameter values.
The generation of synthetic mature mRNA data (mature mRNA measurements in each of 10^{4} cells) using stochastic simulations of the telegraph model (Figure 1a) is described in Methods Sections Mathematical model and Generation of synthetic mature mRNA data. The inference algorithm is described in detail in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. It is based on a maximization of the likelihood of observing the single cell mature mRNA numbers measured in a population of cells. The likelihood of observing a certain number of mature mRNA numbers from a given cell is given by evaluating the telegraph model’s steadystate mature mRNA count probability distribution.
For nascent RNA data, we used stochastic simulations of the delay telegraph model (Figure 1b) to generate the position of bound Pol II molecules from which we constructed the synthetic smFISH signal in each of 10^{4} cells (Methods Section Generation of synthetic nascent mRNA data). An inference algorithm estimates the parameters, based on a maximization of the likelihood of observing the single cell total fluorescence intensity measured in a population of cells (Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data). Note that the likelihood of observing a certain fluorescence signal intensity from a cell is given by extension of the delay telegraph model (but not directly by the delay telegraph model itself) to account for the smFISH probe positions.
This extension takes into account that the experimental fluorescence data used in this manuscript was acquired from smFISH of PP7GAL10 in budding yeast, where probes were hybridized to the PP7 sequences. Because the PP7 sequences are positioned at the 5’ of the GAL10 gene, the fluorescence intensity of a single mRNA on the DNA locus resembles a trapezoidal pulse (see Figure 1 for an illustration). As the Pol II molecule travels through the 14 repeats of the PP7 loops, the fluorescence intensity increases as the fluorescent probes binds to the nascent mRNA (this is the linear part of the trapezoidal pulse). However, once all 14 loops on the nascent mRNA are bound by the fluorescent probes, the intensity of a single mRNA reaches maximal intensity and the plot plateaus as the RNA elongates through the GAL10 gene body before termination and release. The total fluorescent signal density function is hence given by
where $p(sk)$ is the density function of the signal $s$ given there are $k$ bound Pol II molecules and $P(k;\theta )$ is the steadystate solution of the delay telegraph model giving the probability of observing $k$ bound Pol II molecules for the parameter set $\theta $. In Methods Section Mathematical model, we show how $p(sk)$ can be approximately calculated for the trapezoidal pulse. Hence Equation (1) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site. Note that both of these inference algorithms were used to infer the promoter switching and initiation rate parameters. The degradation rate and the elongation time were not estimated but assumed to be known. The inference and synthetic data generation procedures are summarised and illustrated in Figure 1d.
The accuracy of inference was first calculated as the mean of the relative error in the estimated parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$, and $\rho $ (for its definition see Methods, Equation (6)); note that this error measures deviations from the known ground truth values. Figure 2a shows, by means of a 3D scatter plot, the ratio of the mean relative error from nascent mRNA data (using delay telegraph model) and the mean relative error from mature mRNA data (using the telegraph model) for 789 independent parameter sets sampled on a grid (for each of these sets, we simulated 10^{4} cells). The overall bluish hue of the plot suggested that the mean relative error from nascent mRNA data was typically less than the error from mature mRNA data. This was confirmed in Figure 2b where the same data was plotted but now as a function of the fraction of ON time (defined as ${f}_{\text{ON}}={\sigma}_{\text{on}}/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$). Out of 789 parameter sets, for 483 of them ($\approx 61\%$) the inference accuracy was higher when using nascent mRNA data.
Thus far, we have implicitly assumed that fluctuations in both nascent and mature mRNA are due to transcriptional bursting. However, it is clear that mature mRNA data exhibit a higher degree of noise due to posttranscriptional processing. For example, it has been shown that transcriptional noise is typically amplified during mRNA nuclear export (Hansen et al., 2018). In addition, celltocell variation in the number of nuclear pore complexes has recently been identified as the source of heterogeneity in nuclear export rates within isogenic yeast populations (Durrieu et al., 2022). To take into account these additional noise sources, which we call external noise, we added noise to the initiation rate $\rho $ in the telegraph model since this rate implicitly models all processes between the synthesis of the transcript and the appearance of mature mRNA in the cytoplasm. Specifically, for each of the 789 parameter sets previously used, we changed $\rho $ to ${\rho}^{\prime}$ where the latter is a lognormal distributed random variable such that its mean is $\rho $ and its standard deviation is equal to 0.05 of the mean (5% external noise). Note that this implies that at the time of measurement, each cell in the population had a different value of the initiation rate. Simulations with this perturbed set of parameters led to a synthetic mature mRNA data set from which we reinferred parameters using the telegraph model. In Figure 2c we show the ratio of mean relative error from nascent mRNA data and the mean relative error from perturbed mature mRNA data as a function of the fraction of ON time, ${f}_{\text{ON}}$. The percentage of parameters where nascent mRNA is more accurate is slightly increased compared to the data without noise (64% versus 61% of the parameters; compare Figure 2c and Figure 2b). However, the addition of even more noise (10% external noise added to the initiation rate) increases the inference accuracy for 91% of the parameter sets when the nascent mRNA data is used (Appendix 1 and Appendix 1—figure 1).
To obtain more insight into the accuracy of the individual parameters, we next plotted the median relative error of transcriptional parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}},\rho $, burst size and the inferred fraction of ON time, as a function of the true fraction of ON time (Figure 2d). We compared the results using synthetic nascent mRNA, synthetic mature mRNA data and synthetic mature mRNA with 5% external noise. The median of the relative error for each transcriptional parameter (as given by the second equation of Equation 8) was obtained for the subset of the 789 parameter sets for which the true fraction of ON time ${f}_{\text{ON}}$ falls into the interval $[x0.05,x+0.05]$ where $x=0.1,0.2,\mathrm{\dots},0.9$. From the plots, the following can be deduced: (i) the errors in ${\sigma}_{\mathrm{on}}$ (the burst frequency), ${\sigma}_{\mathrm{off}}$ and the burst size $\rho /{\sigma}_{\mathrm{off}}$ tend to increase with ${f}_{\mathrm{ON}}$ while the rest of the parameters ($\rho $ and the estimated value of ${f}_{\mathrm{ON}}$) decrease; (ii) for small ${f}_{\mathrm{ON}}$, the best estimated parameters are the burst frequency and size while for large ${f}_{\mathrm{ON}}$, it was $\rho $ and the estimated value of ${f}_{\mathrm{ON}}$. The worst estimated parameter was ${\sigma}_{\mathrm{off}}$, independent of the value of ${f}_{\mathrm{ON}}$; (iii) the addition of external noise to mature mRNA data had a small impact on inference for small ${f}_{\mathrm{ON}}$; in contrast, for large ${f}_{\mathrm{ON}}$ the noise appreciably increased the relative error in ${\sigma}_{\mathrm{off}}$ and to a lesser extent the error in the other parameters too.
Additionally, in Appendices 1 and 2 we show that (i) independent of the accuracy of parameter estimation, the best fit distributions accurately matched the ground truth distributions (Appendix 1 and Appendix 1—figure 2); (ii) the parameters ordered by relative error were in agreement with the parameters ordered by sample variability (Appendix 1 and Appendix 1—table 1) and by profile likelihood error (Kreutz et al., 2013) (Appendix 1, Appendix 1—tables 2 and 3). Since from experimental data, only the sample variability and the profile likelihood error are available, it follows that the results of our synthetic data study in Figure 2 based on relative error from the ground truth have wide practical applicability; (iii) stochastic perturbation of the mature or nascent mRNA data (due to errors in the measurement of the number of spots and the fluorescent intensity) had little effect on the inference quality, unless the gene spent a large proportion of time in the OFF state (Appendix 1—tables 4 and 5); (iv) if one utilized the conventional telegraph model to fit the nascent data generated by the delay telegraph model, it was possible to obtain a distribution fitting as good as the delay telegraph model but with lowfidelity parameter estimation (Appendix 2, Appendix 2—figure 1 and Appendix 2—table 1). Analytically, the telegraph model is only an accurate approximation of the delay telegraph model when the promoter switching timescales are much longer than the time spent by Pol II on a gene or the off switching rates are very small such that gene expression is nearly constitutive.
In summary, by means of synthetic experiments, we have clarified how the accuracy of the parameter inference strongly depends on the type of data (nascent or mature mRNA) and the fraction of time spent in the ON state (which determines the mode of gene expression).
Applications to experimental yeast mRNA data
Now that we have introduced the inference algorithms and tested them thoroughly using synthetic data, we applied the algorithms to experimental data (see Method Section Experimental data acquisition and processing for details of the data acquisition). Note that in what follows, delay telegraph model refers to the extended delay telegraph model that accounts for the smFISH probe positions that was used to predict the smFISH fluorescent signal of the transcription site.
Inference from mature mRNA data: experimental artifacts
We have four independent datasets from which we determined mRNA count and nascent RNA distributions. Figure 3a shows an example cell with mature single RNAs in the cytoplasm, and a bright nuclear spot representing the site of nascent transcription. Spots and cell outlines were identified using automated pipelines. Importantly, to obtain an accurate estimation of transcriptional parameters, the experimental input distributions of mRNA count and nascent RNAs require high accuracy. We therefore first determined how technical artifacts in the analysis affects the inference estimates.
First, if the number of mRNA transcripts per cell is high, accurate determination of the number of transcripts may be challenging, as transcripts may overlap. To determine if this occurred in our datasets, we analyzed the distributions of intensities of the cytoplasmic spots, which revealed unimodal distributions where ∼90% of the detected spots fell in the range 0.5× median – 1.5× median (Figure 4a). We therefore concluded that overlapping spots are not a large confounder in our data. In fact, in our experiments, the number of detected mature mRNA transcripts per cell was lower than expected, based on the number of nascent transcripts (compare Figure 3 with Figure 4). This discrepancy between nascent and mature transcripts likely arises because the addition of the PP7 loops to the GAL10 RNA destabilizes the RNA, resulting in faster mRNA turnover compared to most endogenous RNAs (Miller et al., 2011; Wang et al., 2002; Holstege et al., 1998; Geisberg et al., 2014). Previously, both shorter and longer mRNA halflives from the addition of stem loops have been observed, which may be caused because changes in the 5’ UTR length or sequence affect its recognition by the mRNA degradation machinery (Heinrich et al., 2017; Tutucci et al., 2018; Garcia and Parker, 2015). In our case, we note that such high turnover should aid transcriptional parameter estimates, as it closely reflects transcriptional activity.
A second possible source of error is cell segmentation. To test how cell segmentation errors contribute to the mature mRNA distribution and the transcriptional bursting estimates, we compared two independent segmentation tools, where segmentation 1 often resulted in missed spots (Figure 3b), resulting in an underestimation of the mean mRNA count and of the variance (compare Figure 3b and c). We inferred the transcriptional parameters using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. In the absence of an experimental measurement of the degradation rate, we could only estimate the three transcriptional parameters normalised by $d$. The best fits of dataset 1 are shown in (Figure 3b and c) and the transcriptional parameters (for all four datasets) are summarized in (Figure 3e). Note that the estimated parameters for all four datasets, using both segmentations, are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Notably the segmentation algorithms led to similar estimates for the burst frequency but considerably different estimates for the rest of the parameters. In particular segmentation 1 suggested that burst expression is infrequent (≈20% of the time) whereas segmentation 2 was consistent with burst expression occurring half of the time. Given that accurate cell segmentation remains challenging, this analysis illustrates that parameter estimation from mature mRNA counts may be affected by technical errors. For the remainder of the mature mRNA analysis, we have used only segmentation 2 data.
Lastly, it may be challenging to distinguish the nascent transcription site from a mature RNA, especially if few nascent RNAs are being produced. Either one can decide to include all cellular spots in the total mRNA count, including the transcription site, with the result that the number of mature transcripts is overestimated with one RNA for cells which show a transcription site. Or conversely, one can decide to exclude the transcription site by subtracting one spot from each cell, with the result that the number of mature mRNAs may be underestimated by one RNA for cells that are transcriptionally silent. To understand how this choice affects the accuracy of parameter inference, we compared both options in (Figure 3c, d and e), where seg2 included all spots, and seg2TS excluded transcription sites (by subtracting 1 from each cell). The estimated parameters for all four datasets are shown in Appendix 3—table 1 and the associated best fit distributions in Appendix 3—figure 1a. Although the mean was lower when transcription sites were excluded, all the parameters except the burst frequency ${\sigma}_{\mathrm{on}}$ were within the error, indicating that the choice of whether or not to include the transcription site in the mature mRNA count had a small influence on parameter estimation. For the remainder of the analysis, we included all spots, and counted the transcription site as one RNA.
Inference from mature mRNA data: merged versus cellcycle specific
The above analysis was performed using the merged data from all cells, irrespective of their position in the cell cycle. The inferred parameters of all four datasets are shown in Figure 3g (grey). To understand the effect of the cell cycle on these parameter estimates, we compared this inference with cellcyclespecific data. We used the integrated nuclear DAPI intensity as a measure for DNA content to classify cells into G1 or G2 cells (Figure 3f (left)) to obtain separate mature mRNA distributions for G1 and G2 cells.
To infer the transcriptional parameters from mature mRNA data of cells in G1, the inference protocol remained the same. However for cells in the G2 stage, this protocol needed to be altered since G2 cells have two gene copies, whereas the solution of the telegraph model assumes one gene copy. Assuming the transcriptional activities of the two gene copies are independent, the distribution of the total molecule number is the convolution of the molecule number (obtained from the telegraph model) with itself for mature mRNA data. This convolved distribution was used in steps (ii) and (iii) of the inference algorithm in Methods Section Steps of the algorithm to estimate parameters from mature mRNA data. A difference between our method of estimating parameters in G2 from that in the literature (Skinner et al., 2016) is that we do not assume that the burst frequency is the only parameter that changes upon replication, and we estimated all transcription parameters simultaneously.
Note that the independence of gene copy transcription has been verified for genes in some eukaryotic cells (Skinner et al., 2016) where the two copies can be easily resolved. For yeast data, as we are analyzing in this paper, it is generally not possible to resolve the two copies of the allele in G2 because they are within the diffraction limit. However, in the absence of experimental evidence, the independence assumption is the simplest reasonable assumption that we could make (see later for a relaxation of this assumption).
For both G1 and G2 cells, we performed inference for cellcycle specific mature mRNA data, the results of which are shown in Figure 3f (centre and right) and Figure 3g – see Appendix 3—table 2 for the confidence intervals of the estimates calculated using profile likelihood. As expected, the mean number of mRNAs in G2 cells was larger than that in G1 cells. For both merged and cellcycle specific data, the parameters ordered by increasing variability of the estimates from independent samples (the standard deviation divided by the mean) were: $\rho $, ${f}_{\mathrm{ON}}$, ${\sigma}_{\mathrm{ON}}$, burst size and ${\sigma}_{\mathrm{OFF}}$, and the same order was predicted by the relative error (from ground truth values) from our synthetic experiments (compare with ${f}_{\mathrm{ON}}=0.50$ and ${f}_{\mathrm{ON}}=0.80$ in the middle and right panels of Figure 2d) and by sample variability (Appendix 1). In Appendix 3 and Appendix 3—table 3 we show that the relaxation of the assumption of independence between the allele copies in G2 (by instead assuming perfect state correlation of the two alleles) had practically no influence on the inference of the two best estimated parameters ($\rho $, ${f}_{\mathrm{ON}}$).
A comparison of the two types of data predicted different behaviour (Figure 3g bottom): merged data indicated behaviour consistent with the gene being ON half of the time and small burst sizes, while cellcyclespecific data implied the gene is ON ≈80% of the time with large burst sizes. We note that the burst sizes have considerable sample variability, exemplifying burst size estimates of transcriptional parameters from mature mRNA distributions have to be treated with caution. Nevertheless, in line with this high fraction ON and large burst size, which start to approach constitutive expression, the variation introduced by the transcription kinetics is relatively modest with Fano factors not far from one: $2.43\pm 0.21$ for merged data and $1.75\pm 0.45$ for cellcycle data (the slightly higher value for merged data likely was due to heterogeneity stemming from varying gene copy numbers per cell).
Comparing the mean rates between the G1 and G2 phases, we found that ${\sigma}_{\mathrm{off}}$, ${\sigma}_{\mathrm{on}}$, $\rho $ decreased while ${f}_{\mathrm{ON}}$ and the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were wellseparated between the two phases were ${\sigma}_{\mathrm{on}}$ and $\rho $. These two parameters decreased by 65% and 21%, respectively, which suggests that upon replication, there are mechanisms at play which reduce the expression of each copy to partially compensate for the doubling of the gene copy number (gene dosage compensation) (Skinner et al., 2016).
In conclusion, what is particularly surprising in our analysis is the differences in the inference results using merged and cellcycle specific data: the former suggests the gene spends only half of its time in the ON state while the latter implies the gene is mostly in its ON state.
Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference
Cellcyclespecific versus merged data
To determine the number of nascent transcripts at the transcription site, we selected the brightest spot from each nucleus and normalized its intensity to the median intensity of the cytoplasmic spots. As the distribution of intensities of the cytoplasmic mRNAs followed a narrow unimodal distribution, its median likely represents the intensity of a single RNA (orange distribution in the central panel of Figure 4a). The inference of transcriptional parameters using the merged data was done using the algorithm described in Methods Section Steps of the algorithm to estimate parameters from nascent mRNA data.
Similar to above, to account for two gene copies in G2 cells, we assumed that the transcriptional activities of the two gene copies are independent. The distribution of the total fluorescent signal from both gene copies was the convolution of the signal distribution (obtained from the extended delay telegraph model, i.e. Equation (1)) with itself. This convolved distribution was then used in steps (ii) and (iii) of the inference algorithm.
The inference of transcriptional parameters from nascent RNA data was done using a fixed elongation time, which was measured previously at a related galactoseresponsive gene (GAL3) at $65\mathrm{bp}/\mathrm{s}$ (Donovan et al., 2019). Since the total transcript length is $3062\mathrm{bp}$ (see Figure 1c), the elongation time ($\tau $ in our model) is $\approx 47.1s\approx 0.785\text{min}$. The fixed elongation rate enabled us to infer the absolute values of the three transcriptional parameters ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$ and $\rho $.
Best fits of the extended delay telegraph model to the distribution of signal intensity of nascent mRNAs at the transcription site are shown in Figure 4a and b for dataset 1; for the other datasets see Appendix 4—figure 1. The corresponding estimates of the transcriptional parameters are shown in Appendix 4—table 1 and also illustrated by bar charts in Figure 4c. The confidence intervals of the transcriptional parameters (computed using the profile likelihood method) are shown in Appendix 4—table 2.
Comparing this estimation with that from mature mRNA, we observed that in both cases ${f}_{\mathrm{ON}}\approx 0.5$ for merged data and in the range $0.70.8$ for cellcyclespecific data. Also in both cases, the Fano factors of merged data were larger than those of cellcyclespecific data. Hence, we are confident that not accounting for the cell cycle phase leads to an overestimation of the time spent in the OFF state and of the Fano factor. In addition, comparing the burst sizes in Figure 3g and Appendix 4—table 1, we found that not taking into account posttranscriptional noise (by using mature mRNA data) led to an lower estimation of the burst size (2.6fold, 2.6fold, and 1.1fold lower for inference from merged, G1 and G2 data, respectively). We note that it would be useful to directly compare the absolute estimates of the other transcriptional parameters from mature and nascent mRNA data. However, this was not possible because the telegraph model only estimates the switching rates and the initiation rate scaled by the degradation rate, and the latter is unknown. On the other hand, the estimates from nascent data were rates multiplied by the average elongation time, which is known and hence the absolute rates can be estimated from nascent mRNA data only. The only quantities that could be directly compared were the burst size and the fraction of ON time, since these are both nondimensional.
Comparing the variability of the parameter estimates, we found that $\rho $ and ${f}_{\mathrm{ON}}$ were the parameters with the smallest variability across samples for the nascent data, as for inference from mature data. However, the inferred parameter variability across samples was on average about 2.5fold lower for nascent data compared to mature mRNA data (this was obtained by computing the standard deviation divided by the mean for each parameter and then averaging over all parameters and over merged, G1 and G2 data). Likely this is because nascent data does not suffer from posttranscriptional noise. Indeed, synthetic experiments suggested that the errors in parameter inference using nascent data are often less than those in mature data when ${f}_{\mathrm{ON}}\approx 0.80$ (Figure 2d). In summary, we have more confidence in the parameter estimates from nascent data, in particular those from cellcycle separated data.
To further investigate the hypothesis that estimates from cellcyclespecific data are more accurate than merged data, we compared the estimates from merged and cellcyclespecific data to previous livecell transcription measurements of the same gene (Donovan et al., 2019). Because livecell traces and simulated traces with the estimated transcriptional parameters are difficult to compare directly, we instead compared their normalized autocorrelation functions (ACFs). Specifically, the livecell traces displayed celltocell variation in overall fluorescent intensities arising from differences in the PP7 coat protein expression level, precluding a direct comparison of the livecell intensities with the smFISH distributions. The normalized ACFs are normalized per trace and thus can be used to directly compare the kinetics. For this, we fed the parameter estimates to the SSA to generate synthetic livecell data and then calculated the corresponding ACF (Appendix 5). We found that the estimates from cellcyclespecific data produced ACFs that match the livecell data closer than that from the merged data (Figure 4d). This was also clear from the sum of squared residuals which for each dataset was smaller for the ACF computed using the cellcyclespecific estimates rather than those from merged data (Figure 4e).
Using nascent data, we also reinvestigated the hypothesis that the gene exhibits dosage compensation. Comparing the mean rates between the G1 and G2 phases, we found that ${\sigma}_{\mathrm{off}}$, ${\sigma}_{\mathrm{on}}$, $\rho $, ${f}_{\mathrm{ON}}$ decreased while the burst size increased upon replication. However, taking into account the variability in estimates across the four datasets, the only two parameters which were cleanly separated between the two phases were ${\sigma}_{\mathrm{on}}$ and ${f}_{\mathrm{ON}}$. These two decreased by 41% and 5%, respectively. These results had some similarity to those deduced from cellcycle separated mature mRNA data (the decrease of ${\sigma}_{\mathrm{on}}$) but they also displayed differences. Namely, from mature mRNA data it was predicted that $\rho $ decreased upon replication while from nascent data we predicted that $\rho $ did not change and it was rather ${f}_{\mathrm{ON}}$ that decreased by a small degree. The decrease of the burst frequency ${\sigma}_{\text{on}}$ after replication has also been reported for some genes in mammalian cells (Skinner et al., 2016; PadovanMerhar et al., 2015), indicating that this could be a general mechanism for gene dosage compensation. Our results are consistent with a populationbased ChIPseq study (Voichek et al., 2016) that showed DNA dosage compensation after replication in budding yeast. We note that our singlecell analysis only revealed partial dosage compensation, where the mean signal intensity of nascent mRNAs in G2 is not the same as in G1, but 1.7fold higher in G2 than in G1 (Figure 4c).
Correcting for experimental artefacts
Although inference on cell cycle separated data outperformed inference on merged data, we noticed that the corresponding best fit distributions did not match well to the experimental signal distributions in the lower bins (Figure 4b and Appendix 4—figure 1). In all cases, the experimental distributions showed high intensities in bins 1, 2, and 3, which was likely an artifact of the experimental data acquisition system. Since we defined the transcription site as the brightest spot, this implies that in the absence of a transcription site, a mature transcript can be misclassified as a nascent transcript. We therefore investigated two methods to correct for this, the ‘rejection’ method and the ‘fusion’ method.
The rejection method removed all data associated with the first $k$ bins of the experimentally obtained histogram of fluorescent intensities (Figure 5a shows the fits for dataset 1; for the other datasets see Appendix 4—figure 2). We found that the parameter estimates varied strongly when the number of bins from which data was rejected ($k$) was changed (Figure 5b; see also Appendix 4—table 3). Although the distributions fit well to the experimental histograms (Appendix 4—figure 1), comparison with the livecell normalized ACF indicated that the estimates actually became worse than noncurated estimates, with a higher sum of squared residuals (Figure 5c and d). The rejection method therefore does not produce reliable estimates.
Next, we considered another data curation method which we call the fusion method. This works by setting to zero all fluorescent intensities in a cell population which were below a certain threshold. In other words, we fused or combined the first $k$ bins of the experimentally obtained histogram of fluorescent intensities, thereby taking into account that the true intensity of bin 0 was artificially distributed over some of the first bins.
Figure 6 and Appendix 4—table 4 show that the fusion method led to estimates that varied little with $k$ which enhanced our degree of confidence in them (note that $k=1$ is the same as the uncurated data). The peak at the zero bin for both G1 and G2 was better captured using the fusion method than using noncurated data (compare Figure 4b and Appendix 4—figure 1, with Figure 6b). Comparison to the autocorrelation function of the livecell data shows that correction with the fusion method also led to improved transcriptional estimates, as indicated by a reduction in the sum of the squared residuals for all four data sets (Figure 6c).
Overall, we conclude that for inferring parameters from the smFISH data, the optimal method is to use nascent cellcyclespecific data, corrected by the fusion method. The optimally inferred parameters for the four data sets in our study are those given in Appendix 4—figure 2d. The profile likelihood estimates of the 95% confidence intervals of these parameters are shown in Appendix 4—table 5. Note that in line with our synthetic data study in Figure 2, the parameters suffering from the least sample variability were ${f}_{\mathrm{ON}}$ and $\rho $. The rest of the parameters (${\sigma}_{\mathrm{off}},{\sigma}_{\mathrm{on}}$ and burst size) suffered more sample variability because the fraction of ON time was high; however since their standard deviation divided by the mean (computed over the four datasets) was not high (in the range of 1020%), they still can be regarded as useful estimates. Note also that the previous prediction that gene dosage compensation involves regulation of the burst frequency did not change upon correction of the nascent data using the fusion method. All these results were deduced assuming that the two copies in G2 are independent from each other. Inferring rates under the opposite assumption of perfectly synchronized copies (Appendix 4—table 6) gave very similar estimates for $\rho $ and ${f}_{\mathrm{ON}}$ (to be expected since according to the synthetic data study, these two are the most robustly estimated parameters for genes spending most of their time in the active state) but different estimates for the rest of the parameters. While such perfect synchronization of alleles is unlikely, some degree of synchronization is plausible and further improvement of the transcriptional parameters in the G2 phase will require its precise experimental quantification.
Discussion
In this study, we compared the reliability of transcriptional parameter inference from mature and nascent mRNA distributions, with and without taking into account the cell cycle phase. Although these distributions come from the same experiment, we found that the different fits produced very different parameter estimates, ranging from small bursts to very large bursts. Comparison to livecell data revealed that the optimal inference method is to use nascent mRNA data that is separated by cell cycle.
Our findings illustrate the risk of inferring transcriptional parameters from fitting of mRNA distributions. First of all, as we have shown, these fits are sensitive to the segmentation method which can lead to large errors in the estimates. Secondly, the most common method of parameter inference in the literature is fitting of mature mRNA distributions that are not separated by cell cycle (Larsson et al., 2019; Raj et al., 2006; Zenklusen et al., 2008). Obtaining such distributions is straightforward using methods such as smFISH, where one can directly count the number of mRNAs per cell. Additionally, with the advance of singlecell mRNA sequencing technologies, it is possible to obtain mRNA distributions for many genes simultaneously and it is tempting to use these to estimate bursting behaviour across the genome (Kim and Marioni, 2013; Larsson et al., 2019). However, our comparisons on the same dataset show that the values obtained from mature mRNA fits (using merged data) can be significantly different from the optimal values (using nascent cellcycle separated data corrected using the fusion method), with underestimation of the burst sizes of almost 10fold and underestimation of the active fraction of more than 1.5fold. These results indicate that parameter inference from merged mature mRNA data should be treated with caution. There were smaller differences between the burst size and the active fraction inferred from cellcycle separated mature and nascent data (only these two can be directly compared because these are nondimensional); however the relative errors in the estimates (computed over the four datasets) were more than twofold higher for mature data likely due to posttranscriptional noise which nascent data is free from.
It is more common to fit mature distributions rather than nascent distributions because nascent distributions are technically more challenging to obtain. As nascent singlecell sequencing methods are still in the early phase (Hendriks et al., 2019), the only method available so far for nascent measurements is smFISH (Patel et al., 2021). In such smFISH experiments, intronic probes can be used to specifically label nascent RNA, although there may be some effects of splicing kinetics on the distribution (Wan et al., 2021). If introns are not present, like for most yeast genes, one can use exonic probes instead (Zenklusen et al., 2008). Since exonic probes label both nascent and mature mRNA transcript, it may be challenging to identify the nascent transcription site unambiguously, especially at lower transcription levels. We show in this manuscript that the fusion method can correct for this bias by combining bins below k RNAs, which results in an improvement of the parameter estimates.
Our analysis also emphasizes the importance of separately analyzing G1 and G2 cells (Skinner et al., 2016). It is important to note that for cellcyclespecific analysis, experimental adjustments or cellcycle synchronized cultures are not required. Although asynchronous cultures consist of a mix G1, S and G2 cells, the integrated DNA intensity of the nucleus of each cell, for example from a DAPI signal, can be used to separate these cells by cell cycle phase in silico (Skinner et al., 2016; Roukos et al., 2015). As most smFISH experiments already include a DNAlabelled channel, adding an extra analysis step should in principle not limit the incorporation of this step in future smFISH fitting procedures.
Even with our optimal fitting strategy, there is a residual error of the simulated ACF and the measured ACF from livecell measurements. This difference may be the result of different experimental biases of the two measurements. For example, livecell measurements have a detection threshold below which RNAs may not be detected. In addition, livecell measurements include cells in S phase, which are not analyzed in smFISH. There could also be differences in the exact percentage of G1 and G2 cells, or other noise sources between livecell and smFISH experiments. Alternatively, the fit may be imperfect because there might be parameter sets, others than the ones which our inference algorithm found, which provide an accurate fit of the nascent mRNA distribution and perhaps an even better fit to the ACF than we found. We cannot exclude this possibility because we estimated ${f}_{\text{ON}}$ to be $0.70.8$ and using synthetic data we showed that the accuracy of some parameters (${\sigma}_{\mathrm{on}},{\sigma}_{\mathrm{off}}$ and the burst size) deteriorated as ${f}_{\text{ON}}$ approached 1 (Figure 2d). Another factor which could explain the residual error between the simulated ACF and the measured ACF is that perhaps the twostate model may be too simplistic to cover the true promoter states in living cells and may therefore not be able to describe the true in vivo kinetics. The promoter may switch between more than 2 states, or there may be sources of extrinsic noise other than the cell cycle that contribute to the heterogeneity. Previous studies have for example identified extrinsic noise on the elongation rate (Fritzsch et al., 2018). However, these more complex transcription models also have more parameters, which in practice often means that very few will be identifiable with the current set of experimental observations. To fit these models, one requires temporal data on the transcription kinetics (Fritzsch et al., 2018; Rodriguez et al., 2019), or simultaneous measurements of various sources of extrinsic noise, such as singlecell transcription factor concentration and RNA polymerase number measurements, cellular volume, local cell crowding, etc, which are often not available in standard smFISH experiments (Battich et al., 2015; Foreman and Wollman, 2020). Nevertheless, given that there is no explicit time component in smFISH data, the closeness of the simulated ACF to the measured ACF provides confidence we are close to the real values.
The optimal parameter set (Figure 6d) indicates long ON promoter times of 75 s, during which almost 50 RNAs are produced in a burst. Large burst sizes (>70) have been previously reported for mouse embryonic stem cells (Skinner et al., 2016, mouse hepatocytes Bahar Halpern et al., 2015 and human fibroblasts Larsson et al., 2019). The large burst size and high active fraction of 0.78 suggests that GAL10 expression is reaching its limit of maximal expression, which may not be surprising as it is already one of the most highly expressed genes in yeast. It is also interesting to note that the ON time of 75 s is longer than the residence time of a single transcript (47 s), which means that RNA polymerases in the beginning of a burst have already left the locus before the burst has finished.
The optimal parameter set (Figure 6d) also indicates partial gene dosage compensation. Specifically the burst frequency per gene copy (${\sigma}_{\text{on}}$) in the G2 phase is 0.66 that in the G1 phase; the other transcriptional rates are not significantly different between the two cell cycle phases. The fold change in the burst frequency per gene copy was previously estimated for the $Oct4$ and $Nanog$ genes to be 0.63 and 0.71 respectively, in mouse embryonic stem cells (Skinner et al., 2016). The similarity of our estimate of the fold change to those previously measured could be explained by the results of a recent study (Jia et al., 2021); using a detailed model of gene expression, it was shown that in the absence of a dependence of the initiation rate on cell volume, gene dosage compensation optimally leads to approximate mRNA concentration homeostasis when the fold change in the burst frequency upon DNA replication is $\sqrt{2}/2\approx 0.71$.
In conclusion, obtaining kinetic information from static distributions can introduce biases. However, we show that it is possible to obtain reasonable estimates that agree with livecell measurements, if one infers parameters from nascent mRNA distributions that are accounted for cell cycle phase.
Methods
Inference from mature mRNA data
Mathematical model
The steadystate solution of the telegraph model of gene expression (Peccoud and Ycart, 1995) gives mature mRNA distributions. The reaction steps in this model are illustrated in Figure 1a. Next we describe the generation of synthetic mature mRNA data and the algorithm used to infer parameters from this data.
Generation of synthetic mature mRNA data
We generate parameter sets on an equidistant mesh grid laid over the space:
where the units are inverse minute. Furthermore we apply a constraint on the effective transcription rate
In each of the three dimensions of the parameter space, we take 10 points that are equidistant, leading to a total of 1000 parameter sets which reduce to 789 after the effective transcription rate constraint is enforced.
We additionally fix the degradation rate $d=1$ min^{1}. Note that we choose not to vary the degradation rate (as we did for the other three parameters) since it is not possible to infer all four rates simultaneously – this is because the steadystate solution of the telegraph model is a function of the nondimensional parameter ratios $\rho /d,{\sigma}_{\text{off}}/d$ and ${\sigma}_{\text{on}}/d$ (Raj et al., 2006).
Once a set of parameters is chosen, we use the stochastic simulation algorithm (SSA Gillespie, 2007) to simulate the telegraph model reactions in Figure 1a and generate 10^{4} samples of synthetic data. Note that each sample mimicks a single cell measurement of mature mRNA.
Steps of the algorithm to estimate parameters from mature mRNA data
The inference procedure consists of the following steps: (i) select a set of random transcriptional parameters; (ii) use the solution of the telegraph model to calculate the probability of observing the number of mature mRNA measured for each cell; (iii) evaluate the likelihood function for the observed data; (iv) iterate the procedure until the negative loglikelihood is minimized; (v) the set of parameters that accomplishes the latter provides the best pointestimate of the parameters of the telegraph model that describes the measured mature mRNA fluctuations.
For step (i), we restrict the search for optimal parameters in the following region of parameter space
The degradation rate is fixed to $d=1$ min^{1}.
Step (ii) can be obtained either by computing the distribution from the analytical solution (Peccoud and Ycart, 1995 or by using the finite state projection (FSP) method Munsky and Khammash, 2006). Here, for the sake of computational efficiency, we use the FSP method to compute the probability distribution of mature mRNA numbers.
For step (iii) we calculate the likelihood of observing the data given a chosen parameter set $\theta $
where $P({n}_{j};\theta )$ is the probability distribution of mature mRNA numbers obtained from step (ii) given a parameter set $\theta $, n_{j} is the total number of mature mRNA from cell $j$ and ${N}_{\text{cell}}$ is the total number of cells.
Steps (i) and (iv) involve an optimization problem. Specifically we use a gradientfree optimization algorithm, namely adaptive differential evolution optimizer (ADE optimizer) using BlackBoxOptim.jl (https://github.com/robertfeldt/BlackBoxOptim.jl; Feldt and Stukalov, 2022) within the Julia programming language to find the optimal parameters
The minimization of the negative loglikelihood is equivalent to maximizing the likelihood. Note the optimization algorithm is terminated when the number of iterations is larger than 10^{4}; this number is chosen because we have found that invariably after this number of iterations, the likelihood has converged to some maximal value. Note that the inference algorithm is particularly low cost computationally, with the optimal parameter values estimated in at most a few minutes.
Once the best parameter set ${\theta}^{*}$ is found, we calculate the mean relative error (MRE) which is defined as
where ${\theta}_{i}^{*}$ and ${\theta}_{\text{true},i}$ represent the $i$th estimated and true parameters respectively, and $M$ denotes the number of the estimated parameters. Thus, the mean relative error reflects the deviation of the estimated parameters from the true parameters.
Inference from nascent mRNA data
Mathematical model
The steadystate solution of the delay telegraph model (Xu et al., 2016) gives the distribution of the number of bound Pol II. In Appendix 6, we present an alternative approach to derive the steadystate solution. The reaction steps are illustrated in Figure 1a.
The position of a Pol II molecule on the gene determines the fluorescence intensity of the mRNA attached to it. In particular for fluorescence data acquired from smFISH PP7GAL10, the fluorescence intensity of a single mRNA on the DNA locus looks like a trapezoidal pulse (see Figure 1b for an illustration). This presents a problem because although we can predict the distribution of the number of bound Pol II using the delay telegraph model, we do not have any specific information on their spatial distribution along the gene. However, since the delay telegraph model implicitly assumes that a Pol II molecule has fixed velocity and that Pol II molecules do not interact with each other (via volume exclusion), it is reasonable to assume that in steadystate, the bound Pol II molecules are uniformly distributed along the gene. This hypothesis is confirmed by stochastic simulations of the delay telegraph model where the position of a Pol II molecule is calculated as the product of the constant Pol II velocity and the time since its production.
By the uniform distribution assumption and the measured trapezoidal fluorescence intensity profile, it follows that the signal intensity of each bound Pol II has the density function $g$ defined by
where ${L}_{1}=862\text{bp}(\text{base pairs}),{L}_{2}=2200\text{bp},L={L}_{1}+{L}_{2}$ as defined in Figure 1b. The indicator function ${\chi}_{[0,1]}(s)=1$ if and only if $s\in [0,1]$ and ${\delta}_{1}(s)$ is the Dirac function at 1. The probability of the signal $s$ being between 0 and 1 is due to the first part of the trapezoid function and hence is multiplied by ${L}_{1}/L$ which is the probability of being in this region if Pol II is uniformly distributed. Similarly, the probability of $s$ being 1 is due to the L_{2} part of the trapezoid and hence the probability is ${L}_{2}/L$ by the uniform distribution assumption. Note that the signal $s$ from each Pol II is at most 1 because in practice, the signal intensity from the transcription site is normalized by the median intensity of single cytoplasmic mRNAs (Zenklusen et al., 2008).
The total signal is the sum of the signals from each bound Pol II. Hence, the density function of the sum is given by the convolution of the signal densities from each bound Pol II. Defining $p(sk)$ as the density function of the signal given there are $k$ bound Pol II molecules, we have that $p(sk)$ is the $k$–th convolution power of $g$, that is
where ${\delta}_{0}(s)$ is the Dirac function at. Finally we can write the total fluorescent signal density function as
where $P(k;\theta )$ is the steadystate solution of the delay telegraph model giving the probability of observing $k$ bound Pol II molecules for the parameter set $\theta $. Hence Equation (8) represents the extension of the delay telegraph model to predict the smFISH fluorescent signal of the transcription site.
Comparison to the algorithm in Xu et al., 2016. Both algorithms take into account the fact that the signal intensity depends on the position of Pol II on the gene, albeit this is done in different ways. In Xu et al., 2016 a master equation is written for the joint distribution of gene state and the number of nascent mRNA. In this case the number of nascent RNAs can have noninteger values since it represents the experimentally measured signal from the (incomplete) nascent RNA. Solution of this master equation proceeds by (a) a discretization of the continuous nascent mRNA signal into bins which are much smaller than one; (b) solution using finite state projection (FSP). This approach can lead to a large state space which incurs a large computational cost. In contrast, in our method, we use FSP to solve for the delay telegraph model, i.e. the distribution of the discrete number of bound Pol II from which we construct (using convolution) the approximate distribution of the continuous nascent mRNA signal by assuming the Pol II is uniformly distributed on the gene. Since the state space of bound Pol II is typically not large, our method will typically be more computationally efficient than the one described in Xu et al., 2016.
Generation of synthetic nascent mRNA data
We generated synthetic smFISH signal data by using the SSA, modified to include delay to simulate the delay telegraph model (Fu et al., 2022). Specifically, we use Algorithm 2 described in Barrio et al., 2006. One run of the algorithm simulates the fluctuating number of bound Pol II molecules in a single cell.
The total fluorescence intensity (mimicking smFISH) is obtained as follows. When a particular bound Pol II is produced by a firing of the transcription reaction $G\to G+N$, we record this production time; since the elongation rate is assumed to be constant, given the production time we can calculate the position of the Pol II molecule on the gene at any later time and hence using Figure 1b we can deduce the fluorescent signal due to this Pol II molecule.
Specifically we normalize each transcribing Pol II’s position to $[0,1]$ and map the position to its normalized signal by
where $x$ is the normalized position on the gene. Thus at a given time, the total fluorescent signal from the $n$th cell (the $n$th realization of the SSA) equals
where ${J}_{n}$ is the number of bound Pol II molecules in the $n$th cell, and $\{{x}_{j}\}$ with $j=1,\mathrm{\dots},{J}_{n}$ is the vector of all Pol II positions on the gene. The total signal from each cell is a real number but it is discretized into an integer.
The kinetic parameters are chosen from the same region of parameter space as in (2), on the same equidistant mesh grid and with the same constraint on the effective transcription rate. Unlike the mature mRNA case, here there is no degradation rate; instead we have the elongation time, which we fix to $\tau =0.5\text{(min)}$. Note that fixing this time is necessary since it is not possible to infer the three transcriptional parameters rates and the elongation time simultaneously because the steadystate solution of the delay telegraph model is a function of the nondimensional parameter ratios $\rho \tau ,{\sigma}_{\text{off}}\tau $ and ${\sigma}_{\text{on}}\tau $.
Once a set of parameters is chosen, we use the modified SSA (as described above) to simulate the signal intensity in each of 10^{4} cells.
Steps of the algorithm to estimate parameters from nascent mRNA data
The inference procedure is essentially the same as steps (i)(v) described in mature mRNA inference except for the following points.
In step (ii), the probability of observing a total signal of intensity $i$ from a single cell is obtained by integrating $p(s;\theta )$ in Equation (8) on an interval $[i1,i]$ for $i\in \mathbb{N}$ which, in our numerical scheme, means
Note that the integration over the interval of length 1 is to match the discretization of the synthetic data and $\theta \in \mathrm{\Theta}$. Intuitively, one can always choose a positive integer $K$ such that $P(k)=0$ for any $k\ge K$. The computation of the solution of the delay telegraph model $P(k)$ can be done either using the analytical solution (evaluated using high precision) or using the finite state projection algorithm (FSP) Munsky and Khammash, 2006. In Appendix 6—figure 1 and Appendix 6—table 1, we show that the two methods yield comparable accuracy and CPU time.
For step (iii) we calculate the likelihood of observing the data given a chosen parameter set $\theta $
where q_{j} is the discretized total signal intensity from cell $j$ and ${N}_{\text{cell}}$ is the total number of cells. In the optimization, we aim to find
The whole procedure (for both mature and nascent mRNA inference) is summarized by a flowchart in Figure 1c.
Experimental data acquisition and processing
A diploid yeast strain of BY4743 background with a single integration of 14xPP7 loops at the 5’UTR of GAL10 (strain YTL047 Donovan et al., 2019) was used in this study. Four replicate yeast cultures were grown in synthetic complete media with 2% galactose to early midlog (OD 0.5), fixed with 5% paraformaldehyde (PFA) for 20 min, permeabilized with 300 units of lyticase and hybridized with 7.5 pmol each of four PP7 probes labeled with Cy3 (Integrated DNA Technologies) as described in Trcek et al., 2012 and Lenstra et al., 2015; Patel et al., 2021, resulting in four technical replicates. The PP7 probe sequences are: atatcgtctgctcctttcta, atatgctctgctggtttcta, gcaattaggtaccttaggat, aatgaacccgggaatactgc. Coverslips were mounted on microscope slides using mounting media with DAPI (ProLong Gold, Life Technologies).
The coverslips were imaged on a Zeiss AxioObserver (Zeiss, USA) widefield microscope with a PlanApochromat 40x1.4 NA oil DIC UV objective and a 1.25 x optovar. For Cy3, a 562 nm longpass dichroic (Chroma T562lpxr), 595/50 nm emission filter (Chroma ET595/50 m) and 550/15 nm LED excitation at full power (Spectra X, Lumencor) were used. For DAPI, a 425 nm longpass dichroic (Chroma T425lpxr) and a 460/50 nm emission filter (Chroma ET460/50 m) and LED excitation at 395/25 nm at 25% power (Spectra X, Lumencor) were used. The signal was detected on a Hamamatsu ORCAFlash4.0 V3 Digital CMOS camera (Hamamatsu Photonics, Japan). For each sample and each channel, we utilized the MicroManager software (UCSF) to acquire at least 20 fieldsofview based on the DAPI channel. Each fieldofview consisted of 13 zstacks (with a zstep of 0.5 µm) at 25ms exposure for DAPI and 250ms exposure for Cy3.
A custom python pipeline was used for analysis (https://github.com/Lenstralab/smFISH; Pomp, 2022). Maximum intensity projected images were used to segment the cell and nucleus using Otsu thresholding and watershedding (segmentation 1). In addition, we segmented cells using CellProfiler (segmentation 2). The diffractionlimited Cy3 spots were detected per zslice using bandpass filtering and refined using iterative Gaussian mask localization procedure (Crocker and Grier, 1996; Thompson et al., 2002; Larson et al., 2005; Larson et al., 2011 and Coulon et al., 2014). Cells in which no spots were detected were excluded from further analysis since a visual inspection indicated that these cells were not properly segmented or were improperly permeabilized.
Spots were classified as nuclear or cytoplasmic and the brightest nuclear spots were classified as transcription sites. The intensity of the brightest nuclear spot in a cell was normalized with the median fluorescence intensity of all the cytoplasmic spots in all cells. This is due to the fact that 90% of cytoplasmic mRNAs are isolated (Figure 4a), thus the median of the fluorescence signal of cytoplasmic mRNAs can be considered as the normalizing value. The distribution of the normalised intensity of the brightest nuclear spot, calculated over the cell population, is the experimental equivalent of the total fluorescent signal density function as given by the solution of the modified delay telegraph model, Equation (8).
The number of mature mRNA in each cell is given by counting the number of spots in the entire cell, that is nuclear plus cytoplasmic. The transcription site is counted as 1 mRNA, regardless of its intensity. We show in Figure 3c that this has negligible influence on the estimated parameters since the mean number of mature mRNA is much greater than 1. The distribution of the number of spots is the experimental equivalent of the solution of the telegraph model, that is the marginal distribution of mature mRNA numbers in steadystate conditions.
The integrated nuclear intensity of each cell was calculated by summing the DNA content intensity (DAPI) of all the pixels within the nucleus mask. The distribution of the intensities was fit with a bimodal Gaussian distribution. Those cells whose intensity was within a standard deviation of the mean of the first (second) Gaussian peak was classified as G1 (G2) (see Figure 3e left). This gave similar results to a different cell cycle classication method using the Fried/Baisch model (Johnston et al., 1978) which was recently employed in Skinner et al., 2016. See Appendix 7—figure 1 for a comparison of the two methods. We note that cells in late G2 may contain two separate transcription sites, one in the mother and one in the bud. When the nucleus moves into the bud, buds often contain less DNA than G1 cells, and mothers contain more DNA than G1 cells, both of which are excluded from the analysis. When the DNA content of the mother and daughter is similar, both mother and daughter are counted separately as G1 cells. We note that this late G2 subpopulation is very small.
We did four independent experiments with a total number of cells equal to 2510, 6411, 4592, 3181, respectively. After classification, the numbers of G1 cells are 766, 2111, 1495, 904 and the number of G2 cells are 683, 1657, 1209, 1143, whereas the rest were classified as undetermined.
Data availability
The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH; Pomp, 2022. The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl; Fu, 2022.
Appendix 1
Accuracy of inference from synthetic mature and nascent mRNA data
Inference from synthetic mature mRNA data with external noise
In the main text, Figure 2, we showed how the addition of 5% external noise to synthetic mature mRNA data degrades the inference accuracy. In Appendix 1—figure 1 we show how the addition of a larger amount of external noise (10%) causes an even larger loss of accuracy. In particular for 91% of the parameters, the inference accuracy is higher when using nascent mRNA data (Appendix 1—figure 1a) and the median relative errors become very high for most parameters, especially for $\rho $ and ${\sigma}_{\mathrm{off}}$ in the limit of large ${f}_{\text{ON}}$ (Appendix 1—figure 1b).
Accuracy of distribution fits
In the main text, Figure 2, we showed how the accuracy of parameter estimation is not uniform across parameter space. Here we investigate if there is any relationship between this accuracy and how well is a distribution of mature mRNA numbers / signal intensity fit by the inference algorithm. For 12 parameter sets (6 for the telegraph model – Appendix 1—figure 2a) and (6 for the delay telegraph model – Appendix 1—figure 2b), we evaluate the fits to the distribution of synthetic data in Appendix 1—figure 2c–d. The results show that independent of the accuracy of the parameters estimated by the inference algorithm, the fits of the delay telegraph and telegraph model distributions to the distributions generated from synthetic data are generally excellent.
Testing the variability of the inference procedure
To obtain a better understanding of the variability of the inference procedure, for each of the six parameter sets in Appendix 1—figure 2b and i.e. for the inference using the delay telegraph model, we generated 10 independent sets of synthetic data and used maximum likelihood to infer the parameters for each dataset. The mean and standard deviation of the parameters (computed over all 10 datasets) are shown in Appendix 1—table 1. The means are close to the true parameter values (in Appendix 1—figure 2b); this shows that the inference procedure is working correctly and the deviations from ground truth values are mostly due to noise in the synthetic datasets. We can define the sample variability of a parameter as the standard deviation divided by the mean. Ordering parameters by this quantity, we find that for all six parameter sets the error is largest for ${\sigma}_{\mathrm{off}}$. For small ${f}_{\mathrm{ON}}$, the parameters ordered by increasing error are: ${\sigma}_{\mathrm{on}}$, burst size, $\rho $, ${f}_{\mathrm{ON}}$ and ${\sigma}_{\mathrm{off}}$. While for large ${f}_{\mathrm{ON}}$, the order is: $\rho $, ${f}_{\mathrm{ON}}$, ${\sigma}_{\mathrm{on}}$, burst size and ${\sigma}_{\mathrm{off}}$. Note that this order is the same as we determined using the relative errors (from the ground truth) which is shown in Figure 2 of the main text.
Confidence intervals using profile likelihood
We perform a profile likelihood study (Kreutz et al., 2013) on the 12 parameter sets of synthetic mature/nascent mRNA data described in Appendix 1—figure 2ab. We obtain the 95% confidence interval for each parameter. The results are shown in Appendix 1—table 2. We also compare the relative errors for each parameter (computed as (estimated value  ground truth)/ground truth using the data in Appendix 1—figure 2a and b) with the profile likelihood error (computed as (upper bound  lower bound)/optimal estimate using the bounds in Appendix 1—table 2 and the optimal values in Appendix 1—figure 2a and b). The results are shown in Appendix 1—table 3. Note that in most cases, the parameters ordered by relative error are in agreement with the parameters ordered by profile likelihood error.
Effect of random perturbation of mature mRNA data on inference
To assess the reliability of the inference results due to errors in spot counting, we redid the inference with synthetic mRNA data (and using the telegraph model) perturbed randomly by minus 1/plus 1/unchanged with probability $1/3$. The results are shown in Appendix 1—table 4.
We found that when the fraction of ON time was very small (Set 1), there is a considerable effect of the perturbations on the values of the inferred parameters – this is because in this case, the mean number of mRNA is very small and hence a perturbation of one molecule is very significant. However as expected, the inference results are quite robust when the fraction of ON time is not too small (Sets 2–6).
Effect of lognormal noise in nascent fluorescent signal on inference
Given the synthetic nascent mRNA signal data ${\{{S}_{i}\}}_{i=1}^{N}$ (where $N$ represents the number of samples) generated by delay telegraph model, for each ${S}_{i}$, we perform a random perturbation $\mathrm{\Phi}$ under the following conditions
where $\mathrm{\Phi}$ is a stochastic perturbation satisfying the following constraints
$\mathrm{\Phi}({S}_{i})$ is a random variable sampled from the distribution Lognormal $(\alpha ,\beta )$ whose mean equals $S}_{i$, and the standard deviation equals $0.1\ast {S}_{i}$.
This means the random perturbation keeps the mean value of the signal ${S}_{i}$ unchanged but adds noise with a coefficient of variation equal to 0.1.
In Appendix 1—table 5 we compare the results of inference using synthetic nascent mRNA data with the aforementioned stochastic perturbation and without. As for mature mRNA data, we find that the perturbation has only a significant impact when the fraction of ON time is very small.
Appendix 2
Telegraph model versus delay telegraph model
In this section, we aim to precisely understand the differences between the telegraph and the delay telegraph models. For both models, we define the rate of switching from the ON state to OFF state as ${\sigma}_{\text{off}}$, the rate of switching from the OFF state to the ON state as ${\sigma}_{\text{on}}$ and the production rate of nascent mRNAs in the ON state as $\rho $. The firstorder decay rate of nascent mRNA in the telegraph model is given by $d$ and the delay time between initiation and degradation in the delay telegraph model is $\tau $. Note that while the telegraph model was in the main text explained in terms of mature mRNA, in this section we use it as a model for nascent mRNA since we want to compare directly with the delay telegraph model. The telegraph and delay telegraph models can be solved exactly in steadystate conditions (Peccoud and Ycart, 1995; Xu et al., 2016) (an alternative derivation for the delay telegraph model is also given in Section F). From the generating function solution of the models, one can deduce expressions for the first and second centered moments in steadystate conditions:
where $\u27e8n\u27e9}_{\text{delay}$ and $\u27e8n\u27e9}_{\text{tele}$ are the mean number of nascent mRNA in the delay telegraph and telegraph models respectively, and ${\mathrm{Var}}_{\text{delay}}$ and ${\mathrm{Var}}_{\text{tele}}$ are the corresponding variances in molecule numbers. To understand the differences between the two models, we set the means of the two models to be the same (by choosing $d=1/\tau $) and then compute the relative error in their variance predictions:
where $x$ and $y$ are nondimensional variables defined as
Note that $x$ is the ratio of the time for nascent mRNA to detach from the gene $\tau $ and the timescale of promoter switching $1/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$. The nondimensional parameter $y$ is a measure of the burstiness of gene expression since it increases with the mean burst size $\rho /{\sigma}_{\text{off}}$ and the fraction of time spent in the OFF state ${\sigma}_{\text{off}}/({\sigma}_{\text{off}}+{\sigma}_{\text{on}})$. In fact as ${\sigma}_{\text{off}}\to 0$ which implies $y\to 0$, the telegraph model converges to a constitutive model where the time between two successive nascent mRNA production events is exponentially distributed. Note that the relative errors in the Fano factor and the coefficient of variation squared are the same as the error in the variance since the means of the two models are the same.
Using Equation (12), it is easy to see that the relative error between the two models vanishes in the limit of $x\to 0$ (when the promoter switching timescales are much longer than the time spent by a polymerase on a gene) or in the limit of $y\to 0$ (when there is no burstiness in gene expression). Hence, the telegraph model is an accurate approximation of the delay telegraph model in these two limits. It can be shown that in the first case, the distribution of nascent mRNA numbers is well approximated by the sum of two Poisson distributions, whereas for the second case the distribution is a Poisson. Note that since $R$ is always positive, it follows that the telegraph model systematically underestimates the size of noise in nascent mRNA numbers. It can be further shown that $R$ increases monotonically with $x$ and $y$ and the maximum attainable value is $R=1/2$ (when $x\to \mathrm{\infty},y\to \mathrm{\infty}$), that is, the worst prediction of the telegraph model is that the variance is half that of the delay telegraph model.
In Appendix 2—figure 1a and c we contrast the distributions of nascent mRNA predicted by the delay telegraph model with those predicted by the telegraph model for the case $d=1/\tau $ where the means are matched, as also assumed for the calculation of the relative error above. Appendix 2—figure 1b shows the effect of increasing $x$ (via $\tau $) and Appendix 2—figure 1d shows the effect of increasing $y$ (via $\rho $). The number distributions are constructed from the generating functions of the telegraph model Peccoud and Ycart, 1995 and of the delay telegraph model (Section F). Note that the shapes of the distributions of the two models can be considerably different, e.g. the cases $\rho =3$ and $\rho =10$ in Appendix 2—figure 1c shows that the nascent distribution from the delay model is bimodal with peaks at 0 and at a nonzero value but it is unimodal with a nonzero peak from the telegraph model. Hence we conclude that if we are interested in accurately predicting nascent mRNA number distributions then generally the Markovian telegraph model is not a good approximation of the nonMarkovian delay telegraph model.
In the main text, we used the delay telegraph model to infer the synthetic data from the synthetic data generated by the delay SSA. For comparison, we repeat the same but now we use the telegraph model (rather than the delay telegraph model) to calculate step (ii) in the inference algorithm, that is $P(k;\theta )$ in Equation 8 in the main text is now chosen to be the steadystate solution of the telegraph model. Once the best parameter set ${\theta}^{*}$ is found, we calculate two scores: (i) the fitness given by the smallest negative loglikelihood value found by the optimizer normalized over the sample size. (ii) the mean relative error (for definition see Equation (4.5) in the main text). In Appendix 2—figure 1, we show both of these scores obtained for 20 independent numerical experiments – clearly the error using the delay telegraph model for the inference algorithm is significantly lower than if the telegraph model is used.
These observations are further reinforced in Appendix 2—figure 1f where we show the best fit distributions and the corresponding relative errors in the estimates of the burst size and the burst frequency (bar chart insets). These are computed using the formula:
We note that the distributions are well fit in all cases, using both telegraph and delay telegraph models. However the errors $\alpha $ and $\beta $ are considerably larger for the former.
In Appendix 2—table 1 we show the true and estimated parameters for the 6 distributions in Appendix 2—figure 1f. The estimates for $\rho $ are accurate independent of the choice of model; however, the estimates for the promoter switching rates ${\sigma}_{\text{off}},{\sigma}_{\text{on}}$ are far more accurate using the delay telegraph model. Hence we conclude that although inference using the telegraph model provides a histogram that fits well with the synthetic data, the inferred parameter values have little meaning because they are not an accurate reflection of the true parameter values.
Appendix 3
Inference results using mature mRNA data
Influence of segmentation on inference
In main text Section Inference from nascent mRNA data: cell cycle effects, experimental artifacts and comparison with mature mRNA inference, we used independent segmentation tools to study segmentation artifacts. The inference results under segmentation 1, segmentation 2 and segmentation 2 without counting the transcriptional site are summarised in Appendix 3—table 1.
Confidence intervals for estimates from merged and cellcycle specific mRNA data using segmentation 2
We obtain 95% confidence intervals for the parameters estimated from mature mRNA data under segmentation 2 (shown in Figure 3f of the main text). The confidence intervals are shown for ${\sigma}_{\text{off}},{\sigma}_{\text{on}},\rho $ in Appendix 3—table 2.
Effect of synchronization of gene copies in G2 phase on parameter inference
In the main text, we assumed that the transcription of two allele copies in the G2 phase are independent. Here we consider the opposite scenario where the two allele copies in G2 phase are perfectly synchronized with each other, i.e. when one copy switches from on to off, the other copy also switches from on to off. Note that the time at which mRNA is transcribed from each copy is however not the same. Using this modified simulation algorithm, we reperform the inference using the experimental data under segmentation 2. The results are shown in Appendix 3—table 3. Comparing these values with those estimated for phase G2 under the assumption of allele independence (main text Figure 3f right panel), we find that $\rho $ and ${f}_{\mathrm{ON}}$ are very similar but the other parameters vary considerably.
Appendix 4
Inference results using nascent mRNA data
Inference using noncurated data
We show the inference results of the experimental noncurated nascent mRNA data (with and without taking into consideration the cellcycle) for the four datasets (Appendix 4—table 1) and their 95% confidence interval using the profile likelihood estimate (Appendix 4—table 2). The best fit distributions for merged and cellcyclespecific data are shown in Appendix 4—figure 1.
Inference using curated data
We show the inference results using nascent mRNA data curated with the rejection method in Appendix 4—table 3 and Appendix 4—figure 2, and with the fusion method in Appendix 4—table 4. We also used the profile likelihood method to obtain the 95% confidence intervals for the fusion method with $k=4$; the results are shown in Appendix 4—table 5. We also reinferred the parameters in the G2 phase for fusion corrected data by assuming that the allele states are perfectly synchronised; these are shown in Appendix 4—table 6.
Appendix 5
Calculating the autocorrelation function (ACF) from stochastic simulations
Intensity traces of GAL10 transcription were generated by a stochastic simulation algorithm (SSA) following the delay telegraph model presented in the main manuscript. Briefly, for each yeast cell, the ON and OFF GAL10 promoter states are simulated such that they have an exponentially distributed lifetime with means $1/{\sigma}_{\text{off}}$ and $1/{\sigma}_{\text{off}}$, respectively. When the promoter was ON, the transcription of single GAL10 mRNAs was initiated following a Poisson process with mean equal to the initiation rate $\rho $. The fluorescence trace $I(t)$, as measured experimentally over time, was obtained by convolving the train of mRNA initiation events with the trapezoidal intensity pulse of the PP7 tagging system (shown in Figure 1c of the main text). It is important to note that this last step is necessary to accurately model the fluorescent intensities that are experimentally measured.
To account for cell cycle, we simulated an asynchronous population, where cells could switch from G1 to G2. In the latter, to account for two independent transcription sites, two independent intensity traces were generated and summed over time. The ratio of G1 to G2 cells was 1.15:1, 1.27:1, 1,24:1 and 0.80:1 for datasets 1, 2, 3, and 4, respectively, as determined from the smFISH experiments. It is important to note that: (i) livecell experiments (as in Donovan et al., 2019) are performed in asynchronous populations, without any sorting based on cell cycle phase; (ii) the average G1 phase duration for mother and daughter cells (${t}_{\text{mother}}=2400s$ and ${t}_{\text{daughter}}=6480s$, respectively – unpublished data) is on the order of the typical total acquisition time (e.g. $1800s$). Hence, in our simulations, cells that were G1 at the start of the intensity trace (i.e. livecell experiment) were allowed to switch to G2 before the end of the experiment. Trivially, cells that were G2 at the start remained so until the end of the experiment.
The autocorrelation function (ACF) was calculated as:
where $\mathrm{\Delta}t$ is the timelag (e.g. $30s$ in Donovan et al., 2019), $\u27e8\cdot \u27e9$ denotes the temporal average over $t$, and $\delta I(t)=I(t)\u27e8I(t)\u27e9$. Appendix 5—figure 1 shows the ACFs from Figure 6c of the main manuscript, in which a linear fit is performed to correct for nonstationary or slowlyvarying effects, i.e. switching from G1 to G2 during the experiment, bleaching, heterogeneous galactose induction.
Appendix 6
Derivation of the exact solution for the master equation of the delay telegraph model and finite state projection
Exact solution
The delay telegraph model includes four reactions
where $G$ and ${G}^{\star}$ stand for the active (ON) and inactive (OFF) gene state, respectively, and ${\sigma}_{\text{off}}$ and ${\sigma}_{\text{on}}$ are the activation and inactivation rates, respectively. Once a nascent mRNA is produced, it will be removed from the system after a deterministic time $\tau $. We aim to find the quantitative description of the kinetics of nascent mRNA ($N$). We proceed by deriving the delay chemical master equation (CME) of Equation (15). The same delay models have also been studied by other authors .
Let $P(0,n,t)$ and $P(1,n,t)$ be the probability of observing $n$ nascent mRNAs while the gene state is inactive and active at time $t$, respectively. Consequently, we can state
Note that by instant reactions, we mean all reactions except the delayed reaction that removes nascent mRNA. Since Part A includes all Markovian reactions, it follows from standard arguments Gillespie, 2007 that
The contribution of Part B requires a careful consideration of the history of the process: (i) $(1,{n}^{\prime},t\tau )$ leads to $(1,{n}^{\prime}+1,t\tau +\mathrm{\Delta}t)$ when the gene was ON; (ii) $(1,{n}^{\prime}+1,t\tau +\mathrm{\Delta}t)$ leads to $(i,n+1,t)$ where the gene state $i$ can be either OFF or ON; (iii) $(i,n+1,t)$ leads to $(i,n,t)$. Indeed, since the removal of nascent mRNA is a delayed reaction, (iii) is due to a production reaction in (i) a time interval $\tau $ earlier. The probability of (i) occurring is $\rho \mathrm{\Delta}tP(1,{n}^{\prime},t\tau )$. The probability of (iii) occurring is 1 since every production event is followed by a removal event a time $\tau $ later. The probability of (ii) occurring is:
As for Equation (18), the two ‘+1’s means that the new nascent mRNA was produced during the time interval $(t\tau ,t\tau +\mathrm{\Delta}t)$, and it did not participate in any other reactions, thereby not influencing the probability at time $t+\mathrm{\Delta}t$. In short,
Furthermore, all ${n}^{\prime}$ nascent mRNAs must leave the system prior to time $t$ as they were all born before time $t\tau $. This implies that
Hence the probability that event B occurs is given by the product of the probability of events (i), (ii) and (iii) and summing over all values of ${n}^{\prime}$
where we used ${\sum}_{{n}^{\prime}}P(1,{n}^{\prime},t\tau )=P(1,t\tau )$.
Finally, the contribution of Part C is obtained from simple probability arguments
where we used the same argument as in Equation (21). Using Equations 16; 17; 21; 22 and taking the limit of small $\mathrm{\Delta}t$, we finally obtain the set of delay CMEs
where ${E}^{k}P(i,n,t)=P(i,n+k,t)$ is the step operator. Summing over all possible $n$ for the two equations in Equation (23), we obtain
initiated with the inactive state, in which $P(0,t)$ and $P(1,t)$ are the probabilities of finding a cell at the inactivated and activated state at time $t$, and their solutions are
Besides, it is noted that in Equation (23)
for $t\ge \tau $, and $h(t\tau )=0$ if $t<\tau$. The $n$ nascent mRNAs of the two conditional probabilities $P(i,n,t1,0,t\tau )$ for $i=0,1$ in Equation (23) are produced during time $(t\tau ,t)$, and the pertinent dynamics are that of a reaction system only composed of the three instant reactions in Equation (15). Specifically, we have
where the initial values are $\stackrel{~}{P}(0,n,0)=0$ for any $n$, $\stackrel{~}{P}(1,n,0)=1$ for $n=0$ and equal to 0 otherwise, as well as $P(i,n,t1,0,t\tau )=\stackrel{~}{P}(i,n,\tau )$ for any $i=0,1$.
Let ${G}_{i}(u,t)=\sum _{n}(u+1{)}^{n}P(i,n,t)$ and ${\stackrel{~}{G}}_{i}(u,t)=\sum _{n}(u+1{)}^{n}\stackrel{~}{P}(i,n,t)$. We particularly define the generating function in such a form to simplify the notation. Then, using Equations 23; 24 we obtain
and
where the arguments $u$ and $t$ in the generating functions are suppressed for clarity, and the superscript $\tau $ is used to emphasize the generating function ${\stackrel{~}{G}}_{i}$ up to a particular time $\tau $. The initiation condition of Equation (27) is ${\stackrel{~}{G}}_{0}=0$ and ${\stackrel{~}{G}}_{1}=1$ when $t=0$.
Under the condition of steadystate as $t\to \mathrm{\infty}$, Equation (26) reduces to
with $\overline{h}={\sigma}_{\text{on}}/({\sigma}_{\text{on}}+{\sigma}_{\text{off}})$.
Therefore, by solving Equation (28) we obtain
where we define $\mathrm{\Delta}(u)={\left({\sigma}_{\text{off}}u\rho \right)}^{2}+2{\sigma}_{\text{on}}\left({\sigma}_{\text{off}}+u\rho \right)+{\sigma}_{\text{on}}^{2}$. Finally, we obtain the probability distribution by the following relation
We checked that the obtained distribution agrees exactly with the nascent mRNA distribution of Equation 7 in Xu et al., 2016.
Finite State Projection
The finite state projection (FSP) method Munsky and Khammash, 2006 is an efficient numerical method that can be implemented to solve the CME (23) to any desired degree of accuracy. Because Equation (23) is coupled with Equation (25), here we show how to use a twostep FSP method to obtain the probability distribution.
By assuming that there exists a large $N$ such that $P(i,n,t)=0$ and $\stackrel{~}{P}(i,n,t)=0$ for any $i=0,1$ and $n\ge N+1$, we denote
as a ${N}^{2}$dimensional vector and $\stackrel{~}{\mathbf{P}}(t)\in {\mathbb{R}}^{{N}^{2}}$ similarly. We first use FSP method to solve Equation (25) with the following ordinary differential equation (ODE) up to time
where $\mathbf{A}$ is defined as
with the initial condition
Secondly, we solve the following inhomogeneous ODE
where $\mathcal{S}$ is the rightshift operator, i.e. for any $\mathbf{x}=({x}_{1},{x}_{2},\dots ,{x}_{N})$, $\mathcal{S}\mathbf{x}=(0,{x}_{1},{x}_{2},\dots ,{x}_{N1})$ and
with the initial condition
Hence, the solution of Equation (31) gives the approximate probability distribution $\mathbf{P}(t)$ for any $t>0$.
Note that if one uses the generating function Equation 29 to compute the probability distribution, the computation for the highorder derivatives can cause numerical instabilities due to lack of arithmetic precision – compare the exact solutions with precision 85 and 300 in the left and right panels of Appendix 6—figure 1. Because FSP avoids the computation of the higherorder derivatives, it can be as numerically stable as computing the exact solution with high precision (right panel of Appendix 6—figure 1). In Appendix 6—table 1, we show the computational efficiency of FSP versus numerical evaluation of the exact solution – FSP and the exact solution with high precision are hence comparable both in terms of accuracy and efficiency. In the main text, we use FSP.
Appendix 7
Classification of the cell cycle phase: bimodal Gaussian distribution method versus the Fried/Baisch model
The distribution of DNA content intensities (DAPI) was fit using a bimodal distribution; those cells whose intensity was around one peak were classified as G1 and those around the second peak were classified as G2 (main text Figure 3e). We also did the classification using an alternative method, based on the Freid/Baisch model, which was recently employed in Skinner et al., 2016. In Appendix 7—figure 1 we show the distributions of fluorescent signal intensity for cells in the G1 phase (top row) and for those in the G2 phase (bottom row) for the four data sets described in the main text. Note that the method (bimodal Gaussian or Freid/Baisch) used to classify the cells in G1/G2 does not alter the intensity distribution hence verifying the robustness of our classification method.
Data availability
The four smFISH datasets are available from https://osf.io/d5nvj/. These datasets include the maximum intensity projected images, the spot localization results, the nuclear and cellular masks used for merged, G1 and G2 cells and the analyzed results of the mature and nascent data. The analysis code of the smFISH microscopy data is available at https://github.com/Lenstralab/smFISH (copy archived at swh:1:rev:b49af68653e9fdcab3fa48085f648fc86d8c659e). The code for the the synthetic simulations and the parameter inference is available at https://github.com/palmtree2013/RNAInferenceTool.jl (copy archived at swh:1:rev:be2fcc8f7a811a571a297d3e150395c0a73add09).

Open Science FrameworkID d5nvj. smFISH datasets for PP7GAL10 in budding yeast.
References

Bursty gene expression in the intact mammalian liverMolecular Cell 58:147–156.https://doi.org/10.1016/j.molcel.2015.01.027

Oscillatory regulation of HES1: discrete stochastic delay modelling and simulationPLOS Computational Biology 2:e117.https://doi.org/10.1371/journal.pcbi.0020117

A stochastic model of gene expression with polymerase recruitment and pause releaseBiophysical Journal 119:1002–1014.https://doi.org/10.1016/j.bpj.2020.07.020

Methods of digital video microscopy for colloidal studiesJ Colloid Interface Sci 179:298–310.https://doi.org/10.1006/jcis.1996.0217

Stochastic models of gene transcription with upstream drives: exact solution and sample path characterizationJournal of the Royal Society, Interface 14:126.https://doi.org/10.1098/rsif.2016.0833

Mammalian gene expression variability is explained by underlying cell stateMolecular Systems Biology 16:e9146.https://doi.org/10.15252/msb.20199146

EstrogenDependent control and celltocell variability of transcriptional burstingMolecular Systems Biology 14:e7678.https://doi.org/10.15252/msb.20177678

SoftwareRNAInferenceTool.jl, version swh:1:rev:be2fcc8f7a811a571a297d3e150395c0a73add09Software Heritage.

Stochastic simulation of chemical kineticsAnnual Review of Physical Chemistry 58:35–55.https://doi.org/10.1146/annurev.physchem.58.032806.104637

Extrinsic noise and heavytailed laws in gene expressionPhysical Review Letters 124:108101.https://doi.org/10.1103/PhysRevLett.124.108101

Exactly solvable models of stochastic gene expressionJ Chem Phys 152:144106.https://doi.org/10.1063/1.5143540

NASCseq monitors RNA synthesis in single cellsNature Communications 10:1–9.https://doi.org/10.1038/s41467019110289

Thiollinked alkylation of RNA to assess expression dynamicsNature Methods 14:1198–1204.https://doi.org/10.1038/nmeth.4435

Automatic processing and interpretation of DNA distributions: comparison of several techniquesComputers and Biomedical Research 11:393–404.https://doi.org/10.1016/00104809(78)900204

Profile likelihood in systems biologyFEBS Journal 280:2564–2571.https://doi.org/10.1111/febs.12276

A single molecule view of gene expressionTrends in Cell Biology 19:630–637.https://doi.org/10.1016/j.tcb.2009.08.008

Singlemolecule mrna detection in live yeastCurrent Protocols in Molecular Biology 113:14.https://doi.org/10.1002/0471142727.mb1424s113

Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeastMolecular Systems Biology 7:458.https://doi.org/10.1038/msb.2010.112

What shapes eukaryotic transcriptional bursting?Molecular BioSystems 13:1280–1290.https://doi.org/10.1039/c7mb00154a

Markovian modeling of geneproduct synthesisTheoretical Population Biology 48:222–234.https://doi.org/10.1006/tpbi.1995.1027

Effects of cell cycle variability on lineage and population measurements of messenger RNA abundanceJournal of the Royal Society, Interface 17:20200360.https://doi.org/10.1098/rsif.2020.0360

Cell cycle staging of individual cells by fluorescence microscopyNature Protocols 10:334–348.https://doi.org/10.1038/nprot.2015.016

Consequences of mRNA transport on stochastic variability in protein levelsBiophysical Journal 103:1087–1096.https://doi.org/10.1016/j.bpj.2012.07.015

Precise nanometer localization analysis for individual fluorescent probesBiophysical Journal 82:2775–2783.https://doi.org/10.1016/S00063495(02)75618X

What is a transcriptional burst?Trends in Genetics 36:288–297.https://doi.org/10.1016/j.tig.2020.01.003

Expression homeostasis during DNA replicationScience 351:1087–1090.https://doi.org/10.1126/science.aad1162

Stochastic kinetics of nascent RNAPhysical Review Letters 117:128101.https://doi.org/10.1103/PhysRevLett.117.128101

Singlerna counting reveals alternative modes of gene expression in yeastNature Structural & Molecular Biology 15:1263–1271.https://doi.org/10.1038/nsmb.1514

Analytical results for a multistate gene modelSIAM Journal on Applied Mathematics 72:789–818.https://doi.org/10.1137/110852887

Cellcycle dependence of transcription dominates noise in gene expressionPLOS Computational Biology 9:e1003161.https://doi.org/10.1371/journal.pcbi.1003161
Article and author information
Author details
Funding
National Natural Science Foundation of China (61988101)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
National Natural Science Foundation of China (62073137)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
H2020 European Research Council (755695 BURSTREG)
 Tineke L Lenstra
Leverhulme Trust (RPG2020327)
 Ramon Grima
Shanghai Action Plan for Technological Innovation Grant (22ZR1415300)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
Shanghai Action Plan for Technological Innovation Grant (22511104000)
 Zhixing Cao
 Xiaoming Fu
 Libin Xu
Shanghai Sailing Program (22YF1410700)
 Xiaoming Fu
Oncode Institute
 Tineke L Lenstra
Netherlands Organisation for Scientific Research (Gravitation program CancerGenomiCs.nl)
 Tineke L Lenstra
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
ZC, XF, and LX acknowledge the support from Natural Science Foundation of China (NSFC No. 61988101, 62073137), Shanghai Action Plan for Technological Innovation Grant (No. 22ZR1415300, 22511104000) and Shanghai Center of Biomedicine Development. XF acknowledges the support from Shanghai Sailing Program (22YF1410700). TLL was supported by the Netherlands Organization for Scientific Research (NWO, gravitation program CancerGenomiCs.nl), Oncode Institute, which is partly financed by the Dutch Cancer Society, and the European Research Council (ERC Starting Grant 755695 BURSTREG). RG was supported by a Leverhulme Trust research award (RPG2020–327).
Version history
 Preprint posted: November 11, 2021 (view preprint)
 Received: August 7, 2022
 Accepted: October 14, 2022
 Accepted Manuscript published: October 17, 2022 (version 1)
 Version of Record published: November 10, 2022 (version 2)
Copyright
© 2022, Fu, Patel et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,058
 views

 187
 downloads

 36
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Biochemistry and Chemical Biology
 Computational and Systems Biology
Previously, Tuller et al. found that the first 30–50 codons of the genes of yeast and other eukaryotes are slightly enriched for rare codons. They argued that this slowed translation, and was adaptive because it queued ribosomes to prevent collisions. Today, the translational speeds of different codons are known, and indeed rare codons are translated slowly. We reexamined this 5’ slow translation ‘ramp.’ We confirm that 5’ regions are slightly enriched for rare codons; in addition, they are depleted for downstream Start codons (which are fast), with both effects contributing to slow 5’ translation. However, we also find that the 5’ (and 3’) ends of yeast genes are poorly conserved in evolution, suggesting that they are unstable and turnover relatively rapidly. When a new 5’ end forms de novo, it is likely to include codons that would otherwise be rare. Because evolution has had a relatively short time to select against these codons, 5’ ends are typically slightly enriched for rare, slow codons. Opposite to the expectation of Tuller et al., we show by direct experiment that genes with slowly translated codons at the 5’ end are expressed relatively poorly, and that substituting faster synonymous codons improves expression. Direct experiment shows that slow codons do not prevent downstream ribosome collisions. Further informatic studies suggest that for natural genes, slow 5’ ends are correlated with poor gene expression, opposite to the expectation of Tuller et al. Thus, we conclude that slow 5’ translation is a ‘spandrel’a nonadaptive consequence of something else, in this case, the turnover of 5’ ends in evolution, and it does not improve translation.

 Computational and Systems Biology
Accurate prediction of the structurally diverse complementarity determining region heavy chain 3 (CDRH3) loop structure remains a primary and longstanding challenge for antibody modeling. Here, we present the H3OPT toolkit for predicting the 3D structures of monoclonal antibodies and nanobodies. H3OPT combines the strengths of AlphaFold2 with a pretrained protein language model and provides a 2.24 Å average RMSD_{Cα} between predicted and experimentally determined CDRH3 loops, thus outperforming other current computational methods in our nonredundant highquality dataset. The model was validated by experimentally solving three structures of antiVEGF nanobodies predicted by H3OPT. We examined the potential applications of H3OPT through analyzing antibody surface properties and antibody–antigen interactions. This structural prediction tool can be used to optimize antibody–antigen binding and engineer therapeutic antibodies with biophysical properties for specialized drug administration route.