Abstract
Characterizing and manipulating cellular behaviour requires a mechanistic understanding of the causal interactions between cellular components. We present an approach that can detect causal interactions between genes without the need to perturb the physiological state of cells. This approach exploits naturally occurring cell-to-cell variability which is experimentally accessible from static population snapshots of genetically identical cells without the need to follow cells over time. Our main contribution is a simple mathematical relation that constrains the propagation of gene expression noise through biochemical reaction networks. This relation allows us to rigorously interpret fluctuation data even when only a small part of a complex gene regulatory process can be observed. This relation can be exploited to detect causal interactions by synthetically engineering a passive reporter of gene expression, akin to the established “dual reporter assay”. While the focus of our contribution is theoretical, we also present an experimental proof-of-principle to illustrate the approach. Our data from synthetic gene regulatory networks in E. coli are not unequivocal but suggest that the method could prove useful in practice to identify causal interactions between genes from non-genetic cell-to-cell variability.
Translating molecular abundance data into mechanistic models of cellular processes remains a major challenge of systems biology. High-throughput experimental techniques routinely produce statistical associations between genes through correlation measurements of gene expression variability [1–5]. While such statistical associations can provide insightful clues, they do not directly identify causal interactions: even if two components X and Z are non-spuriously correlated, we cannot conclude that X affects Z because the causal connection could be reversed or a confounding factor could be regulating both components, even with statistical associations that go beyond correlations such as Granger causality [6, 7].
Perturbation experiments are a conceptually simple solution to avoid this problem and directly infer causal relationships in gene regulatory networks [8–10]. However, they come with practical challenges: drug perturbations can affect multiple targets at once [11] and genetic perturbations, e.g., through changing gene copy numbers, are not guaranteed to keep cells in their physiologically relevant regime [12, 13]. For large enough perturbations, everything is expected to affect everything else in the cell.
Here we present a method to infer directional causal interactions between cellular components by utilizing stochastic fluctuations in cellular abundances. Existing work on analyzing non-genetic variability has focused on testing completely specified mechanistic models against such fluctuation data [1, 14–17]. However, completely specifying mechanistic models of cellular processes often requires making a large number of assumptions about unknown details, which can make inferences based on such an approach unreliable [18, 19].
We propose a novel inference method to detect whether a gene X causally affects a gene Z which rests on a mathematical identity that constrains an entire class of models. We show how covariability measurements of molecular abundances can then be used to detect causal interactions by specifying only “local” aspects of the underlying gene expression dynamics [19–21].
Experimentally, we exploit an approach similar to the “dual reporter assay” [22–25] previously established to quantify stochastic fluctuations in expression differences between copies of a gene of interest. However, instead of analyzing the covariance between the dual reporters, we focus on the covariance of each reporter with a third component of biological interest. We prove that under broad conditions these covariances must be identical in the absence of the causal interaction we wish to establish. Violations of this relation can thus be used to detect causal interactions.
To illustrate the generality of our covariance identity, we numerically verify the analytically proven result for a wide variety of example systems. Additionally, we present an experimental proof-of-principle for synthetically constructed gene regulatory circuits in E. coli with known interactions. The data establish a baseline estimate for the accuracy of our approach with current experimental methods. Despite the arguable occurrence of a false positive result, the experimental data support our hypothesis that single-cell variability measurements might prove useful in detecting causal interactions in gene regulatory networks.
Results
Invariant relation in the absence of causal interactions
Our main mathematical result is a covariance invariant that constrains the fluctuations of two molecular species in partially specified reaction networks. We consider a molecular species of interest X that is subject to firstorder degradation and is made with an unspecified and time-varying production rate. A reporter species Y acts as an identical but separate copy of X that is subject to the same control. This defines the following stochastic birth-death process, see Fig. 1
where x and y denote the number of copies of molecular species X and Y respectively, α is an arbitrary proportionality constant, and the shared production rate R can depend in arbitrary ways on the abundances zk of any cellular component (including X and Y themselves), collectively denoted as Z := {Z1, Z2, … .}
Theorem: if neither X nor Y causally affect a third cellular component of interest Zk ∈ Z, then
where angular brackets denote ensemble averages, and denotes the normalized covariance between Zk and X in a population of genetically identical cells.
Eq. (2) makes no assumptions on the interactions of all the unspecified cellular components and allows arbitrary effects from the component of interest Zk onto X and Y . The invariant constrains any cellular component Zk that is not affected by X and Y as long as the “local” dynamics of X and Y is given by Eq. (1). The detailed proof of Eq. (2) is presented in SI. The key condition under which it holds is that the non-genetic population variability has reached a stationary state. In systems in which there is uncertainty about this condition it can be directly verified experimentally.
Note, a violation of Eq. (2) implies the presence of a causal connection but not vice versa. This is a logical necessity because the dynamics of causally connected genes can be arbitrarily close to that of non-interacting genes. In a later section, we discuss further generalizations of the dual reporter class defined in Eq. (1), allowing for growing and dividing cells, measurement noise, and fluctuations in degradation rates as would be relevant for its application to experimental single-cell data. Note that due to the possibility of feedback, the dynamics of X and Y is not generally symmetric even though X and Y are co-regulated [20], see Fig. S1A. Eq. (2) is a statement about stochastic covariability and not a trivial relation based on the (incorrect) assumption that x(t) = y(t). For example, the Pearson correlation coefficients , are not necessarily equal even when X and Y do not affect Zk, see Fig. S1B.
Detecting causal interactions between genes through experimentally engineered “dual reporters”
Violation of Eq. (2) can be used to experimentally deduce the existence of a causal effect from a gene of interest X onto another gene of interest Zk as follows. If we engineer a co-regulated “dual reporter” system with the properties defined in Eq. (1), in which Y is a passive reporter (i.e., it does not significantly interact with other cellular components) that responds to the same input as X, then a violation of Eq. (2) implies that changes in X must causally affect the abundance of component Zk. This is because intrinsic fluctuations from the expression of X propagate through the network and affect Zk, but those from the passive reporter Y do not. As detailed next, this approach can be applied to dual reporters for gene expression both on the transcriptional as well as the translational level, see Fig. 1.
Transcriptional dual reporters
Consider two genes of interest, geneX with transcript X and geneZ with transcript Zk. Our approach relies on engineering a passive reporter gene geneY with identical transcriptional control to geneX and mRNA lifetime, but with transcript that does not affect other cellular components, see Fig. 2A. This could be achieved with a transcriptional reporter that is put under control of the same promoter as geneX and placed at a similar gene locus. One way to ensure that geneY interacts minimally with other cellular components would be to remove its start codon so that it will not be translated into a protein.
Co-regulated mRNA reporters have been engineered to satisfy the assumptions underlying Eq. (1), with transcripts counted with smFISH [15, 25, 26]. Alternatively, future improvements in RNAseq [27–29] accuracy might allow for reliable fluctuation measurements of transcript levels in single-cells through sequencing approaches.
Translational dual reporters
The dual reporter assay has been frequently implemented as co-regulated fluorescent proteins [14, 16, 22, 30]. The invariant of Eq. (2) directly applies when X and Y are co-regulated fluorescent proteins with first-order translation rates and maturation times (see SI Sec. 4 B for details of the proof). Fluorescent proteins can then be used to detect causal interactions in gene regulation as follows. The gene of interest geneX is fused to a fluorescent protein to make a functional fusion geneX-FP. A passive reporter protein Y is made by introducing a spectrally distinguishable fluorescent protein Y under the control of the same (but distinct) promoter as geneX-FP. The expression level zk of geneZ can be measured either through a transcriptional reporter or a functional fusion protein with a third fluorescence reporter. Under the assumption that the fluorescent protein Y does not directly affect other cellular components, Eq. (2) directly applies to the covariances of fluorescence levels as long as X does not causally affect Zk. As the normalized covariances in Eq. (2) are independent of scaling factors, standard fluorescence microscopy methods can be used without the need to determine absolute numbers. Two stable fluorescent proteins with similar maturation times [31] should be chosen to ensure that the assumptions underlying the class of systems is satisfied. Additionally, the translation rates of X and Y need not be identical, but can be proportional with a fluctuating proportionality factor as defined in the transitions of Eq. (1). As a result, different fluorescent proteins with different ribosome binding sites, mRNA secondary structures, and gene lengths, can be used as long as the translation rates remain proportional.
Typical experimental single-cell data can be analyzed using the invariant relation of Eq. (2)
In the above sections we introduced the main mathematical result and presented the basic logic of how it can be experimentally exploited using synthetically engineered gene expression reporters. Next, we describe how Eq. (2) generalizes to the broader class of systems necessary to analyze single-cell experimental data.
Eq. (2) constrains abundances in growing & dividing cells
The class of stochastic processes presented in Eq. (1) constrains the dynamics of cellular abundances in stationary processes. However, experimental data typically report measurements of growing and dividing cells. Under the assumption that during cell division, molecular abundances are divided on average proportional to cell size, we can show (see SI) that Eq. (2) must be satisfied by the dual reporter abundances when Zk is a component not affected by geneX. In this analysis of cellular growth and division, we allow for partitioning noise, division time fluctuations, asymmetric divisions, and arbitrary growth-rate dynamics. see Fig. 2B.
Eq. (2) constrains concentrations in growing & dividing cells
Chemical reaction rates typically depend on molecular concentrations and not absolute abundance numbers. Under general assumptions (see SI Sec. 10), we can show that the covariance constraint of Eq. (2) describes molecular concentrations in growing and dividing cells, see Fig. 2. This result assumes that the abundance of X does not affect cell volume or growth rate. This requirement can be intuitively understood, because if X affects cell volume then it causally affects the concentration of Zk which depends on volume (see SI Sec. 10).
Eq. (2) applies to reporters with fluctuating degradation rates and fluctuating translation rates
The class of systems defined in Eq. (1) assumes that the dual reporters are degraded in a first-order process with a constant rate parameter β, see Eq. (1). This assumption can be relaxed: the invariant of Eq. (2) generalizes to the class of systems in which the degradation rate constant is an arbitrary function of all the cellular components that are not affected by the dual reporters (see SI Sec. 4). As a result, the degradation constant can vary in time with arbitrary extrinsic fluctuations. Additionally, for the class of co-regulated fluorescent proteins, the translation rate parameters can also vary with arbitrary extrinsic fluctuations.
In silico validation of analytical results
In order to numerically verify and illustrate the generality of the above results, we simulated several stochastic birth-death systems in growing and dividing cells, see Fig. 2. We modelled 10 systems made up of 6 network topologies in which a component Z is correlated with X and Y but is not affected by X, see Fig. 2A. These systems include cellular components with non-linear reaction rates and feedback loops, fluctuating degradation rates, and confounding variables that affect all observed components. We analyzed each system subject to three different cellular growth dynamics: periodic exponential growth, periodic linear growth, and linear growth with fluctuating cell-division times and division sizes, see Fig. 2B.
To numerically test Eq. (2), we generated random sample paths for abundances and concentrations using the Gillespie algorithm [32, 33]. System parameters were varied over several orders of magnitude. The numerical data verify that the normalized covariances of cellular abundances as well as cellular concentrations satisfy Eq. (2) in growing and dividing cells, see Fig. 2C. The same does not hold for the corresponding Pearson correlation coefficients, see Fig. 2C. Note, due to finite sampling, any numerical simulation will necessarily show small deviations from the exact equality of Eq. (2). Through statistical analyses and re-running systems for increased sampling, we confirmed that the (minuscule) deviations observed in numerical simulations were consistent with finite sampling, see Materials and Methods.
Eq. (2) constrains data with significant measurement noise
Experimental techniques to measure mRNA abundances or fluorescence levels potentially introduce significant measurement noise. For example, when counting mRNA abundances, smFISH can lead to probabilistic undercounting noise and fluorescence microscopy can report fluorescence levels that includes photon noise, read noise, and segmentation errors, while flow cytometry introduces Poisson noise when used with bacteria [34].
The invariant of Eq. (2) holds in the face of measurement noise as long as the noise is symmetric in X and Y measurements. We can show (SI Sec. 12) that Eq. (2) holds in systems with arbitrary multiplicative noise that is independent of the X and Y signals, and additive noise that is independent of Zk. Additionally, the invariant holds in the face of systematic undercounting that introduces a binomial readout of the signal of interest, along with Poisson-Gaussian noise. If, however, the experiment introduces a different type of noise in X as compared to Y, then Eq. (2) is no longer valid. It is thus important to choose similar measurement techniques while measuring the abundances of X and its passive reporter Y .
Experimental proof-of-principle
Whether the above theoretical approach works in practice depends on two crucial questions: Can we reliably build dual reporter systems that satisfy the assumptions underlying Eq. (2)? Are experimental violations of Eq. (2) larger than measurement uncertainties for current experimental techniques?
To address these questions we present an experimental proof-of-principle using synthetically engineered gene regulatory circuits in E. coli. These circuits consist of variants of the “repressilator”, a celebrated synthetic control circuit in which three genes respectively repress each other [35, 36].
Causally connected genes that break the invariant
Inherent to our fluctuation approach is the direction of inference, i.e., violations of Eq. (2) imply the presence of a causal connection, whereas agreement with Eq. (2) does not imply the absence of causal interactions. The question thus becomes whether a biologically relevant genetic circuit with known causal connections ever breaks our covariance relation.
We engineered four synthetic circuits in which a TetR-YFP fusion protein (X) represses an RFP reporter (Z), and a passive CFP reporter (Y) is under the same transcriptional control as X, see Fig. 3A. While the circuits differ in dynamics and connections, X affects Z in all cases. See SI Table. S2 for details of the synthetic circuits #1–4. Using time-lapse fluorescent microscopy with cells growing in a microfluidic device (see Fig. 3C), we measured the normalized covariances from populations of genetically identical cells with high accuracy while discarding all temporal information.
We found that circuits #1 and #2 clearly broke the constraint, circuit #4 did not violate it, and circuit #3 showed significant deviations right at the limit of what can be reliably detected, see Fig. 3D.
The above repressilator test circuits were chosen for their well-characterized genetic interactions and because oscillating abundances make for a convenient test of the constructs. However, large oscillations in Z also make it much more challenging to detect violations of Eq. (2) because in those circuits the intrinsic noise we exploit is small compared to the periodically varying dynamics of the circuit. Indeed, the two clear violators correspond to altered versions of the repressilator circuit that do not oscillate (see SI Table. S2). These non-oscillating test cases may in fact be the most relevant assessment of using Eq. (2) in natural gene regulatory circuits.
Overall the experimental data confirm that Eq. (2) has the fundamental power to experimentally detect causal interactions. However, the precision of current experimental techniques lies at the edge of what is necessary to detect physiologically relevant interactions when genes show large variability relative to the intrinsic noise we exploit. This highlights the importance of correctly estimating measurement uncertainty in our approach. Our error bars estimate sampling error and correction errors from non-even sample illumination, temporal drift, background fluorescence, and autofluorescence. These were estimated using a bootstrapping approach where the corrections and the normalized covariances were computed recursively to samples of the data, see Materials and Methods, and supplemental Fig. S2.
Negative controls
Our crucial theoretical result is that Eq. (2) must be satisfied by any circuit in which X does not affect Z. To experimentally test this prediction we engineered test cases in which X and Y are co-regulated, while Z is a component that is not affected by X. This took one of two forms: First, we constructed eight different circuits in which RFP (Z) was expressed constitutively using the pRNA1 promoter as previously used as a segmentation marker in E. coli [36, 37]. We used various synthetic circuits and the RFP reporter was either integrated chromosomally or included on the circuit plasmid (the latter leading to correlations between RFP and our synthetic components without introducing a causal interaction). Second, as an additional control, for each circuit we also considered the cellular growth rate as the unaffected cellular property Z. While the growth rate affects reporter concentrations through dilution, we did not expect cellular growth to be affected by our synthetic circuits which made growth rates a convenient additional negative control.
We found the data from all but one synthetic circuit consistent with our predicted invariant, see Fig. 3E. Note, the false positives correspond to the same synthetic circuit: one data point is from taking Z as the RFP concentration, while the other data point is from taking Z as the growth rate. To rule out experimental error we repeated the experiment twice while swapping the YFP and CFP reporters which confirmed the result (see Fig. S3). Additionally, we confirmed the result using a different cell segmentation pipeline (see Fig. S4).
Taken at face-value, our method thus produced two false positives (out of 16 tests), suggesting that our approach could be valuable but somewhat unreliable. However, next we present experimental evidence that indicate the violating data points were not false positives but may have correctly identified an unexpected causal effect from X on Z in the outlier circuit.
Evidence that RpoS mediated stress response affected cellular growth in the outlier circuit
We observed that the outlier strain (Fig. 4A) exhibited periods of slow growth (Fig. 4B), suggesting that cells underwent periods of stress. Indeed, although RFP was thought to be constitutively expressed, RFP exhibited clear temporal dynamics that were negatively correlated with growth rate (Fig. 4C, Fig. S5). These fluctuations were larger than other strains expressing RFP from the same chromosomal gene (Fig. S6).
We hypothesized that in the outlier strain the “constitutive” RFP reporter was being affected by the bacterial stress response regulator RpoS [41, 42]. By measuring the transcriptional activity of the known RpoS target gadX [39] we observed variable RpoS activity with pulses that correlated with periods of slow cellular growth (Fig. S7). Similar pulsing behaviour of RpoS triggering periods of slow growth has been reported previously [39, 40]. RFP levels were in turn strongly correlated with RpoS activity (Fig. 4D), as expected for genes for which transcription rates are constant but cell growth is slowed down by RpoS activity. Indeed, upon deleting RpoS, RFP fluctuations as well as growth rate fluctuations satisfy Eq. (2), see Fig. 4E.
Next, we discuss why the RpoS stress response was triggered in the outlier circuit in a TetR dependent manner. In the outlier strain, LacI was chromosomally expressed, which represses TetR independent of the repressilator circuit. Instead of regular oscillations of TetR, we observed extended periods of very low concentrations, see Fig. 4F. We conjecture that during those extended periods of low TetR levels, expression of CI is exceedingly high, which in combination with the already highly expressed RFP, leads to resource competition in cells that ultimately triggers RpoS mediated stress response. Note, CI is under the control of the very strong pLtetO1 promoter, which can initiate the transcription of up to 0.3 mRNA/sec in bacteria when fully induced [43] and has been shown to burden E. coli [44]. RFP is under the control of another strong promoter, pRNA1, previously used as a bright segmentation marker [36, 37], which we find to affect growth rate (Fig. S8). This RpoS mediated causal effect of TetR on RFP would explain the violations of Eq. (2) by the outlier circuit, why deletion of rpoS removes the outlier data, and why the same circuit without endogenous lacI (strain #5) did not violate the constraint, see Fig. 3E.
Taken together these additional data provide evidence that the outlier data may not have been a false positive but detected an unexpected causal interaction mediated by cellular burden and RpoS.
Discussion
A ubiquitous problem in understanding gene regulatory networks is identifying which genes regulate the expression levels of which other genes. Stochastic fluctuations of molecular abundances within cells provide a natural source of information about such regulation. Here, we presented a mathematical identity that potentially allows for the translation of such single-cell variability data into causal interactions. Because correlations do not imply causation, this method requires experimental intervention, but crucially it does not require perturbing the physiological state of the cells.
Note that agreement with Eq. (2) does not prove the non-existence of such a causal interaction because the lack of deviation can have two reasons: the causal interaction could be negligible or the effect of the intrinsic noise we exploit is not large enough compared with other sources of variability of the target gene. In other words, while deviations of Eq. (2) rigorously establish causal interactions, “absence of this evidence does not constitute evidence of absence” [45].
Limitations of this study
Our method relies on a handful of key assumptions about the molecular reporters used. In the absence of standardized synthetic parts we may not know whether a given engineered system satisfies these assumptions. Only if the assumptions are satisfied, can violations of the invariant rigorously identify the existence of a directed causal interaction.
While the presented experimental study works as a proof-of-principle to illustrate the approach and establish its practical feasibility, at face value it also produced a false positive result. Through additional perturbation experiments we argue that the outlier circuit includes an unexpected causal connection. However, the evidence for this conclusion is weaker than the proven theoretical results presented. The fact that our interpretation of the perturbation experiments will be debated may serve as an illustration of how valuable a fluctuation based approach would be that avoids perturbation experiments altogether.
Additionally, all experimental synthetic circuits were analyzed in E. coli and eukaryotic test cases remain to be investigated.
Materials and Methods
Data and materials availability
Numerical simulation code and data will be available on Github. The segmented and tracked single-cell traces will be available online. Code for analyzing the singlecell traces to compute the normalized covariances will be available on Github. Python code used to run the DeLTA deep learning segmentation pipeline along with our trained U-Net models used for segmentation and chamber identification will be available on Github.
Numerical simulation details
Exact simulated single-cell time trajectories of the abundances were generated using the standard Gillespie algorithm [32], with an additional step to account for time-dependent rates and divisions [33] (see SI Sec. 11 for the exact algorithm used). The time trajectories for the concentrations correspond to the abundance trajectories divided by the cell volume trajectories V (t), with the latter being simulated independently of the species abundances. Simulations were performed with python.
Simulated cell growth and division models
Three cellular growth dynamics were simulated. The first and second are linear and exponential growth respectively, with constant division times and symmetric cell divisions. Here V (t) trajectories were generated analytically as periodic functions. The volume is reduced by factor 1/2 at evenly spaced division times {τi}. Between division times τi, τi+1 the volume is given by for the linear case and for the exponential case, where td is the time between divisions, and V0 is the volume right after a division. The third simulated volume dynamics is linear growth with stochastic division times and asymmetric divisions. Here a constant linear growth rate is used. Division times {τi} and division factors {ai} are picked recursively: The τi+1 is taken from a normal distribution with mean set to the doubling time:
where 𝒩 (μ, σ) is the normal distribution with mean μ and standard deviation σ. Until then the volume grows linearly with constant rate
At τi+1 the volume is reduced by factor ai+1 taken from a normal distribution with mean set to the ratio of cell volumes at the beginning and end of the cycle
As a result, cells that grow more (less) than double in size during a cycle tend to divide with larger (smaller) division factors, ensuring the volume trajectories do not eventually expand (decay) to infinity (zero).
At division, molecular abundances are reduced according to a binomial splitting with probability given by the division factor. For example, if the volume is reduced by a factor of 0.4 at a cell division, i.e., V →0.4V, then each molecular has probability 0.4 to remain in the followed daughter cell.
Computing normalized covariances from trajectories
Normalized covariances were computed by integrating over the trajectories to obtain time averages for first and second moments. This is equivalent to using the distribution given by calculating the fraction of the total system time spent in each sampled state. In the ergodic regime, this distribution converges to the stationary distribution of the ensemble.
Simulated systems
We simulated ten groups of systems defined by their rate functions (see SI Table. S1 for details). For each group, each model parameter was picked randomly multiple times from the set {0.1, 1, 10}. This was done for each of the the three volume dynamics we considered. If a simulated system gave an average abundance in one of the components less than 0.01 molecules, then the system was omitted to avoid numerical errors that arise from divisions of small numbers when computing the normalized covariances. In total, there are 6522 resulting simulations which are plotted in Fig. 2C.
Confidence intervals for finite sampling error
For each system, simulated trajectories were generated forty times, and then repeated until the percentage errors of the normalized covariances, taken as the standard error of the mean divided by the mean, reached less than 1% or the number of simulations reached 1,000. Each trajectory ran for 20,000 cell divisions and started with a unique random number generator seed. Each component abundance was set to 1 at the beginning of each trajectory, and the cell cycle time set to 0. We let the simulated cells reach 200 cell divisions before we start to compute the time average integrals for the moments, in order for the effect of the initial condition to dissipate and we analyze the systems’ cyclo-stationarity state. The final normalized covariance for each system corresponds to the average taken over the ensemble of simulated trajectories with 95% confidence intervals given by twice the standard error.
For a given system, Eq. (2) was tested by verifying that the ratio ηxz/ηyz produced a 95% confidence interval that encompassed the predicted value of 1. The test was satisfied by 6369 out of 6522 systems (97.7%), consistent with the definition of confidence intervals for finite sampling. Re-simulating the outliers for twice the number of simulations leads to a reduction in the standard error by a factor of , with Eq. (2) being satisfied by 143 of the 153 outliers (93.5%). The remaining 10 systems re-simulated for 4 times the number of simulations, reducing the errors by a factor of 1/2, and all of the 10 outliers now satisfying Eq. (2).
Strain and plasmid construction
All strains, plasmids, full construction details are provided in the SI. The base background strain used throughout the manuscript is E. coli MG1655.
The base plasmid used throughout the manuscript is the repressilator [35, 36]. It consists of three genes: tetR from the Tn10 transposon, cI from bacteriophage λ, and lacI from the lactose operon, which have promoters that are repressed by LacI, TetR, and CI respectively. All genes are placed on the low-copy pSC101 plasmid which is inserted into E. coli MG1655.
Using standard molecular biology techniques, the repressilator plasmid sequence was altered in a number of ways to construct the different circuits. The tetR repressor was set as the gene of interest geneX and TetR levels were measured through fluorescence measurements of a TetR fusion to the yellow fluorescent protein mVenus NB (YFP) [31, 46]. We engineered the passive reporter geneY by expressing a different fluorescent protein SCFP3A (CFP) [31, 47] under the control of the same pLacO1 promoter as geneX on the same pSC101 plasmid. Both X and Y levels are co-regulated by LacI concentrations. To ensure equal degradation times, we used a version of the repressilator in which the three repressing genes lack degradation tags so that circuit proteins are removed predominantly from cell-division and dilution [36]. The fluorescent proteins mVenus NB and SCFP3A were chosen since they both have a short maturation half-life (4.1 ± 0.3min and 6.6 ± 0.5min respectively at 37oC) compared to the time-scale of the repressilator oscillations (∼5 hours).
For the RFP, we used a modified mkate2 hybrid (with improved translational efficiency) throughout all experiments, which consists of the mCherry N-terminal 11 amino acids followed by the mKate2 sequence. In the experiment shown in Fig. 4D in which GFP was used to measure gadX expression, we used gfpmut2 as a transcriptional reporter for the gadX promoter, placed on a low-copy pSC101 plasmid (taken from the E. coli promoter library [48]). For Fig. 4E, we used a strain of E. coli MG1655 with rpoS deletion taken from the E. coli Keio Knockouts library [49].
We chose the repressilator as the base circuit because the circuit interactions are known and the resulting dynamics (periodic oscillations) can be used as a consistency check to ensure that the TetR-YFP fusion is functional (i.e., still represses the λ gene).
Mother machine experiment
We used a microfluidic device commonly called “mother machine” to follow growing and dividing cells for hundreds of generations in controlled environments [50, 51]. Single-cells are trapped in micrometer wide trenches. As cells grow and divide, the mother cells remain trapped while newborn daughter cells are washed away by the constant flow of growth media. Automated time-lapse microscopy and cell segmentation software enables us to track hundreds of cells in each experiment, while precisely measuring cell fluorescence, growth rate, and cell size. The time-series data for each mother cell can be pooled into a distribution from which the normalized covariances can be computed.
Though the invariant of Eq. (2) does not require the use of time-series data, we used the additional temporal information as a consistency check for our assumptions. In particular, the highly correlated trajectories of the YFP and CFP fluorescence (see Tables. S3 and S4) are consistent with our assumption that these reporters are co-regulated. Additionally, the observed oscillations in the closed loop circuits are indicative that the TetRYFP fusion has not lost its function as a result of the added fluorescent protein. Moreover, the time-series data allows us to measure cellular growth rates which can be used to test Eq. (2) as shown in Fig. (3)D.
Microfluidic chip preparation
Polydimethylsiloxane (PDMS) (Sylgard 184 Silicon Elastomer, Fisher Scientific) was mixed at a 10:1 (monomer:curing agent) ratio, poured on top of a 1.0 μm tall wafer and degassed for one hour at room temperature before baking for an additional 1.5 hours at 65°C. After careful removal of the PDMS from the wafer, individual PDMS chips were cut out with a razor blade. The inlet and outlet holes were punched with a 0.75 mm biopsy puncher (World Precision Instruments). The PDMS chips were sonicated in isopropyl alcohol (Fisher Scientific) for 30 minutes and dried at 65°C for 15 minutes. Glass coverslips (Fisher Scientific: 22×40 mm #1.5) were cleaned with 1M potassium hydroxide (KOH, Sigma Aldrich) for 20 minutes. The PDMS chips were bonded to the glass coverslips using a plasma cleaner (Oxygen flow rate at 45 sccm, power at 30W for 15 seconds, Tergeo Plasma Cleaner, PIE Scientific). The completed microfluidic device was heated to reinforce the plasma bonding at 100°C for 10 minutes, then 65°C for 30 minutes.
Cell preparation
E. coli strains were grown overnight in LB with appropriate antibiotics to select for cells containing the constructed plasmids. At ∼ 3-4 hours prior to the experiment start time, the overnight cultures were diluted 1:100 in imaging media consisting of M9 salts, 10% (v/v) LB, 0.2% (w/v) glucose, 2 mM MgSO4, 0.1 mM CaCl2, 1.5 μM thiamine hydrochloride and 0.85 g/L Pluronic F-108 (Sigma Aldrich, surfactant to prevent cells sticking to the surface of a microfluidic device). At an OD600 of 0.2-0.4, cells were loaded into the main feeding channel of the microfluidic chip and centrifuged at 5000 g for 10 min to push cells into the cell trenches. The feeding channels were then connected to syringes filled with imaging media using Tygon tubing, and the microfluidic chip was placed in a temperature controlled incubation chamber set at 37°C. Media was pumped through the feeding channel using syringe pumps (New Era Pump System), first at a rate of 5 μL/min for ∼ 0.5-1 hours to get the cells comfortable in the trenches, then at a high rate of 100 μL/min for 1 hour to clear the inlets and outlets. The rate of media flow was then set back to 5 μL/min for the duration of the experiment.
Microscopy and image acquisition
Images were acquired using a Zeiss Axio Observer inverted microscope equipped with a 63x Plan-Apochromat M27 oil objective (NA 1.40), an Orca Flash 4.0 LT camera (Hamamatsu), and an LED epifluorescence illuminator (Zeiss Colibri 7). The experiments were performed inside a temperature controlled incubation chamber set at 37°C. To reduce photobleaching the exposure time (100 ms) and light intensity (10-20%) were set low, with 16bit CZI images taken every 5-8 minutes. Focal drift was corrected automatically with the Definite Focus 2 (Zeiss) monitoring and compensation system using an infrared laser (850 nm). In all experiments in which CFP, YFP, and RFP are measured simultaneously (i.e., all experiments except for the one shown in Fig. 4D), the following filter sets were used for acquisition: CFP (Semrock FF02475/20-25), YFP (Chroma CT560/39bp), RFP (Zeiss Filter Set 91 HE LED), with respective Zeiss Colibri 7 illumination wavelengths set to 430nm (CFP), 511 nm (YFP), and 590 nm (RFP), along with a the Zeiss TBS 450/538/610 beam splitter. For the experiment in Fig. 4D in which a gfp reporter is used to measure gadX expression, the following filter sets were used for acquisition: GFP (Chroma ET525/50m) and cy5 (Zeiss BP 690/50) with Zeiss Colibri 7 illumination wavelengths set to 475 nm (GFP) and 630 nm (cy5), along with Zeiss QBS 405/493/575/653 beam splitter.
Data analysis
Segmentation
Because all three fluorescence channels were used to measure synthetic circuit components throughout, we used the bright-field channel for segmentation. This was achieved with the automated deep learning-based cell segmentation software DeLTA [52, 53]. The U-Net deep learning models used for channel identification and cell segmentation were trained with a large dataset built from 5 mother machine experiments performed in the Potvin Lab that were generated with the same microscopy and image acquisition setup. In these training datasets a bright RFP is expressed and a thresholding segmentation pipeline is performed on the RFP channel images to obtain segmentation masks. The U-Net deep learning models from DeLTA are trained with these segmentation masks in combination with bright-field images from the training dataset as described in [52, 53]. The training dataset was segmented with the same thresholding pipeline used in previously described procedures [36, 54]. We also trained a DelTA model to use the RFP channel for segmentation and obtained similar results on the strains with bright RFP fluorescence. In the SI we include videos of a mother machine with segmentation boundaries obtained from our DeLTA models, along with the single-cell growth trajectories.
We estimated the YFP, CFP, and RFP single-cell concentrations as the average fluorescence intensities of all pixels in the segmentation mask of each cell. Cell area is measured as returned by opencv’s contourArea() over the segmentation mask. Cell length is measured by fitting a rotated bounding box to the segmented cell. We only analyzed data from the mother cells trapped at the top of the growth chambers.
Single-cell traces construction
For a given cell chamber with the Region of Interest identified by the U-Net model, a single-cell trace was constructed by selecting the segmented cell in each frame with highest average vertical pixel location. This in effect keeps track of the mother cells trapped at the top of the growth chambers.
Each single-cell trace went through a manual purging process. Traces with cell areas that are not growing and dividing were removed, as they are expected to correspond to dead cells. Traces with growing and dividing cell areas with very low constant fluorescence correspond to cells that lost the inserted plasmid due to random partitioning at cell division. These traces were also purged, but were analyzed separately to measure the cell autofluorescence as described in the next section. Moreover, segmentation and tracking errors were reduced by purging any parts of the traces that exhibited a clear nonbiological anomaly (see SI for examples).
Cell divisions were identified by a sudden decrease in the cell area. A division is called whenever the cell lengths dropped to less than 80% of its previous value.
Temporal drift correction
Focal drift was reduced automatically with a 850 nm infrared laser (Zeiss Definite Focus 2). However, we used an oil objective, which can cause temporal drift from spreading of the oil over time. We thus applied an additional correction to the data as follows.
In a given mother machine experiment, we computed the mean YFP, CFP, and RFP signals of all the cells at each time frame. For example, the mean YFP time-trace is given by
where is the average YFP intensity taken over the segmented area of the ith cell at time frame t, and Ncells(t) is the number of surviving cells at time frame t. The mean time-traces of each fluorophore are fitted with a second degree polynomial according to a least-squared fit. To correct for the drift, we multiply a measured intensity with the reciprocal of the respective fitted polynomial (see Fig. S9). For example, if f (t) is the obtained fit of ⟨YFPavg⟩t from Eq. (3), we correct all the YFPavg measurements as follows
where f (0) is the fitted function f (t) taken over the first time frame.
Uneven illumination and background correction
A single image frame from our setup encompasses ∼ 30 cell trenches with total spatial extension of ∼ 100 μm. This results in uneven illumination onto the cells. We use the linear gain model [55, 56] to correct for uneven illumination and background fluorescence:
where Im(x) and Itrue(x) are the measured and true intensities at horizontal pixel position x respectively, the multiplicative term S(x) models the uneven illumination, and the additive term D(x) is any background noise that is present when no light is incident on the sensor. For our imaging setup we decompose Itrue(x) as follows
where IF P (x) is the intensity from the fluorescent proteins, Ia is the autofluorescence of the cell, Imedia is any fluorescence originating from the media, and Ib is any fluorescence originating from the PDMS background of the microfluidic device. The total background is given by b(x) = Ib(x) × S(x) + D(x), in which case the imaging model becomes
We use a ‘prospective’ approach to correct for S(x) and b(x) as follows. For each segmented cell in an image, we compute the average intensity in a box of dimensions 50 by 50 pixels located 15 pixels above the cell trench (see Fig. S11A). This in effect estimates b(x) at the x position of just above each cell trench. In each image, this b(x) is removed from the Imeas(x) measurements of each segmented cell. To estimate the uneven illumination gain function S(x) we pool all of the single cell fluorescence measurements according to their horizontal pixel position x. Data is binned into bins of size 50 pixels and averages are taken at each bin (see Fig. S11B). The resulting curve gives an estimate of ⟨IF P + Iaf + Imedia ⟩ × S(x). To estimate S(x) we fit the curve with a 3rd degree polynomial pS(x). To correct for the uneven illumination gain function we multiply the fluorescence measurements by the reciprocal of pS(x):
where pS,max is the max value of the fit pS(x) over x.
Autofluorescence and fluorescing media
The preceding section described how the background and the uneven illumination were corrected using the linear gain imaging model of Eq. (6). Even with the background correction, the fluorescence profile across growth chambers empty of cells is not negligible compared to the profiles of some cell-containing chambers, see Fig. S11. This indicates that media fluorescence is not negligible in experiments with strains that produce low levels of fluorescence. Here we show how we estimated the autofluorescence Ia and the media fluorescence Imedia to obtain the sought after fluorescent protein fluorescence IFP .
In each mother machine experiment there was a subset of cells that lost the synthetic circuit plasmid due to random partitioning at cell division. These cells were used to estimate Ia + Imedia. Cells that lost the plasmid were identified manually by observing single-cell traces: When a cell loses the plasmid, the CFP and YFP fluorescence decay rapidly and reach a constant low signal level for the remainder of the experiment, see Fig. S10. The segments of the time traces at constant low signal level were cut and saved, with the autofluorescence and the media fluorescence taken as the average fluorescence of the saved traces. Around 10–30 cases of plasmid loss occur for each strain in a mother machine experiment.
Note that Ia and Imedia do not need to be corrected in the RFP channel measurements. This is because RFP is taken as the component either affected or not affected by X. For the systems in which RFP is not affected by X, any function of RFP (like adding autofluorescence and media fluorescence) will also not be affected by X and the invariant of Eq. (2) will be satisfied. Alternatively, if X affects RFP, then it will generally also affect any typically expected function of RFP. We did not correct for Ia and Imedia in the RFP channel because a fraction of circuits have RFP located on the chromosome and not the plasmid.
Estimating confidence intervals
The corrections from the preceding sections rely on the distribution of single-cell measurements. As a result, sampling error affects the accuracy of the corrections, along with the final estimators of the normalized covariances. We applied two methods to estimate normalized covariance confidence intervals by taking into account sampling error and its effect on the corrections.
First we use bootstrapping, where the data corrections and the normalized covariances are computed over many samples of the data, allowing for replacement in sampling. If an experiment produces N single-cell traces of cells with plasmid, and M single-cell traces of cells that lost the plasmid, random samples of size N and M of each respective ensemble are taken. The corrections from the previous sections are then applied using the samples, with the corrected data pooled into a distribution from which the normalized covariances are computed. This is repeated until 100 normalized covariances ηsample have been computed for each sample. The final reported normalized covariance corresponds to the average over the ηsample, with confidence intervals given by twice the standard deviation. See Fig. S12 and SI Sec. 13 for details.
Second we use sampling without replacement, where the single-cell traces from each experiment are divided into 7 to 10 disjoint sets, and the corrections and the normalized covariances are computed for each set. The reported normalized covariances correspond to the average of those computed from the sets, with error bars being the standard error of the mean (see SI Sec. 13 for details).
A single mother machine experiment typically produced 100–500 single-cell traces of cells with plasmid, and 7–30 single-cell traces of cells that lost the plasmid due to random partitioning at division. The computed normalized covariances from the bootstrapping method are shown in Fig. 3D,E, while those from the alternative splitting method shown in Fig. S2. The two methods give similar results.
Growth rate estimation
We define growth rate in this work as the rate of change of the cell length normalized by cell length: . To estimate the growth rate from the single-cell traces we assume exponential growth between division events.
In that case, if there is no division event that occurs at time frames i−1, i, and i+1, then the growth rate at time ti is computed as , where Li is the cell length at time ti, Li−1 is the length at the preceding frame, Li+1 is the length at the subsequent frame, and Δt is the time between frames (5–8 min). If there is a division event at the subsequent frame i + 1, then the growth rate is computed as . If there is a division event at the current frame i, then the growth rate is computed as .
For illustration we smoothed the growth rate measurements shown in Fig. 4B using a moving average filter with a window of 5 frames. We did not smooth the data when computing the normalized covariances and CVs shown in Fig. 3.
Supporting information
Acknowledgements
We thank Raymond Fan, B Kell, Seshu Iyengar, Sid Goyal, and Josh Milstein for many helpful discussions. This work was supported by the Natural Sciences and Engineering Research Council of Canada and a New Researcher Award from the University of Toronto Connaught Fund. Simulations were performed on the Niagara, Beluga, and Narval supercomputers at the SciNet HPC Consortium. SciNet is funded by the Canada Foundation for Innovation; the Government of Ontario, Ontario Research Fund - Research Excellence, and the University of Toronto. E.J.S. gratefully acknowledges a doctoral research visit grant to support their experimental work in L.P.T.’s lab at Concordia University.
References
- [1]Science336
- [2]Mol. Cell45
- [3]Nature structural & molecular biology18
- [4]Nature microbiology4
- [5]Nature Methods16
- [6]All of statistics: a concise course in statistical inferenceSpringer
- [7]eLife11
- [8]Cell155
- [9]Proceedings of the National Academy of Sciences100
- [10]Proceedings of the National Academy of Sciences99
- [11]Nature communications13
- [12]Biophysical journal107
- [13]Nature cell biology13
- [14]Science348
- [15]Cell reports26
- [16]Science319
- [17]Nature Reviews Genetics20
- [18]Nature Reviews Genetics6
- [19]Cell Systems2
- [20]Physical Review E104
- [21]Phys. Rev. Lett116
- [22]Science297
- [23]Science304
- [24]Science317
- [25]PLoS Biol4
- [26]Nature protocols8
- [27]Nucleic acids research42
- [28]Nat Meth11
- [29]Nature protocols13
- [30]Nat. Genet38
- [31]Nature Methods15
- [32]J. Phys. Chem81
- [33]PLoS computational biology12
- [34]PloS one15
- [35]Nature403
- [36]Nature538
- [37]Science366
- [38]Journal of bacteriology169
- [39]Proceedings of the National Academy of Sciences119
- [40]Nature communications9
- [41]Molecular microbiology5
- [42]Proceedings of the National Academy of Sciences90
- [43]Nucleic acids research25
- [44]Molecular cell38
- [45]The 1971 NASA/ASEE Summer Fac. Fellowship Program (NASA-CR-114445
- [46]Nature biotechnology20
- [47]Biochemistry45
- [48]Nature methods3
- [49]Molecular systems biology2
- [50]Curr. Biol20
- [51]Frontiers in Bioengineering and Biotechnology10
- [52]PLoS computational biology16
- [53]PLOS Computational Biology18
- [54]Nature503
- [55]“A basic tool for background and shading correction of optical microscopy images, nat. commun8
- [56]Nature methods12
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2024, Joly-Smith et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 199
- downloads
- 14
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.